Skip to content
Skip to main content
Control Flow Over Prompts: Why My n8n Agents Started Failing
16 min readBy Carlos Aragon

Control Flow Over Prompts: Why My n8n Agents Started Failing

Three production agents. Zero error logs. All delivering garbage output. After the post-mortem and full rebuild, I found the same root cause every time: I was using prompt engineering to solve problems that need deterministic code.

The Morning Three Agents Failed Silently

It was a Monday. I opened my Telegram and had no alerts — which should have meant everything was fine. It wasn't. Three n8n workflows that run daily across my VIXI client stack had all executed successfully, logged green checkmarks, and delivered completely wrong output. No errors. No failed executions. Just garbage data flowing into downstream systems that had no idea anything had gone wrong.

Agent one was a lead qualifier that scores inbound contacts from Retell AI calls. It had been assigning scores like "high-qualified" instead of numeric 1-10 values — which broke the Supabase insert that expects an integer. The records inserted with null scores. Silently.

Agent two was a content classifier that routes blog drafts to different n8n branches based on topic. It had invented three categories that didn't exist in my routing logic — "AI-Automation-General", "Tech-Misc", and "Business Strategy" — none of which matched any Switch node branch. Every piece of content had silently routed to the default fallback and done nothing.

Agent three was a Monday.com task generator that creates client onboarding items from a structured intake form. It had started generating tasks with missing required fields, because the prompt I'd written assumed the model would always include them. Sometimes it did. Sometimes it didn't. The Monday.com API accepted partial records without error.

I spent six hours that Monday in post-mortem mode. What I found was not a model problem, not a prompt problem, and not an n8n bug. It was an architecture problem. I had built agents where the LLM was responsible for decisions that should have been made by code.

The pattern I kept finding:

Every failure had the same shape. The LLM node produced output that was structurally plausible. The next node consumed it without checking. Execution completed. No one was the wiser until I manually reviewed the data three days later.

This post is the rebuild. I'm going to show you exactly what the prompt-heavy architecture looked like, why it fails at scale, and the specific control flow patterns I used to fix each agent. If you're running n8n workflows in production that rely on LLM output to drive routing or validation decisions, you're likely one bad prompt variation away from the same Monday I had.

What "Prompt-Driven" Architecture Actually Looks Like

Prompt-driven architecture is what you build when you're moving fast and the LLM seems smart enough to handle it. You write a single Claude node that does classification, extraction, and formatting in one shot. You tell it to return JSON. It does. You move on.

Here's what the original lead qualifier node prompt looked like:

You are a lead qualification assistant.
Analyze the following call transcript and return a JSON object with:
- qualification_score: a number from 1-10 indicating lead quality
- primary_intent: the lead's main reason for calling
- recommended_action: what the sales team should do next
- urgency: low, medium, or high

Transcript: {{$json.transcript}}

Return only valid JSON. Do not include any other text.

This prompt works 95% of the time. The model returns clean JSON with integer scores, valid urgency values, and sensible recommended actions. The problem is the 5% — and what happens when it goes wrong at scale.

At 10 runs per day, a 5% failure rate is one failure every two days. Noticeable. At 100 runs per day, it's five failures daily. At 500 runs, you have 25 silent failures every single day flowing into your downstream systems.

The deeper problem: you can't prompt your way to zero failures. Every workaround you add to the prompt — "ALWAYS return a numeric score", "NEVER use string values for numbers" — works until it doesn't. The model is a probabilistic system. You're trying to enforce determinism with text instructions. Those are fundamentally incompatible goals.

The three things I was asking the LLM to do that it shouldn't:

  • Make routing decisions — Choosing categories that determine which n8n branch executes
  • Enforce data schemas — Returning exact field names, types, and value ranges that downstream nodes depend on
  • Handle edge cases gracefully — Deciding what to do when input is malformed, truncated, or ambiguous

None of these are generation tasks. They're structural decisions that need structural code.

The Three Failure Modes I Hit

After the post-mortem, I identified three distinct failure modes. They're worth naming separately because they have different causes and different fixes.

Failure Mode 1: Silent Hallucination

The model returns output that is structurally correct but semantically wrong. The JSON is valid. The fields are present. The values are plausible. But the data is fabricated or inconsistently computed.

My lead qualifier started returning qualification_score: "high-qualified" instead of a number. This is a hallucination of the format — the model produced something that sounds like a qualification score but isn't what I specified. Because the JSON was otherwise valid, the n8n node didn't throw an error. The bad value flowed downstream.

// What the model returned (unexpected)

{
  "qualification_score": "high-qualified",  // should be integer 1-10
  "primary_intent": "storm damage inspection",
  "recommended_action": "Schedule same-day inspection",
  "urgency": "high"
}

// Supabase insert: qualification_score column is INTEGER — silently stores NULL

Failure Mode 2: Prompt Drift

The model's behavior changes over time without any change to your workflow. This happens because LLM APIs don't guarantee identical outputs for identical inputs — model updates, temperature variation, and context window artifacts all shift behavior gradually.

My content classifier ran fine for six weeks. Then it started inventing category names. I hadn't changed the prompt. The model had drifted. The categories I listed in the prompt were still there, but the model was occasionally generating variations of them — treating the list as examples rather than an exhaustive constraint.

You cannot fix prompt drift with prompt engineering. You fix it by making the model's output irrelevant to your routing logic. More on that in the rebuild section.

Failure Mode 3: Context Overflow Degradation

Long-running agents that accumulate context across many tool calls or conversation turns start behaving unpredictably as the context window fills. Instructions from the beginning of the conversation get less weight. The model starts "forgetting" constraints you specified early in the system prompt.

My Monday.com task generator runs a multi-step workflow: read intake form → extract client requirements → generate task list → validate → insert. By the third or fourth step, the model was omitting required fields that I'd specified clearly in the initial system prompt. Not always. Often enough to cause production failures.

The common thread across all three:

In every case, the workflow had no structural checkpoint between the LLM output and the downstream action. The model's output was trusted implicitly. There was no code asking: "Is this output actually valid before we proceed?"

Deterministic Control Flow: The Core Pattern

The fix is a single architectural principle: LLMs do generation. Code does routing.

Every decision about what happens next in your workflow — which branch executes, whether data is valid, what to do when something fails — should be made by a Code node, IF node, or Switch node. Never by the LLM.

This sounds obvious. But when you're building fast and the model is smart, it's easy to let it do too much. The discipline is deliberately separating what the model produces from what your workflow decides to do with it.

The Schema Gate Pattern

The first structural layer is a validation Code node immediately after every LLM node. Before any downstream node touches the output, the code asks: does this match what I expect?

// n8n Code node — validate LLM output before it flows downstream

const output = $input.first().json;

// Parse if string
const data = typeof output.text === 'string'
  ? JSON.parse(output.text)
  : output;

// Validate schema
const score = parseInt(data.qualification_score, 10);
const validUrgency = ['low', 'medium', 'high'];

if (isNaN(score) || score < 1 || score > 10) {
  throw new Error(`Invalid qualification_score: ${data.qualification_score}`);
}

if (!validUrgency.includes(data.urgency)) {
  throw new Error(`Invalid urgency value: ${data.urgency}`);
}

if (!data.primary_intent || typeof data.primary_intent !== 'string') {
  throw new Error('Missing or invalid primary_intent');
}

// Return clean, typed output
return [{
  json: {
    qualification_score: score,
    primary_intent: String(data.primary_intent).trim(),
    recommended_action: String(data.recommended_action || '').trim(),
    urgency: data.urgency,
    validated: true
  }
}];

When this validation throws an error, n8n routes to the error branch. The downstream Supabase insert never runs. No silent failure. You get a Telegram alert with the exact validation error and the raw LLM output for debugging.

The Closed Enum Pattern for Classification

For classification tasks, stop asking the LLM to generate a category name and start asking it to select from a numbered list. Then use the number — not the text — as your routing key.

// Prompt that prevents category hallucination

Classify the following content into exactly one category.
Reply with ONLY the number. No other text.

1 = n8n automation
2 = AI agents
3 = Hyros attribution
4 = Voice AI
5 = Cost optimization
0 = Other

Content: {{$json.content}}

// n8n Switch node routes on the number, not the text

// Code node to extract and validate the number
const raw = $input.first().json.text.trim();
const category = parseInt(raw, 10);

if (isNaN(category) || category < 0 || category > 5) {
  throw new Error(`Unexpected category response: "${raw}"`);
}

return [{ json: { category_id: category } }];

// Switch node: routes on {{ $json.category_id }}
// Case 1 → n8n workflow branch
// Case 2 → AI agents branch
// Case 3 → Hyros branch
// etc.

Now the model can't invent a category. It returns a digit. The Code node validates it's a valid digit. The Switch node routes deterministically. Prompt drift on category names becomes irrelevant because we never parse the category name.

Rebuilding the Lead Qualifier: Before and After

The original lead qualifier was four nodes: Webhook → Claude → Supabase insert → Telegram notification. Forty-two lines of prompt. Zero validation.

The rebuild is nine nodes: Webhook → Claude → Code (validate) → IF (score threshold) → two branches (high-value vs. nurture) → two Supabase inserts → two Telegram notifications.

More nodes. Much more reliable. Here's what changed in the Claude prompt:

// BEFORE — asking the model to do too much

You are a lead qualification assistant. Analyze the call transcript
and return JSON with qualification_score (1-10), primary_intent,
recommended_action, and urgency (low/medium/high).
Return only valid JSON.

// AFTER — model only generates, code validates and routes

Analyze the call transcript. Return a JSON object with these exact fields:
{
  "score": [integer 1-10, where 10 is highest quality],
  "intent": [one sentence describing why they called],
  "action": [one sentence on recommended next step],
  "urgency": [exactly one of: "low", "medium", "high"]
}

Return only the JSON object. No explanation.

The prompt change is subtle — more explicit field types, shorter instructions, exact value constraints. But the real change is what comes after: the validation Code node that throws on any deviation, and the IF node that makes the routing decision rather than the prompt.

// IF node condition — deterministic routing by score

// IF: {{ $json.score }} >= 7
// True → High-value branch (immediate callback alert)
// False → Nurture branch (add to email sequence)

// No LLM involved in this decision.
// The model gave us a number. The IF node decides what to do with it.

Result after 30 days in production: 2,400 executions, zero silent failures. Three hard failures caught by the validation node (all were malformed API responses from upstream, not LLM issues), all surfaced immediately with Telegram alerts.

Rebuilding the Content Classifier: Closed Enums in Action

The content classifier failure was pure prompt drift. The model was treating my category list as examples, not constraints. The fix: make the list non-interpretable.

When you ask a model to "classify into one of: n8n automation, AI agents, Hyros attribution...", it will occasionally paraphrase. "n8n automation" becomes "n8n workflow automation". "AI agents" becomes "AI Agent Systems". Your Switch node case-matches on exact strings — so all those variations fall through to the default.

The rebuild replaced string categories with integer codes. Here's the full n8n workflow logic:

// Node 1: Claude (classification prompt)
// Returns: "3" (just the digit)

// Node 2: Code (validate and parse)
const raw = $input.first().json.text.trim();
const categoryId = parseInt(raw, 10);
const validIds = [0, 1, 2, 3, 4, 5];

if (!validIds.includes(categoryId)) {
  throw new Error(`Invalid category ID: "${raw}"`);
}

const categoryNames = {
  0: 'other',
  1: 'n8n-automation',
  2: 'ai-agents',
  3: 'hyros-attribution',
  4: 'voice-ai',
  5: 'cost-optimization'
};

return [{
  json: {
    category_id: categoryId,
    category_name: categoryNames[categoryId],
    original_content: $input.first().json.content
  }
}];

// Node 3: Switch (routes on category_id integer)
// Case 0 → Manual review branch
// Case 1 → n8n content queue
// Case 2 → AI agents content queue
// etc.

Notice that the category name is assigned by the Code node, not by the model. The model returns a number. The Code node maps it to a name. If the model drifts, it still returns a number — and we catch any non-number with the validation check.

Running this for 45 days: 1,800 classifications, zero routing failures. Two validation errors caught (model returned "3 - Hyros attribution" instead of just "3" on two runs — caught and alerted, workflow improved prompt slightly to remove that edge case).

The Control Flow Hierarchy I Use Now

After the rebuild, I documented a four-level hierarchy for how I structure every production n8n agent. If a workflow doesn't have all four levels, it's not production-ready.

Level 1 — Schema Gates

A Code node immediately after every LLM node. Validates types, required fields, and value ranges. Throws on failure. Never optional.

Level 2 — Explicit Routing

All branching logic uses IF or Switch nodes with deterministic conditions. The LLM never decides which branch executes — it only provides values that code evaluates.

Level 3 — Error Branches with Escalation

Every workflow has an error branch that sends a Telegram alert with the error message and raw input. Never just retries. Human reviews all failures.

Level 4 — Audit Logging

Every execution logs inputs, LLM outputs, routing decisions, and final actions to a Supabase audit table. When something goes wrong, the full trace is available.

The Audit Log Pattern

The audit log was the last piece I added, and it's paid off the most. After an execution, a final Code node inserts a record to Supabase with the full execution trace:

// Final Code node — audit log insert
const execution = {
  workflow_id: 'lead-qualifier-v2',
  run_at: new Date().toISOString(),
  input: {
    contact_id: $('Webhook').first().json.contact_id,
    transcript_length: $('Webhook').first().json.transcript.length
  },
  llm_output: $('Claude').first().json.text,
  validated_output: $('Validate Schema').first().json,
  routing_decision: $('Score Threshold IF').first().json.qualification_score >= 7
    ? 'high-value'
    : 'nurture',
  final_action: $('Supabase Insert').first().json.id ? 'inserted' : 'failed'
};

// This goes to Supabase agent_audit_log table
return [{ json: execution }];

Three months of audit logs have let me spot patterns I never would have caught otherwise: certain call transcript formats consistently produce lower scores, specific client types trigger more validation errors, and certain times of day have higher LLM latency that occasionally causes timeout failures in downstream nodes.

When Prompt Engineering Is Still the Right Tool

I've been critical of prompt engineering as an architecture strategy, but I use it constantly. The distinction is what you're using it for.

Prompt engineering is the right tool for:

  • Pure content generation — Writing email drafts, summarizing call notes, generating blog post outlines. The output is for human review. If it's wrong, a human catches it.
  • Open-ended extraction without downstream dependencies — Pulling key points from a document where the result goes into a UI for a human to read, not into another automated system.
  • One-shot tasks with manual review — Anything where the n8n workflow ends with output in a Slack message or email that a person reads before acting.

The litmus test I apply to every LLM node before I ship a workflow: if this node returns garbage, does the workflow execute an irreversible action without a human noticing?

If the answer is yes — inserting to a database, sending a message, creating a record in a CRM, posting to an API — you need structural guardrails. A better prompt isn't good enough. The question isn't whether your prompt is good. It's whether your system can detect and stop bad output before it causes damage.

Quick decision framework:

ScenarioApproach
LLM output goes to human for reviewPrompt engineering fine
LLM output drives a downstream API callSchema gate required
LLM output determines which branch executesReplace with closed enum + Switch node
LLM output inserts to databaseSchema gate + type validation required
LLM output sends a message/emailLength/content check + human approval for high-stakes

The Mental Model Shift That Changes Everything

The shift I had to make was from thinking "how do I write a better prompt?" to thinking "what structural guarantees does this workflow need?"

A better prompt makes the model more likely to return what you expect. A Code node with schema validation makes it impossible for bad output to proceed without you knowing. Those are different levels of reliability — and only one of them scales.

Every time I'm tempted to add another instruction to a prompt, I ask: is this instruction compensating for a missing structural check? If the instruction is "always return valid JSON", the real fix is a JSON parse with error handling. If the instruction is "always include the score field", the real fix is a required field check that throws if it's absent.

The LLM is incredibly powerful for the things it's good at: reasoning about language, extracting meaning, generating content. But it's not a type system. It's not a router. It's not a validator. When you treat it like one, you get exactly the kind of Monday morning I had.

Build the structure first. Let the model do the creative work within it. That's the architecture that doesn't fail silently.

Building production n8n agents?

I'm sharing everything I've learned running 11 agents in production — the failures, the rebuilds, and the patterns that actually hold up at scale. If you're running into similar issues or want to talk through your workflow architecture, reach out.

Get in touch →