Pipelines, agents, and the maturity curve

Hamel Husain, who has consulted across thirty-plus AI teams, opens his field guide with a sentence that reads like a private joke and is actually a diagnostic: "The engineers kept saying, 'We're going to build an agent that does XYZ,' when really the job to be done was writing a prompt."^[1]

This is the agent reflex. It is the most expensive failure mode in AI engineering as of mid-2026, and the most common. It is also explicitly warned against by the AI vendor with the most commercial interest in the agentic future. Anthropic's own Building Effective Agents — the canonical reference in this space — opens with the recommendation to start with simple prompts and add multi-step agentic systems only when simpler solutions fall short.^[2]

The discourse goes the other way. Frameworks default to agents. Sequoia decks call this The Year of the Agent for the third year in a row. The dominant practitioner essay arguing pipeline-first is published by the company selling agents.

This essay names what most teams should default to, what they should escalate to, and the four-question test that decides between them.

Three architectures

There is a real taxonomy beneath the agent discourse. Two primary sources define it. Anthropic's Building Effective Agents (December 2024) and LangChain's LangGraph documentation, which mirrors Anthropic's framing almost verbatim.^[3] Both agree on the shape:

Prompt — a single LLM call. The augmented case adds retrieval, tools, or memory, but the control flow is the call.
Pipeline (Anthropic calls it workflow) — systems where LLMs and tools are orchestrated through predefined code paths.^[2] The developer writes the graph; the LLM fills in the boxes.
Agent — systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.^[2] The developer hands the LLM a goal and a toolbox; the LLM decides what to do next.

Anthropic's umbrella term for everything in this space is agentic systems — both workflows and agents are subtypes. The word agentic has been overloaded so thoroughly in practitioner discourse that this nuance is almost always missed. A retrieval-augmented prompt with three retry rules is an agentic system. So is a multi-agent research orchestrator. They sit on the same continuum and behave nothing like each other.

The continuum has a direction. Pipelines are deterministic in shape: the same input produces a predictable graph traversal even if individual LLM calls vary. Agents are non-deterministic in shape: the same input produces a different traversal each time because the LLM is choosing the next step. As a team moves rightward along the continuum, three things happen monotonically. Cost goes up. Latency variance goes up. Debuggability goes down.

The vocabulary matters because most arguments about agents are arguments about a word that means three different things to three different people. When a team says we are building an agent for X, the right first question is which point on this continuum they actually mean. If they mean a single LLM call with a retry, the conversation is done. If they mean an orchestrator-worker pattern with four specialised subagents, the conversation is just starting.

You get two

Cost, reliability, and flexibility form a triangle. You get two. The further right you move on the architecture continuum, the more flexibility you buy and the more of the other two you sell.

Anthropic states this directly:

Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense. — Anthropic^[2]

The cost side has a hard number from Anthropic's own production deployment. Their multi-agent research system uses an orchestrator-worker pattern — a lead researcher agent decomposes the query, spawns three to five parallel subagents, and aggregates their findings through a citation-checking pass. The published engineering retro reports the system consumes approximately fifteen times more tokens than standard chat interactions.^[4] This is from the AI lab whose commercial future depends on agentic systems being the answer.

An order-of-magnitude token premium isn't a problem if the work justifies it. It's a problem when you don't notice it until the bill comes.

The reliability side is harder to put a number on but easier to feel in production. An agentic system is a closed-loop control structure where the controller is an LLM. Closed loops are where the worst failure modes in any engineering domain live. The widely-circulated viral agent burned $47,000 in a loop meme of 2025 turned out, under reproduction, to be unverifiable in its specifics. What survives is the pattern. A practitioner postmortem published in April 2026 documents an agent caught in a plan → tool-call → 429 error → replan → retry cycle, each iteration re-ingesting its own failure trace as new context, with the loop firing roughly 4,800 times per hour before it was caught.^[5] The specific dollar number from that postmortem did not survive independent verification. The loop pattern did. It has a name in the agent-engineering literature: unbounded retry plus context accumulation. It is common enough to need one.

The flexibility side is what you bought. Agents do work pipelines structurally cannot. Anthropic's multi-agent research system, by their own report, outperforms a single-agent setup by over 90 percent on its target task class — open-ended research requiring breadth-first exploration.^[4] Sean Goedecke makes the maximalist case in Build agents, not pipelines: when in doubt, use agents, because agents handle context-gathering autonomously, scale with model improvements, and remove most of the manual engineering pipelines require.^[6] His argument has weight on tasks where the shape of the work cannot be enumerated in advance. The contested ground is how often that's the actual task.

The choice between architectures is not aesthetic. It is a budgeted decision: how much variance in cost, latency, and debuggability the application will tolerate to buy how much flexibility. Most production applications, when this question is asked plainly, want very little variance and very little flexibility. Pipelines are the natural answer. The agent reflex is what happens when the question doesn't get asked at all.

The agent reflex

The reflex has three reinforcing causes.

The first is linguistic. Agentic is a high-status word in 2025–2026 engineering discourse. It signals familiarity with the frontier. A team that says we're building an agent sounds further along than a team that says we're building a pipeline, even when the pipeline does the same job better. Hamel Husain documents this pattern across thirty-plus AI consulting engagements:

The engineers kept saying, 'We're going to build an agent that does XYZ,' when really the job to be done was writing a prompt. — Hamel Husain^[1]

Same observation, in his telling, repeating verbatim across legal-tech companies, mental-health startups, and healthcare firms. The word agent has become the default frame; the actual job-to-be-done has become the thing that gets noticed second.

The second cause is tool design. The dominant frameworks for AI engineering — LangGraph, CrewAI, AutoGen, Mastra, OpenAI's Agent SDK — are agent-first by construction. Their getting-started examples are agents. Their template galleries lead with agents. The path of least resistance from a fresh npm init is an agent. Building a deterministic pipeline in any of these frameworks requires more code, more decisions, and produces an artefact that looks like less. The artefact-aesthetics of the discourse reward complexity even when the customer doesn't.

The third cause is demo-driven roadmaps. An agent demo is unforgettable. An LLM that opens a browser, navigates three pages, fills in a form, and confirms an action plays well in the C-suite. A pipeline demo is we extract this entity from this document at 99 percent precision and 95 percent recall, here are the confusion matrices. The second is the one that ships value. The first is the one that gets greenlit. Roadmaps written from board decks select for the first; production retros select for the second.

The cost of the reflex is visible in the most prominent production case. Cognition's Devin — the highest-profile autonomous coding agent — achieves a sixty-seven percent PR merge rate after eighteen months in production, up from thirty-four percent the year before.^[7] The doubling is real. It also means roughly one in three PRs from the most-funded, most-tuned, most-deployed autonomous coding agent in production still does not merge. Cognition's own commentary on the limitation:

Like most junior engineers, Devin does best with clear requirements. Devin can't independently tackle an ambiguous coding project end-to-end like a senior engineer could. — Cognition AI^[7]

And:

Human review is still necessary, because code quality is not straightforwardly verifiable. — Cognition AI^[7]

This is the ceiling for the most-invested case in the field, against the most repetitive, most-scoped, most well-documented task class possible. The reflex says let's build an autonomous agent for this. The data, from the company selling autonomous agents, says that ceiling exists and isn't moving fast.

The maturity curve

Anthropic's Building Effective Agents enumerates a seven-stage progression of agentic systems, in order of increasing complexity:

Augmented LLM — a single call enhanced with retrieval, tools, and memory.
Prompt chaining — sequential task decomposition with intermediate checks between calls.
Routing — input classification that dispatches to specialised downstream tasks.
Parallelisation — simultaneous processing (sectioning or voting) with aggregated outputs.
Orchestrator-workers — a central LLM dynamically delegates subtasks to worker LLMs.
Evaluator-optimiser — an iterative refinement loop where one LLM critiques another's output.
Autonomous agents — LLMs operating independently with environmental feedback loops.

LangChain's LangGraph documentation mirrors five of these as named workflow patterns and treats autonomous agents as the separate, higher-complexity case.^[3] Together, the two documents are the canonical reference for the space.

The progression is real and useful. It is also explicitly not a maturity ladder in the way the word is usually meant. Anthropic itself flags this:

These building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases. Add complexity only when it demonstrably improves outcomes. — Anthropic^[2]

This is the gap the essay is closing. The patterns are enumerated; the when is not. A team that reads Building Effective Agents learns the vocabulary and the components. It does not learn the decision rule that picks one. In practice this means teams that adopt the framework still default to the highest-complexity option that seems to fit, because fit is judged on plausibility rather than on a forcing function.

The maturity curve, as we use the term, is the progression Anthropic describes plus the prescriptive overlay Anthropic deliberately doesn't supply. Each step on the curve is earned by evidence that the previous step couldn't meet the requirement. A team that jumps from augmented LLM directly to orchestrator-workers without spending real production time at the intermediate steps is not building maturity. It is buying complexity at a discount it will pay back later.

The four-question decision tree below is the forcing function we install. It is the smallest test that prevents the agent reflex from running unopposed.

The four-question decision tree

Four questions, in order. The team answers each one for the actual production task, with the actual production constraints. The first NO answer is permission to consider escalating from pipeline to agent. Three YES answers in a row are permission to stop where you are.

1 — Determinism

Does the task have a knowable correct output, given a fixed input?

If yes — pipeline. A task with a verifiable answer is a task where determinism is a feature. Classification, extraction, structured data generation, schema mapping, summarisation against a rubric: all of these have a notion of right and wrong that does not change run-to-run. An agent gives you variance you didn't ask for. A pipeline gives you reproducibility for free.

If no — keep going.

2 — Bounded state

Can the task be expressed as a finite state machine where the states and transitions are knowable in advance?

If yes — pipeline, possibly with routing or evaluator-optimiser components. Knowable in advance is the load-bearing phrase. Most production tasks that look open-ended turn out, when sketched on a whiteboard, to have between three and twelve states and a few dozen transitions. Once they're sketched, the pipeline writes itself.

If no — the task is genuinely open-ended. Keep going.

3 — Cost ceiling

Is there a fixed per-run cost cap above which the application is broken, not just degraded?

If yes — pipeline. Pipelines have bounded cost by construction. Each step is a known call to a known model at a known cost. Agents have unbounded cost by construction. The Anthropic multi-agent retro's fifteen-times-token-cost multiplier^[4] is the well-behaved case. The unbounded retry loop documented in the April 2026 postmortem^[5] is the failure case. If your application falls over at $5 per request, you need pipelines.

If no — cost is a softer constraint. Keep going.

4 — Debuggability

Will a production incident require post-hoc tracing of why a specific output was produced?

If yes — pipeline. Pipeline traces are sequential, inspectable, and reproducible. Agent traces are forensic archaeology — you can see what the LLM decided but rarely why in a way that helps you fix the root cause. The post-incident review you'll be sitting in three months from now is the question being asked here.

If no — all four answers are NO. You have permission to escalate to an agent. The escalation should still be the simplest agent that fits.

Three or four YES answers means pipeline. The agent reflex says some of those questions don't apply to our case. The reflex is wrong about that more often than it is right.

The tree is a forcing function for an explicit conversation that most teams skip. Run it. Document the answers. Re-run it every quarter — the same task can answer differently as the team's production constraints evolve. If the answers move in the direction of YES, the architecture should de-escalate: from agent back down to pipeline. De-escalation is the move the discourse does not talk about and the move that compounds the most.

Three architectures in production

One case per architecture, with the decision-tree answers shown explicitly.

Prompt — entity extraction at scale

A consumer-facing finance product needs to extract counterparty, amount, currency, and date from free-text transaction descriptions. Tens of millions of calls per day, p99 latency budget of 80 ms, cost ceiling of fractions of a cent per call.

| Question | Answer | |---|---| | Determinism | yes — the extraction either matches the ground-truth or doesn't | | Bounded state | yes — the schema is fixed | | Cost ceiling | yes, hard | | Debuggability | yes, mandatory |

Architecture: single LLM call, fine-tuned for the schema, with a regex post-validator. No chain. No router. No agent. This is the boring answer and it is correct. The pipeline version is one HTTP call away from this; the agent version would be twenty.

Pipeline — incident triage automation

An infrastructure team gets ~400 alerts a day. They want a system that ingests each alert, retrieves the relevant runbook, drafts a first-pass triage note (severity, suspected component, recommended action), and either auto-pages or routes to a human queue.

| Question | Answer | |---|---| | Determinism | partially — the triage decision has a notion of correctness but not always a single right answer | | Bounded state | yes — about a dozen alert classes and four severity levels | | Cost ceiling | yes — page-level cost matters | | Debuggability | yes — post-incident reviews need it |

Architecture: a pipeline. Routing on alert class, retrieval on the relevant runbook section, a single drafting call against a structured template, evaluator-optimiser if the first draft fails a confidence check. Crucially: no agent loop. The team uses Anthropic's evaluator-optimiser pattern as the most-complex component and deliberately stops short of an autonomous agent. The pipeline is auditable end-to-end and costs predictably.

This is where the Agentless paper (Xia et al., 2024) is instructive.^[8] The paper asks, verbatim, "Do we really have to employ complex autonomous software agents?" and answers with a three-phase localise-repair-validate pipeline that scored thirty-two percent on SWE-bench Lite at $0.70 per issue — beating most agent baselines of the time. It is academic, peer-reviewed, and on the canonical agent-shaped task. The pipeline-vs-agent question is genuinely contested even at the frontier.

Agent — open-ended deep research

A product team needs an internal research tool: given a question, the tool reads across an internal knowledge base, the open web, and a set of structured data sources; synthesises an answer; and cites everything. The query space is open-ended. The number of sources to consult per question is not knowable in advance.

| Question | Answer | |---|---| | Determinism | no — the same question reasonably produces different (correct) answers | | Bounded state | no — the research path branches unpredictably | | Cost ceiling | soft — high-value internal use justifies the cost | | Debuggability | relaxed — the citation chain provides traceability |

Architecture: an agent, or more specifically an orchestrator-worker pattern with a citation-checking pass. This is the architecture Anthropic's own multi-agent research system uses, and the architecture they report a 15× token premium on.^[4] The premium is paid because the task is the one task this architecture genuinely fits. The team built this only after a six-month pipeline attempt established that the open-endedness was real and not a missing-spec problem in disguise.

Three task classes, three architectures, one principle: the architecture follows the answers to the four questions, not the other way around.

The Studio default

Whitescroll's Studio practice runs pipeline-first by construction. The maturity curve is a ratchet, not a slope. The team only earns the next step when production data justifies it. The first agent in a system is the one we resist the longest.

The implementation discipline is mundane and decisive:

Every new LLM-touched component starts as a single prompt, instrumented with eval traffic before it ships.
A component graduates to a pipeline when, and only when, eval data shows a structured handoff between two prompts outperforms the single one.
A pipeline graduates to an agent when, and only when, the question space provably cannot be enumerated and the production cost ceiling allows variance.
Every quarter, every component is reviewed against the four-question decision tree. Components that have drifted from their original answers are candidates for de-escalation — agent back to pipeline, or pipeline back to prompt.

De-escalation is the move that compounds. The discourse is built around moving rightward on the maturity curve. Production retros — Cognition's framing of Devin's ceiling, Anthropic's simplicity-first guidance, Xia et al.'s academic counter to agent maximalism, Hamel Husain's field observations — are all evidence that the same architecture can drift out of its fit. The team that catches the drift early de-escalates and saves the cost difference. The team that doesn't catch it ships the architecture into production and pays it forever.

The closing diagnostic question, the one we ask in every Studio engagement:

If you removed the agent from this architecture and replaced it with a pipeline, would you ship the same product? If the honest answer is yes, you don't need the agent.

This question kills more agent proposals than any technical critique. It is also the easiest one to answer truthfully before you have built the agent, and the hardest one to face after.

The teams that compound on AI engineering are not the ones that built the most sophisticated agents. They are the ones that resisted the agent reflex long enough for the production data to tell them what the architecture should be. The decision tree is a forcing function for that resistance. The Studio default is what installs the discipline. The discipline is what most teams are missing.

If you're staring at an agent in your stack that you suspect should be a pipeline, the four questions take twenty minutes to answer. We read every note that lands at hello@whitescroll.com.

References

Husain, Hamel. A Field Guide to Rapidly Improving AI Products. 2025. Primary practitioner essay; widely cited in AI-engineering discourse; observations from 30+ consulting engagements.
Anthropic. Building Effective Agents. December 2024. Primary vendor engineering essay; the canonical reference for workflow / agent vocabulary; the source of the "pipeline-first" guidance and the seven-stage progression cited throughout.
LangChain. Workflows and agents — LangGraph documentation. 2025. Primary vendor documentation; mirrors Anthropic's framing nearly verbatim and serves as the practitioner-facing operationalisation.
ByteByteGo. How Anthropic Built a Multi-Agent Research System. 2025. Secondary source summarising Anthropic's published engineering retro on its multi-agent research system; source of the ~15× token-cost multiplier and the >90% performance gain over single-agent for the same task class.
Jain, Sattyam. A Production AI Postmortem — The Unbounded Retry Loop. Medium. April 2026. Practitioner postmortem; the loop pattern (plan → tool-call → 429 → replan → retry, with context re-ingestion) is independently corroborated by Anthropic and LangChain agent-engineering guidance. Specific dollar figures from this report did not survive independent verification — the pattern is cited; the numbers are not.
Goedecke, Sean. Build agents, not pipelines. 2025. Practitioner essay; the strongest extant maximalist case for agent-first defaults; engaged with directly as the counter-position in this essay.
Cognition AI. Devin 2025 Annual Performance Review. November 2025. Primary vendor self-reported retro on the highest-profile autonomous coding agent; source of the 67% PR merge rate after 18 months, the "human review is still necessary" admission, and the clear-vs-ambiguous-scope characterisation.
Xia, Chunqiu Steven, et al. Agentless: Demystifying LLM-based Software Engineering Agents. arXiv preprint 2407.01489. July 2024 (revised October 2024); peer-reviewed at ACM PACMSE. Primary academic; 32% on SWE-bench Lite at $0.70 per issue using a non-agentic localise-repair-validate pipeline; the canonical counter-argument to agent maximalism on the canonical agent-shaped task.

Cite this essay: Whitescroll. "Pipelines, Agents, and the Maturity Curve." 2026-06-19. whitescroll.com/writing/pipelines-agents-and-the-maturity-curve.