> ./contents Contents 14
> ./posts/most-hitl-is-escalation-done-badly

Most 'human-in-the-loop' is escalation done badly

Most AI products don't need a human in every loop. They need clear escalation points, useful review context, and a system that knows when not to ask for help.

id 0x02cluster Agents and Workflowsread ~ 14 min
agentsworkflowshitlai-oversightescalation
The loop everyone draws.

Short answer#

Put a human in the default decision path of an AI system and you get the worst of every option. At low risk they’re a throughput bottleneck. At volume they rubber-stamp. When they actually engage with a hard case, the peer-reviewed evidence says they often make the system less accurate, not more.

What “human-in-the-loop” claims to deliver is meaningful oversight. What actually delivers it is calibrated escalation, where humans are invoked only by triggers you can name and measure. Most teams running HITL are doing escalation, just without the calibration. The good teams are doing it deliberately.

The terms are a mess, and the mess matters#

I want to be precise about three patterns before going further, because the literature, the vendor pages, and the regulators are all using the same words to mean different things.

In human-in-the-loop the human is part of the default path. Every action, or every action of a certain class, pauses for human review or co-decision before it proceeds. Volume of cases drives cost linearly. The human is the gate.

In human-on-the-loop the AI acts and the human supervises. The human can intervene at any time, but does not by default. Cost scales with the vigilance demand, not the volume of cases.

In escalation the AI acts by default, and the human is invoked only by a trigger you have defined: low confidence, an irreversible action, a customer asking for one, a category the AI shouldn’t touch. Cost scales with the edge-case rate, not the total volume.

These three are not the same thing wearing different labels. They have different cost curves, different failure modes, and different things they ask the human to do. The single biggest source of confusion in this space is that “HITL” has come to mean “a human is involved somewhere.” Once that conflation happens, you can’t argue about design any more, because you’re not pointing at the same object.

The argument I want to make is about the first one. Default-path HITL. The pattern where a human sits inside every loop iteration, every action, every output. That pattern is broken in ways most product teams don’t realise until it shows up in their metrics.

Fig. 1: three patterns the term HITL gets used for.

What the peer-reviewed evidence actually says#

I’ll lead with two findings, because they’re the ones I’d want a sceptical reader to have in front of them.

The first is a meta-analysis published in Nature Human Behaviour in 2024 by Vaccaro, Almaatouq and Malone. They aggregated 106 experimental studies covering 370 effect sizes published between 2020 and 2023. The headline: human-AI combinations performed significantly worse, on average, than the better of either humans alone or AI alone. When the AI was stronger than the human, the combined team was weaker than the AI. When the human was stronger, the team was stronger than the human alone. Combining the two is not free. It is often net negative, especially on decision-making tasks.

The second is a 2024 PLOS One paper by Sele and Chugunova called, with admirable honesty, “Putting a human in the loop: increasing uptake, but decreasing accuracy of automated decision-making.” They ran a controlled experiment with 292 participants. People preferred algorithmic recommendations and used them more when they were given the ability to adjust them. But their adjustments decreased the final accuracy of the system. And, the finding that should make every product team uncomfortable: humans were less likely to override AI recommendations that contained large errors. The bigger the AI’s mistake, the smaller the human correction. The emergency brake was worst exactly when you needed it.

You can argue with the generalisability of any single study. You can’t easily argue with both of those together. Adding a human to a default decision path is not a free upgrade. On many of the tasks where teams reach for HITL, it is a downgrade.

Fig. 2: biased AI made clinicians less accurate, not more.

Why default-path HITL fails in production#

There’s forty-three years of human-factors research on this. Lisanne Bainbridge wrote “Ironies of Automation” in 1983, and it predicted every failure mode the AI industry is rediscovering at expensive scale. The core argument is that the more you automate the routine work, the worse the human gets at the residual work, and the more exhausting the monitoring task becomes. If you keep the human in the loop to catch the failures the machine can’t, you’ve designed a job that requires sustained attention to rare events. Humans are bad at sustained attention to rare events. We are particularly bad at it after the system has worked correctly for a while and we’ve stopped really watching. The Therac-25 radiation therapy accidents are the textbook case: the machine worked correctly so often that operators couldn’t recognise when it didn’t.

The same pattern shows up in modern clinical decision support. Clinicians override 90 to 95 percent of medication alerts in computerised order entry systems. More than half of those overrides get categorised as the alerts being clinically irrelevant. Primary care clinicians receive over 56 alerts per day and spend roughly 49 minutes responding to asynchronous ones. This is what HITL looks like when scaled to production volume. The humans aren’t providing oversight. They’re dismissing notifications so they can get on with the actual work. The system is lying to itself about what oversight is happening.

And there’s automation bias, the failure that goes the other way. When the AI is wrong but the human defers to it anyway. A 2023 study of 450 clinicians given diagnostic assistance from intentionally biased AI saw their accuracy fall from 73 percent to 61.7 percent. The clinicians weren’t ignorant of the right answer. They knew it when working alone. They lost it when given a confident-looking machine to defer to. A separate 2025 governance analysis cited in the EDPS TechDispatch on automated decision-making found that humans providing in-the-loop oversight gave “correct” oversight only about half the time.

Put these together and you get the production reality of default-path HITL. At low risk, you’ve added latency and headcount cost to actions that didn’t need a human. At high volume, the human is desensitised and rubber-stamps. When the human does engage on a hard case, they often defer to the model exactly when they shouldn’t. The shape of what you’ve built is not “human catches AI errors.” The shape is “human enables AI errors with their signature on the bottom.”

The reframe that actually works#

The question is not whether a human is involved in your AI system. It is what triggers them to be invoked, and what they can actually do once they are.

Default-path HITL gives you one trigger (“every action”) and one role (“approve or correct”). Both are wrong for almost every production AI workflow. The trigger is wrong because most actions don’t warrant a human, and putting one there desensitises them for the actions that do. The role is wrong because “approve or correct” requires sustained vigilance that no human delivers under volume. You wouldn’t design any other production system that way.

Calibrated escalation gives you a different shape. The trigger is conditional, and you choose the conditions deliberately. The role is whatever the trigger calls for. Sometimes that’s “approve before send.” Sometimes “take the conversation from here.” Sometimes “tell us this is the wrong category so we can add it to the training set.” Human time is finite and expensive. You spend it on the cases where it does real work.

The escalation literature has converged on four useful trigger categories. Every team I’ve watched build this badly was missing one or two.

The first is confidence. The model produces a confidence signal, calibrated against historical correctness, and you escalate below a threshold. This is the trigger that gets the most attention in vendor pitches, and it’s the one most likely to be wrong if you skip the calibration work. LLM confidence is harder to calibrate than classical ML confidence because the sources of uncertainty are different. Input ambiguity, reasoning path divergence, and decoding stochasticity all introduce uncertainty that doesn’t fit the aleatoric/epistemic split. Verbalised confidence with the right prompting can be reasonably well-calibrated. Selective prediction and learning-to-defer literature gives you a deeper toolkit if you need it. If your “confidence threshold” is the model’s raw self-reported probability with no calibration check, you’re guessing.

The second is reversibility. Independent of confidence, some actions are cheap to undo and some aren’t. Add a label, post a comment, run a read-only query: these tolerate high autonomy. Delete a record, transfer money, send a customer email, push a config change: these don’t. The most useful single guardrail in agentic AI right now is gating tool calls by reversibility. This is what most of the production agent guidance from Anthropic and others is converging on, and it’s the trigger that actually does the work in agentic workflows.

The third is policy. Categorical rules that override everything else. Insurance underwriting beyond an authority limit is referred to an underwriter, no matter how confident the model is. A loan application from a sanctioned country is escalated, no matter how clean the application looks. Don’t let the model resolve a complaint above a certain monetary value. These are rules, not signals, and they should be in code, not in the prompt.

The fourth is user request. The customer asked for a human. This is the simplest trigger and the one too many teams under-tune. If a user has typed “agent” or “human” or “this isn’t working”, you’ve already lost the resolution on that case. Stop trying to win it back. Hand it over.

A good escalation system uses all four. If any trigger fires, you escalate, with full context, to the right kind of human (a domain reviewer, a senior agent, a compliance officer, not all the same person). The human’s role at that moment is the action the trigger named, not “review everything we just did.” You measure the escalation rate, the false-escalation rate, the missed-escalation rate, and the outcome quality on escalated cases. They are also the metrics that almost no team reports on, which is why so many production AI rollouts have a quiet quality problem you only discover at the post-mortem.

Klarna, read correctly#

The Klarna story is the most common example in this space and the most commonly misread. Let me put it back together with the right reading.

In February 2024 Klarna announced that their AI assistant was handling two-thirds of all customer service chats in its first month, around 2.3 million conversations, with resolution times dropping from eleven minutes to under two. The company claimed the system was doing the work of seven hundred agents and saving forty million dollars a year. (Treat those numbers as Klarna’s own claims rather than independently verified; analysts have pushed back.) By early 2025, CSAT had dropped, complaints had risen, and customers were specifically frustrated by responses that felt generic and that failed on nuanced cases. By mid-2025, the CEO publicly said the company had pushed too hard on cost and quality had suffered, and announced a pivot to a hybrid “Uber-style” model with rehired remote human agents.

The dominant reading of this story is that AI customer service “doesn’t work” and that humans are back. That reading is wrong, and the reason is the central argument of this post. Klarna didn’t fail at automation. They failed at escalation design. They optimised for resolution rate as a metric, which biased the system toward not escalating, which meant the AI kept trying to resolve cases that should have gone to a human, which is precisely how you produce the “generic, repetitive” complaints customers actually had. The fix was not “put a human in every loop.” It was to redesign the trigger so the AI hands off when it should, with context, to a human who can do something the AI can’t.

Read the case study at the level of the trigger, not the level of the model, and it stops being a cautionary tale about AI and starts being one about the design discipline that needs to surround it.

The high-stakes caveat#

I have to be honest about regulated high-stakes decisions, because they’re where the argument runs into its hardest test, and where it actually gets sharper.

GDPR Article 22 gives data subjects the right not to be subject to a decision based solely on automated processing if it produces legal or similarly significant effects on them. The EU AI Act’s Article 14 mandates human oversight for high-risk AI systems. NYC’s Local Law 144 imposes audit obligations on automated employment decisions. US state insurance regulators are tightening explainability and human-review requirements for underwriting decisions. The surface reading of all this is “the law requires HITL, so the argument fails for high-stakes decisions.”

The deeper reading is that the law is asking for meaningful oversight, and the empirical evidence I just walked through says default-path HITL doesn’t deliver it. The European Data Protection Supervisor’s 2025 TechDispatch on human oversight of automated decision-making explicitly recognises the rubber-stamping problem. EU AI Act Article 14 requires that the design of the oversight “enable awareness of the possible tendency of automatically relying or over-relying on the output produced by a high-risk AI system (automation bias).” Guidance under GDPR Article 22 has been clear for years that mere rubber-stamping or superficial review doesn’t satisfy the requirement.

The legal requirement of “meaningful human oversight” is, when you look closely, asking you to do escalation properly. With designed triggers, meaningful human authority, audit logging, and the human capacity to actually do the job when invoked.

This is the part of the argument I’d want to lean on if I were writing for a chief compliance officer, not a head of product. It is the most contrarian and the most defensible move in this whole space.

When default-path HITL is still the right call#

I’m not writing this to argue that humans don’t belong in AI systems. They do. There are three contexts where default-path HITL is the correct, non-negotiable choice, and I’d push back on any team I worked with that tried to escalation-design them away.

The first is the calibration phase of a new system in a new domain. If you have no production telemetry, no eval data, and no signal you can trust, your confidence threshold is a guess. Default-path human review during a deliberate ramp-up gives you the data you need to design the trigger. Anthropic’s own data on Claude Code adoption shows this pattern: new users auto-approve about 20 percent of the time, experienced users (more than 750 sessions) auto-approve more than 40 percent of the time, and the trust is co-constructed between the model, the product, and the user as the data accumulates. You don’t start at the destination. You earn your way to it.

The second is active learning. If the point of putting the human in the loop is to teach the model, the human is doing different work and the comparison in this post doesn’t apply. They are a labeller. Different problem, different design.

The third is decisions where a wrong autonomous outcome is genuinely catastrophic and there is no audit, undo, or apology that fixes it. Some medical decisions. Some military targeting decisions. Some critical infrastructure decisions. The bar for “catastrophic” is higher than most teams claim. Be honest about whether your decision actually clears it, or whether you’re invoking the catastrophic-irreversible argument to avoid the harder work of designing the trigger properly.

Outside those three, the argument holds. Default-path HITL is the worst of the available patterns, and the pattern most teams reach for first.

Where the human actually goes#

A human in the loop is not oversight when the human is a rubber stamp, and not a safeguard when they get pulled in after the damage is done. The hard part was never building the demo. It is choosing where the human goes, on what trigger, and with the authority to actually act when they arrive. Get that wrong and you have built the shape of oversight without the function.

FAQ#

Is human-on-the-loop the same as escalation? Close, but not quite. Human-on-the-loop has the human monitoring throughout and able to intervene at any time. Escalation has the human invoked only when a trigger fires. In practice, mature production systems often use HOTL for the monitoring layer (dashboards, anomaly alerts) and escalation for the action layer (specific cases routed to specific humans). They complement each other.

What about regulated decisions where the law explicitly names “human review”? Read the regulation and the regulator commentary together, not just the law’s heading. In every major framework I’ve looked at (GDPR Article 22, EU AI Act Article 14, FDA SaMD guidance, US state insurance regulation), the substantive requirement is meaningful human review with authority and capacity to act. That’s what designed escalation gives you. Rubber-stamping HITL doesn’t satisfy the substance even if it satisfies the heading.

Caveats and claim-safety notes#

  • The Vaccaro et al. meta-analysis is mostly non-experts on synthetic tasks. Generalising it to production expert workflows is one step of inference, and I’ve made it; that’s a fair place to push back, not a refutation.
  • The 73 to 61.7 percent clinician drop was with intentionally biased AI. The size is illustrative of the failure mode, not a literal prediction for your system.
  • LLM confidence calibration is workable in 2026 but not solved. If nobody on your team can do it, that’s the gap to close before escalation design pays off.
  • Klarna’s dollar and headcount figures are the company’s own and disputed by analysts. The CSAT decline and hybrid pivot are well documented; treat the specifics as illustrative.
  • The GDPR Article 22 and EU AI Act Article 14 reading is mine. It’s consistent with the EDPS commentary, but a lawyer could argue you need default-path HITL for specific decisions in specific sectors. In regulated territory, get advice for your use case.

Footnotes#

[1] Jabbour S, Fouhey D, Shepard S, Valley TS, Kazerooni EA, Banovic N, Wiens J, Sjoding MW. Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study. JAMA. 2023;330(23):2275–2284. doi:10.1001/jama.2023.22295. Baseline clinician accuracy 73.0% (95% CI 68.3–77.8); accuracy under systematically biased AI without explanations 61.7% (95% CI 55.3–68.2); absolute drop 11.3 percentage points (95% CI 7.2–15.5, p < .001). n = 457 hospitalist physicians, nurse practitioners, and physician assistants across 13 US states.

If you’re designing this right now#

If you can’t name your escalation triggers in one sentence each, the system has an oversight gap you haven’t found yet. That’s what I help teams fix.