> ./contents Contents 14
> ./posts/when-to-force-the-llm-and-when-to-use-a-button

When to Force the LLM, and When to Use a Button

Some product decisions belong to the model. Others belong to boring UI. A practical guide to when to use the LLM, when to use a button, and why voice AI makes the trade-off obvious.

id 0x01cluster Agents and Workflowsread ~ 7 min
agentsworkflowsvoice-aiui-design
Sometimes the answer is the button.

Short answer#

If the user’s intent has a finite list of options and the cost of misreading them is high, use a button. If the intent is open-ended and the cost of asking a clarifying question is low, use the model. Most teams default to the model because it feels more impressive. They pay for it in accuracy, latency, and trust.

The most expensive AI lesson was free#

The Humane AI Pin and the Rabbit R1 were the two most-watched AI consumer launches of 2024 and 2025. Both went badly. The post-mortems converge on something almost embarrassingly simple: pure voice doesn’t work for roughly 80% of real use cases. You need to see lists. You need to compare options. You need to re-read what was just said. Removing the screen sent the experience back fifteen years, and the model couldn’t paper over it.

The deeper lesson is not “voice is bad.” Voice is fine. The lesson is that once you have a language model in your stack, the temptation to put it in charge of everything is what kills the product. Navigation. Confirmation. State. Control. The model ends up owning all of them, and the product ends up brittle in places that should never have been brittle. The Pin and the R1 are the spectacular version of a mistake almost every AI feature team makes in some form.

What “forcing the LLM” actually costs#

Three things, paid at once.

Accuracy goes first. Every step you hand to the model is a step where intent can be misread, and the joint probability of running five of those steps cleanly drops faster than most people expect.

Latency next. Each step is an inference call. A button takes roughly zero milliseconds. Voice turn-taking sits around half a second, and the model can’t promise that. Ten of the same action in a session and the maths is brutal.

Then there’s cost, which is the silent one. The 2026 story in AI economics isn’t that prices went up. It’s that agentic workflows multiplied token use across every step. One user action can fan out to plan, retrieve, call a tool, validate, retry, respond. If it could have been a button, you just paid for six inference calls to do the work of one click.

Voice has a budget the LLM can't promise to hit.

A small fix that taught me the pattern#

I worked on a voice AI workflow at Brokerloop. The system ran structured intake conversations with insurance brokers, and we hit this wall early. The model handled the open part of the conversation well: asking the right questions, listening, capturing detail. But we also wanted users to be able to do four things at any moment: pause, go back to the previous question, skip to the next one, and confirm that what the model had captured was correct.

For a few weeks, all four lived inside the model. The user could say “go back” or “wait, that’s wrong” and the model would try to interpret intent. Sometimes it worked. Sometimes “go back” was read as “stop the call.” Sometimes “actually, the date is wrong” was acknowledged verbally but not reflected in the captured data. Users got frustrated because they were never sure the correction had landed.

The fix was small and slightly embarrassing: we added buttons to the companion web interface. Pause. Previous question. Next question. Confirm captured value. The four things the model kept guessing at became four things the user could click. The voice agent kept doing the open-ended part: listening, asking, capturing. The four deterministic controls moved out of the prompt and into the UI.

Accuracy on captured data went up. Frustration went down. Latency on those four interactions dropped to roughly zero. None of this was a model upgrade. It was a workflow change.

Back-channel UI: the pattern, named#

The thing we ended up building isn’t really a chat interface, and it isn’t really a form. It’s a back-channel: a persistent surface running alongside the voice flow, never interrupting it, but always available to take a deterministic action from the user.

The controls are always there. They don’t appear and disappear with the conversation, and the user doesn’t have to wait for a turn or a prompt to use them.

They run parallel to the voice. Pressing one doesn’t cut the model off. The flow advances because the user pressed something. The model keeps listening and keeps speaking; the user steers.

They’re authoritative. When the button and the model disagree, the button wins. If the user clicks “that’s wrong,” the captured value is wrong, full stop.

Where the model sits inside the workflow.

The button test#

When you’re trying to decide whether a step in your product should be owned by the model or by something deterministic, run it through these four questions.

  1. Is the set of valid outcomes finite and known?
  2. Would a wrong interpretation here cost the user real time, money, or trust?
  3. Is the user trying to control the flow (pause, skip, undo) rather than express something?
  4. Will this happen many times per session?

If you’re answering yes to two or more, that step almost certainly belongs to a button, form, dropdown, slider, or rule, not the model. Save the model for the parts where the input is open-ended and the human gains something from being heard rather than clicking.

The structured-input research nobody is quoting#

If you’re sceptical that this is just my taste, there’s a 2025 paper on generative interfaces for language models (Chen, Zhang, Zhang, Shao & Yang, arXiv 2508.19227) that ran the experiment cleanly. They compared two ways of letting an LLM handle a user request: one where the interface was described to the model in free natural language, and one where it was described as a structured representation with explicit fields and constraints. Structured representations won, measurably, in human evaluation. Around a 13% to 17% lift on overall win rates.

The takeaway sits one level deeper than “use buttons for users.” It’s that structured beats free-text even when the LLM is the one doing the work. The model is not actually better off in ambiguity. We just keep assuming it is, because the demo runs fine.

When the model is right, and when it isn’t#

Some places do call for the model:

  • The input is unstructured language and you can’t predict the shape of it. Free-text intake, summarisation, classification of fuzzy categories, extraction from messy documents.
  • The user wins something specifically by talking rather than tapping. Hands-busy contexts. Field workflows. Anywhere typing or clicking adds real friction.
  • The branching is too wide for a UI. Thirty leaves where most users only touch three: a model that routes is cheaper than thirty buttons no one will read.
  • Personalisation needs context you can’t pre-encode. Tone, follow-up phrasing, adaptive prompts.

And the cases where deterministic logic does the job better:

  • Anything binary and consequential. Confirmations. Authorisations. Payments. Account deletions.
  • Anything stateful that the user expects to control. Pause, resume, go back, skip, restart.
  • Anything where the valid answer set is small and the cost of a misread is large. Yes/no, today/tomorrow, this account/that account.
  • Anything that has to be auditable or compliant. Regulators don’t want to read your prompt.
  • Anything that has to happen in under 500 milliseconds.

A diagnostic you can run today#

Walk through one full session of your product. Note every place a user has to repeat themselves, rephrase, correct, or say “no, that’s not what I meant.” Note every place they pause because they’re not sure the model heard them. Note every place they ask the model to do something procedural rather than something expressive.

Each of those notes is a candidate for a button, a form field, or a fixed rule. You probably don’t need to rebuild the product. You need to peel the model off a few specific steps.

What getting it right looks like#

You’ll know you’ve drawn the line correctly when users stop noticing the model in the places where it’s doing its job, and start noticing the buttons in the places where they need control. The model gets quieter, more useful, and more accurate. The product gets faster and more trusted. Nothing about it looks like a demo, which is exactly the point.

The line, written out.

FAQ#

Doesn’t adding buttons make the experience feel less like AI? Yes, and that’s usually a feature. Most users don’t want an “AI experience.” They want the thing they were trying to do to happen reliably. The model is a means, not an aesthetic.

How does this relate to agents? Same idea, scaled up. An agent with full autonomy over a long task is the maximally-LLM-owned design. A workflow with the model inside specific steps is the minimum. Most real products live closer to the workflow end, and the ones that try to live near the agent end usually walk back.

Is this just “use traditional software inside your AI agent?” Related, but a different angle. That argument is about the back-end parts of an agent that should stay deterministic: rules, queues, database constraints, retries. This one is about the front-end: what the user clicks, sees, and steers with.

Caveats#

The Brokerloop example is described in general terms. Accuracy and frustration improvements were observed qualitatively across the team, not measured against a fixed benchmark or an A/B test. The ~500ms figure is a commonly cited threshold in conversational UX work, not a number from a specific study in my back pocket. Your stack will produce its own numbers.

If you’re working through this#

If you’re deciding how much of your product the model should own, the cheapest version of this conversation is to walk through your current workflow with someone who has shipped one. Reach out if that’s useful.