Stop running one coding agent for everything (my May 2026 setup)

A working two-agent setup for founder-engineers in 2026. Claude as the continuity layer, Codex as the bounded specialist, hooks and MCPs as the controlled automation layer.

For about a year I tried to make a single coding agent do everything. Planning, memory, refactors, audits, migrations, test generation, browser QA. It mostly worked. It also produced the same failure mode every time. The agent would start a job well, drift into a side investigation, lose the plot of the original request, and either over-edit or run out of context exactly when the careful part was needed.

The fix in May 2026, for me, was not a better single agent. It was giving two agents defined roles and writing the handoff down. Claude is the continuity layer: planning, memory, product judgement, user-facing synthesis, small local edits. Codex is the deep specialist: bounded review, audits, migrations, test generation, simulations, large-codebase tracing, focused implementation slices. Hooks, MCPs and skills are the controlled automation layer that codifies the split and catches obvious unsafe moves.

This is a working setup, not a manifesto. If you run one agent and it suits you, do not break it. If you run several long-lived projects with their own conventions, persistent memory, and tasks you do not want bleeding into your planning context, this is the playbook I would lift and adapt.

Why this is practical now#

A year ago I ran two or three Claude windows side by side on the same project. Context filled up, I copied state between them by hand, and continuity died every few hours when the oldest one had to be killed. Same for Codex.

That constraint is mostly gone. Claude’s 1M-token context window shipped in August 2025 and is now the default across the Sonnet 4 line. One session can hold a real project for a working day before it needs compaction. Claude Code and Codex CLI subagents take the side investigations into their own context windows and hand back a summary, so the orchestration thread does not fill up with the contents of every file the agent had to read. The split used to need two of everything just to survive the context limit. Now it needs one of each.

The short answer#

Use Claude for orchestration and continuity. Use Codex for bounded specialist work. Make every handoff explicit: working directory, sandbox, model, expected output, verification command. Put the per-project rules in AGENTS.md and CLAUDE.md and let both agents read them. Use Claude Code hooks for guardrails, not magic. Treat MCPs and subagents as least-privilege adapters. Audit your skills folder this week.

Most people will not bother because the single-agent setup is already good enough most days. That is exactly the trap. The cost of the split is borne up front. The benefit shows up six months in, when you can hand a 200-file migration to Codex while Claude keeps holding the thread of why you are doing it.

The split, written down#

The handoff is the load-bearing part. If you cannot say in one sentence what each agent is for, the agents will quietly converge on whichever one your fingers reach for first.

Role	Claude	Codex
Planning and product framing	yes	rarely
Memory and project continuity	yes	no
User-facing synthesis	yes	no
Small local edits	yes	sometimes
Code review and audits	sometimes	yes
Large-codebase tracing	sometimes	yes
Test generation	sometimes	yes
Migrations and refactors	rarely	yes
Simulations and replay harnesses	rarely	yes

That table is not original. It is a written form of the orchestrator/specialist pattern any team that has run microservices already knows. The point is that it has to be written. Otherwise the cleverer of the two agents (whichever you find more pleasant to talk to) will absorb both jobs and you will lose the gain.

In practice the handoff prompt I send to Codex looks like this:

cwd: ~/code/voiceai/services/orchestrator
sandbox: writable
model: gpt-5.5, reasoning: high
task: rename JobStatus enum from started|running|done to queued|running|completed|failed across Go domain, mappers, DTOs, frontend TS types, Python schemas, voice contracts. Update tests.
expected output: branch, diff summary, list of changed files grouped by service.
verify: go test ./... && pnpm test && pytest. Stop and report on first failure rather than retrying.

Five fields, all explicit. The agent does not have to guess what good looks like. When Codex does something I did not expect, almost always it is because one of those five fields was implicit.

Two-agent split showing Claude as orchestration layer holding planning, memory and synthesis, with a narrow handoff arrow to Codex holding audits, migrations, tests and simulations, plus a returning verification arrow. — Fig. 1: the orchestrator / specialist split, with the five-field handoff packet between them.

Fig. 1: the orchestrator / specialist split, with the five-field handoff packet between them.

Audit your skills folder this week#

This is the one part of the post I would underline if I were sitting across from you.

I went through my own skill ecosystem doing exactly the audit I am about to recommend, and found a user-invocable production-analysis skill that contained sensitive access material. Not credentials in a clever obfuscated form. Material that should never have been in a skill file in the first place. The skill ran fine. It would have continued to run fine. The risk was that it was sitting in a folder a coding agent reads on every session.

The audit takes an hour. Open every skill in ~/.claude/skills/ and ~/.codex/skills/, or wherever your tool keeps them. For each one, ask: does this contain anything that would not survive being read aloud at a standup; does it pass secrets via positional arguments rather than environment variables or a secret store; does it write to a path the agent should not be writing to; does it depend on a tool surface (browser, shell, database) that should be scoped to a subagent rather than available to the whole session?

Anything that fails the first question moves out of the skill file immediately. Either to a runtime prompt the user has to confirm, or to a proper secret store. The skill becomes the workflow and only the workflow. Do not rely on the agent to safely ignore sensitive material it can read.

If you take nothing else, run the audit.

AGENTS.md and CLAUDE.md hold the local contract#

Project guidance lives in the repo, not the global config. Codex reads AGENTS.md, Claude reads CLAUDE.md, both override anything my global setup says. The two files often share content; that is fine. The discipline is that the per-project file is authoritative. If you want one source of truth, import @AGENTS.md into CLAUDE.md and put Claude-specific rules underneath.

What goes in there for me: exact build, test and lint commands with the actual flags (not “run the tests”); the directories the agent is allowed to write to; the compact output preferences for the project; the least-privilege MCP list for this repo (trading repos enable Polymarket and whale-tracking MCPs, the voice repo enables none); and the “do not touch” list (.env, anything under ~/.codex, ~/.claude, ~/.aws, ~/.ssh). The AGENTS.md spec is right that exact-command sections do the heaviest lifting. Agents execute commands literally, and ambiguity becomes errors.

A small structural choice that paid for itself: I put a Projects/{project}/_overview.md in Obsidian for every active project, and the global Claude config has a hook that injects a compact summary of it into the session. The agent walks in already knowing what the project is, what is broken, and what the user has been working on. Same idea as AGENTS.md, but for the durable narrative rather than the build commands.

Four hook patterns that pay rent#

Hooks fire deterministically every time, which is the whole point. Unlike prompt instructions the agent can talk itself out of, hooks run on the lifecycle event regardless. Keep them short and boring. Mine are four.

The first is protect-sensitive-files.sh. It blocks writes to .env, anything under ~/.aws, ~/.ssh, ~/.codex, ~/.claude, and MCP credential paths. It fires on PreToolUse for any file-write tool. Failure is loud, not silent.

The second is codex-delegate.sh. It pattern-matches the user prompt for words like “audit”, “review”, “migrate”, “profile”, “simulate”, “trace”. When it matches, it surfaces a one-line nudge: “this looks like specialist work, consider handing it to Codex”. It does not block. The nudge is enough.

The third is compact-test-output.sh. It rewrites simple pytest and vitest invocations to compact reporters, and compresses successful runs to a one-line summary. Failures pass through verbatim. The trap to avoid: never let the compactor swallow stderr. If it ever hides a failure, throw it out.

The fourth is obsidian-awareness.sh. On SessionStart it reads the project’s _overview.md from Obsidian and injects a 40-line summary. Cheap, durable, the single hook that most changed how the agent talks to me at the start of a session.

Two patterns I do not recommend. The first is anything that auto-runs a destructive cleanup (“tidy”, “format-and-commit”) on Stop. Always have an explicit opt-in. The second is hooks that rewrite the user prompt without showing the rewrite. If the agent is going to act on a different prompt than the one you typed, you need to be able to see what it actually saw.

MCPs and subagents, least privilege by default#

MCPs are useful and dangerous in the same way. The protocol Anthropic shipped in November 2024 solved the real M-models-by-N-tools problem, but it also turned tool surface area into a config decision instead of a code decision. My rule is the same as anyone’s networking rule. Default off, scoped per project, explicit when on.

The split I actually run: OpenAI questions go to the openaiDeveloperDocs MCP first, with the official OpenAI docs as a fallback (not web search). Browser QA does not get a global Playwright MCP. It gets a browser-tester subagent with Playwright tools scoped to that agent’s context, so the main session never sees the browser surface area. Claude Code’s subagent docs describe this pattern well. The practical reason to use it is so a planning conversation does not accidentally open a Chrome window. Linear and Sentry are off by default and turn on when I am actually working a ticket or chasing an error. Obsidian is on and treated as durable memory, not scratchpad. Trading-specific MCPs (Polymarket, Kalshi, whale tracking) live with the trading repos and nowhere else. Stitch, Remotion and shadcn MCPs live with the design-to-code repos and nowhere else.

Treat MCPs, hooks, plugins and skills as executable automation with least privilege. None of them is “just config”.

Project archetypes map showing four families: trading and finance systems, voice and insurance automation, web3 game monorepos, frontend content and product apps, all connected to a single shared agent setup at the centre. — Fig. 2: four project families share one orchestrator/specialist setup, with per-repo files holding the local contract.

Fig. 2: four project families share one orchestrator/specialist setup, with per-repo files holding the local contract.

Where the setup bends#

Three places I have had to adjust.

The first is small repos. A single isolated script, a one-off data extraction, a five-file CLI, the orchestrator/specialist split is overhead. I use Claude alone for those and skip the handoff ceremony.

The second is greenfield work where the architecture is still molten. Codex is excellent at executing inside a known shape and weaker at proposing the shape. If I do not yet know what the system wants to be, I keep it in Claude until the bounded contexts have edges.

The third is when a Codex run goes wrong inside a long migration. The instinct is to let it try again. The better move is to stop it, hand the failure log back to Claude, ask Claude what changed about the assumptions, and only then re-handoff. The split is what makes the recovery clean. If both agents try to be the planner, the recovery turns into another drift.

What to avoid#

Do not give one agent both the orchestration role and a large set of MCP tools at the same time. The combination produces an agent that is too distractible to plan and too unbounded to specialise. Either it is the planner with very few tools, or it is the specialist with many tools and a narrow brief.

Do not let global hooks reach into project-level decisions. Global hooks are for things that are always true (protect secrets, compact test output, suggest delegation). Anything project-specific belongs in the per-project AGENTS.md or CLAUDE.md, where it can be reviewed alongside the code.

Do not skip the handoff sentence to save time. The five-field handoff is the single intervention that most reduces wasted runs. The minute you start sending Codex three-word prompts, you are paying for the agent twice and getting the benefit once.

Do not treat the changelogs as optional reading. Codex and Claude Code moved quickly through April and May 2026, especially around hooks, MCP/plugin support, goals, browser/app workflows, subagents and workspace flows. Some capabilities, like image input, landed earlier but now sit inside the current working surface area. If you have not re-read both docs/changelogs in the last month, your mental model of what each tool does is probably stale.

The split is the thing#

The setup I run is not the only one that works. It is the one that survived six months of real projects across trading bots, voice automation, web3 monorepos, and frontend product work without producing the failure mode I opened with: two agents with written roles, not one agent with infinite ambition. Everything else is implementation detail.

Caveats#

This is a personal setup as of May 2026. Tooling moves fast, so any specific feature claim in this post should be sanity-checked against the relevant changelog before you copy it.

I have not benchmarked the split against a single-agent baseline. The improvement is observed, not measured. A reader running rigorous evals will produce a more confident answer than I have.

The “audit your skills folder” anecdote is mine. I have not tested whether the same finding generalises across the skills ecosystem. Treat it as a prompt to check, not as a claim about anyone else’s setup.

Why layered codebases punish humans and AI coding agents (and what vertical slice fixes): structure the repo so a bounded specialist can land a slice without tracing the whole stack first.
Why most AI agent programs are repeating RPA’s mistakes: the operating-model view of the same written-down, least-privilege discipline, scaled past a single developer’s setup.
Most “human-in-the-loop” is escalation done badly: stop the specialist, hand the failure back to the planner, re-handoff only once the assumptions are clear.

References#

Claude Code hooks reference: official hook lifecycle events, command/HTTP/prompt/agent hook types.
Claude Code subagents: independent context windows, tool restriction, scoping rules.
AGENTS.md open standard: the cross-vendor per-repo agent instruction convention.
Introducing the Model Context Protocol: Anthropic, November 2024.
Claude Sonnet 4 now supports 1M tokens of context: Anthropic, August 2025.
OpenAI Codex CLI: official Codex CLI documentation.
Codex changelog: May 2026 features and version history.

$ git blame ./site/src/content/posts/coding-agent-setup-may-2026.mdx Suggest an edit on GitHub

← older Why most AI agent programs are repeating RPA's mistakes (and the playbook to steal)