The senior-only team just became the best way to build software

For thirty years, building more software meant adding more people. AI quietly broke that assumption. The most effective team in 2026 isn't the biggest one: it's a small group of senior engineers running a fleet of agents, and the tooling now exists to make that concrete rather than aspirational.

The default way to scale a software project has always been headcount. More scope meant more engineers: juniors for the grunt work, mid-levels to wire things together, and layers of coordination on top. The problem, as anyone who has scaled a team knows, is that communication overhead grows faster than output. Add people and you often add meetings, hand-offs, and merge conflicts faster than you add shipped features.

That math has changed, and it changed fast. What's interesting isn't the abstract claim: it's what the new shape actually looks like day to day, tool by tool, on a real engagement.

The grunt work became delegable, and we measure it

The clearest signal is in the benchmarks. On SWE-bench Verified, a test built from real, messy GitHub issues in open-source projects, leading models paired with a capable agent harness now resolve the large majority of tasks, somewhere in the 70–90% range. In 2023 that number was around four percent. In roughly two years, AI went from "interesting autocomplete" to "resolves most real engineering tickets end to end."

~4%

SWE-bench Verified resolve rate, 2023

70-90%

resolve rate with modern agent harnesses

2 years

to go from autocomplete to end-to-end task resolution

On our projects this isn't a statistic we read about, it's a daily workflow. A typical sprint on a Dedicated AI Pod engagement looks like this: a senior engineer breaks a feature into a handful of well-scoped tickets, each with acceptance criteria, relevant file references, and constraints (which libraries to use, which patterns to follow, what not to touch). Those tickets go to Claude Code running in the repo, which reads the existing codebase, drafts an implementation plan, writes the code, runs the test suite, and iterates until the tests pass, often across dozens of files in a single run that takes anywhere from twenty minutes to a couple of hours unattended.

The work that used to justify a big team, boilerplate, glue code, test scaffolding, routine refactors, migrations, dependency upgrades, writing the tenth CRUD endpoint that looks like the first nine, is now reliably delegable. We don't staff for that volume anymore. The volume work is no longer where the people need to be, which is the entire premise of why three or four senior people can now cover what used to take ten or twelve.

The bottleneck moved from typing to judgement

When the keystrokes get cheap, the scarce resource becomes knowing which keystrokes are worth making. Anthropic's 2026 framing of the shift is a useful one: engineering is moving from hands-on implementation toward a loop of delegate, review, and own. The job is increasingly about defining the goal, directing agents toward it, and judging whether what comes back is actually correct, secure, and maintainable.

That is senior work, and it shows up concretely in how we structure review. Every Claude Code or Cursor session that touches production code on our projects produces a diff that a senior engineer reads, not skims, reads, before it merges. We've found the failure mode isn't that agents write code that doesn't run; it's code that runs, passes the tests it was given, and is subtly wrong in a way that only shows up at scale or under an edge case nobody thought to write a test for. A junior engineer reviewing that diff tends to check "does it work", does it compile, does the demo pass. A senior engineer reviewing the same diff is checking "is this the right thing," which is a different question entirely: does this introduce an N+1 query, does this error path actually get hit in production, does this change quietly break an assumption three modules away.

Spotting the subtle bug an agent confidently introduced, making the architectural call that decides whether the product survives its first ten thousand users, knowing when a "working" solution is a trap, none of that comes from volume. It comes from having shipped real things before, and from knowing what "real things" tend to break on.

Engineering shifts from hands-on implementation to a continuous delegate, review, own loop

AI didn't make engineers replaceable. It made seniority the whole job.

What a senior-only, agent-augmented sprint actually looks like

"AI-augmented team" has become a phrase vague enough to mean almost anything. Here's what a two-week sprint actually looks like on one of our Dedicated AI Pod engagements, with three senior engineers, a designer, and a data/ML person.

Planning (half a day). The team works through the sprint backlog with Claude open alongside, not to generate the plan wholesale, but as a thinking partner for scoping. We'll paste in a rough feature description and ask it to surface edge cases, ambiguities, and dependencies we might have missed. For anything with a UI component, the designer is in parallel using Figma AI to generate first-pass layout variations from a written brief, and v0 to turn a rough Figma frame into a working React component that the frontend engineer can actually run, so design review happens against something clickable, not a static mockup, on day one.

Build (the bulk of the sprint). Each engineer typically has two or three agent sessions running in parallel, one in Claude Code working through a backend feature end to end (read the spec, implement, write tests, run them, fix failures, repeat), another in Cursor doing more interactive frontend work where the engineer is pairing closely and accepting changes incrementally, and sometimes a third doing a mechanical task like a dependency bump or a lint-rule migration across the whole repo. The engineer's actual time is spent reviewing diffs, answering the clarifying questions agents raise, and doing the parts that genuinely need a human: deciding on a data model, resolving a genuinely ambiguous product requirement, or debugging something that requires understanding context the agent doesn't have (a flaky third-party API, an undocumented business rule, a decision made in a meeting two months ago).

Review and merge (continuous). Every PR runs through GitHub Actions, linting, type-checking, the test suite, and for projects with an ML component, a check that the model evaluation metrics haven't regressed. A senior engineer does the human review described above. Nothing merges to main without a human who has read the diff and can explain, in their own words, what it does and why it's safe.

Ship and observe (end of sprint, continuous after). Deploys go out through the same CI/CD pipeline, and Datadog dashboards and alerting are part of the definition of done for anything user-facing, not an afterthought added later. Increasingly we're also wiring in lightweight AIOps monitoring that uses anomaly detection on the metrics themselves, so a quietly-degrading endpoint gets flagged before a user complains.

Why this shape wins on quality, not just speed

Pair a senior engineer with a fleet of agents and you get something that used to require an entire pod: someone who can direct several workstreams in parallel, catch the failures agents still produce, and own the decisions that matter. Pair a junior with the same agents and you get volume, but not the judgement to know whether that volume is right. Unreviewed AI output is simply the fastest way ever invented to accumulate technical debt, and a junior reviewer is poorly positioned to catch it because they haven't yet built the pattern-matching for what "this will hurt later" looks like.

This is why "small and senior" isn't a staffing compromise anymore. It's the optimal shape, and the tooling makes the case mechanically: fewer people means less coordination overhead, fewer hand-offs, fewer Slack threads asking "wait, who owns this file," and more output per head, with a higher quality floor, not a lower one. The agents handle the parts where more hands used to help (typing, boilerplate, mechanical changes). The humans handle the parts where more hands never actually helped (judgement, architecture, taste).

The tools we actually run, by stage

Discover & Define: Claude and Perplexity for research synthesis and competitive scans; NotebookLM for turning long discovery docs and call transcripts into something the whole team can query.
Design & Prototype: Figma AI for rapid layout exploration, v0 for turning designs into working React, Firefly for quick illustration and image assets.
Architect: Claude for working through trade-offs and writing ADRs (architecture decision records), Eraser AI and Mermaid for system diagrams that live in the repo and stay in sync with the code.
Build: Claude Code for end-to-end feature implementation and large mechanical changes, Cursor for interactive pairing, Copilot for inline completion on smaller edits.
Add Intelligence: Bedrock and Vertex AI for managed model access, LangChain for orchestration when a feature needs RAG or multi-step agent logic.
Ship & Operate: GitHub Actions for CI/CD, Datadog for observability, and AIOps-style anomaly detection on top of both.

The skills that matter now, and how we hire for them

If the bottleneck has moved from typing to judgement, hiring has to move with it. We don't screen for how fast someone can write a sorting algorithm on a whiteboard, an agent can do that in seconds. We screen for the things that show up when the agent's output is sitting in front of you and you have thirty seconds to decide whether to trust it.

In practice that means our interviews involve a lot of code review, not writing code from scratch, but reading a diff (often one we've deliberately had an agent produce, including its mistakes) and asking: what's wrong here, what would you change, and why does it matter. The candidates who do well aren't necessarily the fastest typists. They're the ones who immediately spot that a caching layer was added without an invalidation strategy, or that a "fix" for a race condition just moved where the race happens.

We also look for people who are comfortable being wrong quickly. Working with agents means proposing an approach, watching it half-work, and adjusting, repeatedly, in rapid cycles. Engineers who need to be certain before they act tend to slow this loop down. Engineers who treat each agent run as a cheap experiment tend to converge on the right answer faster, because they're not precious about any single attempt.

Where this breaks down, and how we guard against it

It would be dishonest to present this as friction-free. Agent-augmented development has real failure modes, and pretending otherwise is how teams end up with the "fast but fragile" reputation that gives AI-native development a bad name.

The most common failure we see is scope creep inside a single agent session. Given a loosely-specified ticket, an agent will often make reasonable-seeming decisions that compound, adding a new dependency to solve a small problem, refactoring a function it didn't need to touch, introducing a new pattern alongside an existing one instead of using the existing one. None of these individually look wrong in a diff. Collectively, over a sprint, they're how a codebase becomes inconsistent. Our mitigation is unglamorous: tickets are scoped tightly, with explicit "do not touch" boundaries, and a senior engineer reviews not just correctness but consistency with the rest of the codebase.

The second failure mode is confidently wrong test coverage. An agent asked to "add tests" will produce tests, and they'll pass, sometimes because they're testing the implementation rather than the requirement, so they'd pass even if the requirement were misunderstood. We treat agent-written tests as a starting point that a senior engineer reads against the original spec, not as proof of correctness.

The third is context loss across long sessions. Agents working for hours on a large task can lose track of an earlier constraint, especially if it was mentioned once at the start and the session has since covered a lot of ground. We've found that breaking large features into smaller, independently-reviewable chunks, even when a single long session could technically handle it, produces better outcomes than letting an agent run for six hours unsupervised, even though the latter looks more impressive in a demo.

A live example: how this played out on a real engagement

One useful way to make this concrete is a feature we recently shipped for a fintech client: a reconciliation dashboard that needed to ingest transaction data from three different upstream systems, normalise it, flag mismatches, and present them for human review.

The discovery phase took half a day, Claude helped us work through the edge cases in how the three systems represented the same transaction differently (currency rounding, timezone handling, partial refunds), which turned into the spec. The architecture, a normalisation layer, a matching engine, and a review queue, was sketched in Mermaid and reviewed with the client before a line of code was written.

Build took four days, not because the code itself was complex, but because the matching logic had genuinely subtle edge cases that needed a human to get right. Claude Code handled the bulk of the implementation, the API integrations, the normalisation layer, the database schema, the review queue UI, in a series of sessions over the first two days, each scoped to one part of the pipeline. The senior engineer on the project spent those two days mostly reviewing diffs and refining the matching rules, which is where almost all of the actual thinking happened. The remaining two days were spent on the parts that needed human judgement: deciding how to handle a category of mismatches the original spec hadn't anticipated, and tuning the matching tolerance so it caught real discrepancies without flooding the review queue with rounding noise.

The team on this engagement was two people. A year ago, the same scope would have been a four-to-six person team over three to four weeks. The difference wasn't that the work got easier, the matching-logic edge cases were just as hard as they'd have been in 2023. The difference was that the volume work around that hard core, the integrations, the schema, the UI, stopped being where the time went.

What we don't outsource to agents, on principle

There's a category of decision we deliberately keep human even when an agent could plausibly attempt it: anything that trades off business risk against engineering convenience. Should this feature ship behind a flag or to everyone? Is a two-second p95 latency acceptable for this endpoint given who uses it? Is the cost of supporting an edge case worth the complexity it adds? These aren't technical questions with technically correct answers, they're judgement calls that depend on context an agent doesn't have and shouldn't be trusted to infer. We've seen teams elsewhere try to have an agent "decide" on acceptable trade-offs by giving it a rubric, and the result is usually a rubric-following answer that's defensible on paper and wrong in practice.

The other category is anything involving another human's trust, a difficult client conversation, a decision that needs to be explained and defended to a stakeholder, a judgement call that someone needs to be accountable for. Agents are extraordinary at producing artifacts. They're not a substitute for a person standing behind a decision.

What it means if you're the one buying

The old agency model sold you a large team and asked you to trust that the size meant seriousness. The economics have flipped. Industry analyses now put AI-driven operating-cost reductions in the 20–40% range for organisations that have genuinely reorganised around AI, and for a product build, that shows up as shorter timelines delivered by a smaller, sharper team, not as the same timeline delivered cheaper.

You shouldn't be paying for a big team to look busy, and you shouldn't be paying for a small team that's quietly shipping unreviewed agent output either. The question worth asking any vendor in 2026 isn't "how many people will be on this," it's "what does your review process actually look like, and who's doing it." You should be paying for senior people who own the outcome and let AI absorb the toil. That's the entire premise behind how we're built, and it's why our pods stay small even as the scope grows.

AI-nativeTeam designAgentic codingClaude CodeCursor

← Read more insights