Two years ago, AI in the editor meant smarter autocomplete. In 2026 it means agents that read a whole codebase, plan, write, test, and open a pull request — running for hours at a stretch. That's a fundamentally different tool, and it changes how a product gets built, from how we scope tickets to how we structure repos.
The first wave of AI coding tools was autocomplete with a longer memory. You typed, it suggested the next line, you accepted or ignored it. Useful, but the human was still doing the building one keystroke at a time. The mental model was "a slightly smarter version of what I was already doing."
What's in production now is categorically different. Tools like Claude Code, Cursor, Windsurf and Cline operate on a repository over time through execution loops: they read the code, form a plan, make changes across many files, run the tests, read the failures, and try again — often unattended for long stretches. The unit of work went from "a line" to "a task," and that change rippled through everything else: how we write tickets, how we structure repos, how we run code review, and what "a junior engineer's job" even means.
Context engineering beat static retrieval
Early tools tried to understand your codebase by stuffing a vector index and retrieving snippets — chunk the repo, embed it, pull back the "relevant" pieces for a prompt. It wasn't enough for real systems, because relevance for a code change is rarely a semantic-similarity problem. Knowing how to fix a bug in a payment flow often requires understanding a config file, an environment variable, and a database migration that have nothing in common textually with the bug report.
Modern agents behave more like a developer dropped into an unfamiliar repo: they grep, open files, trace call paths, follow imports, and build a working mental model by investigating — the same way a new hire would spend their first week. Persistent convention files — the now-common AGENTS.md or CLAUDE.md pattern — give them durable instructions about how a project is structured, which commands to run for tests and linting, what patterns to follow, and what's off-limits. On every project we start, writing this file is one of the first things we do, and we treat it as a living document: when an agent makes a mistake because of a missing convention, the fix isn't just correcting that one instance — it's adding the convention to the file so the next session doesn't make the same mistake.
The practical upshot is that agents are far better at operating inside a real, messy, multi-year codebase than the demos two years ago suggested they'd be by now — provided the codebase gives them the context to work with. A repo with no tests, no conventions file, and inconsistent patterns will produce worse agent output than the same repo with those things in place, and the gap is large enough that "agent-readiness" has become something we actively invest in on every engagement, even ones that predate our involvement.
The honest part: the verifiability spectrum
Agents are not uniformly good, and pretending otherwise is how teams end up disappointed. They are excellent where the output is verifiable — code that compiles, tests that pass, a CLI task with a clear success condition — and noticeably shakier where it isn't, like ambiguous UI flows or fuzzy product decisions. A widely-cited Sourcegraph study of well over a thousand agent runs in large codebases found the main bottleneck wasn't raw model intelligence at all; it was infrastructure and context. The model is rarely the weakest link — the surrounding scaffolding usually is.
This matters because it tells you exactly where to trust an agent and where to keep a human firmly in the loop, and it maps onto a spectrum we think about explicitly:
- High verifiability — agents run largely unsupervised: writing tests for existing code, fixing a failing test where the expected behaviour is clear, migrating from one library version to another, implementing a well-specified API endpoint with a documented contract, refactoring for a named pattern (extract this into a hook, convert this class to a function component).
- Medium verifiability — agents draft, humans steer closely: implementing a new feature from a written spec where some details are underspecified, writing the first version of a UI component before design feedback, drafting an architecture proposal for review.
- Low verifiability — humans lead, agents assist: deciding the product trade-off itself (not implementing it — deciding what it should be), resolving a genuinely ambiguous requirement, anything where "correct" depends on context the agent can't access (a conversation with the client, an unwritten business rule, institutional history about why something was done a certain way).
Most of the velocity gains people talk about come from the first category, and that's fine — it's also where the gains are safest to take. The mistakes happen when teams apply first-category trust to second- or third-category work.
What changed in our day-to-day toolchain
It's worth walking through what this looks like tool by tool, because "we use AI" has become almost meaningless as a description.
Claude Code is where most of our verifiable-category work happens. We give it a ticket with acceptance criteria, point it at the relevant part of the codebase, and let it run. For a typical backend feature — say, a new endpoint with validation, error handling, and tests — a session might run for 30–90 minutes, touching the route handler, a service layer, the database access code, and a test file, then running the test suite and fixing whatever fails. The engineer's job during this time is usually working on something else — reviewing a previous session's diff, or running a second session for a different feature.
Cursor is where medium-verifiability work happens, particularly frontend. The interaction model is tighter: the engineer is in the loop turn by turn, accepting some suggestions, rejecting others, redirecting when the agent goes down a path that doesn't match the design. This is where a lot of UI work happens, because "does this look and feel right" is a judgement call that benefits from a human reacting in real time rather than reviewing a finished diff.
Copilot still has a role, mostly for the smallest-scale completions — finishing a line, suggesting an obvious next statement — in files where spinning up a full agent session would be overkill.
GitHub Actions is where the verifiability spectrum gets enforced mechanically. Every PR — whether it came from an agent session or a human typing directly — runs the same checks: type-checking, linting, the full test suite, and for projects with a data or ML component, a check on model evaluation metrics. An agent's diff doesn't get special treatment, which is exactly the point; the CI pipeline doesn't know or care who wrote the code, it just enforces that it works.
Good harnesses ask for help
One of the more important recent patterns is agents learning to flag uncertainty instead of bluffing through it. We lean into that deliberately: an agent that pauses mid-task and says "this is ambiguous — should X behave like A or B?" is doing its job well, and we configure our setups to encourage that rather than reward agents for plowing ahead and guessing.
In practice this means our CLAUDE.md files explicitly instruct agents to stop and ask when they hit a decision that affects behaviour rather than implementation — for example, "if you need to choose between two valid approaches to handling a missing field, stop and ask rather than picking one." This adds a small amount of back-and-forth, but it's dramatically cheaper than discovering three days later that the agent picked the approach that doesn't match what three other parts of the system assume.
More code and shorter timelines are real — but only because a human still owns the review gate.
Where the velocity actually comes from
It's tempting to attribute speed gains entirely to "AI writes code faster than humans." That's part of it, but not the biggest part. The bigger contributor is that the feedback loop between "I have an idea" and "I can see whether it works" got dramatically shorter.
Before, testing a hypothesis about how a feature should work meant: write the code (hours to days), then see if it's right. Now: describe the hypothesis to an agent, get a working implementation in tens of minutes, and see immediately whether it's right. When it's wrong — which it often is, on the first try, for anything nontrivial — the cost of being wrong dropped from days to minutes. That changes the economics of exploration. Teams that used to commit to one approach because trying two was too expensive can now try three and pick the best one, in less time than trying one used to take.
This is also why "vibe coding" — generating plausible-looking code without verification — is such a trap. It optimises for the part that got cheap (generating code) while ignoring the part that didn't (knowing whether it's right). The teams getting burned right now are the ones who saw the velocity and removed the human from the verification step. Agents ship fast; without review they also break quietly, often in ways that don't surface until the code has been in production for weeks. The whole craft in 2026 is keeping the speed while keeping the judgement — which is a people and process problem, not a tooling one.
A worked example: migrating an API version across a real codebase
To make this less abstract, here's a task we ran recently: migrating a mid-sized service from one major version of a payment provider's SDK to the next, across roughly 40 files. In 2023, this would have been a multi-day task — read the migration guide, find every call site, update each one, handle the inevitable signature changes, fix the tests that broke, repeat for edge cases the migration guide didn't mention.
With Claude Code, the process looked like this: we pointed it at the migration guide (fetched and saved locally) and the codebase, and asked it to identify every call site, propose the updated signatures, and make the changes incrementally — one logical group of call sites at a time, running tests after each group. The first pass took about 90 minutes and got roughly 90% of the way there. The remaining 10% were call sites with custom error-handling wrappers that the agent correctly flagged as uncertain rather than guessing — exactly the "ask for help" behaviour we want. A senior engineer resolved those six call sites by hand in about twenty minutes, because they required knowing why the custom wrapper existed in the first place (a workaround for a since-fixed bug in the old SDK that no longer applied).
Total time: about two hours, including human review of every changed file. The agent did the mechanical 90%; the human did the 10% that required institutional knowledge, and reviewed the other 90% to confirm it didn't introduce anything subtle. That ratio — agent handles the bulk, human handles the part that needs context plus reviews everything — is the pattern that shows up again and again, across very different kinds of tasks.
The repos that work well with agents, and the ones that don't (yet)
Not every codebase gets equal benefit from this shift, and it's worth being honest about why. The codebases where agents are most effective tend to share a few traits: a test suite that actually runs and actually fails when something's broken, consistent patterns (one way to do data access, one way to handle errors, not three competing conventions from three different eras of the project), and reasonably small, focused files rather than 3,000-line god-modules.
Codebases without these things aren't unworkable — but the first phase of work on them often has to be making them more agent-friendly before the velocity gains show up. We've taken on a few engagements where the first two weeks were effectively "improve test coverage and consolidate three different state-management patterns into one" — unglamorous work that doesn't look like "AI-native development" from the outside, but that's the precondition for everything that follows being faster. Skipping this step is how teams end up with agents that produce plausible-looking code that doesn't fit the codebase's actual conventions, because there weren't consistent conventions to fit.
Where we expect this to go next
The trajectory over the last two years has been: longer unsupervised runs, better context-gathering, and more reliable self-assessment of uncertainty. We expect that to continue, but we don't expect the fundamental shape — agents handle verifiable work, humans own judgement and review — to change, because the distinction between verifiable and non-verifiable work isn't a capability gap that better models close. It's a structural feature of what "correct" means. No amount of model improvement makes "should this feature exist" a question with an objectively checkable answer.
What we do expect to keep improving is the boundary — more of what currently sits in the "medium verifiability, humans steer closely" bucket moving into "high verifiability, agents run largely unsupervised," as agents get better at inferring conventions and asking better clarifying questions. That's a real and valuable trend. It's just not the same as "agents will eventually do everything," which is a different claim, and one we don't think the evidence supports.
If you want the full stage-by-stage picture of where we apply which tools across an engagement, that's laid out on our process page.