Stop worshipping the model. Start designing the system.

The fastest way to build a worse AI product in 2026 is to obsess over which model tops the leaderboard this week. The teams actually shipping useful AI aren't winning on model choice: they're winning on system design, and the difference shows up in concrete architecture decisions, not in which API they call.

There's a real arms race among frontier models, and it's genuinely impressive. But for people building products, it has quietly commoditised the part everyone fixates on. The leading model families (Claude, GPT, Gemini and a fast-improving field of open-weight models) now ship million-token context windows and native multimodality as standard. Multimodal stopped being a differentiator. It became the floor.

When the raw capability is broadly comparable and changes every few weeks, betting your architecture on "whichever model is best today" is a losing strategy. Worse, it's the wrong place to compete, because if your product's value proposition is "we use the best model," that proposition has a shelf life measured in months, and your competitors can copy it by changing one API endpoint.

The model sits on top: most of what determines product quality lives in the layers underneath

The lesson everyone learned the expensive way

A pattern played out across a lot of teams in the last year, and we watched it happen on both sides. One team had the biggest model budget in the room, they were running every feature through the most capable frontier model available, with minimal pre- or post-processing, betting that raw model quality would carry the product. Another team had a smaller budget but spent it on retrieval infrastructure, a routing layer, and an evaluation harness, paired with a mix of model sizes.

Six months later, the second team was shipping useful AI to real users at a fraction of the inference cost, while the first was still tuning prompts and arguing about which new model release to switch to next. The centre of gravity moved from model worship to system design, and the gap wasn't subtle. It was the difference between a product that worked and one that demoed well.

That's the whole story of 2026 in one anecdote. Value isn't in the model. It's in the system around it. Here's what that system actually looks like, piece by piece, and how we build each piece.

Retrieval: the difference between an assistant and a confident liar

RAG (retrieval-augmented generation) sounds like a solved problem from the outside: embed your documents, store the vectors, retrieve the top-k matches for a query, stuff them in the prompt. In practice, the gap between "RAG that demos well" and "RAG people trust with real decisions" is almost entirely in details that don't show up in a quick demo.

On a recent engagement for a legal-document intelligence product, the naive approach, chunk documents into fixed-size windows, embed, retrieve by similarity, produced answers that looked right and were sometimes subtly wrong, because a chunk boundary had split a clause from the exception that modified it. The fix wasn't a better model. It was better chunking: splitting on document structure (sections, clauses, defined terms) rather than character count, and retrieving with enough surrounding context that a clause and its exceptions stayed together. We also added explicit source attribution, every answer links back to the specific clause it's grounded in, so a lawyer can verify rather than trust blindly. None of this involved choosing a "better" model. All of it involved engineering the data pipeline the model sits on top of.

The other half of retrieval that matters: keeping it fresh. A RAG system over stale data is worse than no RAG system, because it produces confident answers based on outdated information. For most of our projects, the retrieval pipeline includes an explicit refresh cadence and a way to detect when source documents have changed, infrastructure work that has nothing to do with which model generates the final answer, and everything to do with whether that answer is trustworthy.

Routing: not every step needs the frontier model

One of the quieter shifts in the last year: capable models in the sub-4B parameter range, paired with quantisation, are now good enough for many bounded tasks, classification, extraction, simple formatting, routine routing decisions. That unlocks low-latency, cheap, even on-device features that used to require a frontier API call for everything.

This doesn't make big models irrelevant. It makes routing a strategic architecture decision rather than an afterthought. On a predictive-event-monitoring platform we built for an IT operations client, the pipeline ingests a high volume of events, far too many to run each one through a frontier model without either bankrupting the client or introducing latency that defeats the purpose of "real-time." The architecture routes: a lightweight classifier (a small fine-tuned model, cheap enough to run on every event) does initial triage and flags anomalies; only the flagged subset, typically a small fraction of total volume, goes to a more capable model for root-cause analysis and natural-language summarisation for the on-call engineer.

The result is a system that's both fast enough for real-time alerting and affordable enough to run on the client's actual event volume, neither of which would be true if every event hit a frontier model. The "smart" part of this system isn't the model. Bedrock and Vertex AI both offer comparable frontier models, and we could swap between them with a config change. The smart part is the routing logic that decides what goes where, which is the part that took the engineering effort.

A lightweight classifier triages full volume; only flagged events reach the frontier model

Evaluation: turning "it feels better" into a number

You can't improve what you can't measure, and "it feels better" is not a measurement. Every AI feature we build ships with an evaluation harness before it ships to users, a set of representative inputs with expected properties (not necessarily exact expected outputs, since generative outputs vary, but properties: does it cite a source, does it stay within the requested format, does it avoid a list of known failure patterns).

This matters enormously for the model-swapping point above. When a new model version comes out, which happens often enough that "wait for the next one" is not a strategy, we run it against the existing eval harness before considering a switch. Sometimes a newer, more capable model performs worse on our specific eval set, because "more capable" on general benchmarks doesn't always mean "better at this specific task with this specific prompt structure." Without an eval harness, you'd only discover this from user complaints. With one, it's a five-minute check before a config change.

Tracing is the other half of this. When something goes wrong in production, a user gets a bad answer, being able to see exactly what was retrieved, what prompt was constructed, what the model returned, and what post-processing happened to it is the difference between "we can fix this specific issue" and "we think it's probably fine now, we changed the prompt a bit."

Tool use and MCP: integration stops being bespoke glue

Standards like the Model Context Protocol (MCP) give models a clean, reusable way to reach your data and take actions, and the practical impact is bigger than it sounds. Before standards like this, every integration between a model and an external system, a database, a CRM, an internal API, was bespoke: custom function definitions, custom parsing of model output, custom error handling, repeated for every tool and every project.

With MCP, a tool integration is written once and can be reused across projects and even across different model providers, because the protocol is the same regardless of which model is calling it. On a recent project building an employee-benefits assistant, this meant the integration with the client's HR system, built once as an MCP server, worked identically whether the underlying model was Claude or a different provider, which gave the client genuine flexibility on model choice without re-engineering the integration layer. That flexibility is exactly the point of treating the model as a swappable component: the value is in the integration and the data, not in which model happens to be calling it this month.

The leaderboard changes monthly. Good system design doesn't.

A case where model choice genuinely mattered, and why it was the exception

To be fair to the "model matters" side of the argument: there are cases where model choice is a real decision, not a distraction. On a computer-vision project, comparing photos of retail store displays against a reference planogram and scoring compliance, the choice between vision-capable models had a measurable impact on accuracy for the specific task of identifying small product placement details in cluttered images.

But notice what made this a real decision rather than leaderboard-chasing: it was a specific, measurable property (accuracy on a representative test set of store photos), tied to a specific step in the pipeline (the visual comparison step, not the whole system), evaluated with the harness described above. We didn't pick a model because it was "the best model in 2026" in general, we picked it because it scored highest on our specific eval set for this specific task, and we re-run that evaluation periodically because the ranking can and does change. The decision was downstream of the system design, not a substitute for it.

The cost conversation nobody wants to have

There's a version of "just use the best model for everything" that's seductive because it's simple, one model, one API, no routing logic to build or maintain. The problem shows up in the bill, and it shows up nonlinearly as usage grows.

On the predictive-event-monitoring platform mentioned earlier, running every event through a frontier model would have made the unit economics of the product simply not work, the inference cost per event would have exceeded what the client could reasonably charge for the monitoring service itself. The routing architecture wasn't an optimisation we added later; it was a precondition for the product being viable at all. We've seen this pattern on enough projects that "what's the unit economics at 10x current volume" is now a question we ask during the architecture phase, not after launch when the bill arrives.

This is also where the "model as swappable component" framing pays off in a way that's easy to underrate: when a cheaper model becomes good enough for a given step, which happens regularly as the smaller-model tier improves, switching it in is a config change plus an eval run, not a re-architecture. Products built around a single hardcoded frontier-model dependency don't get to capture that improvement without engineering work; products built with routing get it almost for free.

What "AI-native" actually means, concretely

We use the phrase "AI-native" a lot, and what it actually means in light of all this is not "uses AI," and it's definitely not "uses the newest model." A product is AI-native when the system design assumes, from the start, that the intelligence layer is a component that will be retrieved-from, evaluated, routed-through, and swapped, not a black box bolted onto an otherwise-conventional architecture.

Concretely, that means: the data layer is designed for retrieval quality from day one, not retrofitted later. The architecture has a routing layer even if it initially routes everything to one model, because adding routing later means re-architecting call sites. An evaluation harness exists before the first user sees the feature, not after the first complaint. And the integration layer uses standards like MCP where possible, so the product isn't locked into one provider's specific function-calling format.

None of these are about which model you use. All of them are about whether your product can adapt as the model layer keeps changing underneath it, which, given the pace of the last two years, it will keep doing for the foreseeable future. So when we build an AI feature, we design the system first, data, retrieval, guardrails, evals, and treat the model as a swappable component. That's how a product stays current as the rankings reshuffle every few months, and how you keep control of cost, latency, and quality instead of being at the mercy of whatever launched this week.

Applied AIArchitectureRAG & evals

← Read more insights