A per-location anomaly-detection system across 1,000+ sites that catches sales dips and integration failures in real time — before they become losses.
A retail brand with 900+ locations generated rich operational and sales data, but early warning signs — a dip in sales, a broken integration at one site — were lost in the aggregate noise and usually noticed too late.
Detecting anomalies for one location is straightforward; doing it for over a thousand, each with its own normal patterns and seasonality, and acting in real time, is a systems problem as much as a modelling one.
We built anomaly-detection models tuned per site across 1,000+ locations, so each store is judged against its own baseline rather than a global average.
An automated CI/CD pipeline handles training, validation, and deployment at scale, so a thousand models stay current without manual effort.
When something anomalous appears, alerts are routed in real time to the stakeholders who can act — not buried in a dashboard.
The core architectural decision was to train one model per location rather than a single global model. A retail brand's 900+ locations don't share a baseline — a store near a university has a completely different weekly rhythm than a suburban location, and a single global threshold would either miss anomalies at quiet stores or flood busy ones with false positives. Each location's model learns its own seasonality (day-of-week, time-of-day, and seasonal patterns specific to that site) and flags deviations against that baseline, not an aggregate one.
Training a thousand-plus models by hand isn't a workflow, it's a job for a pipeline. We built an automated CI/CD training pipeline on AWS SageMaker: new data triggers retraining on a schedule, each model is validated against a holdout window before promotion, and a model registry tracks which version is live per location. If a location's data pattern shifts — a renovation, a new product line, a changed operating schedule — the next scheduled retrain absorbs it automatically, without anyone needing to notice and intervene.
Incoming data is scored against the live per-location model in near real time. When a score crosses the anomaly threshold, the event is enriched with context (which location, what metric, how far from baseline, how long it's persisted) and routed to the right stakeholder — a regional ops manager for a sales dip, an IT contact for an integration failure — rather than landing in a shared dashboard that nobody owns. The routing logic itself is configurable per alert type, so different anomaly categories reach different teams.
Problems were detected and routed automatically, location by location, in real time — protecting revenue and the customer experience by catching issues while they were still small.
Tell us what you're building. We'll tell you the fastest honest path to shipping it.
Start a conversation →