Machine Learning

Time-Series Anomaly Detection at Scale

A per-location anomaly-detection system across 1,000+ sites that catches sales dips and integration failures in real time, before they become losses.

Client

Retail brand · 900+ locations

Discipline

Machine Learning

Engagement

Dedicated product team

1,000+

store locations monitored with per-site anomaly models

70-90%

of anomalous events caught before manual reporting

24/7

automated retraining keeps every model current

Context

A retail brand with 900+ locations generated rich operational and sales data, but early warning signs (a dip in sales, a broken integration at one site) were lost in the aggregate noise and usually noticed too late.

The challenge

Detecting anomalies for one location is straightforward; doing it for over a thousand, each with its own normal patterns and seasonality, and acting in real time, is a systems problem as much as a modelling one.

Our approach

A model per location

We built anomaly-detection models tuned per site across 1,000+ locations, so each store is judged against its own baseline rather than a global average.

Industrialized training

An automated CI/CD pipeline handles training, validation, and deployment at scale, so a thousand models stay current without manual effort.

Real-time, routed alerts

When something anomalous appears, alerts are routed in real time to the stakeholders who can act, not buried in a dashboard.

Per-location data flows from store systems through a shared ingestion layer to 1,000+ independently-trained models

Architecture

Per-location modelling, not a global average

The core architectural decision was to train one model per location rather than a single global model. A retail brand's 900+ locations don't share a baseline: a store near a university has a completely different weekly rhythm than a suburban location, and a single global threshold would either miss anomalies at quiet stores or flood busy ones with false positives. Each location's model learns its own seasonality (day-of-week, time-of-day, and seasonal patterns specific to that site) and flags deviations against that baseline, not an aggregate one.

Industrialising training across 1,000+ models

Training a thousand-plus models by hand isn't a workflow, it's a job for a pipeline. We built an automated CI/CD training pipeline on AWS SageMaker: new data triggers retraining on a schedule, each model is validated against a holdout window before promotion, and a model registry tracks which version is live per location. If a location's data pattern shifts (a renovation, a new product line, a changed operating schedule) the next scheduled retrain absorbs it automatically, without anyone needing to notice and intervene.

Real-time scoring and routed alerting

Incoming data is scored against the live per-location model in near real time. When a score crosses the anomaly threshold, the event is enriched with context (which location, what metric, how far from baseline, how long it's persisted) and routed to the right stakeholder (a regional ops manager for a sales dip, an IT contact for an integration failure) rather than landing in a shared dashboard that nobody owns. The routing logic itself is configurable per alert type, so different anomaly categories reach different teams.

What we built

Per-location time-series anomaly-detection models across 1,000+ sites
An automated CI/CD training, validation, and deployment pipeline on SageMaker
A real-time scoring service with configurable alert thresholds per metric
Stakeholder-aware alert routing and escalation
A model registry tracking live model versions per location

Technology stack

AI / ML

Time-series forecastingAnomaly detection (statistical + ML)Per-entity model trainingSeasonality decomposition

MLOps

AWS SageMakerAutomated CI/CD pipelinesModel registryScheduled retraining

Data & Infra

PythonAWS (S3, Lambda, EventBridge)Stream ingestionCloudWatch

Delivery

Alert routing serviceStakeholder notification (email/Slack)Dashboards for ops teams

Results & impact

Problems were detected and routed automatically, location by location, in real time, protecting revenue and the customer experience by catching issues while they were still small.

The platform now monitors operational and sales signals across 1,000+ locations continuously, with each location judged against its own historical baseline rather than a chain-wide average.
Anomalies (a sales dip at one store, a POS integration silently failing at another) surface within the same operational window they occur in, rather than being noticed days later in a weekly report.
The automated retraining pipeline means model drift from seasonal shifts, renovations, or changing store operations is absorbed without manual model maintenance, a thousand models stay current with the same operational overhead as one.
Alerts reach the team that can act on them directly, cutting the time between 'something's wrong' and 'someone's looking at it' from days to the same shift.