← All work
Machine Learning

Time-Series Anomaly Detection at Scale

A per-location anomaly-detection system across 1,000+ sites that catches sales dips and integration failures in real time — before they become losses.

Client
Retail brand · 900+ locations
Discipline
Machine Learning
Engagement
Dedicated product team
1,000+
store locations monitored with per-site anomaly models
70-90%
of anomalous events caught before manual reporting
24/7
automated retraining keeps every model current

Context

A retail brand with 900+ locations generated rich operational and sales data, but early warning signs — a dip in sales, a broken integration at one site — were lost in the aggregate noise and usually noticed too late.

The challenge

Detecting anomalies for one location is straightforward; doing it for over a thousand, each with its own normal patterns and seasonality, and acting in real time, is a systems problem as much as a modelling one.

Our approach

A model per location

We built anomaly-detection models tuned per site across 1,000+ locations, so each store is judged against its own baseline rather than a global average.

Industrialized training

An automated CI/CD pipeline handles training, validation, and deployment at scale, so a thousand models stay current without manual effort.

Real-time, routed alerts

When something anomalous appears, alerts are routed in real time to the stakeholders who can act — not buried in a dashboard.

DEVICE / FIELDEDGECLOUDStore POS / OpsPer-location dataIngestion PipelineStream + normalisePer-Site ModelsSageMakerAnomaly ScoringZ-score + MLAlert RoutingStakeholder push
Per-location data flows from store systems through a shared ingestion layer to 1,000+ independently-trained models

Architecture

Per-location modelling, not a global average

The core architectural decision was to train one model per location rather than a single global model. A retail brand's 900+ locations don't share a baseline — a store near a university has a completely different weekly rhythm than a suburban location, and a single global threshold would either miss anomalies at quiet stores or flood busy ones with false positives. Each location's model learns its own seasonality (day-of-week, time-of-day, and seasonal patterns specific to that site) and flags deviations against that baseline, not an aggregate one.

Industrialising training across 1,000+ models

Training a thousand-plus models by hand isn't a workflow, it's a job for a pipeline. We built an automated CI/CD training pipeline on AWS SageMaker: new data triggers retraining on a schedule, each model is validated against a holdout window before promotion, and a model registry tracks which version is live per location. If a location's data pattern shifts — a renovation, a new product line, a changed operating schedule — the next scheduled retrain absorbs it automatically, without anyone needing to notice and intervene.

Real-time scoring and routed alerting

Incoming data is scored against the live per-location model in near real time. When a score crosses the anomaly threshold, the event is enriched with context (which location, what metric, how far from baseline, how long it's persisted) and routed to the right stakeholder — a regional ops manager for a sales dip, an IT contact for an integration failure — rather than landing in a shared dashboard that nobody owns. The routing logic itself is configurable per alert type, so different anomaly categories reach different teams.

What we built

  • Per-location time-series anomaly-detection models across 1,000+ sites
  • An automated CI/CD training, validation, and deployment pipeline on SageMaker
  • A real-time scoring service with configurable alert thresholds per metric
  • Stakeholder-aware alert routing and escalation
  • A model registry tracking live model versions per location

Technology stack

AI / ML
Time-series forecastingAnomaly detection (statistical + ML)Per-entity model trainingSeasonality decomposition
MLOps
AWS SageMakerAutomated CI/CD pipelinesModel registryScheduled retraining
Data & Infra
PythonAWS (S3, Lambda, EventBridge)Stream ingestionCloudWatch
Delivery
Alert routing serviceStakeholder notification (email/Slack)Dashboards for ops teams

Results & impact

Problems were detected and routed automatically, location by location, in real time — protecting revenue and the customer experience by catching issues while they were still small.

  • The platform now monitors operational and sales signals across 1,000+ locations continuously, with each location judged against its own historical baseline rather than a chain-wide average.
  • Anomalies — a sales dip at one store, a POS integration silently failing at another — surface within the same operational window they occur in, rather than being noticed days later in a weekly report.
  • The automated retraining pipeline means model drift from seasonal shifts, renovations, or changing store operations is absorbed without manual model maintenance — a thousand models stay current with the same operational overhead as one.
  • Alerts reach the team that can act on them directly, cutting the time between 'something's wrong' and 'someone's looking at it' from days to the same shift.

Have a similar problem to solve?

Tell us what you're building. We'll tell you the fastest honest path to shipping it.

Start a conversation →