As more companies rely on machine learning to make product decisions, a structural issue is becoming increasingly visible, particularly inside large recommendation platforms. The same systems are often expected to handle two very different responsibilities: delivering real-time personalised results and running experiments to evaluate new ideas. At Spotify, engineers concluded that trying to do both within a single system was holding them back.
Personalisation systems are designed for speed and stability. They must respond instantly, handle massive traffic, and remain reliable even during spikes in usage. Experimentation systems operate under a different set of priorities. They need room for iteration, comparison, analysis, and failure. Over time, Spotify found that combining these workloads made both harder to manage.
The company has spent years building large-scale personalisation pipelines that determine which songs, podcasts, and playlists users see. Alongside that, it continuously experiments with ranking models, recommendation logic, and user experiences. Early on, these systems were closely intertwined. As the platform grew, that tight coupling began to create operational and organisational friction.
Rather than treating experimentation as a layer within its personalisation stack, Spotify made a deliberate architectural shift: separate the systems entirely. One system is responsible for serving results to users. The other is responsible for learning from data and evaluating change.
While subtle on the surface, this decision reshaped how teams deploy models, assess risk, and move ideas into production.
Two Distinct Systems, Two Different Priorities
In Spotify’s architecture, personalisation pipelines are optimised for low latency and high availability. These systems power live user requests and operate under strict time constraints. Any failure or delay is immediately visible to users.
Experimentation systems follow a different model. They focus on collecting data, running controlled comparisons, and supporting long-term analysis. For these systems, accuracy, traceability, and reproducibility matter far more than response time.
By keeping these systems separate, Spotify allows each to evolve independently. Experimentation workflows can change rapidly without threatening production stability, while personalisation systems remain reliable even as new ideas are tested elsewhere.
The separation also reduces risk. A broken experiment does not impact live traffic, and a production issue does not invalidate weeks of experimental data.
For engineers, this changes how work progresses through the organisation. Models no longer move directly into user-facing systems. Instead, they pass through a defined evaluation process, where results are reviewed and debated before being deployed at scale. As machine learning models grow more complex, this evaluation stage becomes increasingly critical.
Why Separation Matters in the AI Era
As recommendation systems evolve, understanding why a model behaves a certain way becomes more difficult. Small adjustments can produce unexpected outcomes, and problems are often discovered only after users notice changes.
Separating experimentation from serving allows Spotify to slow down decision-making without slowing down delivery. Teams can examine whether a change improved outcomes, caused regressions, or shifted user behaviour in unanticipated ways before those changes reach everyone.
This approach also creates a durable record of decision-making. Experiments are logged, reviewed, and compared over time, making it easier to revisit past decisions or explain outcomes internally.
More broadly, it reflects a shift in how large engineering organisations think about AI systems. Models are no longer treated as static components but as evolving processes that require oversight, validation, and rollback mechanisms.
For developers, this means more effort happens before production, not less. The goal is to identify issues early, when they are easier to diagnose and less costly to fix.
Platform Engineering Over Model Tuning
What stands out in Spotify’s approach is that the core challenge is not selecting better models—it is coordination at scale.
Separating experimentation from personalisation forces teams to define clear interfaces, data contracts, and ownership boundaries. It requires shared tooling for logging, evaluation, and review. It also demands patience, since not every idea progresses to production.
This is where platform engineering plays a central role. The platform defines how ideas move through the organisation, not just the tools used to build them.
When those rules are well-defined, teams can move quickly without interfering with one another. When they are unclear, progress slows and trust erodes. Spotify’s experience suggests that scaling AI is less about algorithm choice and more about building systems that support disagreement, measurement, and gradual change.
Lessons for Teams Scaling AI Systems
Most organisations do not operate at Spotify’s scale, but the trade-offs are widely applicable.
Many teams still run experiments directly inside production systems because it feels faster and simpler. Over time, that simplicity fades. Changes become harder to understand, rollbacks become riskier, and confidence in results declines.
Separating experimentation from serving does introduce additional process and friction. It requires upfront investment and forces teams to slow down in areas where speed once seemed paramount.
But that friction can be valuable.
It creates space to question assumptions, test ideas without commitment, and identify what is truly safe to deploy. As AI systems take on greater responsibility, those signals become increasingly important.
Spotify’s architectural choice is not a strict blueprint for others to follow. It is a reminder that infrastructure shapes behaviour. When systems are designed to prioritise learning over speed, teams are better positioned to make sound long-term decisions.
In the race to scale AI responsibly, that may be one of the most practical lessons to take away.

