August 3, 2023
TL;DR: Spotify is releasing a brand new business product for software program growth groups: a model of our homegrown experimentation platform that we’re calling Confidence. Based on every thing we’ve realized over the past 10+ years about what it takes to allow experimentation at scale, the platform makes it simple for groups to arrange, run, coordinate, and analyze their very own person exams — from easy A/B testing to probably the most complicated use instances — to allow them to rapidly validate their concepts and optimize them for impression. Designed to be versatile, extensible, and customizable, our aim is to make the Confidence platform easy to get began with and unattainable to outgrow, making experimentation an integral a part of your org, like it’s for ours.
Note: Confidence is accessible now as a non-public beta solely. Sign up for the waitlist to be eligible for an invitation and to get updates on options, demos, launch dates, and different product information.
One platform for one million concepts
Spotify’s knowledge scientists and engineers have been creating and honing our product testing strategies for years. Whether it’s routinely coordinating simultaneous A/B exams or orchestrating the rollout of an AI suggestion system throughout cellular, desktop, and internet, the platform we constructed scales experimentation greatest practices and capabilities to all our groups. Soon this experimentation platform might be accessible to any firm that wishes to construct, take a look at, and iterate concepts the way in which we do at Spotify: rapidly, reliably, and with confidence.
How did we get right here? We didn’t notice it after we began down this highway years in the past with our first homegrown experimentation instruments, however we’ve been on a decade-long journey to take A/B testing to the following degree. As our imaginative and prescient for what Spotify may very well be expanded, so did our want for an experimentation platform that might scale and hold tempo with us. So that’s what we’ve constructed. And it’s what allows the experimentation tradition at Spotify to thrive right this moment.
Building a tradition of experimentation
With a whole lot of squads and hundreds of builders, designers, knowledge scientists, and PMs, there isn’t a scarcity of concepts at Spotify. “What if we used different playlist art for different regions?” “What if you could preview the most interesting parts of podcast episodes just by swiping through them?” “What if every listener had their own personal DJ?” As a product-focused firm, we’re all the time searching for methods so as to add worth and ship an amazing expertise for our customers, from listeners to creators to advertisers.
We don’t wish to decelerate the circulate of these concepts or get in the way in which of the event cycle. Our philosophy is “Think it, build it, ship it, tweak it.” Shipping extra concepts sooner will get us to the most effective concepts sooner. But how will we know which concepts are nice? And which concepts are simply studying experiences for the following thought?
Data or it didn’t occur
We’ve come a great distance from being only a music participant. From playlists utilizing Algotorial know-how to our annual Wrapped marketing campaign to an AI DJ — even probably the most formidable concepts at Spotify acquired their begin as simply one other thought in an ocean of concepts. Each one is a tiny spark: vibrant, shiny, new — and completely unproven.
What we’ve realized over time is that — irrespective of how thrilling the thought is or the sort of data you need to assist it — should you’re not working managed experiments, then you definitely’re not confronting your concepts with actuality. Customer suggestions, instinct, and creativity are all important instruments for bringing improvements to market. But with no stable scientific technique and engineering infrastructure to assist make data-informed selections, your groups might be perpetually chasing concepts as a substitute of delivery and enhancing those which have probably the most impression in your customers and your corporation.
From a handful of experiments to a whole lot
Our early experimentation efforts started greater than a decade in the past. In the early 2010s (let’s carbon-date it to round when Adele’s “Rolling in the Deep” was climbing the charts), a couple of knowledge scientists and engineers began conducting small A/B exams internally. These exams had been guide and error-prone, however we believed within the significance of experimenting and needed to get higher at it.
So, we determined to construct our personal A/B testing platform, which we known as ABBA. ABBA was fairly primary: it did characteristic flagging and evaluation for a set of standardized metrics. The simplicity and adaptability of ABBA unlocked a wave of experimentation throughout the corporate. We grew from working fewer than 20 precedence experiments per yr to working a whole lot of experiments per yr throughout a number of squads.
More testing shouldn’t be the identical as higher testing
Then round 2018 (circa: Drake’s “God’s Plan” was the top-streaming music), Spotify launched a revamped free tier, and with the sudden inflow of customers, we had been introduced with much more experimentation alternatives. By this time, we had migrated to Google Cloud, and entry to all of the uncooked processing energy of BigQuery made getting take a look at outcomes sooner and simpler. So, because the enterprise continued to develop, we continued to extend what we examined.
Then a humorous factor began to occur: the extra testing we did, the extra we may see flaws within the testing strategies themselves. Our groups had been getting slowed down by restarting experiments, manually calculating statistical analyses in notebooks, and coordinating take a look at teams in spreadsheets. Tying options to particular exams through characteristic flags additionally began to show restrictive. As the bottlenecks, workarounds, and errors continued to pile up, it was clear that we had been working into the bounds of what ABBA may do for us.
We wanted to have the ability to scale our testing strategies throughout extra units and software program platforms , new and extra complicated use instances (having customized, context-aware suggestions powered by machine studying could be very completely different from testing the colour of a button), an increasing person base (with extra cloud processing energy got here extra knowledge administration points), and most significantly, a rising variety of groups — which meant extra experiments crashing into one another than ever. In quick, we would have liked to discover ways to experiment each higher and sooner. We wanted to discover ways to experiment at scale.
Learning tips on how to be taught higher
So we took every thing we realized from ABBA and began over. We started constructing new instruments and incorporating extra superior testing strategies into how we work. We additionally began to automate a few of these scientific greatest practices so it was simpler for groups to arrange managed experiments themselves, with out having to coordinate or schedule take a look at teams with others. And that’s the place our new Experimentation Platform (aka EP for brief) got here in.
Our knowledge platform group launched two main enhancements with EP: (1) a brand new Metrics Catalog that made analyzing metrics self-service and eradicated the necessity for knowledge scientists to run evaluation manually in notebooks, and (2) a coordination engine that allowed us to run many mutually unique experiments on the similar time, together with managing holdback teams.
With EP, any group at Spotify can run any sort of experiment with the boldness that, on the finish of it, they’ll have insights they’ll belief and use to maneuver ahead in an knowledgeable approach.
From a whole lot of experiments to hundreds
Once we made it simpler for groups to create, run, handle, and analyze experiments on their very own and in a scientifically dependable approach, naturally, extra groups ran extra experiments. And so the tradition of experimentation at Spotify grew much more. By the time we turned off ABBA in 2020, we’d gone from working a whole lot of experiments to working hundreds of experiments per yr throughout nearly each facet of our enterprise.
That tradition of experimentation is ingrained all through our engineering group and the way we construct options — not simply in how we develop options for our apps, but additionally in how we enhance backend providers and knowledge pipelines. This virtuous cycle of studying about what’s working and what isn’t — many groups testing and delivery, testing and delivery — is what we had been capable of unlock at scale utilizing the inner experimentation platform we constructed for ourselves. And now we’re making a business model accessible to everybody.
We ???? platforms
This tune could also be acquainted should you’ve adopted the evolution of Backstage — our homegrown developer portal, which we open sourced and donated to the CNCF three years in the past. That, too, was a platform — a strategy to unlock the potential of many impartial groups by bringing them collectively on a shared set of tooling and ideas with the intention to remedy frequent issues.
As with Backstage, offering an amazing developer expertise is vital to the platform’s success: ensuring the easiest way to your builders to do one thing can be the simplest and most supported approach. As our personal groups adopted this manner of doing issues, we’ve come to think about experimentation not as a software our groups decide up and generally use, however as a functionality they all the time possess.
That’s what we’re aiming to ship with Confidence, the newest iteration of our Experimentation Platform. Scientific greatest practices are constructed proper into the platform in order that many alternative groups can run many alternative experiments reliably and rapidly at scale.
An experimentation platform that scales with you
There is seldom a one-size-fits-all answer to experimentation. If you’re severe about utilizing A/B testing to validate person habits and dealing in a data-informed approach, you want a platform that works throughout a variety of wants and use instances. From usability to messaging to promoting to acquisition funnels and past, Confidence can assist you discover solutions to questions of each form and measurement.
Extensible and customizable
Throughout our journey, we’ve in contrast notes with different firms struggling to scale dependable experimentation practices inside their organizations. Often these firms have outgrown their present A/B testing instruments (whether or not bought off-the-shelf or constructed internally) and are actually looking for better customization for the way they run experiments.
But not each firm is on the similar level on this journey. Confidence is designed to carry worth whether or not you’ve outgrown your present testing platform or are searching for a fast, simple strategy to get began with A/B testing that can scale with you as your wants change.
One platform, accessible 3 ways
To make it simpler to suit your wants, the Confidence platform might be accessible to clients in 3 ways:
1. Managed service. Want to rise up and working rapidly and with the bottom technical overhead? Run the experimentation platform as a standalone internet service managed by our group.
2. Backstage plugin. Already have a Backstage occasion working (or wish to get began)? Get all of the options of Confidence as a plugin subsequent to your different developer instruments. This is how we run our experimentation platform at Spotify.
3. APIs. Need extra customization? Want to construct a bandit or do switchback testing? Integrate the Confidence platform into your individual infrastructure with most flexibility and extensibility. Confidence will give you the capabilities to do what you want to do.
We imagine these three choices will make it simple to entry and develop with the platform, it doesn’t matter what your organization’s wants are right this moment or tomorrow.
Sign up for the beta
We’re actually excited to share this new platform with you. Confidence is at the moment accessible for choose clients in personal beta. You can join the personal beta waitlist on our web site and be a part of our mailing record on the identical type to get updates on all issues Confidence.
Appendix: Engineering higher experimentation
Learn extra about experimentation at Spotify — together with slightly gentle studying on automated salting and bucket reuse, selecting sequential testing frameworks, evaluating quantiles at scale, and the way we scale different scientific greatest practices throughout the org — all proper right here on the Spotify Engineering weblog:
- Spotify’s New Experimentation Platform (Part 1): How we went from our first A/B testing software, ABBA, to constructing EP, the inner experimentation platform we use right this moment and that Confidence is predicated on. Learn about why we changed characteristic “flags” in favor of “properties” for Remote Configuration, our transfer away from counting on notebooks to the Metrics Catalog for analyses, and the way we handle and orchestrate experiments utilizing the Experiment Planner.
- Spotify’s New Experimentation Platform (Part 2): More options of our inner platform, together with: coordinating many experiments directly whereas preserving exclusivity and holdbacks, utilizing our “salt machine” to routinely reshuffle customers with out the necessity to cease and restart experiments, the significance of establishing each success and guardrail metrics up entrance, and the way validity checks and gradual rollouts additional defend you from errors and sudden regressions.
- Comparing Quantiles at Scale in Online A/B Testing: How we use the Poisson bootstrap algorithm and quantile estimators to simply calculate bootstrap confidence intervals for difference-in-quantiles in A/B exams with a whole lot of hundreds of thousands of observations.
- Experimenting at Scale, the Spotify Home Way: How we use our inner Experimentation Platform to run over 250 exams a yr on our Home display screen alone, coordinating the work of dozens of groups, each inventing new sorts of customized experiences for a whole lot of hundreds of thousands of customers.
- Experimenting with Machine Learning to Target In-App Messaging: We believed that we may use machine studying to find out who ought to obtain in-app messages and that this extra exact focusing on would enhance person expertise with out harming enterprise metrics. To discover out if our speculation was appropriate, we used uplift modeling to attempt to immediately mannequin the impact of in-app messaging on person habits.
Hear about it from the individuals who lived it. Listen to Spotify’s experimentation journey on the NerdOut@Spotify podcast:
- Episode 20: The Rise and Fall of ABBA: Host Dave Zolotusky talks with Mark Grey, a senior workers engineer and 10-year Spotify veteran, about our very first A/B testing software, ABBA, and early classes about doing product experimentation at scale.
- Episode 21: The Man Who Killed ABBA: Dave and Mark are joined by one other longtime Spotify engineer, Dima Kunin. They discuss why we changed ABBA with Spotify’s present inner Experimentation Platform, how we constructed it, and the way it enabled our groups to go from working a whole lot of experiments to hundreds.