Behind the Streams: Three Years Of Live at Netflix. Part 1. | by Netflix Technology Blog | Jul, 2025

0
366
Behind the Streams: Three Years Of Live at Netflix. Part 1. | by Netflix Technology Blog | Jul, 2025

[ad_1]

By Sergey Fedorov, Chris Pham, Flavio Ribeiro, Chris Newton, and Wei Wei

Many nice concepts at Netflix start with a query, and three years in the past, we requested considered one of our boldest but: if we had been to entertain the world by way of Live — a format nearly as outdated as tv itself — how would we do it?

What started with an engineering plan to pave the trail in the direction of our first Live comedy particular, Chris Rock: Selective Outrage, has since led to a whole lot of Live occasions starting from the most important comedy reveals and NFL Christmas Games to record-breaking boxing fights and changing into the house of WWE.

In our sequence Behind the Streams — the place we take you thru the technical journey of our greatest bets — we are going to do a a number of half deep-dive into the structure of Live and what we discovered whereas constructing it. Part one begins with the muse we set for Live, and the crucial choices we made that influenced our strategy.

While Live as a tv format shouldn’t be new, the streaming expertise we supposed to construct required capabilities we didn’t have on the time. Despite 15 years of on-demand streaming beneath our belt, Live launched new concerns influencing structure and expertise decisions:

Press enter or click on to view picture in full measurement

References: 1. Content Pre-Positioning on Open Connect, 2.Load-Balancing Netflix Traffic at Global Scale

This implies that we had loads to construct to be able to make Live work effectively on Netflix. That begins with making the precise decisions relating to the basics of our Live Architecture.

Our Live Technology wanted to increase the identical promise to members that we’ve made with on-demand streaming: nice high quality on as many gadgets as attainable with out interruptions. Live is considered one of many leisure codecs on Netflix, so we additionally wanted to seamlessly mix Live occasions into the consumer expertise, all whereas scaling to over 300 million world subscribers.

When we began, we had 9 months till the primary launch. While we wanted to execute shortly, we additionally wished to architect for future progress in each magnitude and multitude of occasions. As a key precept, we leveraged our distinctive place of constructing help for a single product — Netflix — and having management over the complete Live lifecycle, from Production to Screen.

Press enter or click on to view picture in full measurement

Dedicated Broadcast Facilities to Ingest Live Content from Production

Live occasions can occur wherever on the planet, however not each location has Live amenities or nice connectivity. To guarantee safe and dependable stay sign transport, we leverage distributed and extremely related broadcast operations facilities, with specialised gear for sign ingest and inspection, closed-captioning, graphics and commercial administration. We prioritized repeatability, conditioning engineering to launch stay occasions constantly, reliably, and cost-effectively, leaning into automation wherever attainable. As a consequence, we have now been in a position to cut back the event-specific setup to the transmission between manufacturing and the Broadcast Operations Center, reusing the remainder throughout occasions.

Cloud-based Redundant Transcoding and Packaging Pipelines

The feed acquired on the Broadcast Center accommodates a completely produced program, however nonetheless must be encoded and packaged for streaming on gadgets. We selected a Cloud-based strategy to permit for dynamic scaling, flexibility in configuration, and ease of integration with our Digital Rights Management (DRM), content material administration, and content material supply providers already deployed within the cloud. We leverage AWS Elemental MediaConnect and AWS Elemental MediaLive to amass feeds within the cloud and transcode them into varied video high quality ranges with bitrates tailor-made per present. We constructed a customized packager to higher combine with our supply and playback techniques. We additionally constructed a customized Live Origin to make sure strict learn and write SLAs for Live segments.

Scaling Live Content Delivery to thousands and thousands of viewers with Open Connect CDN

In order for the produced media belongings to be streamed, they should be transferred from a couple of AWS areas, the place Live Origin is deployed, to a whole lot of thousands and thousands of gadgets worldwide. We leverage Netflix’s CDN, Open Connect, to scale Live asset supply. Open Connect servers are positioned near the viewers at over 6K areas and related to AWS areas through a devoted Open Connect Backbone community.

Press enter or click on to view picture in full measurement

18K+ servers in 6K+ areas, in Internet Exchanges, or embedded into ISP networks
Press enter or click on to view picture in full measurement

Open Connect Backbone connects servers in Internet Exchange areas to five AWS areas

By enabling Live supply on Open Connect, we construct on high of $1B+ in Netflix investments over the past 12 years targeted on scaling the community and optimizing the efficiency of supply servers. By sharing capability throughout on-demand and Live viewership we enhance utilization, and by caching previous Live content material on the identical servers used for on-demand streaming, we will simply allow catch-up viewing.

Optimizing Live Playback for Device Compatibility, Scale, Quality, and Stability

To make Live accessible to the vast majority of our prospects with out upgrading their streaming gadgets, we settled on utilizing HTTPS-based Live Streaming. While UDP-based protocols can present extra options like ultra-low latency, HTTPS has ubiquitous help amongst gadgets and compatibility with supply and encoding techniques. Furthermore, we use AVC and HEVC video codecs, transcode with a number of high quality ranges up from SD to 4K, and use a 2-second phase period to stability compression effectivity, infrastructure load, and latency. While prioritizing streaming high quality and playback stability, we have now additionally achieved business commonplace latency from digicam to system, and proceed to enhance it.

To configure playback, the system participant receives a playback manifest on the play begin. The manifest accommodates gadgets just like the encoding bitrates and CDN servers gamers ought to use. We ship the manifest from the cloud as a substitute of the CDN, because it permits us to personalize the configuration for every system. To reference segments of the stream, the manifest features a phase template that’s utilized by gadgets to map a wall-clock time to URLs on the CDN. Using a phase template vs periodic polling for manifest updates minimizes community dependencies, CDN server load, and overhead on resource-constrained gadgets, like sensible TVs, thus bettering each scalability and stability of our system. While streaming, the participant screens community efficiency and dynamically chooses the bitrate and CDN server, maximizing streaming high quality whereas minimizing rebuffering.

Run Discovery and Playback Control Services within the Cloud

So far, we have now coated the streaming path from Camera to Device. To make the stream absolutely work, we additionally must orchestrate throughout all techniques, and guarantee viewers can discover and begin the Live occasion. This performance is carried out by dozens of Cloud providers, with capabilities like playback configuration, personalization, or metrics assortment. These providers are inclined to obtain disproportionately increased masses round Live occasion begin time, and Cloud deployment gives flexibility in dynamically scaling compute sources. Moreover, as Live demand tends to be localized, we’re in a position to stability load throughout a number of AWS areas, higher using our world footprint. Deployment within the cloud additionally permits us to construct a consumer expertise the place we embed Live content material right into a broader number of leisure choices within the UI, like on-demand titles or Games.

Centralize Real-time Metrics within the Cloud with Specialized Tools and Facilities

With management over ingest, encoding pipelines, the Open Connect CDN, and system gamers, we have now almost end-to-end observability into the Live workflow. During Live, we gather system and consumer metrics in real-time (e.g., the place members see the title on Netflix and their high quality of expertise), alerting us to poor consumer experiences or degraded system efficiency. Our real-time monitoring is constructed utilizing a mixture of internally developed instruments, equivalent to Atlas, Mantis, and Lumen, and open-source applied sciences, equivalent to Kafka and Druid, processing as much as 38 million occasions per second throughout a few of our largest stay occasions whereas offering crucial metrics and operational insights in a matter of seconds. Furthermore, we arrange devoted “Control Center” amenities, which deliver key metrics collectively to the operational group that screens the occasion in real-time.

Building new performance all the time brings contemporary challenges and alternatives to be taught, particularly with a system as complicated as Live. Even after three years, we’re nonetheless studying daily find out how to ship Live occasions extra successfully. Here are a couple of key highlights:

Extensive testing: Prior to Live we closely relied on the predictable circulation of on-demand visitors for pre-release canaries or A/B checks to validate deployments. But Live visitors was not all the time obtainable, particularly not on the scale consultant of an enormous launch. As a consequence, we spent appreciable effort to:

  1. Generate inner “test streams,” which engineers use to run integration, regression, or smoke checks as a part of the event lifecycle.
  2. Build artificial load testing capabilities to emphasize take a look at cloud and CDN techniques. We use 2 approaches, permitting us to generate as much as 100K starts-per-second:
    — Capture, modify, and replay previous Live manufacturing visitors, representing a range of consumer gadgets and request patterns.
    — Virtualize Netflix gadgets and generate visitors in opposition to CDN or Cloud endpoints to check the affect of the newest adjustments throughout all techniques.
  3. Run automated failure injection, forcing lacking or corrupted segments from the encoding pipeline, lack of a cloud area, community drop, or server timeouts.

Regular apply: Despite rigorous pre-release testing, nothing beats a manufacturing surroundings, particularly when working at scale. We discovered that having a daily schedule with numerous Live content material is important to creating enhancements whereas balancing the dangers of member affect. We run A/B checks, carry out chaos testing, operational workout routines, and prepare operational groups for upcoming launches.

Viewership predictions: We use prediction-based strategies to pre-provision Cloud and CDN capability, and share forecasts with our ISP and Cloud companions forward of time to allow them to plan community and compute sources. Then we complement them with reactive scaling of cloud techniques powering sign-up, log-in, title discovery, and playback providers to account for viewership exceeding our predictions. We have discovered success with forward-looking real-time viewership predictions throughout a stay occasion, permitting us to take steps to mitigate dangers earlier, earlier than extra members are impacted.

Graceful degradation: Despite our greatest efforts, we will (and did!) discover ourselves in a scenario the place viewership exceeded our predictions and provisioned capability. In this case, we developed a lot of levers to proceed streaming, even when it means steadily eradicating some nice-to-have options. For instance, we use service-level prioritized load shedding to prioritize stay visitors over non-critical visitors (like pre-fetch). Beyond that, we will lighten the expertise, like dialing down personalization, disabling bookmarks, or reducing the utmost streaming high quality. Our load checks embody situations the place we under-scale techniques to validate desired conduct.

Retry storms: When techniques attain capability, our key focus is to keep away from cascading points or additional overloading techniques with retries. Beyond system retries, customers might retry manually — we’ve seen a 10x enhance in visitors load resulting from stream restarts after viewing interruptions of as little as 30 seconds. We spent appreciable time understanding system retry conduct within the presence of points like community timeouts or lacking segments. As a consequence, we applied methods like server-guided backoff for system retries, absorbing spikes through prioritized visitors shedding at Cloud Edge Gateway, and re-balancing visitors between cloud areas.

Contingency planning: Everyone has a plan until they get punched in the mouth” could be very related for Live. When one thing breaks, there may be virtually no time for troubleshooting. For massive occasions, we arrange in-person launch rooms with engineering house owners of crucial techniques. For fast detection and response, we developed a small set of metrics as early indicators of points, and have in depth runbooks for frequent operational points. We don’t be taught on launch day; as a substitute, launch groups apply failure response through Game Day workout routines forward of time. Finally, our runbooks prolong past engineering, masking escalation to government management and coordination throughout capabilities like Customer Service, Production, Communications, or Social.

Our dedication to enhancing the member expertise doesn’t finish on the “Thanks for Watching!” display screen. Shortly after every stay stream, we dive into metrics to determine areas for enchancment. Our Data & Insights group conducts complete analyses, A/B checks, and client analysis to make sure the subsequent occasion is much more pleasant for our members. We leverage insights on member conduct, preferences, and expectations to refine the Netflix product expertise and optimize our Live expertise — like lowering latency by ~10 seconds by way of A/B checks, with out affecting high quality or stability.

Despite three years of effort, we’re removed from accomplished! In reality, we’re simply getting began, actively constructing on the learnings shared above to ship extra pleasure to our members with Live occasions. To help the rising variety of Live titles and new codecs, like FIFA WWC in 2027, we hold constructing our broadcast and supply infrastructure and are actively working to additional enhance the Live expertise.

In this submit, we’ve offered a broad overview and have barely scratched the floor. In the upcoming posts, we are going to dive deeper into key pillars of our Live techniques, masking our encoding, supply, playback, and consumer expertise investments in additional element.

Getting this far wouldn’t have been attainable with out the exhausting work of dozens of groups throughout Netflix, who collaborate carefully to design, construct, and function Live techniques: Operations and Reliability, Encoding Technologies, Content Delivery, Device Playback, Streaming Algorithms, UI Engineering, Search and Discovery, Messaging, Content Promotion and Distribution, Data Platform, Cloud Infrastructure, Tooling and Productivity, Program Management, Data Science & Engineering, Product Management, Globalization, Consumer Insights, Ads, Security, Payments, Live Production, Experience and Design, Product Marketing and Customer Service, amongst many others.

LEAVE A REPLY

Please enter your comment!
Please enter your name here