Netflix

Migrating Critical Traffic At Scale with No Downtime — Part 1 | by Netflix Technology Blog | May, 2023

May 4, 2023

246

[ad_1]

Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, Devang Shah

Hundreds of thousands and thousands of consumers tune into Netflix each day, anticipating an uninterrupted and immersive streaming expertise. Behind the scenes, a myriad of techniques and companies are concerned in orchestrating the product expertise. These backend techniques are constantly being advanced and optimized to fulfill and exceed buyer and product expectations.

When enterprise system migrations, one of many fundamental challenges is establishing confidence and seamlessly transitioning the site visitors to the upgraded structure with out adversely impacting the client expertise. This weblog collection will look at the instruments, strategies, and techniques now we have utilized to attain this objective.

The backend for the streaming product makes use of a extremely distributed microservices structure; therefore these migrations additionally occur at totally different factors of the service name graph. It can occur on an edge API system servicing buyer gadgets, between the sting and mid-tier companies, or from mid-tiers to information shops. Another related issue is that the migration may very well be occurring on APIs which might be stateless and idempotent, or it may very well be occurring on stateful APIs.

We have categorized the instruments and strategies now we have used to facilitate these migrations in two high-level phases. The first section includes validating practical correctness, scalability, and efficiency considerations and guaranteeing the brand new techniques’ resilience earlier than the migration. The second section includes migrating the site visitors over to the brand new techniques in a way that mitigates the chance of incidents whereas frequently monitoring and confirming that we’re assembly essential metrics tracked at a number of ranges. These embrace Quality-of-Experience(QoE) measurements on the buyer system stage, Service-Level-Agreements (SLAs), and business-level Key-Performance-Indicators(KPIs).

This weblog put up will present an in depth evaluation of replay site visitors testing, a flexible method now we have utilized within the preliminary validation section for a number of migration initiatives. In a follow-up weblog put up, we’ll deal with the second section and look deeper at a number of the tactical steps that we use emigrate the site visitors over in a managed method.

Replay site visitors refers to manufacturing site visitors that’s cloned and forked over to a unique path within the service name graph, permitting us to train new/up to date techniques in a way that simulates precise manufacturing circumstances. In this testing technique, we execute a replica (replay) of manufacturing site visitors in opposition to a system’s present and new variations to carry out related validations. This strategy has a handful of advantages.

Replay site visitors testing allows sandboxed testing at scale with out considerably impacting manufacturing site visitors or consumer expertise.
Utilizing cloned actual site visitors, we are able to train the variety of inputs from a wide selection of gadgets and system utility software program variations in manufacturing. This is especially essential for complicated APIs which have many excessive cardinality inputs. Replay site visitors supplies the attain and protection required to check the flexibility of the system to deal with sometimes used enter combos and edge instances.
This method facilitates validation on a number of fronts. It permits us to claim practical correctness and supplies a mechanism to load check the system and tune the system and scaling parameters for optimum functioning.
By simulating an actual manufacturing setting, we are able to characterize system efficiency over an prolonged interval whereas contemplating the anticipated and sudden site visitors sample shifts. It supplies a great learn on the provision and latency ranges below totally different manufacturing circumstances.
Provides a platform to make sure that related operational insights, metrics, logging, and alerting are in place earlier than migration.

Replay Solution

The replay site visitors testing resolution includes two important elements.

Traffic Duplication and Correlation: The preliminary step requires the implementation of a mechanism to clone and fork manufacturing site visitors to the newly established pathway, together with a course of to document and correlate responses from the unique and different routes.
Comparative Analysis and Reporting: Following site visitors duplication and correlation, we’d like a framework to check and analyze the responses recorded from the 2 paths and get a complete report for the evaluation.

We have tried totally different approaches for the site visitors duplication and recording step by way of varied migrations, making enhancements alongside the way in which. These embrace choices the place replay site visitors technology is orchestrated on the system, on the server, and through a devoted service. We will look at these options within the upcoming sections.

Device Driven

In this selection, the system makes a request on the manufacturing path and the replay path, then discards the response on the replay path. These requests are executed in parallel to attenuate any potential delay on the manufacturing path. The collection of the replay path on the backend could be pushed by the URL the system makes use of when making the request or by using particular request parameters in routing logic on the applicable layer of the service name graph. The system additionally features a distinctive identifier with an identical values on each paths, which is used to correlate the manufacturing and replay responses. The responses could be recorded on the most optimum location within the service name graph or by the system itself, relying on the actual migration.

The device-driven strategy’s apparent draw back is that we’re losing system assets. There can be a threat of influence on system QoE, particularly on low-resource gadgets. Adding forking logic and complexity to the system code can create dependencies on system utility launch cycles that typically run at a slower cadence than service launch cycles, resulting in bottlenecks within the migration. Moreover, permitting the system to execute untested server-side code paths can inadvertently expose an assault floor space for potential misuse.

Server Driven

To deal with the considerations of the device-driven strategy, the opposite possibility now we have used is to deal with the replay considerations solely on the backend. The replay site visitors is cloned and forked within the applicable service upstream of the migrated service. The upstream service calls the present and new substitute companies concurrently to attenuate any latency improve on the manufacturing path. The upstream service data the responses on the 2 paths together with an identifier with a typical worth that’s used to correlate the responses. This recording operation can be finished asynchronously to attenuate any influence on the latency on the manufacturing path.

The server-driven strategy’s profit is that the complete complexity of replay logic is encapsulated on the backend, and there’s no wastage of system assets. Also, since this logic resides on the server aspect, we are able to iterate on any required adjustments quicker. However, we’re nonetheless inserting the replay-related logic alongside the manufacturing code that’s dealing with enterprise logic, which may end up in pointless coupling and complexity. There can be an elevated threat that bugs within the replay logic have the potential to influence manufacturing code and metrics.

Dedicated Service

The newest strategy that now we have used is to utterly isolate all elements of replay site visitors right into a separate devoted service. In this strategy, we document the requests and responses for the service that must be up to date or changed to an offline occasion stream asynchronously. Quite usually, this logging of requests and responses is already occurring for operational insights. Subsequently, we use Mantis, a distributed stream processor, to seize these requests and responses and replay the requests in opposition to the brand new service or cluster whereas making any required changes to the requests. After replaying the requests, this devoted service additionally data the responses from the manufacturing and replay paths for offline evaluation.

This strategy centralizes the replay logic in an remoted, devoted code base. Apart from not consuming system assets and never impacting system QoE, this strategy additionally reduces any coupling between manufacturing enterprise logic and replay site visitors logic on the backend. It additionally decouples any updates on the replay framework away from the system and repair launch cycles.

Analyzing Replay Traffic

Once now we have run replay site visitors and recorded a statistically vital quantity of responses, we’re prepared for the comparative evaluation and reporting part of replay site visitors testing. Given the dimensions of the information being generated utilizing replay site visitors, we document the responses from the 2 sides to a cheap chilly storage facility utilizing expertise like Apache Iceberg. We can then create offline distributed batch processing jobs to correlate & examine the responses throughout the manufacturing and replay paths and generate detailed studies on the evaluation.

Normalization

Depending on the character of the system being migrated, the responses would possibly want some preprocessing earlier than being in contrast. For instance, if some fields within the responses are timestamps, these will differ. Similarly, if there are unsorted lists within the responses, it is likely to be finest to kind them earlier than evaluating. In sure migration situations, there could also be intentional alterations to the response generated by the up to date service or part. For occasion, a subject that was an inventory within the unique path is represented as key-value pairs within the new path. In such instances, we are able to apply particular transformations to the response on the replay path to simulate the anticipated adjustments. Based on the system and the related responses, there is likely to be different particular normalizations that we would apply to the response earlier than we examine the responses.

Comparison

After normalizing, we diff the responses on the 2 sides and verify whether or not now we have matching or mismatching responses. The batch job creates a high-level abstract that captures some key comparability metrics. These embrace the overall variety of responses on each side, the rely of responses joined by the correlation identifier, matches and mismatches. The abstract additionally data the variety of passing/ failing responses on every path. This abstract supplies a wonderful high-level view of the evaluation and the general match fee throughout the manufacturing and replay paths. Additionally, for mismatches, we document the normalized and unnormalized responses from each side to a different huge information desk together with different related parameters, such because the diff. We use this extra logging to debug and determine the basis explanation for points driving the mismatches. Once we uncover and deal with these points, we are able to use the replay testing course of iteratively to carry down the mismatch proportion to an appropriate quantity.

Lineage

When evaluating responses, a typical supply of noise arises from the utilization of non-deterministic or non-idempotent dependency information for producing responses on the manufacturing and replay pathways. For occasion, envision a response payload that delivers media streams for a playback session. The service chargeable for producing this payload consults a metadata service that gives all obtainable streams for the given title. Various elements can result in the addition or removing of streams, corresponding to figuring out points with a selected stream, incorporating assist for a brand new language, or introducing a brand new encode. Consequently, there’s a potential for discrepancies within the units of streams used to find out payloads on the manufacturing and replay paths, leading to divergent responses.

A complete abstract of knowledge variations or checksums for all dependencies concerned in producing a response, known as a lineage, is compiled to deal with this problem. Discrepancies could be recognized and discarded by evaluating the lineage of each manufacturing and replay responses within the automated jobs analyzing the responses. This strategy mitigates the influence of noise and ensures correct and dependable comparisons between manufacturing and replay responses.

Comparing Live Traffic

An different methodology to recording responses and performing the comparability offline is to carry out a stay comparability. In this strategy, we do the forking of the replay site visitors on the upstream service as described within the `Server Driven` part. The service that forks and clones the replay site visitors immediately compares the responses on the manufacturing and replay path and data related metrics. This possibility is possible if the response payload isn’t very complicated, such that the comparability doesn’t considerably improve latencies or if the companies being migrated usually are not on the important path. Logging is selective to instances the place the previous and new responses don’t match.

Load Testing

Besides practical testing, replay site visitors permits us to emphasize check the up to date system elements. We can regulate the load on the replay path by controlling the quantity of site visitors being replayed and the brand new service’s horizontal and vertical scale elements. This strategy permits us to judge the efficiency of the brand new companies below totally different site visitors circumstances. We can see how the provision, latency, and different system efficiency metrics, corresponding to CPU consumption, reminiscence consumption, rubbish assortment fee, and many others, change because the load issue adjustments. Load testing the system utilizing this system permits us to determine efficiency hotspots utilizing precise manufacturing site visitors profiles. It helps expose reminiscence leaks, deadlocks, caching points, and different system points. It allows the tuning of thread swimming pools, connection swimming pools, connection timeouts, and different configuration parameters. Further, it helps within the willpower of affordable scaling insurance policies and estimates for the related price and the broader price/threat tradeoff.

Stateful Systems

We have also used replay testing to construct confidence in migrations involving stateless and idempotent techniques. Replay testing can even validate migrations involving stateful techniques, though extra measures should be taken. The manufacturing and replay paths should have distinct and remoted information shops which might be in an identical states earlier than enabling the replay of site visitors. Additionally, all totally different request sorts that drive the state machine should be replayed. In the recording step, aside from the responses, we additionally need to seize the state related to that particular response. Correspondingly within the evaluation section, we need to examine each the response and the associated state within the state machine. Given the general complexity of utilizing replay testing with stateful techniques, now we have employed different strategies in such situations. We will have a look at one among them within the follow-up weblog put up on this collection.

We have adopted replay site visitors testing at Netflix for quite a few migration initiatives. A current instance concerned leveraging replay testing to validate an intensive re-architecture of the sting APIs that drive the playback part of our product. Another occasion included migrating a mid-tier service from REST to gRPC. In each instances, replay testing facilitated complete practical testing, load testing, and system tuning at scale utilizing actual manufacturing site visitors. This strategy enabled us to determine elusive points and quickly construct confidence in these substantial redesigns.

Upon concluding replay testing, we’re prepared to start out introducing these adjustments in manufacturing. In an upcoming weblog put up, we’ll have a look at a number of the strategies we use to roll out vital adjustments to the system to manufacturing in a gradual risk-controlled method whereas constructing confidence through metrics at totally different ranges.

[ad_2]

Migrating Critical Traffic At Scale with No Downtime — Part 1 | by Netflix Technology Blog | May, 2023

Replay Solution

Analyzing Replay Traffic

Comparing Live Traffic

Load Testing

Stateful Systems

LEAVE A REPLY Cancel reply

ABOUT US

POPULAR POSTS

Netflix’s $83 Billion Gamble: The ‘Streaming Wars’ Are Over. Now the Battle Begins.

The Levi’s Effect: How Beyoncé, a Song, and a Billion-Dollar “Denim War” Redefined Celebrity Endorsement

Courtside or Splitside? The Rollercoaster Romance of Kylie Jenner and Timothée Chalamet

POPULAR CATEGORY