Netflix operates a extremely environment friendly cloud computing infrastructure that helps a wide selection of functions important for our SVOD (Subscription Video on Demand), stay streaming and gaming companies. Utilizing Amazon AWS, our infrastructure is hosted throughout a number of geographic areas worldwide. This world distribution permits our functions to ship content material extra successfully by serving site visitors nearer to our clients. Like any distributed system, our functions often require information synchronization between areas to take care of seamless service supply.
The following diagram exhibits a simplified cloud community topology for cross-region site visitors.
Our Cloud Network Engineering on-call staff acquired a request to handle a community challenge affecting an software with cross-region site visitors. Initially, it appeared that the applying was experiencing timeouts, seemingly because of suboptimal community efficiency. As everyone knows, the longer the community path, the extra gadgets the packets traverse, rising the probability of points. For this incident, the shopper software is situated in an inside subnet within the US area whereas the server software is situated in an exterior subnet in a European area. Therefore, it’s pure in charge the community since packets have to journey lengthy distances by way of the web.
As community engineers, our preliminary response when the community is blamed is usually, “No, it can’t be the network,” and our process is to show it. Given that there have been no latest adjustments to the community infrastructure and no reported AWS points impacting different functions, the on-call engineer suspected a loud neighbor challenge and sought help from the Host Network Engineering staff.
In this context, a loud neighbor challenge happens when a container shares a number with different network-intensive containers. These noisy neighbors eat extreme community sources, inflicting different containers on the identical host to undergo from degraded community efficiency. Despite every container having bandwidth limitations, oversubscription can nonetheless result in such points.
Upon investigating different containers on the identical host — most of which have been a part of the identical software — we shortly eradicated the potential for noisy neighbors. The community throughput for each the problematic container and all others was considerably under the set bandwidth limits. We tried to resolve the problem by eradicating these bandwidth limits, permitting the applying to make the most of as a lot bandwidth as crucial. However, the issue endured.
We noticed some TCP packets within the community marked with the RST flag, a flag indicating {that a} connection needs to be instantly terminated. Although the frequency of those packets was not alarmingly excessive, the presence of any RST packets nonetheless raised suspicion on the community. To decide whether or not this was certainly a network-induced challenge, we carried out a tcpdump on the shopper. In the packet seize file, we noticed one TCP stream that was closed after precisely 30 seconds.
SYN at 18:47:06
After the 3-way handshake (SYN,SYN-ACK,ACK), the site visitors began flowing usually. Nothing unusual till FIN at 18:47:36 (30 seconds later)
The packet seize outcomes clearly indicated that it was the shopper software that initiated the connection termination by sending a FIN packet. Following this, the server continued to ship information; nonetheless, because the shopper had already determined to shut the connection, it responded with RST packets to all subsequent information from the server.
To be sure that the shopper wasn’t closing the connection because of packet loss, we additionally carried out a packet seize on the server aspect to confirm that every one packets despatched by the server have been acquired. This process was sophisticated by the truth that the packets handed by way of a NAT gateway (NGW), which meant that on the server aspect, the shopper’s IP and port appeared as these of the NGW, differing from these seen on the shopper aspect. Consequently, to precisely match TCP streams, we wanted to establish the TCP stream on the shopper aspect, find the uncooked TCP sequence quantity, after which use this quantity as a filter on the server aspect to search out the corresponding TCP stream.
With packet seize outcomes from each the shopper and server sides, we confirmed that all packets despatched by the server have been accurately acquired earlier than the shopper despatched a FIN.
Now, from the community standpoint, the story is obvious. The shopper initiated the connection requesting information from the server. The server saved sending information to the shopper with no downside. However, at a sure level, regardless of the server nonetheless having information to ship, the shopper selected to terminate the reception of knowledge. This led us to suspect that the problem may be associated to the shopper software itself.
In order to totally perceive the issue, we now want to grasp how the applying works. As proven within the diagram under, the applying runs within the us-east-1 area. It reads information from cross-region servers and writes the info to shoppers inside the identical area. The shopper runs as containers, whereas the servers are EC2 cases.
Notably, the cross-region learn was problematic whereas the write path was easy. Most importantly, there’s a 30-second application-level timeout for studying the info. The software (shopper) errors out if it fails to learn an preliminary batch of knowledge from the servers inside 30 seconds. When we elevated this timeout to 60 seconds, all the things labored as anticipated. This explains why the shopper initiated a FIN — as a result of it misplaced persistence ready for the server to switch information.
Could it’s that the server was up to date to ship information extra slowly? Could it’s that the shopper software was up to date to obtain information extra slowly? Could it’s that the info quantity turned too giant to be fully despatched out inside 30 seconds? Sadly, we acquired unfavorable solutions for all 3 questions from the applying proprietor. The server had been working with out adjustments for over a yr, there have been no vital updates within the newest rollout of the shopper, and the info quantity had remained constant.
If each the community and the applying weren’t modified just lately, then what modified? In reality, we found that the problem coincided with a latest Linux kernel improve from model 6.5.13 to six.6.10. To take a look at this speculation, we rolled again the kernel improve and it did restore regular operation to the applying.
Honestly talking, at the moment I didn’t consider it was a kernel bug as a result of I assumed the TCP implementation within the kernel needs to be strong and steady (Spoiler alert: How incorrect was I!). But we have been additionally out of concepts from different angles.
There have been about 14k commits between the great and unhealthy kernel variations. Engineers on the staff methodically and diligently bisected between the 2 variations. When the bisecting was narrowed to a few commits, a change with “tcp” in its commit message caught our consideration. The remaining bisecting confirmed that this commit was our perpetrator.
Interestingly, whereas reviewing the e-mail historical past associated to this commit, we discovered that one other person had reported a Python take a look at failure following the identical kernel improve. Although their answer was indirectly relevant to our state of affairs, it recommended that an easier take a look at may also reproduce our downside. Using strace, we noticed that the applying configured the next socket choices when speaking with the server:
[pid 1699] setsockopt(917, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
[pid 1699] setsockopt(917, SOL_TCP, TCP_NODELAY, [1], 4) = 0
We then developed a minimal client-server C software that transfers a file from the server to the shopper, with the shopper configuring the identical set of socket choices. During testing, we used a 10M file, which represents the quantity of knowledge sometimes transferred inside 30 seconds earlier than the shopper points a FIN. On the outdated kernel, this cross-region switch accomplished in 22 seconds, whereas on the brand new kernel, it took 39 seconds to complete.
With the assistance of the minimal replica setup, we have been in the end capable of pinpoint the basis explanation for the issue. In order to grasp the basis trigger, it’s important to have a grasp of the TCP obtain window.
TCP Receive Window
Simply put, the TCP obtain window is how the receiver tells the sender “This is how many bytes you can send me without me ACKing any of them”. Assuming the sender is the server and the receiver is the shopper, then we’ve:
The Window Size
Now that we all know the TCP obtain window dimension may have an effect on the throughput, the query is, how is the window dimension calculated? As an software author, you may’t resolve the window dimension, nonetheless, you may resolve how a lot reminiscence you wish to use for buffering acquired information. This is configured utilizing SO_RCVBUF socket choice we noticed within the strace consequence above. However, notice that the worth of this selection means how a lot software information could be queued within the obtain buffer. In man 7 socket, there may be
SO_RCVBUF
Sets or will get the utmost socket obtain buffer in bytes.
The kernel doubles this worth (to permit house for
bookkeeping overhead) when it’s set utilizing setsockopt(2),
and this doubled worth is returned by getsockopt(2). The
default worth is about by the
/proc/sys/internet/core/rmem_default file, and the utmost
allowed worth is about by the /proc/sys/internet/core/rmem_max
file. The minimal (doubled) worth for this selection is 256.
This means, when the person provides a worth X, then the kernel shops 2X within the variable sk->sk_rcvbuf. In different phrases, the kernel assumes that the bookkeeping overhead is as a lot because the precise information (i.e. 50% of the sk_rcvbuf).
sysctl_tcp_adv_win_scale
However, the belief above is probably not true as a result of the precise overhead actually is dependent upon a number of components resembling Maximum Transmission Unit (MTU). Therefore, the kernel supplied this sysctl_tcp_adv_win_scale which you need to use to inform the kernel what the precise overhead is. (I consider 99% of individuals additionally don’t know find out how to set this parameter accurately and I’m undoubtedly considered one of them. You’re the kernel, for those who don’t know the overhead, how will you anticipate me to know?).
According to the sysctl doc,
tcp_adv_win_scale — INTEGER
Obsolete since linux-6.6 Count buffering overhead as bytes/2^tcp_adv_win_scale (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), whether it is <= 0.
Possible values are [-31, 31], inclusive.
Default: 1
For 99% of individuals, we’re simply utilizing the default worth 1, which in flip means the overhead is calculated by rcvbuf/2^tcp_adv_win_scale = 1/2 * rcvbuf. This matches the belief when setting the SO_RCVBUF worth.
Let’s recap. Assume you set SO_RCVBUF to 65536, which is the worth set by the applying as proven within the setsockopt syscall. Then we’ve:
- SO_RCVBUF = 65536
- rcvbuf = 2 * 65536 = 131072
- overhead = rcvbuf / 2 = 131072 / 2 = 65536
- obtain window dimension = rcvbuf — overhead = 131072–65536 = 65536
(Note, this calculation is simplified. The actual calculation is extra complicated.)
In brief, the obtain window dimension earlier than the kernel improve was 65536. With this window dimension, the applying was capable of switch 10M information inside 30 seconds.
The Change
This commit obsoleted sysctl_tcp_adv_win_scale and launched a scaling_ratio that may extra precisely calculate the overhead or window dimension, which is the best factor to do. With the change, the window dimension is now rcvbuf * scaling_ratio.
So how is scaling_ratio calculated? It is calculated utilizing skb->len/skb->truesize the place skb->len is the size of the tcp information size in an skb and truesize is the entire dimension of the skb. This is unquestionably a extra correct ratio primarily based on actual information reasonably than a hardcoded 50%. Now, right here is the following query: throughout the TCP handshake earlier than any information is transferred, how can we resolve the preliminary scaling_ratio? The reply is, a magic and conservative ratio was chosen with the worth being roughly 0.25.
Now we’ve:
- SO_RCVBUF = 65536
- rcvbuf = 2 * 65536 = 131072
- obtain window dimension = rcvbuf * 0.25 = 131072 * 0.25 = 32768
In brief, the obtain window dimension halved after the kernel improve. Hence the throughput was reduce in half, inflicting the info switch time to double.
Naturally, you might ask, I perceive that the preliminary window dimension is small, however why doesn’t the window develop when we’ve a extra correct ratio of the payload later (i.e. skb->len/skb->truesize)? With some debugging, we ultimately discovered that the scaling_ratio does get up to date to a extra correct skb->len/skb->truesize, which in our case is round 0.66. However, one other variable, window_clamp, just isn’t up to date accordingly. window_clamp is the most obtain window allowed to be marketed, which can also be initialized to 0.25 * rcvbuf utilizing the preliminary scaling_ratio. As a consequence, the obtain window dimension is capped at this worth and might’t develop larger.
In idea, the repair is to replace window_clamp together with scaling_ratio. However, to be able to have a easy repair that doesn’t introduce different surprising behaviors, our remaining repair was to extend the preliminary scaling_ratio from 25% to 50%. This will make the obtain window dimension backward suitable with the unique default sysctl_tcp_adv_win_scale.
Meanwhile, discover that the issue just isn’t solely brought on by the modified kernel habits but additionally by the truth that the applying units SO_RCVBUF and has a 30-second application-level timeout. In reality, the applying is Kafka Connect and each settings are the default configurations (obtain.buffer.bytes=64k and request.timeout.ms=30s). We additionally created a kafka ticket to alter obtain.buffer.bytes to -1 to permit Linux to auto tune the obtain window.
This was a really fascinating debugging train that lined many layers of Netflix’s stack and infrastructure. While it technically wasn’t the “network” in charge, this time it turned out the perpetrator was the software program elements that make up the community (i.e. the TCP implementation within the kernel).
If tackling such technical challenges excites you, take into account becoming a member of our Cloud Infrastructure Engineering groups. Explore alternatives by visiting Netflix Jobs and looking for Cloud Engineering positions.
Special due to our gorgeous colleagues Alok Tiagi, Artem Tkachuk, Ethan Adams, Jorge Rodriguez, Nick Mahilani, Tycho Andersen and Vinay Rayini for investigating and mitigating this challenge. We would additionally prefer to thank Linux kernel community knowledgeable Eric Dumazet for reviewing and making use of the patch.