Investigation of a Workbench UI Latency Issue | by Netflix Technology Blog | Oct, 2024

0
114
Investigation of a Workbench UI Latency Issue | by Netflix Technology Blog | Oct, 2024


12 min learn

Oct 14, 2024

By: Hechao Li and Marcelo Mayworm

With particular due to our gorgeous colleagues Amer Ather, Itay Dafna, Luca Pozzi, Matheus Leão, and Ye Ji.

At Netflix, the Analytics and Developer Experience group, a part of the Data Platform, presents a product known as Workbench. Workbench is a distant growth workspace primarily based on Titus that permits information practitioners to work with large information and machine studying use instances at scale. A standard use case for Workbench is operating JupyterLab Notebooks.

Recently, a number of customers reported that their JupyterLab UI turns into sluggish and unresponsive when operating sure notebooks. This doc particulars the intriguing strategy of debugging this difficulty, all the best way from the UI right down to the Linux kernel.

Machine Learning engineer Luca Pozzi reported to our Data Platform staff that their JupyterLab UI on their workbench turns into sluggish and unresponsive when operating a few of their Notebooks. Restarting the ipykernel course of, which runs the Notebook, may quickly alleviate the issue, however the frustration persists as extra notebooks are run.

While we noticed the problem firsthand, the time period “UI being slow” is subjective and tough to measure. To examine this difficulty, we would have liked a quantitative evaluation of the slowness.

Itay Dafna devised an efficient and easy technique to quantify the UI slowness. Specifically, we opened a terminal by way of JupyterLab and held down a key (e.g., “j”) for 15 seconds whereas operating the person’s pocket book. The enter to stdin is shipped to the backend (i.e., JupyterLab) by way of a WebSocket, and the output to stdout is shipped again from the backend and displayed on the UI. We then exported the .har file recording all communications from the browser and loaded it right into a Notebook for evaluation.

Using this method, we noticed latencies starting from 1 to 10 seconds, averaging 7.4 seconds.

Now that now we have an goal metric for the slowness, let’s formally begin our investigation. If you may have learn the symptom rigorously, you will need to have seen that the slowness solely happens when the person runs sure notebooks however not others.

Therefore, step one is scrutinizing the precise Notebook experiencing the problem. Why does the UI at all times decelerate after operating this explicit Notebook? Naturally, you’d assume that there should be one thing incorrect with the code operating in it.

Upon carefully inspecting the person’s Notebook, we seen a library known as pystan , which offers Python bindings to a local C++ library known as stan, seemed suspicious. Specifically, pystan makes use of asyncio. However, as a result of there’s already an present asyncio occasion loop operating within the Notebook course of and asyncio can’t be nested by design, to ensure that pystan to work, the authors of pystan suggest injecting pystan into the present occasion loop through the use of a package deal known as nest_asyncio, a library that turned unmaintained as a result of the writer sadly handed away.

Given this seemingly hacky utilization, we naturally suspected that the occasions injected by pystan into the occasion loop have been blocking the dealing with of the WebSocket messages used to speak with the JupyterLab UI. This reasoning sounds very believable. However, the person claimed that there have been instances when a Notebook not utilizing pystan runs, the UI additionally turned sluggish.

Moreover, after a number of rounds of debate with ChatGPT, we discovered extra in regards to the structure and realized that, in principle, the utilization of pystan and nest_asyncio mustn’t trigger the slowness in dealing with the UI WebSocket for the next causes:

Even although pystan makes use of nest_asyncio to inject itself into the primary occasion loop, the Notebook runs on a toddler course of (i.e., the ipykernel course of) of the jupyter-lab server course of, which suggests the primary occasion loop being injected by pystan is that of the ipykernel course of, not the jupyter-server course of. Therefore, even when pystan blocks the occasion loop, it shouldn’t impression the jupyter-lab most important occasion loop that’s used for UI websocket communication. See the diagram under:

In different phrases, pystan occasions are injected to the occasion loop B on this diagram as an alternative of occasion loop A. So, it shouldn’t block the UI WebSocket occasions.

You may additionally assume that as a result of occasion loop A handles each the WebSocket occasions from the UI and the ZeroMQ socket occasions from the ipykernel course of, a excessive quantity of ZeroMQ occasions generated by the pocket book might block the WebSocket. However, after we captured packets on the ZeroMQ socket whereas reproducing the problem, we didn’t observe heavy site visitors on this socket that might trigger such blocking.

A stronger piece of proof to rule out pystan was that we have been finally in a position to reproduce the problem even with out it, which I’ll dive into later.

The Workbench occasion runs as a Titus container. To effectively make the most of our compute sources, Titus employs a CPU oversubscription function, that means the mixed digital CPUs allotted to containers exceed the variety of out there bodily CPUs on a Titus agent. If a container is unlucky sufficient to be scheduled alongside different “noisy” containers — people who devour lots of CPU sources — it might endure from CPU deficiency.

However, after inspecting the CPU utilization of neighboring containers on the identical Titus agent because the Workbench occasion, in addition to the general CPU utilization of the Titus agent, we shortly dominated out this speculation. Using the highest command on the Workbench, we noticed that when operating the Notebook, the Workbench occasion makes use of solely 4 out of the 64 CPUs allotted to it. Simply put, this workload just isn’t CPU-bound.

The subsequent principle was that the community between the online browser UI (on the laptop computer) and the JupyterLab server was sluggish. To examine, we captured all of the packets between the laptop computer and the server whereas operating the Notebook and repeatedly urgent ‘j’ within the terminal.

When the UI skilled delays, we noticed a 5-second pause in packet transmission from server port 8888 to the laptop computer. Meanwhile, site visitors from different ports, akin to port 22 for SSH, remained unaffected. This led us to conclude that the pause was attributable to the applying operating on port 8888 (i.e., the JupyterLab course of) slightly than the community.

As beforehand talked about, one other robust piece of proof proving the innocence of pystan was that we might reproduce the problem with out it. By regularly stripping down the “bad” Notebook, we finally arrived at a minimal snippet of code that reproduces the problem with none third-party dependencies or complicated logic:

import time
import os
from multiprocessing import Process

N = os.cpu_count()

def launch_worker(worker_id):
time.sleep(60)

if __name__ == '__main__':
with open('/root/2GB_file', 'r') as file:
information = file.learn()
processes = []
for i in vary(N):
p = Process(goal=launch_worker, args=(i,))
processes.append(p)
p.begin()

for p in processes:
p.be a part of()

The code does solely two issues:

  1. Read a 2GB file into reminiscence (the Workbench occasion has 480G reminiscence in complete so this reminiscence utilization is sort of negligible).
  2. Start N processes the place N is the variety of CPUs. The N processes do nothing however sleep.

There is little question that that is probably the most foolish piece of code I’ve ever written. It is neither CPU certain nor reminiscence certain. Yet it may well trigger the JupyterLab UI to stall for as many as 10 seconds!

There are a few attention-grabbing observations that elevate a number of questions:

  • We seen that each steps are required with a purpose to reproduce the problem. If you don’t learn the 2GB file (that isn’t even used!), the problem just isn’t reproducible. Why utilizing 2GB out of 480GB reminiscence might impression the efficiency?
  • When the UI delay happens, the jupyter-lab course of CPU utilization spikes to 100%, hinting at competition on the single-threaded occasion loop on this course of (occasion loop A within the diagram earlier than). What does the jupyter-lab course of want the CPU for, provided that it isn’t the method that runs the Notebook?
  • The code runs in a Notebook, which suggests it runs within the ipykernel course of, that may be a baby strategy of the jupyter-lab course of. How can something that occurs in a toddler course of trigger the guardian course of to have CPU competition?
  • The workbench has 64CPUs. But after we printed os.cpu_count(), the output was 96. That means the code begins extra processes than the variety of CPUs. Why is that?

Let’s reply the final query first. In truth, in the event you run lscpu and nproc instructions inside a Titus container, additionally, you will see totally different outcomes — the previous provides you 96, which is the variety of bodily CPUs on the Titus agent, whereas the latter provides you 64, which is the variety of digital CPUs allotted to the container. This discrepancy is as a result of lack of a “CPU namespace” within the Linux kernel, inflicting the variety of bodily CPUs to be leaked to the container when calling sure capabilities to get the CPU rely. The assumption right here is that Python os.cpu_count() makes use of the identical perform because the lscpu command, inflicting it to get the CPU rely of the host as an alternative of the container. Python 3.13 has a brand new name that can be utilized to get the correct CPU rely, but it surely’s not GA’ed but.

It will likely be confirmed later that this inaccurate variety of CPUs is usually a contributing issue to the slowness.

Next, we used py-spy to do a profiling of the jupyter-lab course of. Note that we profiled the guardian jupyter-lab course of, not the ipykernel baby course of that runs the copy code. The profiling result’s as follows:

As one can see, lots of CPU time (89%!!) is spent on a perform known as __parse_smaps_rollup. In comparability, the terminal handler used solely 0.47% CPU time. From the stack hint, we see that this perform is contained in the occasion loop A, so it may well undoubtedly trigger the UI WebSocket occasions to be delayed.

The stack hint additionally exhibits that this perform is finally known as by a perform utilized by a Jupyter lab extension known as jupyter_resource_usage. We then disabled this extension and restarted the jupyter-lab course of. As you’ll have guessed, we might now not reproduce the slowness!

But our puzzle just isn’t solved but. Why does this extension trigger the UI to decelerate? Let’s preserve digging.

From the identify of the extension and the names of the opposite capabilities it calls, we are able to infer that this extension is used to get sources akin to CPU and reminiscence utilization data. Examining the code, we see that this perform name stack is triggered when an API endpoint /metrics/v1 known as from the UI. The UI apparently calls this perform periodically, based on the community site visitors tab in Chrome’s Developer Tools.

Now let’s take a look at the implementation ranging from the decision get(jupter_resource_usage/api.py:42) . The full code is right here and the important thing traces are proven under:

cur_process = psutil.Process()
all_processes = [cur_process] + cur_process.youngsters(recursive=True)

for p in all_processes:
information = p.memory_full_info()

Basically, it will get all youngsters processes of the jupyter-lab course of recursively, together with each the ipykernel Notebook course of and all processes created by the Notebook. Obviously, the price of this perform is linear to the variety of all youngsters processes. In the copy code, we create 96 processes. So right here we may have at the very least 96 (sleep processes) + 1 (ipykernel course of) + 1 (jupyter-lab course of) = 98 processes when it ought to really be 64 (allotted CPUs) + 1 (ipykernel course of) + 1 (jupyter-lab course of) = 66 processes, as a result of the variety of CPUs allotted to the container is, in reality, 64.

This is really ironic. The extra CPUs now we have, the slower we’re!

At this level, now we have answered one query: Why does beginning many grandchildren processes within the baby course of trigger the guardian course of to be sluggish? Because the guardian course of runs a perform that’s linear to the quantity all youngsters course of recursively.

However, this solves solely half of the puzzle. If you bear in mind the earlier evaluation, beginning many baby processes ALONE doesn’t reproduce the problem. If we don’t learn the 2GB file, even when we create 2x extra processes, we are able to’t reproduce the slowness.

So now we should reply the following query: Why does studying a 2GB file within the baby course of have an effect on the guardian course of efficiency, particularly when the workbench has as a lot as 480GB reminiscence in complete?

To reply this query, let’s look carefully on the perform __parse_smaps_rollup. As the identify implies, this perform parses the file /proc/<pid>/smaps_rollup.

def _parse_smaps_rollup(self):
uss = pss = swap = 0
with open_binary("{}/{}/smaps_rollup".format(self._procfs_path, self.pid)) as f:
for line in f:
if line.startswith(b”Private_”):
# Private_Clean, Private_Dirty, Private_Hugetlb
s uss += int(line.break up()[1]) * 1024
elif line.startswith(b”Pss:”):
pss = int(line.break up()[1]) * 1024
elif line.startswith(b”Swap:”):
swap = int(line.break up()[1]) * 1024
return (uss, pss, swap)

Naturally, you may assume that when reminiscence utilization will increase, this file turns into bigger in dimension, inflicting the perform to take longer to parse. Unfortunately, this isn’t the reply as a result of:

  • First, the variety of traces on this file is fixed for all processes.
  • Second, it is a particular file within the /proc filesystem, which must be seen as a kernel interface as an alternative of a daily file on disk. In different phrases, I/O operations of this file are dealt with by the kernel slightly than disk.

This file was launched in this commit in 2017, with the aim of enhancing the efficiency of person packages that decide mixture reminiscence statistics. Let’s first give attention to the handler of open syscall on this /proc/<pid>/smaps_rollup.

Following by way of the single_open perform, we’ll discover that it makes use of the perform show_smaps_rollup for the present operation, which might translate to the learn system name on the file. Next, we take a look at the show_smaps_rollup implementation. You will discover a do-while loop that’s linear to the digital reminiscence space.

static int show_smaps_rollup(struct seq_file *m, void *v) {

vma_start = vma->vm_start;
do {
smap_gather_stats(vma, &mss, 0);
last_vma_end = vma->vm_end;

} for_each_vma(vmi, vma);

}

This completely explains why the perform will get slower when a 2GB file is learn into reminiscence. Because the handler of studying the smaps_rollup file now takes longer to run the whereas loop. Basically, though smaps_rollup already improved the efficiency of getting reminiscence data in comparison with the outdated technique of parsing the /proc/<pid>/smaps file, it’s nonetheless linear to the digital reminiscence used.

Even although at this level the puzzle is solved, let’s conduct a extra quantitative evaluation. How a lot is the time distinction when studying the smaps_rollup file with small versus giant digital reminiscence utilization? Let’s write some easy benchmark code like under:

import os

def read_smaps_rollup(pid):
with open("/proc/{}/smaps_rollup".format(pid), "rb") as f:
for line in f:
cross

if __name__ == “__main__”:
pid = os.getpid()

read_smaps_rollup(pid)

with open(“/root/2G_file”, “rb”) as f:
information = f.learn()

read_smaps_rollup(pid)

This program performs the next steps:

  1. Reads the smaps_rollup file of the present course of.
  2. Reads a 2GB file into reminiscence.
  3. Repeats step 1.

We then use strace to search out the correct time of studying the smaps_rollup file.

$ sudo strace -T -e hint=openat,learn python3 benchmark.py 2>&1 | grep “smaps_rollup” -A 1

openat(AT_FDCWD, “/proc/3107492/smaps_rollup”, O_RDONLY|O_CLOEXEC) = 3 <0.000023>
learn(3, “560b42ed4000–7ffdadcef000 — -p 0”…, 1024) = 670 <0.000259>
...
openat(AT_FDCWD, “/proc/3107492/smaps_rollup”, O_RDONLY|O_CLOEXEC) = 3 <0.000029>
learn(3, “560b42ed4000–7ffdadcef000 — -p 0”…, 1024) = 670 <0.027698>

As you possibly can see, each instances, the learn syscall returned 670, that means the file dimension remained the identical at 670 bytes. However, the time it took the second time (i.e., 0.027698 seconds) is 100x the time it took the primary time (i.e., 0.000259 seconds)! This signifies that if there are 98 processes, the time spent on studying this file alone will likely be 98 * 0.027698 = 2.7 seconds! Such a delay can considerably have an effect on the UI expertise.

This extension is used to show the CPU and reminiscence utilization of the pocket book course of on the bar on the backside of the Notebook:

We confirmed with the person that disabling the jupyter-resource-usage extension meets their necessities for UI responsiveness, and that this extension just isn’t essential to their use case. Therefore, we supplied a manner for them to disable the extension.

This was such a difficult difficulty that required debugging from the UI all the best way right down to the Linux kernel. It is fascinating that the issue is linear to each the variety of CPUs and the digital reminiscence dimension — two dimensions which might be typically considered individually.

Overall, we hope you loved the irony of:

  1. The extension used to observe CPU utilization inflicting CPU competition.
  2. An attention-grabbing case the place the extra CPUs you may have, the slower you get!

If you’re excited by tackling such technical challenges and have the chance to unravel complicated technical challenges and drive innovation, contemplate becoming a member of our Data Platform staffs. Be a part of shaping the way forward for Data Security and Infrastructure, Data Developer Experience, Analytics Infrastructure and Enablement, and extra. Explore the impression you may make with us!

LEAVE A REPLY

Please enter your comment!
Please enter your name here