Investigation of a Workbench UI Latency Issue | by Netflix Technology Blog | Oct, 2024

0
10
Investigation of a Workbench UI Latency Issue | by Netflix Technology Blog | Oct, 2024


Netflix Technology Blog

Netflix TechBlog

12 min learn

Oct 14, 2024

By: Hechao Li and Marcelo Mayworm

With particular due to our beautiful colleagues Amer Ather, Itay Dafna, Luca Pozzi, Matheus Leão, and Ye Ji.

At Netflix, the Analytics and Developer Experience group, a part of the Data Platform, affords a product referred to as Workbench. Workbench is a distant improvement workspace based mostly on Titus that enables knowledge practitioners to work with huge knowledge and machine studying use instances at scale. A standard use case for Workbench is operating JupyterLab Notebooks.

Recently, a number of customers reported that their JupyterLab UI turns into gradual and unresponsive when operating sure notebooks. This doc particulars the intriguing means of debugging this situation, all the way in which from the UI all the way down to the Linux kernel.

Machine Learning engineer Luca Pozzi reported to our Data Platform staff that their JupyterLab UI on their workbench turns into gradual and unresponsive when operating a few of their Notebooks. Restarting the ipykernel course of, which runs the Notebook, may briefly alleviate the issue, however the frustration persists as extra notebooks are run.

While we noticed the difficulty firsthand, the time period “UI being slow” is subjective and troublesome to measure. To examine this situation, we would have liked a quantitative evaluation of the slowness.

Itay Dafna devised an efficient and easy technique to quantify the UI slowness. Specifically, we opened a terminal through JupyterLab and held down a key (e.g., “j”) for 15 seconds whereas operating the consumer’s pocket book. The enter to stdin is shipped to the backend (i.e., JupyterLab) through a WebSocket, and the output to stdout is shipped again from the backend and displayed on the UI. We then exported the .har file recording all communications from the browser and loaded it right into a Notebook for evaluation.

Using this method, we noticed latencies starting from 1 to 10 seconds, averaging 7.4 seconds.

Now that we’ve an goal metric for the slowness, let’s formally begin our investigation. If you may have learn the symptom rigorously, you could have seen that the slowness solely happens when the consumer runs sure notebooks however not others.

Therefore, step one is scrutinizing the particular Notebook experiencing the difficulty. Why does the UI all the time decelerate after operating this explicit Notebook? Naturally, you’d assume that there should be one thing improper with the code operating in it.

Upon intently inspecting the consumer’s Notebook, we seen a library referred to as pystan , which offers Python bindings to a local C++ library referred to as stan, seemed suspicious. Specifically, pystan makes use of asyncio. However, as a result of there may be already an present asyncio occasion loop operating within the Notebook course of and asyncio can’t be nested by design, to ensure that pystan to work, the authors of pystan suggest injecting pystan into the present occasion loop through the use of a bundle referred to as nest_asyncio, a library that grew to become unmaintained as a result of the creator sadly handed away.

Given this seemingly hacky utilization, we naturally suspected that the occasions injected by pystan into the occasion loop had been blocking the dealing with of the WebSocket messages used to speak with the JupyterLab UI. This reasoning sounds very believable. However, the consumer claimed that there have been instances when a Notebook not utilizing pystan runs, the UI additionally grew to become gradual.

Moreover, after a number of rounds of debate with ChatGPT, we discovered extra concerning the structure and realized that, in idea, the utilization of pystan and nest_asyncio shouldn’t trigger the slowness in dealing with the UI WebSocket for the next causes:

Even although pystan makes use of nest_asyncio to inject itself into the primary occasion loop, the Notebook runs on a toddler course of (i.e., the ipykernel course of) of the jupyter-lab server course of, which implies the primary occasion loop being injected by pystan is that of the ipykernel course of, not the jupyter-server course of. Therefore, even when pystan blocks the occasion loop, it shouldn’t influence the jupyter-lab most important occasion loop that’s used for UI websocket communication. See the diagram under:

In different phrases, pystan occasions are injected to the occasion loop B on this diagram as a substitute of occasion loop A. So, it shouldn’t block the UI WebSocket occasions.

You may additionally assume that as a result of occasion loop A handles each the WebSocket occasions from the UI and the ZeroMQ socket occasions from the ipykernel course of, a excessive quantity of ZeroMQ occasions generated by the pocket book may block the WebSocket. However, after we captured packets on the ZeroMQ socket whereas reproducing the difficulty, we didn’t observe heavy site visitors on this socket that would trigger such blocking.

A stronger piece of proof to rule out pystan was that we had been finally in a position to reproduce the difficulty even with out it, which I’ll dive into later.

The Workbench occasion runs as a Titus container. To effectively make the most of our compute sources, Titus employs a CPU oversubscription characteristic, that means the mixed digital CPUs allotted to containers exceed the variety of out there bodily CPUs on a Titus agent. If a container is unlucky sufficient to be scheduled alongside different “noisy” containers — people who devour lots of CPU sources — it may undergo from CPU deficiency.

However, after inspecting the CPU utilization of neighboring containers on the identical Titus agent because the Workbench occasion, in addition to the general CPU utilization of the Titus agent, we rapidly dominated out this speculation. Using the highest command on the Workbench, we noticed that when operating the Notebook, the Workbench occasion makes use of solely 4 out of the 64 CPUs allotted to it. Simply put, this workload is just not CPU-bound.

The subsequent idea was that the community between the net browser UI (on the laptop computer) and the JupyterLab server was gradual. To examine, we captured all of the packets between the laptop computer and the server whereas operating the Notebook and repeatedly urgent ‘j’ within the terminal.

When the UI skilled delays, we noticed a 5-second pause in packet transmission from server port 8888 to the laptop computer. Meanwhile, site visitors from different ports, corresponding to port 22 for SSH, remained unaffected. This led us to conclude that the pause was brought on by the appliance operating on port 8888 (i.e., the JupyterLab course of) fairly than the community.

As beforehand talked about, one other robust piece of proof proving the innocence of pystan was that we may reproduce the difficulty with out it. By steadily stripping down the “bad” Notebook, we ultimately arrived at a minimal snippet of code that reproduces the difficulty with none third-party dependencies or advanced logic:

import time
import os
from multiprocessing import Process

N = os.cpu_count()

def launch_worker(worker_id):
time.sleep(60)

if __name__ == '__main__':
with open('/root/2GB_file', 'r') as file:
knowledge = file.learn()
processes = []
for i in vary(N):
p = Process(goal=launch_worker, args=(i,))
processes.append(p)
p.begin()

for p in processes:
p.be a part of()

The code does solely two issues:

  1. Read a 2GB file into reminiscence (the Workbench occasion has 480G reminiscence in whole so this reminiscence utilization is sort of negligible).
  2. Start N processes the place N is the variety of CPUs. The N processes do nothing however sleep.

There is little question that that is essentially the most foolish piece of code I’ve ever written. It is neither CPU certain nor reminiscence certain. Yet it might trigger the JupyterLab UI to stall for as many as 10 seconds!

There are a few fascinating observations that increase a number of questions:

  • We seen that each steps are required with a purpose to reproduce the difficulty. If you don’t learn the 2GB file (that’s not even used!), the difficulty is just not reproducible. Why utilizing 2GB out of 480GB reminiscence may influence the efficiency?
  • When the UI delay happens, the jupyter-lab course of CPU utilization spikes to 100%, hinting at rivalry on the single-threaded occasion loop on this course of (occasion loop A within the diagram earlier than). What does the jupyter-lab course of want the CPU for, on condition that it isn’t the method that runs the Notebook?
  • The code runs in a Notebook, which implies it runs within the ipykernel course of, that could be a little one means of the jupyter-lab course of. How can something that occurs in a toddler course of trigger the mother or father course of to have CPU rivalry?
  • The workbench has 64CPUs. But after we printed os.cpu_count(), the output was 96. That means the code begins extra processes than the variety of CPUs. Why is that?

Let’s reply the final query first. In reality, in the event you run lscpu and nproc instructions inside a Titus container, additionally, you will see completely different outcomes — the previous provides you 96, which is the variety of bodily CPUs on the Titus agent, whereas the latter provides you 64, which is the variety of digital CPUs allotted to the container. This discrepancy is because of the lack of a “CPU namespace” within the Linux kernel, inflicting the variety of bodily CPUs to be leaked to the container when calling sure features to get the CPU rely. The assumption right here is that Python os.cpu_count() makes use of the identical perform because the lscpu command, inflicting it to get the CPU rely of the host as a substitute of the container. Python 3.13 has a brand new name that can be utilized to get the correct CPU rely, but it surely’s not GA’ed but.

It shall be confirmed later that this inaccurate variety of CPUs generally is a contributing issue to the slowness.

Next, we used py-spy to do a profiling of the jupyter-lab course of. Note that we profiled the mother or father jupyter-lab course of, not the ipykernel little one course of that runs the copy code. The profiling result’s as follows:

As one can see, lots of CPU time (89%!!) is spent on a perform referred to as __parse_smaps_rollup. In comparability, the terminal handler used solely 0.47% CPU time. From the stack hint, we see that this perform is contained in the occasion loop A, so it might positively trigger the UI WebSocket occasions to be delayed.

The stack hint additionally reveals that this perform is finally referred to as by a perform utilized by a Jupyter lab extension referred to as jupyter_resource_usage. We then disabled this extension and restarted the jupyter-lab course of. As you might have guessed, we may now not reproduce the slowness!

But our puzzle is just not solved but. Why does this extension trigger the UI to decelerate? Let’s hold digging.

From the identify of the extension and the names of the opposite features it calls, we will infer that this extension is used to get sources corresponding to CPU and reminiscence utilization info. Examining the code, we see that this perform name stack is triggered when an API endpoint /metrics/v1 is named from the UI. The UI apparently calls this perform periodically, in keeping with the community site visitors tab in Chrome’s Developer Tools.

Now let’s have a look at the implementation ranging from the decision get(jupter_resource_usage/api.py:42) . The full code is right here and the important thing traces are proven under:

cur_process = psutil.Process()
all_processes = [cur_process] + cur_process.youngsters(recursive=True)

for p in all_processes:
information = p.memory_full_info()

Basically, it will get all youngsters processes of the jupyter-lab course of recursively, together with each the ipykernel Notebook course of and all processes created by the Notebook. Obviously, the price of this perform is linear to the variety of all youngsters processes. In the copy code, we create 96 processes. So right here we could have at the least 96 (sleep processes) + 1 (ipykernel course of) + 1 (jupyter-lab course of) = 98 processes when it ought to really be 64 (allotted CPUs) + 1 (ipykernel course of) + 1 (jupyter-lab course of) = 66 processes, as a result of the variety of CPUs allotted to the container is, in reality, 64.

This is really ironic. The extra CPUs we’ve, the slower we’re!

At this level, we’ve answered one query: Why does beginning many grandchildren processes within the little one course of trigger the mother or father course of to be gradual? Because the mother or father course of runs a perform that’s linear to the quantity all youngsters course of recursively.

However, this solves solely half of the puzzle. If you keep in mind the earlier evaluation, beginning many little one processes ALONE doesn’t reproduce the difficulty. If we don’t learn the 2GB file, even when we create 2x extra processes, we will’t reproduce the slowness.

So now we should reply the following query: Why does studying a 2GB file within the little one course of have an effect on the mother or father course of efficiency, particularly when the workbench has as a lot as 480GB reminiscence in whole?

To reply this query, let’s look intently on the perform __parse_smaps_rollup. As the identify implies, this perform parses the file /proc/<pid>/smaps_rollup.

def _parse_smaps_rollup(self):
uss = pss = swap = 0
with open_binary("{}/{}/smaps_rollup".format(self._procfs_path, self.pid)) as f:
for line in f:
if line.startswith(b”Private_”):
# Private_Clean, Private_Dirty, Private_Hugetlb
s uss += int(line.cut up()[1]) * 1024
elif line.startswith(b”Pss:”):
pss = int(line.cut up()[1]) * 1024
elif line.startswith(b”Swap:”):
swap = int(line.cut up()[1]) * 1024
return (uss, pss, swap)

Naturally, you may assume that when reminiscence utilization will increase, this file turns into bigger in dimension, inflicting the perform to take longer to parse. Unfortunately, this isn’t the reply as a result of:

  • First, the variety of traces on this file is fixed for all processes.
  • Second, this can be a particular file within the /proc filesystem, which needs to be seen as a kernel interface as a substitute of an everyday file on disk. In different phrases, I/O operations of this file are dealt with by the kernel fairly than disk.

This file was launched in this commit in 2017, with the aim of enhancing the efficiency of consumer applications that decide combination reminiscence statistics. Let’s first deal with the handler of open syscall on this /proc/<pid>/smaps_rollup.

Following via the single_open perform, we’ll discover that it makes use of the perform show_smaps_rollup for the present operation, which might translate to the learn system name on the file. Next, we have a look at the show_smaps_rollup implementation. You will discover a do-while loop that’s linear to the digital reminiscence space.

static int show_smaps_rollup(struct seq_file *m, void *v) {

vma_start = vma->vm_start;
do {
smap_gather_stats(vma, &mss, 0);
last_vma_end = vma->vm_end;

} for_each_vma(vmi, vma);

}

This completely explains why the perform will get slower when a 2GB file is learn into reminiscence. Because the handler of studying the smaps_rollup file now takes longer to run the whereas loop. Basically, though smaps_rollup already improved the efficiency of getting reminiscence info in comparison with the previous technique of parsing the /proc/<pid>/smaps file, it’s nonetheless linear to the digital reminiscence used.

Even although at this level the puzzle is solved, let’s conduct a extra quantitative evaluation. How a lot is the time distinction when studying the smaps_rollup file with small versus giant digital reminiscence utilization? Let’s write some easy benchmark code like under:

import os

def read_smaps_rollup(pid):
with open("/proc/{}/smaps_rollup".format(pid), "rb") as f:
for line in f:
go

if __name__ == “__main__”:
pid = os.getpid()

read_smaps_rollup(pid)

with open(“/root/2G_file”, “rb”) as f:
knowledge = f.learn()

read_smaps_rollup(pid)

This program performs the next steps:

  1. Reads the smaps_rollup file of the present course of.
  2. Reads a 2GB file into reminiscence.
  3. Repeats step 1.

We then use strace to search out the correct time of studying the smaps_rollup file.

$ sudo strace -T -e hint=openat,learn python3 benchmark.py 2>&1 | grep “smaps_rollup” -A 1

openat(AT_FDCWD, “/proc/3107492/smaps_rollup”, O_RDONLY|O_CLOEXEC) = 3 <0.000023>
learn(3, “560b42ed4000–7ffdadcef000 — -p 0”…, 1024) = 670 <0.000259>
...
openat(AT_FDCWD, “/proc/3107492/smaps_rollup”, O_RDONLY|O_CLOEXEC) = 3 <0.000029>
learn(3, “560b42ed4000–7ffdadcef000 — -p 0”…, 1024) = 670 <0.027698>

As you possibly can see, each occasions, the learn syscall returned 670, that means the file dimension remained the identical at 670 bytes. However, the time it took the second time (i.e., 0.027698 seconds) is 100x the time it took the primary time (i.e., 0.000259 seconds)! This signifies that if there are 98 processes, the time spent on studying this file alone shall be 98 * 0.027698 = 2.7 seconds! Such a delay can considerably have an effect on the UI expertise.

This extension is used to show the CPU and reminiscence utilization of the pocket book course of on the bar on the backside of the Notebook:

We confirmed with the consumer that disabling the jupyter-resource-usage extension meets their necessities for UI responsiveness, and that this extension is just not crucial to their use case. Therefore, we supplied a method for them to disable the extension.

This was such a difficult situation that required debugging from the UI all the way in which all the way down to the Linux kernel. It is fascinating that the issue is linear to each the variety of CPUs and the digital reminiscence dimension — two dimensions which might be typically seen individually.

Overall, we hope you loved the irony of:

  1. The extension used to watch CPU utilization inflicting CPU rivalry.
  2. An fascinating case the place the extra CPUs you may have, the slower you get!

If you’re excited by tackling such technical challenges and have the chance to unravel advanced technical challenges and drive innovation, take into account becoming a member of our Data Platform staffs. Be a part of shaping the way forward for Data Security and Infrastructure, Data Developer Experience, Analytics Infrastructure and Enablement, and extra. Explore the influence you can also make with us!

LEAVE A REPLY

Please enter your comment!
Please enter your name here