{"id":104436,"date":"2023-05-19T23:08:46","date_gmt":"2023-05-19T23:08:46","guid":{"rendered":"https:\/\/showbizztoday.com\/index.php\/2023\/05\/19\/debugging-a-fuse-deadlock-in-the-linux-kernel-by-netflix-technology-blog-may-2023\/"},"modified":"2023-05-19T23:08:46","modified_gmt":"2023-05-19T23:08:46","slug":"debugging-a-fuse-impasse-within-the-linux-kernel-by-netflix-technology-blog-may-2023","status":"publish","type":"post","link":"https:\/\/showbizztoday.com\/index.php\/2023\/05\/19\/debugging-a-fuse-impasse-within-the-linux-kernel-by-netflix-technology-blog-may-2023\/","title":{"rendered":"Debugging a FUSE impasse within the Linux kernel | by Netflix Technology Blog | May, 2023"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<div class=\"\">\n<div class=\"hr hs ht hu hv\">\n<div class=\"speechify-ignore ab co\">\n<div class=\"speechify-ignore bg l\">\n<div class=\"hw hx hy hz ia ab\">\n<div>\n<div class=\"ab ib\"><a href=\"https:\/\/netflixtechblog.medium.com\/?source=post_page-----c75cd7989b6d--------------------------------\" rel=\"noopener follow\" target=\"_blank\"><\/p>\n<div>\n<div class=\"bl\" aria-hidden=\"false\">\n<div class=\"l ic id bx ie if\">\n<div class=\"l ff\"><img decoding=\"async\" alt=\"Netflix Technology Blog\" class=\"l fa bx dc dd cw\" src=\"https:\/\/miro.medium.com\/v2\/resize:fill:88:88\/1*BJWRqfSMf9Da9vsXG9EBRQ.jpeg\" width=\"44\" height=\"44\" loading=\"lazy\"\/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><\/a><a href=\"https:\/\/netflixtechblog.com\/?source=post_page-----c75cd7989b6d--------------------------------\" rel=\"noopener  ugc nofollow\" target=\"_blank\"><\/p>\n<div class=\"ij ab ff\">\n<div>\n<div class=\"bl\" aria-hidden=\"false\">\n<div class=\"l ik il bx ie im\">\n<div class=\"l ff\"><img decoding=\"async\" alt=\"Netflix TechBlog\" class=\"l fa bx bq in cw\" src=\"https:\/\/miro.medium.com\/v2\/resize:fill:48:48\/1*ty4NvNrGg4ReETxqU2N3Og.png\" width=\"24\" height=\"24\" loading=\"lazy\"\/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p id=\"939c\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\"><a class=\"af np\" href=\"https:\/\/tycho.pizza\" rel=\"noopener ugc nofollow\" target=\"_blank\">Tycho Andersen<\/a><\/p>\n<p id=\"97e4\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">The Compute staff at Netflix is charged with managing all AWS and containerized workloads at Netflix, together with autoscaling, deployment of containers, situation remediation, and many others. As a part of this staff, I work on fixing unusual issues that customers report.<\/p>\n<p id=\"adb6\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">This specific situation concerned a customized inner <a class=\"af np\" href=\"https:\/\/www.kernel.org\/doc\/html\/latest\/filesystems\/fuse.html\" rel=\"noopener ugc nofollow\" target=\"_blank\">FUSE filesystem<\/a>: <a class=\"af np\" rel=\"noopener ugc nofollow\" target=\"_blank\" href=\"https:\/\/netflixtechblog.com\/netflix-drive-a607538c3055\">ndrive<\/a>. It had been festering for a while, however wanted somebody to take a seat down and have a look at it in anger. This weblog submit describes how I poked at <code class=\"cw nq nr ns nt b\">\/proc<\/code>to get a way of what was occurring, earlier than posting the problem to the kernel mailing record and getting schooled on how the kernel\u2019s wait code truly works!<\/p>\n<p id=\"1e38\" class=\"pw-post-body-paragraph mr ms gq mt b mu os mw mx my ot na nb nc ou ne nf ng ov ni nj nk ow nm nn no gj bj\">We had a caught docker API name:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"91cc\" class=\"ph nv gq nt b bf pi pj l pk pl\">goroutine 146 [select, 8817 minutes]:<br\/>internet\/http.(*persistConn).roundTrip(0xc000658fc0, 0xc0003fc080, 0x0, 0x0, 0x0)<br\/>\/usr\/native\/go\/src\/internet\/http\/transport.go:2610 +0x765<br\/>internet\/http.(*Transport).roundTrip(0xc000420140, 0xc000966200, 0x30, 0x1366f20, 0x162)<br\/>\/usr\/native\/go\/src\/internet\/http\/transport.go:592 +0xacb<br\/>internet\/http.(*Transport).RoundTrip(0xc000420140, 0xc000966200, 0xc000420140, 0x0, 0x0)<br\/>\/usr\/native\/go\/src\/internet\/http\/roundtrip.go:17 +0x35<br\/>internet\/http.ship(0xc000966200, 0x161eba0, 0xc000420140, 0x0, 0x0, 0x0, 0xc00000e050, 0x3, 0x1, 0x0)<br\/>\/usr\/native\/go\/src\/internet\/http\/consumer.go:251 +0x454<br\/>internet\/http.(*Client).ship(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0, 0xc00000e050, 0x0, 0x1, 0x10000168e)<br\/>\/usr\/native\/go\/src\/internet\/http\/consumer.go:175 +0xff<br\/>internet\/http.(*Client).do(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0)<br\/>\/usr\/native\/go\/src\/internet\/http\/consumer.go:717 +0x45f<br\/>internet\/http.(*Client).Do(...)<br\/>\/usr\/native\/go\/src\/internet\/http\/consumer.go:585<br\/>golang.org\/x\/internet\/context\/ctxhttp.Do(0x163bd48, 0xc000044090, 0xc000438480, 0xc000966100, 0x0, 0x0, 0x0)<br\/>\/go\/pkg\/mod\/golang.org\/x\/internet@v0.0.0-20211209124913-491a49abca63\/context\/ctxhttp\/ctxhttp.go:27 +0x10f<br\/>github.com\/docker\/docker\/consumer.(*Client).doRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc000966100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)<br\/>\/go\/pkg\/mod\/github.com\/moby\/moby@v0.0.0-20190408150954-50ebe4562dfc\/consumer\/request.go:132 +0xbe<br\/>github.com\/docker\/docker\/consumer.(*Client).sendRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0x13d8643, 0x3, 0xc00079a720, 0x51, 0x0, 0x0, 0x0, ...)<br\/>\/go\/pkg\/mod\/github.com\/moby\/moby@v0.0.0-20190408150954-50ebe4562dfc\/consumer\/request.go:122 +0x156<br\/>github.com\/docker\/docker\/consumer.(*Client).get(...)<br\/>\/go\/pkg\/mod\/github.com\/moby\/moby@v0.0.0-20190408150954-50ebe4562dfc\/consumer\/request.go:37<br\/>github.com\/docker\/docker\/consumer.(*Client).ContainerInspect(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc0006a01c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)<br\/>\/go\/pkg\/mod\/github.com\/moby\/moby@v0.0.0-20190408150954-50ebe4562dfc\/consumer\/container_inspect.go:18 +0x128<br\/>github.com\/Netflix\/titus-executor\/executor\/runtime\/docker.(*DockerRuntime).Kill(0xc000215180, 0x163bdb8, 0xc000938600, 0x1, 0x0, 0x0)<br\/>\/var\/lib\/buildkite-agent\/builds\/ip-192-168-1-90-1\/netflix\/titus-executor\/executor\/runtime\/docker\/docker.go:2835 +0x310<br\/>github.com\/Netflix\/titus-executor\/executor\/runner.(*Runner).doShutdown(0xc000432dc0, 0x163bd10, 0xc000938390, 0x1, 0xc000b821e0, 0x1d, 0xc0005e4710)<br\/>\/var\/lib\/buildkite-agent\/builds\/ip-192-168-1-90-1\/netflix\/titus-executor\/executor\/runner\/runner.go:326 +0x4f4<br\/>github.com\/Netflix\/titus-executor\/executor\/runner.(*Runner).beginRunner(0xc000432dc0, 0x163bdb8, 0xc00071e0c0, 0xc0a502e28c08b488, 0x24572b8, 0x1df5980)<br\/>\/var\/lib\/buildkite-agent\/builds\/ip-192-168-1-90-1\/netflix\/titus-executor\/executor\/runner\/runner.go:122 +0x391<br\/>created by github.com\/Netflix\/titus-executor\/executor\/runner.StartTaskWithRuntime<br\/>\/var\/lib\/buildkite-agent\/builds\/ip-192-168-1-90-1\/netflix\/titus-executor\/executor\/runner\/runner.go:81 +0x411<\/span><\/pre>\n<p id=\"abe6\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Here, our administration engine has made an HTTP name to the Docker API\u2019s unix socket asking it to kill a container. Our containers are configured to be killed by way of <code class=\"cw nq nr ns nt b\">SIGKILL<\/code>. But that is unusual. <code class=\"cw nq nr ns nt b\">kill(SIGKILL)<\/code> needs to be comparatively deadly, so what&#8217;s the container doing?<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"4458\" class=\"ph nv gq nt b bf pi pj l jf pl\">$ docker exec -it 6643cd073492 bash<br\/>OCI runtime exec failed: exec failed: container_linux.go:380: beginning container course of brought about: process_linux.go:130: executing setns course of brought about: exit standing 1: unknown<\/span><\/pre>\n<p id=\"9510\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Hmm. Seems prefer it\u2019s alive, however <code class=\"cw nq nr ns nt b\">setns(2)<\/code> fails. Why would that be? If we have a look at the method tree by way of <code class=\"cw nq nr ns nt b\">ps awwfux<\/code>, we see:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"e7da\" class=\"ph nv gq nt b bf pi pj l jf pl\">_ containerd-shim -namespace moby -workdir \/var\/lib\/containerd\/io.containerd.runtime.v1.linux\/moby\/6643cd073492ba9166100ed30dbe389ff1caef0dc3d35<br\/>|  _ [docker-init]<br\/>|      _ [ndrive] &lt;defunct&gt;<\/span><\/pre>\n<p id=\"484b\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Ok, so the container\u2019s init course of remains to be alive, nevertheless it has one zombie baby. What might the container\u2019s init course of presumably be doing?<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"2bb2\" class=\"ph nv gq nt b bf pi pj l jf pl\"># cat \/proc\/1528591\/stack<br\/>[&lt;0&gt;] do_wait+0x156\/0x2f0<br\/>[&lt;0&gt;] kernel_wait4+0x8d\/0x140<br\/>[&lt;0&gt;] zap_pid_ns_processes+0x104\/0x180<br\/>[&lt;0&gt;] do_exit+0xa41\/0xb80<br\/>[&lt;0&gt;] do_group_exit+0x3a\/0xa0<br\/>[&lt;0&gt;] __x64_sys_exit_group+0x14\/0x20<br\/>[&lt;0&gt;] do_syscall_64+0x37\/0xb0<br\/>[&lt;0&gt;] entry_SYSCALL_64_after_hwframe+0x44\/0xae<\/span><\/pre>\n<p id=\"5c89\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">It is within the means of exiting, nevertheless it appears caught. The solely baby is the ndrive course of in Z (i.e. \u201czombie\u201d) state, although. Zombies are processes which have efficiently exited, and are ready to be reaped by a corresponding <code class=\"cw nq nr ns nt b\">wait()<\/code> syscall from their dad and mom. So how might the kernel be caught ready on a zombie?<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"b55b\" class=\"ph nv gq nt b bf pi pj l jf pl\"># ls \/proc\/1544450\/process<br\/>1544450  1544574<\/span><\/pre>\n<p id=\"5d90\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Ah ha, there are two threads within the thread group. One of them is a zombie, possibly the opposite one isn\u2019t:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"1cc3\" class=\"ph nv gq nt b bf pi pj l jf pl\"># cat \/proc\/1544574\/stack<br\/>[&lt;0&gt;] request_wait_answer+0x12f\/0x210<br\/>[&lt;0&gt;] fuse_simple_request+0x109\/0x2c0<br\/>[&lt;0&gt;] fuse_flush+0x16f\/0x1b0<br\/>[&lt;0&gt;] filp_close+0x27\/0x70<br\/>[&lt;0&gt;] put_files_struct+0x6b\/0xc0<br\/>[&lt;0&gt;] do_exit+0x360\/0xb80<br\/>[&lt;0&gt;] do_group_exit+0x3a\/0xa0<br\/>[&lt;0&gt;] get_signal+0x140\/0x870<br\/>[&lt;0&gt;] arch_do_signal_or_restart+0xae\/0x7c0<br\/>[&lt;0&gt;] exit_to_user_mode_prepare+0x10f\/0x1c0<br\/>[&lt;0&gt;] syscall_exit_to_user_mode+0x26\/0x40<br\/>[&lt;0&gt;] do_syscall_64+0x46\/0xb0<br\/>[&lt;0&gt;] entry_SYSCALL_64_after_hwframe+0x44\/0xae<\/span><\/pre>\n<p id=\"f377\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Indeed it&#8217;s not a zombie. It is making an attempt to grow to be one as onerous as it may well, nevertheless it\u2019s blocking inside FUSE for some cause. To discover out why, let\u2019s have a look at some kernel code. If we have a look at <code class=\"cw nq nr ns nt b\"><a class=\"af np\" href=\"https:\/\/git.kernel.org\/pub\/scm\/linux\/kernel\/git\/torvalds\/linux.git\/tree\/kernel\/pid_namespace.c?h=v5.19#n166\" rel=\"noopener ugc nofollow\" target=\"_blank\">zap_pid_ns_processes()<\/a><\/code>, it does:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"ee96\" class=\"ph nv gq nt b bf pi pj l pk pl\">\/*<br\/>* Reap the EXIT_ZOMBIE kids we had earlier than we ignored SIGCHLD.<br\/>* kernel_wait4() can even block till our kids traced from the<br\/>* mum or dad namespace are indifferent and grow to be EXIT_DEAD.<br\/>*\/<br\/>do {<br\/>clear_thread_flag(TIF_SIGPENDING);<br\/>rc = kernel_wait4(-1, NULL, __WALL, NULL);<br\/>} whereas (rc != -ECHILD);<\/span><\/pre>\n<p id=\"b65e\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">which is the place we&#8217;re caught, however earlier than that, it has achieved:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"73e6\" class=\"ph nv gq nt b bf pi pj l pk pl\">\/* Don't enable any extra processes into the pid namespace *\/<br\/>disable_pid_allocation(pid_ns);<\/span><\/pre>\n<p id=\"3eed\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">which is why docker can\u2019t <code class=\"cw nq nr ns nt b\">setns()<\/code> \u2014 the <em class=\"pm\">namespace<\/em> is a zombie. Ok, so we are able to\u2019t <code class=\"cw nq nr ns nt b\">setns(2)<\/code>, however why are we caught in <code class=\"cw nq nr ns nt b\">kernel_wait4()<\/code>? To perceive why, let\u2019s have a look at what the opposite thread was doing in FUSE\u2019s <code class=\"cw nq nr ns nt b\"><a class=\"af np\" href=\"https:\/\/git.kernel.org\/pub\/scm\/linux\/kernel\/git\/torvalds\/linux.git\/tree\/fs\/fuse\/dev.c?h=v5.19#n407\" rel=\"noopener ugc nofollow\" target=\"_blank\">request_wait_answer()<\/a><\/code>:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"ea99\" class=\"ph nv gq nt b bf pi pj l pk pl\">\/*<br\/>* Either request is already in userspace, or it was compelled.<br\/>* Wait it out.<br\/>*\/<br\/>wait_event(req-&gt;waitq, test_bit(FR_FINISHED, &amp;req-&gt;flags));<\/span><\/pre>\n<p id=\"df37\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Ok, so we\u2019re ready for an occasion (on this case, that userspace has replied to the FUSE flush request). But <code class=\"cw nq nr ns nt b\">zap_pid_ns_processes()<\/code>despatched a <code class=\"cw nq nr ns nt b\">SIGKILL<\/code>! <code class=\"cw nq nr ns nt b\">SIGKILL<\/code> needs to be very deadly to a course of. If we have a look at the method, we are able to certainly see that there\u2019s a pending <code class=\"cw nq nr ns nt b\">SIGKILL<\/code>:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"8bc8\" class=\"ph nv gq nt b bf pi pj l jf pl\"># grep Pnd \/proc\/1544574\/standing<br\/>SigPnd: 0000000000000000<br\/>ShdPnd: 0000000000000100<\/span><\/pre>\n<p id=\"3a27\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Viewing course of standing this fashion, you may see <code class=\"cw nq nr ns nt b\">0x100<\/code> (i.e. the ninth bit is ready) underneath <code class=\"cw nq nr ns nt b\">ShdPnd<\/code>, which is the sign quantity comparable to <code class=\"cw nq nr ns nt b\">SIGKILL<\/code>. Pending alerts are alerts which have been generated by the kernel, however haven&#8217;t but been delivered to userspace. Signals are solely delivered at sure instances, for instance when getting into or leaving a syscall, or when ready on occasions. If the kernel is at present doing one thing on behalf of the duty, the sign could also be pending. Signals may also be blocked by a process, in order that they&#8217;re by no means delivered. Blocked alerts will present up of their respective pending units as properly. However, <code class=\"cw nq nr ns nt b\">man 7 sign<\/code> says: \u201cThe signals <code class=\"cw nq nr ns nt b\">SIGKILL<\/code> and <code class=\"cw nq nr ns nt b\">SIGSTOP<\/code> cannot be caught, blocked, or ignored.\u201d But right here the kernel is telling us that we&#8217;ve a pending <code class=\"cw nq nr ns nt b\">SIGKILL<\/code>, aka that it&#8217;s being ignored even whereas the duty is ready!<\/p>\n<p id=\"03f8\" class=\"pw-post-body-paragraph mr ms gq mt b mu os mw mx my ot na nb nc ou ne nf ng ov ni nj nk ow nm nn no gj bj\">Well that&#8217;s bizarre. The wait code (i.e. <code class=\"cw nq nr ns nt b\">embody\/linux\/wait.h<\/code>) is used in all places within the kernel: semaphores, wait queues, completions, and many others. Surely it is aware of to search for <code class=\"cw nq nr ns nt b\">SIGKILL<\/code>s. So what does <code class=\"cw nq nr ns nt b\">wait_event()<\/code> truly do? Digging by way of the macro expansions and wrappers, the meat of it&#8217;s:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"7675\" class=\"ph nv gq nt b bf pi pj l pk pl\">#outline ___wait_event(wq_head, situation, state, unique, ret, cmd)           <br\/>({                                                                              <br\/>__label__ __out;                                                        <br\/>struct wait_queue_entry __wq_entry;                                     <br\/>lengthy __ret = ret;       \/* express shadow *\/                           <br\/><br\/>init_wait_entry(&amp;__wq_entry, unique ? WQ_FLAG_EXCLUSIVE : 0);        <br\/>for (;;) {                                                              <br\/>lengthy __int = prepare_to_wait_event(&amp;wq_head, &amp;__wq_entry, state);<br\/><br\/>if (situation)                                                  <br\/>break;                                                  <br\/><br\/>if (___wait_is_interruptible(state) &amp;&amp; __int) {                 <br\/>__ret = __int;                                          <br\/>goto __out;                                             <br\/>}                                                               <br\/><br\/>cmd;                                                            <br\/>}                                                                       <br\/>finish_wait(&amp;wq_head, &amp;__wq_entry);                                     <br\/>__out:  __ret;                                                                  <br\/>})<\/span><\/pre>\n<p id=\"2343\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">So it loops eternally, doing <code class=\"cw nq nr ns nt b\">prepare_to_wait_event()<\/code>, checking the situation, then checking to see if we have to interrupt. Then it does <code class=\"cw nq nr ns nt b\">cmd<\/code>, which on this case is <code class=\"cw nq nr ns nt b\">schedule()<\/code>, i.e. \u201cdo something else for a while\u201d. <code class=\"cw nq nr ns nt b\">prepare_to_wait_event()<\/code> seems to be like:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"ec85\" class=\"ph nv gq nt b bf pi pj l pk pl\">lengthy prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)<br\/>{<br\/>unsigned lengthy flags;<br\/>lengthy ret = 0;<p>spin_lock_irqsave(&amp;wq_head-&gt;lock, flags);<br\/>if (signal_pending_state(state, present)) {<br\/>\/*<br\/>* Exclusive waiter should not fail if it was chosen by wakeup,<br\/>* it ought to \"eat\" the situation we have been ready for.<br\/>*<br\/>* The caller will recheck the situation and return success if<br\/>* we have been already woken up, we cannot miss the occasion as a result of<br\/>* wakeup locks\/unlocks the identical wq_head-&gt;lock.<br\/>*<br\/>* But we have to be sure that set-condition + wakeup after that<br\/>* cannot see us, it ought to get up one other unique waiter if<br\/>* we fail.<br\/>*\/<br\/>list_del_init(&amp;wq_entry-&gt;entry);<br\/>ret = -ERESTARTSYS;<br\/>} else {<br\/>if (list_empty(&amp;wq_entry-&gt;entry)) {<br\/>if (wq_entry-&gt;flags &amp; WQ_FLAG_EXCLUSIVE)<br\/>__add_wait_queue_entry_tail(wq_head, wq_entry);<br\/>else<br\/>__add_wait_queue(wq_head, wq_entry);<br\/>}<br\/>set_current_state(state);<br\/>}<br\/>spin_unlock_irqrestore(&amp;wq_head-&gt;lock, flags);<\/p><p>return ret;<br\/>}<br\/>EXPORT_SYMBOL(prepare_to_wait_event);<\/p><\/span><\/pre>\n<p id=\"7e53\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">It seems to be like the one approach we are able to get away of this with a non-zero exit code is that if <code class=\"cw nq nr ns nt b\">signal_pending_state()<\/code> is true. Since our name website was simply <code class=\"cw nq nr ns nt b\">wait_event()<\/code>, we all know that state right here is <code class=\"cw nq nr ns nt b\">TASK_UNINTERRUPTIBLE<\/code>; the definition of <code class=\"cw nq nr ns nt b\">signal_pending_state()<\/code> seems to be like:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"bc4b\" class=\"ph nv gq nt b bf pi pj l pk pl\">static inline int signal_pending_state(unsigned int state, struct task_struct *p)<br\/> TASK_WAKEKILL)))<br\/>return 0;<br\/>if (!signal_pending(p))<br\/>return 0;<p>return (state &amp; TASK_INTERRUPTIBLE) <\/p><\/span><\/pre>\n<p id=\"45d6\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Our process will not be interruptible, so the primary if fails. Our process ought to have a sign pending, although, proper?<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"bc82\" class=\"ph nv gq nt b bf pi pj l pk pl\">static inline int signal_pending(struct task_struct *p)<br\/>{<br\/>\/*<br\/>* TIF_NOTIFY_SIGNAL is not actually a sign, nevertheless it requires the identical<br\/>* conduct by way of making certain that we get away of wait loops<br\/>* in order that notify sign callbacks may be processed.<br\/>*\/<br\/>if (unlikely(test_tsk_thread_flag(p, TIF_NOTIFY_SIGNAL)))<br\/>return 1;<br\/>return task_sigpending(p);<br\/>}<\/span><\/pre>\n<p id=\"d7ba\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">As the remark notes, <code class=\"cw nq nr ns nt b\">TIF_NOTIFY_SIGNAL<\/code> isn\u2019t related right here, regardless of its identify, however let\u2019s have a look at <code class=\"cw nq nr ns nt b\">task_sigpending()<\/code>:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"88bf\" class=\"ph nv gq nt b bf pi pj l pk pl\">static inline int task_sigpending(struct task_struct *p)<br\/>{<br\/>return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));<br\/>}<\/span><\/pre>\n<p id=\"962f\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Hmm. Seems like we should always have that flag set, proper? To determine that out, let\u2019s have a look at how sign supply works. When we\u2019re shutting down the pid namespace in <code class=\"cw nq nr ns nt b\">zap_pid_ns_processes()<\/code>, it does:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"ac63\" class=\"ph nv gq nt b bf pi pj l pk pl\">group_send_sig_info(SIGKILL, SEND_SIG_PRIV, process, PIDTYPE_MAX);<\/span><\/pre>\n<p id=\"a015\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">which finally will get to <code class=\"cw nq nr ns nt b\">__send_signal_locked()<\/code>, which has:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"68ae\" class=\"ph nv gq nt b bf pi pj l pk pl\">pending = (sort != PIDTYPE_PID) ? &amp;t-&gt;signal-&gt;shared_pending : &amp;t-&gt;pending;<br\/>...<br\/>sigaddset(&amp;pending-&gt;sign, sig);<br\/>...<br\/>complete_signal(sig, t, sort);<\/span><\/pre>\n<p id=\"a0ed\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Using <code class=\"cw nq nr ns nt b\">PIDTYPE_MAX<\/code> right here as the sort is slightly bizarre, nevertheless it roughly signifies \u201cthis is very privileged kernel stuff sending this signal, you should definitely deliver it\u201d. There is a little bit of unintended consequence right here, although, in that <code class=\"cw nq nr ns nt b\">__send_signal_locked()<\/code> finally ends up sending the <code class=\"cw nq nr ns nt b\">SIGKILL<\/code> to the shared set, as an alternative of the person process\u2019s set. If we have a look at the <code class=\"cw nq nr ns nt b\">__fatal_signal_pending()<\/code> code, we see:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"0a6e\" class=\"ph nv gq nt b bf pi pj l pk pl\">static inline int __fatal_signal_pending(struct task_struct *p)<br\/>{<br\/>return unlikely(sigismember(&amp;p-&gt;pending.sign, SIGKILL));<br\/>}<\/span><\/pre>\n<p id=\"ba1a\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">But it seems it is a little bit of a pink herring (<a class=\"af np\" href=\"https:\/\/lore.kernel.org\/all\/YuGUyayVWDB7R89i@tycho.pizza\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">though<\/a> <a class=\"af np\" href=\"https:\/\/lore.kernel.org\/all\/20220728091220.GA11207@redhat.com\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">it<\/a> <a class=\"af np\" href=\"https:\/\/lore.kernel.org\/all\/871qu6bjp3.fsf@email.froward.int.ebiederm.org\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">took<\/a> <a class=\"af np\" href=\"https:\/\/lore.kernel.org\/all\/8735elhy4u.fsf@email.froward.int.ebiederm.org\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">a<\/a> <a class=\"af np\" href=\"https:\/\/lore.kernel.org\/all\/87pmhofr1q.fsf@email.froward.int.ebiederm.org\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">whereas<\/a> for me to know that).<\/p>\n<p id=\"8339\" class=\"pw-post-body-paragraph mr ms gq mt b mu os mw mx my ot na nb nc ou ne nf ng ov ni nj nk ow nm nn no gj bj\">To perceive what\u2019s actually occurring right here, we have to have a look at <code class=\"cw nq nr ns nt b\">complete_signal()<\/code>, because it unconditionally provides a <code class=\"cw nq nr ns nt b\">SIGKILL<\/code> to the duty\u2019s pending set:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"5c53\" class=\"ph nv gq nt b bf pi pj l pk pl\">sigaddset(&amp;t-&gt;pending.sign, SIGKILL);<\/span><\/pre>\n<p id=\"4553\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">however why doesn\u2019t it work? At the highest of the perform we&#8217;ve:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"3636\" class=\"ph nv gq nt b bf pi pj l pk pl\">\/*<br\/>* Now discover a thread we are able to get up to take the sign off the queue.<br\/>*<br\/>* If the principle thread needs the sign, it will get first crack.<br\/>* Probably the least stunning to the common bear.<br\/>*\/<br\/>if (wants_signal(sig, p))<br\/>t = p;<br\/>else if ((sort == PIDTYPE_PID) || thread_group_empty(p))<br\/>\/*<br\/>* There is only one thread and it doesn't should be woken.<br\/>* It will dequeue unblocked alerts earlier than it runs once more.<br\/>*\/<br\/>return;<\/span><\/pre>\n<p id=\"07ad\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">however as <a class=\"af np\" href=\"https:\/\/lore.kernel.org\/all\/877d4jbabb.fsf@email.froward.int.ebiederm.org\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">Eric Biederman described<\/a>, mainly each thread can deal with a <code class=\"cw nq nr ns nt b\">SIGKILL<\/code> at any time. Here\u2019s <code class=\"cw nq nr ns nt b\">wants_signal()<\/code>:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"d689\" class=\"ph nv gq nt b bf pi pj l pk pl\">static inline bool wants_signal(int sig, struct task_struct *p)<br\/> !task_sigpending(p);<br\/><\/p><\/span><\/pre>\n<p id=\"04f2\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">So\u2026 if a thread is already exiting (i.e. it has <code class=\"cw nq nr ns nt b\">PF_EXITING<\/code>), it doesn\u2019t need a sign. Consider the next sequence of occasions:<\/p>\n<p id=\"f18d\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">1. a process opens a FUSE file, and doesn\u2019t shut it, then exits. During that exit, the kernel dutifully calls <code class=\"cw nq nr ns nt b\">do_exit()<\/code>, which does the next:<\/p>\n<pre class=\"ox oy oz pa pb pc nt pd bo pe pf pg\"><span id=\"07d1\" class=\"ph nv gq nt b bf pi pj l pk pl\">exit_signals(tsk); \/* units PF_EXITING *\/<\/span><\/pre>\n<p id=\"2fd3\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">2. <code class=\"cw nq nr ns nt b\">do_exit()<\/code> continues on to <code class=\"cw nq nr ns nt b\">exit_files(tsk);<\/code>, which flushes all recordsdata which are nonetheless open, ensuing within the stack hint above.<\/p>\n<p id=\"ded2\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">3. the pid namespace exits, and enters <code class=\"cw nq nr ns nt b\">zap_pid_ns_processes()<\/code>, sends a <code class=\"cw nq nr ns nt b\">SIGKILL<\/code> to everybody (that it expects to be deadly), after which waits for everybody to exit.<\/p>\n<p id=\"8eb7\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">4. this kills the FUSE daemon within the pid ns so it may well by no means reply.<\/p>\n<p id=\"2db9\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">5. <code class=\"cw nq nr ns nt b\">complete_signal()<\/code> for the FUSE process that was already exiting ignores the sign, because it has <code class=\"cw nq nr ns nt b\">PF_EXITING<\/code>.<\/p>\n<p id=\"b28c\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">6. Deadlock. Without manually aborting the FUSE connection, issues will cling eternally.<\/p>\n<p id=\"16d8\" class=\"pw-post-body-paragraph mr ms gq mt b mu os mw mx my ot na nb nc ou ne nf ng ov ni nj nk ow nm nn no gj bj\">It doesn\u2019t actually make sense to attend for flushes on this case: the duty is dying, so there\u2019s no one to inform the return code of <code class=\"cw nq nr ns nt b\">flush()<\/code> to. It additionally seems that this bug can occur with a number of filesystems (something that calls the kernel\u2019s wait code in <code class=\"cw nq nr ns nt b\">flush()<\/code>, i.e. mainly something that talks to one thing exterior the native kernel).<\/p>\n<p id=\"0106\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">Individual filesystems will should be patched within the meantime, for instance the repair for FUSE is <a class=\"af np\" href=\"https:\/\/github.com\/torvalds\/linux\/commit\/14feceeeb012faf9def7d313d37f5d4f85e6572b\" rel=\"noopener ugc nofollow\" target=\"_blank\">right here<\/a>, which was launched on April 23 in Linux 6.3.<\/p>\n<p id=\"ea5e\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">While this weblog submit addresses FUSE deadlocks, there are positively points within the nfs code and elsewhere, which we&#8217;ve not hit in manufacturing but, however nearly definitely will. You may also see it as a <a class=\"af np\" href=\"https:\/\/lore.kernel.org\/all\/20230512225414.GE3223426@dread.disaster.area\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">symptom of different filesystem bugs<\/a>. Something to look out for in case you have a pid namespace that received\u2019t exit.<\/p>\n<p id=\"6e54\" class=\"pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj\">This is only a small style of the number of unusual points we encounter working containers at scale at Netflix. Our staff is hiring, so please attain out in case you additionally love pink herrings and kernel deadlocks!<\/p>\n<\/div>\n<p>[ad_2]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Tycho Andersen The Compute staff at Netflix is charged with managing all AWS and containerized workloads at Netflix, together with autoscaling, deployment of containers, situation remediation, and many others. As a part of this staff, I work on fixing unusual issues that customers report. This specific situation concerned a customized inner FUSE filesystem: ndrive. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":104438,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37],"tags":[],"class_list":["post-104436","post","type-post","status-publish","format-standard","has-post-thumbnail","category-netflix"],"_links":{"self":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/104436","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/comments?post=104436"}],"version-history":[{"count":0,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/104436\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media\/104438"}],"wp:attachment":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media?parent=104436"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/categories?post=104436"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/tags?post=104436"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}