{"id":112352,"date":"2023-11-02T16:00:02","date_gmt":"2023-11-02T16:00:02","guid":{"rendered":"https:\/\/showbizztoday.com\/index.php\/2023\/11\/02\/how-netflixs-container-platform-connects-linux-kernel-panics-to-kubernetes-pods\/"},"modified":"2023-11-02T16:00:03","modified_gmt":"2023-11-02T16:00:03","slug":"how-netflixs-container-platform-connects-linux-kernel-panics-to-kubernetes-pods","status":"publish","type":"post","link":"https:\/\/showbizztoday.com\/index.php\/2023\/11\/02\/how-netflixs-container-platform-connects-linux-kernel-panics-to-kubernetes-pods\/","title":{"rendered":"How Netflix&#8217;s Container Platform Connects Linux Kernel Panics to Kubernetes Pods"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<div>\n<div class=\"hs ht hu hv hw\">\n<div class=\"speechify-ignore ab co\">\n<div class=\"speechify-ignore bg l\">\n<div class=\"hx hy hz ia ib ab\">\n<div>\n<div class=\"ab ic\"><a href=\"https:\/\/netflixtechblog.medium.com\/?source=post_page-----ed620b9c6225--------------------------------\" rel=\"noopener follow\" target=\"_blank\"><\/p>\n<div>\n<div class=\"bl\" aria-hidden=\"false\">\n<div class=\"l id ie bx if ig\">\n<div class=\"l fg\"><img decoding=\"async\" alt=\"Netflix Technology Blog\" class=\"l fa bx dc dd cw\" src=\"https:\/\/miro.medium.com\/v2\/resize:fill:88:88\/1*BJWRqfSMf9Da9vsXG9EBRQ.jpeg\" width=\"44\" height=\"44\" loading=\"lazy\" data-testid=\"authorPhoto\"\/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><\/a><a href=\"https:\/\/netflixtechblog.com\/?source=post_page-----ed620b9c6225--------------------------------\" rel=\"noopener  ugc nofollow\" target=\"_blank\"><\/p>\n<div class=\"ij ab fg\">\n<div>\n<div class=\"bl\" aria-hidden=\"false\">\n<div class=\"l ik il bx if im\">\n<div class=\"l fg\"><img decoding=\"async\" alt=\"Netflix TechBlog\" class=\"l fa bx bq in cw\" src=\"https:\/\/miro.medium.com\/v2\/resize:fill:48:48\/1*ty4NvNrGg4ReETxqU2N3Og.png\" width=\"24\" height=\"24\" loading=\"lazy\" data-testid=\"publicationPhoto\"\/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p id=\"8759\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">How Netflix\u2019s Container Platform Connects Linux Kernel Panics to Kubernetes Pods<\/p>\n<p id=\"158b\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\"><em class=\"ns\">By Kyle Anderson<\/em><\/p>\n<p id=\"53a2\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">With a latest effort to scale back buyer (engineers, not finish customers) ache on our container platform <a class=\"af nt\" href=\"https:\/\/netflixtechblog.com\/tagged\/titus\" rel=\"noopener ugc nofollow\" target=\"_blank\">Titus<\/a>, I began investigating \u201corphaned\u201d pods. There are pods that by no means acquired to complete and needed to be rubbish collected with no actual passable remaining standing. Our Service job (suppose <a class=\"af nt\" href=\"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/controllers\/replicaset\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">ReplicatSet<\/a>) house owners don\u2019t care an excessive amount of, however our Batch customers care loads. Without an actual return code, how can they know whether it is protected to retry or not?<\/p>\n<p id=\"e0e7\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">These orphaned pods symbolize actual ache for our customers, even when they&#8217;re a small share of the whole pods within the system. Where are they going, precisely? Why did they go away?<\/p>\n<p id=\"5820\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">This weblog submit exhibits the right way to join the dots from the worst case situation (a kernel panic) by to Kubernetes (k8s) and finally as much as us operators in order that we are able to observe how and why our k8s nodes are going away.<\/p>\n<p id=\"23e0\" class=\"pw-post-body-paragraph mu mv gr mw b mx os mz na nb ot nd ne nf ou nh ni nj ov nl nm nn ow np nq nr gk bj\">Orphaned pods get misplaced as a result of the underlying k8s node object goes away. Once that occurs a <a class=\"af nt\" href=\"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/pods\/pod-lifecycle\/#pod-garbage-collection\" rel=\"noopener ugc nofollow\" target=\"_blank\">GC<\/a> course of deletes the pod. On Titus we run a customized controller to retailer the historical past of Pod and Node objects, in order that we are able to avoid wasting clarification and present it to our customers. This failure mode appears like this in our UI:<\/p>\n<figure class=\"pa pb pc pd pe pf ox oy paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"pg ph fg pi bg pj\">\n<div class=\"ox oy oz\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*bPnudULpVKE1AKEH 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*bPnudULpVKE1AKEH 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*bPnudULpVKE1AKEH 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*bPnudULpVKE1AKEH 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*bPnudULpVKE1AKEH 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*bPnudULpVKE1AKEH 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*bPnudULpVKE1AKEH 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" type=\"image\/webp\"\/><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*bPnudULpVKE1AKEH 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*bPnudULpVKE1AKEH 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*bPnudULpVKE1AKEH 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*bPnudULpVKE1AKEH 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*bPnudULpVKE1AKEH 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*bPnudULpVKE1AKEH 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*bPnudULpVKE1AKEH 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"bg pk pl c\" width=\"700\" height=\"264\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div><figcaption class=\"pm fc pn ox oy po pp be b bf z dt\">What it appears prefer to our customers when a k8s node and its pods disappear<\/figcaption><\/figure>\n<p id=\"be89\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">This is <em class=\"ns\">an <\/em>clarification, but it surely wasn\u2019t very satisfying to me or to our customers. <em class=\"ns\">Why<\/em> was the agent misplaced?<\/p>\n<p id=\"4839\" class=\"pw-post-body-paragraph mu mv gr mw b mx os mz na nb ot nd ne nf ou nh ni nj ov nl nm nn ow np nq nr gk bj\">Nodes can go away for any motive, particularly in \u201cthe cloud\u201d. When this occurs, normally a k8s cloud-controller offered by the cloud vendor will detect that the precise server, in our case an EC2 Instance, has truly gone away, and can in flip delete the k8s node object. That nonetheless doesn\u2019t actually reply the query of <em class=\"ns\">why<\/em>.<\/p>\n<p id=\"67c5\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">How can we ensure that each occasion that goes away has a motive, account for that motive, and bubble it up all the best way to the pod? It all begins with an annotation:<\/p>\n<pre class=\"pa pb pc pd pe pq pr ps bo pt ba bj\"><span id=\"2fb2\" class=\"pu nv gr pr b bf pv pw l jf px\">{<br\/>\"apiVersion\": \"v1\",<br\/>\"type\": \"Pod\",<br\/>\"metadata\": {<br\/>\"annotations\": {<br\/>\"pod.titus.netflix.com\/pod-termination-reason\": \"Something actually dangerous occurred!\",<br\/>...<\/span><\/pre>\n<p id=\"6cd5\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">Just making a spot to place this information is a superb begin. Now all we&#8217;ve got to do is make our GC controllers conscious of this annotation, after which sprinkle it into any course of that would probably make a pod or node go away unexpectedly. Adding an annotation (versus patching the standing) preserves the remainder of the pod as-is for historic functions. (We additionally add annotations for what did the terminating, and a brief <code class=\"cw py pz qa pr b\">reason-code<\/code> for tagging)<\/p>\n<p id=\"3ea2\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">The <code class=\"cw py pz qa pr b\">pod-termination-reason<\/code> annotation is helpful to populate human readable messages like:<\/p>\n<ul class=\"\">\n<li id=\"2657\" class=\"mu mv gr mw b mx my mz na nb nc nd ne nf qb nh ni nj qc nl nm nn qd np nq nr qe qf qg bj\">\u201cThis pod was preempted by a higher priority job ($id)\u201d<\/li>\n<li id=\"db59\" class=\"mu mv gr mw b mx qh mz na nb qi nd ne nf qj nh ni nj qk nl nm nn ql np nq nr qe qf qg bj\">\u201cThis pod had to be terminated because the underlying hardware failed ($failuretype)\u201d<\/li>\n<li id=\"c052\" class=\"mu mv gr mw b mx qh mz na nb qi nd ne nf qj nh ni nj qk nl nm nn ql np nq nr qe qf qg bj\">\u201cThis pod had to be terminated because $user ran sudo halt on the node\u201d<\/li>\n<li id=\"28df\" class=\"mu mv gr mw b mx qh mz na nb qi nd ne nf qj nh ni nj qk nl nm nn ql np nq nr qe qf qg bj\"><strong class=\"mw gs\">\u201cThis pod died unexpectedly because the underlying node kernel panicked!\u201d<\/strong><\/li>\n<\/ul>\n<p id=\"e870\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">But wait, how are we going to annotate a pod for a node that kernel panicked?<\/p>\n<p id=\"5206\" class=\"pw-post-body-paragraph mu mv gr mw b mx os mz na nb ot nd ne nf ou nh ni nj ov nl nm nn ow np nq nr gk bj\">When the Linux kernel panics, there may be simply not a lot you are able to do. But what in the event you may ship out some type of \u201cwith my final breath, I curse Kubernetes!\u201d UDP packet?<\/p>\n<p id=\"23fc\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">Inspired by this <a class=\"af nt\" href=\"https:\/\/research.google\/pubs\/pub45855\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">Google Spanner paper<\/a>, the place Spanner nodes ship out a \u201clast gasp\u201d UDP packet to launch leases &amp; locks, you can also configure your servers to do the identical upon kernel panic utilizing a inventory Linux module: <code class=\"cw py pz qa pr b\"><a class=\"af nt\" href=\"https:\/\/www.kernel.org\/doc\/Documentation\/networking\/netconsole.txt\" rel=\"noopener ugc nofollow\" target=\"_blank\">netconsole<\/a><\/code>.<\/p>\n<p id=\"50c5\" class=\"pw-post-body-paragraph mu mv gr mw b mx os mz na nb ot nd ne nf ou nh ni nj ov nl nm nn ow np nq nr gk bj\">The indisputable fact that the Linux kernel may even ship out UDP packets with the string \u2018kernel panic\u2019, <em class=\"ns\">whereas it&#8217;s panicking<\/em>, is form of wonderful. This works as a result of netconsole must be configured with nearly the complete IP header crammed out already beforehand. That is true, it&#8217;s important to inform Linux precisely what your supply MAC, IP, and UDP Port are, in addition to the vacation spot MAC, IP, and UDP ports. You are virtually setting up the UDP packet for the kernel. But, with that prework, when the time comes, the kernel can simply <a class=\"af nt\" href=\"https:\/\/github.com\/torvalds\/linux\/blob\/94f6f0550c625fab1f373bb86a6669b45e9748b3\/drivers\/net\/netconsole.c#L932\" rel=\"noopener ugc nofollow\" target=\"_blank\">assemble<\/a> the packet and get it out the (preconfigured) community interface as issues come crashing down. Luckily the <code class=\"cw py pz qa pr b\"><a class=\"af nt\" href=\"https:\/\/manpages.ubuntu.com\/manpages\/jammy\/en\/man8\/netconsole-setup.8.html\" rel=\"noopener ugc nofollow\" target=\"_blank\">netconsole-setup<\/a><\/code> command makes the setup fairly simple. All the configuration choices may be set <a class=\"af nt\" href=\"https:\/\/wiki.ubuntu.com\/Kernel\/Netconsole#Step_3:_Initialize_netconsole_at_boot_time\" rel=\"noopener ugc nofollow\" target=\"_blank\">dynamically<\/a> as nicely, in order that when the endpoint modifications one can level to the brand new IP.<\/p>\n<p id=\"18e5\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">Once that is setup, kernel messages will begin flowing proper after <code class=\"cw py pz qa pr b\">modprobe<\/code>. Imagine the entire thing working like a <code class=\"cw py pz qa pr b\">dmesg | netcat -u $vacation spot 6666<\/code>, however in kernel house.<\/p>\n<p id=\"30f9\" class=\"pw-post-body-paragraph mu mv gr mw b mx os mz na nb ot nd ne nf ou nh ni nj ov nl nm nn ow np nq nr gk bj\">With <code class=\"cw py pz qa pr b\">netconsole<\/code> setup, the final gasp from a crashing kernel appears like a set of UDP packets precisely like one may count on, the place the info of the UDP packet is just the textual content of the kernel message. In the case of a kernel panic, it can look one thing like this (one UDP packet per line):<\/p>\n<pre class=\"pa pb pc pd pe pq pr ps bo pt ba bj\"><span id=\"c35d\" class=\"pu nv gr pr b bf pv pw l jf px\">Kernel panic - not syncing: buffer overrun at 0x4ba4c73e73acce54<br\/>[ 8374.456345] CPU: 1 PID: 139616 Comm: insmod Kdump: loaded Tainted: G OE<br\/>[ 8374.458506] Hardware title: Amazon EC2 r5.2xlarge\/, BIOS 1.0 10\/16\/2017<br\/>[ 8374.555629] Call Trace:<br\/>[ 8374.556147] &lt;TASK&gt;<br\/>[ 8374.556601] dump_stack_lvl+0x45\/0x5b<br\/>[ 8374.557361] panic+0x103\/0x2db<br\/>[ 8374.558166] ? __cond_resched+0x15\/0x20<br\/>[ 8374.559019] ? do_init_module+0x22\/0x20a<br\/>[ 8374.655123] ? 0xffffffffc0f56000<br\/>[ 8374.655810] init_module+0x11\/0x1000 [kpanic]<br\/>[ 8374.656939] do_one_initcall+0x41\/0x1e0<br\/>[ 8374.657724] ? __cond_resched+0x15\/0x20<br\/>[ 8374.658505] ? kmem_cache_alloc_trace+0x3d\/0x3c0<br\/>[ 8374.754906] do_init_module+0x4b\/0x20a<br\/>[ 8374.755703] load_module+0x2a7a\/0x3030<br\/>[ 8374.756557] ? __do_sys_finit_module+0xaa\/0x110<br\/>[ 8374.757480] __do_sys_finit_module+0xaa\/0x110<br\/>[ 8374.758537] do_syscall_64+0x3a\/0xc0<br\/>[ 8374.759331] entry_SYSCALL_64_after_hwframe+0x62\/0xcc<br\/>[ 8374.855671] RIP: 0033:0x7f2869e8ee69<br\/>...<\/span><\/pre>\n<p id=\"bdc0\" class=\"pw-post-body-paragraph mu mv gr mw b mx os mz na nb ot nd ne nf ou nh ni nj ov nl nm nn ow np nq nr gk bj\">The final piece is to attach is Kubernetes (k8s). We want a k8s controller to do the next:<\/p>\n<ol class=\"\">\n<li id=\"114a\" class=\"mu mv gr mw b mx my mz na nb nc nd ne nf qb nh ni nj qc nl nm nn qd np nq nr qm qf qg bj\">Listen for netconsole UDP packets on port 6666, looking ahead to issues that appear like kernel panics from nodes.<\/li>\n<li id=\"10fd\" class=\"mu mv gr mw b mx qh mz na nb qi nd ne nf qj nh ni nj qk nl nm nn ql np nq nr qm qf qg bj\">Upon kernel panic, lookup the k8s node object related to the IP handle of the incoming netconsole packet.<\/li>\n<li id=\"2e6b\" class=\"mu mv gr mw b mx qh mz na nb qi nd ne nf qj nh ni nj qk nl nm nn ql np nq nr qm qf qg bj\">For that k8s node, discover all of the pods certain to it, annotate, then delete these pods (they&#8217;re toast!).<\/li>\n<li id=\"0d85\" class=\"mu mv gr mw b mx qh mz na nb qi nd ne nf qj nh ni nj qk nl nm nn ql np nq nr qm qf qg bj\">For that k8s node, annotate the node after which delete it too (it is usually toast!).<\/li>\n<\/ol>\n<p id=\"577a\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">Parts 1&amp;2 may appear like this:<\/p>\n<pre class=\"pa pb pc pd pe pq pr ps bo pt ba bj\"><span id=\"17e7\" class=\"pu nv gr pr b bf pv pw l qn px\">for {<br\/>n, addr, err := serverConn.ReadFromUDP(buf)<br\/>if err != nil {<br\/>klog.Errorf(\"Error ReadFromUDP: %s\", err)<br\/>} else {<br\/>line := santizeNetConsoleBuffer(buf[0:n])<br\/>if isKernelPanic(line) {<br\/>panicCounter = 20<br\/>go deal withKernelPanicOnNode(ctx, addr, nodeInformer, podInformer, kubeClient, line)<br\/>}<br\/>}<br\/>if panicCounter &gt; 0 {<br\/>klog.Infof(\"KernelPanic context from %s: %s\", addr.IP, line)<br\/>panicCounter++<br\/>}<br\/>}<\/span><\/pre>\n<p id=\"9ed0\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">And then components 3&amp;4 may appear like this:<\/p>\n<pre class=\"pa pb pc pd pe pq pr ps bo pt ba bj\"><span id=\"4b02\" class=\"pu nv gr pr b bf pv pw l qn px\">func deal withKernelPanicOnNode(ctx context.Context, addr *internet.UDPAddr, nodeInformer cache.SharedIndexInformer, podInformer cache.SharedIndexInformer, kubeClient kubernetes.Interface, line string) {<br\/>node := getNodeFromAddr(addr.IP.String(), nodeInformer)<br\/>if node == nil {<br\/>klog.Errorf(\"Got a kernel panic from %s, however could not discover a k8s node object for it?\", addr.IP.String())<br\/>} else {<br\/>pods := getPodsFromNode(node, podInformer)<br\/>klog.Infof(\"Got a kernel panic from node %s, annotating and deleting all %d pods and that node.\", node.Name, len(pods))<br\/>annotateAndDeletePodsWithMotive(ctx, kubeClient, pods, line)<br\/>err := deleteNode(ctx, kubeClient, node.Name)<br\/>if err != nil {<br\/>klog.Errorf(\"Error deleting node %s: %s\", node.Name, err)<br\/>} else {<br\/>klog.Infof(\"Deleted panicked node %s\", node.Name)<br\/>}<br\/>}<br\/>}<\/span><\/pre>\n<p id=\"dec3\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">With that code in place, as quickly as a kernel panic is detected, the pods and nodes instantly go away. No want to attend for any GC course of. The annotations assist doc what occurred to the node &amp; pod:<\/p>\n<figure class=\"pa pb pc pd pe pf ox oy paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"pg ph fg pi bg pj\">\n<div class=\"ox oy qo\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*cjClRuyUQ67lu2shmjCObQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*cjClRuyUQ67lu2shmjCObQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*cjClRuyUQ67lu2shmjCObQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*cjClRuyUQ67lu2shmjCObQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*cjClRuyUQ67lu2shmjCObQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*cjClRuyUQ67lu2shmjCObQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*cjClRuyUQ67lu2shmjCObQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" type=\"image\/webp\"\/><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*cjClRuyUQ67lu2shmjCObQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*cjClRuyUQ67lu2shmjCObQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*cjClRuyUQ67lu2shmjCObQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*cjClRuyUQ67lu2shmjCObQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*cjClRuyUQ67lu2shmjCObQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*cjClRuyUQ67lu2shmjCObQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*cjClRuyUQ67lu2shmjCObQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"bg pk pl c\" width=\"700\" height=\"278\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div><figcaption class=\"pm fc pn ox oy po pp be b bf z dt\">An actual pod misplaced on an actual k8s node that had an actual kernel panic!<\/figcaption><\/figure>\n<p id=\"4bd6\" class=\"pw-post-body-paragraph mu mv gr mw b mx os mz na nb ot nd ne nf ou nh ni nj ov nl nm nn ow np nq nr gk bj\">Marking {that a} job failed due to a kernel panic is probably not <em class=\"ns\">that<\/em> passable to our prospects. But they&#8217;ll take satisfaction in figuring out that we now have the required observability instruments to begin fixing these kernel panics!<\/p>\n<p id=\"5c9c\" class=\"pw-post-body-paragraph mu mv gr mw b mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr gk bj\">Do you additionally get pleasure from actually attending to the underside of why issues fail in your methods or suppose kernel panics are cool? Join us on the <a class=\"af nt\" href=\"https:\/\/jobs.netflix.com\/jobs\/198642264\" rel=\"noopener ugc nofollow\" target=\"_blank\">Compute Team<\/a> the place we&#8217;re constructing a world-class container platform for our engineers.<\/p>\n<\/div>\n<p>[ad_2]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] How Netflix\u2019s Container Platform Connects Linux Kernel Panics to Kubernetes Pods By Kyle Anderson With a latest effort to scale back buyer (engineers, not finish customers) ache on our container platform Titus, I began investigating \u201corphaned\u201d pods. There are pods that by no means acquired to complete and needed to be rubbish collected with [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":112354,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37],"tags":[],"class_list":{"0":"post-112352","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-netflix"},"_links":{"self":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/112352","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/comments?post=112352"}],"version-history":[{"count":0,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/112352\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media\/112354"}],"wp:attachment":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media?parent=112352"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/categories?post=112352"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/tags?post=112352"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}