How Netflix’s Container Platform Connects Linux Kernel Panics to Kubernetes Pods
By Kyle Anderson
With a latest effort to scale back buyer (engineers, not finish customers) ache on our container platform Titus, I began investigating “orphaned” pods. There are pods that by no means acquired to complete and needed to be rubbish collected with no actual passable remaining standing. Our Service job (suppose ReplicatSet) house owners don’t care an excessive amount of, however our Batch customers care loads. Without an actual return code, how can they know whether it is protected to retry or not?
These orphaned pods symbolize actual ache for our customers, even when they’re a small share of the whole pods within the system. Where are they going, precisely? Why did they go away?
This weblog submit exhibits the right way to join the dots from the worst case situation (a kernel panic) by to Kubernetes (k8s) and finally as much as us operators in order that we are able to observe how and why our k8s nodes are going away.
Orphaned pods get misplaced as a result of the underlying k8s node object goes away. Once that occurs a GC course of deletes the pod. On Titus we run a customized controller to retailer the historical past of Pod and Node objects, in order that we are able to avoid wasting clarification and present it to our customers. This failure mode appears like this in our UI:
This is an clarification, but it surely wasn’t very satisfying to me or to our customers. Why was the agent misplaced?
Nodes can go away for any motive, particularly in “the cloud”. When this occurs, normally a k8s cloud-controller offered by the cloud vendor will detect that the precise server, in our case an EC2 Instance, has truly gone away, and can in flip delete the k8s node object. That nonetheless doesn’t actually reply the query of why.
How can we ensure that each occasion that goes away has a motive, account for that motive, and bubble it up all the best way to the pod? It all begins with an annotation:
{
"apiVersion": "v1",
"type": "Pod",
"metadata": {
"annotations": {
"pod.titus.netflix.com/pod-termination-reason": "Something actually dangerous occurred!",
...
Just making a spot to place this information is a superb begin. Now all we’ve got to do is make our GC controllers conscious of this annotation, after which sprinkle it into any course of that would probably make a pod or node go away unexpectedly. Adding an annotation (versus patching the standing) preserves the remainder of the pod as-is for historic functions. (We additionally add annotations for what did the terminating, and a brief reason-code
for tagging)
The pod-termination-reason
annotation is helpful to populate human readable messages like:
- “This pod was preempted by a higher priority job ($id)”
- “This pod had to be terminated because the underlying hardware failed ($failuretype)”
- “This pod had to be terminated because $user ran sudo halt on the node”
- “This pod died unexpectedly because the underlying node kernel panicked!”
But wait, how are we going to annotate a pod for a node that kernel panicked?
When the Linux kernel panics, there may be simply not a lot you are able to do. But what in the event you may ship out some type of “with my final breath, I curse Kubernetes!” UDP packet?
Inspired by this Google Spanner paper, the place Spanner nodes ship out a “last gasp” UDP packet to launch leases & locks, you can also configure your servers to do the identical upon kernel panic utilizing a inventory Linux module: netconsole
.
The indisputable fact that the Linux kernel may even ship out UDP packets with the string ‘kernel panic’, whereas it’s panicking, is form of wonderful. This works as a result of netconsole must be configured with nearly the complete IP header crammed out already beforehand. That is true, it’s important to inform Linux precisely what your supply MAC, IP, and UDP Port are, in addition to the vacation spot MAC, IP, and UDP ports. You are virtually setting up the UDP packet for the kernel. But, with that prework, when the time comes, the kernel can simply assemble the packet and get it out the (preconfigured) community interface as issues come crashing down. Luckily the netconsole-setup
command makes the setup fairly simple. All the configuration choices may be set dynamically as nicely, in order that when the endpoint modifications one can level to the brand new IP.
Once that is setup, kernel messages will begin flowing proper after modprobe
. Imagine the entire thing working like a dmesg | netcat -u $vacation spot 6666
, however in kernel house.
With netconsole
setup, the final gasp from a crashing kernel appears like a set of UDP packets precisely like one may count on, the place the info of the UDP packet is just the textual content of the kernel message. In the case of a kernel panic, it can look one thing like this (one UDP packet per line):
Kernel panic - not syncing: buffer overrun at 0x4ba4c73e73acce54
[ 8374.456345] CPU: 1 PID: 139616 Comm: insmod Kdump: loaded Tainted: G OE
[ 8374.458506] Hardware title: Amazon EC2 r5.2xlarge/, BIOS 1.0 10/16/2017
[ 8374.555629] Call Trace:
[ 8374.556147] <TASK>
[ 8374.556601] dump_stack_lvl+0x45/0x5b
[ 8374.557361] panic+0x103/0x2db
[ 8374.558166] ? __cond_resched+0x15/0x20
[ 8374.559019] ? do_init_module+0x22/0x20a
[ 8374.655123] ? 0xffffffffc0f56000
[ 8374.655810] init_module+0x11/0x1000 [kpanic]
[ 8374.656939] do_one_initcall+0x41/0x1e0
[ 8374.657724] ? __cond_resched+0x15/0x20
[ 8374.658505] ? kmem_cache_alloc_trace+0x3d/0x3c0
[ 8374.754906] do_init_module+0x4b/0x20a
[ 8374.755703] load_module+0x2a7a/0x3030
[ 8374.756557] ? __do_sys_finit_module+0xaa/0x110
[ 8374.757480] __do_sys_finit_module+0xaa/0x110
[ 8374.758537] do_syscall_64+0x3a/0xc0
[ 8374.759331] entry_SYSCALL_64_after_hwframe+0x62/0xcc
[ 8374.855671] RIP: 0033:0x7f2869e8ee69
...
The final piece is to attach is Kubernetes (k8s). We want a k8s controller to do the next:
- Listen for netconsole UDP packets on port 6666, looking ahead to issues that appear like kernel panics from nodes.
- Upon kernel panic, lookup the k8s node object related to the IP handle of the incoming netconsole packet.
- For that k8s node, discover all of the pods certain to it, annotate, then delete these pods (they’re toast!).
- For that k8s node, annotate the node after which delete it too (it is usually toast!).
Parts 1&2 may appear like this:
for {
n, addr, err := serverConn.ReadFromUDP(buf)
if err != nil {
klog.Errorf("Error ReadFromUDP: %s", err)
} else {
line := santizeNetConsoleBuffer(buf[0:n])
if isKernelPanic(line) {
panicCounter = 20
go deal withKernelPanicOnNode(ctx, addr, nodeInformer, podInformer, kubeClient, line)
}
}
if panicCounter > 0 {
klog.Infof("KernelPanic context from %s: %s", addr.IP, line)
panicCounter++
}
}
And then components 3&4 may appear like this:
func deal withKernelPanicOnNode(ctx context.Context, addr *internet.UDPAddr, nodeInformer cache.SharedIndexInformer, podInformer cache.SharedIndexInformer, kubeClient kubernetes.Interface, line string) {
node := getNodeFromAddr(addr.IP.String(), nodeInformer)
if node == nil {
klog.Errorf("Got a kernel panic from %s, however could not discover a k8s node object for it?", addr.IP.String())
} else {
pods := getPodsFromNode(node, podInformer)
klog.Infof("Got a kernel panic from node %s, annotating and deleting all %d pods and that node.", node.Name, len(pods))
annotateAndDeletePodsWithMotive(ctx, kubeClient, pods, line)
err := deleteNode(ctx, kubeClient, node.Name)
if err != nil {
klog.Errorf("Error deleting node %s: %s", node.Name, err)
} else {
klog.Infof("Deleted panicked node %s", node.Name)
}
}
}
With that code in place, as quickly as a kernel panic is detected, the pods and nodes instantly go away. No want to attend for any GC course of. The annotations assist doc what occurred to the node & pod:
Marking {that a} job failed due to a kernel panic is probably not that passable to our prospects. But they’ll take satisfaction in figuring out that we now have the required observability instruments to begin fixing these kernel panics!
Do you additionally get pleasure from actually attending to the underside of why issues fail in your methods or suppose kernel panics are cool? Join us on the Compute Team the place we’re constructing a world-class container platform for our engineers.