[ad_1]
The sched_wakeup and sched_wakeup_new hooks are invoked when a course of adjustments state from ‘sleeping’ to ‘runnable.’ They allow us to establish when a course of is able to run and is ready for CPU time. During this occasion, we generate a timestamp and retailer it in an eBPF hash map utilizing the method ID as the important thing.
struct {
__uint(kind, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u64));
} runq_enqueued SEC(".maps");SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx)
{
struct task_struct *job = (void *)ctx[0];
u32 pid = task->pid;
u64 ts = bpf_ktime_get_ns();
bpf_map_update_elem(&runq_enqueued, &pid, &ts, BPF_NOEXIST);
return 0;
}
Conversely, the sched_switch hook is triggered when the CPU switches between processes. This hook supplies tips that could the method at present using the CPU and the method about to take over. We use the upcoming job’s course of ID (PID) to fetch the timestamp from the eBPF map. This timestamp represents when the method entered the queue, which we had beforehand saved. We then calculate the run queue latency by merely subtracting the timestamps.
SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
struct task_struct *prev = (struct task_struct *)ctx[1];
struct task_struct *subsequent = (struct task_struct *)ctx[2];
u32 prev_pid = prev->pid;
u32 next_pid = next->pid;// fetch timestamp of when the following job was enqueued
u64 *tsp = bpf_map_lookup_elem(&runq_enqueued, &next_pid);
if (tsp == NULL) {
return 0; // missed enqueue
}
// calculate runq latency earlier than deleting the saved timestamp
u64 now = bpf_ktime_get_ns();
u64 runq_lat = now - *tsp;
// delete pid from enqueued map
bpf_map_delete_elem(&runq_enqueued, &next_pid);
....
One of the benefits of eBPF is its potential to offer tips that could the precise kernel knowledge constructions representing processes or threads, often known as duties in kernel terminology. This characteristic permits entry to a wealth of knowledge saved a few course of. We required the method’s cgroup ID to affiliate it with a container for our particular use case. However, the cgroup info within the course of struct is safeguarded by an RCU (Read Copy Update) lock.
To safely entry this RCU-protected info, we are able to leverage kfuncs in eBPF. kfuncs are kernel capabilities that may be known as from eBPF applications. There are kfuncs accessible to lock and unlock RCU read-side essential sections. These capabilities be certain that our eBPF program stays secure and environment friendly whereas retrieving the cgroup ID from the duty struct.
void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;u64 get_task_cgroup_id(struct task_struct *job)
{
struct css_set *cgroups;
u64 cgroup_id;
bpf_rcu_read_lock();
cgroups = task->cgroups;
cgroup_id = cgroups->dfl_cgrp->kn->id;
bpf_rcu_read_unlock();
return cgroup_id;
}
Once the information is prepared, we should bundle it and ship it to userspace. For this objective, we selected the eBPF ring buffer. It is environment friendly, high-performing, and user-friendly. It can deal with variable-length knowledge information and permits knowledge studying with out necessitating further reminiscence copying or syscalls. However, the sheer variety of knowledge factors was inflicting the userspace program to make use of an excessive amount of CPU, so we carried out a fee limiter in eBPF to pattern the information.
struct {
__uint(kind, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, RINGBUF_SIZE_BYTES);
} occasions SEC(".maps");struct {
__uint(kind, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u64));
__uint(value_size, sizeof(u64));
} cgroup_id_to_last_event_ts SEC(".maps");
struct runq_event {
u64 prev_cgroup_id;
u64 cgroup_id;
u64 runq_lat;
u64 ts;
};
SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
// ....
// The earlier code
// ....
u64 prev_cgroup_id = get_task_cgroup_id(prev);
u64 cgroup_id = get_task_cgroup_id(subsequent);
// per-cgroup-id-per-CPU rate-limiting
// to steadiness observability with efficiency overhead
u64 *last_ts =
bpf_map_lookup_elem(&cgroup_id_to_last_event_ts, &cgroup_id);
u64 last_ts_val = last_ts == NULL ? 0 : *last_ts;
// verify the speed restrict for the cgroup_id in consideration
// earlier than doing extra work
if (now - last_ts_val < RATE_LIMIT_NS) {
// Rate restrict exceeded, drop the occasion
return 0;
}
struct runq_event *occasion;
occasion = bpf_ringbuf_reserve(&occasions, sizeof(*occasion), 0);
if (occasion) {
event->prev_cgroup_id = prev_cgroup_id;
event->cgroup_id = cgroup_id;
event->runq_lat = runq_lat;
event->ts = now;
bpf_ringbuf_submit(occasion, 0);
// Update the final occasion timestamp for the present cgroup_id
bpf_map_update_elem(&cgroup_id_to_last_event_ts, &cgroup_id,
&now, BPF_ANY);
}
return 0;
}
Our userspace software, developed in Go, processes occasions from the ring buffer to emit metrics to our metrics backend, Atlas. Each occasion features a run queue latency pattern with a cgroup ID, which we affiliate with containers operating on the host. We categorize it as a system service if no such affiliation is discovered. When a cgroup ID is related to a container, we emit a percentile timer Atlas metric (runq.latency) for that container. We additionally increment a counter metric (sched.swap.out) to watch preemptions occurring for the container’s processes. Access to the prev_cgroup_id of the preempted course of permits us to tag the metric with the reason for the preemption, whether or not it is resulting from a course of throughout the similar container (or cgroup), a course of in one other container, or a system service.
It’s necessary to focus on that each the runq.latency metric and the sched.swap.out metrics are wanted to find out if a container is affected by noisy neighbors, which is the purpose we goal to realize — relying solely on the runq.latency metric can result in misconceptions. For instance, if a container is at or over its cgroup CPU restrict, the scheduler will throttle it, leading to an obvious spike in run queue latency resulting from delays within the queue. If we have been solely to contemplate this metric, we’d incorrectly attribute the efficiency degradation to noisy neighbors when it is usually because the container is hitting its CPU quota. However, simultaneous spikes in each metrics, primarily when the trigger is a unique container or system course of, clearly point out a loud neighbor problem.
Below is the runq.latency metric for a server operating a single container with ample CPU capability. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Although there are some spikes reaching 400µs, the latency stays inside acceptable parameters.
[ad_2]
