diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-12-03 13:25:39 -0800 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-12-03 13:25:39 -0800 |
| commit | 02baaa67d9afc2e56c6e1ac6a1fb1f1dd2be366f (patch) | |
| tree | 13ae2fec8be92b2f774cfb3fd725c027740be3ac /tools/sched_ext/scx_cpu0.bpf.c | |
| parent | 8449d3252c2603a51ffc7c36cb5bd94874378b7d (diff) | |
| parent | 1dd6c84f1c544e552848a8968599220bd464e338 (diff) | |
Merge tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Improve recovery from misbehaving BPF schedulers.
When a scheduler puts many tasks with varying affinity restrictions
on a shared DSQ, CPUs scanning through tasks they cannot run can
overwhelm the system, causing lockups.
Bypass mode now uses per-CPU DSQs with a load balancer to avoid this,
and hooks into the hardlockup detector to attempt recovery.
Add scx_cpu0 example scheduler to demonstrate this scenario.
- Add lockless peek operation for DSQs to reduce lock contention for
schedulers that need to query queue state during load balancing.
- Allow scx_bpf_reenqueue_local() to be called from anywhere in
preparation for deprecating cpu_acquire/release() callbacks in favor
of generic BPF hooks.
- Prepare for hierarchical scheduler support: add
scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs,
make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in
structs for future aux__prog parameter.
- Implement cgroup_set_idle() callback to notify BPF schedulers when a
cgroup's idle state changes.
- Fix migration tasks being incorrectly downgraded from
stop_sched_class to rt_sched_class across sched_ext enable/disable.
Applied late as the fix is low risk and the bug subtle but needs
stable backporting.
- Various fixes and cleanups including cgroup exit ordering,
SCX_KICK_WAIT reliability, and backward compatibility improvements.
* tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits)
sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks
sched_ext: tools: Removing duplicate targets during non-cross compilation
sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object
sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs
sched_ext: Update comments replacing breather with aborting mechanism
sched_ext: Implement load balancer for bypass mode
sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked()
sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
sched_ext: Add scx_cpu0 example scheduler
sched_ext: Hook up hardlockup detector
sched_ext: Make handle_lockup() propagate scx_verror() result
sched_ext: Refactor lockup handlers into handle_lockup()
sched_ext: Make scx_exit() and scx_vexit() return bool
sched_ext: Exit dispatch and move operations immediately when aborting
sched_ext: Simplify breather mechanism with scx_aborting flag
sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
sched_ext: Refactor do_enqueue_task() local and global DSQ paths
sched_ext: Use shorter slice in bypass mode
sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races
sched_ext: Minor cleanups to scx_task_iter
...
Diffstat (limited to 'tools/sched_ext/scx_cpu0.bpf.c')
| -rw-r--r-- | tools/sched_ext/scx_cpu0.bpf.c | 88 |
1 files changed, 88 insertions, 0 deletions
diff --git a/tools/sched_ext/scx_cpu0.bpf.c b/tools/sched_ext/scx_cpu0.bpf.c new file mode 100644 index 000000000000..6326ce598c8e --- /dev/null +++ b/tools/sched_ext/scx_cpu0.bpf.c @@ -0,0 +1,88 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A CPU0 scheduler. + * + * This scheduler queues all tasks to a shared DSQ and only dispatches them on + * CPU0 in FIFO order. This is useful for testing bypass behavior when many + * tasks are concentrated on a single CPU. If the load balancer doesn't work, + * bypass mode can trigger task hangs or RCU stalls as the queue is long and + * there's only one CPU working on it. + * + * - Statistics tracking how many tasks are queued to local and CPU0 DSQs. + * - Termination notification for userspace. + * + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2025 Tejun Heo <tj@kernel.org> + */ +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +const volatile u32 nr_cpus = 32; /* !0 for veristat, set during init */ + +UEI_DEFINE(uei); + +/* + * We create a custom DSQ with ID 0 that we dispatch to and consume from on + * CPU0. + */ +#define DSQ_CPU0 0 + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(u32)); + __uint(value_size, sizeof(u64)); + __uint(max_entries, 2); /* [local, cpu0] */ +} stats SEC(".maps"); + +static void stat_inc(u32 idx) +{ + u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx); + if (cnt_p) + (*cnt_p)++; +} + +s32 BPF_STRUCT_OPS(cpu0_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) +{ + return 0; +} + +void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags) +{ + /* + * select_cpu() always picks CPU0. If @p is not on CPU0, it can't run on + * CPU 0. Queue on whichever CPU it's currently only. + */ + if (scx_bpf_task_cpu(p) != 0) { + stat_inc(0); /* count local queueing */ + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); + return; + } + + stat_inc(1); /* count cpu0 queueing */ + scx_bpf_dsq_insert(p, DSQ_CPU0, SCX_SLICE_DFL, enq_flags); +} + +void BPF_STRUCT_OPS(cpu0_dispatch, s32 cpu, struct task_struct *prev) +{ + if (cpu == 0) + scx_bpf_dsq_move_to_local(DSQ_CPU0); +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(cpu0_init) +{ + return scx_bpf_create_dsq(DSQ_CPU0, -1); +} + +void BPF_STRUCT_OPS(cpu0_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SCX_OPS_DEFINE(cpu0_ops, + .select_cpu = (void *)cpu0_select_cpu, + .enqueue = (void *)cpu0_enqueue, + .dispatch = (void *)cpu0_dispatch, + .init = (void *)cpu0_init, + .exit = (void *)cpu0_exit, + .name = "cpu0"); |