summaryrefslogtreecommitdiff
path: root/tools/sched_ext
AgeCommit message (Collapse)Author
2025-11-20sched_ext: tools: Removing duplicate targets during non-cross compilationRong Tao
When cross-compilation is not used, BPFOBJ and HOST_BPFOBJ are identical files, libbpf.a, and duplicate libbpf.a files should be removed. Signed-off-by: Rong Tao <rongtao@cestc.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Add scx_cpu0 example schedulerTejun Heo
Add scx_cpu0, a simple scheduler that queues all tasks to a single DSQ and only dispatches them from CPU0 in FIFO order. This is useful for testing bypass behavior when many tasks are concentrated on a single CPU. If the load balancer doesn't work, bypass mode can trigger task hangs or RCU stalls as the queue is long and there's only one CPU working on it. v2: Check whether task is on CPU0 at enqueue using scx_bpf_task_cpu() instead of nr_cpus_allowed (Andrea Righi). Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext/tools: Restore backward compat with v6.12 kernelsTejun Heo
Commit 111a79800aed ("tools/sched_ext: Strip compatibility macros for cgroup and dispatch APIs") removed the compat layer for v6.12-v6.13 kfunc renaming, but v6.12 is the current LTS kernel and will remain supported through 2026. Restore backward compatibility so schedulers built with v6.19+ headers can run on v6.12 kernels. The restored compat differs from the original in two ways: 1. Uses ___new/___old suffixes instead of ___compat for clarity. The new macros check for v6.13+ names (scx_bpf_dsq_move*), fall back to v6.12 names (scx_bpf_dispatch_from_dsq*, scx_bpf_consume), then return safe no-ops for missing symbols. 2. Integrates with the args-struct-packing changes added in c0d630ba347c ("sched_ext: Wrap kfunc args in struct to prepare for aux__prog"). scx_bpf_dsq_insert_vtime() now tries __scx_bpf_dsq_insert_vtime (args struct), then scx_bpf_dsq_insert_vtime___compat (v6.13-v6.18), then scx_bpf_dispatch_vtime___compat (pre-v6.13). Forward compatibility is not restored - binaries built against v6.13 or earlier headers won't run on v6.19+ kernels, as the old kfunc names are not exported. This is acceptable since the priority is new binaries running on older kernels. Also add missing compat checks for ops.cgroup_set_bandwidth() (added v6.17) and ops.cgroup_set_idle() (added v6.19). These need to be NULLed out in userspace on older kernels. Reported-by: Andrea Righi <arighi@nvidia.com> Acked-and-tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhereTejun Heo
The ops.cpu_acquire/release() callbacks miss events under multiple conditions. There are two distinct task dispatch gaps that can cause cpu_released flag desynchronization: 1. balance-to-pick_task gap: This is what was originally reported. balance_scx() can enqueue a task, but during consume_remote_task() when the rq lock is released, a higher priority task can be enqueued and ultimately picked while cpu_released remains false. This gap is closeable via RETRY_TASK handling. 2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local DSQ. By the time the sched path runs on the target CPU, higher class tasks may already be queued. In such cases, nothing on sched_ext side will be invoked, and the only solution would be a hook invoked regardless of sched class, which isn't desirable. Rather than adding invasive core hooks, BPF schedulers can use generic BPF mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent with other mechanisms it already uses and doesn't add further friction. The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when a CPU gets preempted by a higher priority scheduling class. However, the old scx_bpf_reenqueue_local() could only be called from cpu_release() context. Add a new version of scx_bpf_reenqueue_local() that can be called from any context by deferring the actual re-enqueue operation. This eliminates the need for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint to detect and handle CPU preemption. Update scx_qmap to demonstrate the new approach using sched_switch instead of cpu_release, with compat support for older kernels. Mark cpu_acquire/release() as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in v6.23. Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/ Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-24sched_ext: Add ___compat suffix to scx_bpf_dsq_insert___v2 in compat.bpf.hTejun Heo
2dbbdeda77a6 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility") renamed the new bool-returning variant to scx_bpf_dsq_insert___v2 in the kernel. However, libbpf currently only strips ___SUFFIX on the BPF side, not on kernel symbols, so the compat wrapper couldn't match the kernel kfunc and would always fall back to the old variant even when the new one was available. Add an extra ___compat suffix as a workaround - libbpf strips one suffix on the BPF side leaving ___v2, which then matches the kernel kfunc directly. In the future when libbpf strips all suffixes on both sides, all suffixes can be dropped. Fixes: 2dbbdeda77a6 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility") Cc: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-21sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibilityTejun Heo
cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool") introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old void-returning version to scx_bpf_dsq_insert___compat, with the expectation that libbpf would match old binaries to the ___compat variant, maintaining backward binary compatibility. However, while libbpf ignores ___suffix on the BPF side when matching symbols, it doesn't do so for kernel-side symbols. Old binaries compiled with the original scx_bpf_dsq_insert() could no longer resolve the symbol. Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old void-returning interface and add ___v2 to the new bool-returning version. This allows old binaries to continue working while new code can use the ___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the ___v2 suffix can be dropped when the compat interface is removed. v2: Use ___v2 instead of ___new. Fixes: cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-15sched_ext: Add lockless peek operation for DSQsRyan Newton
The builtin DSQ queue data structures are meant to be used by a wide range of different sched_ext schedulers with different demands on these data structures. They might be per-cpu with low-contention, or high-contention shared queues. Unfortunately, DSQs have a coarse-grained lock around the whole data structure. Without going all the way to a lock-free, more scalable implementation, a small step we can take to reduce lock contention is to allow a lockless, small-fixed-cost peek at the head of the queue. This change allows certain custom SCX schedulers to cheaply peek at queues, e.g. during load balancing, before locking them. But it represents a few extra memory operations to update the pointer each time the DSQ is modified, including a memory barrier on ARM so the write appears correctly ordered. This commit adds a first_task pointer field which is updated atomically when the DSQ is modified, and allows any thread to peek at the head of the queue without holding the lock. Signed-off-by: Ryan Newton <newton@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext/tools: Add compat wrapper for scx_bpf_task_set_slice/dsq_vtime()Tejun Heo
for sub-scheduler authority checks. Add compat wrappers which fall back to direct p->scx field writes on older kernels. Suggested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Make scx_bpf_dsq_insert*() return boolTejun Heo
In preparation for hierarchical schedulers, change scx_bpf_dsq_insert() and scx_bpf_dsq_insert_vtime() to return bool instead of void. With sub-schedulers, there will be no reliable way to guarantee a task is still owned by the sub-scheduler at insertion time (e.g., the task may have been migrated to another scheduler). The bool return value will enable sub-schedulers to detect and gracefully handle insertion failures. For the root scheduler, insertion failures will continue to trigger scheduler abort via scx_error(), so existing code doesn't need to check the return value. Backward compatibility is maintained through compat wrappers. Also update scx_bpf_dsq_move() documentation to clarify that it can return false for sub-schedulers when @dsq_id points to a disallowed local DSQ. Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Wrap kfunc args in struct to prepare for aux__progTejun Heo
scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and() currently have 5 parameters. An upcoming change will add aux__prog parameter which will exceed BPF's 5 argument limit. Prepare by adding new kfuncs __scx_bpf_dsq_insert_vtime() and __scx_bpf_select_cpu_and() that take args structs. The existing kfuncs are kept as compatibility wrappers. BPF programs use inline wrappers that detect kernel API version via bpf_core_type_exists() and use the new struct-based kfuncs when available, falling back to compat kfuncs otherwise. This allows BPF programs to work with both old and new kernels. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime()Tejun Heo
With the planned hierarchical scheduler support, sub-schedulers will need to be verified for authority before being allowed to modify task->scx.slice and task->scx.dsq_vtime. Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() which will perform the necessary permission checks. Root schedulers can still directly write to these fields, so this doesn't affect existing schedulers. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13tools/sched_ext: Strip compatibility macros for cgroup and dispatch APIsTejun Heo
Enough time has passed since the introduction of scx_bpf_task_cgroup() and the scx_bpf_dispatch* -> scx_bpf_dsq* kfunc renaming. Strip the compatibility macros. Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23tools/sched_ext: scx_qmap: Make debug output quieter by defaultTejun Heo
scx_qmap currently outputs verbose debug messages including cgroup operations and CPU online/offline events by default, which can be noisy during normal operation. While the existing -P option controls DSQ dumps and event statistics, there's no way to suppress the other debug messages. Split the debug output controls to make scx_qmap quieter by default. The -P option continues to control DSQ dumps and event statistics (print_dsqs_and_events), while a new -M option controls debug messages like cgroup operations and CPU events (print_msgs). This allows users to run scx_qmap with minimal output and selectively enable debug information as needed. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Make qmap dump operation non-destructiveTejun Heo
The qmap dump operation was destructively consuming queue entries while displaying them. As dump can be triggered anytime, this can easily lead to stalls. Add a temporary dump_store queue and modify the dump logic to pop entries, display them, and then restore them back to the original queue. This allows dump operations to be performed without affecting the scheduler's queue state. Note that if racing against new enqueues during dump, ordering can get mixed up, but this is acceptable for debugging purposes. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03tools/sched_ext: Add compat helper for scx_bpf_cpu_curr()Andrea Righi
Introduce a compatibility helper that allows BPF schedulers to use scx_bpf_cpu_curr() on older kernels. Fixes: 20b158094a1ad ("sched_ext: Introduce scx_bpf_cpu_curr()") Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_cpu_curr()Christian Loehle
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr task of a remote rq without assuming its lock is held. Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr (e.g. to see if it should be preempted). This is problematic because scx_bpf_cpu_rq() provides access to all fields of struct rq, most of which aren't safe to use without holding the associated rq lock. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_locked_rq()Christian Loehle
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held. Furthermore they become meaningless without rq lock, too. Make a safer version of scx_bpf_cpu_rq() that only returns a rq if we hold rq lock of that rq. Also mark the new scx_bpf_locked_rq() as returning NULL as scx_bpf_cpu_rq() should've been too. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-11tools/sched_ext: Receive updates from SCX repoAndrea Righi
Receive tools/sched_ext updates form https://github.com/sched-ext/scx to sync userspace bits: - basic BPF arena allocator abstractions, - additional process flags definitions, - fixed is_migration_disabled() helper, - separate out user_exit_info BPF and user space code. This also fixes the following warning when building the selftests: tools/sched_ext/include/scx/common.bpf.h:550:9: warning: 'likely' macro redefined [-Wmacro-redefined] 550 | #define likely(x) __builtin_expect(!!(x), 1) | ^ Co-developed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext: Add support for cgroup bandwidth control interfaceTejun Heo
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18sched_ext: change the variable name for slice refill eventHonglei Wang
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-14sched_ext: Improve cross-compilation support in Makefileyangsonghua
Modify the tools/sched_ext/Makefile to better handle cross-compilation environments by: 1. Fix host tools build directory structure by separating obj/ from output (HOST_BUILD_DIR now points to $(OBJ_DIR)/host/obj) 2. Properly propagate CROSS_COMPILE to libbpf sub-make invocation 3. Add missing $(HOST_BPFOBJ) build rule with proper host toolchain flags (ARCH=, CROSS_COMPILE=, explicit HOSTCC/HOSTLD) 4. Consistently quote $(HOSTCC) in bpftool build rule 5. Change LDFLAGS assignment to += to allow external extensions The changes ensure proper cross-compilation behavior while maintaining backward compatibility with native builds. Host tools are now correctly built with the host toolchain while target binaries use the cross-toolchain. Signed-off-by: yangsonghua <yangsonghua@lixiang.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
Pull for-6.15-fixes to receive: e776b26e3701 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings") which conflicts with: 1a7ff7216c8b ("sched_ext: Drop "ops" from scx_ops_enable_state and friends") The former removes code updated by the latter. Resolved by removing the updated section. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecationTejun Heo
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag doesn't do anything. Mark it for deprecation and remove its usage from scx_flatcg. v2: Actually include the scx_flatcg update. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-07sched_ext: idle: Introduce scx_bpf_select_cpu_and()Andrea Righi
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU. This new helper is basically an extension of scx_bpf_select_cpu_dfl(). However, when an idle CPU can't be found, it returns a negative value instead of @prev_cpu, aligning its behavior more closely with scx_bpf_pick_idle_cpu(). It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in selection logic. With this helper, BPF schedulers can apply the built-in idle CPU selection policy restricted to any arbitrary subset of CPUs. Example usage ============= Possible usage in ops.select_cpu(): s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu; cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets, 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-04sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and ↵Tejun Heo
__scx_ops_enabled The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_enable_state and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_enable_state and friends. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-02tools/sched_ext: Sync with scx repoTejun Heo
Synchronize with https://github.com/sched-ext/scx at dc44584874f0 ("kernel: Synchronize with kernel tools/sched_ext"). - READ/WRITE_ONCE() is made more proper and READA_ONCE_ARENA() is dropped. - scale_by_task_weight[_inverse]() helpers added. - Enum defs expanded to cover more and new enums. - Don't trigger fatal error when some enums are missing from kernel BTF. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04sched_ext: Change the event type from u64 to s64Changwoo Min
The event count could be negative in the future, so change the event type from u64 to s64. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull for-6.14-fixes to receive: 9360dfe4cbd6 ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()") which conflicts with: 337d1b354a29 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-27tools/sched_ext: Provide a compatible helper for scx_bpf_events()Andrea Righi
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a compatible way also with kernels that don't provide this kfunc. This also fixes the following error with scx_qmap when running on a kernel that does not provide scx_bpf_events(): ; scx_bpf_events(&events, sizeof(events)); @ scx_qmap.bpf.c:777 318: (b7) r2 = 72 ; R2_w=72 async_cb 319: <invalid kfunc call> kfunc 'scx_bpf_events' is referenced but wasn't resolved Fixes: 9865f31d852a4 ("sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25tools/sched_ext: Provide consistent access to scx flagsAndrea Righi
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the user-space part of the schedulers via the compat interface. This allows schedulers / selftests to set all the ops flags in user-space, rather than having them split between BPF and user-space. Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-24sched_ext: idle: Introduce scx_bpf_nr_node_ids()Andrea Righi
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-18sched_ext: idle: Introduce node-aware idle cpu kfunc helpersAndrea Righi
Introduce a new kfunc to retrieve the node associated to a CPU: int scx_bpf_cpu_node(s32 cpu) Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information: const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Per-node idle cpumasksAndrea Righi
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. = Test = Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself) Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs: avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895 = Conclusion = Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case. The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%. On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple). Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test. [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODEAndrea Righi
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14tools/sched_ext: Sync with scx repoTejun Heo
Synchronize with https://github.com/sched-ext/scx at d384453984a0 ("kernel: Sync at ad3b301aa05a ("sched_ext: Provides a sysfs 'events' to expose core event counters")"). Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.Chuyi Zhou
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_ push_{front,back}. This patch fix this issue. Without this patch, if we use bpf_list kfunc in scx, the BPF verifier would complain: libbpf: extern (func ksym) 'bpf_list_push_back': not found in kernel or module BTFs libbpf: failed to load object 'scx_foo' libbpf: failed to load BPF skeleton 'scx_foo': -EINVAL With this patch, the bpf list kfunc will work as expected. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Fixes: 2a52ca7c98960 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10tools/sched_ext: Update enum_defs.autogen.hChangwoo Min
Add where the script is located to the comment lines of the header file. This helps anyone re-generate the header file if required. Note that this is a sync from the PR [1] in the scx repo. [1] https://github.com/sched-ext/scx/pull/1322 Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTEDChangwoo Min
This provides compatible testing of SCX_ENQ_CPU_SELECTED. More specifically, it handles two cases: 1. a BPF scheduler is compiled against vmlinux.h where SCX_ENQ_CPU_SELECTED is defined, but it runs on a kernel that does not have SCX_ENQ_CPU_SELECTED. In this case, the test result of 'enq_flags & SCX_ENQ_CPU_SELECTED' will always be false. That test result is semantically incorrect because the kernel before SCX_ENQ_CPU_SELECTED has never skipped select_task_rq_scx(), so the result should be true. 2. a BPF scheduler is compiling against vmlinux.h where SCX_ENQ_CPU_SELECTED is not defined. In this case, directly using SCX_ENQ_CPU_SELECTED causes compilation errors. To hide such complexity, introduce __COMPAT_is_enq_cpu_selected(), which checks if SCX_ENQ_CPU_SELECTED exists in runtime using BPF CO-RE. This consists of three parts: 1. Add enum_defs.autogen.h, which has macros (HAVE_{enum name}) denoting whether SCX enums are defined in the vmlinux.h or not. 2. Implement __COMPAT_is_enq_cpu_selected(), which provide the test of SCX_ENQ_CPU_SELECTED in a compatible way. 3. Use __COMPAT_is_enq_cpu_selected() in scx_qmap. Note that this is a sync of the relevant PR [1] in the scx repo. [1] https://github.com/sched-ext/scx/pull/1314 Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08tool/sched_ext: Event counter dumping updatesTejun Heo
- There's no need to dump event counters from both scx_qmap and scx_central. Drop counter dumping from scx_central. - bpf_printk() implies a trailing new line and the explicit new line leads to double new lines. Drop the explicit new lines. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull to receive: - 2fa0fbeb69ed ("sched_ext: Implement auto local dispatching of migration disabled tasks") - 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") as planned for-6.15 changes depend on them (e.g. adding event counter for implicit migration disabled task handling).
2025-02-07sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/centralChangwoo Min
Modify the scx_qmap and scx_celtral schedulers to print the SCX_EV_ENQ_SLICE_DFL event every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Print core event count in scx_qmap schedulerChangwoo Min
Modify the scx_qmap scheduler to print the core event counter every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Print core event count in scx_central schedulerChangwoo Min
Modify the scx_central scheduler to print the core event counter every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulersChangwoo Min
scx_bpf_events() is added to the header files so the BPF scheduler can use it. Also, scx_read_event() is added to read an event type in a compatible way. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Fix incorrect time delta calculation in time_delta()Changwoo Min
When (s64)(after - before) > 0, the code returns the result of (s64)(after - before) > 0 while the intended result should be (s64)(after - before). That happens because the middle operand of the ternary operator was omitted incorrectly, returning the result of (s64)(after - before) > 0. Thus, add the middle operand -- (s64)(after - before) -- to return the correct time calculation. Fixes: d07be814fc71 ("sched_ext: Add time helpers for BPF schedulers") Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27tools/sched_ext: Add helper to check task migration stateAndrea Righi
Introduce a new helper for BPF schedulers to determine whether a task can migrate or not (supporting both SMP and UP systems). Fixes: e9fe182772dc ("sched_ext: selftests/dsp_local_on: Fix sporadic failures") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Use time helpers in BPF schedulersChangwoo Min
Modify the BPF schedulers to use time helpers defined in common.bpf.h Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()Changwoo Min
In the BPF schedulers that use bpf_ktime_get_ns() -- scx_central and scx_flatcg, replace bpf_ktime_get_ns() calls to scx_bpf_now(). Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>