summaryrefslogtreecommitdiff
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
7 daysMerge tag 'sched_ext-for-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Improve recovery from misbehaving BPF schedulers. When a scheduler puts many tasks with varying affinity restrictions on a shared DSQ, CPUs scanning through tasks they cannot run can overwhelm the system, causing lockups. Bypass mode now uses per-CPU DSQs with a load balancer to avoid this, and hooks into the hardlockup detector to attempt recovery. Add scx_cpu0 example scheduler to demonstrate this scenario. - Add lockless peek operation for DSQs to reduce lock contention for schedulers that need to query queue state during load balancing. - Allow scx_bpf_reenqueue_local() to be called from anywhere in preparation for deprecating cpu_acquire/release() callbacks in favor of generic BPF hooks. - Prepare for hierarchical scheduler support: add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs, make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in structs for future aux__prog parameter. - Implement cgroup_set_idle() callback to notify BPF schedulers when a cgroup's idle state changes. - Fix migration tasks being incorrectly downgraded from stop_sched_class to rt_sched_class across sched_ext enable/disable. Applied late as the fix is low risk and the bug subtle but needs stable backporting. - Various fixes and cleanups including cgroup exit ordering, SCX_KICK_WAIT reliability, and backward compatibility improvements. * tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits) sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks sched_ext: tools: Removing duplicate targets during non-cross compilation sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs sched_ext: Update comments replacing breather with aborting mechanism sched_ext: Implement load balancer for bypass mode sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked() sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR sched_ext: Add scx_cpu0 example scheduler sched_ext: Hook up hardlockup detector sched_ext: Make handle_lockup() propagate scx_verror() result sched_ext: Refactor lockup handlers into handle_lockup() sched_ext: Make scx_exit() and scx_vexit() return bool sched_ext: Exit dispatch and move operations immediately when aborting sched_ext: Simplify breather mechanism with scx_aborting flag sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode sched_ext: Refactor do_enqueue_task() local and global DSQ paths sched_ext: Use shorter slice in bypass mode sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races sched_ext: Minor cleanups to scx_task_iter ...
7 daysMerge tag 'cgroup-for-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Defer task cgroup unlink until after the dying task's final context switch so that controllers see the cgroup properly populated until the task is truly gone - cpuset cleanups and simplifications. Enforce that domain isolated CPUs stay in root or isolated partitions and fail if isolated+nohz_full would leave no housekeeping CPU. Fix sched/deadline root domain handling during CPU hot-unplug and race for tasks in attaching cpusets - Misc fixes including memory reclaim protection documentation and selftest KTAP conformance * tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: Treat cpusets in attaching as populated sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug cgroup/cpuset: Introduce cpuset_cpus_allowed_locked() docs: cgroup: No special handling of unpopulated memcgs docs: cgroup: Note about sibling relative reclaim protection docs: cgroup: Explain reclaim protection target selftests/cgroup: conform test to KTAP format output cpuset: remove need_rebuild_sched_domains cpuset: remove global remote_children list cpuset: simplify node setting on error cgroup: include missing header for struct irq_work cgroup: Fix sleeping from invalid context warning on PREEMPT_RT cgroup/cpuset: Globally track isolated_cpus update cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition cgroup/cpuset: Move up prstate_housekeeping_conflict() helper cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() cgroup: Defer task cgroup unlink until after the task is done switching out cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() ...
8 daysMerge tag 'core-rseq-2025-11-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq updates from Thomas Gleixner: "A large overhaul of the restartable sequences and CID management: The recent enablement of RSEQ in glibc resulted in regressions which are caused by the related overhead. It turned out that the decision to invoke the exit to user work was not really a decision. More or less each context switch caused that. There is a long list of small issues which sums up nicely and results in a 3-4% regression in I/O benchmarks. The other detail which caused issues due to extra work in context switch and task migration is the CID (memory context ID) management. It also requires to use a task work to consolidate the CID space, which is executed in the context of an arbitrary task and results in sporadic uncontrolled exit latencies. The rewrite addresses this by: - Removing deprecated and long unsupported functionality - Moving the related data into dedicated data structures which are optimized for fast path processing. - Caching values so actual decisions can be made - Replacing the current implementation with a optimized inlined variant. - Separating fast and slow path for architectures which use the generic entry code, so that only fault and error handling goes into the TIF_NOTIFY_RESUME handler. - Rewriting the CID management so that it becomes mostly invisible in the context switch path. That moves the work of switching modes into the fork/exit path, which is a reasonable tradeoff. That work is only required when a process creates more threads than the cpuset it is allowed to run on or when enough threads exit after that. An artificial thread pool benchmarks which triggers this did not degrade, it actually improved significantly. The main effect in migration heavy scenarios is that runqueue lock held time and therefore contention goes down significantly" * tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/mmcid: Switch over to the new mechanism sched/mmcid: Implement deferred mode change irqwork: Move data struct to a types header sched/mmcid: Provide CID ownership mode fixup functions sched/mmcid: Provide new scheduler CID mechanism sched/mmcid: Introduce per task/CPU ownership infrastructure sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex sched/mmcid: Provide precomputed maximal value sched/mmcid: Move initialization out of line signal: Move MMCID exit out of sighand lock sched/mmcid: Convert mm CID mask to a bitmap cpumask: Cache num_possible_cpus() sched/mmcid: Use cpumask_weighted_or() cpumask: Introduce cpumask_weighted_or() sched/mmcid: Prevent pointless work in mm_update_cpus_allowed() sched/mmcid: Move scheduler code out of global header sched: Fixup whitespace damage sched/mmcid: Cacheline align MM CID storage sched/mmcid: Use proper data structures sched/mmcid: Revert the complex CID management ...
9 daysMerge tag 'sched-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scalability and load-balancing improvements: - Enable scheduler feature NEXT_BUDDY (Mel Gorman) - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman) - Skip sched_balance_running cmpxchg when balance is not due (Tim Chen) - Implement generic code for architecture specific sched domain NUMA distances (Tim Chen) - Optimize the NUMA distances of the sched-domains builds of Intel Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim Chen) - Implement proportional newidle balance: a randomized algorithm that runs newidle balancing proportional to its success rate. (Peter Zijlstra) Scheduler infrastructure changes: - Implement the 'sched_change' scoped_guard() pattern for the entire scheduler (Peter Zijlstra) - More broadly utilize the sched_change guard (Peter Zijlstra) - Add support to pick functions to take runqueue-flags (Joel Fernandes) - Provide and use set_need_resched_current() (Peter Zijlstra) Fair scheduling enhancements: - Forfeit vruntime on yield (Fernand Sieber) - Only update stats for allowed CPUs when looking for dst group (Adam Li) CPU-core scheduling enhancements: - Optimize core cookie matching check (Fernand Sieber) Deadline scheduler fixes: - Only set free_cpus for online runqueues (Doug Berger) - Fix dl_server time accounting (Peter Zijlstra) - Fix dl_server stop condition (Peter Zijlstra) Proxy scheduling fixes: - Yield the donor task (Fernand Sieber) Fixes and cleanups: - Fix do_set_cpus_allowed() locking (Peter Zijlstra) - Fix migrate_disable_switch() locking (Peter Zijlstra) - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() (Hao Jia) - Increase sched_tick_remote timeout (Phil Auld) - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth Hegde) - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)" * tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) sched: Provide and use set_need_resched_current() sched/fair: Proportional newidle balance sched/fair: Small cleanup to update_newidle_cost() sched/fair: Small cleanup to sched_balance_newidle() sched/fair: Revert max_newidle_lb_cost bump sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals sched/fair: Enable scheduler feature NEXT_BUDDY sched: Increase sched_tick_remote timeout sched/fair: Have SD_SERIALIZE affect newidle balancing sched/fair: Skip sched_balance_running cmpxchg when balance is not due sched/deadline: Minor cleanup in select_task_rq_dl() sched/deadline: Use cpumask_weight_and() in dl_bw_cpus sched/deadline: Document dl_server sched/deadline: Fix dl_server stop condition sched/deadline: Fix dl_server time accounting sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() sched/eevdf: Fix min_vruntime vs avg_vruntime sched/core: Add comment explaining force-idle vruntime snapshots sched/core: Optimize core cookie matching check sched/proxy: Yield the donor task ...
2025-11-25sched/mmcid: Switch over to the new mechanismThomas Gleixner
Now that all pieces are in place, change the implementations of sched_mm_cid_fork() and sched_mm_cid_exit() to adhere to the new strict ownership scheme and switch context_switch() over to use the new mm_cid_schedin() functionality. The common case is that there is no mode change required, which makes fork() and exit() just update the user count and the constraints. In case that a new user would exceed the CID space limit the fork() context handles the transition to per CPU mode with mm::mm_cid::mutex held. exit() handles the transition back to per task mode when the user count drops below the switch back threshold. fork() might also be forced to handle a deferred switch back to per task mode, when a affinity change increased the number of allowed CPUs enough. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.280380631@linutronix.de
2025-11-25sched/mmcid: Implement deferred mode changeThomas Gleixner
When affinity changes cause an increase of the number of CPUs allowed for tasks which are related to a MM, that might results in a situation where the ownership mode can go back from per CPU mode to per task mode. As affinity changes happen with runqueue lock held there is no way to do the actual mode change and required fixup right there. Add the infrastructure to defer it to a workqueue. The scheduled work can race with a fork() or exit(). Whatever happens first takes care of it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.216484739@linutronix.de
2025-11-25sched/mmcid: Provide CID ownership mode fixup functionsThomas Gleixner
CIDs are either owned by tasks or by CPUs. The ownership mode depends on the number of tasks related to a MM and the number of CPUs on which these tasks are theoretically allowed to run on. Theoretically because that number is the superset of CPU affinities of all tasks which only grows and never shrinks. Switching to per CPU mode happens when the user count becomes greater than the maximum number of CIDs, which is calculated by: opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users); max_cids = min(1.25 * opt_cids, nr_cpu_ids); The +25% allowance is useful for tight CPU masks in scenarios where only a few threads are created and destroyed to avoid frequent mode switches. Though this allowance shrinks, the closer opt_cids becomes to nr_cpu_ids, which is the (unfortunate) hard ABI limit. At the point of switching to per CPU mode the new user is not yet visible in the system, so the task which initiated the fork() runs the fixup function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either transfers each tasks owned CID to the CPU the task runs on or drops it into the CID pool if a task is not on a CPU at that point in time. Tasks which schedule in before the task walk reaches them do the handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's guaranteed that no task related to that MM owns a CID anymore. Switching back to task mode happens when the user count goes below the threshold which was recorded on the per CPU mode switch: pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2); This threshold is updated when a affinity change increases the number of allowed CPUs for the MM, which might cause a switch back to per task mode. If the switch back was initiated by a exiting task, then that task runs the fixup function. If it was initiated by a affinity change, then it's run either in the deferred update function in context of a workqueue or by a task which forks a new one or by a task which exits. Whatever happens first. mm_cid_fixup_cpus_to_task() walks through the possible CPUs and either transfers the CPU owned CIDs to a related task which runs on the CPU or drops it into the pool. Tasks which schedule in on a CPU which the walk did not cover yet do the handover themselves. This transition from CPU to per task ownership happens in two phases: 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task CID and denotes that the CID is only temporarily owned by the task. When it schedules out the task drops the CID back into the pool if this bit is set. 2) The initiating context walks the per CPU space and after completion clears mm:mm_cid.transit. After that point the CIDs are strictly task owned again. This two phase transition is required to prevent CID space exhaustion during the transition as a direct transfer of ownership would fail if two tasks are scheduled in on the same CPU before the fixup freed per CPU CIDs. When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID related to that MM is owned by a CPU anymore. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.088189028@linutronix.de
2025-11-25sched/mmcid: Provide new scheduler CID mechanismThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least some LTTng library depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes it's affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. Implement the context switch parts of a strict ownership mechanism to address this. This removes most of the work from the task which schedules out. Only during transitioning from per CPU to per task ownership it is required to drop the CID when leaving the CPU to prevent CID space exhaustion. Other than that scheduling out is just a single check and branch. The task which schedules in has to check whether: 1) The ownership mode changed 2) The CID is within the optimal CID space In stable situations this results in zero work. The only short disruption is when ownership mode changes or when the associated CID is not in the optimal CID space. The latter only happens when tasks exit and therefore the optimal CID space shrinks. That mechanism is strictly optimized for the common case where no change happens. The only case where it actually causes a temporary one time spike is on mode changes when and only when a lot of tasks related to a MM schedule exactly at the same time and have eventually to compete on allocating a CID from the bitmap. In the sysbench test case which triggered the spinlock contention in the initial CID code, __schedule() drops significantly in perf top on a 128 Core (256 threads) machine when running sysbench with 255 threads, which fits into the task mode limit of 256 together with the parent thread: Upstream rseq/perf branch +CID rework 0.42% 0.37% 0.32% [k] __schedule Increasing the number of threads to 256, which puts the test process into per CPU mode looks about the same. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.023984859@linutronix.de
2025-11-25sched/mmcid: Introduce per task/CPU ownership infrastructureThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least librseq depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes its affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. This can be done differently by implementing a strict CID ownership mechanism. Either the CIDs are owned by the tasks or by the CPUs. The latter provides less locality when tasks are heavily migrating, but there is no justification to optimize for overcommit scenarios and thereby penalizing everyone else. Provide the basic infrastructure to implement this: - Change the UNSET marker to BIT(31) from ~0U - Add the ONCPU marker as BIT(30) - Add the TRANSIT marker as BIT(29) That allows to check for ownership trivially and provides a simple check for UNSET as well. The TRANSIT marker is required to prevent CID space exhaustion when switching from per CPU to per task mode. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251119172549.960252358@linutronix.de
2025-11-25sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutexThomas Gleixner
Prepare for the new CID management scheme which puts the CID ownership transition into the fork() and exit() slow path by serializing sched_mm_cid_fork()/exit() with it, so task list and cpu mask walks can be done in interruptible and preemptible code. The contention on it is not worse than on other concurrency controls in the fork()/exit() machinery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.895826703@linutronix.de
2025-11-25sched/mmcid: Provide precomputed maximal valueThomas Gleixner
Reading mm::mm_users and mm:::mm_cid::nr_cpus_allowed every time to compute the maximal CID value is just wasteful as that value is only changing on fork(), exit() and eventually when the affinity changes. So it can be easily precomputed at those points and provided in mm::mm_cid for consumption in the hot path. But there is an issue with using mm::mm_users for accounting because that does not necessarily reflect the number of user space tasks as other kernel code can take temporary references on the MM which skew the picture. Solve that by adding a users counter to struct mm_mm_cid, which is modified by fork() and exit() and used for precomputing under mm_mm_cid::lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.832764634@linutronix.de
2025-11-25sched/mmcid: Move initialization out of lineThomas Gleixner
It's getting bigger soon, so just move it out of line to the rest of the code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.769636491@linutronix.de
2025-11-25signal: Move MMCID exit out of sighand lockThomas Gleixner
There is no need anymore to keep this under sighand lock as the current code and the upcoming replacement are not depending on the exit state of a task anymore. That allows to use a mutex in the exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.706439391@linutronix.de
2025-11-25sched/mmcid: Convert mm CID mask to a bitmapThomas Gleixner
This is truly a bitmap and just conveniently uses a cpumask because the maximum size of the bitmap is nr_cpu_ids. But that prevents to do searches for a zero bit in a limited range, which is helpful to provide an efficient mechanism to consolidate the CID space when the number of users decreases. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.642866767@linutronix.de
2025-11-20sched/mmcid: Use cpumask_weighted_or()Thomas Gleixner
Use cpumask_weighted_or() instead of cpumask_or() and cpumask_weight() on the result, which walks the same bitmap twice. Results in 10-20% less cycles, which reduces the runqueue lock hold time. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.511736272@linutronix.de
2025-11-20sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()Thomas Gleixner
mm_update_cpus_allowed() is not required to be invoked for affinity changes due to migrate_disable() and migrate_enable(). migrate_disable() restricts the task temporarily to a CPU on which the task was already allowed to run, so nothing changes. migrate_enable() restores the actual task affinity mask. If that mask changed between migrate_disable() and migrate_enable() then that change was already accounted for. Move the invocation to the proper place to avoid that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.385208276@linutronix.de
2025-11-20sched/mmcid: Move scheduler code out of global headerThomas Gleixner
This is only used in the scheduler core code, so there is no point to have it in a global header. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.321259077@linutronix.de
2025-11-20sched: Fixup whitespace damageThomas Gleixner
With whitespace checks enabled in the editor this makes eyes bleed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.258651925@linutronix.de
2025-11-20sched/mmcid: Use proper data structuresThomas Gleixner
Having a lot of CID functionality specific members in struct task_struct and struct mm_struct is not really making the code easier to read. Encapsulate the CID specific parts in data structures and keep them separate from the stuff they are embedded in. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.131573768@linutronix.de
2025-11-20sched/mmcid: Revert the complex CID managementThomas Gleixner
The CID management is a complex beast, which affects both scheduling and task migration. The compaction mechanism forces random tasks of a process into task work on exit to user space causing latency spikes. Revert back to the initial simple bitmap allocating mechanics, which are known to have scalability issues as that allows to gradually build up a replacement functionality in a reviewable way. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.068197830@linutronix.de
2025-11-17sched/fair: Proportional newidle balancePeter Zijlstra
Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
2025-11-17sched: Increase sched_tick_remote timeoutPhil Auld
Increase the sched_tick_remote WARN_ON timeout to remove false positives due to temporarily busy HK cpus. The suggestion was 30 seconds to catch really stuck remote tick processing but not trigger it too easily. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20250911161300.437944-1-pauld@redhat.com
2025-11-11sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()Hao Jia
Since commit d4c64207b88a ("sched: Cleanup the sched_change NOCLOCK usage"), update_rq_clock() is called in do_set_cpus_allowed() -> sched_change_begin() to update the rq clock. This results in a duplicate call update_rq_clock() in __set_cpus_allowed_ptr_locked(). While holding the rq lock and before calling do_set_cpus_allowed(), there is nothing that depends on an updated rq_clock. Therefore, remove the redundant update_rq_clock() in __set_cpus_allowed_ptr_locked() to avoid the warning about double rq clock updates. Fixes: d4c64207b88a ("sched: Cleanup the sched_change NOCLOCK usage") Signed-off-by: Hao Jia <jiahao1@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20251029093655.31252-1-jiahao.kernel@gmail.com
2025-11-06sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remainingAaron Lu
When a cfs_rq is to be throttled, its limbo list should be empty and that's why there is a warn in tg_throttle_down() for non empty cfs_rq->throttled_limbo_list. When running a test with the following hierarchy: root / \ A* ... / | \ ... B / \ C* where both A and C have quota settings, that warn on non empty limbo list is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu part of the cfs_rq for the sake of simpler representation). Debug showed it happened like this: Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), cfs_rq_c is initialized with runtime_enabled set, runtime_remaining equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c, *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), these throttled tasks are directly placed into cfs_rq_c's limbo list by enqueue_throttled_task(). Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these tasks. The first enqueue triggers check_enqueue_throttle(), and with zero runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it can't get more runtime and enters tg_throttle_down(), where the warning is hit due to remaining tasks in the limbo list. I think it's a chaos to trigger throttle on unthrottle path, the status of a being unthrottled cfs_rq can be in a mixed state in the end, so fix this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures cfs_rq_c has a positive runtime_remaining when initialized as unthrottled and cannot enter tg_unthrottle_up() with zero runtime_remaining. Also, update outdated comments in tg_throttle_down() since unthrottle_cfs_rq() is no longer called with zero runtime_remaining. While at it, remove a redundant assignment to se in tg_throttle_down(). Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model") Reviewed-By: Benjamin Segall <bsegall@google.com> Suggested-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Hao Jia <jiahao1@lixiang.com> Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com
2025-11-04rseq: Optimize event settingThomas Gleixner
After removing the various condition bits earlier it turns out that one extra information is needed to avoid setting event::sched_switch and TIF_NOTIFY_RESUME unconditionally on every context switch. The update of the RSEQ user space memory is only required, when either the task was interrupted in user space and schedules or the CPU or MM CID changes in schedule() independent of the entry mode Right now only the interrupt from user information is available. Add an event flag, which is set when the CPU or MM CID or both change. Evaluate this event in the scheduler to decide whether the sched_switch event and the TIF bit need to be set. It's an extra conditional in context_switch(), but the downside of unconditionally handling RSEQ after a context switch to user is way more significant. The utilized boolean logic minimizes this to a single conditional branch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de
2025-11-04rseq: Simplify the event notificationThomas Gleixner
Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags") the bits in task::rseq_event_mask are meaningless and just extra work in terms of setting them individually. Aside of that the only relevant point where an event has to be raised is context switch. Neither the CPU nor MM CID can change without going through a context switch. Collapse them all into a single boolean which simplifies the code a lot and remove the pointless invocations which have been sprinkled all over the place for no value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.336978188@linutronix.de
2025-11-03sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to ↵Tejun Heo
finish_task_switch() sched_ext_free() was called from __put_task_struct() when the last reference to the task is dropped, which could be long after the task has finished running. This causes cgroup-related problems: - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d during scheduler load, because the cgroup might be destroyed/unlinked while the zombie or dead task is still lingering on the scx_tasks list. - ops.cgroup_exit() could be called before ops.exit_task() is called on all member tasks, leading to incorrect exit ordering. Fix by moving it to finish_task_switch() to be called right after the final context switch away from the dying task, matching when sched_class->task_dead() is called. Rename it to sched_ext_dead() to match the new calling context. By calling sched_ext_dead() before cgroup_task_dead(), we ensure that: - Tasks visible on scx_tasks list have valid cgroups during scheduler load, as cgroup_mutex prevents cgroup destruction while the task is still linked. - All member tasks have ops.exit_task() called and are removed from scx_tasks before the cgroup can be destroyed and trigger ops.cgroup_exit(). This fix is made possible by the cgroup_task_dead() split in the previous patch. This also makes more sense resource-wise as there's no point in keeping scheduler side resources around for dead tasks. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03sched_ext: Merge branch 'for-6.19' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-6.19 Pull cgroup/for-6.19 to receive: 16dad7801aad ("cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()") 260fbcb92bbe ("cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()") d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") These are needed for the sched_ext cgroup exit ordering fix. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03cgroup: Defer task cgroup unlink until after the task is done switching outTejun Heo
When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task from its cgroup. From the cgroup's perspective, the task is now gone. If this makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks that notify controllers the cgroup is going offline resource-wise. However, the exiting task can still run, perform memory operations, and schedule until the final context switch in finish_task_switch(). This creates a confusing situation where controllers are told a cgroup is offline while resource activities are still happening in it. While this hasn't broken existing controllers, it has caused direct confusion for sched_ext schedulers. Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls the subsystem exit callbacks and continues to be called from do_exit(). The css_set cleanup is moved to the new cgroup_task_dead() which is called from finish_task_switch() after the final context switch, so that the cgroup only appears empty after the task is truly done running. This also reorders operations so that subsys->exit() is now called before unlinking from the cgroup, which shouldn't break anything. Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29Merge branch 'linus/master' into sched/core, to resolve conflictPeter Zijlstra
Conflicts: kernel/sched/ext.c Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-10-28sched: Fix the do_set_cpus_allowed() locking fixPeter Zijlstra
Commit abfc01077df6 ("sched: Fix do_set_cpus_allowed() locking") overlooked that __balance_push_cpu_stop() calls select_fallback_rq() with rq->lock held. This makes that set_cpus_allowed_force() will recursively take rq->lock and the machine locks up. Run select_fallback_rq() earlier, without holding rq->lock. This opens up a race window where a task could get migrated out from under us, but that is harmless, we want the task migrated. select_fallback_rq() itself will not be subject to concurrency as it will be fully serialized by p->pi_lock, so there is no chance of set_cpus_allowed_force() getting called with different arguments and selecting different fallback CPUs for one task. Fixes: abfc01077df6 ("sched: Fix do_set_cpus_allowed() locking") Reported-by: Jan Polensky <japo@linux.ibm.com> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Jan Polensky <japo@linux.ibm.com> Closes: https://lore.kernel.org/oe-lkp/202510271206.24495a68-lkp@intel.com Link: https://patch.msgid.link/20251027110133.GI3245006@noisy.programming.kicks-ass.net
2025-10-16sched/ext: Fold balance_scx() into pick_task_scx()Peter Zijlstra
With pick_task() having an rf argument, it is possible to do the lock-break there, get rid of the weird balance/pick_task hack. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Add support to pick functions to take rfJoel Fernandes
Some pick functions like the internal pick_next_task_fair() already take rf but some others dont. We need this for scx's server pick function. Prepare for this by having pick functions accept it. [peterz: - added RETRY_TASK handling - removed pick_next_task_fair indirection] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Detect per-class runqueue changesPeter Zijlstra
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then enables easy tracking of which runqueues are modified over a lock-break. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Mandate shared flags for sched_changePeter Zijlstra
Shrikanth noted that sched_change pattern relies on using shared flags. Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16sched: Cleanup the sched_change NOCLOCK usagePeter Zijlstra
Teach the sched_change pattern how to do update_rq_clock(); this allows for some simplifications / cleanups. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Match __task_rq_{,un}lock()Peter Zijlstra
In preparation to adding more rules to __task_rq_lock(), such that __task_rq_unlock() will no longer be equivalent to rq_unlock(), make sure every __task_rq_lock() is matched by a __task_rq_unlock() and vice-versa. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Add locking comments to sched_class methodsPeter Zijlstra
'Document' the locking context the various sched_class methods are called under. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Make __do_set_cpus_allowed() use the sched_change patternPeter Zijlstra
Now that do_set_cpus_allowed() holds all the regular locks, convert it to use the sched_change pattern helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Rename do_set_cpus_allowed()Peter Zijlstra
Hopefully saner naming. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Fix do_set_cpus_allowed() lockingPeter Zijlstra
All callers of do_set_cpus_allowed() only take p->pi_lock, which is not sufficient to actually change the cpumask. Again, this is mostly ok in these cases, but it results in unnecessarily complicated reasoning. Furthermore, there is no reason what so ever to not just take all the required locks, so do just that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Fix migrate_disable_switch() lockingPeter Zijlstra
For some reason migrate_disable_switch() was more complicated than it needs to be, resulting in mind bending locking of dubious quality. Recognise that migrate_disable_switch() must be called before a context switch, but any place before that switch is equally good. Since the current place results in troubled locking, simply move the thing before taking rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Move sched_class::prio_changed() into the change patternPeter Zijlstra
Move sched_class::prio_changed() into the change pattern. And while there, extend it with sched_class::get_prio() in order to fix the deadline sitation. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Cleanup sched_delayed handling for class switchesPeter Zijlstra
Use the new sched_class::switching_from() method to dequeue delayed tasks before switching to another class. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change patternPeter Zijlstra
Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into the change pattern. This completes and makes the pattern more symmetric. This changes the order of callbacks slightly: OLD NEW | | switching_from() dequeue_task(); | dequeue_task() put_prev_task(); | put_prev_task() | switched_from() | ... change task ... | ... change task ... | switching_to(); | switching_to() enqueue_task(); | enqueue_task() set_next_task(); | set_next_task() prev_class->switched_from() | switched_to() | switched_to() | Notably, it moves the switched_from() callback right after the dequeue/put. Existing implementations don't appear to be affected by this change in location -- specifically the task isn't enqueued on the class in question in either location. Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore when changing scheduling classes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16sched: Employ sched_change guardsPeter Zijlstra
As proposed a long while ago -- and half done by scx -- wrap the scheduler's 'change' pattern in a guard helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-14sched/deadline: Stop dl_server before CPU goes offlinePeter Zijlstra (Intel)
IBM CI tool reported kernel warning[1] when running a CPU removal operation through drmgr[2]. i.e "drmgr -c cpu -r -q 1" WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170 NIP [c0000000002b6ed8] cpudl_set+0x58/0x170 LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0 Call Trace: [c000000002c2f8c0] init_stack+0x78c0/0x8000 (unreliable) [c0000000002b7cb8] dl_server_timer+0x168/0x2a0 [c00000000034df84] __hrtimer_run_queues+0x1a4/0x390 [c00000000034f624] hrtimer_interrupt+0x124/0x300 [c00000000002a230] timer_interrupt+0x140/0x320 Git bisects to: commit 4ae8d9aa9f9d ("sched/deadline: Fix dl_server getting stuck") This happens since: - dl_server hrtimer gets enqueued close to cpu offline, when kthread_park enqueues a fair task. - CPU goes offline and drmgr removes it from cpu_present_mask. - hrtimer fires and warning is hit. Fix it by stopping the dl_server before CPU is marked dead. [1]: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com/ [2]: https://github.com/ibm-power-utilities/powerpc-utils/tree/next/src/drmgr [sshegde: wrote the changelog and tested it] Fixes: 4ae8d9aa9f9d ("sched/deadline: Fix dl_server getting stuck") Closes: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
2025-09-30Merge tag 'timers-core-2025-09-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - Address the inconsistent shutdown sequence of per CPU clockevents on CPU hotplug, which only removed it from the core but failed to invoke the actual device driver shutdown callback. This kept the timer active, which prevented power savings and caused pointless noise in virtualization. - Encapsulate the open coded access to the hrtimer clock base, which is a private implementation detail, so that the implementation can be changed without breaking a lot of usage sites. - Enhance the debug output of the clocksource watchdog to provide better information for analysis. - The usual set of cleanups and enhancements all over the place * tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Fix spelling mistakes in comments clocksource: Print durations for sync check unconditionally LoongArch: Remove clockevents shutdown call on offlining tick: Do not set device to detached state in tick_shutdown() hrtimer: Reorder branches in hrtimer_clockid_to_base() hrtimer: Remove hrtimer_clock_base:: Get_time hrtimer: Use hrtimer_cb_get_time() helper media: pwm-ir-tx: Avoid direct access to hrtimer clockbase ALSA: hrtimer: Avoid direct access to hrtimer clockbase lib: test_objpool: Avoid direct access to hrtimer clockbase sched/core: Avoid direct access to hrtimer clockbase timers/itimer: Avoid direct access to hrtimer clockbase posix-timers: Avoid direct access to hrtimer clockbase jiffies: Remove obsolete SHIFTED_HZ comment
2025-09-30Merge tag 'sched-core-2025-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core scheduler changes: - Make migrate_{en,dis}able() inline, to improve performance (Menglong Dong) - Move STDL_INIT() functions out-of-line (Peter Zijlstra) - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra) Fair scheduling: - Defer throttling to when tasks exit to user-space, to reduce the chance & impact of throttle-preemption with held locks and other resources (Aaron Lu, Valentin Schneider) - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the warning was getting triggered on certain topologies (Peter Zijlstra) Misc cleanups & fixes: - Header cleanups (Menglong Dong) - Fix race in push_dl_task() (Harshit Agarwal)" * tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix some typos in include/linux/preempt.h sched: Make migrate_{en,dis}able() inline rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c sched/fair: Do not balance task to a throttled cfs_rq sched/fair: Do not special case tasks in throttled hierarchy sched/fair: update_cfs_group() for throttled cfs_rqs sched/fair: Propagate load for throttled cfs_rq sched/fair: Get rid of throttled_lb_pair() sched/fair: Task based throttle time accounting sched/fair: Switch to task based throttle model sched/fair: Implement throttle task work and related helpers sched/fair: Add related data structure for task based throttle sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig sched: Move STDL_INIT() functions out-of-line sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask() sched/deadline: Fix race in push_dl_task()
2025-09-30Merge tag 'sched_ext-for-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Code organization cleanup. Separate internal types and accessors to ext_internal.h to reduce the size of ext.c and improve maintainability. - Prepare for cgroup sub-scheduler support by adding @sch parameter to various functions and helpers, reorganizing scheduler instance handling, and dropping obsolete helpers like scx_kf_exit() and kf_cpu_valid(). - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to provide safer access patterns with proper RCU protection. scx_bpf_cpu_rq() is deprecated with warnings due to potential race conditions. - Improve debugging with migration-disabled counter in error state dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and other enhancements to help diagnose issues. - Use cgroup_lock/unlock() for cgroup synchronization instead of scx_cgroup_rwsem based synchronization. This is simpler and allows enable/disable paths to synchronize against cgroup changes independent of the CPU controller. - rhashtable_lookup() replacement to avoid redundant RCU locking was reverted due to RCU usage warnings. Will be redone once rhashtable is updated to use rcu_dereference_all(). - Other misc updates and fixes including bypass handling improvements, scx_task_iter_relock() improvements, tools/sched_ext updates, and compatibility helpers. * tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (28 commits) Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()" sched_ext: Misc updates around scx_sched instance pointer sched_ext: Drop scx_kf_exit() and scx_kf_error() sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit() sched_ext: Drop kf_cpu_valid() sched_ext: Add the @sch parameter to ext_idle helpers sched_ext: Add the @sch parameter to __bstr_format() sched_ext: Separate out scx_kick_cpu() and add @sch to it tools/sched_ext: scx_qmap: Make debug output quieter by default sched_ext: Make qmap dump operation non-destructive sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() sched_ext: Use bitfields for boolean warning flags sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() sched_ext: Improve SCX_KF_DISPATCH comment sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() sched_ext: Verify RCU protection in scx_bpf_cpu_curr() sched_ext: Add migration-disabled counter to error state dump sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning tools/sched_ext: Add compat helper for scx_bpf_cpu_curr() sched_ext: deprecation warn for scx_bpf_cpu_rq() ...