| Age | Commit message (Collapse) | Author |
|
|
|
|
|
mutex_init() invokes __mutex_init() providing the name of the lock and
a pointer to a the lock class. With LOCKDEP enabled this information is
useful but without LOCKDEP it not used at all. Passing the pointer
information of the lock class might be considered negligible but the
name of the lock is passed as well and the string is stored. This
information is wasting storage.
Split __mutex_init() into a _genereic() variant doing the initialisation
of the lock and a _lockdep() version which does _genereic() plus the
lockdep bits. Restrict the lockdep version to lockdep enabled builds
allowing the compiler to remove the unused parameter.
This results in the following size reduction:
text data bss dec filename
| 30237599 8161430 1176624 39575653 vmlinux.defconfig
| 30233269 8149142 1176560 39558971 vmlinux.defconfig.patched
-4.2KiB -12KiB
| 32455099 8471098 12934684 53860881 vmlinux.defconfig.lockdep
| 32455100 8471098 12934684 53860882 vmlinux.defconfig.patched.lockdep
| 27152407 7191822 2068040 36412269 vmlinux.defconfig.preempt_rt
| 27145937 7183630 2067976 36397543 vmlinux.defconfig.patched.preempt_rt
-6.3KiB -8KiB
| 29382020 7505742 13784608 50672370 vmlinux.defconfig.preempt_rt.lockdep
| 29376229 7505742 13784544 50666515 vmlinux.defconfig.patched.preempt_rt.lockdep
-5.6KiB
[peterz: folded fix from boqun]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20251125145425.68319-1-boqun.feng@gmail.com
Link: https://patch.msgid.link/20251105142350.Tfeevs2N@linutronix.de
|
|
- In order to prepare the layout for nohz_full work deferral to
user exit, the context tracking state must shrink the counter
of transitions to/from RCU not watching. The only possible hazard
is to trigger wrap-around more easily, delaying a bit grace periods
when that happens. This should be a rare event though. Yet add
debugging and torture code to test that assumption.
- Fix memory leak on locktorture module
- Annotate accesses in rculist_nulls.h to prevent from KCSAN warnings.
On recent discussions, we also concluded that all those WRITE_ONCE()
and READ_ONCE() on list APIs deserve appropriate comments. Something
to be expected for the next cycle.
- Provide a script to apply several configs to several commits with torture.
- Allow torture to reuse a build directory in order to save needless
rebuild time.
- Various cleanups.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Have timekeeping aux clocks sysfs interface setup function return an
error code on failure instead of success
* tag 'timers_urgent_for_v6.18_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix error code in tk_aux_sysfs_init()
|
|
The __mm_flags_set_word() function is slightly ambiguous - we use 'set' to
refer to setting individual bits (such as in mm_flags_set()) but here we
use it to refer to overwriting the value altogether.
Rename it to __mm_flags_overwrite_word() to eliminate this ambiguity.
We additionally simplify the functions, eliminating unnecessary
bitmap_xxx() operations (the compiler would have optimised these out but
it's worth being as clear as we can be here).
Link: https://lkml.kernel.org/r/8f0bc556e1b90eca8ea5eba41f8d5d3f9cd7c98a.1764064557.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Alice Ryhl <aliceryhl@google.com> [rust]
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Updating a BPF_MAP_TYPE_HASH_OF_MAPS or BPF_MAP_TYPE_ARRAY_OF_MAPS via
bpf_map_update_elem() is very expensive.
In one of our workloads, we're inserting ~1400 maps of type
BPF_MAP_TYPE_ARRAY into a BPF_MAP_TYPE_ARRAY_OF_MAPS. This takes ~21
seconds on a single thread, with an average of ~15ms per call:
Function Name: map_update_elem
Number of calls: 1369
Total time: 21s 182ms 966µs
Maximum: 47ms 937µs
Average: 15ms 473µs
Minimum: 7µs
Profiling shows that nearly all of this time is going to synchronize_rcu(),
via maybe_wait_bpf_programs() in map_update_elem().
The call to synchronize_rcu() is done to ensure that after
bpf_map_update_elem() returns, no BPF programs are still looking at the old
value of the map, per commit 1ae80cf31938 ("bpf: wait for running BPF
programs when updating map-in-map").
As discussed on the bpf mailing list, replace synchronize_rcu() with
synchronize_rcu_expedited(). This is 175x faster: it now takes an average
of 88 microseconds per call, for a total of 127 milliseconds in the same
benchmark:
Function Name: map_update_elem
Number of calls: 1439
Total time: 127ms 626µs
Maximum: 445µs
Average: 88µs
Minimum: 10µs
Link: https://lore.kernel.org/bpf/CAH6OuBR=w2kybK6u7aH_35B=Bo1PCukeMZefR=7V4Z2tJNK--Q@mail.gmail.com/
Signed-off-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Link: https://lore.kernel.org/r/20251128000422.20462-1-ritesh@superluminal.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Make kprobe_multi_link_prog_run() always inline to obtain better
performance. Before this patch, the bench performance is:
./bench trig-kprobe-multi
Setting up benchmark 'trig-kprobe-multi'...
Benchmark 'trig-kprobe-multi' started.
Iter 0 ( 95.485us): hits 62.462M/s ( 62.462M/prod), [...]
Iter 1 (-80.054us): hits 62.486M/s ( 62.486M/prod), [...]
Iter 2 ( 13.572us): hits 62.287M/s ( 62.287M/prod), [...]
Iter 3 ( 76.961us): hits 62.293M/s ( 62.293M/prod), [...]
Iter 4 (-77.698us): hits 62.394M/s ( 62.394M/prod), [...]
Iter 5 (-13.399us): hits 62.319M/s ( 62.319M/prod), [...]
Iter 6 ( 77.573us): hits 62.250M/s ( 62.250M/prod), [...]
Summary: hits 62.338 ± 0.083M/s ( 62.338M/prod)
And after this patch, the performance is:
Iter 0 (454.148us): hits 66.900M/s ( 66.900M/prod), [...]
Iter 1 (-435.540us): hits 68.925M/s ( 68.925M/prod), [...]
Iter 2 ( 8.223us): hits 68.795M/s ( 68.795M/prod), [...]
Iter 3 (-12.347us): hits 68.880M/s ( 68.880M/prod), [...]
Iter 4 ( 2.291us): hits 68.767M/s ( 68.767M/prod), [...]
Iter 5 ( -1.446us): hits 68.756M/s ( 68.756M/prod), [...]
Iter 6 ( 13.882us): hits 68.657M/s ( 68.657M/prod), [...]
Summary: hits 68.792 ± 0.087M/s ( 68.792M/prod)
As we can see, the performance of kprobe-multi increase from 62M/s to
68M/s.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20251126085246.309942-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
While previous commits sufficiently address the deadlocks, there are
still scenarios where queueing of waiters in NMIs can exacerbate the
possibility of timeouts.
Consider the case below:
CPU 0
<NMI>
res_spin_lock(A) -> becomes non-head waiter
</NMI>
lock owner in CS or pending waiter spinning
CPU 1
res_spin_lock(A) -> head waiter spinning on owner/pending bits
In such a scenario, the non-head waiter in NMI on CPU 0 will not poll
for deadlocks or timeout since it will simply queue behind previous
waiter (head on CPU 1), and also not enter the trylock fallback since
no rqspinlock queue waiter is active on CPU 0. In such a scenario, the
transaction initiated by the head waiter on CPU 1 will timeout,
signalling the NMI and ending the cyclic dependency, but it will cost
250 ms of time.
Instead, the NMI on CPU 0 could simply check for the presence of an AA
deadlock and only proceed with queueing on success. Add such a check
right before any form of queueing is initiated.
The reason the AA deadlock check is not used in conjunction with
in_nmi() is that a similar case could occur due to a reentrant path
in the owner's critical section, and unconditionally checking for AA
before entering the queueing path avoids expensive timeouts. Non-NMI
reentrancy only happens at controlled points in the slow path (with
specific tracepoints which do not impede the forward progress of a
waiter loop), or in the owner CS, while NMIs can land anywhere.
While this check is only needed for non-head waiter queueing, checking
whether we are head or not is racy without xchg_tail, and after that
point, we are already queued, hence for simplicity we must invoke the
check unconditionally.
Note that a more contrived case could still be constructed by using two
locks, and interrupting the progress of the respective owners by
non-head waiters of the other lock, in an ABBA fashion, which would
still not be covered by the current set of checks and conditions. It
would still lead to a timeout though, and not a deadlock. An ABBA check
cannot happen optimistically before the queueing, since it can be racy,
and needs to be happen continuously during the waiting period, which
would then require an unlinking step for queued NMI/reentrant waiters.
This is beyond the scope of this patch.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The original trylock fallback was inherited from qspinlock, and then
reused for the reentrant NMIs while the slow path is active. However,
under contention, it is very unlikely for the trylock to succeed in
taking the lock. In addition, a trylock also has no fairness guarantees,
and thus is prone to starvation issues under extreme scenarios.
The original qspinlock had no choice in terms of returning an error the
caller; if the node count was breached, it had to fall back to trylock
to attempt to take the lock. In case of rqspinlock, we do have the
option of returning to the user. Thus, simply attempt the trylock once,
and instead of spinning, return an error in case the lock cannot be
taken.
This ends up significantly reducing the time spent in the trylock
fallback, since we no longer wait for the timeout duration trying to
aimlessly acquire the lock when there's a high-probability that under
contention, it won't be available to us anyway.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In addition to deferring to the trylock fallback in NMIs, only do so
when an rqspinlock waiter is queued on the current CPU. This is detected
by noticing a non-zero node index. This allows NMI waiters to join the
waiter queue if it isn't interrupting an existing rqspinlock waiter, and
increase the chances of fairly obtaining the lock, performing deadlock
detection as the head, and not being starved while attempting the
trylock.
The trylock path in particular is unlikely to succeed under contention,
as it relies on the lock word becoming 0, which indicates no contention.
This means that the most likely result for NMIs attempting a trylock is
a timeout under contention if they don't hit an AA or ABBA case.
The core problem being addressed through the fixed commit was removing
the dependency edge between an NMI queue waiter and the queue waiter it
is interrupting. Whenever a circular dependency forms, and with no way
to break it (as non-head waiters don't poll for deadlocks or timeouts),
we would enter into a deadlock. A trylock either breaks such an edge by
probing for deadlocks, and finally terminating the waiting loop using a
timeout.
By excluding queueing on CPUs where the node index is non-zero for NMIs,
this sort of dependency is broken. The CPU enters the trylock path for
those cases, and falls back to deadlock checks and timeouts. However, in
other case where it doesn't interrupt the CPU in the slow path while its
queued on the lock, it can join the queue as a normal waiter, and avoid
trylock associated starvation and subsequent timeouts.
There are a few remaining cases here that matter: the NMI can still
preempt the owner in its critical section, and if it queues as a
non-head waiter, it can end up impeding the progress of the owner. While
this won't deadlock, since the head waiter will eventually signal the
NMI waiter to either stop (due to a timeout), it can still lead to long
timeouts. These gaps will be addressed in subsequent commits.
Note that while the node count detection approach is less conservative
than simply deferring NMIs to trylock, it is going to return errors
where attempts to lock B in NMI happen while waiters for lock A are in a
lower context on the same CPU. However, this only occurs when the lower
context is queued in the slow path, and the NMI attempt can proceed
without failure in all other cases. To continue to prevent AA deadlocks
(or ABBA in a similar NMI interrupting lower context pattern), we'd need
a more fleshed out algorithm to unlink NMI waiters after they queue and
detect such cases. However, all that complexity isn't appealing yet to
reduce the failure rate in the small window inside the slow path.
It is important to note that reentrancy in the slow path can also happen
through trace_contention_{begin,end}, but in those cases, unlike an NMI,
the forward progress of the head waiter (or the predecessor in general)
is not being blocked.
Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, while we enter the check_timeout call immediately due to the
way the ts.spin is initialized, we still invoke the AA and ABBA checks
in the second invocation, and only initialize the timestamp in the first
one. Since each iteration is at least done with a 1ms delay, this can
add delays in detection of AA deadlocks, up to a ms.
Rework check_timeout() to avoid this. First, call check_deadlock_AA()
while initializing the timestamps for the wait period. This also means
that we only do it once per waiting period, instead of every invocation.
Finally, drop check_deadlock() and call check_deadlock_ABBA() directly.
To save on unnecessary ktime_get_mono_fast_ns() in case of AA deadlock,
sample the time only if it returns 0.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Ritesh reported that timeouts occurred frequently for rqspinlock despite
reentrancy on the same lock on the same CPU in [0]. This patch closes
one of the races leading to this behavior, and reduces the frequency of
timeouts.
We currently have a tiny window between the fast-path cmpxchg and the
grabbing of the lock entry where an NMI could land, attempt the same
lock that was just acquired, and end up timing out. This is not ideal.
Instead, move the lock entry acquisition from the fast path to before
the cmpxchg, and remove the grabbing of the lock entry in the slow path,
assuming it was already taken by the fast path. The TAS fallback is
invoked directly without being preceded by the typical fast path,
therefore we must continue to grab the deadlock detection entry in that
case.
Case on lock leading to missed AA:
cmpxchg lock A
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
grab_held_lock_entry(A)
There is a similar case when unlocking the lock. If the NMI lands
between the WRITE_ONCE and smp_store_release, it is possible that we end
up in a situation where the NMI fails to diagnose the AA condition,
leading to a timeout.
Case on unlock leading to missed AA:
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
smp_store_release(A->locked, 0)
The patch changes the order on unlock to smp_store_release() succeeded
by WRITE_ONCE() of NULL. This avoids the missed AA detection described
above, but may lead to a false positive if the NMI lands between these
two statements, which is acceptable (and preferred over a timeout).
The original intention of the reverse order on unlock was to prevent the
following possible misdiagnosis of an ABBA scenario:
grab entry A
lock A
grab entry B
lock B
unlock B
smp_store_release(B->locked, 0)
grab entry B
lock B
grab entry A
lock A
! <detect ABBA>
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
If the store release were is after the WRITE_ONCE, the other CPU would
not observe B in the table of the CPU unlocking the lock B. However,
since the threads are obviously participating in an ABBA deadlock, it
is no longer appealing to use the order above since it may lead to a
250 ms timeout due to missed AA detection.
[0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com
Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A use-after-free bug may be triggered by calling bpf_inode_storage_get()
in a BPF LSM program hooked to file_alloc_security. Disable the hook to
prevent this from happening.
The cause of the bug is shown in the trace below. In alloc_file(), a
file struct is first allocated through kmem_cache_alloc(). Then,
file_alloc_security hook is invoked. Since the zero initialization or
assignment of f->f_inode happen after this LSM hook, a BPF program may
get a dangeld inode pointer by walking the file struct.
alloc_file()
-> alloc_empty_file()
-> f = kmem_cache_alloc()
-> init_file()
-> security_file_alloc() // f->f_inode not init-ed yet!
-> f->f_inode = NULL;
-> file_init_path()
-> f->f_inode = path->dentry->d_inode
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/1d2d1968.47cd3.19ab9528e94.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20251126202927.2584874-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Do not abuse the strict_alignment_once flag, and check if the map is
an instruction array inside the check_ptr_alignment() function.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The original implementation added a hack to check_mem_access()
to prevent programs from writing into insn arrays. To get rid
of this hack, enforce BPF_F_RDONLY_PROG on map creation.
Also fix the corresponding selftest, as the error message changes
with this patch.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add performance testing for common context synchronizations
(Preemption, IRQ, Softirq) and per-cpu increments. Those are
relevant comparisons against SRCU-fast read side APIs, especially
as they are planned to synchronize further tracing fast-path code.
|
|
Merge PM QoS updates and a cpupower utility update for 6.19-rc1:
- Introduce and document a QoS limit on CPU exit latency during wakeup
from suspend-to-idle (Ulf Hansson)
- Add support for building libcpupower statically (Zuo An)
* pm-qos:
Documentation: power/cpuidle: Document the CPU system wakeup latency QoS
cpuidle: Respect the CPU system wakeup QoS limit for cpuidle
sched: idle: Respect the CPU system wakeup QoS limit for s2idle
pmdomain: Respect the CPU system wakeup QoS limit for cpuidle
pmdomain: Respect the CPU system wakeup QoS limit for s2idle
PM: QoS: Introduce a CPU system wakeup QoS limit
* pm-tools:
tools/power/cpupower: Support building libcpupower statically
|
|
Merge energy model management updates and operating performance points
(OPP) library changes for 6.19-rc1:
- Add support for sending netlink notifications to user space on energy
model updates (Changwoo Mini, Peng Fan)
- Minor improvements to the Rust OPP interface (Tamir Duberstein)
- Fixes to scope-based pointers in the OPP library (Viresh Kumar)
* pm-em:
PM: EM: Add to em_pd_list only when no failure
PM: EM: Notify an event when the performance domain changes
PM: EM: Implement em_notify_pd_created/updated()
PM: EM: Implement em_notify_pd_deleted()
PM: EM: Implement em_nl_get_pd_table_doit()
PM: EM: Implement em_nl_get_pds_doit()
PM: EM: Add an iterator and accessor for the performance domain
PM: EM: Add a skeleton code for netlink notification
PM: EM: Add em.yaml and autogen files
PM: EM: Expose the ID of a performance domain via debugfs
PM: EM: Assign a unique ID when creating a performance domain
* pm-opp:
rust: opp: simplify callers of `to_c_str_array`
OPP: Initialize scope-based pointers inline
rust: opp: fix broken rustdoc link
|
|
Merge updates related to system suspend and hibernation for 6.19-rc1:
- Replace snprintf() with scnprintf() in show_trace_dev_match()
(Kaushlendra Kumar)
- Fix memory allocation error handling in pm_vt_switch_required()
(Malaya Kumar Rout)
- Introduce CALL_PM_OP() macro and use it to simplify code in
generic PM operations (Kaushlendra Kumar)
- Add module param to backtrace all CPUs in the device power management
watchdog (Sergey Senozhatsky)
- Rework message printing in swsusp_save() (Rafael Wysocki)
- Make it possible to change the number of hibernation compression
threads (Xueqin Luo)
- Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)
- Add document on debugging shutdown hangs to PM documentation and
correct a mistaken configuration option in it (Mario Limonciello)
- Shut down wakeup source timer before removing the wakeup source from
the list (Kaushlendra Kumar, Rafael Wysocki)
- Introduce new PMSG_POWEROFF event for system shutdown handling with
the help of PM device callbacks (Mario Limonciello)
- Make pm_test delay interruptible by wakeup events (Riwen Lu)
- Clean up kernel-doc comment style usage in the core hibernation
code and remove unuseful comments from it (Sunday Adelodun, Rafael
Wysocki)
- Add support for handling wakeup events and aborting the suspend
process while it is syncing file systems (Samuel Wu, Rafael Wysocki)
* pm-sleep: (21 commits)
PM: hibernate: Extra cleanup of comments in swap handling code
PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper()
PM: sleep: Add support for wakeup during filesystem sync
PM: hibernate: Clean up kernel-doc comment style usage
PM: suspend: Make pm_test delay interruptible by wakeup events
usb: sl811-hcd: Add PM_EVENT_POWEROFF into suspend callbacks
scsi: Add PM_EVENT_POWEROFF into suspend callbacks
PM: Introduce new PMSG_POWEROFF event
PM: wakeup: Update after recent wakeup source removal ordering change
PM: wakeup: Delete timer before removing wakeup source from list
Documentation: power: Correct a mistaken configuration option
Documentation: power: Add document on debugging shutdown hangs
freezer: Clarify that only cgroup1 freezer uses PM freezer
PM: hibernate: add sysfs interface for hibernate_compression_threads
PM: hibernate: make compression threads configurable
PM: hibernate: dynamically allocate crc->unc_len/unc for configurable threads
PM: hibernate: Rework message printing in swsusp_save()
PM: dpm_watchdog: add module param to backtrace all CPUs
PM: sleep: Introduce CALL_PM_OP() macro to simplify code
PM: console: Fix memory allocation error handling in pm_vt_switch_required()
...
|
|
Merge a core power management update and runtime PM framework updates
for 6.19-rc1:
- Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)
- Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
them in the PCI core and the ACPI TAD driver (Rafael Wysocki)
- Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)
- Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)
- Fix typos in runtime.c comments (Malaya Kumar Rout)
* pm-core:
PM: WQ_UNBOUND added to pm_wq workqueue
* pm-runtime:
PCI/sysfs: Use PM_RUNTIME_ACQUIRE()/PM_RUNTIME_ACQUIRE_ERR()
ACPI: TAD: Use PM_RUNTIME_ACQUIRE()/PM_RUNTIME_ACQUIRE_ERR()
PM: runtime: Wrapper macros for ACQUIRE()/ACQUIRE_ERR()
PM: runtime: fix typos in runtime.c comments
ACPI: TAD: Improve runtime PM using guard macros
ACPI: TAD: Rearrange runtime PM operations in acpi_tad_remove()
PM: runtime: docs: Update pm_runtime_allow/forbid() documentation
|
|
This commit adds refscale readers based on srcu_read_lock_fast_updown()
and srcu_read_unlock_fast_updown()
("refscale.scale_type=srcu-fast-updown"). On my x86 laptop, these are
about 2.2ns per pair.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-24-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-23-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two last minute fixes for the recently modified DMA API infrastructure:
- proper handling of DMA_ATTR_MMIO in dma_iova_unlink() function (me)
- regression fix for the code refactoring related to P2PDMA (Pranjal
Shrivastava)"
* tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappings
iommu/dma: add missing support for DMA_ATTR_MMIO for dma_iova_unlink()
|
|
The trace_marker_raw file in tracefs takes a buffer from user space that
contains an id as well as a raw data string which is usually a binary
structure. The structure used has the following:
struct raw_data_entry {
struct trace_entry ent;
unsigned int id;
char buf[];
};
Since the passed in "cnt" variable is both the size of buf as well as the
size of id, the code to allocate the location on the ring buffer had:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
Which is quite ugly and hard to understand. Instead, add a helper macro
called struct_offset() which then changes the above to a simple and easy
to understand:
size = struct_offset(entry, id) + cnt;
This will likely come in handy for other use cases too.
Link: https://lore.kernel.org/all/CAHk-=whYZVoEdfO1PmtbirPdBMTV9Nxt9f09CK0k6S+HJD3Zmg@mail.gmail.com/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://patch.msgid.link/20251126145249.05b1770a@gandalf.local.home
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Conflicts:
net/xdp/xsk.c
0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
8da7bea7db69 ("xsk: add indirect call for xsk_destruct_skb")
30ed05adca4a ("xsk: use a smaller new lock for shared pool case")
https://lore.kernel.org/20251127105450.4a1665ec@canb.auug.org.au
https://lore.kernel.org/eb4eee14-7e24-4d1b-b312-e9ea738fefee@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The macro for_each_console_srcu iterates over all registered consoles. It's
implied that all registered consoles have CON_ENABLED flag set, making
the check for the flag unnecessary. Call console_is_usable function to
fully verify if the given console is usable before calling the ->unblank
callback.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Make do_proc_douintvec static and export proc_douintvec_conv wrapper
function for external use. This is to keep with the design in sysctl.c.
Update fs/pipe.c to use the new public API.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Create a converter for the pipe-max-size proc_handler using the
SYSCTL_UINT_CONV_CUSTOM. Move SYSCTL_CONV_IDENTITY macro to the sysctl
header to make it available for pipe size validation. Keep returning
-EINVAL when (val == 0) by using a range checking converter and setting
the minimal valid value (extern1) to SYSCTL_ONE. Keep round_pipe_size by
passing it as the operation for SYSCTL_USER_TO_KERN_INT_CONV.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move proc_doulongvec_ms_jiffies_minmax to kernel/time/jiffies.c. Create
a non static wrapper function proc_doulongvec_minmax_conv that
forwards the custom convmul and convdiv argument values to the internal
do_proc_doulongvec_minmax. Remove unused linux/times.h include from
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move integer jiffies converters (proc_dointvec{_,_ms_,_userhz_}jiffies
and proc_dointvec_ms_jiffies_minmax) to kernel/time/jiffies.c. Error
stubs for when CONFIG_PRCO_SYSCTL is not defined are not reproduced
because all the jiffies converters go through proc_dointvec_conv which
is already stubbed. This is part of the greater effort to move sysctl
logic out of kernel/sysctl.c thereby reducing merge conflicts in
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move SYSCTL_USER_TO_KERN_UINT_CONV and SYSCTL_UINT_CONV_CUSTOM macros to
include/linux/sysctl.h. No need to embed sysctl_kern_to_user_uint_conv
in a macro as it will not need a custom kernel pointer operation. This
is a preparation commit to enable jiffies converter creation outside
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move direction macros (SYSCTL_{USER_TO_KERN,KERN_TO_USER}) and the
integer converter macros (SYSCTL_{USER_TO_KERN,KERN_TO_USER}_INT_CONV,
SYSCTL_INT_CONV_CUSTOM) into include/linux/sysctl.h. This is a
preparation commit to enable jiffies converter creation outside
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
The new non-static proc_dointvec_conv forwards a custom converter
function to do_proc_dointvec from outside the sysctl scope. Rename the
do_proc_dointvec call points so any future changes to proc_dointvec_conv
are propagated in sysctl.c This is a preparation commit that allows the
integer jiffie converter functions to move out of kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
The buffer arg in proc handler functions have been void* (no __user
qualifier) since commit 32927393dc1c ("sysctl: pass kernel pointers to
->proc_handler"). The __user qualifier was erroneously brought back in
commit 0df8bdd5e3b3 ("stackleak: move stack_erasing sysctl to
stackleak.c"). This fixes the error by removing the __user qualifier.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202510221719.3ggn070M-lkp@intel.com/
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Replace sysctl_user_to_kern_uint_conv function with
SYSCTL_USER_TO_KERN_UINT_CONV macro that accepts u_ptr_op parameter for
value transformation. Replacing sysctl_kern_to_user_uint_conv is not
needed as it will only be used from within sysctl.c. This is a
preparation commit for creating a custom converter in fs/pipe.c. No
Functional changes are intended.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Add k_ptr_range_check parameter to SYSCTL_UINT_CONV_CUSTOM macro to
enable range validation using table->extra1/extra2. Replace
do_proc_douintvec_minmax_conv with do_proc_uint_conv_minmax generated
by the updated macro.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Pass sysctl_{user_to_kern,kern_to_user}_uint_conv (unsigned integer
uni-directional converters) to the new SYSCTL_UINT_CONV_CUSTOM macro
to create do_proc_douintvec_conv's replacement (do_proc_uint_conv).
This is a preparation commit to use the unsigned integer converter from
outside sysctl. No functional change is intended.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Extend the SYSCTL_INT_CONV_CUSTOM macro with a k_ptr_range_check
parameter to conditionally generate range validation code. When enabled,
validation is done against table->extra1 (min) and table->extra2 (max)
bounds before assignment. Add base minmax and ms_jiffies_minmax
converter instances that utilize the range checking functionality.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
New SYSCTL_INT_CONV_CUSTOM macro creates "bi-directional" converters
from a user-to-kernel and a kernel-to-user functions. Replace integer
versions of do_proc_*_conv functions with the ones from the new macro.
Rename "_dointvec_" to just "_int_" as these converters are not applied
to vectors and the "do" is already in the name.
Move the USER_HZ validation directly into proc_dointvec_userhz_jiffies()
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Eight converter functions are created using two new macros
(SYSCTL_USER_TO_KERN_INT_CONV & SYSCTL_KERN_TO_USER_INT_CONV); they are
called from four pre-existing converter functions: do_proc_dointvec_conv
and do_proc_dointvec{,_userhz,_ms}_jiffies_conv. The function names
generated by the macros are differentiated by a string suffix passed as
the first macro argument.
The SYSCTL_USER_TO_KERN_INT_CONV macro first executes the u_ptr_op
operation, then checks for overflow, assigns sign (-, +) and finally
writes to the kernel var with WRITE_ONCE; it always returns an -EINVAL
when an overflow is detected. The SYSCTL_KERN_TO_USER_INT_CONV uses
READ_ONCE, casts to unsigned long, then executes the k_ptr_op before
assigning the value to the user space buffer.
The overflow check is always done against MAX_INT after applying
{k,u}_ptr_op. This approach avoids rounding or precision errors that
might occur when using the inverse operations.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Rename converter parameter to indicate data flow direction: "lvalp" to
"u_ptr" indicating a user space parsed value pointer. "valp" to "k_ptr"
indicating a kernel storage value pointer. This facilitates the
identification of discrepancies between direction (copy to kernel or
copy to user space) and the modified variable. This is a preparation
commit for when the converter functions are exposed to the rest of the
kernel.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Replace the "write" integer parameter with SYSCTL_USER_TO_KERN() and
SYSCTL_KERN_TO_USER() that clearly indicate data flow direction in
sysctl operations.
"write" originates in proc_sysctl.c (proc_sys_{read,write}) and can take
one of two values: "0" or "1" when called from proc_sys_read and
proc_sys_write respectively. When write has a value of zero, data is
"written" to a user space buffer from a kernel variable (usually
ctl_table->data). Whereas when write has a value greater than zero, data
is "written" to an internal kernel variable from a user space buffer.
Remove this ambiguity by introducing macros that clearly indicate the
direction of the "write".
The write mode names in sysctl_writes_mode are left unchanged as these
directly relate to the sysctl_write_strict file in /proc/sys where the
word "write" unambiguously refers to writing to a file.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Remove "__" from __do_proc_do{intvec,uintvec,ulongvec_minmax} internal
functions and delete their corresponding do_proc_do* wrappers. These
indirections are unnecessary as they do not add extra logic nor do they
indicate a layer separation.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Remove superfluous tbl_data param from do_proc_douintvec{,_r,_w}
and __do_proc_do{intvec,uintvec,ulongvec_minmax}. There is no need to
pass it as it is always contained within the ctl_table struct.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
* Replace void* data in the converter functions with a const struct
ctl_table* table as it was only getting forwarding values from
ctl_table->extra{1,2}.
* Remove the void* data in the do_proc_* functions as they already had a
pointer to the ctl_table.
* Remove min/max structures do_proc_do{uint,int}vec_minmax_conv_param;
the min/max values get passed directly in ctl_table.
* Keep min/max initialization in extra{1,2} in proc_dou8vec_minmax.
* The do_proc_douintvec was adjusted outside sysctl.c as it is exported
to fs/pipe.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
This commit updates the initialization for the "srcu-fast" scale
type to use DEFINE_STATIC_SRCU_FAST() when reader_flavor is equal to
SRCU_READ_FLAVOR_FAST.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
|
This commit causes rcutorture's srcu_torture_init() and
srcud_torture_init() functions to announce on the console log
which variant of SRCU is being tortured, for example: "torture:
srcud_torture_init fast SRCU".
[ paulmck: Apply feedback from kernel test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
|
This commit creates an SRCU-fast-updown API, including
DEFINE_SRCU_FAST_UPDOWN(), DEFINE_STATIC_SRCU_FAST_UPDOWN(),
__init_srcu_struct_fast_updown(), init_srcu_struct_fast_updown(),
srcu_read_lock_fast_updown(), srcu_read_unlock_fast_updown(),
__srcu_read_lock_fast_updown(), and __srcu_read_unlock_fast_updown().
These are initially identical to their SRCU-fast counterparts, but both
SRCU-fast and SRCU-fast-updown will be optimized in different directions
by later commits. SRCU-fast will lack any sort of srcu_down_read() and
srcu_up_read() APIs, which will enable extremely efficient NMI safety.
For its part, SRCU-fast-updown will not be NMI safe, which will enable
reasonably efficient implementations of srcu_down_read_fast() and
srcu_up_read_fast().
This API fork happens to meet two different future use cases.
* SRCU-fast will become the reimplementation basis for RCU-TASK-TRACE
for consolidation. Since RCU-TASK-TRACE must be NMI safe, SRCU-fast
must be as well.
* SRCU-fast-updown will be needed for uretprobes code in order to get
rid of the read-side memory barriers while still allowing entering the
reader at task level while exiting it in a timer handler.
This commit also adds rcutorture tests for the new APIs. This
(annoyingly) needs to be in the same commit for bisectability. With this
commit, the 0x8 value tests SRCU-fast-updown. However, most SRCU-fast
testing will be via the RCU Tasks Trace wrappers.
[ paulmck: Apply s/0x8/0x4/ missing change per Boqun Feng feedback. ]
[ paulmck: Apply Akira Yokosawa feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|