| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer fix from Steven Rostedt:
- Do not allow mmapped ring buffer to be split
When the ring buffer VMA is split by a partial munmap or a MAP_FIXED,
the kernel calls vm_ops->close() on each portion. This causes the
ring_buffer_unmap() to be called multiple times. This causes
subsequent calls to return -ENODEV and triggers a warning.
There's no reason to allow user space to split up memory mapping of
the ring buffer. Have it return -EINVAL when that happens.
* tag 'trace-ringbuffer-v6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix WARN_ON in tracing_buffers_mmap_close for split VMAs
|
|
Prior to commit a25e7962db0d7 ("PCI/P2PDMA: Refactor the p2pdma mapping
helpers"), P2P segments were mapped using the pci_p2pdma_map_segment()
helper. This helper was responsible for populating sg->dma_address,
marking the bus address, and also setting sg_dma_len(sg).
The refactor[1] removed this helper and moved the mapping logic directly
into the callers. While iommu_dma_map_sg() was correctly updated to set
the length in the new flow, it was missed in dma_direct_map_sg().
Thus, in dma_direct_map_sg(), the PCI_P2PDMA_MAP_BUS_ADDR case sets the
dma_address and marks the segment, but immediately executes 'continue',
which causes the loop to skip the standard assignment logic at the end:
sg_dma_len(sg) = sg->length;
As a result, when CONFIG_NEED_SG_DMA_LENGTH is enabled, the dma_length
field remains uninitialized (zero) for P2P bus address mappings. This
breaks upper-layer drivers (for e.g. RDMA/IB) that rely on sg_dma_len()
to determine the transfer size.
Fix this by explicitly setting the DMA length in the
PCI_P2PDMA_MAP_BUS_ADDR case before continuing to the next scatterlist
entry.
Fixes: a25e7962db0d7 ("PCI/P2PDMA: Refactor the p2pdma mapping helpers")
Reported-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Pranjal Shrivastava <praan@google.com>
[1]
https://lore.kernel.org/all/ac14a0e94355bf898de65d023ccf8a2ad22a3ece.1746424934.git.leon@kernel.org/
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shivaji Kant <shivajikant@google.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20251126114112.3694469-1-praan@google.com
|
|
With PID filtering working via ftrace_pids_enabled() and fgraph_pid_func,
the coarse-grained ftrace_trace_task() check in graph_entry() is obsolete.
It was only a fallback for uninitialized op->private (now fixed), and its
removal ensures consistent PID filtering with standard function tracing.
Also remove unused ftrace_trace_task() definition from trace.h.
Cc: <wang.yaxin@zte.com.cn>
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <zhang.run@zte.com.cn>
Cc: <yang.yang29@zte.com.cn>
Link: https://patch.msgid.link/20251126173552333XoJZN20143fWbsdTEtWoU@zte.com.cn
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When registering ftrace_graph, check if ftrace_pids_enabled is active.
If enabled, assign entryfunc to fgraph_pid_func to ensure filtering
is performed before executing the saved original entry function.
Cc: stable@vger.kernel.org
Cc: <wang.yaxin@zte.com.cn>
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <zhang.run@zte.com.cn>
Cc: <yang.yang29@zte.com.cn>
Link: https://patch.msgid.link/20251126173331679XGVF98NLhyLJRdtNkVZ6w@zte.com.cn
Fixes: df3ec5da6a1e7 ("function_graph: Add pid tracing back to function graph tracer")
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The ftrace_pids_enabled(op) check relies on op->private being properly
initialized, but fgraph_ops's underlying ftrace_ops->private was left
uninitialized. This caused ftrace_pids_enabled() to always return false,
effectively disabling PID filtering for function graph tracing.
Fix this by copying src_ops->private to dst_ops->private in
fgraph_init_ops(), ensuring PID filter state is correctly propagated.
Cc: stable@vger.kernel.org
Cc: <wang.yaxin@zte.com.cn>
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <zhang.run@zte.com.cn>
Cc: <yang.yang29@zte.com.cn>
Fixes: c132be2c4fcc1 ("function_graph: Have the instances use their own ftrace_ops for filtering")
Link: https://patch.msgid.link/20251126172926004y3hC8QyU4WFOjBkU_UxLC@zte.com.cn
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
simultaneously
Currently, the funcgraph-args and funcgraph-retaddr features are
mutually exclusive. This patch resolves this limitation by allowing
funcgraph-retaddr to have an args array.
To verify the change, use perf to trace vfs_write with both options
enabled:
Before:
# perf ftrace -G vfs_write --graph-opts args,retaddr
......
down_read() { /* <-n_tty_write+0xa3/0x540 */
__cond_resched(); /* <-down_read+0x12/0x160 */
preempt_count_add(); /* <-down_read+0x3b/0x160 */
preempt_count_sub(); /* <-down_read+0x8b/0x160 */
}
After:
# perf ftrace -G vfs_write --graph-opts args,retaddr
......
down_read(sem=0xffff8880100bea78) { /* <-n_tty_write+0xa3/0x540 */
__cond_resched(); /* <-down_read+0x12/0x160 */
preempt_count_add(val=1); /* <-down_read+0x3b/0x160 */
preempt_count_sub(val=1); /* <-down_read+0x8b/0x160 */
}
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com>
Link: https://patch.msgid.link/20251125093425.2563849-1-dolinux.peng@gmail.com
Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Currently, the persistent ring buffer instance needs to be read before
using it. This means we have to wait for boot up user space and dump
the persistent ring buffer. However, in that case we can not start
tracing on it from the kernel cmdline.
To solve this limitation, this adds an option which allows to create
a trace instance as a backup of the persistent ring buffer at boot.
If user specifies trace_instance=<BACKUP>=<PERSIST_RB> then the
<BACKUP> instance is made as a copy of the <PERSIST_RB> instance.
For example, the below kernel cmdline records all syscalls, scheduler
and interrupt events on the persistent ring buffer `boot_map` but
before starting the tracing, it makes a `backup` instance from the
`boot_map`. Thus, the `backup` instance has the previous boot events.
'reserve_mem=12M:4M:trace trace_instance=boot_map@trace,syscalls:*,sched:*,irq:* trace_instance=backup=boot_map'
As you can see, this just make a copy of entire reserved area and
make a backup instance on it. So you can release (or shrink) the
backup instance after use it to save the memory usage.
/sys/kernel/tracing/instances # free
total used free shared buff/cache available
Mem: 1999284 55704 1930520 10132 13060 1914628
Swap: 0 0 0
/sys/kernel/tracing/instances # rmdir backup/
/sys/kernel/tracing/instances # free
total used free shared buff/cache available
Mem: 1999284 40640 1945584 10132 13060 1929692
Swap: 0 0 0
Note: since there is no reason to make a copy of empty buffer, this
backup only accepts a persistent ring buffer as the original instance.
Also, since this backup is based on vmalloc(), it does not support
user-space mmap().
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/176377150002.219692.9425536150438129267.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
There is times when tracing the tracing infrastructure can be useful for
debugging the tracing code. Currently all files in the tracing directory
are set to "notrace" the functions.
Add a new config option FUNCTION_SELF_TRACING that will allow some of the
files in the tracing infrastructure to be traced. It requires a config to
enable because it will add noise to the function tracer if events and
other tracing features are enabled. Tracing functions and events together
is quite common, so not tracing the event code should be the default.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20251120181514.736f2d5f@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The function trigger_process_regex() is called by a few functions, where
only one calls strim() on the buffer passed to it. That leaves the other
functions not trimming the end of the buffer passed in and making it a
little inconsistent.
Remove the strim() from event_trigger_regex_write() and have
trigger_process_regex() use strim() instead of skip_spaces(). The buff
variable is not passed in as const, so it can be modified.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20251125214032.323747707@kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The event trigger data requires a full tracepoint_synchronize_unregister()
call before freeing. That call can take 100s of milliseconds to complete.
In order to allow for bulk freeing of the trigger data, it can not call
the tracepoint_synchronize_unregister() for every individual trigger data
being free.
Create a kthread that gets created the first time a trigger data is freed,
and have it use the lockless llist to get the list of data to free, run
the tracepoint_synchronize_unregister() then free everything in the list.
By freeing hundreds of event_trigger_data elements together, it only
requires two runs of the synchronization function, and not hundreds of
runs. This speeds up the operation by orders of magnitude (milliseconds
instead of several seconds).
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20251125214032.151674992@kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
In event_trigger_regex_release(), the only code is:
mutex_lock(&event_mutex);
if (file->f_mode & FMODE_READ)
seq_release(inode, file);
mutex_unlock(&event_mutex);
return 0;
There's nothing special about the file->f_mode or the seq_release() that
requires any locking. Remove the unnecessary locks.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20251125214031.975879283@kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Now that there's pretty much a one to one mapping between the struct
event_trigger_ops and struct event_command, there's no reason to have two
different structures. Merge the function pointers of event_trigger_ops
into event_command.
There's one exception in trace_events_hist.c for the
event_hist_trigger_named_ops. This has special logic for the init and free
function pointers for "named histograms". In this case, allocate the
cmd_ops of the event_trigger_data and set it to the proper init and free
functions, which are used to initialize and free the event_trigger_data
respectively. Have the free function and the init function (on failure)
free the cmd_ops of the data element.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20251125200932.446322765@kernel.org
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The struct event_command has a callback function called get_trigger_ops().
This callback returns the "trigger_ops" to use for the trigger. These ops
define the trigger function, how to init the trigger, how to print the
trigger and how to free it.
The only reason there's a callback function to get these ops is because
some triggers have two types of operations. One is an "always on"
operation, and the other is a "count down" operation. If a user passes in
a parameter to say how many times the trigger should execute. For example:
echo stacktrace:5 > events/kmem/kmem_cache_alloc/trigger
It will trigger the stacktrace for the first 5 times the kmem_cache_alloc
event is hit.
Instead of having two different trigger_ops since the only difference
between them is the tigger itself (the print, init and free functions are
all the same), just use a single ops that the event_command points to and
add a function field to the trigger_ops to have a count_func.
When a trigger is added to an event, if there's a count attached to it and
the trigger ops has the count_func field, the data allocated to represent
this trigger will have a new flag set called COUNT.
Then when the trigger executes, it will check if the COUNT data flag is
set, and if so, it will call the ops count_func(). If that returns false,
it returns without executing the trigger.
This removes the need for duplicate event_trigger_ops structures.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20251125200932.274566147@kernel.org
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Since tracer_init_tracefs_work_func() only updates the tracer options
for the global_trace, the instances created by the kernel cmdline
do not have those options.
Fix to update tracer options for those boot-time created instances
to show those options.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/176354112555.2356172.3989277078358802353.stgit@mhiramat.tok.corp.google.com
Fixes: 428add559b69 ("tracing: Have tracer option be instance specific")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The FTRACE_OPS_FL_INITIALIZED flag is cleared in register_ftrace_direct,
which can make it initialized by ftrace_ops_init() even if it is already
initialized. It seems that there is no big deal here, but let's still fix
it.
Link: https://patch.msgid.link/20251110121808.1559240-1-dongml2@chinatelecom.cn
Fixes: f64dd4627ec6 ("ftrace: Add multi direct register/unregister interface")
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The flags and opts used in tracing_trace_options_show() now come directly
from the trace array "current_trace_flags" and not the current_trace. The
variable "trace" was still being assigned to tr->current_trace but never
used. This caused a warning in clang.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251117120637.43ef995d@gandalf.local.home
Reported-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Closes: https://lore.kernel.org/all/aRtHWXzYa8ijUIDa@black.igk.intel.com/
Fixes: 428add559b692 ("tracing: Have tracer option be instance specific")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The variable fgraph_no_sleep_time changed from being a boolean to being a
counter. A check is made to make sure that it never goes below zero. But
the variable being unsigned makes the check always fail even if it does go
below zero.
Make the variable a signed int so that checking it going below zero still
works.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251125104751.4c9c7f28@gandalf.local.home
Fixes: 5abb6ccb58f0 ("tracing: Have function graph tracer option sleep-time be per instance")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/aR1yRQxDmlfLZzoo@stanley.mountain/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When excl_prog_hash is 0 and excl_prog_hash_size is non-zero, the map also
needs to be freed. Otherwise, the map memory will not be reclaimed, just
like the memory leak problem reported by syzbot [1].
syzbot reported:
BUG: memory leak
backtrace (crc 7b9fb9b4):
map_create+0x322/0x11e0 kernel/bpf/syscall.c:1512
__sys_bpf+0x3556/0x3610 kernel/bpf/syscall.c:6131
Fixes: baefdbdf6812 ("bpf: Implement exclusive map creation")
Reported-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cf08c551fecea9fd1320
Tested-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/tencent_3F226F882CE56DCC94ACE90EED1ECCFC780A@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
It is to unify map flags checking for lookup_elem, update_elem,
lookup_batch and update_batch APIs.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20251125145857.98134-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When a VMA is split (e.g., by partial munmap or MAP_FIXED), the kernel
calls vm_ops->close on each portion. For trace buffer mappings, this
results in ring_buffer_unmap() being called multiple times while
ring_buffer_map() was only called once.
This causes ring_buffer_unmap() to return -ENODEV on subsequent calls
because user_mapped is already 0, triggering a WARN_ON.
Trace buffer mappings cannot support partial mappings because the ring
buffer structure requires the complete buffer including the meta page.
Fix this by adding a may_split callback that returns -EINVAL to prevent
VMA splits entirely.
Cc: stable@vger.kernel.org
Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer")
Link: https://patch.msgid.link/20251119064019.25904-1-kartikey406@gmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a72c325b042aae6403c7
Tested-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com
Reported-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Now that all pieces are in place, change the implementations of
sched_mm_cid_fork() and sched_mm_cid_exit() to adhere to the new strict
ownership scheme and switch context_switch() over to use the new
mm_cid_schedin() functionality.
The common case is that there is no mode change required, which makes
fork() and exit() just update the user count and the constraints.
In case that a new user would exceed the CID space limit the fork() context
handles the transition to per CPU mode with mm::mm_cid::mutex held. exit()
handles the transition back to per task mode when the user count drops
below the switch back threshold. fork() might also be forced to handle a
deferred switch back to per task mode, when a affinity change increased the
number of allowed CPUs enough.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.280380631@linutronix.de
|
|
When affinity changes cause an increase of the number of CPUs allowed for
tasks which are related to a MM, that might results in a situation where
the ownership mode can go back from per CPU mode to per task mode.
As affinity changes happen with runqueue lock held there is no way to do
the actual mode change and required fixup right there.
Add the infrastructure to defer it to a workqueue. The scheduled work can
race with a fork() or exit(). Whatever happens first takes care of it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.216484739@linutronix.de
|
|
CIDs are either owned by tasks or by CPUs. The ownership mode depends on
the number of tasks related to a MM and the number of CPUs on which these
tasks are theoretically allowed to run on. Theoretically because that
number is the superset of CPU affinities of all tasks which only grows and
never shrinks.
Switching to per CPU mode happens when the user count becomes greater than
the maximum number of CIDs, which is calculated by:
opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users);
max_cids = min(1.25 * opt_cids, nr_cpu_ids);
The +25% allowance is useful for tight CPU masks in scenarios where only a
few threads are created and destroyed to avoid frequent mode
switches. Though this allowance shrinks, the closer opt_cids becomes to
nr_cpu_ids, which is the (unfortunate) hard ABI limit.
At the point of switching to per CPU mode the new user is not yet visible
in the system, so the task which initiated the fork() runs the fixup
function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either
transfers each tasks owned CID to the CPU the task runs on or drops it into
the CID pool if a task is not on a CPU at that point in time. Tasks which
schedule in before the task walk reaches them do the handover in
mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's
guaranteed that no task related to that MM owns a CID anymore.
Switching back to task mode happens when the user count goes below the
threshold which was recorded on the per CPU mode switch:
pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2);
This threshold is updated when a affinity change increases the number of
allowed CPUs for the MM, which might cause a switch back to per task mode.
If the switch back was initiated by a exiting task, then that task runs the
fixup function. If it was initiated by a affinity change, then it's run
either in the deferred update function in context of a workqueue or by a
task which forks a new one or by a task which exits. Whatever happens
first. mm_cid_fixup_cpus_to_task() walks through the possible CPUs and
either transfers the CPU owned CIDs to a related task which runs on the CPU
or drops it into the pool. Tasks which schedule in on a CPU which the walk
did not cover yet do the handover themselves.
This transition from CPU to per task ownership happens in two phases:
1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task
CID and denotes that the CID is only temporarily owned by the
task. When it schedules out the task drops the CID back into the
pool if this bit is set.
2) The initiating context walks the per CPU space and after completion
clears mm:mm_cid.transit. After that point the CIDs are strictly
task owned again.
This two phase transition is required to prevent CID space exhaustion
during the transition as a direct transfer of ownership would fail if
two tasks are scheduled in on the same CPU before the fixup freed per
CPU CIDs.
When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID
related to that MM is owned by a CPU anymore.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.088189028@linutronix.de
|
|
The MM CID management has two fundamental requirements:
1) It has to guarantee that at no given point in time the same CID is
used by concurrent tasks in userspace.
2) The CID space must not exceed the number of possible CPUs in a
system. While most allocators (glibc, tcmalloc, jemalloc) do not
care about that, there seems to be at least some LTTng library
depending on it.
The CID space compaction itself is not a functional correctness
requirement, it is only a useful optimization mechanism to reduce the
memory foot print in unused user space pools.
The optimal CID space is:
min(nr_tasks, nr_cpus_allowed);
Where @nr_tasks is the number of actual user space threads associated to
the mm and @nr_cpus_allowed is the superset of all task affinities. It is
growth only as it would be insane to take a racy snapshot of all task
affinities when the affinity of one task changes just do redo it 2
milliseconds later when the next task changes it's affinity.
That means that as long as the number of tasks is lower or equal than the
number of CPUs allowed, each task owns a CID. If the number of tasks
exceeds the number of CPUs allowed it switches to per CPU mode, where the
CPUs own the CIDs and the tasks borrow them as long as they are scheduled
in.
For transition periods CIDs can go beyond the optimal space as long as they
don't go beyond the number of possible CPUs.
The current upstream implementation adds overhead into task migration to
keep the CID with the task. It also has to do the CID space consolidation
work from a task work in the exit to user space path. As that work is
assigned to a random task related to a MM this can inflict unwanted exit
latencies.
Implement the context switch parts of a strict ownership mechanism to
address this.
This removes most of the work from the task which schedules out. Only
during transitioning from per CPU to per task ownership it is required to
drop the CID when leaving the CPU to prevent CID space exhaustion. Other
than that scheduling out is just a single check and branch.
The task which schedules in has to check whether:
1) The ownership mode changed
2) The CID is within the optimal CID space
In stable situations this results in zero work. The only short disruption
is when ownership mode changes or when the associated CID is not in the
optimal CID space. The latter only happens when tasks exit and therefore
the optimal CID space shrinks.
That mechanism is strictly optimized for the common case where no change
happens. The only case where it actually causes a temporary one time spike
is on mode changes when and only when a lot of tasks related to a MM
schedule exactly at the same time and have eventually to compete on
allocating a CID from the bitmap.
In the sysbench test case which triggered the spinlock contention in the
initial CID code, __schedule() drops significantly in perf top on a 128
Core (256 threads) machine when running sysbench with 255 threads, which
fits into the task mode limit of 256 together with the parent thread:
Upstream rseq/perf branch +CID rework
0.42% 0.37% 0.32% [k] __schedule
Increasing the number of threads to 256, which puts the test process into
per CPU mode looks about the same.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.023984859@linutronix.de
|
|
The MM CID management has two fundamental requirements:
1) It has to guarantee that at no given point in time the same CID is
used by concurrent tasks in userspace.
2) The CID space must not exceed the number of possible CPUs in a
system. While most allocators (glibc, tcmalloc, jemalloc) do not care
about that, there seems to be at least librseq depending on it.
The CID space compaction itself is not a functional correctness
requirement, it is only a useful optimization mechanism to reduce the
memory foot print in unused user space pools.
The optimal CID space is:
min(nr_tasks, nr_cpus_allowed);
Where @nr_tasks is the number of actual user space threads associated to
the mm and @nr_cpus_allowed is the superset of all task affinities. It is
growth only as it would be insane to take a racy snapshot of all task
affinities when the affinity of one task changes just do redo it 2
milliseconds later when the next task changes its affinity.
That means that as long as the number of tasks is lower or equal than the
number of CPUs allowed, each task owns a CID. If the number of tasks
exceeds the number of CPUs allowed it switches to per CPU mode, where the
CPUs own the CIDs and the tasks borrow them as long as they are scheduled
in.
For transition periods CIDs can go beyond the optimal space as long as they
don't go beyond the number of possible CPUs.
The current upstream implementation adds overhead into task migration to
keep the CID with the task. It also has to do the CID space consolidation
work from a task work in the exit to user space path. As that work is
assigned to a random task related to a MM this can inflict unwanted exit
latencies.
This can be done differently by implementing a strict CID ownership
mechanism. Either the CIDs are owned by the tasks or by the CPUs. The
latter provides less locality when tasks are heavily migrating, but there
is no justification to optimize for overcommit scenarios and thereby
penalizing everyone else.
Provide the basic infrastructure to implement this:
- Change the UNSET marker to BIT(31) from ~0U
- Add the ONCPU marker as BIT(30)
- Add the TRANSIT marker as BIT(29)
That allows to check for ownership trivially and provides a simple check for
UNSET as well. The TRANSIT marker is required to prevent CID space
exhaustion when switching from per CPU to per task mode.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251119172549.960252358@linutronix.de
|
|
Prepare for the new CID management scheme which puts the CID ownership
transition into the fork() and exit() slow path by serializing
sched_mm_cid_fork()/exit() with it, so task list and cpu mask walks can be
done in interruptible and preemptible code.
The contention on it is not worse than on other concurrency controls in the
fork()/exit() machinery.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.895826703@linutronix.de
|
|
Reading mm::mm_users and mm:::mm_cid::nr_cpus_allowed every time to compute
the maximal CID value is just wasteful as that value is only changing on
fork(), exit() and eventually when the affinity changes.
So it can be easily precomputed at those points and provided in mm::mm_cid
for consumption in the hot path.
But there is an issue with using mm::mm_users for accounting because that
does not necessarily reflect the number of user space tasks as other kernel
code can take temporary references on the MM which skew the picture.
Solve that by adding a users counter to struct mm_mm_cid, which is modified
by fork() and exit() and used for precomputing under mm_mm_cid::lock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.832764634@linutronix.de
|
|
It's getting bigger soon, so just move it out of line to the rest of the
code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.769636491@linutronix.de
|
|
There is no need anymore to keep this under sighand lock as the current
code and the upcoming replacement are not depending on the exit state of a
task anymore.
That allows to use a mutex in the exit path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.706439391@linutronix.de
|
|
This is truly a bitmap and just conveniently uses a cpumask because the
maximum size of the bitmap is nr_cpu_ids.
But that prevents to do searches for a zero bit in a limited range, which
is helpful to provide an efficient mechanism to consolidate the CID space
when the number of users decreases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Link: https://patch.msgid.link/20251119172549.642866767@linutronix.de
|
|
Reevaluating num_possible_cpus() over and over does not make sense. That
becomes a constant after init as cpu_possible_mask is marked ro_after_init.
Cache the value during initialization and provide that for consumption.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251119172549.578653738@linutronix.de
|
|
A CPU system wakeup QoS limit may have been requested by user space. To
avoid breaking this constraint when entering a low power state during
s2idle, let's start to take into account the QoS limit.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com>
Tested-by: Kevin Hilman (TI) <khilman@baylibre.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/20251125112650.329269-5-ulf.hansson@linaro.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Some platforms supports multiple low power states for CPUs that can be used
when entering system-wide suspend. Currently we are always selecting the
deepest possible state for the CPUs, which can break the system wakeup
latency constraint that may be required for a use case.
Let's take the first step towards addressing this problem, by introducing
an interface for user space, that allows us to specify the CPU system
wakeup QoS limit. Subsequent changes will start taking into account the new
QoS limit.
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com>
Tested-by: Kevin Hilman (TI) <khilman@baylibre.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/20251125112650.329269-2-ulf.hansson@linaro.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
If kobject_create_and_add() fails on the first iteration, then the error
code is set to -ENOMEM which is correct. But if it fails in subsequent
iterations then "ret" is zero, which means success, but it should be
-ENOMEM.
Set the error code to -ENOMEM correctly.
Fixes: 7b5ab04f035f ("timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Malaya Kumar Rout <mrout@redhat.com>
Link: https://patch.msgid.link/aSW1R8q5zoY_DgQE@stanley.mountain
|
|
Continue recent cleanups of comments in the swap handling code.
Unify the use of white space in the comments, drop some unuseful
comments outside function bodies, and move some other comments into
function bodies.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/5943864.DvuYhMxLoT@rafael.j.wysocki
|
|
Implement the "jmp" mode for the bpf trampoline. For the ftrace_managed
case, we need only to set the FTRACE_OPS_FL_JMP on the tr->fops if "jmp"
is needed.
For the bpf poke case, we will check the origin poke type with the
"origin_flags", and current poke type with "tr->flags". The function
bpf_trampoline_update_fentry() is introduced to do the job.
The "jmp" mode will only be enabled with CONFIG_DYNAMIC_FTRACE_WITH_JMP
enabled and BPF_TRAMP_F_SHARE_IPMODIFY is not set. With
BPF_TRAMP_F_SHARE_IPMODIFY, we need to get the origin call ip from the
stack, so we can't use the "jmp" mode.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20251118123639.688444-7-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In the origin logic, the bpf_arch_text_poke() assume that the old and new
instructions have the same opcode. However, they can have different opcode
if we want to replace a "call" insn with a "jmp" insn.
Therefore, add the new function parameter "old_t" along with the "new_t",
which are used to indicate the old and new poke type. Meanwhile, adjust
the implement of bpf_arch_text_poke() for all the archs.
"BPF_MOD_NOP" is added to make the code more readable. In
bpf_arch_text_poke(), we still check if the new and old address is NULL to
determine if nop insn should be used, which I think is more safe.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20251118123639.688444-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
For now, the "nop" will be replaced with a "call" instruction when a
function is hooked by the ftrace. However, sometimes the "call" can break
the RSB and introduce extra overhead. Therefore, introduce the flag
FTRACE_OPS_FL_JMP, which indicate that the ftrace_ops should be called
with a "jmp" instead of "call". For now, it is only used by the direct
call case.
When a direct ftrace_ops is marked with FTRACE_OPS_FL_JMP, the last bit of
the ops->direct_call will be set to 1. Therefore, we can tell if we should
use "jmp" for the callback in ftrace_call_replace().
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20251118123639.688444-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In commit b4ce5923e780 ("bpf, x86: add new map type: instructions array")
env->used_map was copied to func[i]->aux->used_maps before jitting.
Clear these fields out after jitting such that pointer to freed memory
(env->used_maps is freed later) are not kept in a live data structure.
The reason why the copies were initially added is explained in
https://lore.kernel.org/bpf/20251105090410.1250500-1-a.s.protopopov@gmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Fixes: b4ce5923e780 ("bpf, x86: add new map type: instructions array")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251124151515.2543403-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently when the length of a symbol is longer than 0x7f characters,
its type shown in /proc/kallsyms can be incorrect.
I found this issue when reading the code, but it can be reproduced by
following steps:
1. Define a function which symbol length is 130 characters:
#define X13(x) x##x##x##x##x##x##x##x##x##x##x##x##x
static noinline void X13(x123456789)(void)
{
printk("hello world\n");
}
2. The type in vmlinux is 't':
$ nm vmlinux | grep x123456
ffffffff816290f0 t x123456789x123456789x123456789x12[...]
3. Then boot the kernel, the type shown in /proc/kallsyms becomes 'g'
instead of the expected 't':
# cat /proc/kallsyms | grep x123456
ffffffff816290f0 g x123456789x123456789x123456789x12[...]
The root cause is that, after commit 73bbb94466fd ("kallsyms: support
"big" kernel symbols"), ULEB128 was used to encode symbol name length.
That is, for "big" kernel symbols of which name length is longer than
0x7f characters, the length info is encoded into 2 bytes.
kallsyms_get_symbol_type() expects to read the first char of the
symbol name which indicates the symbol type. However, due to the
"big" symbol case not being handled, the symbol type read from
/proc/kallsyms may be wrong, so handle it properly.
Cc: stable@vger.kernel.org
Fixes: 73bbb94466fd ("kallsyms: support "big" kernel symbols")
Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com>
Acked-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20241011143853.3022643-1-zhengyejian@huaweicloud.com
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
|
|
With commit ("printk: Avoid scheduling irq_work on suspend") the
implementation of printk_get_console_flush_type() was modified to
avoid offloading when irq_work should be blocked during suspend.
Since printk uses the returned flush type to determine what
flushing methods are used, this was thought to be sufficient for
avoiding irq_work usage during the suspend phase.
However, vprintk_emit() implements a hack to support
printk_deferred(). In this hack, the returned flush type is
adjusted to make sure no legacy direct printing occurs when
printk_deferred() was used.
Because of this hack, the legacy offloading flushing method can
still be used, causing irq_work to be queued when it should not
be.
Adjust the vprintk_emit() hack to also consider
@console_irqwork_blocked so that legacy offloading will not be
chosen when irq_work should be blocked.
Link: https://lore.kernel.org/lkml/87fra90xv4.fsf@jogness.linutronix.de
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Fixes: 26873e3e7f0c ("printk: Avoid scheduling irq_work on suspend")
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
- Fix a race in timer->function clearing in timer_shutdown_sync()
- Fix a timekeeper sysfs-setup resource leak in error paths
- Fix the NOHZ report_idle_softirq() syslog rate-limiting
logic to have no side effects on the return value
* tag 'timers-urgent-2025-11-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix NULL function pointer race in timer_shutdown_sync()
timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths
tick/sched: Fix bogus condition in report_idle_softirq()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Fix perf CPU-clock counters, and address a static checker warning"
* tag 'perf-urgent-2025-11-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix 0 count issue of cpu-clock
perf/x86/intel/uncore: Remove superfluous check
|
|
There is a race condition between timer_shutdown_sync() and timer
expiration that can lead to hitting a WARN_ON in expire_timers().
The issue occurs when timer_shutdown_sync() clears the timer function
to NULL while the timer is still running on another CPU. The race
scenario looks like this:
CPU0 CPU1
<SOFTIRQ>
lock_timer_base()
expire_timers()
base->running_timer = timer;
unlock_timer_base()
[call_timer_fn enter]
mod_timer()
...
timer_shutdown_sync()
lock_timer_base()
// For now, will not detach the timer but only clear its function to NULL
if (base->running_timer != timer)
ret = detach_if_pending(timer, base, true);
if (shutdown)
timer->function = NULL;
unlock_timer_base()
[call_timer_fn exit]
lock_timer_base()
base->running_timer = NULL;
unlock_timer_base()
...
// Now timer is pending while its function set to NULL.
// next timer trigger
<SOFTIRQ>
expire_timers()
WARN_ON_ONCE(!fn) // hit
...
lock_timer_base()
// Now timer will detach
if (base->running_timer != timer)
ret = detach_if_pending(timer, base, true);
if (shutdown)
timer->function = NULL;
unlock_timer_base()
The problem is that timer_shutdown_sync() clears the timer function
regardless of whether the timer is currently running. This can leave a
pending timer with a NULL function pointer, which triggers the
WARN_ON_ONCE(!fn) check in expire_timers().
Fix this by only clearing the timer function when actually detaching the
timer. If the timer is running, leave the function pointer intact, which is
safe because the timer will be properly detached when it finishes running.
Fixes: 0cc04e80458a ("timers: Add shutdown mechanism to the internal functions")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251122093942.301559-1-zouyipeng@huawei.com
|
|
Pick up OF changes to resolve dependencies
|
|
Failing to allocate the affinity mask of an interrupt descriptor fails the
whole descriptor initialization. It is then guaranteed that the cpumask is
always available whenever the related interrupt objects are alive, such as
the kthread handler.
Therefore remove the superfluous check since it is merely a historical
leftover. Get rid also of the comments above it that are obsolete and
useless.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251121143500.42111-4-frederic@kernel.org
|
|
When a cpuset isolated partition is created / updated or destroyed, the
interrupt threads are affined blindly to all the non-isolated CPUs. This
happens without taking into account the interrupt threads initial affinity
that becomes ignored.
For example in a system with 8 CPUs, if an interrupt and its kthread are
initially affine to CPU 5, creating an isolated partition with only CPU 2
inside will eventually end up affining the interrupt kthread to all CPUs
but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for CPU 5.
Besides the blind re-affining, this doesn't take care of the actual low
level interrupt which isn't migrated. As of today the only way to isolate
non managed interrupts, along with their kthreads, is to overwrite their
affinity separately, for example through /proc/irq/
To avoid doing that manually, future development should focus on updating
the interrupt's affinity whenever cpuset isolated partitions are updated.
In the meantime, cpuset shouldn't fiddle with interrupt threads directly.
To prevent from that, set the PF_NO_SETAFFINITY flag to them.
This is done through kthread_bind_mask() by affining them initially to all
possible CPUs as at that point the interrupt is not started up which means
the affinity of the hard interrupt is not known. The thread will adjust
that once it reaches the handler, which is guaranteed to happen after the
initial affinity of the hard interrupt is established.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251121143500.42111-3-frederic@kernel.org
|
|
During initialization, the interrupt thread is created before the interrupt
is enabled. The interrupt enablement happens before the actual kthread wake
up point. Once the interrupt is enabled the hardware can raise an interrupt
and once setup_irq() drops the descriptor lock a interrupt wake-up can
happen.
Even when such an interrupt can be considered premature, this is not a
problem in general because at the point where the descriptor lock is
dropped and the wakeup can happen, the data which is used by the thread is
fully initialized.
Though from the perspective of least surprise, the initial wakeup really
should be performed by the setup code and not randomly by a premature
interrupt.
Prevent this by performing a wake-up only if the target is in state
TASK_INTERRUPTIBLE, which the thread uses in wait_for_interrupt().
If the thread is still in state TASK_UNINTERRUPTIBLE, the wake-up is not
lost because after the setup code completed the initial wake-up the thread
will observe the IRQTF_RUNTHREAD and proceed with the handling.
[ tglx: Simplified the changes and extended the changelog. ]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251121143500.42111-2-frederic@kernel.org
|
|
Currently, nested rcu critical sections are rejected by the verifier and
rcu_lock state is managed by a boolean variable. Add support for nested
rcu critical sections by make active_rcu_locks a counter similar to
active_preempt_locks. bpf_rcu_read_lock() increments this counter and
bpf_rcu_read_unlock() decrements it, MEM_RCU -> PTR_UNTRUSTED transition
happens when active_rcu_locks drops to 0.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251117200411.25563-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This updates bpf_insn_successors() reflecting that control flow might
jump over the instructions between tail call and function exit, verifier
might assume that some writes to parent stack always happen, which is
not the case.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu>
Link: https://lore.kernel.org/r/20251119160355.1160932-4-martin.teichmann@xfel.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|