| Age | Commit message (Collapse) | Author |
|
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU feature updates from Dave Hansen:
"The biggest thing of note here is Linear Address Space Separation
(LASS). It represents the first time I can think of that the
upper=>kernel/lower=>user address space convention is actually
recognized by the hardware on x86. It ensures that userspace can not
even get the hardware to _start_ page walks for the kernel address
space. This, of course, is a really nice generic side channel defense.
This is really only a down payment on LASS support. There are still
some details to work out in its interaction with EFI calls and
vsyscall emulation. For now, LASS is disabled if either of those
features is compiled in (which is almost always the case).
There's also one straggler commit in here which converts an
under-utilized AMD CPU feature leaf into a generic Linux-defined leaf
so more feature can be packed in there.
Summary:
- Enable Linear Address Space Separation (LASS)
- Change X86_FEATURE leaf 17 from an AMD leaf to Linux-defined"
* tag 'x86_cpu_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Enable LASS during CPU initialization
selftests/x86: Update the negative vsyscall tests to expect a #GP
x86/traps: Communicate a LASS violation in #GP message
x86/kexec: Disable LASS during relocate kernel
x86/alternatives: Disable LASS when patching kernel code
x86/asm: Introduce inline memcpy and memset
x86/cpu: Add an LASS dependency on SMAP
x86/cpufeatures: Enumerate the LASS feature bits
x86/cpufeatures: Make X86_FEATURE leaf 17 Linux-specific
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry update from Dave Hansen:
"This one is pretty trivial: fix a badly-named FRED data structure
member"
* tag 'x86_entry_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fred: Fix 64bit identifier in fred_ss
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Dave Hansen:
"The most significant are some changes to ensure that symbols exported
for KVM are used only by KVM modules themselves, along with some
related cleanups.
In true x86/misc fashion, the other patch is completely unrelated and
just enhances an existing pr_warn() to make it clear to users how they
have tainted their kernel when something is mucking with MSRs.
Summary:
- Make MSR-induced taint easier for users to track down
- Restrict KVM-specific exports to KVM itself"
* tag 'x86_misc_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possible
x86/mm: Drop unnecessary export of "ptdump_walk_pgd_level_debugfs"
x86/mtrr: Drop unnecessary export of "mtrr_state"
x86/bugs: Drop unnecessary export of "x86_spec_ctrl_base"
x86/msr: Add CPU_OUT_OF_SPEC taint name to "unrecognized" pr_warn(msg)
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave HansenL
"The main content here is adding support for the new EUPDATESVN SGX
ISA. Before this, folks who updated microcode had to reboot before
enclaves could attest to the new microcode. The new functionality lets
them do this without a reboot.
The rest are some nice, but relatively mundane comment and kernel-doc
fixups.
Summary:
- Allow security version (SVN) updates so enclaves can attest to new
microcode
- Fix kernel docs typos"
* tag 'x86_sgx_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Fix a typo in the kernel-doc comment for enum sgx_attribute
x86/sgx: Remove superfluous asterisk from copyright comment in asm/sgx.h
x86/sgx: Document structs and enums with '@', not '%'
x86/sgx: Add kernel-doc descriptions for params passed to vDSO user handler
x86/sgx: Add a missing colon in kernel-doc markup for "struct sgx_enclave_run"
x86/sgx: Enable automatic SVN updates for SGX enclaves
x86/sgx: Implement ENCLS[EUPDATESVN]
x86/sgx: Define error codes for use by ENCLS[EUPDATESVN]
x86/cpufeatures: Add X86_FEATURE_SGX_EUPDATESVN feature flag
x86/sgx: Introduce functions to count the sgx_(vepc_)open()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Borislav Petkov:
- Use the proper accessors when reading CR3 as part of the page level
transitions (5-level to 4-level, the use case being kexec) so that
only the physical address in CR3 is picked up and not flags which are
above the physical mask shift
- Clean up and unify __phys_addr_symbol() definitions
* tag 'x86_mm_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub: Fix page table access in 5-level to 4-level paging transition
x86/boot: Fix page table access in 5-level to 4-level paging transition
x86/mm: Unify __phys_addr_symbol()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Largely cleanups along with a change to save XSS to the GHCB
(Guest-Host Communication Block) in SEV-ES guests so that the
hypervisor can determine the guest's XSAVES buffer size properly
and thus support shadow stacks in AMD confidential guests
* tag 'x86_sev_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cc: Fix enum spelling to fix kernel-doc warnings
x86/boot: Drop unused sev_enable() fallback
x86/coco/sev: Convert has_cpuflag() to use cpu_feature_enabled()
x86/sev: Include XSS value in GHCB CPUID request
x86/boot: Move boot_*msr helpers to asm/shared/msr.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- The mandatory pile of cleanups the cat drags in every merge window
* tag 'x86_cleanups_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Clean up whitespace in a20.c
x86/mm: Delete disabled debug code
x86/{boot,mtrr}: Remove unused function declarations
x86/percpu: Use BIT_WORD() and BIT_MASK() macros
x86/cpufeatures: Correct LKGS feature flag description
x86/idtentry: Add missing '*' to kernel-doc lines
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Add support for AMD's Smart Data Cache Injection feature which allows
for direct insertion of data from I/O devices into the L3 cache, thus
bypassing DRAM and saving its bandwidth; the resctrl side of the
feature allows the size of the L3 used for data injection to be
controlled
- Add Intel Clearwater Forest to the list of CPUs which support
Sub-NUMA clustering
- Other fixes and cleanups
* tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Update bit_usage to reflect io_alloc
fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
fs/resctrl: Introduce interface to display io_alloc CBMs
fs/resctrl: Add user interface to enable/disable io_alloc feature
fs/resctrl: Introduce interface to display "io_alloc" support
x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
x86,fs/resctrl: Detect io_alloc feature
x86/resctrl: Add SDCIAE feature in the command line options
x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
fs/resctrl: Consider sparse masks when initializing new group's allocation
x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading updates from Borislav Petkov:
- Add microcode staging support on Intel: it moves the sole microcode
blobs loading to a non-critical path so that microcode loading
latencies are kept at minimum. The actual "directing" the hardware to
load microcode is the only step which is done on the critical path.
This scheme is also opportunistic as in: on a failure, the machinery
falls back to normal loading
- Add the capability to the AMD side of the loader to select one of two
per-family/model/stepping patches: one is pre-Entrysign and the other
is post-Entrysign; with the goal to take care of machines which
haven't updated their BIOS yet - something they should absolutely do
as this is the only proper Entrysign fix
- Other small cleanups and fixlets
* tag 'x86_microcode_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Mark early_parse_cmdline() as __init
x86/microcode/AMD: Select which microcode patch to load
x86/microcode/intel: Enable staging when available
x86/microcode/intel: Support mailbox transfer
x86/microcode/intel: Implement staging handler
x86/microcode/intel: Define staging state struct
x86/microcode/intel: Establish staging control logic
x86/microcode: Introduce staging step to reduce late-loading time
x86/cpu/topology: Make primary thread mask available with SMP=n
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- The second part of the AMD MCA interrupts rework after the
last-minute show-stopper from the last merge window was sorted out.
After this, the AMD MCA deferred errors, thresholding and corrected
errors interrupt handlers use common MCA code and are tightly
integrated into the core MCA code, thereby getting rid of
considerable duplication. All culminating into allowing CMCI error
thresholding storms to be detected at AMD too, using the common
infrastructure
- Add support for two new MCA bank bits on AMD Zen6 which denote
whether the error address logged is a system physical address, which
obviates the need for it to be translated before further error
recovery can be done
* tag 'ras_core_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Handle AMD threshold interrupt storms
x86/mce: Do not clear bank's poll bit in mce_poll_banks on AMD SMCA systems
x86/mce: Add support for physical address valid bit
x86/mce: Save and use APEI corrected threshold limit
x86/mce/amd: Define threshold restart function for banks
x86/mce/amd: Remove redundant reset_block()
x86/mce/amd: Support SMCA Corrected Error Interrupt
x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
x86/mce: Unify AMD DFR handler with MCA Polling
x86/mce: Unify AMD THR handler with MCA Polling
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
"A large overhaul of the restartable sequences and CID management:
The recent enablement of RSEQ in glibc resulted in regressions which
are caused by the related overhead. It turned out that the decision to
invoke the exit to user work was not really a decision. More or less
each context switch caused that. There is a long list of small issues
which sums up nicely and results in a 3-4% regression in I/O
benchmarks.
The other detail which caused issues due to extra work in context
switch and task migration is the CID (memory context ID) management.
It also requires to use a task work to consolidate the CID space,
which is executed in the context of an arbitrary task and results in
sporadic uncontrolled exit latencies.
The rewrite addresses this by:
- Removing deprecated and long unsupported functionality
- Moving the related data into dedicated data structures which are
optimized for fast path processing.
- Caching values so actual decisions can be made
- Replacing the current implementation with a optimized inlined
variant.
- Separating fast and slow path for architectures which use the
generic entry code, so that only fault and error handling goes into
the TIF_NOTIFY_RESUME handler.
- Rewriting the CID management so that it becomes mostly invisible in
the context switch path. That moves the work of switching modes
into the fork/exit path, which is a reasonable tradeoff. That work
is only required when a process creates more threads than the
cpuset it is allowed to run on or when enough threads exit after
that. An artificial thread pool benchmarks which triggers this did
not degrade, it actually improved significantly.
The main effect in migration heavy scenarios is that runqueue lock
held time and therefore contention goes down significantly"
* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/mmcid: Switch over to the new mechanism
sched/mmcid: Implement deferred mode change
irqwork: Move data struct to a types header
sched/mmcid: Provide CID ownership mode fixup functions
sched/mmcid: Provide new scheduler CID mechanism
sched/mmcid: Introduce per task/CPU ownership infrastructure
sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
sched/mmcid: Provide precomputed maximal value
sched/mmcid: Move initialization out of line
signal: Move MMCID exit out of sighand lock
sched/mmcid: Convert mm CID mask to a bitmap
cpumask: Cache num_possible_cpus()
sched/mmcid: Use cpumask_weighted_or()
cpumask: Introduce cpumask_weighted_or()
sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
sched/mmcid: Move scheduler code out of global header
sched: Fixup whitespace damage
sched/mmcid: Cacheline align MM CID storage
sched/mmcid: Use proper data structures
sched/mmcid: Revert the complex CID management
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scoped user access updates from Thomas Gleixner:
"Scoped user mode access and related changes:
- Implement the missing u64 user access function on ARM when
CONFIG_CPU_SPECTRE=n.
This makes it possible to access a 64bit value in generic code with
[unsafe_]get_user(). All other architectures and ARM variants
provide the relevant accessors already.
- Ensure that ASM GOTO jump label usage in the user mode access
helpers always goes through a local C scope label indirection
inside the helpers.
This is required because compilers are not supporting that a ASM
GOTO target leaves a auto cleanup scope. GCC silently fails to emit
the cleanup invocation and CLANG fails the build.
[ Editor's note: gcc-16 will have fixed the code generation issue
in commit f68fe3ddda4 ("eh: Invoke cleanups/destructors in asm
goto jumps [PR122835]"). But we obviously have to deal with clang
and older versions of gcc, so.. - Linus ]
This provides generic wrapper macros and the conversion of affected
architecture code to use them.
- Scoped user mode access with auto cleanup
Access to user mode memory can be required in hot code paths, but
if it has to be done with user controlled pointers, the access is
shielded with a speculation barrier, so that the CPU cannot
speculate around the address range check. Those speculation
barriers impact performance quite significantly.
This cost can be avoided by "masking" the provided pointer so it is
guaranteed to be in the valid user memory access range and
otherwise to point to a guaranteed unpopulated address space. This
has to be done without branches so it creates an address dependency
for the access, which the CPU cannot speculate ahead.
This results in repeating and error prone programming patterns:
if (can_do_masked_user_access())
from = masked_user_read_access_begin((from));
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(val, from, Efault);
user_read_access_end();
return 0;
Efault:
user_read_access_end();
return -EFAULT;
which can be replaced with scopes and automatic cleanup:
scoped_user_read_access(from, Efault)
unsafe_get_user(val, from, Efault);
return 0;
Efault:
return -EFAULT;
- Convert code which implements the above pattern over to
scope_user.*.access(). This also corrects a couple of imbalanced
masked_*_begin() instances which are harmless on most
architectures, but prevent PowerPC from implementing the masking
optimization.
- Add a missing speculation barrier in copy_from_user_iter()"
* tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
scm: Convert put_cmsg() to scoped user access
iov_iter: Add missing speculation barrier to copy_from_user_iter()
iov_iter: Convert copy_from_user_iter() to masked user access
select: Convert to scoped user access
x86/futex: Convert to scoped user access
futex: Convert to get/put_user_inline()
uaccess: Provide put/get_user_inline()
uaccess: Provide scoped user access regions
arm64: uaccess: Use unsafe wrappers for ASM GOTO
s390/uaccess: Use unsafe wrappers for ASM GOTO
riscv/uaccess: Use unsafe wrappers for ASM GOTO
powerpc/uaccess: Use unsafe wrappers for ASM GOTO
x86/uaccess: Use unsafe wrappers for ASM GOTO
uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
ARM: uaccess: Implement missing __get_user_asm_dword()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull bug handling infrastructure updates from Ingo Molnar:
"Core updates:
- Improve WARN(), which has vararg printf like arguments, to work
with the x86 #UD based WARN-optimizing infrastructure by hiding the
format in the bug_table and replacing this first argument with the
address of the bug-table entry, while making the actual function
that's called a UD1 instruction (Peter Zijlstra)
- Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch (Ingo
Molnar, s390 support by Heiko Carstens)
Fixes and cleanups:
- bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)
- <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS (Peter
Zijlstra)"
* tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
x86/bug: Fix BUG_FORMAT vs KASLR
x86_64/bug: Inline the UD1
x86/bug: Implement WARN_ONCE()
x86_64/bug: Implement __WARN_printf()
x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
x86/bug: Add BUG_FORMAT basics
bug: Allow architectures to provide __WARN_printf()
bug: Implement WARN_ON() using __WARN_FLAGS()
bug: Add report_bug_entry()
bug: Add BUG_FORMAT_ARGS infrastructure
bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS
bug: Add BUG_FORMAT infrastructure
x86: Rework __bug_table helpers
bugs/s390: Remove private WARN_ON() implementation
bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output
bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scalability and load-balancing improvements:
- Enable scheduler feature NEXT_BUDDY (Mel Gorman)
- Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman)
- Skip sched_balance_running cmpxchg when balance is not due (Tim
Chen)
- Implement generic code for architecture specific sched domain NUMA
distances (Tim Chen)
- Optimize the NUMA distances of the sched-domains builds of Intel
Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim
Chen)
- Implement proportional newidle balance: a randomized algorithm that
runs newidle balancing proportional to its success rate. (Peter
Zijlstra)
Scheduler infrastructure changes:
- Implement the 'sched_change' scoped_guard() pattern for the entire
scheduler (Peter Zijlstra)
- More broadly utilize the sched_change guard (Peter Zijlstra)
- Add support to pick functions to take runqueue-flags (Joel
Fernandes)
- Provide and use set_need_resched_current() (Peter Zijlstra)
Fair scheduling enhancements:
- Forfeit vruntime on yield (Fernand Sieber)
- Only update stats for allowed CPUs when looking for dst group (Adam
Li)
CPU-core scheduling enhancements:
- Optimize core cookie matching check (Fernand Sieber)
Deadline scheduler fixes:
- Only set free_cpus for online runqueues (Doug Berger)
- Fix dl_server time accounting (Peter Zijlstra)
- Fix dl_server stop condition (Peter Zijlstra)
Proxy scheduling fixes:
- Yield the donor task (Fernand Sieber)
Fixes and cleanups:
- Fix do_set_cpus_allowed() locking (Peter Zijlstra)
- Fix migrate_disable_switch() locking (Peter Zijlstra)
- Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
(Hao Jia)
- Increase sched_tick_remote timeout (Phil Auld)
- sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth
Hegde)
- sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)"
* tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
sched: Provide and use set_need_resched_current()
sched/fair: Proportional newidle balance
sched/fair: Small cleanup to update_newidle_cost()
sched/fair: Small cleanup to sched_balance_newidle()
sched/fair: Revert max_newidle_lb_cost bump
sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
sched/fair: Enable scheduler feature NEXT_BUDDY
sched: Increase sched_tick_remote timeout
sched/fair: Have SD_SERIALIZE affect newidle balancing
sched/fair: Skip sched_balance_running cmpxchg when balance is not due
sched/deadline: Minor cleanup in select_task_rq_dl()
sched/deadline: Use cpumask_weight_and() in dl_bw_cpus
sched/deadline: Document dl_server
sched/deadline: Fix dl_server stop condition
sched/deadline: Fix dl_server time accounting
sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
sched/eevdf: Fix min_vruntime vs avg_vruntime
sched/core: Add comment explaining force-idle vruntime snapshots
sched/core: Optimize core cookie matching check
sched/proxy: Yield the donor task
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Callchain support:
- Add support for deferred user-space stack unwinding for perf,
enabled on x86. (Peter Zijlstra, Steven Rostedt)
- unwind_user/x86: Enable frame pointer unwinding on x86 (Josh
Poimboeuf)
x86 PMU support and infrastructure:
- x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)
- x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra)
Intel PMU driver:
- Large series to prepare for and implement architectural PEBS
support for Intel platforms such as Clearwater Forest (CWF) and
Panther Lake (PTL). (Dapeng Mi, Kan Liang)
- Check dynamic constraints (Kan Liang)
- Optimize PEBS extended config (Peter Zijlstra)
- cstates:
- Remove PC3 support from LunarLake (Zhang Rui)
- Add Pantherlake support (Zhang Rui)
- Clearwater Forest support (Zide Chen)
AMD PMU driver:
- x86/amd: Check event before enable to avoid GPF (George Kennedy)
Fixes and cleanups:
- task_work: Fix NMI race condition (Peter Zijlstra)
- perf/x86: Fix NULL event access and potential PEBS record loss
(Dapeng Mi)
- Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter
Zijlstra)"
* tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use
perf/x86/intel: Optimize PEBS extended config
perf/x86/intel: Check PEBS dyn_constraints
perf/x86/intel: Add a check for dynamic constraints
perf/x86/intel: Add counter group support for arch-PEBS
perf/x86/intel: Setup PEBS data configuration and enable legacy groups
perf/x86/intel: Update dyn_constraint base on PEBS event precise level
perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
perf/x86/intel: Process arch-PEBS records or record fragments
perf/x86/intel/ds: Factor out PEBS group processing code to functions
perf/x86/intel/ds: Factor out PEBS record processing code to functions
perf/x86/intel: Initialize architectural PEBS
perf/x86/intel: Correct large PEBS flag check
perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call
perf/x86: Fix NULL event access and potential PEBS record loss
perf/x86: Remove redundant is_x86_event() prototype
entry,unwind/deferred: Fix unwind_reset_info() placement
unwind_user/x86: Fix arch=um build
perf: Support deferred user unwind
unwind_user/x86: Teach FP unwind about start of function
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- klp-build livepatch module generation (Josh Poimboeuf)
Introduce new objtool features and a klp-build script to generate
livepatch modules using a source .patch as input.
This builds on concepts from the longstanding out-of-tree kpatch
project which began in 2012 and has been used for many years to
generate livepatch modules for production kernels. However, this is a
complete rewrite which incorporates hard-earned lessons from 12+
years of maintaining kpatch.
Key improvements compared to kpatch-build:
- Integrated with objtool: Leverages objtool's existing control-flow
graph analysis to help detect changed functions.
- Works on vmlinux.o: Supports late-linked objects, making it
compatible with LTO, IBT, and similar.
- Simplified code base: ~3k fewer lines of code.
- Upstream: No more out-of-tree #ifdef hacks, far less cruft.
- Cleaner internals: Vastly simplified logic for
symbol/section/reloc inclusion and special section extraction.
- Robust __LINE__ macro handling: Avoids false positive binary diffs
caused by the __LINE__ macro by introducing a fix-patch-lines
script which injects #line directives into the source .patch to
preserve the original line numbers at compile time.
- Disassemble code with libopcodes instead of running objdump
(Alexandre Chartre)
- Disassemble support (-d option to objtool) by Alexandre Chartre,
which supports the decoding of various Linux kernel code generation
specials such as alternatives:
17ef: sched_balance_find_dst_group+0x62f mov 0x34(%r9),%edx
17f3: sched_balance_find_dst_group+0x633 | <alternative.17f3> | X86_FEATURE_POPCNT
17f3: sched_balance_find_dst_group+0x633 | call 0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
17f8: sched_balance_find_dst_group+0x638 cmp %eax,%edx
... jump table alternatives:
1895: sched_use_asym_prio+0x5 test $0x8,%ch
1898: sched_use_asym_prio+0x8 je 0x18a9 <sched_use_asym_prio+0x19>
189a: sched_use_asym_prio+0xa | <jump_table.189a> | JUMP
189a: sched_use_asym_prio+0xa | jmp 0x18ae <sched_use_asym_prio+0x1e> | nop2
189c: sched_use_asym_prio+0xc mov $0x1,%eax
18a1: sched_use_asym_prio+0x11 and $0x80,%ecx
... exception table alternatives:
native_read_msr:
5b80: native_read_msr+0x0 mov %edi,%ecx
5b82: native_read_msr+0x2 | <ex_table.5b82> | EXCEPTION
5b82: native_read_msr+0x2 | rdmsr | resume at 0x5b84 <native_read_msr+0x4>
5b84: native_read_msr+0x4 shl $0x20,%rdx
.... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
example in sched_balance_find_dst_group() above):
2faaf: start_thread_common.constprop.0+0x1f jne 0x2fba4 <start_thread_common.constprop.0+0x114>
2fab5: start_thread_common.constprop.0+0x25 | <alternative.2fab5> | X86_FEATURE_ALWAYS | X86_BUG_NULL_SEG
2fab5: start_thread_common.constprop.0+0x25 | jmp 0x2faba <.altinstr_aux+0x2f4> | jmp 0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
2faba: start_thread_common.constprop.0+0x2a mov $0x2b,%eax
... NOP sequence shortening:
1048e2: snapshot_write_finalize+0xc2 je 0x104917 <snapshot_write_finalize+0xf7>
1048e4: snapshot_write_finalize+0xc4 nop6
1048ea: snapshot_write_finalize+0xca nop11
1048f5: snapshot_write_finalize+0xd5 nop11
104900: snapshot_write_finalize+0xe0 mov %rax,%rcx
104903: snapshot_write_finalize+0xe3 mov 0x10(%rdx),%rax
... and much more.
- Function validation tracing support (Alexandre Chartre)
- Various -ffunction-sections fixes (Josh Poimboeuf)
- Clang AutoFDO (Automated Feedback-Directed Optimizations) support
(Josh Poimboeuf)
- Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo
Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra,
Thorsten Blum)
* tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
objtool: Fix segfault on unknown alternatives
objtool: Build with disassembly can fail when including bdf.h
objtool: Trim trailing NOPs in alternative
objtool: Add wide output for disassembly
objtool: Compact output for alternatives with one instruction
objtool: Improve naming of group alternatives
objtool: Add Function to get the name of a CPU feature
objtool: Provide access to feature and flags of group alternatives
objtool: Fix address references in alternatives
objtool: Disassemble jump table alternatives
objtool: Disassemble exception table alternatives
objtool: Print addresses with alternative instructions
objtool: Disassemble group alternatives
objtool: Print headers for alternatives
objtool: Preserve alternatives order
objtool: Add the --disas=<function-pattern> action
objtool: Do not validate IBT for .return_sites and .call_sites
objtool: Improve tracing of alternative instructions
objtool: Add functions to better name alternatives
objtool: Identify the different types of alternatives
...
|
|
Linus figured less #ifdef is more better and making x86-32 use
GENERIC_BUG_RELATIVE_POINTERS removes one layer of macro magic from
the bug.h bits.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Encoding a relative NULL pointer doesn't work for KASLR, when the
whole kernel image gets shifted, the __bug_table and the target string
get shifted by the same amount and the relative offset is preserved.
However when the target is an absolute 0 value and the __bug_table
gets moved about, the end result in a pointer equivalent to
kaslr_offset(), not NULL.
Notably, this will generate SHN_UNDEF relocations, and Ard would
really like to not have those at all.
Use the empty string to denote no-string.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
KVM SVM changes for 6.19:
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
|
|
KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module,
which KVM was either working around in weird, ugly ways, or was simply
oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
selftests).
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying capabilities to userspace.
|
|
KVM x86 misc changes for 6.19:
- Fix an async #PF bug where KVM would clear the completion queue when the
guest transitioned in and out of paging mode, e.g. when handling an SMI and
then returning to paged mode via RSM.
- Fix a bug where TDX would effectively corrupt user-return MSR values if the
TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
- Leave the user-return notifier used to restore MSRs registered when
disabling virtualization, and instead pin kvm.ko. Restoring host MSRs via
IPI callback is either pointless (clean reboot) or dangerous (forced reboot)
since KVM has no idea what code it's interrupting.
- Use the checked version of {get,put}_user(), as Linus wants to kill them
off, and they're measurably faster on modern CPUs due to the unchecked
versions containing an LFENCE.
- Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
timers can result in a hard lockup in the host.
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NPT corrections.
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
behind CONFIG_CPU_MITIGATIONS.
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as
the only reason they were handled in the faspath was to paper of a bug in
the core #MC code that has long since been fixed.
- Add emulator support for AVX MOV instructions to play nice with emulated
devices whose PCI BARs guest drivers like to access with large multi-byte
instructions.
|
|
(Ab)use the static_call infrastructure to convert all:
call __WARN_trap
instances into the desired:
ud1 (%edx), %rdi
eliminating the CALL/RET, but more importantly, fixing the
fact that all WARNs will have:
RIP: 0010:__WARN_trap+0
Basically, by making it a static_call trampoline call, objtool will
collect the callsites, and then the inline rewrite will hit the
special case and replace the code with the magic instruction.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.456717741@infradead.org
|
|
Implement WARN_ONCE like WARN using BUGFLAG_ONCE.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.339309119@infradead.org
|
|
The basic idea is to have __WARN_printf() be a vararg function such
that the compiler can do the optimal calling convention for us. This
function body will be a #UD and then set up a va_list in the exception
from pt_regs.
But because the trap will be in a called function, the bug_entry must
be passed in. Have that be the first argument, with the format tucked
away inside the bug_entry.
The comments should clarify the real fun details.
The big downside is that all WARNs will now show:
RIP: 0010:__WARN_trap:+0
One possible solution is to simply discard the top frame when
unwinding. A follow up patch takes care of this slightly differently
by abusing the x86 static_call implementation.
This changes (with the next patches):
WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
"corrupted preempt_count: %s/%d/0x%x\n",
from:
cmpl $2, %ecx #, _7
jne .L1472
...
.L1472:
cmpb $0, __already_done.11(%rip)
je .L1513
...
.L1513
movb $1, __already_done.11(%rip)
movl 1424(%r14), %edx # _15->pid, _15->pid
leaq 1912(%r14), %rsi #, _17
movq $.LC43, %rdi #,
call __warn_printk #
ud2
.pushsection __bug_table,"aw"
2:
.long 1b - . # bug_entry::bug_addr
.long .LC1 - . # bug_entry::file
.word 5093 # bug_entry::line
.word 2313 # bug_entry::flags
.org 2b + 12
.popsection
.pushsection .discard.annotate_insn,"M", @progbits, 8
.long 1b - .
.long 8 # ANNOTYPE_REACHABLE
.popsection
into:
cmpl $2, %ecx #, _7
jne .L1442 #,
...
.L1442:
lea (2f)(%rip), %rdi
1:
.pushsection __bug_table,"aw"
2:
.long 1b - . # bug_entry::bug_addr
.long .LC43 - . # bug_entry::format
.long .LC1 - . # bug_entry::file
.word 5093 # bug_entry::line
.word 2323 # bug_entry::flags
.org 2b + 16
.popsection
movl 1424(%r14), %edx # _19->pid, _19->pid
leaq 1912(%r14), %rsi #, _13
ud1 (%edx), %rdi
Notably, by pushing everything into the exception handler it can take
care of the ONCE thing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.213813530@infradead.org
|
|
Since we have an explicit format string, use it for the condition string
instead of frobbing it in the file string.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.097401406@infradead.org
|
|
Opt-in to BUG_FORMAT for x86_64, adjust the BUGTABLE helper and for
now, just store NULL pointers.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.980264454@infradead.org
|
|
Enforcement
Smart Data Cache Injection (SDCI) is a mechanism that enables direct insertion
of data from I/O devices into the L3 cache. By directly caching data from I/O
devices rather than first storing the I/O data in DRAM, SDCI reduces demands on
DRAM bandwidth and reduces latency to the processor consuming the I/O data.
The SDCIAE (SDCI Allocation Enforcement) PQE feature allows system software to
control the portion of the L3 cache used for SDCI.
When enabled, SDCIAE forces all SDCI lines to be placed into the L3 cache
partitions identified by the highest-supported L3_MASK_n register, where n is
the maximum supported CLOSID.
Add CPUID feature bit that can be used to configure SDCIAE.
The SDCIAE feature details are documented in:
AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.4.7 L3 Smart Data Cache
Injection Allocation Enforcement (SDCIAE).
available at https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/83ca10d981c48e86df2c3ad9658bb3ba3544c763.1762995456.git.babu.moger@amd.com
|
|
Rework the __bug_table helpers such that extension becomes easier.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.111187573@infradead.org
|
|
Bring in the UDB and objtool data annotations to avoid conflicts while further extending the bug exceptions.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
|
|
Starting with Zen6, AMD's Scalable MCA systems will incorporate two new bits in
MCA_STATUS and MCA_CONFIG MSRs. These bits will indicate if a valid System
Physical Address (SPA) is present in MCA_ADDR.
PhysAddrValidSupported bit (MCA_CONFIG[11]) serves as the architectural
indicator and states if PhysAddrV bit (MCA_STATUS[54]) is Reserved or if it
indicates validity of SPA in MCA_ADDR.
PhysAddrV bit (MCA_STATUS[54]) advertises if MCA_ADDR contains valid SPA or if
it is implementation specific.
Use and prefer MCA_STATUS[PhysAddrV] when checking for a usable address.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251118191731.181269-1-avadhut.naik@amd.com
|
|
The MCA threshold limit generally is not something that needs to change during
runtime. It is common for a system administrator to decide on a policy for
their managed systems.
If MCA thresholding is OS-managed, then the threshold limit must be set at
every boot. However, many systems allow the user to set a value in their BIOS.
And this is reported through an APEI HEST entry even if thresholding is not in
FW-First mode.
Use this value, if available, to set the OS-managed threshold limit. Users
can still override it through sysfs if desired for testing or debug.
APEI is parsed after MCE is initialized. So reset the thresholding blocks
later to pick up the threshold limit.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
|
|
Currently the tracking of the need to flush L1D for L1TF is tracked by
two bits: one per-CPU and one per-vCPU.
The per-vCPU bit is always set when the vCPU shows up on a core, so
there is no interesting state that's truly per-vCPU. Indeed, this is a
requirement, since L1D is a part of the physical CPU.
So simplify this by combining the two bits.
The vCPU bit was being written from preemption-enabled regions. To play
nice with those cases, wrap all calls from KVM and use a raw write so that
request a flush with preemption enabled doesn't trigger what would
effectively be DEBUG_PREEMPT false positives. Preemption doesn't need to
be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a
flush if the vCPU task is migrated, or if userspace runs the vCPU on a
different task.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
[sean: put raw write in KVM instead of in a hardirq.h variant]
Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable support for flushing the L1 data cache to mitigate L1TF if CPU
mitigations are disabled for the entire kernel. KVM's mitigation of L1TF
is in no way special enough to justify ignoring CONFIG_CPU_MITIGATIONS=n.
Deliberately use CPU_MITIGATIONS instead of the more precise
MITIGATION_L1TF, as MITIGATION_L1TF only controls the default behavior,
i.e. CONFIG_MITIGATION_L1TF=n doesn't completely disable L1TF mitigations
in the kernel.
Keep the vmentry_l1d_flush module param to avoid breaking existing setups,
and leverage the .set path to alert the user to the fact that
vmentry_l1d_flush will be ignored. Don't bother validating the incoming
value; if an admin misconfigures vmentry_l1d_flush, the fact that the bad
configuration won't be detected when running with CONFIG_CPU_MITIGATIONS=n
is likely the least of their worries.
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that VMX encodes its own sequence for clearing CPU buffers, move
VM_CLEAR_CPU_BUFFERS into SVM to minimize the chances of KVM botching a
mitigation in the future, e.g. using VM_CLEAR_CPU_BUFFERS instead of
checking multiple mitigation flags.
No functional change intended.
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251113233746.1703361-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Convert the MMIO Stale Data mitigation tracking from a static branch into
an x86 feature flag so that it can be used via ALTERNATIVE_2 in KVM.
No functional change intended.
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Decouple the use of ALTERNATIVE from the encoding of VERW to clear CPU
buffers so that KVM can use ALTERNATIVE_2 to handle "always clear buffers"
and "clear if guest can access host MMIO" in a single statement.
No functional change intended.
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://patch.msgid.link/20251113233746.1703361-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
For patching, the kernel initializes a temporary mm area in the lower
half of the address range. LASS blocks these accesses because its
enforcement relies on bit 63 of the virtual address as opposed to SMAP
which depends on the _PAGE_BIT_USER bit in the page table. Disable LASS
enforcement by toggling the RFLAGS.AC bit during patching to avoid
triggering a #GP fault.
Introduce LASS-specific STAC/CLAC helpers to set the AC bit only on
platforms that need it. Name the wrappers as lass_stac()/_clac() instead
of lass_disable()/_enable() because they only control the kernel data
access enforcement. The entire LASS mechanism (including instruction
fetch enforcement) is controlled by the CR4.LASS bit.
Describe the usage of the new helpers in comparison to the ones used for
SMAP. Also, add comments to explain when the existing stac()/clac()
should be used. While at it, move the duplicated "barrier" comment to
the same block.
The Text poking functions use standard memcpy()/memset() while patching
kernel code. However, objtool complains about calling such dynamic
functions within an AC=1 region. See warning #9, regarding function
calls with UACCESS enabled, in tools/objtool/Documentation/objtool.txt.
To pacify objtool, one option is to add memcpy() and memset() to the
list of allowed-functions. However, that would provide a blanket
exemption for all usages of memcpy() and memset(). Instead, replace the
standard calls in the text poking functions with their unoptimized,
always-inlined versions. Considering that patching is usually small,
there is no performance impact expected.
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-5-sohil.mehta%40intel.com
|
|
Provide inline memcpy and memset functions that can be used instead of
the GCC builtins when necessary. The immediate use case is for the text
poking functions to avoid the standard memcpy()/memset() calls because
objtool complains about such dynamic calls within an AC=1 region. See
tools/objtool/Documentation/objtool.txt, warning #9, regarding function
calls with UACCESS enabled.
Some user copy functions such as copy_user_generic() and __clear_user()
have similar rep_{movs,stos} usages. But, those are highly specialized
and hard to combine or reuse for other things. Define these new helpers
for all other usages that need a completely unoptimized, strictly inline
version of memcpy() or memset().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-4-sohil.mehta%40intel.com
|
|
Linear Address Space Separation (LASS) is a security feature that
mitigates a class of side-channel attacks relying on speculative access
across the user/kernel boundary.
Privilege mode based access protection already exists today with paging
and features such as SMEP and SMAP. However, to enforce these
protections, the processor must traverse the paging structures in
memory. An attacker can use timing information resulting from this
traversal to determine details about the paging structures, and to
determine the layout of the kernel memory.
LASS provides the same mode-based protections as paging but without
traversing the paging structures. Because the protections are enforced
prior to page-walks, an attacker will not be able to derive paging-based
timing information from the various caching structures such as the TLBs,
mid-level caches, page walker, data caches, etc.
LASS enforcement relies on the kernel implementation to divide the
64-bit virtual address space into two halves:
Addr[63]=0 -> User address space
Addr[63]=1 -> Kernel address space
Any data access or code execution across address spaces typically
results in a #GP fault, with an #SS generated in some rare cases. The
LASS enforcement for kernel data accesses is dependent on CR4.SMAP being
set. The enforcement can be disabled by toggling the RFLAGS.AC bit
similar to SMAP.
Define the CPU feature bits to enumerate LASS. Also, disable the feature
at compile time on 32-bit kernels. Use a direct dependency on X86_32
(instead of !X86_64) to make it easier to combine with similar 32-bit
specific dependencies in the future.
LASS mitigates a class of side-channel speculative attacks, such as
Spectre LAM, described in the paper, "Leaky Address Masking: Exploiting
Unmasked Spectre Gadgets with Noncanonical Address Translation".
Add the "lass" flag to /proc/cpuinfo to indicate that the feature is
supported by hardware and enabled by the kernel. This allows userspace
to determine if the system is secure against such attacks.
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-2-sohil.mehta%40intel.com
|
|
This reverts commit 7e44e4495a398eb553ce561f29f9148f40a3448f.
Commit 7e44e4495a39 ("x86: kvm: rate-limit global clock updates")
intends to use a kvmclock_update_work to sync ntp corretion
across all vcpus kvmclock, which is based on commit 0061d53daf26f
("KVM: x86: limit difference between kvmclock updates")
Since kvmclock has been switched to mono raw, this commit can be
reverted.
Signed-off-by: Lei Chen <lei.chen@smartx.com>
Link: https://patch.msgid.link/20250819152027.1687487-3-lei.chen@smartx.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
This reverts commit 332967a3eac06f6379283cf155c84fe7cd0537c2.
Commit 332967a3eac0 ("x86: kvm: introduce periodic global clock
updates") introduced a 300s interval work to sync ntp corrections
across all vcpus.
Since commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
monotonic raw clock"), kvmclock switched to mono raw clock,
we can no longer take ntp into consideration.
Signed-off-by: Lei Chen <lei.chen@smartx.com>
Link: https://patch.msgid.link/20250819152027.1687487-2-lei.chen@smartx.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pull bpf fixes from Alexei Starovoitov:
- Fix interaction between livepatch and BPF fexit programs (Song Liu)
With Steven and Masami acks.
- Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa)
With Steven and Masami acks.
- Fix out of bounds access in widen_imprecise_scalars() in the verifier
(Eduard Zingerman)
- Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen)
- Fix net_sched storage collision with BPF data_meta/data_end (Eric
Dumazet)
- Add _impl suffix to BPF kfuncs with implicit args to avoid breaking
them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Test widen_imprecise_scalars() with different stack depth
bpf: account for current allocated stack depth in widen_imprecise_scalars()
bpf: Add bpf_prog_run_data_pointers()
selftests/bpf: Add mptcp test with sockmap
mptcp: Fix proto fallback detection with BPF
mptcp: Disallow MPTCP subflows from sockmap
selftests/bpf: Add stacktrace ips test for raw_tp
selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi
x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe
Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"
bpf: add _impl suffix for bpf_stream_vprintk() kfunc
bpf:add _impl suffix for bpf_task_work_schedule* kfuncs
selftests/bpf: Add tests for livepatch + bpf trampoline
ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct()
ftrace: Fix BPF fexit with livepatch
|
|
Use the exact enum name when documenting "enum sgx_attribute" to fix a
warning if the file is fed into kernel-doc processing:
WARNING: ./arch/x86/include/asm/sgx.h:139 expecting prototype for enum
sgx_attributes. Prototype was for enum sgx_attribute instead
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112160708.1343355-6-seanjc%40google.com
|
|
Drop an asterisk from a file-level copyright comment so that the comment
isn't intrepeted as a kernel-doc comment.
E.g. if arch/x86/include/asm/sgx.h is fed into kernel-doc processing:
WARNING: ./arch/x86/include/asm/sgx.h:2 This comment starts with '/**',
but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112160708.1343355-5-seanjc%40google.com
|
|
Use '@' to document structure members and enum values in kernel-doc markup,
as per Documentation/doc-guide/kernel-doc.rst and flagged by make htmldocs.
WARNING: arch/x86/include/uapi/asm/sgx.h:17 Enum value 'SGX_PAGE_MEASURE'
not described in enum 'sgx_page_flags'
Opportunistically add a missing ':' for SGX_CHILD_PRESENT.
Closes: https://lore.kernel.org/all/20251106145506.145fc620@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112160708.1343355-4-seanjc%40google.com
|
|
When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn
instruction, discard the exception and retry the instruction if the code
stream is changed (e.g. by a different vCPU) between when the CPU
executes the instruction and when KVM decodes the instruction to get the
next RIP.
As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject
INT3/INTO instead of retrying the instruction"), failure to verify that
the correct INTn instruction was decoded can effectively clobber guest
state due to decoding the wrong instruction and thus specifying the
wrong next RIP.
The bug most often manifests as "Oops: int3" panics on static branch
checks in Linux guests. Enabling or disabling a static branch in Linux
uses the kernel's "text poke" code patching mechanism. To modify code
while other CPUs may be executing that code, Linux (temporarily)
replaces the first byte of the original instruction with an int3 (opcode
0xcc), then patches in the new code stream except for the first byte,
and finally replaces the int3 with the first byte of the new code
stream. If a CPU hits the int3, i.e. executes the code while it's being
modified, then the guest kernel must look up the RIP to determine how to
handle the #BP, e.g. by emulating the new instruction. If the RIP is
incorrect, then this lookup fails and the guest kernel panics.
The bug reproduces almost instantly by hacking the guest kernel to
repeatedly check a static branch[1] while running a drgn script[2] on
the host to constantly swap out the memory containing the guest's TSS.
[1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
[2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b
Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Extend KVM's export macro framework to provide EXPORT_SYMBOL_FOR_KVM(),
and use the helper macro to export symbols for KVM throughout x86 if and
only if KVM will build one or more modules, and only for those modules.
To avoid unnecessary exports when CONFIG_KVM=m but kvm.ko will not be
built (because no vendor modules are selected), let arch code #define
EXPORT_SYMBOL_FOR_KVM to suppress/override the exports.
Note, the set of symbols to restrict to KVM was generated by manual search
and audit; any "misses" are due to human error, not some grand plan.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112173944.1380633-5-seanjc%40google.com
|
|
Use BIT_WORD() and BIT_MASK() macros from <linux/bits.h>
in <arch/x86/include/asm/percpu.h> instead of open-coding them.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20250907184915.78041-1-ubizjak@gmail.com
|