| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
"This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for power
management, as a preparation for future Tegra specific changes
- Reset controller updates with added drivers for LAN969x, eic770 and
RZ/G3S SoCs
- Protection of system controller registers on Renesas and Google
SoCs, to prevent trivially triggering a system crash from e.g.
debugfs access
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas"
* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
memory: tegra186-emc: Fix missing put_bpmp
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
soc: rockchip: grf: Add select correct PWM implementation on RK3368
soc/tegra: pmc: Add USB wake events for Tegra234
amba: tegra-ahb: Fix device leak on SMMU enable
...
|
|
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Restore clearing of MSR[RI] at interrupt/syscall exit on 32-bit
- Fix unpaired stwcx on interrupt exit on 32-bit
- Fix race condition leading to double list-add in
mac_hid_toggle_emumouse()
- Fix mprotect on book3s 32-bit
- Fix SLB multihit issue during SLB preload with 64-bit hash MMU
- Add support for crashkernel CMA reservation
- Add die_id and die_cpumask for Power10 & later to expose chip
hemispheres
- A series of minor fixes and improvements to the hash SLB code
Thanks to Antonio Alvarez Feijoo, Ben Collins, Bhaskar Chowdhury,
Christophe Leroy, Daniel Thompson, Dave Vasilevsky, Donet Tom,
J. Neuschäfer, Kunwu Chan, Long Li, Naresh Kamboju, Nathan Chancellor,
Ritesh Harjani (IBM), Shirisha G, Shrikanth Hegde, Sourabh Jain, Srikar
Dronamraju, Stephen Rothwell, Thomas Zimmermann, Venkat Rao Bagalkote,
and Vishal Chourasia.
* tag 'powerpc-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (32 commits)
macintosh/via-pmu-backlight: Include <linux/fb.h> and <linux/of.h>
powerpc/powermac: backlight: Include <linux/of.h>
powerpc/64s/slb: Add no_slb_preload early cmdline param
powerpc/64s/slb: Make preload_add return type as void
powerpc/ptdump: Dump PXX level info for kernel_page_tables
powerpc/64s/pgtable: Enable directMap counters in meminfo for Hash
powerpc/64s/hash: Update directMap page counters for Hash
powerpc/64s/hash: Hash hpt_order should be only available with Hash MMU
powerpc/64s/hash: Improve hash mmu printk messages
powerpc/64s/hash: Fix phys_addr_t printf format in htab_initialize()
powerpc/64s/ptdump: Fix kernel_hash_pagetable dump for ISA v3.00 HPTE format
powerpc/64s/hash: Restrict stress_hpt_struct memblock region to within RMA limit
powerpc/64s/slb: Fix SLB multihit issue during SLB preload
powerpc, mm: Fix mprotect on book3s 32-bit
powerpc/smp: Expose die_id and die_cpumask
powerpc/83xx: Add a null pointer check to mcu_gpiochip_add
arch:powerpc:tools This file was missing shebang line, so added it
kexec: Include kernel-end even without crashkernel
powerpc: p2020: Rename wdt@ nodes to watchdog@
powerpc: 86xx: Rename wdt@ nodes to watchdog@
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull persistent dentry infrastructure and conversion from Al Viro:
"Some filesystems use a kinda-sorta controlled dentry refcount leak to
pin dentries of created objects in dcache (and undo it when removing
those). A reference is grabbed and not released, but it's not actually
_stored_ anywhere.
That works, but it's hard to follow and verify; among other things, we
have no way to tell _which_ of the increments is intended to be an
unpaired one. Worse, on removal we need to decide whether the
reference had already been dropped, which can be non-trivial if that
removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag
(DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
claims responsibility for +1 in refcount.
The end result this series is aiming for:
- get these unbalanced dget() and dput() replaced with new primitives
that would, in addition to adjusting refcount, set and clear
persistency flag.
- instead of having kill_litter_super() mess with removing the
remaining "leaked" references (e.g. for all tmpfs files that hadn't
been removed prior to umount), have the regular
shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
dropping the corresponding reference if it had been set. After that
kill_litter_super() becomes an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many
places in too many filesystems. It has to be split into a series.
This work has really started early in 2024; quite a few preliminary
pieces have already gone into mainline. This chunk is finally getting
to the meat of that stuff - infrastructure and most of the conversions
to it.
Some pieces are still sitting in the local branches, but the bulk of
that stuff is here"
* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
d_make_discardable(): warn if given a non-persistent dentry
kill securityfs_recursive_remove()
convert securityfs
get rid of kill_litter_super()
convert rust_binderfs
convert nfsctl
convert rpc_pipefs
convert hypfs
hypfs: swich hypfs_create_u64() to returning int
hypfs: switch hypfs_create_str() to returning int
hypfs: don't pin dentries twice
convert gadgetfs
gadgetfs: switch to simple_remove_by_name()
convert functionfs
functionfs: switch to simple_remove_by_name()
functionfs: fix the open/removal races
functionfs: need to cancel ->reset_work in ->kill_sb()
functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
functionfs: don't abuse ffs_data_closed() on fs shutdown
convert selinuxfs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
Rework the vmalloc() code to support non-blocking allocations
(GFP_ATOIC, GFP_NOWAIT)
"ksm: fix exec/fork inheritance" (xu xin)
Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
inherited across fork/exec
"mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
Some light maintenance work on the zswap code
"mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
Enhance the /sys/kernel/debug/page_owner debug feature by adding
unique identifiers to differentiate the various stack traces so
that userspace monitoring tools can better match stack traces over
time
"mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
Minor alterations to the page allocator's per-cpu-pages feature
"Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
Address a scalability issue in userfaultfd's UFFDIO_MOVE operation
"kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)
"drivers/base/node: fold node register and unregister functions" (Donet Tom)
Clean up the NUMA node handling code a little
"mm: some optimizations for prot numa" (Kefeng Wang)
Cleanups and small optimizations to the NUMA allocation hinting
code
"mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
Address long lock hold times at boot on large machines. These were
causing (harmless) softlockup warnings
"optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
Remove some now-unnecessary work from page reclaim
"mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
Enhance the DAMOS auto-tuning feature
"mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
configuration
"expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
Enhance the new(ish) file_operations.mmap_prepare() method and port
additional callsites from the old ->mmap() over to ->mmap_prepare()
"Fix stale IOTLB entries for kernel address space" (Lu Baolu)
Fix a bug (and possible security issue on non-x86) in the IOMMU
code. In some situations the IOMMU could be left hanging onto a
stale kernel pagetable entry
"mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
Clean up and optimize the folio splitting code
"mm, swap: misc cleanup and bugfix" (Kairui Song)
Some cleanups and a minor fix in the swap discard code
"mm/damon: misc documentation fixups" (SeongJae Park)
"mm/damon: support pin-point targets removal" (SeongJae Park)
Permit userspace to remove a specific monitoring target in the
middle of the current targets list
"mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
A couple of cleanups related to mm header file inclusion
"mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
improve the selection of swap devices for NUMA machines
"mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
Change the memory block labels from macros to enums so they will
appear in kernel debug info
"ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
Address an inefficiency when KSM unmerges an address range
"mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
Fix leaks and unhandled malloc() failures in DAMON userspace unit
tests
"some cleanups for pageout()" (Baolin Wang)
Clean up a couple of minor things in the page scanner's
writeback-for-eviction code
"mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
Move hugetlb's sysfs/sysctl handling code into a new file
"introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
Make the VMA guard regions available in /proc/pid/smaps and
improves the mergeability of guarded VMAs
"mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
Reduce mmap lock contention for callers performing VMA guard region
operations
"vma_start_write_killable" (Matthew Wilcox)
Start work on permitting applications to be killed when they are
waiting on a read_lock on the VMA lock
"mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
Add additional userspace testing of DAMON's "commit" feature
"mm/damon: misc cleanups" (SeongJae Park)
"make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
VMA is merged with another
"mm: support device-private THP" (Balbir Singh)
Introduce support for Transparent Huge Page (THP) migration in zone
device-private memory
"Optimize folio split in memory failure" (Zi Yan)
"mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
Some more cleanups in the folio splitting code
"mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
Clean up our handling of pagetable leaf entries by introducing the
concept of 'software leaf entries', of type softleaf_t
"reparent the THP split queue" (Muchun Song)
Reparent the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory
resources
"unify PMD scan results and remove redundant cleanup" (Wei Yang)
A little cleanup in the hugepage collapse code
"zram: introduce writeback bio batching" (Sergey Senozhatsky)
Improve zram writeback efficiency by introducing batched bio
writeback support
"memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
Clean up our handling of the interrupt safety of some memcg stats
"make vmalloc gfp flags usage more apparent" (Vishal Moola)
Clean up vmalloc's handling of incoming GFP flags
"mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
Teach soft dirty and userfaultfd write protect tracking to use
RISC-V's Svrsw60t59b extension
"mm: swap: small fixes and comment cleanups" (Youngjun Park)
Fix a small bug and clean up some of the swap code
"initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
Start work on converting the vma struct's flags to a bitmap, so we
stop running out of them, especially on 32-bit
"mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
Address a possible bug in the swap discard code and clean things
up a little
[ This merge also reverts commit ebb9aeb980e5 ("vfio/nvgrace-gpu:
register device memory for poison handling") because it looks
broken to me, I've asked for clarification - Linus ]
* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm: fix vma_start_write_killable() signal handling
mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
mm/swapfile: fix list iteration when next node is removed during discard
fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
mm/kfence: add reboot notifier to disable KFENCE on shutdown
memcg: remove inc/dec_lruvec_kmem_state helpers
selftests/mm/uffd: initialize char variable to Null
mm: fix DEBUG_RODATA_TEST indentation in Kconfig
mm: introduce VMA flags bitmap type
tools/testing/vma: eliminate dependency on vma->__vm_flags
mm: simplify and rename mm flags function for clarity
mm: declare VMA flags by bit
zram: fix a spelling mistake
mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
pagemap: update BUDDY flag documentation
mm: swap: remove scan_swap_map_slots() references from comments
mm: swap: change swap_alloc_slow() to void
mm, swap: remove redundant comment for read_swap_cache_async
mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
- Introduction of the generic IO page-table framework with support for
Intel and AMD IOMMU formats from Jason.
This has good potential for unifying more IO page-table
implementations and making future enhancements more easy. But this
also needed quite some fixes during development. All known issues
have been fixed, but my feeling is that there is a higher potential
than usual that more might be needed.
- Intel VT-d updates:
- Use right invalidation hint in qi_desc_iotlb()
- Reduce the scope of INTEL_IOMMU_FLOPPY_WA
- ARM-SMMU updates:
- Qualcomm device-tree binding updates for Kaanapali and Glymur SoCs
and a new clock for the TBU.
- Fix error handling if level 1 CD table allocation fails.
- Permit more than the architectural maximum number of SMRs for
funky Qualcomm mis-implementations of SMMUv2.
- Mediatek driver:
- MT8189 iommu support
- Move ARM IO-pgtable selftests to kunit
- Device leak fixes for a couple of drivers
- Random smaller fixes and improvements
* tag 'iommu-updates-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (81 commits)
iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
powerpc/pseries/svm: Make mem_encrypt.h self contained
genpt: Make GENERIC_PT invisible
iommupt: Avoid a compiler bug with sw_bit
iommu/arm-smmu-qcom: Enable use of all SMR groups when running bare-metal
iommupt: Fix unlikely flows in increase_top()
iommu/amd: Propagate the error code returned by __modify_irte_ga() in modify_irte_ga()
MAINTAINERS: Update my email address
iommu/arm-smmu-v3: Fix error check in arm_smmu_alloc_cd_tables
dt-bindings: iommu: qcom_iommu: Allow 'tbu' clock
iommu/vt-d: Restore previous domain::aperture_end calculation
iommu/vt-d: Fix unused invalidation hint in qi_desc_iotlb
iommu/vt-d: Set INTEL_IOMMU_FLOPPY_WA depend on BLK_DEV_FD
iommu/tegra: fix device leak on probe_device()
iommu/sun50i: fix device leak on of_xlate()
iommu/omap: simplify probe_device() error handling
iommu/omap: fix device leaks on probe_device()
iommu/mediatek-v1: add missing larb count sanity check
iommu/mediatek-v1: fix device leaks on probe()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Convert selftests/bpf/test_tc_edt and test_tc_tunnel from .sh to
test_progs runner (Alexis Lothoré)
- Convert selftests/bpf/test_xsk to test_progs runner (Bastien
Curutchet)
- Replace bpf memory allocator with kmalloc_nolock() in
bpf_local_storage (Amery Hung), and in bpf streams and range tree
(Puranjay Mohan)
- Introduce support for indirect jumps in BPF verifier and x86 JIT
(Anton Protopopov) and arm64 JIT (Puranjay Mohan)
- Remove runqslower bpf tool (Hoyeon Lee)
- Fix corner cases in the verifier to close several syzbot reports
(Eduard Zingerman, KaFai Wan)
- Several improvements in deadlock detection in rqspinlock (Kumar
Kartikeya Dwivedi)
- Implement "jmp" mode for BPF trampoline and corresponding
DYNAMIC_FTRACE_WITH_JMP. It improves "fexit" program type performance
from 80 M/s to 136 M/s. With Steven's Ack. (Menglong Dong)
- Add ability to test non-linear skbs in BPF_PROG_TEST_RUN (Paul
Chaignon)
- Do not let BPF_PROG_TEST_RUN emit invalid GSO types to stack (Daniel
Borkmann)
- Generalize buildid reader into bpf_dynptr (Mykyta Yatsenko)
- Optimize bpf_map_update_elem() for map-in-map types (Ritesh
Oedayrajsingh Varma)
- Introduce overwrite mode for BPF ring buffer (Xu Kuohai)
* tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (169 commits)
bpf: optimize bpf_map_update_elem() for map-in-map types
bpf: make kprobe_multi_link_prog_run always_inline
selftests/bpf: do not hardcode target rate in test_tc_edt BPF program
selftests/bpf: remove test_tc_edt.sh
selftests/bpf: integrate test_tc_edt into test_progs
selftests/bpf: rename test_tc_edt.bpf.c section to expose program type
selftests/bpf: Add success stats to rqspinlock stress test
rqspinlock: Precede non-head waiter queueing with AA check
rqspinlock: Disable spinning for trylock fallback
rqspinlock: Use trylock fallback when per-CPU rqnode is busy
rqspinlock: Perform AA checks immediately
rqspinlock: Enclose lock/unlock within lock entry acquisitions
bpf: Remove runqslower tool
selftests/bpf: Remove usage of lsm/file_alloc_security in selftest
bpf: Disable file_alloc_security hook
bpf: check for insn arrays in check_ptr_alignment
bpf: force BPF_F_RDONLY_PROG on insn array creation
bpf: Fix exclusive map memory leak
selftests/bpf: Make CS length configurable for rqspinlock stress test
selftests/bpf: Add lock wait time stats to rqspinlock stress test
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scoped user access updates from Thomas Gleixner:
"Scoped user mode access and related changes:
- Implement the missing u64 user access function on ARM when
CONFIG_CPU_SPECTRE=n.
This makes it possible to access a 64bit value in generic code with
[unsafe_]get_user(). All other architectures and ARM variants
provide the relevant accessors already.
- Ensure that ASM GOTO jump label usage in the user mode access
helpers always goes through a local C scope label indirection
inside the helpers.
This is required because compilers are not supporting that a ASM
GOTO target leaves a auto cleanup scope. GCC silently fails to emit
the cleanup invocation and CLANG fails the build.
[ Editor's note: gcc-16 will have fixed the code generation issue
in commit f68fe3ddda4 ("eh: Invoke cleanups/destructors in asm
goto jumps [PR122835]"). But we obviously have to deal with clang
and older versions of gcc, so.. - Linus ]
This provides generic wrapper macros and the conversion of affected
architecture code to use them.
- Scoped user mode access with auto cleanup
Access to user mode memory can be required in hot code paths, but
if it has to be done with user controlled pointers, the access is
shielded with a speculation barrier, so that the CPU cannot
speculate around the address range check. Those speculation
barriers impact performance quite significantly.
This cost can be avoided by "masking" the provided pointer so it is
guaranteed to be in the valid user memory access range and
otherwise to point to a guaranteed unpopulated address space. This
has to be done without branches so it creates an address dependency
for the access, which the CPU cannot speculate ahead.
This results in repeating and error prone programming patterns:
if (can_do_masked_user_access())
from = masked_user_read_access_begin((from));
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(val, from, Efault);
user_read_access_end();
return 0;
Efault:
user_read_access_end();
return -EFAULT;
which can be replaced with scopes and automatic cleanup:
scoped_user_read_access(from, Efault)
unsafe_get_user(val, from, Efault);
return 0;
Efault:
return -EFAULT;
- Convert code which implements the above pattern over to
scope_user.*.access(). This also corrects a couple of imbalanced
masked_*_begin() instances which are harmless on most
architectures, but prevent PowerPC from implementing the masking
optimization.
- Add a missing speculation barrier in copy_from_user_iter()"
* tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
scm: Convert put_cmsg() to scoped user access
iov_iter: Add missing speculation barrier to copy_from_user_iter()
iov_iter: Convert copy_from_user_iter() to masked user access
select: Convert to scoped user access
x86/futex: Convert to scoped user access
futex: Convert to get/put_user_inline()
uaccess: Provide put/get_user_inline()
uaccess: Provide scoped user access regions
arm64: uaccess: Use unsafe wrappers for ASM GOTO
s390/uaccess: Use unsafe wrappers for ASM GOTO
riscv/uaccess: Use unsafe wrappers for ASM GOTO
powerpc/uaccess: Use unsafe wrappers for ASM GOTO
x86/uaccess: Use unsafe wrappers for ASM GOTO
uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
ARM: uaccess: Implement missing __get_user_asm_dword()
|
|
Include <linux/of.h> to avoid dependency on backlight header to include
it. Declares of_find_node_by_name(), of_property_match_string() and
of_node_put().
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 243ce64b2b37 ("backlight: Do not include <linux/fb.h> in header file")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
Closes: https://lore.kernel.org/linuxppc-dev/CA+G9fYs8fn5URQx2+s2oNxdUgZkSrdLC0P1tNBW_n-6BaBkK2Q@mail.gmail.com/
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull bug handling infrastructure updates from Ingo Molnar:
"Core updates:
- Improve WARN(), which has vararg printf like arguments, to work
with the x86 #UD based WARN-optimizing infrastructure by hiding the
format in the bug_table and replacing this first argument with the
address of the bug-table entry, while making the actual function
that's called a UD1 instruction (Peter Zijlstra)
- Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch (Ingo
Molnar, s390 support by Heiko Carstens)
Fixes and cleanups:
- bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)
- <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS (Peter
Zijlstra)"
* tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
x86/bug: Fix BUG_FORMAT vs KASLR
x86_64/bug: Inline the UD1
x86/bug: Implement WARN_ONCE()
x86_64/bug: Implement __WARN_printf()
x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
x86/bug: Add BUG_FORMAT basics
bug: Allow architectures to provide __WARN_printf()
bug: Implement WARN_ON() using __WARN_FLAGS()
bug: Add report_bug_entry()
bug: Add BUG_FORMAT_ARGS infrastructure
bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS
bug: Add BUG_FORMAT infrastructure
x86: Rework __bug_table helpers
bugs/s390: Remove private WARN_ON() implementation
bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output
bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fd prepare updates from Christian Brauner:
"This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
common pattern of get_unused_fd_flags() + create file + fd_install()
that is used extensively throughout the kernel and currently requires
cumbersome cleanup paths.
FD_ADD() - For simple cases where a file is installed immediately:
fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
if (fd < 0)
vfio_device_put_registration(device);
return fd;
FD_PREPARE() - For cases requiring access to the fd or file, or
additional work before publishing:
FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
if (fdf.err) {
fput(sync_file->file);
return fdf.err;
}
data.fence = fd_prepare_fd(fdf);
if (copy_to_user((void __user *)arg, &data, sizeof(data)))
return -EFAULT;
return fd_publish(fdf);
The primitives are centered around struct fd_prepare. FD_PREPARE()
encapsulates all allocation and cleanup logic and must be followed by
a call to fd_publish() which associates the fd with the file and
installs it into the caller's fdtable. If fd_publish() isn't called,
both are deallocated automatically. FD_ADD() is a shorthand that does
fd_publish() immediately and never exposes the struct to the caller.
I've implemented this in a way that it's compatible with the cleanup
infrastructure while also being usable separately. IOW, it's centered
around struct fd_prepare which is aliased to class_fd_prepare_t and so
we can make use of all the basica guard infrastructure"
* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
io_uring: convert io_create_mock_file() to FD_PREPARE()
file: convert replace_fd() to FD_PREPARE()
vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
tty: convert ptm_open_peer() to FD_ADD()
ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
media: convert media_request_alloc() to FD_PREPARE()
hv: convert mshv_ioctl_create_partition() to FD_ADD()
gpio: convert linehandle_create() to FD_PREPARE()
pseries: port papr_rtas_setup_file_interface() to FD_ADD()
pseries: convert papr_platform_dump_create_handle() to FD_ADD()
spufs: convert spufs_gang_open() to FD_PREPARE()
papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
spufs: convert spufs_context_open() to FD_PREPARE()
net/socket: convert __sys_accept4_file() to FD_ADD()
net/socket: convert sock_map_fd() to FD_ADD()
net/kcm: convert kcm_ioctl() to FD_PREPARE()
net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
secretmem: convert memfd_secret() to FD_ADD()
memfd: convert memfd_create() to FD_ADD()
bpf: convert bpf_token_create() to FD_PREPARE()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains substantial namespace infrastructure changes including a new
system call, active reference counting, and extensive header cleanups.
The branch depends on the shared kbuild branch for -fms-extensions support.
Features:
- listns() system call
Add a new listns() system call that allows userspace to iterate
through namespaces in the system. This provides a programmatic
interface to discover and inspect namespaces, addressing
longstanding limitations:
Currently, there is no direct way for userspace to enumerate
namespaces. Applications must resort to scanning /proc/*/ns/ across
all processes, which is:
- Inefficient - requires iterating over all processes
- Incomplete - misses namespaces not attached to any running
process but kept alive by file descriptors, bind mounts, or
parent references
- Permission-heavy - requires access to /proc for many processes
- No ordering or ownership information
- No filtering per namespace type
The listns() system call solves these problems:
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
struct ns_id_req {
__u32 size;
__u32 spare;
__u64 ns_id;
struct /* listns */ {
__u32 ns_type;
__u32 spare2;
__u64 user_ns_id;
};
};
Features include:
- Pagination support for large namespace sets
- Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
- Filtering by owning user namespace
- Permission checks respecting namespace isolation
- Active Reference Counting
Introduce an active reference count that tracks namespace
visibility to userspace. A namespace is visible in the following
cases:
- The namespace is in use by a task
- The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount)
- The namespace is a hierarchical type and is the parent of child
namespaces
The active reference count does not regulate lifetime (that's still
done by the normal reference count) - it only regulates visibility
to namespace file handles and listns().
This prevents resurrection of namespaces that are pinned only for
internal kernel reasons (e.g., user namespaces held by
file->f_cred, lazy TLB references on idle CPUs, etc.) which should
not be accessible via (1)-(3).
- Unified Namespace Tree
Introduce a unified tree structure for all namespaces with:
- Fixed IDs assigned to initial namespaces
- Lookup based solely on inode number
- Maintained list of owned namespaces per user namespace
- Simplified rbtree comparison helpers
Cleanups
- Header Reorganization:
- Move namespace types into separate header (ns_common_types.h)
- Decouple nstree from ns_common header
- Move nstree types into separate header
- Switch to new ns_tree_{node,root} structures with helper functions
- Use guards for ns_tree_lock
- Initial Namespace Reference Count Optimization
- Make all reference counts on initial namespaces a nop to avoid
pointless cacheline ping-pong for namespaces that can never go
away
- Drop custom reference count initialization for initial namespaces
- Add NS_COMMON_INIT() macro and use it for all namespaces
- pid: rely on common reference count behavior
- Miscellaneous Cleanups
- Rename exit_task_namespaces() to exit_nsproxy_namespaces()
- Rename is_initial_namespace() and make argument const
- Use boolean to indicate anonymous mount namespace
- Simplify owner list iteration in nstree
- nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
- nsfs: use inode_just_drop()
- pidfs: raise DCACHE_DONTCACHE explicitly
- pidfs: simplify PIDFD_GET__NAMESPACE ioctls
- libfs: allow to specify s_d_flags
- cgroup: add cgroup namespace to tree after owner is set
- nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
Fixes:
- setns(pidfd, ...) race condition
Fix a subtle race when using pidfds with setns(). When the target
task exits after prepare_nsset() but before commit_nsset(), the
namespace's active reference count might have been dropped. If
setns() then installs the namespaces, it would bump the active
reference count from zero without taking the required reference on
the owner namespace, leading to underflow when later decremented.
The fix resurrects the ownership chain if necessary - if the caller
succeeded in grabbing passive references, the setns() should
succeed even if the target task exits or gets reaped.
- Return EFAULT on put_user() error instead of success
- Make sure references are dropped outside of RCU lock (some
namespaces like mount namespace sleep when putting the last
reference)
- Don't skip active reference count initialization for network
namespace
- Add asserts for active refcount underflow
- Add asserts for initial namespace reference counts (both passive
and active)
- ipc: enable is_ns_init_id() assertions
- Fix kernel-doc comments for internal nstree functions
- Selftests
- 15 active reference count tests
- 9 listns() functionality tests
- 7 listns() permission tests
- 12 inactive namespace resurrection tests
- 3 threaded active reference count tests
- commit_creds() active reference tests
- Pagination and stress tests
- EFAULT handling test
- nsid tests fixes"
* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
nstree: fix kernel-doc comments for internal functions
nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
selftests/namespaces: fix nsid tests
ns: drop custom reference count initialization for initial namespaces
pid: rely on common reference count behavior
ns: add asserts for initial namespace active reference counts
ns: add asserts for initial namespace reference counts
ns: make all reference counts on initial namespace a nop
ipc: enable is_ns_init_id() assertions
fs: use boolean to indicate anonymous mount namespace
ns: rename is_initial_namespace()
ns: make is_initial_namespace() argument const
nstree: use guards for ns_tree_lock
nstree: simplify owner list iteration
nstree: switch to new structures
nstree: add helper to operate on struct ns_tree_{node,root}
nstree: move nstree types into separate header
nstree: decouple from ns_common header
ns: move namespace types into separate header
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
permission checks during path lookup and adds the
IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
expensive permission work.
- Hide dentry_cache behind runtime const machinery.
- Add German Maglione as virtiofs co-maintainer.
Cleanups:
- Tidy up and inline step_into() and walk_component() for improved
code generation.
- Re-enable IOCB_NOWAIT writes to files. This refactors file
timestamp update logic, fixing a layering bypass in btrfs when
updating timestamps on device files and improving FMODE_NOCMTIME
handling in VFS now that nfsd started using it.
- Path lookup optimizations extracting slowpaths into dedicated
routines and adding branch prediction hints for mntput_no_expire(),
fd_install(), lookup_slow(), and various other hot paths.
- Enable clang's -fms-extensions flag, requiring a JFS rename to
avoid conflicts.
- Remove spurious exports in fs/file_attr.c.
- Stop duplicating union pipe_index declaration. This depends on the
shared kbuild branch that brings in -fms-extensions support which
is merged into this branch.
- Use MD5 library instead of crypto_shash in ecryptfs.
- Use largest_zero_folio() in iomap_dio_zero().
- Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
initrd code.
- Various typo fixes.
Fixes:
- Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
call with wait == 1 to commit super blocks. The emergency sync path
never passed this, leaving btrfs data uncommitted during emergency
sync.
- Use local kmap in watch_queue's post_one_notification().
- Add hint prints in sb_set_blocksize() for LBS dependency on THP"
* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
MAINTAINERS: add German Maglione as virtiofs co-maintainer
fs: inline step_into() and walk_component()
fs: tidy up step_into() & friends before inlining
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
include/linux/fs.h: trivial fix: regualr -> regular
fs/splice.c: trivial fix: pipes -> pipe's
fs: mark lookup_slow() as noinline
fs: add predicts based on nd->depth
fs: move mntput_no_expire() slowpath into a dedicated routine
fs: remove spurious exports in fs/file_attr.c
watch_queue: Use local kmap in post_one_notification()
fs: touch up predicts in path lookup
fs: move fd_install() slowpath into a dedicated routine and provide commentary
fs: hide dentry_cache behind runtime const machinery
fs: touch predicts in do_dentry_open()
...
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-36-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-35-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-34-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fixes a UAF for src_info as well.
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-33-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-32-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
'nvidia/tegra', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
|
|
Add the missing forward declarations and includes so it does not have
implicit dependencies. mem_encrypt.h is a public header imported by
drivers. Users should not have to guess what include files are needed.
Resolves a kbuild splat:
In file included from drivers/iommu/generic_pt/fmt/iommu_amdv1.c:15:
In file included from drivers/iommu/generic_pt/fmt/iommu_template.h:36:
In file included from drivers/iommu/generic_pt/fmt/amdv1.h:23:
In file included from include/linux/mem_encrypt.h:17:
>> arch/powerpc/include/asm/mem_encrypt.h:13:49: warning: declaration of 'struct device' will not be visible outside of this function [-Wvisibility]
13 | static inline bool force_dma_unencrypted(struct device *dev)
Fixes: 879ced2bab1b ("iommupt: Add the AMD IOMMU v1 page table format")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511161358.rS5pSb3U-lkp@intel.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module,
which KVM was either working around in weird, ugly ways, or was simply
oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
selftests).
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying capabilities to userspace.
|
|
Change page_free to folio_free to make the folio support for
zone device-private more consistent. The PCI P2PDMA callback
has also been updated and changed to folio_free() as a result.
For drivers that do not support folios (yet), the folio is
converted back into page via &folio->page and the page is used
as is, in the current callback implementation.
Link: https://lkml.kernel.org/r/20251001065707.920170-3-balbirs@nvidia.com
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: support device-private THP", v7.
This patch series introduces support for Transparent Huge Page (THP)
migration in zone device-private memory. The implementation enables
efficient migration of large folios between system memory and
device-private memory
Background
Current zone device-private memory implementation only supports PAGE_SIZE
granularity, leading to:
- Increased TLB pressure
- Inefficient migration between CPU and device memory
This series extends the existing zone device-private infrastructure to
support THP, leading to:
- Reduced page table overhead
- Improved memory bandwidth utilization
- Seamless fallback to base pages when needed
In my local testing (using lib/test_hmm) and a throughput test, the series
shows a 350% improvement in data transfer throughput and a 80% improvement
in latency
These patches build on the earlier posts by Ralph Campbell [1]
Two new flags are added in vma_migration to select and mark compound
pages. migrate_vma_setup(), migrate_vma_pages() and
migrate_vma_finalize() support migration of these pages when
MIGRATE_VMA_SELECT_COMPOUND is passed in as arguments.
The series also adds zone device awareness to (m)THP pages along with
fault handling of large zone device private pages. page vma walk and the
rmap code is also zone device aware. Support has also been added for
folios that might need to be split in the middle of migration (when the
src and dst do not agree on MIGRATE_PFN_COMPOUND), that occurs when src
side of the migration can migrate large pages, but the destination has not
been able to allocate large pages. The code supported and used
folio_split() when migrating THP pages, this is used when
MIGRATE_VMA_SELECT_COMPOUND is not passed as an argument to
migrate_vma_setup().
The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.
The nouveau dmem code has been enhanced to use the new THP migration
capability.
mTHP support:
The patches hard code, HPAGE_PMD_NR in a few places, but the code has been
kept generic to support various order sizes. With additional refactoring
of the code support of different order sizes should be possible.
The future plan is to post enhancements to support mTHP with a rough
design as follows:
1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize
The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.
HMM support for large folios was added in 10b9feee2d0d ("mm/hmm:
populate PFNs from PMD swap entry").
This patch (of 16)
Add routines to support allocation of large order zone device folios and
helper functions for zone device folios, to check if a folio is device
private and helpers for setting zone device data.
When large folios are used, the existing page_free() callback in pgmap is
called when the folio is freed, this is true for both PAGE_SIZE and higher
order pages.
Zone device private large folios do not support deferred split and scan
like normal THP folios.
Link: https://lkml.kernel.org/r/20251001065707.920170-1-balbirs@nvidia.com
Link: https://lkml.kernel.org/r/20251001065707.920170-2-balbirs@nvidia.com
Link: https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/ [1]
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In the origin logic, the bpf_arch_text_poke() assume that the old and new
instructions have the same opcode. However, they can have different opcode
if we want to replace a "call" insn with a "jmp" insn.
Therefore, add the new function parameter "old_t" along with the "new_t",
which are used to indicate the old and new poke type. Meanwhile, adjust
the implement of bpf_arch_text_poke() for all the archs.
"BPF_MOD_NOP" is added to make the code more readable. In
bpf_arch_text_poke(), we still check if the new and old address is NULL to
determine if nop insn should be used, which I think is more safe.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20251118123639.688444-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Bring in the UDB and objtool data annotations to avoid conflicts while further extending the bug exceptions.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
|
|
no_slb_preload cmdline can come useful in quickly disabling and/or
testing the performance impact of userspace slb preloads. Recently there
was a slb multi-hit issue due to slb preload cache which was very
difficult to triage. This cmdline option allows to quickly disable
preloads and verify if the issue exists in preload cache or somewhere
else. This can also be a useful option to see the effect of slb preloads
for any application workload e.g. number of slb faults with or w/o slb
preloads.
with slb_preload:
slb_faults (minimal initrd boot): 15
slb_faults (full systemd boot): 300
with no_slb_preload:
slb_faults (minimal initrd boot): 33
slb_faults (full systemd boot): 138180
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/de484b55c45d831bc2db63945f455153c89a9a65.1761834163.git.ritesh.list@gmail.com
|
|
We dropped preload_new_slb_context() & slb_setup_new_exec() in a
previous patch. That means we don't really need preload_add() return
type anymore. So let's make its return type as void.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/88b07223e93b0fc56c1119f6889b7d9c83e20109.1761834163.git.ritesh.list@gmail.com
|
|
This patch adds PGD/PUD/PMD/PTE level information while dumping kernel
page tables. Before this patch it was hard to identify which entries
belongs to which page table level e.g.
~ # dmesg |grep -i radix
[0.000000] radix-mmu: Mapped 0x0000000000000000-0x0000000005400000 with 2.00 MiB pages (exec)
[0.000000] radix-mmu: Mapped 0x0000000005400000-0x0000000040000000 with 2.00 MiB pages
[0.000000] radix-mmu: Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages
[0.000000] radix-mmu: Initializing Radix MMU
Before:
---[ Start of kernel VM ]---
0xc000000000000000-0xc000000003ffffff XXX 64M r X pte valid present dirty accessed
0xc000000004000000-0xc00000003fffffff XXX 960M r w pte valid present dirty accessed
0xc000000040000000-0xc0000000ffffffff XXX 3G r w pte valid present dirty accessed
...
---[ vmemmap start ]---
0xc00c000000000000-0xc00c0000003fffff XXX 4M r w pte valid present dirty accessed
After:
---[ Start of kernel VM ]---
0xc000000000000000-0xc000000003ffffff XXX 64M PMD r X pte valid present dirty accessed
0xc000000004000000-0xc00000003fffffff XXX 960M PMD r w pte valid present dirty accessed
0xc000000040000000-0xc0000000ffffffff XXX 3G PUD r w pte valid present dirty accessed
...
---[ vmemmap start ]---
0xc00c000000000000-0xc00c0000003fffff XXX 4M PMD r w pte valid present dirty accessed
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/95defb675ee5607ef3923a1e6aeac39311b8fad4.1761834163.git.ritesh.list@gmail.com
|
|
This patch enables the directMap counters to be printed in proc/meminfo
for Hash mmu. With this patch on a system with 8G of DRAM we can see the
entire RAM mapped with 16M pagesize:
cat /proc/meminfo |grep -i direct
DirectMap4k: 0 kB
DirectMap64k: 0 kB
DirectMap16M: 8388608 kB
DirectMap16G: 0 kB
Tested with devdax too:
root@buildroot:/# ndctl create-namespace -r region0 -m devdax -s 2G
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":"2032.00 MiB (2130.71 MB)",
"uuid":"aa383ded-cd99-43a0-979f-5225467cfb40",
"daxregion":{
"id":0,
"size":"2032.00 MiB (2130.71 MB)",
"align":16777216,
"devices":[
{
"chardev":"dax0.0",
"size":"2032.00 MiB (2130.71 MB)",
"target_node":0,
"align":"16.00 MiB (16.78 MB)",
"mode":"devdax"
}
]
},
"align":16777216
}
root@buildroot:/# cat /proc/meminfo |grep -i direct
DirectMap4k: 0 kB
DirectMap64k: 0 kB
DirectMap16M: 10485760 kB
DirectMap16G: 0 kB
root@buildroot:/# ndctl destroy-namespace -f all
destroyed 1 namespace
root@buildroot:/# cat /proc/meminfo |grep -i direct
DirectMap4k: 0 kB
DirectMap64k: 0 kB
DirectMap16M: 8388608 kB
DirectMap16G: 0 kB
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/ffe47912e17075649b6ce5b2ee5d7f3eb5352e0b.1761834163.git.ritesh.list@gmail.com
|
|
Update the directMap page counters for Hash. Hash by default always uses
mmu_linear_psize only, for it's directMap. However, once the kernel has
booted and the dmesg log is wrapped over there is no way of knowing the
kernel linear pagesize with Hash mmu. Features like debug_page_alloc can
make mmu_linear_psize to be PAGE_SIZE instead of PMD / PUD mappings. It
would be easier if we have this info printed in proc meminfo similar to
Radix for debugging purposes.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/208e6f946d2ba9c1e2b8b4f665728abe5c891e7c.1761834163.git.ritesh.list@gmail.com
|
|
This disables creating hpt_order debugfs entry with radix mode.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/99237176a51c73e85f4a7edd60a2460017882d69.1761834163.git.ritesh.list@gmail.com
|
|
Let's use pr_info() instead of printk() in order to utilize the pr_fmt
set to "hash-mmu:". This improves the debug messages that are spitted
out during kernel bootup.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/a2af842f85c099cdbd19bf468606960c5226a079.1761834163.git.ritesh.list@gmail.com
|
|
We get below errors when we try to enable debug logs in book3s64/hash_utils.c
This patch fixes these errors related to phys_addr_t printf format.
arch/powerpc/mm/book3s64/hash_utils.c: In function ‘htab_initialize’:
arch/powerpc/mm/book3s64/hash_utils.c:1401:21: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘phys_addr_t’ {aka ‘long long unsigned int’} [-Werror=format=]
1401 | DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
arch/powerpc/mm/book3s64/hash_utils.c:1401:21: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘phys_addr_t’ {aka ‘long long unsigned int’} [-Werror=format=]
cc1: all warnings being treated as errors
make[6]: *** [../scripts/Makefile.build:287: arch/powerpc/mm/book3s64/hash_utils.o] Error 1
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/4873e9692fc4411099c9741005d218d5e734c345.1761834163.git.ritesh.list@gmail.com
|
|
HPTE format was changed since Power9 (ISA 3.0) onwards. While dumping
kernel hash page tables, nothing gets printed on powernv P9+. This patch
utilizes the helpers added in the patch tagged as fixes, to convert new
format to old format and dump the hptes. This fix is only needed for
native_find() (powernv), since pseries continues to work fine with the
old format.
Fixes: 6b243fcfb5f1e ("powerpc/64: Simplify adaptation to new ISA v3.00 HPTE format")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/4c2bb9e5b3cfbc0dd80b61b67cdd3ccfc632684c.1761834163.git.ritesh.list@gmail.com
|
|
When HV=0 & IR/DR=0, the Hash MMU is said to be in Virtual Real
Addressing Mode during early boot. During this, we should ensure that
memory region allocations for stress_hpt_struct should happen from
within RMA region as otherwise the boot might get stuck while doing
memset of this region.
History behind why do we have RMA region limitation is better explained
in these 2 patches [1] & [2]. This patch ensures that memset to
stress_hpt_struct during early boot does not cross ppc64_rma_size
boundary.
[1]: https://lore.kernel.org/all/20190710052018.14628-1-sjitindarsingh@gmail.com/
[2]: https://lore.kernel.org/all/87wp54usvj.fsf@linux.vnet.ibm.com/
Fixes: 6b34a099faa12 ("powerpc/64s/hash: add stress_hpt kernel boot option to increase hash faults")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/ada1173933ea7617a994d6ee3e54ced8797339fc.1761834163.git.ritesh.list@gmail.com
|
|
On systems using the hash MMU, there is a software SLB preload cache that
mirrors the entries loaded into the hardware SLB buffer. This preload
cache is subject to periodic eviction — typically after every 256 context
switches — to remove old entry.
To optimize performance, the kernel skips switch_mmu_context() in
switch_mm_irqs_off() when the prev and next mm_struct are the same.
However, on hash MMU systems, this can lead to inconsistencies between
the hardware SLB and the software preload cache.
If an SLB entry for a process is evicted from the software cache on one
CPU, and the same process later runs on another CPU without executing
switch_mmu_context(), the hardware SLB may retain stale entries. If the
kernel then attempts to reload that entry, it can trigger an SLB
multi-hit error.
The following timeline shows how stale SLB entries are created and can
cause a multi-hit error when a process moves between CPUs without a
MMU context switch.
CPU 0 CPU 1
----- -----
Process P
exec swapper/1
load_elf_binary
begin_new_exc
activate_mm
switch_mm_irqs_off
switch_mmu_context
switch_slb
/*
* This invalidates all
* the entries in the HW
* and setup the new HW
* SLB entries as per the
* preload cache.
*/
context_switch
sched_migrate_task migrates process P to cpu-1
Process swapper/0 context switch (to process P)
(uses mm_struct of Process P) switch_mm_irqs_off()
switch_slb
load_slb++
/*
* load_slb becomes 0 here
* and we evict an entry from
* the preload cache with
* preload_age(). We still
* keep HW SLB and preload
* cache in sync, that is
* because all HW SLB entries
* anyways gets evicted in
* switch_slb during SLBIA.
* We then only add those
* entries back in HW SLB,
* which are currently
* present in preload_cache
* (after eviction).
*/
load_elf_binary continues...
setup_new_exec()
slb_setup_new_exec()
sched_switch event
sched_migrate_task migrates
process P to cpu-0
context_switch from swapper/0 to Process P
switch_mm_irqs_off()
/*
* Since both prev and next mm struct are same we don't call
* switch_mmu_context(). This will cause the HW SLB and SW preload
* cache to go out of sync in preload_new_slb_context. Because there
* was an SLB entry which was evicted from both HW and preload cache
* on cpu-1. Now later in preload_new_slb_context(), when we will try
* to add the same preload entry again, we will add this to the SW
* preload cache and then will add it to the HW SLB. Since on cpu-0
* this entry was never invalidated, hence adding this entry to the HW
* SLB will cause a SLB multi-hit error.
*/
load_elf_binary continues...
START_THREAD
start_thread
preload_new_slb_context
/*
* This tries to add a new EA to preload cache which was earlier
* evicted from both cpu-1 HW SLB and preload cache. This caused the
* HW SLB of cpu-0 to go out of sync with the SW preload cache. The
* reason for this was, that when we context switched back on CPU-0,
* we should have ideally called switch_mmu_context() which will
* bring the HW SLB entries on CPU-0 in sync with SW preload cache
* entries by setting up the mmu context properly. But we didn't do
* that since the prev mm_struct running on cpu-0 was same as the
* next mm_struct (which is true for swapper / kernel threads). So
* now when we try to add this new entry into the HW SLB of cpu-0,
* we hit a SLB multi-hit error.
*/
WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62
assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
Modules linked in:
CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12
VOLUNTARY
Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected)
0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
NIP: c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
REGS: c0000000497c77e0 TRAP: 0700 Not tainted (6.16.0-rc3-dirty)
MSR: 8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 28888482 XER: 00000000
CFAR: c0000000001543b0 IRQMASK: 3
<...>
NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
LR [c0000000001543b4] slb_insert_entry+0x124/0x390
Call Trace:
0x7fffceb5ffff (unreliable)
preload_new_slb_context+0x100/0x1a0
start_thread+0x26c/0x420
load_elf_binary+0x1b04/0x1c40
bprm_execve+0x358/0x680
do_execveat_common+0x1f8/0x240
sys_execve+0x58/0x70
system_call_exception+0x114/0x300
system_call_common+0x160/0x2c4
>From the above analysis, during early exec the hardware SLB is cleared,
and entries from the software preload cache are reloaded into hardware
by switch_slb. However, preload_new_slb_context and slb_setup_new_exec
also attempt to load some of the same entries, which can trigger a
multi-hit. In most cases, these additional preloads simply hit existing
entries and add nothing new. Removing these functions avoids redundant
preloads and eliminates the multi-hit issue. This patch removes these
two functions.
We tested process switching performance using the context_switch
benchmark on POWER9/hash, and observed no regression.
Without this patch: 129041 ops/sec
With this patch: 129341 ops/sec
We also measured SLB faults during boot, and the counts are essentially
the same with and without this patch.
SLB faults without this patch: 19727
SLB faults with this patch: 19786
Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
cc: stable@vger.kernel.org
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/0ac694ae683494fe8cadbd911a1a5018d5d3c541.1761834163.git.ritesh.list@gmail.com
|
|
On 32-bit book3s with hash-MMUs, tlb_flush() was a no-op. This was
unnoticed because all uses until recently were for unmaps, and thus
handled by __tlb_remove_tlb_entry().
After commit 4a18419f71cd ("mm/mprotect: use mmu_gather") in kernel 5.19,
tlb_gather_mmu() started being used for mprotect as well. This caused
mprotect to simply not work on these machines:
int *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
*ptr = 1; // force HPTE to be created
mprotect(ptr, 4096, PROT_READ);
*ptr = 2; // should segfault, but succeeds
Fixed by making tlb_flush() actually flush TLB pages. This finally
agrees with the behaviour of boot3s64's tlb_flush().
Fixes: 4a18419f71cd ("mm/mprotect: use mmu_gather")
Cc: stable@vger.kernel.org
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Dave Vasilevsky <dave@vasilevsky.ca>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251116-vasi-mprotect-g3-v3-1-59a9bd33ba00@vasilevsky.ca
|
|
Patch series "drivers/base/node: fold node register and unregister
functions", v2.
The first patch merges register_one_node() and register_node(), leaving a
single register_node() function.
The second patch merges unregister_one_node() and unregister_node(),
leaving a single unregister_node() function.
There are no functional changes in these patches.
This patch (of 2):
register_node() is only called from register_one_node(). This patch folds
register_node() into its only caller and renames register_one_node() to
register_node().
This reduces unnecessary indirection and simplifies the code structure.
No functional changes are introduced.
[akpm@linux-foundation.org: fix kerneldoc, per David]
Link: https://lkml.kernel.org/r/cover.1760097207.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/910853c9dd61f7a2190a56cba101e73e9c6859be.1760097207.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
have spufs_new_file() use d_make_persistent() instead of d_add() and
do an uncondition dput() in the caller; the rest is completely
straightforward.
[a braino in spufs_mkgang() fixed]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
In the past, CONFIG_ARCH_HAS_GIGANTIC_PAGE indicated that we support
runtime allocation of gigantic hugetlb folios. In the meantime it evolved
into a generic way for the architecture to state that it supports gigantic
hugetlb folios.
In commit fae7d834c43c ("mm: add __dump_folio()") we started using
CONFIG_ARCH_HAS_GIGANTIC_PAGE to decide MAX_FOLIO_ORDER: whether we could
have folios larger than what the buddy can handle. In the context of that
commit, we started using MAX_FOLIO_ORDER to detect page corruptions when
dumping tail pages of folios. Before that commit, we assumed that we
cannot have folios larger than the highest buddy order, which was
obviously wrong.
In commit 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes
when registering hstate"), we used MAX_FOLIO_ORDER to detect
inconsistencies, and in fact, we found some now.
Powerpc allows for configs that can allocate gigantic folio during boot
(not at runtime), that do not set CONFIG_ARCH_HAS_GIGANTIC_PAGE and can
exceed PUD_ORDER.
To fix it, let's make powerpc select CONFIG_ARCH_HAS_GIGANTIC_PAGE with
hugetlb on powerpc, and increase the maximum folio size with hugetlb to 16
GiB on 64bit (possible on arm64 and powerpc) and 1 GiB on 32 bit
(powerpc). Note that on some powerpc configurations, whether we actually
have gigantic pages depends on the setting of CONFIG_ARCH_FORCE_MAX_ORDER,
but there is nothing really problematic about setting it unconditionally:
we just try to keep the value small so we can better detect problems in
__dump_folio() and inconsistencies around the expected largest folio in
the system.
Ideally, we'd have a better way to obtain the maximum hugetlb folio size
and detect ourselves whether we really end up with gigantic folios. Let's
defer bigger changes and fix the warnings first.
While at it, handle gigantic DAX folios more clearly: DAX can only end up
creating gigantic folios with HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD.
Add a new Kconfig option HAVE_GIGANTIC_FOLIOS to make both cases clearer.
In particular, worry about ARCH_HAS_GIGANTIC_PAGE only with HUGETLB_PAGE.
Note: with enabling CONFIG_ARCH_HAS_GIGANTIC_PAGE on powerpc, we will now
also allow for runtime allocations of folios in some more powerpc configs.
I don't think this is a problem, but if it is we could handle it through
__HAVE_ARCH_GIGANTIC_PAGE_RUNTIME_SUPPORTED.
While __dump_page()/__dump_folio was also problematic (not handling
dumping of tail pages of such gigantic folios correctly), it doesn't seem
critical enough to mark it as a fix.
Link: https://lkml.kernel.org/r/20251114214920.2550676-1-david@kernel.org
Fixes: 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when registering hstate")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Closes: https://lore.kernel.org/r/3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu/
Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Closes: https://lore.kernel.org/r/94377f5c-d4f0-4c0f-b0f6-5bf1cd7305b1@linux.ibm.com/
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Several drivers can benefit from registering per-instance data along
with the syscore operations. To achieve this, move the modifiable fields
out of the syscore_ops structure and into a separate struct syscore that
can be registered with the framework. Add a void * driver data field for
drivers to store contextual data that will be passed to the syscore ops.
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
|
|
>From Power10 processors onwards, each chip has 2 hemispheres. For LPARs
running on PowerVM Hypervisor, hypervisor determines the allocation of
CPU groups to each LPAR, resulting in two LPARs with the same number of
CPUs potentially having different numbers of CPUs from each hemisphere.
Additionally, it is not feasible to ascertain the hemisphere based
solely on the CPU number.
Users wishing to assign their workload to all CPUs, or a subset of CPUs
within a specific hemisphere, encounter difficulties in identifying the
cpumask. To address this, it is proposed to expose hemisphere
information as a die in sysfs. This aligns with other architectures
and facilitates the identification of CPUs within the same hemisphere.
Tools such as lstopo can also access this information.
Please note: The hypervisor reveals the locality of the CPUs to
hemispheres only in dedicated mode. Consequently, in systems where
hemisphere information is unavailable, such as shared LPARs, the
die_cpus information in sysfs will mirror package_cpus, with
die_id set to -1.
Without this change.
$ grep . /sys/devices/system/cpu/cpu16/topology/{die*,package*} 2>/dev/null
/sys/devices/system/cpu/cpu16/topology/package_cpus:000000,000000ff,ffff0000
/sys/devices/system/cpu/cpu16/topology/package_cpus_list:16-39
With this change.
$ grep . /sys/devices/system/cpu/cpu16/topology/{die*,package*} 2>/dev/null
/sys/devices/system/cpu/cpu16/topology/die_cpus:000000,00000000,00ff0000
/sys/devices/system/cpu/cpu16/topology/die_cpus_list:16-23
/sys/devices/system/cpu/cpu16/topology/die_id:2
/sys/devices/system/cpu/cpu16/topology/package_cpus:000000,000000ff,ffff0000
/sys/devices/system/cpu/cpu16/topology/package_cpus_list:16-39
snipped lstopo-no-graphics o/p
Group0 L#0 (total=8747584KB)
Package L#0 (total=3564096KB CPUModel="POWER10 (architected), altivec supported" CPURevision="2.0 (pvr 0080 0200)")
NUMANode L#0 (P#0 local=3564096KB total=3564096KB)
Die L#0 (P#0)
Core L#0 (P#0)
<snipped>
Package L#1 (total=5183488KB CPUModel="POWER10 (architected), altivec supported" CPURevision="2.0 (pvr 0080 0200)")
NUMANode L#1 (P#1 local=5183488KB total=5183488KB)
Die L#2 (P#2)
Core L#2 (P#16)
L3Cache L#4 (size=4096KB linesize=128 ways=16)
L2Cache L#4 (size=1024KB linesize=128 ways=8)
L1dCache L#4 (size=32KB linesize=128 ways=8)
L1iCache L#4 (size=48KB linesize=128 ways=6)
PU L#16 (P#16)
PU L#17 (P#18)
PU L#18 (P#20)
PU L#19 (P#22)
L3Cache L#5 (size=4096KB linesize=128 ways=16)
L2Cache L#5 (size=1024KB linesize=128 ways=8)
L1dCache L#5 (size=32KB linesize=128 ways=8)
L1iCache L#5 (size=48KB linesize=128 ways=6)
PU L#20 (P#17)
PU L#21 (P#19)
PU L#22 (P#21)
PU L#23 (P#23)
Die L#3 (P#3)
Core L#3 (P#24)
L3Cache L#6 (size=4096KB linesize=128 ways=16)
L2Cache L#6 (size=1024KB linesize=128 ways=8)
L1dCache L#6 (size=32KB linesize=128 ways=8)
L1iCache L#6 (size=48KB linesize=128 ways=6)
PU L#24 (P#24)
PU L#25 (P#26)
PU L#26 (P#28)
PU L#27 (P#30)
L3Cache L#7 (size=4096KB linesize=128 ways=16)
L2Cache L#7 (size=1024KB linesize=128 ways=8)
L1dCache L#7 (size=32KB linesize=128 ways=8)
L1iCache L#7 (size=48KB linesize=128 ways=6)
PU L#28 (P#25)
PU L#29 (P#27)
PU L#30 (P#29)
PU L#31 (P#31)
Core L#4 (P#32)
L3Cache L#8 (size=4096KB linesize=128 ways=16)
L2Cache L#8 (size=1024KB linesize=128 ways=8)
L1dCache L#8 (size=32KB linesize=128 ways=8)
L1iCache L#8 (size=48KB linesize=128 ways=6)
PU L#32 (P#32)
PU L#33 (P#34)
PU L#34 (P#36)
PU L#35 (P#38)
L3Cache L#9 (size=4096KB linesize=128 ways=16)
L2Cache L#9 (size=1024KB linesize=128 ways=8)
L1dCache L#9 (size=32KB linesize=128 ways=8)
L1iCache L#9 (size=48KB linesize=128 ways=6)
PU L#36 (P#33)
PU L#37 (P#35)
PU L#38 (P#37)
PU L#39 (P#39)
Group0 L#1 (total=7736896KB)
Package L#2 (total=5170880KB CPUModel="POWER10 (architected), altivec supported" CPURevision="2.0 (pvr 0080 0200)")
NUMANode L#2 (P#2 local=5170880KB total=5170880KB)
Die L#4 (P#4)
<snipped>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251112074859.814087-1-srikar@linux.ibm.com
|
|
kasprintf() returns a pointer to dynamically allocated memory
which can be NULL upon failure. Ensure the allocation was successful
by checking the pointer validity.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20240115094330.33014-1-chentao@kylinos.cn
|
|
This file was missing the shebang line, so added it.
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250722220043.14862-1-unixbhaskar@gmail.com
|
|
Certain versions of kexec don't even work without kernel-end being
added to the device-tree. Add it even if crash-kernel is disabled.
Signed-off-by: Ben Collins <bcollins@kernel.org>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/2025042122-inescapable-mandrill-8a5ff2@boujee-and-buff
|
|
The watchdog.yaml schema prescribes a node name of "timer" or "watchdog"
rather than the abbreviation "wdt".
Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250418-watchdog-v1-4-987ff2046272@posteo.net
|
|
The watchdog.yaml schema prescribes a node name of "timer" or "watchdog"
rather than the abbreviation "wdt".
Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250418-watchdog-v1-3-987ff2046272@posteo.net
|
|
The watchdog.yaml schema prescribes a node name of "timer" or "watchdog"
rather than the abbreviation "wdt".
Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250418-watchdog-v1-2-987ff2046272@posteo.net
|
|
The watchdog.yaml schema prescribes a node name of "timer" or "watchdog"
rather than the abbreviation "wdt".
Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250418-watchdog-v1-1-987ff2046272@posteo.net
|
|
The PUT_64[LB]E() macros need to cast the value to unsigned long long
like the GET_64[LB]E() macros. Caused lots of warnings when compiled
on 32-bit, and clobbered addresses (36-bit P4080).
Signed-off-by: Ben Collins <bcollins@kernel.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/2025042122-mustard-wrasse-694572@boujee-and-buff
|