summaryrefslogtreecommitdiff
path: root/drivers/infiniband/hw
AgeCommit message (Collapse)Author
5 daysMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "This has another new RDMA driver 'bng_en' for latest generation Broadcom NICs. There might be one more new driver still to come. Otherwise it is a fairly quite cycle. Summary: - Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re, mlx5 - Many bug fix patches for irdma - WQ_PERCPU annotations and system_dfl_wq changes - Improved mlx5 support for "other eswitches" and multiple PFs - 1600Gbps link speed reporting support. Four Digits Now! - New driver bng_en for latest generation Broadcom NICs - Bonding support for hns - Adjust mlx5's hmm based ODP to work with the very large address space created by the new 5 level paging default on x86 - Lockdep fixups in rxe and siw" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits) RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep RDMA/siw: reclassify sockets in order to avoid false positives from lockdep RDMA/bng_re: Remove prefetch instruction RDMA/core: Reduce cond_resched() frequency in __ib_umem_release RDMA/irdma: Fix SRQ shadow area address initialization RDMA/irdma: Remove doorbell elision logic RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+ RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY RDMA/irdma: Add missing mutex destroy RDMA/irdma: Fix SIGBUS in AEQ destroy RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2 RDMA/irdma: Fix data race in irdma_free_pble RDMA/irdma: Fix data race in irdma_sc_ccq_arm RDMA/mlx5: Add support for 1600_8x lane speed RDMA/core: Add new IB rate for XDR (8x) support IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled RDMA/bnxt_re: Pass correct flag for dma mr creation RDMA/bnxt_re: Fix the inline size for GenP7 devices RDMA/hns: Support reset recovery for bond RDMA/hns: Support link state reporting for bond ...
6 daysMerge tag 'net-next-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Replace busylock at the Tx queuing layer with a lockless list. Resulting in a 300% (4x) improvement on heavy TX workloads, sending twice the number of packets per second, for half the cpu cycles. - Allow constantly busy flows to migrate to a more suitable CPU/NIC queue. Normally we perform queue re-selection when flow comes out of idle, but under extreme circumstances the flows may be constantly busy. Add sysctl to allow periodic rehashing even if it'd risk packet reordering. - Optimize the NAPI skb cache, make it larger, use it in more paths. - Attempt returning Tx skbs to the originating CPU (like we already did for Rx skbs). - Various data structure layout and prefetch optimizations from Eric. - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly quite expensive on recent AMD machines. - Extend threaded NAPI polling to allow the kthread busy poll for packets. - Make MPTCP use Rx backlog processing. This lowers the lock pressure, improving the Rx performance. - Support memcg accounting of MPTCP socket memory. - Allow admin to opt sockets out of global protocol memory accounting (using a sysctl or BPF-based policy). The global limits are a poor fit for modern container workloads, where limits are imposed using cgroups. - Improve heuristics for when to kick off AF_UNIX garbage collection. - Allow users to control TCP SACK compression, and default to 33% of RTT. - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily aggressive rcvbuf growth and overshot when the connection RTT is low. - Preserve skb metadata space across skb_push / skb_pull operations. - Support for IPIP encapsulation in the nftables flowtable offload. - Support appending IP interface information to ICMP messages (RFC 5837). - Support setting max record size in TLS (RFC 8449). - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL. - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock. - Let users configure the number of write buffers in SMC. - Add new struct sockaddr_unsized for sockaddr of unknown length, from Kees. - Some conversions away from the crypto_ahash API, from Eric Biggers. - Some preparations for slimming down struct page. - YAML Netlink protocol spec for WireGuard. - Add a tool on top of YAML Netlink specs/lib for reporting commonly computed derived statistics and summarized system state. Driver API: - Add CAN XL support to the CAN Netlink interface. - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as defined by the OPEN Alliance's "Advanced diagnostic features for 100BASE-T1 automotive Ethernet PHYs" specification. - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x). - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec and performs RSS. - Add info to devlink params whether the current setting is the default or a user override. Allow resetting back to default. - Add standard device stats for PSP crypto offload. - Leverage DSA frame broadcast to implement simple HSR frame duplication for a lot of switches without dedicated HSR offload. - Add uAPI defines for 1.6Tbps link modes. Device drivers: - Add Motorcomm YT921x gigabit Ethernet switch support. - Add MUCSE driver for N500/N210 1GbE NIC series. - Convert drivers to support dedicated ops for timestamping control, and away from the direct IOCTL handling. While at it support GET operations for PHY timestamping. - Add (and convert most drivers to) a dedicated ethtool callback for reading the Rx ring count. - Significant refactoring efforts in the STMMAC driver, which supports Synopsys turn-key MAC IP integrated into a ton of SoCs. - Ethernet high-speed NICs: - Broadcom (bnxt): - support PPS in/out on all pins - Intel (100G, ice, idpf): - ice: implement standard ethtool and timestamping stats - i40e: support setting the max number of MAC addresses per VF - iavf: support RSS of GTP tunnels for 5G and LTE deployments - nVidia/Mellanox (mlx5): - reduce downtime on interface reconfiguration - disable being an XDP redirect target by default (same as other drivers) to avoid wasting resources if feature is unused - Meta (fbnic): - add support for Linux-managed PCS on 25G, 50G, and 100G links - Wangxun: - support Rx descriptor merge, and Tx head writeback - support Rx coalescing offload - support 25G SPF and 40G QSFP modules - Ethernet virtual: - Google (gve): - allow ethtool to configure rx_buf_len - implement XDP HW RX Timestamping support for DQ descriptor format - Microsoft vNIC (mana): - support HW link state events - handle hardware recovery events when probing the device - Ethernet NICs consumer, and embedded: - usbnet: add support for Byte Queue Limits (BQL) - AMD (amd-xgbe): - add device selftests - NXP (enetc): - add i.MX94 support - Broadcom integrated MACs (bcmgenet, bcmasp): - bcmasp: add support for PHY-based Wake-on-LAN - Broadcom switches (b53): - support port isolation - support BCM5389/97/98 and BCM63XX ARL formats - Lantiq/MaxLinear switches: - support bridge FDB entries on the CPU port - use regmap for register access - allow user to enable/disable learning - support Energy Efficient Ethernet - support configuring RMII clock delays - add tagging driver for MaxLinear GSW1xx switches - Synopsys (stmmac): - support using the HW clock in free running mode - add Eswin EIC7700 support - add Rockchip RK3506 support - add Altera Agilex5 support - Cadence (macb): - cleanup and consolidate descriptor and DMA address handling - add EyeQ5 support - TI: - icssg-prueth: support AF_XDP - Airoha access points: - add missing Ethernet stats and link state callback - add AN7583 support - support out-of-order Tx completion processing - Power over Ethernet: - pd692x0: preserve PSE configuration across reboots - add support for TPS23881B devices - Ethernet PHYs: - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support - Support 50G SerDes and 100G interfaces in Linux-managed PHYs - micrel: - support for non PTP SKUs of lan8814 - enable in-band auto-negotiation on lan8814 - realtek: - cable testing support on RTL8224 - interrupt support on RTL8221B - motorcomm: support for PHY LEDs on YT853 - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag - mscc: support for PHY LED control - CAN drivers: - m_can: add support for optional reset and system wake up - remove can_change_mtu() obsoleted by core handling - mcp251xfd: support GPIO controller functionality - Bluetooth: - add initial support for PASTa - WiFi: - split ieee80211.h file, it's way too big - improvements in VHT radiotap reporting, S1G, Channel Switch Announcement handling, rate tracking in mesh networks - improve multi-radio monitor mode support, and add a cfg80211 debugfs interface for it - HT action frame handling on 6 GHz - initial chanctx work towards NAN - MU-MIMO sniffer improvements - WiFi drivers: - RealTek (rtw89): - support USB devices RTL8852AU and RTL8852CU - initial work for RTL8922DE - improved injection support - Intel: - iwlwifi: new sniffer API support - MediaTek (mt76): - WED support for >32-bit DMA - airoha NPU support - regdomain improvements - continued WiFi7/MLO work - Qualcomm/Atheros: - ath10k: factory test support - ath11k: TX power insertion support - ath12k: BSS color change support - ath12k: statistics improvements - brcmfmac: Acer A1 840 tablet quirk - rtl8xxxu: 40 MHz connection fixes/support" * tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits) net: page_pool: sanitise allocation order net: page pool: xa init with destroy on pp init net/mlx5e: Support XDP target xmit with dummy program net/mlx5e: Update XDP features in switch channels selftests/tc-testing: Test CAKE scheduler when enqueue drops packets net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop wireguard: netlink: generate netlink code wireguard: uapi: generate header with ynl-gen wireguard: uapi: move flag enums wireguard: uapi: move enum wg_cmd wireguard: netlink: add YNL specification selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py selftests: drv-net: introduce Iperf3Runner for measurement use cases selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive() Documentation: net: dsa: mention simple HSR offload helpers Documentation: net: dsa: mention availability of RedBox ...
8 daysMerge tag 'objtool-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - klp-build livepatch module generation (Josh Poimboeuf) Introduce new objtool features and a klp-build script to generate livepatch modules using a source .patch as input. This builds on concepts from the longstanding out-of-tree kpatch project which began in 2012 and has been used for many years to generate livepatch modules for production kernels. However, this is a complete rewrite which incorporates hard-earned lessons from 12+ years of maintaining kpatch. Key improvements compared to kpatch-build: - Integrated with objtool: Leverages objtool's existing control-flow graph analysis to help detect changed functions. - Works on vmlinux.o: Supports late-linked objects, making it compatible with LTO, IBT, and similar. - Simplified code base: ~3k fewer lines of code. - Upstream: No more out-of-tree #ifdef hacks, far less cruft. - Cleaner internals: Vastly simplified logic for symbol/section/reloc inclusion and special section extraction. - Robust __LINE__ macro handling: Avoids false positive binary diffs caused by the __LINE__ macro by introducing a fix-patch-lines script which injects #line directives into the source .patch to preserve the original line numbers at compile time. - Disassemble code with libopcodes instead of running objdump (Alexandre Chartre) - Disassemble support (-d option to objtool) by Alexandre Chartre, which supports the decoding of various Linux kernel code generation specials such as alternatives: 17ef: sched_balance_find_dst_group+0x62f mov 0x34(%r9),%edx 17f3: sched_balance_find_dst_group+0x633 | <alternative.17f3> | X86_FEATURE_POPCNT 17f3: sched_balance_find_dst_group+0x633 | call 0x17f8 <__sw_hweight64> | popcnt %rdi,%rax 17f8: sched_balance_find_dst_group+0x638 cmp %eax,%edx ... jump table alternatives: 1895: sched_use_asym_prio+0x5 test $0x8,%ch 1898: sched_use_asym_prio+0x8 je 0x18a9 <sched_use_asym_prio+0x19> 189a: sched_use_asym_prio+0xa | <jump_table.189a> | JUMP 189a: sched_use_asym_prio+0xa | jmp 0x18ae <sched_use_asym_prio+0x1e> | nop2 189c: sched_use_asym_prio+0xc mov $0x1,%eax 18a1: sched_use_asym_prio+0x11 and $0x80,%ecx ... exception table alternatives: native_read_msr: 5b80: native_read_msr+0x0 mov %edi,%ecx 5b82: native_read_msr+0x2 | <ex_table.5b82> | EXCEPTION 5b82: native_read_msr+0x2 | rdmsr | resume at 0x5b84 <native_read_msr+0x4> 5b84: native_read_msr+0x4 shl $0x20,%rdx .... x86 feature flag decoding (also see the X86_FEATURE_POPCNT example in sched_balance_find_dst_group() above): 2faaf: start_thread_common.constprop.0+0x1f jne 0x2fba4 <start_thread_common.constprop.0+0x114> 2fab5: start_thread_common.constprop.0+0x25 | <alternative.2fab5> | X86_FEATURE_ALWAYS | X86_BUG_NULL_SEG 2fab5: start_thread_common.constprop.0+0x25 | jmp 0x2faba <.altinstr_aux+0x2f4> | jmp 0x4b0 <start_thread_common.constprop.0+0x3f> | nop5 2faba: start_thread_common.constprop.0+0x2a mov $0x2b,%eax ... NOP sequence shortening: 1048e2: snapshot_write_finalize+0xc2 je 0x104917 <snapshot_write_finalize+0xf7> 1048e4: snapshot_write_finalize+0xc4 nop6 1048ea: snapshot_write_finalize+0xca nop11 1048f5: snapshot_write_finalize+0xd5 nop11 104900: snapshot_write_finalize+0xe0 mov %rax,%rcx 104903: snapshot_write_finalize+0xe3 mov 0x10(%rdx),%rax ... and much more. - Function validation tracing support (Alexandre Chartre) - Various -ffunction-sections fixes (Josh Poimboeuf) - Clang AutoFDO (Automated Feedback-Directed Optimizations) support (Josh Poimboeuf) - Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra, Thorsten Blum) * tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) objtool: Fix segfault on unknown alternatives objtool: Build with disassembly can fail when including bdf.h objtool: Trim trailing NOPs in alternative objtool: Add wide output for disassembly objtool: Compact output for alternatives with one instruction objtool: Improve naming of group alternatives objtool: Add Function to get the name of a CPU feature objtool: Provide access to feature and flags of group alternatives objtool: Fix address references in alternatives objtool: Disassemble jump table alternatives objtool: Disassemble exception table alternatives objtool: Print addresses with alternative instructions objtool: Disassemble group alternatives objtool: Print headers for alternatives objtool: Preserve alternatives order objtool: Add the --disas=<function-pattern> action objtool: Do not validate IBT for .return_sites and .call_sites objtool: Improve tracing of alternative instructions objtool: Add functions to better name alternatives objtool: Identify the different types of alternatives ...
13 daysRDMA/bng_re: Remove prefetch instructionLeon Romanovsky
The prefetch instruction is meant to speed up access to memory referenced by its address argument. In the bng_re code path, it has no effect, because the pointer refers to a function that is executed immediately afterward. The issue was identified due to the following kbuild compilation error: drivers/infiniband/hw/bng_re/bng_fw.c: In function 'bng_re_creq_irq': drivers/infiniband/hw/bng_re/bng_fw.c:278:9: error: implicit declaration of function 'prefetch' [-Wimplicit-function-declaration] 278 | prefetch(bng_re_get_qe(hwq, sw_cons, NULL)); | ^~~~~~~~ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511260607.Kuxn4NnN-lkp@intel.com/ Fixes: 4f830cd8d7fe ("RDMA/bng_re: Add infrastructure for enabling Firmware channel") Link: https://patch.msgid.link/20251126-remove-prefetch-v1-1-fcac22007ea7@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
14 daysRDMA/irdma: Fix SRQ shadow area address initializationJijun Wang
Fix SRQ shadow area address initialization. Fixes: 563e1feb5f6e ("RDMA/irdma: Add SRQ support") Signed-off-by: Jijun Wang <jijun.wang@intel.com> Signed-off-by: Jay Bhat <jay.bhat@intel.com> Link: https://patch.msgid.link/20251125025350.180-10-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
14 daysRDMA/irdma: Remove doorbell elision logicJacob Moroni
In some cases, this logic can result in doorbell writes being skipped when they should not have been (at least on GEN3 HW), so remove it. This also means that the mb() can be safely downgraded to dma_wmb(). Fixes: 551c46edc769 ("RDMA/irdma: Add user/kernel shared libraries") Signed-off-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251125025350.180-9-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
14 daysRDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+Jacob Moroni
The GEN3 hardware does not appear to support IBK_LOCAL_DMA_LKEY. Attempts to use it will result in an AE. Fixes: eb31dfc2b41a ("RDMA/irdma: Restrict Memory Window and CQE Timestamping to GEN3") Signed-off-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251125025350.180-8-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
14 daysRDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEYJacob Moroni
The HW disables bounds checking for MRs with a length of zero, so the driver will only allow a zero length MR if the "all_memory" flag is set, and this flag is only set if IB_PD_UNSAFE_GLOBAL_RKEY is set for the PD. This means that the "get_dma_mr" method will currently fail unless the IB_PD_UNSAFE_GLOBAL_RKEY flag is set. This has not been an issue because the "get_dma_mr" method is only ever invoked if the device does not support the local DMA key or if IB_PD_UNSAFE_GLOBAL_RKEY is set, and so far, all IRDMA HW supports the local DMA lkey. However, some new HW does not support the local DMA lkey, so the "get_dma_mr" method needs to work without IB_PD_UNSAFE_GLOBAL_RKEY being set. To support HW that does not allow the local DMA lkey, the logic has been changed to pass an explicit flag to indicate when a dma_mr is being created so that the zero length will be allowed. Also, the "all_memory" flag has been forced to false for normal MR allocation since these MRs are never supposed to provide global unsafe rkey semantics anyway; only the MR created with "get_dma_mr" should support this. Fixes: bb6d73d9add6 ("RDMA/irdma: Prevent zero-length STAG registration") Signed-off-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251125025350.180-7-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
14 daysRDMA/irdma: Add missing mutex destroyAnil Samal
Add missing destroy of ah_tbl_lock and vchnl_mutex. Fixes: d5edd33364a5 ("RDMA/irdma: RDMA/irdma: Add GEN3 core driver support") Signed-off-by: Anil Samal <anil.samal@intel.com> Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251125025350.180-6-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
14 daysRDMA/irdma: Fix SIGBUS in AEQ destroyKrzysztof Czurylo
Removes write to IRDMA_PFINT_AEQCTL register prior to destroying AEQ, as this register does not exist in GEN3+ hardware and this kind of IRQ configuration is no longer required. Fixes: b800e82feba7 ("RDMA/irdma: Add GEN3 support for AEQ and CEQ") Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251125025350.180-5-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
14 daysRDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2Tatyana Nikolova
During a refactor of the irdma GEN2 code, the kfree of the irdma_pci_f struct in icrdma_remove(), which was originally introduced upstream as part of commit 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") was accidentally removed. Fixes: 0c2b80cac96e ("RDMA/irdma: Refactor GEN2 auxiliary driver") Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251125025350.180-4-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
14 daysRDMA/irdma: Fix data race in irdma_free_pbleKrzysztof Czurylo
Protects pble_rsrc counters with mutex to prevent data race. Fixes the following data race in irdma_free_pble reported by KCSAN: BUG: KCSAN: data-race in irdma_free_pble [irdma] / irdma_free_pble [irdma] write to 0xffff91430baa0078 of 8 bytes by task 16956 on cpu 5: irdma_free_pble+0x3b/0xb0 [irdma] irdma_dereg_mr+0x108/0x110 [irdma] ib_dereg_mr_user+0x74/0x160 [ib_core] uverbs_free_mr+0x26/0x30 [ib_uverbs] destroy_hw_idr_uobject+0x4a/0x90 [ib_uverbs] uverbs_destroy_uobject+0x7b/0x330 [ib_uverbs] uobj_destroy+0x61/0xb0 [ib_uverbs] ib_uverbs_run_method+0x1f2/0x380 [ib_uverbs] ib_uverbs_cmd_verbs+0x365/0x440 [ib_uverbs] ib_uverbs_ioctl+0x111/0x190 [ib_uverbs] __x64_sys_ioctl+0xc9/0x100 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 read to 0xffff91430baa0078 of 8 bytes by task 16953 on cpu 2: irdma_free_pble+0x23/0xb0 [irdma] irdma_dereg_mr+0x108/0x110 [irdma] ib_dereg_mr_user+0x74/0x160 [ib_core] uverbs_free_mr+0x26/0x30 [ib_uverbs] destroy_hw_idr_uobject+0x4a/0x90 [ib_uverbs] uverbs_destroy_uobject+0x7b/0x330 [ib_uverbs] uobj_destroy+0x61/0xb0 [ib_uverbs] ib_uverbs_run_method+0x1f2/0x380 [ib_uverbs] ib_uverbs_cmd_verbs+0x365/0x440 [ib_uverbs] ib_uverbs_ioctl+0x111/0x190 [ib_uverbs] __x64_sys_ioctl+0xc9/0x100 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 value changed: 0x0000000000005a62 -> 0x0000000000005a68 Fixes: e8c4dbc2fcac ("RDMA/irdma: Add PBLE resource manager") Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251125025350.180-3-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
14 daysRDMA/irdma: Fix data race in irdma_sc_ccq_armKrzysztof Czurylo
Adds a lock around irdma_sc_ccq_arm body to prevent inter-thread data race. Fixes data race in irdma_sc_ccq_arm() reported by KCSAN: BUG: KCSAN: data-race in irdma_sc_ccq_arm [irdma] / irdma_sc_ccq_arm [irdma] read to 0xffff9d51b4034220 of 8 bytes by task 255 on cpu 11: irdma_sc_ccq_arm+0x36/0xd0 [irdma] irdma_cqp_ce_handler+0x300/0x310 [irdma] cqp_compl_worker+0x2a/0x40 [irdma] process_one_work+0x402/0x7e0 worker_thread+0xb3/0x6d0 kthread+0x178/0x1a0 ret_from_fork+0x2c/0x50 write to 0xffff9d51b4034220 of 8 bytes by task 89 on cpu 3: irdma_sc_ccq_arm+0x7e/0xd0 [irdma] irdma_cqp_ce_handler+0x300/0x310 [irdma] irdma_wait_event+0xd4/0x3e0 [irdma] irdma_handle_cqp_op+0xa5/0x220 [irdma] irdma_hw_flush_wqes+0xb1/0x300 [irdma] irdma_flush_wqes+0x22e/0x3a0 [irdma] irdma_cm_disconn_true+0x4c7/0x5d0 [irdma] irdma_disconnect_worker+0x35/0x50 [irdma] process_one_work+0x402/0x7e0 worker_thread+0xb3/0x6d0 kthread+0x178/0x1a0 ret_from_fork+0x2c/0x50 value changed: 0x0000000000024000 -> 0x0000000000034000 Fixes: 3f49d6842569 ("RDMA/irdma: Implement HW Admin Queue OPs") Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251125025350.180-2-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/mlx5: Add support for 1600_8x lane speedMaher Sanalla
Add a check for 1600G_8X link speed when querying PTYS and report it back correctly when needed. While at it, adjust mlx5 function which maps the speed rate from IB spec values to internal driver values to be able to handle speeds up to 1600Gbps. Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/20251120-speed-8-v1-2-e6a7efef8cb8@nvidia.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24IB/mlx5: Reduce IMR KSM size when 5-level paging is enabledYishai Hadas
Enabling 5-level paging (LA57) increases TASK_SIZE on x86_64 from 2^47 to 2^56. This affects implicit ODP, which uses TASK_SIZE to calculate the number of IMR KSM entries. As a result, the number of entries and the memory usage for KSM mkeys increase drastically: - With 2^47 TASK_SIZE: 0x20000 entries (~2MB) - With 2^56 TASK_SIZE: 0x4000000 entries (~1GB) This issue could happen previously on systems with LA57 manually enabled, but now commit 7212b58d6d71 ("x86/mm/64: Make 5-level paging support unconditional") enables LA57 by default on all supported systems. This makes the issue impact widespread. To mitigate this, increase the size each MTT entry maps from 1GB to 16GB when 5-level paging is enabled. This reduces the number of KSM entries and lowers the memory usage on LA57 systems from 1GB to 64MB per IMR. As now 'mlx5_imr_mtt_size' is larger than 32 bits, we move to use u64 instead of int as part of populate_klm() to prevent overflow of the 'step' variable. In addition, as populate_klm() actually handles KSM and not KLM, as it's used only by implicit ODP, we renamed its signature and the internal structures accordingly while dropping the byte_count handling which is not relevant in KSM. The page size in KSM is fixed for all the entries and come from the log_page_size of the mkey. Note: On platforms where the calculated value for 'mlx5_imr_ksm_page_shift' is higher than the max firmware cap to be changed over UMR, or that the calculated value for 'log_va_pages' is higher than what we may expect, the implicit ODP cap will be simply turned off. Co-developed-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251120-reduce-ksm-v1-1-6864bfc814dc@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bnxt_re: Pass correct flag for dma mr creationSelvin Xavier
DMA MR doesn't use the unified MR model. So the lkey passed on to the reg_mr command to FW should contain the correct lkey. Driver is incorrectly over writing the lkey with pdid and firmware commands fails due to this. Avoid passing the wrong key for cases where the unified MR registration is not used. Fixes: f786eebbbefa ("RDMA/bnxt_re: Avoid an extra hwrm per MR creation") Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1763624215-10382-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bnxt_re: Fix the inline size for GenP7 devicesSelvin Xavier
Inline size supported by the device is based on the number of SGEs supported by the adapter. Change the inline size calculation based on that. Fixes: de1d364c3815 ("RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters") Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1763624215-10382-1-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Support reset recovery for bondJunxian Huang
Re-set bond configuration to HW after HW reset. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-9-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Support link state reporting for bondJunxian Huang
The link state of bond depends on the upper device. Adapt current link state querying flow and ib_event dispatching flow to report correct link state of bond. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-8-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Add delayed work for bondingJunxian Huang
When conditions are met, schedule a delayed work in bond event handler to perform bonding operation according to the bond state. In the case of changing slave number or link state, re-set the netdev for the bond ibdev after the modification is complete, since these two operations may not call hns_roce_set_bond_netdev() in hns_roce_init(). The delayed work will be paused when there is a driver reset or exit to avoid concurrency. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-7-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Implement bonding init/uninit processJunxian Huang
Implement hns_roce_slave_init() and hns_roce_slave_uninit() for device init/uninit in bonding cases. The former is used to initialize a slave ibdev (when the slave is unlinked from a bond) or a bond ibdev, while the latter does the opposite. Most of the process is the same as regular device init/uninit, while some bonding‑specific steps below are also added. In bond device init flow, choose one slave to re-initialize as the main_hr_dev of the bond, and it will be the only device presented for multiple slaves. During registration, set and active netdev to the ibdev based on the link state of the slaves. When this main_hr_dev slave is being unlinked while the bond is still valid, choose a new slave from the rest and initialize it as the new bond device. In uninit flow, add a bond cleanup process, restore all the other slaves and clean up bond resource. This is only for the case where the port of main_hr_dev is directly removed without unlinking it from bond. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-6-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Add bonding cmdsJunxian Huang
Add three bonding cmds to configure bonding settings to HW. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-5-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Add bonding event handlerJunxian Huang
Register netdev notifier for two bonding events NETDEV_CHANGEUPPER and NETDEV_CHANGELOWERSTATE. In NETDEV_CHANGEUPPER event handler, check some rules about the HW constraints when trying to link a new slave to the masteri, and store some bonding information from the notifier. In unlinking case, simply check the number of the rest slaves to decide whether the bond is still supported. In NETDEV_CHANGELOWERSTATE event handler, not much is done. It simply sets the bond state when the bond is ready, which will be used later. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-4-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Initialize bonding resourcesJunxian Huang
Allocate bond_grp resources for each card when the first device in this card is registered. Block the initialization of VF when its PF is a bonded slave, as VF is not supported in this case due to HW constraints. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-3-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Add helpers to obtain netdev and bus_num from hr_devJunxian Huang
Add helpers to obtain netdev and bus_num from hr_dev. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-2-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Initialize the Firmware and HardwareSiva Reddy Kallam
Initialize the firmware and hardware with HWRM command. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-9-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Add basic debugfs infrastructureSiva Reddy Kallam
Add basic debugfs infrastructure for Broadcom next generation controller. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-8-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Enable Firmware channel and query device attributesSiva Reddy Kallam
Enable Firmware channel and query device attributes Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-7-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Add infrastructure for enabling Firmware channelSiva Reddy Kallam
Add infrastructure for enabling Firmware channel. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-6-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Allocate required memory resources for Firmware channelSiva Reddy Kallam
Allocate required memory resources for Firmware channel. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-5-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Register and get the resources from bnge driverSiva Reddy Kallam
Register and get the basic required resources from bnge driver. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-4-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Add Auxiliary interfaceSiva Reddy Kallam
Add basic Auxiliary interface to the driver which supports the BCM5770X NIC family. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-3-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-20RDMA/bnxt_re: Fix wrong check for CQ coalesc supportKalesh AP
Driver is not creating the debugfs hooks for CQ coalesc parameters because of a wrong check. Fixed the condition check inside bnxt_re_init_cq_coal_debugfs(). Fixes: cf2749079011 ("RDMA/bnxt_re: Add a debugfs entry for CQE coalescing tuning") Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20251117061306.1140588-1-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-14Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-11-13 The following pull-request contains common mlx5 updates * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Expose definition for 1600Gbps link mode net/mlx5: fs, set non default device per namespace net/mlx5: fs, Add other_eswitch support for steering tables net/mlx5: Add OTHER_ESWITCH HW capabilities net/mlx5: Add direct ST mode support for RDMA PCI/TPH: Expose pcie_tph_get_st_table_loc() {rdma,net}/mlx5: Query vports mac address from device ==================== Link: https://patch.msgid.link/1763027252-1168760-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.18-rc6). No conflicts, adjacent changes in: drivers/net/phy/micrel.c 96a9178a29a6 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface") 61b7ade9ba8c ("net: phy: micrel: Add support for non PTP SKUs for lan8814") and a trivial one in tools/testing/selftests/drivers/net/Makefile. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-13Merge tag 'v6.18-rc5' into objtool/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-11-12RDMA/irdma: Remove redundant NULL check of udata in irdma_create_user_ah()Tuo Li
The variable udata cannot be NULL because irdma_create_user_ah() always receives it. Therefore, the if() check can be safely removed. Signed-off-by: Tuo Li <islituo@gmail.com> Link: https://patch.msgid.link/20251112120253.68945-1-islituo@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-11mlx5: Fix default values in create CQAkiva Goldberger
Currently, CQs without a completion function are assigned the mlx5_add_cq_to_tasklet function by default. This is problematic since only user CQs created through the mlx5_ib driver are intended to use this function. Additionally, all CQs that will use doorbells instead of polling for completions must call mlx5_cq_arm. However, the default CQ creation flow leaves a valid value in the CQ's arm_db field, allowing FW to send interrupts to polling-only CQs in certain corner cases. These two factors would allow a polling-only kernel CQ to be triggered by an EQ interrupt and call a completion function intended only for user CQs, causing a null pointer exception. Some areas in the driver have prevented this issue with one-off fixes but did not address the root cause. This patch fixes the described issue by adding defaults to the create CQ flow. It adds a default dummy completion function to protect against null pointer exceptions, and it sets an invalid command sequence number by default in kernel CQs to prevent the FW from sending an interrupt to the CQ until it is armed. User CQs are responsible for their own initialization values. Callers of mlx5_core_create_cq are responsible for changing the completion function and arming the CQ per their needs. Fixes: cdd04f4d4d71 ("net/mlx5: Add support to create SQ and CQ for ASO") Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://patch.msgid.link/1762681743-1084694-1-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-09RDMA/irdma: Remove unused CQ registryJacob Moroni
The CQ registry was never actually used (ceq->reg_cq was always NULL), so remove the dead code. Signed-off-by: Jacob Moroni <jmoroni@google.com> Link: https://patch.msgid.link/20251105162841.31786-1-jmoroni@google.com Acked-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-11-09RDMA/mlx5: Add other eswitch support to userspace tablesPatrisious Haddad
Allows the creation of RDMA TRANSPORT tables over VFs/SFs that belong to another eswitch manager. Which is only possible for PFs that were connected via a create_lag PRM command. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-7-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/mlx5: Refactor _get_prio() functionPatrisious Haddad
Refactor the _get_prio() function to remove redundant arguments by reusing the existing flow table attributes struct instead of passing attributes separately. This improves code clarity and maintainability. In addition allows downstream patch to add new parameter without needing to change __get_prio() arguments. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-6-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/mlx5: Add other_eswitch support for devx destructionPatrisious Haddad
When building a devx object destruction command for steering objects add consideration for other_eswitch argument to allow proper destruction for objects that were created with it. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-5-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/mlx5: Change default device for LAG slaves in RDMA TRANSPORT namespacesPatrisious Haddad
In case of a LAG configuration change the root namespace core device for all of the LAG slaves to be the core device of the master device for RDMA_TRANSPORT namespaces, in order to ensure all tables are created through the master device. Once the LAG is disabled revert back to the native core device. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-4-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09Add other eswitch supportLeon Romanovsky
When the device in switchdev mode, the RDMA device manages all the vports which belong to its representors, which can lead to a situation where the PF that is used to manage the RDMA device isn't the native PF of some of the vports it manages. Add infrastructure to allow the master PF to manage all the hardware resources for the vports under its management. Whereas currently the only such resource is RDMA TRANSPORT steering domains. That is done by adding new FW argument other_eswitch which is passed by the driver to the FW to allow the master PF to properly manage vports belonging to other native PF. Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/bnxt_re: Add a debugfs entry for CQE coalescing tuningKalesh AP
This patch adds debugfs interfaces that allows the user to enable/disable the RoCE CQ coalescing and fine tune certain CQ coalescing parameters which would be helpful during debug. Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20251103043425.234846-1-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.18-rc5). Conflicts: drivers/net/wireless/ath/ath12k/mac.c 9222582ec524 ("Revert "wifi: ath12k: Fix missing station power save configuration"") 6917e268c433 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon") https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net Adjacent changes: drivers/net/ethernet/intel/Kconfig b1d16f7c0063 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG") 93f53db9f9dc ("ice: switch to Page Pool") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06RDMA/mlx4: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. CC: Yishai Hadas <yishaih@nvidia.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-5-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06hfi1: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-4-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06RDMA/core: RDMA/mlx5: replace use of system_unbound_wq with system_dfl_wqMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06RDMA/irdma: Take a lock before moving SRQ tail in poll_cqJay Bhat
Need to take an SRQ lock in poll_cq before moving SRQ tail. Signed-off-by: Jay Bhat <jay.bhat@intel.com> Reviewed-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251031021726.1003-7-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>