| Age | Commit message (Collapse) | Author |
|
Pull rdma updates from Jason Gunthorpe:
"This has another new RDMA driver 'bng_en' for latest generation
Broadcom NICs. There might be one more new driver still to come.
Otherwise it is a fairly quite cycle. Summary:
- Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re,
mlx5
- Many bug fix patches for irdma
- WQ_PERCPU annotations and system_dfl_wq changes
- Improved mlx5 support for "other eswitches" and multiple PFs
- 1600Gbps link speed reporting support. Four Digits Now!
- New driver bng_en for latest generation Broadcom NICs
- Bonding support for hns
- Adjust mlx5's hmm based ODP to work with the very large address
space created by the new 5 level paging default on x86
- Lockdep fixups in rxe and siw"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep
RDMA/siw: reclassify sockets in order to avoid false positives from lockdep
RDMA/bng_re: Remove prefetch instruction
RDMA/core: Reduce cond_resched() frequency in __ib_umem_release
RDMA/irdma: Fix SRQ shadow area address initialization
RDMA/irdma: Remove doorbell elision logic
RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+
RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY
RDMA/irdma: Add missing mutex destroy
RDMA/irdma: Fix SIGBUS in AEQ destroy
RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2
RDMA/irdma: Fix data race in irdma_free_pble
RDMA/irdma: Fix data race in irdma_sc_ccq_arm
RDMA/mlx5: Add support for 1600_8x lane speed
RDMA/core: Add new IB rate for XDR (8x) support
IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled
RDMA/bnxt_re: Pass correct flag for dma mr creation
RDMA/bnxt_re: Fix the inline size for GenP7 devices
RDMA/hns: Support reset recovery for bond
RDMA/hns: Support link state reporting for bond
...
|
|
Add a check for 1600G_8X link speed when querying PTYS and report it
back correctly when needed.
While at it, adjust mlx5 function which maps the speed rate from IB
spec values to internal driver values to be able to handle speeds
up to 1600Gbps.
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/20251120-speed-8-v1-2-e6a7efef8cb8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Enabling 5-level paging (LA57) increases TASK_SIZE on x86_64 from 2^47
to 2^56. This affects implicit ODP, which uses TASK_SIZE to calculate
the number of IMR KSM entries.
As a result, the number of entries and the memory usage for KSM mkeys
increase drastically:
- With 2^47 TASK_SIZE: 0x20000 entries (~2MB)
- With 2^56 TASK_SIZE: 0x4000000 entries (~1GB)
This issue could happen previously on systems with LA57 manually
enabled, but now commit 7212b58d6d71 ("x86/mm/64: Make 5-level paging
support unconditional") enables LA57 by default on all supported
systems. This makes the issue impact widespread.
To mitigate this, increase the size each MTT entry maps from 1GB to 16GB
when 5-level paging is enabled. This reduces the number of KSM entries
and lowers the memory usage on LA57 systems from 1GB to 64MB per IMR.
As now 'mlx5_imr_mtt_size' is larger than 32 bits, we move to use u64
instead of int as part of populate_klm() to prevent overflow of the
'step' variable.
In addition, as populate_klm() actually handles KSM and not KLM, as it's
used only by implicit ODP, we renamed its signature and the internal
structures accordingly while dropping the byte_count handling which is
not relevant in KSM. The page size in KSM is fixed for all the entries
and come from the log_page_size of the mkey.
Note:
On platforms where the calculated value for 'mlx5_imr_ksm_page_shift' is
higher than the max firmware cap to be changed over UMR, or that the
calculated value for 'log_va_pages' is higher than what we may expect,
the implicit ODP cap will be simply turned off.
Co-developed-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251120-reduce-ksm-v1-1-6864bfc814dc@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2025-11-13
The following pull-request contains common mlx5 updates
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Expose definition for 1600Gbps link mode
net/mlx5: fs, set non default device per namespace
net/mlx5: fs, Add other_eswitch support for steering tables
net/mlx5: Add OTHER_ESWITCH HW capabilities
net/mlx5: Add direct ST mode support for RDMA
PCI/TPH: Expose pcie_tph_get_st_table_loc()
{rdma,net}/mlx5: Query vports mac address from device
====================
Link: https://patch.msgid.link/1763027252-1168760-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, CQs without a completion function are assigned the
mlx5_add_cq_to_tasklet function by default. This is problematic since
only user CQs created through the mlx5_ib driver are intended to use
this function.
Additionally, all CQs that will use doorbells instead of polling for
completions must call mlx5_cq_arm. However, the default CQ creation flow
leaves a valid value in the CQ's arm_db field, allowing FW to send
interrupts to polling-only CQs in certain corner cases.
These two factors would allow a polling-only kernel CQ to be triggered
by an EQ interrupt and call a completion function intended only for user
CQs, causing a null pointer exception.
Some areas in the driver have prevented this issue with one-off fixes
but did not address the root cause.
This patch fixes the described issue by adding defaults to the create CQ
flow. It adds a default dummy completion function to protect against
null pointer exceptions, and it sets an invalid command sequence number
by default in kernel CQs to prevent the FW from sending an interrupt to
the CQ until it is armed. User CQs are responsible for their own
initialization values.
Callers of mlx5_core_create_cq are responsible for changing the
completion function and arming the CQ per their needs.
Fixes: cdd04f4d4d71 ("net/mlx5: Add support to create SQ and CQ for ASO")
Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Link: https://patch.msgid.link/1762681743-1084694-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Allows the creation of RDMA TRANSPORT tables over VFs/SFs that
belong to another eswitch manager. Which is only possible for PFs that
were connected via a create_lag PRM command.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-7-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Refactor the _get_prio() function to remove redundant arguments by
reusing the existing flow table attributes struct instead of passing
attributes separately. This improves code clarity and maintainability.
In addition allows downstream patch to add new parameter without
needing to change __get_prio() arguments.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-6-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When building a devx object destruction command for steering objects add
consideration for other_eswitch argument to allow proper destruction for
objects that were created with it.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-5-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In case of a LAG configuration change the root namespace core device for
all of the LAG slaves to be the core device of the master device for
RDMA_TRANSPORT namespaces, in order to ensure all tables are created
through the master device.
Once the LAG is disabled revert back to the native core device.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-4-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When the device in switchdev mode, the RDMA device manages all the
vports which belong to its representors, which can lead to a situation
where the PF that is used to manage the RDMA device isn't the native PF
of some of the vports it manages.
Add infrastructure to allow the master PF to manage all the hardware
resources for the vports under its management.
Whereas currently the only such resource is RDMA TRANSPORT steering
domains.
That is done by adding new FW argument other_eswitch which is passed by
the driver to the FW to allow the master PF to properly manage vports
belonging to other native PF.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Before this patch during either switchdev or legacy mode enablement we
cleared the mac address of vports between changes. This change allows us
to preserve the vports mac address between eswitch mode changes.
Vports hold information for VFs/SFs such as the permanent mac address.
VF/SF mac can be set either by iproute vf interface or devlink function
interface. For no obvious reason we reset it to 0 on switchdev/legacy
mode changes, this patch is fixing that, to align with other vport
information that are never reset, e.g GUID,mtu,promisc mode, etc ..
Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Leon Romanovsky <leon@kernel.org> # RDMA
|
|
Pull rdma updates from Jason Gunthorpe:
"A new Pensando ionic driver, a new Gen 3 HW support for Intel irdma,
and lots of small bnxt_re improvements.
- Small bug fixes and improves to hfi1, efa, mlx5, erdma, rdmarvt,
siw
- Allow userspace access to IB service records through the rdmacm
- Optimize dma mapping for erdma
- Fix shutdown of the GSI QP in mana
- Support relaxed ordering MR and fix a corruption bug with mlx5 DMA
Data Direct
- Many improvement to bnxt_re:
- Debugging features and counters
- Improve performance of some commands
- Change flow_label reporting in completions
- Mirror vnic
- RDMA flow support
- New RDMA driver for Pensando Ethernet devices: ionic
- Gen 3 hardware support for the Intel irdma driver
- Fix rdma routing resolution with VRFs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (85 commits)
RDMA/ionic: Fix memory leak of admin q_wr
RDMA/siw: Always report immediate post SQ errors
RDMA/bnxt_re: improve clarity in ALLOC_PAGE handler
RDMA/irdma: Remove unused struct irdma_cq fields
RDMA/irdma: Fix positive vs negative error codes in irdma_post_send()
RDMA/bnxt_re: Remove non-statistics counters from hw_counters
RDMA/bnxt_re: Add debugfs info entry for device and resource information
RDMA/bnxt_re: Fix incorrect errno used in function comments
RDMA: Use %pe format specifier for error pointers
RDMA/ionic: Use ether_addr_copy instead of memcpy
RDMA/ionic: Fix build failure on SPARC due to xchg() operand size
RDMA/rxe: Fix race in do_task() when draining
IB/sa: Fix sa_local_svc_timeout_ms read race
IB/ipoib: Ignore L3 master device
RDMA/core: Use route entry flag to decide on loopback traffic
RDMA/core: Resolve MAC of next-hop device without ARP support
RDMA/core: Squash a single user static function
RDMA/irdma: Update Kconfig
RDMA/irdma: Extend CQE Error and Flush Handling for GEN3 Devices
RDMA/irdma: Add Atomic Operations support
...
|
|
Cross-merge networking fixes after downstream PR (net-6.17-rc8).
Conflicts:
drivers/net/can/spi/hi311x.c
6b6968084721 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled")
27ce71e1ce81 ("net: WQ_PERCPU added to alloc_workqueue users")
https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org
net/smc/smc_loopback.c
drivers/dibs/dibs_loopback.c
a35c04de2565 ("net/smc: fix warning in smc_rx_splice() when calling get_page()")
cc21191b584c ("dibs: Move data path to dibs layer")
https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org
Adjacent changes:
drivers/net/dsa/lantiq/lantiq_gswip.c
c0054b25e2f1 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()")
7a1eaef0a791 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert error logging throughout the RDMA subsystem to use
the %pe format specifier instead of PTR_ERR() with integer
format specifiers.
Link: https://patch.msgid.link/e81ec02df1e474be20417fb62e779776e3f47a50.1758217936.git.leon@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
The global doorbell is used for more than just Ethernet resources, so
move it out of mlx5e_hw_objs into a common place (mlx5_priv), to avoid
non-Ethernet modules (e.g. HWS, ASO) depending on Ethernet structs.
Use this opportunity to consolidate it with the 'uar' pointer already
there, which was used as an RX doorbell. Underneath the 'uar' pointer is
identical to 'bfreg->up', so store a single resource and use that
instead.
For CQ doorbells, care is taken to always use bfreg->up->index instead
of bfreg->index, which may refer to a subsequent UAR page from the same
ALLOC_UAR batch on some NICs.
This paves the way for cleanly supporting multiple doorbells in the
Ethernet driver.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When using KSM (Key Scatter-gather Memory) access mode, the HW requires
the IOVA to be aligned to the selected page size.
Without this alignment, the HW may not function correctly.
Currently, mlx5_umem_mkc_find_best_pgsz() does not filter out page sizes
that would result in misaligned IOVAs for KSM mode. This can lead to
selecting page sizes that are incompatible with the given IOVA.
Fix this by filtering the page size bitmap when in KSM mode, keeping
only page sizes to which the IOVA is aligned to.
Fixes: fcfb03597b7d ("RDMA/mlx5: Align mkc page size capability check to PRM")
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20250824144839.154717-1-edwards@nvidia.com
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Fix a bug where the driver's event subscription logic for SRQ-related
events incorrectly sets obj_type for RMP objects.
When subscribing to SRQ events, get_legacy_obj_type() did not handle
the MLX5_CMD_OP_CREATE_RMP case, which caused obj_type to be 0
(default).
This led to a mismatch between the obj_type used during subscription
(0) and the value used during notification (1, taken from the event's
type field). As a result, event mapping for SRQ objects could fail and
event notification would not be delivered correctly.
This fix adds handling for MLX5_CMD_OP_CREATE_RMP in get_legacy_obj_type,
returning MLX5_EVENT_QUEUE_TYPE_RQ so obj_type is consistent between
subscription and notification.
Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX")
Link: https://patch.msgid.link/r/8f1048e3fdd1fde6b90607ce0ed251afaf8a148c.1755088962.git.leon@kernel.org
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Previously loopback for MPV was supposed to be permanently enabled,
however other driver flows were able to over-ride that configuration and
disable it.
Add force_lb parameter that indicates that loopback should always be
enabled which prevents all other driver flows from disabling it.
Fixes: a9a9e68954f2 ("RDMA/mlx5: Fix vport loopback for MPV device")
Link: https://patch.msgid.link/r/cfc6b1f0f99f8100b087483cc14da6025317f901.1755088808.git.leon@kernel.org
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The mlx5 driver currently derives max_qp_wr directly from the
log_max_qp_sz HCA capability:
props->max_qp_wr = 1 << MLX5_CAP_GEN(mdev, log_max_qp_sz);
However, this value represents the number of WQEs in units of Basic
Blocks (see MLX5_SEND_WQE_BB), not actual number of WQEs. Since the size
of a WQE can vary depending on transport type and features (e.g., atomic
operations, UMR, LSO), the actual number of WQEs can be significantly
smaller than the WQEBB count suggests.
This patch introduces a conservative estimation of the worst-case WQE size
— considering largest segments possible with 1 SGE and no inline data or
special features. It uses this to derive a more accurate max_qp_wr value.
Fixes: 938fe83c8dcb ("net/mlx5_core: New device capabilities handling")
Link: https://patch.msgid.link/r/7d992c9831c997ed5c33d30973406dc2dcaf5e89.1755088725.git.leon@kernel.org
Reported-by: Chuck Lever <cel@kernel.org>
Closes: https://lore.kernel.org/all/20250506142202.GJ2260621@ziepe.ca/
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Relaxed Ordering can improve performance in certain scenarios.
Enable it in the Data-Direct use case as well.
Link: https://patch.msgid.link/r/1221dcdda8061ba5f6bc3519044083c7438b257e.1755088503.git.leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Gal Shalom <galshalom@Nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
vhca id is already cached in the vport structure no need to query on
every mlx5 layer, use the mlx5_vport_get_vhca_id, where possible.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Reviewed-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
|
|
Pull rdma updates from Jason Gunthorpe:
- Various minor code cleanups and fixes for hns, iser, cxgb4, hfi1,
rxe, erdma, mana_ib
- Prefetch supprot for rxe ODP
- Remove memory window support from hns as new device FW is no longer
support it
- Remove qib, it is very old and obsolete now, Cornelis wishes to
restructure the hfi1/qib shared layer
- Fix a race in destroying CQs where we can still end up with work
running because the work is cancled before the driver stops
triggering it
- Improve interaction with namespaces:
* Follow the devlink namespace for newly spawned RDMA devices
* Create iopoib net devces in the parent IB device's namespace
* Allow CAP_NET_RAW checks to pass in user namespaces
- A new flow control scheme for IB MADs to try and avoid queue
overflows in the network
- Fix 2G message sizes in bnxt_re
- Optimize mkey layout for mlx5 DMABUF
- New "DMA Handle" concept to allow controlling PCI TPH and steering
tags
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
RDMA/siw: Change maintainer email address
RDMA/mana_ib: add support of multiple ports
RDMA/mlx5: Refactor optional counters steering code
RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr
IB: Extend UVERBS_METHOD_REG_MR to get DMAH
RDMA/mlx5: Add DMAH object support
RDMA/core: Introduce a DMAH object and its alloc/free APIs
IB/core: Add UVERBS_METHOD_REG_MR on the MR object
net/mlx5: Add support for device steering tag
net/mlx5: Expose IFC bits for TPH
PCI/TPH: Expose pcie_tph_get_st_table_size()
RDMA/mlx5: Fix incorrect MKEY masking
RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()
RDMA/mlx5: remove redundant check on err on return expression
RDMA/mana_ib: add additional port counters
RDMA/mana_ib: Fix DSCP value in modify QP
RDMA/efa: Add CQ with external memory support
RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers
RDMA/uverbs: Add a common way to create CQ with umem
RDMA/mlx5: Optimize DMABUF mkey page size
...
|
|
Currently there isn't a clear layer separation between the counters and
the steering code, whereas the steering code is doing redundant access
to the counter struct.
Separate the fs.c and counters.c, where fs code won't access or be
aware of counter structs but only the objects it needs.
As a result, move mlx5_rdma_counter struct from the header file to be
an internal struct for the counters file only.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/9854d1fdb140e4a6552b7a2fd1a028cfe488a935.1753004208.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
As part of this enhancement, allow the creation of an MKEY associated
with a DMA handle.
Additional notes:
MKEYs with TPH (i.e. TLP Processing Hints) attributes are currently not
UMR-capable unless explicitly enabled by firmware or hardware.
Therefore, to maintain such MKEYs in the MR cache, the TPH fields have
been added to the rb_key structure, with a dedicated hash bucket.
The ability to bypass the kernel verbs flow and create an MKEY with TPH
attributes using DEVX has been restricted. TPH must follow the standard
InfiniBand flow, where a DMAH is created with the appropriate security
checks and management mechanisms in place.
DMA handles are currently not supported in conjunction with On-Demand
Paging (ODP).
Re-registration of memory regions originally created with TPH attributes
is currently not supported.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/1c485651cf8417694ddebb80446c5093d5a791a9.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Extend UVERBS_METHOD_REG_MR to get DMAH and pass it to all drivers.
It will be used in mlx5 driver as part of the next patch from the
series.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/2ae1e628c0675db81f092cc00d3ad6fbf6139405.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This patch introduces support for allocating and deallocating the DMAH
object.
Further details:
----------------
The DMAH API is exposed to upper layers only if the underlying device
supports TPH.
It uses the mlx5_core steering tag (ST) APIs to get a steering tag index
based on the provided input.
The obtained index is stored in the device-specific mlx5_dmah structure
for future use.
Upcoming patches in the series will integrate the allocated DMAH into
the memory region (MR) registration process.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/778550776799d82edb4d05da249a1cff00160b50.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
mkey_mask is __be64 type, while MLX5_MKEY_MASK_PAGE_SIZE is declared as
unsigned long long. This causes to the static checkers errors:
drivers/infiniband/hw/mlx5/umr.c:663:49: warning: invalid assignment: &=
drivers/infiniband/hw/mlx5/umr.c:663:49: left side has type restricted __be64
drivers/infiniband/hw/mlx5/umr.c:663:49: right side has type int
Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size")
Link: https://patch.msgid.link/e354d70b98dfa5ecf4c236a36cd36b64add9d9de.1753003467.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
As Colin reported:
"The variable zapped_blocks is a size_t type and is being assigned a int
return value from the call to _mlx5r_umr_zap_mkey. Since zapped_blocks is an
unsigned type, the error check for zapped_blocks < 0 will never be true."
So separate return error and nblocks assignment.
Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size")
Reported-by: Colin King (gmail) <colin.i.king@gmail.com>
Closes: https://lore.kernel.org/all/79166fb1-3b73-4d37-af02-a17b22eb8e64@gmail.com
Link: https://patch.msgid.link/71d8ea208ac7eaa4438af683b9afaed78625e419.1753003467.git.leon@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Currently all paths that set err and then check it for an error
perform immediate returns, hence err always zero at the end of
the function _mlx5r_umr_zap_mkey. The return expression
err ? err : nblocks has a redundant check on the err since err
is always zero, so just return nblocks instead.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://patch.msgid.link/20250717112108.4036171-1-colin.i.king@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2025-07-14
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: IFC updates for disabled host PF
net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifc
RDMA/mlx5: Fix UMR modifying of mkey page size
net/mlx5: Expose HCA capability bits for mkey max page size
====================
Link: https://patch.msgid.link/1752481357-34780-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The current implementation of DMABUF memory registration uses a fixed
page size for the memory key (mkey), which can lead to suboptimal
performance when the underlying memory layout may offer better page
size.
The optimization improves performance by reducing the number of page
table entries required for the mkey, leading to less MTT/KSM descriptors
that the HCA must go through to find translations, fewer cache-lines,
and shorter UMR work requests on mkey updates such as when
re-registering or reusing a cacheable mkey.
To ensure safe page size updates, the implementation uses a 5-step
process:
1. Make the first X entries non-present, while X is calculated to be
minimal according to a large page shift that can be used to cover the
MR length.
2. Update the page size to the large supported page size
3. Load the remaining N-X entries according to the (optimized)
page shift
4. Update the page size according to the (optimized) page shift
5. Load the first X entries with the correct translations
This ensures that at no point is the MR accessible with a partially
updated translation table, maintaining correctness and preventing
access to stale or inconsistent mappings, such as having an mkey
advertising the new page size while some of the underlying page table
entries still contain the old page size translations.
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/bc05a6b2142c02f96a90635f9a4458ee4bbbf39f.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Align the capabilities checked when using the log_page_size 6th bit in the
mkey context to the PRM definition. The upper and lower bounds are set by
max/min caps, and modification of the 6th bit by UMR is allowed only when
a specific UMR cap is set.
Current implementation falsely assumes all page sizes up-to 2^63 are
supported when the UMR cap is set. In case the upper bound cap is lower
than 63, this might result a FW syndrome on mkey creation, e.g:
mlx5_core 0000:c1:00.0: mlx5_cmd_out_err:832:(pid 0): CREATE_MKEY(0×200) op_mod(0×0) failed, status bad parameter(0×3), syndrome (0×38a711), err(-22)
Previous cap enforcement is still correct for all current HW, FW and
driver combinations. However, this patch aligns the code to be PRM
compliant in the general case.
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/eab4eeb4785105a4bb5eb362dc0b3662cd840412.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
From Edward:
This patch series enables the mlx5 driver to dynamically choose the
optimal page size for a DMABUF-based memory key (mkey), rather than
always registering with a fixed page size.
Previously, DMABUF memory registration used a fixed 4K page size for
mkeys which could lead to suboptimal performance when the underlying
memory layout may offer better page sizes.
This approach did not take advantage of larger page size capabilities
advertised by the HCA, and the driver was not setting the proper page
size mask in the mkey mask when performing page size changes,
potentially
leading to invalid registrations when updating to a very large pages.
This series improves DMABUF performance by dynamically selecting the
best page size for a given memory region (MR) both at creation time and
on page fault occurrences, based on the underlying layout and fixing
related gaps and bugs.
By doing so, we reduce the number of page table entries (and thus MTT/
KSM descriptors) that the HCA must traverse, which in turn reduces
cache-line fetches.
Thanks
* mlx5-next:
RDMA/mlx5: Fix UMR modifying of mkey page size
net/mlx5: Expose HCA capability bits for mkey max page size
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When changing the page size on an mkey, the driver needs to set the
appropriate bits in the mkey mask to indicate which fields are being
modified.
The 6th bit of a page size in mlx5 driver is considered an extension,
and this bit has a dedicated capability and mask bits.
Previously, the driver was not setting this mask in the mkey mask when
performing page size changes, regardless of its hardware support,
potentially leading to an incorrect page size updates.
This fixes the issue by setting the relevant bit in the mkey mask when
performing page size changes on an mkey and the 6th bit of this field is
supported by the hardware.
Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits")
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/9f43a9c73bf2db6085a99dc836f7137e76579f09.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2025-07-08
The following pull-request contains common mlx5 updates
for your *net-next* tree.
v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Check device memory pointer before usage
net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow
net/mlx5: Add IFC bits for PCIe Congestion Event object
net/mlx5: Small refactor for general object capabilities
net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain
====================
Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
* mlx5-next:
net/mlx5: Check device memory pointer before usage
net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow
net/mlx5: Add IFC bits for PCIe Congestion Event object
net/mlx5: Small refactor for general object capabilities
|
|
Add a NULL check before accessing device memory to prevent a crash if
dev->dm allocation in mlx5_init_once() fails.
Fixes: c9b9dcb430b3 ("net/mlx5: Move device memory management to mlx5_core")
Signed-off-by: Stav Aviram <saviram@nvidia.com>
Link: https://patch.msgid.link/c88711327f4d74d5cebc730dc629607e989ca187.1751370035.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the devx object.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Fixes: a8b92ca1b0e5 ("IB/mlx5: Introduce DEVX")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/36ee87e92defd81410c6a2b33f9d6c0d6dcfd64c.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the anchor.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/c2376ca75e7658e2cbd1f619cf28fbe98c906419.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the flow.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Fixes: 322694412400 ("IB/mlx5: Introduce driver create and destroy flow methods")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/a4dcd5e3ac6904ef50b19e56942ca6ab0728794c.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Use the new ib_alloc_device_with_net() API to allocate the IB device
so that it is properly bound to the network namespace obtained via
mlx5_core_net(). This change ensures correct namespace association
(e.g., for containerized setups).
Additionally, expose mlx5_core_net so that RDMA driver can use it.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Support the creation of RDMA TRANSPORT tables over multiple priorities
via matcher creation.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/bb38e50ae4504e979c6568d41939402a4cf15635.1750148083.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
- pre_destroy_cq: Destroy FW CQ object so that no new CQ event would
be generated;
- post_destroy_cq: Release all resources.
This patch, along with last one, fixes the crash below.
Unable to handle kernel paging request at virtual address ffff8000114b1180
Mem abort info:
ESR = 0x96000047
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000047
CM = 0, WnR = 1
swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000f4582000
[ffff8000114b1180] pgd=00000447fffff003, p4d=00000447fffff003, pud=00000447ffffe003, pmd=00000447ffffb003, pte=0000000000000000
Internal error: Oops: 96000047 [#1] SMP
Modules linked in: udp_diag uio_pci_generic uio tcp_diag inet_diag binfmt_misc sn_core_odd(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) kpatch_9658536(OK) kpatch_9322385(OK) kpatch_8843421(OK) kpatch_8636216(OK) vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sm4_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt sg acpi_ipmi ipmi_si ipmi_msghandler m1_uncore_ddrss_pmu m1_uncore_cmn_pmu team_yosemite9rc6(OE) vnic(OE) ip_tables mlx5_ib(OE) sd_mod ast mlx5_core(OE) i2c_algo_bit drm_vram_helper psample drm_kms_helper mlxdevm(OE) auxiliary(OE) mlxfw(OE) syscopyarea sysfillrect tls sysimgblt fb_sys_fops drm_ttm_helper nvme ttm nvme_core drm t10_pi i2c_designware_platform i2c_designware_core i2c_core ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [last unloaded: ipmi_devintf]
CPU: 83 PID: 59375 Comm: kworker/u253:1 Kdump: loaded Tainted: G OE K 5.10.84-004.ali5000.alios7.aarch64 #1
Hardware name: Inspur AliServer-Xuanwu2.0AM-02-2UM1P-5B/AS1221MG1, BIOS 1.2.M1.AL.P.158.00 08/31/2023
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
pstate: 82c00089 (Nzcv daIf +PAN +UAO +TCO BTYPE=--)
pc : native_queued_spin_lock_slowpath+0x1c4/0x31c
lr : mlx5_ib_poll_cq+0x18c/0x2f8 [mlx5_ib]
sp : ffff80002be1bc80
x29: ffff80002be1bc80 x28: ffff000810e69000
x27: ffff000810e69000 x26: ffff000810e69200
x25: 0000000000000000 x24: ffff8000117db000
x23: ffff04000156b780 x22: 0000000000000000
x21: ffff04000ce6c160 x20: ffff0008196a4000
x19: 0000000000000010 x18: 0000000000000020
x17: 0000000000000000 x16: 0000000000000000
x15: ffff040055a364e8 x14: ffffffffffffffff
x13: ffff80002318bda8 x12: ffff0400358836e8
x11: 0000000000000040 x10: 0000000000000eb0
x9 : 0000000000000000 x8 : 0000000000000000
x7 : ffff04477fa20140 x6 : ffff8000114b1140
x5 : ffff04477fa20140 x4 : ffff8000114b1180
x3 : ffff000810e69200 x2 : ffff8000114b1180
x1 : 0000000001500000 x0 : ffff04477fa20148
Call trace:
native_queued_spin_lock_slowpath+0x1c4/0x31c
__ib_process_cq+0x74/0x1b8 [ib_core]
ib_cq_poll_work+0x34/0xa0 [ib_core]
process_one_work+0x1d8/0x4b0
worker_thread+0x180/0x440
kthread+0x114/0x120
Code: 910020e0 8b0400c4 f862d929 aa0403e2 (f8296847)
---[ end trace 387be2290557729c ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x9850817,7a60aa38
Memory Limit: none
Starting crashdump kernel...
Bye!
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://patch.msgid.link/aaf0072f350d1c7e8731f43b79e11a560bafb9e0.1750070205.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Always enable vport loopback for both MPV devices on driver start.
Previously in some cases related to MPV RoCE, packets weren't correctly
executing loopback check at vport in FW, since it was disabled.
Due to complexity of identifying such cases for MPV always enable vport
loopback for both GVMIs when binding the slave to the master port.
Fixes: 0042f9e458a5 ("RDMA/mlx5: Enable vport loopback when user context or QP mandate")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/d4298f5ebb2197459e9e7221c51ecd6a34699847.1750064969.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In case, CC counters are querying for the second port use the correct
core device for the query instead of always using the master core device.
Fixes: aac4492ef23a ("IB/mlx5: Update counter implementation for dual port RoCE")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/9cace74dcf106116118bebfa9146d40d4166c6b0.1750064969.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
To get the device HW counters, a non-representor switchdev device
should use the mlx5_ib_query_q_counters() function and query all of
the available counters. While a representor device in switchdev mode
should use the mlx5_ib_query_q_counters_vport() function and query only
the Q_Counters without the PPCNT counters and congestion control counters,
since they aren't relevant for a representor device.
Currently a non-representor switchdev device skips querying the PPCNT
counters and congestion control counters, leaving them unupdated.
Fix that by properly querying those counters for non-representor devices.
Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/56bf8af4ca8c58e3fb9f7e47b1dca2009eeeed81.1750064969.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The issue arises when kzalloc() is invoked while holding umem_mutex or
any other lock acquired under umem_mutex. This is problematic because
kzalloc() can trigger fs_reclaim_aqcuire(), which may, in turn, invoke
mmu_notifier_invalidate_range_start(). This function can lead to
mlx5_ib_invalidate_range(), which attempts to acquire umem_mutex again,
resulting in a deadlock.
The problematic flow:
CPU0 | CPU1
---------------------------------------|------------------------------------------------
mlx5_ib_dereg_mr() |
→ revoke_mr() |
→ mutex_lock(&umem_odp->umem_mutex) |
| mlx5_mkey_cache_init()
| → mutex_lock(&dev->cache.rb_lock)
| → mlx5r_cache_create_ent_locked()
| → kzalloc(GFP_KERNEL)
| → fs_reclaim()
| → mmu_notifier_invalidate_range_start()
| → mlx5_ib_invalidate_range()
| → mutex_lock(&umem_odp->umem_mutex)
→ cache_ent_find_and_store() |
→ mutex_lock(&dev->cache.rb_lock) |
Additionally, when kzalloc() is called from within
cache_ent_find_and_store(), we encounter the same deadlock due to
re-acquisition of umem_mutex.
Solve by releasing umem_mutex in dereg_mr() after umr_revoke_mr()
and before acquiring rb_lock. This ensures that we don't hold
umem_mutex while performing memory allocations that could trigger
the reclaim path.
This change prevents the deadlock by ensuring proper lock ordering and
avoiding holding locks during memory allocation operations that could
trigger the reclaim path.
The following lockdep warning demonstrates the deadlock:
python3/20557 is trying to acquire lock:
ffff888387542128 (&umem_odp->umem_mutex){+.+.}-{4:4}, at:
mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib]
but task is already holding lock:
ffffffff82f6b840 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at:
unmap_vmas+0x7b/0x1a0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
fs_reclaim_acquire+0x60/0xd0
mem_cgroup_css_alloc+0x6f/0x9b0
cgroup_init_subsys+0xa4/0x240
cgroup_init+0x1c8/0x510
start_kernel+0x747/0x760
x86_64_start_reservations+0x25/0x30
x86_64_start_kernel+0x73/0x80
common_startup_64+0x129/0x138
-> #2 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x91/0xd0
__kmalloc_cache_noprof+0x4d/0x4c0
mlx5r_cache_create_ent_locked+0x75/0x620 [mlx5_ib]
mlx5_mkey_cache_init+0x186/0x360 [mlx5_ib]
mlx5_ib_stage_post_ib_reg_umr_init+0x3c/0x60 [mlx5_ib]
__mlx5_ib_add+0x4b/0x190 [mlx5_ib]
mlx5r_probe+0xd9/0x320 [mlx5_ib]
auxiliary_bus_probe+0x42/0x70
really_probe+0xdb/0x360
__driver_probe_device+0x8f/0x130
driver_probe_device+0x1f/0xb0
__driver_attach+0xd4/0x1f0
bus_for_each_dev+0x79/0xd0
bus_add_driver+0xf0/0x200
driver_register+0x6e/0xc0
__auxiliary_driver_register+0x6a/0xc0
do_one_initcall+0x5e/0x390
do_init_module+0x88/0x240
init_module_from_file+0x85/0xc0
idempotent_init_module+0x104/0x300
__x64_sys_finit_module+0x68/0xc0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
-> #1 (&dev->cache.rb_lock){+.+.}-{4:4}:
__mutex_lock+0x98/0xf10
__mlx5_ib_dereg_mr+0x6f2/0x890 [mlx5_ib]
mlx5_ib_dereg_mr+0x21/0x110 [mlx5_ib]
ib_dereg_mr_user+0x85/0x1f0 [ib_core]
uverbs_free_mr+0x19/0x30 [ib_uverbs]
destroy_hw_idr_uobject+0x21/0x80 [ib_uverbs]
uverbs_destroy_uobject+0x60/0x3d0 [ib_uverbs]
uobj_destroy+0x57/0xa0 [ib_uverbs]
ib_uverbs_cmd_verbs+0x4d5/0x1210 [ib_uverbs]
ib_uverbs_ioctl+0x129/0x230 [ib_uverbs]
__x64_sys_ioctl+0x596/0xaa0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
-> #0 (&umem_odp->umem_mutex){+.+.}-{4:4}:
__lock_acquire+0x1826/0x2f00
lock_acquire+0xd3/0x2e0
__mutex_lock+0x98/0xf10
mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib]
__mmu_notifier_invalidate_range_start+0x18e/0x1f0
unmap_vmas+0x182/0x1a0
exit_mmap+0xf3/0x4a0
mmput+0x3a/0x100
do_exit+0x2b9/0xa90
do_group_exit+0x32/0xa0
get_signal+0xc32/0xcb0
arch_do_signal_or_restart+0x29/0x1d0
syscall_exit_to_user_mode+0x105/0x1d0
do_syscall_64+0x79/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Chain exists of:
&dev->cache.rb_lock --> mmu_notifier_invalidate_range_start -->
&umem_odp->umem_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&umem_odp->umem_mutex);
lock(mmu_notifier_invalidate_range_start);
lock(&umem_odp->umem_mutex);
lock(&dev->cache.rb_lock);
*** DEADLOCK ***
Fixes: abb604a1a9c8 ("RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/3c8f225a8a9fade647d19b014df1172544643e4a.1750061612.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The obj_event may be loaded immediately after inserted, then if the
list_head is not initialized then we may get a poisonous pointer. This
fixes the crash below:
mlx5_core 0000:03:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)
mlx5_core.sf mlx5_core.sf.4: firmware version: 32.38.3056
mlx5_core 0000:03:00.0 en3f0pf0sf2002: renamed from eth0
mlx5_core.sf mlx5_core.sf.4: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
IPv6: ADDRCONF(NETDEV_CHANGE): en3f0pf0sf2002: link becomes ready
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060
Mem abort info:
ESR = 0x96000006
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000006
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=00000007760fb000
[0000000000000060] pgd=000000076f6d7003, p4d=000000076f6d7003, pud=0000000777841003, pmd=0000000000000000
Internal error: Oops: 96000006 [#1] SMP
Modules linked in: ipmb_host(OE) act_mirred(E) cls_flower(E) sch_ingress(E) mptcp_diag(E) udp_diag(E) raw_diag(E) unix_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) bonding(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) isofs(E) cdrom(E) mst_pciconf(OE) ib_umad(OE) mlx5_ib(OE) ipmb_dev_int(OE) mlx5_core(OE) kpatch_15237886(OEK) mlxdevm(OE) auxiliary(OE) ib_uverbs(OE) ib_core(OE) psample(E) mlxfw(OE) tls(E) sunrpc(E) vfat(E) fat(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) sbsa_gwdt(E) virtio_console(E) ext4(E) mbcache(E) jbd2(E) xfs(E) libcrc32c(E) mmc_block(E) virtio_net(E) net_failover(E) failover(E) sha2_ce(E) sha256_arm64(E) nvme(OE) nvme_core(OE) gpio_mlxbf3(OE) mlx_compat(OE) mlxbf_pmc(OE) i2c_mlxbf(OE) sdhci_of_dwcmshc(OE) pinctrl_mlxbf3(OE) mlxbf_pka(OE) gpio_generic(E) i2c_core(E) mmc_core(E) mlxbf_gige(OE) vitesse(E) pwr_mlxbf(OE) mlxbf_tmfifo(OE) micrel(E) mlxbf_bootctl(OE) virtio_ring(E) virtio(E) ipmi_devintf(E) ipmi_msghandler(E)
[last unloaded: mst_pci]
CPU: 11 PID: 20913 Comm: rte-worker-11 Kdump: loaded Tainted: G OE K 5.10.134-13.1.an8.aarch64 #1
Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.2.2.12968 Oct 26 2023
pstate: a0400089 (NzCv daIf +PAN -UAO -TCO BTYPE=--)
pc : dispatch_event_fd+0x68/0x300 [mlx5_ib]
lr : devx_event_notifier+0xcc/0x228 [mlx5_ib]
sp : ffff80001005bcf0
x29: ffff80001005bcf0 x28: 0000000000000001
x27: ffff244e0740a1d8 x26: ffff244e0740a1d0
x25: ffffda56beff5ae0 x24: ffffda56bf911618
x23: ffff244e0596a480 x22: ffff244e0596a480
x21: ffff244d8312ad90 x20: ffff244e0596a480
x19: fffffffffffffff0 x18: 0000000000000000
x17: 0000000000000000 x16: ffffda56be66d620
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000040 x10: ffffda56bfcafb50
x9 : ffffda5655c25f2c x8 : 0000000000000010
x7 : 0000000000000000 x6 : ffff24545a2e24b8
x5 : 0000000000000003 x4 : ffff80001005bd28
x3 : 0000000000000000 x2 : 0000000000000000
x1 : ffff244e0596a480 x0 : ffff244d8312ad90
Call trace:
dispatch_event_fd+0x68/0x300 [mlx5_ib]
devx_event_notifier+0xcc/0x228 [mlx5_ib]
atomic_notifier_call_chain+0x58/0x80
mlx5_eq_async_int+0x148/0x2b0 [mlx5_core]
atomic_notifier_call_chain+0x58/0x80
irq_int_handler+0x20/0x30 [mlx5_core]
__handle_irq_event_percpu+0x60/0x220
handle_irq_event_percpu+0x3c/0x90
handle_irq_event+0x58/0x158
handle_fasteoi_irq+0xfc/0x188
generic_handle_irq+0x34/0x48
...
Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX")
Link: https://patch.msgid.link/r/3ce7f20e0d1a03dc7de6e57494ec4b8eaf1f05c2.1750147949.git.leon@kernel.org
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
__xa_store() and __xa_erase() were used without holding the proper lock,
which led to a lockdep warning due to unsafe RCU usage. This patch
replaces them with xa_store() and xa_erase(), which perform the necessary
locking internally.
=============================
WARNING: suspicious RCPU usage
6.14.0-rc7_for_upstream_debug_2025_03_18_15_01 #1 Not tainted
-----------------------------
./include/linux/xarray.h:1211 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
3 locks held by kworker/u136:0/219:
at: process_one_work+0xbe4/0x15f0
process_one_work+0x75c/0x15f0
pagefault_mr+0x9a5/0x1390 [mlx5_ib]
stack backtrace:
CPU: 14 UID: 0 PID: 219 Comm: kworker/u136:0 Not tainted
6.14.0-rc7_for_upstream_debug_2025_03_18_15_01 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib]
Call Trace:
dump_stack_lvl+0xa8/0xc0
lockdep_rcu_suspicious+0x1e6/0x260
xas_create+0xb8a/0xee0
xas_store+0x73/0x14c0
__xa_store+0x13c/0x220
? xa_store_range+0x390/0x390
? spin_bug+0x1d0/0x1d0
pagefault_mr+0xcb5/0x1390 [mlx5_ib]
? _raw_spin_unlock+0x1f/0x30
mlx5_ib_eqe_pf_action+0x3be/0x2620 [mlx5_ib]
? lockdep_hardirqs_on_prepare+0x400/0x400
? mlx5_ib_invalidate_range+0xcb0/0xcb0 [mlx5_ib]
process_one_work+0x7db/0x15f0
? pwq_dec_nr_in_flight+0xda0/0xda0
? assign_work+0x168/0x240
worker_thread+0x57d/0xcd0
? rescuer_thread+0xc40/0xc40
kthread+0x3b3/0x800
? kthread_is_per_cpu+0xb0/0xb0
? lock_downgrade+0x680/0x680
? do_raw_spin_lock+0x12d/0x270
? spin_bug+0x1d0/0x1d0
? finish_task_switch.isra.0+0x284/0x9e0
? lockdep_hardirqs_on_prepare+0x284/0x400
? kthread_is_per_cpu+0xb0/0xb0
ret_from_fork+0x2d/0x70
? kthread_is_per_cpu+0xb0/0xb0
ret_from_fork_asm+0x11/0x20
Fixes: d3d930411ce3 ("RDMA/mlx5: Fix implicit ODP use after free")
Link: https://patch.msgid.link/r/a85ddd16f45c8cb2bc0a188c2b0fcedfce975eb8.1750061791.git.leon@kernel.org
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|