| Age | Commit message (Collapse) | Author |
|
Cross-merge networking fixes after downstream PR (net-6.18-rc7).
No conflicts, adjacent changes:
tools/testing/selftests/net/af_unix/Makefile
e1bb28bf13f4 ("selftest: af_unix: Add test for SO_PEEK_OFF.")
45a1cd8346ca ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2025-11-18
1) Misc fixes for xfrm_state creation/modification/deletion.
Patchset from Sabrina Dubroca.
2) Fix inner packet family determination for xfrm offloads.
From Jianbo Liu.
3) Don't push locally generated packets directly to L2 tunnel
mode offloading, they still need processing from the standard
xfrm path. From Jianbo Liu.
4) Fix memory leaks in xfrm_add_acquire for policy offloads and policy
security contexts. From Zilin Guan.
* tag 'ipsec-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: fix memory leak in xfrm_add_acquire()
xfrm: Prevent locally generated packets from direct output in tunnel mode
xfrm: Determine inner GSO type from packet inner protocol
xfrm: Check inner packet family directly from skb_dst
xfrm: check all hash buckets for leftover states during netns deletion
xfrm: set err and extack on failure to create pcpu SA
xfrm: call xfrm_dev_state_delete when xfrm_state_migrate fails to add the state
xfrm: make state as DEAD before final put when migrate fails
xfrm: also call xfrm_state_delete_tunnel at destroy time for states that were never added
xfrm: drop SA reference in xfrm_state_update if dir doesn't match
====================
Link: https://patch.msgid.link/20251118085344.2199815-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The xfrm_add_acquire() function constructs an xfrm policy by calling
xfrm_policy_construct(). This allocates the policy structure and
potentially associates a security context and a device policy with it.
However, at the end of the function, the policy object is freed using
only kfree() . This skips the necessary cleanup for the security context
and device policy, leading to a memory leak.
To fix this, invoke the proper cleanup functions xfrm_dev_policy_delete(),
xfrm_dev_policy_free(), and security_xfrm_policy_free() before freeing the
policy object. This approach mirrors the error handling path in
xfrm_add_policy(), ensuring that all associated resources are correctly
released.
Fixes: 980ebd25794f ("[IPSEC]: Sync series - acquire insert")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Add a check to ensure locally generated packets (skb->sk != NULL) do
not use direct output in tunnel mode, as these packets require proper
L2 header setup that is handled by the normal XFRM processing path.
Fixes: 5eddd76ec2fd ("xfrm: fix tunnel mode TX datapath in packet offload mode")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
In the output path, xfrm_dev_offload_ok and xfrm_get_inner_ipproto
need to determine the protocol family of the inner packet (skb) before
it gets encapsulated.
In xfrm_dev_offload_ok, the code checked x->inner_mode.family. This is
unreliable because, for states handling both IPv4 and IPv6, the
relevant inner family could be either x->inner_mode.family or
x->inner_mode_iaf.family. Checking only the former can lead to a
mismatch with the actual packet being processed.
In xfrm_get_inner_ipproto, the code checked x->outer_mode.family. This
is also incorrect for tunnel mode, as the inner packet's family can be
different from the outer header's family.
At both of these call sites, the skb variable holds the original inner
packet. The most direct and reliable source of truth for its protocol
family is its destination entry. This patch fixes the issue by using
skb_dst(skb)->ops->family to ensure protocol-specific headers are only
accessed for the correct packet type.
Fixes: 91d8a53db219 ("xfrm: fix offloading of cross-family tunnels")
Fixes: 45a98ef4922d ("net/xfrm: IPsec tunnel mode fix inner_ipproto setting in sec_path")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The pfkey user configuration interface was replaced by the netlink
user configuration interface more than a decade ago. In between
all maintained IKE implementations moved to the netlink interface.
So let config NET_KEY default to no in Kconfig. The pfkey code
will be removed in a second step.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Antony Antony <antony.antony@secunet.com>
Acked-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Tuomo Soini <tis@foobar.fi>
Acked-by: Paul Wouters <paul@nohats.ca>
|
|
The xfrm_replay_recheck() function was introduced to handle the issues
arising from asynchronous crypto algorithms.
The crypto offload path is now effectively synchronous, as it holds
the state lock throughout its operation. This eliminates the race
condition, making the recheck an unnecessary overhead. This patch
improves performance by skipping the redundant call when
crypto_done is true.
Additionally, the sequence number assignment is moved to an earlier
point in the function. This improves performance by reducing lock
contention and places the logic at a more appropriate point, as the
full sequence number (including the higher-order bits) can be
determined as soon as the packet is received.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
With newer NICs like mlx5 supporting RSS for IPsec crypto offload,
packets for a single Security Association (SA) are scattered across
multiple CPU cores for parallel processing. The xfrm_state spinlock
(x->lock) is held for each packet during xfrm processing.
When multiple connections or flows share the same SA, this parallelism
causes high lock contention on x->lock, creating a performance
bottleneck and limiting scalability.
The original xfrm_input() function exacerbated this issue by releasing
and immediately re-acquiring x->lock. For hardware crypto offload
paths, this unlock/relock sequence is unnecessary and introduces
significant overhead. This patch refactors the function to relocate
the type_offload->input_tail call for the offload path, performing all
necessary work while continuously holding the lock. This reordering is
safe, since packets which don't pass the checks below will still fail
them with the new code.
Performance testing with iperf using multiple parallel streams over a
single IPsec SA shows significant improvement in throughput as the
number of queues (and thus CPU cores) increases:
+-----------+---------------+--------------+-----------------+
| RX queues | Before (Gbps) | After (Gbps) | Improvement (%) |
+-----------+---------------+--------------+-----------------+
| 2 | 32.3 | 34.4 | 6.5 |
| 4 | 34.4 | 40.0 | 16.3 |
| 6 | 24.5 | 38.3 | 56.3 |
| 8 | 23.1 | 38.3 | 65.8 |
| 12 | 18.1 | 29.9 | 65.2 |
| 16 | 16.0 | 25.2 | 57.5 |
+-----------+---------------+--------------+-----------------+
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
espintcp uses a custom queue (ike_queue) to deliver packets to
userspace. The polling logic relies on datagram_poll, which checks
sk_receive_queue, which can lead to false readiness signals when that
queue contains non-userspace packets.
Switch espintcp_poll to use datagram_poll_queue with ike_queue, ensuring
poll only signals readiness when userspace data is actually available.
Fixes: e27cca96cd68 ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251021100942.195010-3-ralf@mandelbit.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The current hlist_empty checks only test the first bucket of each
hashtable, ignoring any other bucket. They should be caught by the
WARN_ON for state_all, but better to make all the checks accurate.
Fixes: 73d189dce486 ("netns xfrm: per-netns xfrm_state_bydst hash")
Fixes: d320bbb306f2 ("netns xfrm: per-netns xfrm_state_bysrc hash")
Fixes: b754a4fd8f58 ("netns xfrm: per-netns xfrm_state_byspi hash")
Fixes: fe9f1d8779cb ("xfrm: add state hashtable keyed by seq")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
xfrm_state_construct can fail without setting an error if the
requested pcpu_num value is too big. Set err and add an extack message
to avoid confusing userspace.
Fixes: 1ddf9916ac09 ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
In case xfrm_state_migrate fails after calling xfrm_dev_state_add, we
directly release the last reference and destroy the new state, without
calling xfrm_dev_state_delete (this only happens in
__xfrm_state_delete, which we're not calling on this path, since the
state was never added).
Call xfrm_dev_state_delete on error when an offload configuration was
provided.
Fixes: ab244a394c7f ("xfrm: Migrate offload configuration")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
xfrm_state_migrate/xfrm_state_clone_and_setup create a new state, and
call xfrm_state_put to destroy it in case of
failure. __xfrm_state_destroy expects the state to be in
XFRM_STATE_DEAD, but we currently don't do that.
Reported-by: syzbot+5cd6299ede4d4f70987b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5cd6299ede4d4f70987b
Fixes: 78347c8c6b2d ("xfrm: Fix xfrm_state_migrate leak")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
were never added
In commit b441cf3f8c4b ("xfrm: delete x->tunnel as we delete x"), I
missed the case where state creation fails between full
initialization (->init_state has been called) and being inserted on
the lists.
In this situation, ->init_state has been called, so for IPcomp
tunnels, the fallback tunnel has been created and added onto the
lists, but the user state never gets added, because we fail before
that. The user state doesn't go through __xfrm_state_delete, so we
don't call xfrm_state_delete_tunnel for those states, and we end up
leaking the FB tunnel.
There are several codepaths affected by this: the add/update paths, in
both net/key and xfrm, and the migrate code (xfrm_migrate,
xfrm_state_migrate). A "proper" rollback of the init_state work would
probably be doable in the add/update code, but for migrate it gets
more complicated as multiple states may be involved.
At some point, the new (not-inserted) state will be destroyed, so call
xfrm_state_delete_tunnel during xfrm_state_gc_destroy. Most states
will have their fallback tunnel cleaned up during __xfrm_state_delete,
which solves the issue that b441cf3f8c4b (and other patches before it)
aimed at. All states (including FB tunnels) will be removed from the
lists once xfrm_state_fini has called flush_work(&xfrm_state_gc_work).
Reported-by: syzbot+999eb23467f83f9bf9bf@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=999eb23467f83f9bf9bf
Fixes: b441cf3f8c4b ("xfrm: delete x->tunnel as we delete x")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
We're not updating x1, but we still need to put() it.
Fixes: a4a87fa4e96c ("xfrm: Add Direction to the SA in or out")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2025-09-26
1) Fix field-spanning memcpy warning in AH output.
From Charalampos Mitrodimas.
2) Replace the strcpy() calls for alg_name by strscpy().
From Miguel García.
* tag 'ipsec-next-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: xfrm_user: use strscpy() for alg_name
net: ipv6: fix field-spanning memcpy warning in AH output
====================
Link: https://patch.msgid.link/20250926053025.2242061-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.17-rc8).
Conflicts:
drivers/net/can/spi/hi311x.c
6b6968084721 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled")
27ce71e1ce81 ("net: WQ_PERCPU added to alloc_workqueue users")
https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org
net/smc/smc_loopback.c
drivers/dibs/dibs_loopback.c
a35c04de2565 ("net/smc: fix warning in smc_rx_splice() when calling get_page()")
cc21191b584c ("dibs: Move data path to dibs layer")
https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org
Adjacent changes:
drivers/net/dsa/lantiq/lantiq_gswip.c
c0054b25e2f1 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()")
7a1eaef0a791 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Xiumei reported a regression in IPsec offload tests over xfrmi, where
the traffic for IPv6 over IPv4 tunnels is processed in SW instead of
going through crypto offload, after commit
cc18f482e8b6 ("xfrm: provide common xdo_dev_offload_ok callback
implementation").
Commit cc18f482e8b6 added a generic version of existing checks
attempting to prevent packets with IPv4 options or IPv6 extension
headers from being sent to HW that doesn't support offloading such
packets. The check mistakenly uses x->props.family (the outer family)
to determine the inner packet's family and verify if
options/extensions are present.
In the case of IPv6 over IPv4, the check compares some of the traffic
class bits to the expected no-options ihl value (5). The original
check was introduced in commit 2ac9cfe78223 ("net/mlx5e: IPSec, Add
Innova IPSec offload TX data path"), and then duplicated in the other
drivers. Before commit cc18f482e8b6, the loose check (ihl > 5) passed
because those traffic class bits were not set to a value that
triggered the no-offload codepath. Packets with options/extension
headers that should have been handled in SW went through the offload
path, and were likely dropped by the NIC or incorrectly
processed. Since commit cc18f482e8b6, the check is now strict (ihl !=
5), and in a basic setup (no traffic class configured), all packets go
through the no-offload codepath.
The commits that introduced the incorrect family checks in each driver
are:
2ac9cfe78223 ("net/mlx5e: IPSec, Add Innova IPSec offload TX data path")
8362ea16f69f ("crypto: chcr - ESN for Inline IPSec Tx")
859a497fe80c ("nfp: implement xfrm callbacks and expose ipsec offload feature to upper layer")
32188be805d0 ("cn10k-ipsec: Allow ipsec crypto offload for skb with SA")
[ixgbe/ixgbevf commits are ignored, as that HW does not support tunnel
mode, thus no cross-family setups are possible]
Fixes: cc18f482e8b6 ("xfrm: provide common xdo_dev_offload_ok callback implementation")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Use ARRAY_SIZE(), so that we know the limit at compile time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250905165813.1470708-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
x->id.spi == 0 means "no SPI assigned", but since commit
94f39804d891 ("xfrm: Duplicate SPI Handling"), we now create states
and add them to the byspi list with this value.
__xfrm_state_delete doesn't remove those states from the byspi list,
since they shouldn't be there, and this shows up as a UAF the next
time we go through the byspi list.
Reported-by: syzbot+a25ee9d20d31e483ba7b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a25ee9d20d31e483ba7b
Fixes: 94f39804d891 ("xfrm: Duplicate SPI Handling")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Convert the ->flowic_tos field of struct flowi_common from __u8 to
dscp_t, rename it ->flowic_dscp and propagate these changes to struct
flowi and struct flowi4.
We've had several bugs in the past where ECN bits could interfere with
IPv4 routing, because these bits were not properly cleared when setting
->flowi4_tos. These bugs should be fixed now and the dscp_t type has
been introduced to ensure that variables carrying DSCP values don't
accidentally have any ECN bits set. Several variables and structure
fields have been converted to dscp_t already, but the main IPv4 routing
structure, struct flowi4, is still using a __u8. To avoid any future
regression, this patch converts it to dscp_t.
There are many users to convert at once. Fortunately, around half of
->flowi4_tos users already have a dscp_t value at hand, which they
currently convert to __u8 using inet_dscp_to_dsfield(). For all of
these users, we just need to drop that conversion.
But, although we try to do the __u8 <-> dscp_t conversions at the
boundaries of the network or of user space, some places still store
TOS/DSCP variables as __u8 in core networking code. Those can hardly be
converted either because the data structure is part of UAPI or because
the same variable or field is also used for handling ECN in other parts
of the code. In all of these cases where we don't have a dscp_t
variable at hand, we need to use inet_dsfield_to_dscp() when
interacting with ->flowi4_dscp.
Changes since v1:
* Fix space alignment in __bpf_redirect_neigh_v4() (Ido).
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Going forward skb_dst_set will assert that skb dst_entry
is empty during skb_dst_set. skb_dstref_steal is added to reset
existing entry without doing refcnt. Switch to skb_dstref_steal
in __xfrm_route_forward and add a comment on why it's safe
to skip skb_dstref_restore.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace the strcpy() calls that copy the canonical algorithm name into
alg_name with strscpy() to avoid potential overflows and guarantee NULL
termination.
Destination is alg_name in xfrm_algo/xfrm_algo_auth/xfrm_algo_aead
(size CRYPTO_MAX_ALG_NAME).
Tested in QEMU (BusyBox/Alpine rootfs):
- Added ESP AEAD (rfc4106(gcm(aes))) and classic ESP (sha256 + cbc(aes))
- Verified canonical names via ip -d xfrm state
- Checked IPComp negative (unknown algo) and deflate path
Signed-off-by: Miguel García <miguelgarciaroman8@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
This is partial revert of commit d53dda291bbd993a29b84d358d282076e3d01506.
This change causes traffic using GSO with SW crypto running through a
NIC capable of HW offload to no longer get segmented during
validate_xmit_xfrm, and is unrelated to the bonding use case mentioned
in the commit.
Fixes: d53dda291bbd ("xfrm: Remove unneeded device check from validate_xmit_xfrm")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Commit 49431af6c4ef incorrectly assumes that the GSO path is only used
by HW offload, but it's also useful for SW crypto.
This patch re-enables GSO for SW crypto. It's not an exact revert to
preserve the other changes made to xfrm_dev_offload_ok afterwards, but
it reverts all of its effects.
Fixes: 49431af6c4ef ("xfrm: rely on XFRM offload")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
While reverting commit f75a2804da39 ("xfrm: destroy xfrm_state
synchronously on net exit path"), I incorrectly changed
xfrm_state_flush's "proto" argument back to IPSEC_PROTO_ANY. This
reverts some of the changes in commit dbb2483b2a46 ("xfrm: clean up
xfrm protocol checks"), and leads to some states not being removed
when we exit the netns.
Pass 0 instead of IPSEC_PROTO_ANY from both xfrm_state_fini
xfrm6_tunnel_net_exit, so that xfrm_state_flush deletes all states.
Fixes: 2a198bbec691 ("Revert "xfrm: destroy xfrm_state synchronously on net exit path"")
Reported-by: syzbot+6641a61fe0e2e89ae8c5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6641a61fe0e2e89ae8c5
Tested-by: syzbot+6641a61fe0e2e89ae8c5@syzkaller.appspotmail.com
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Cross-merge networking fixes after downstream PR (net-6.16-rc8).
Conflicts:
drivers/net/ethernet/microsoft/mana/gdma_main.c
9669ddda18fb ("net: mana: Fix warnings for missing export.h header inclusion")
755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au
Adjacent changes:
drivers/net/ethernet/ti/icssg/icssg_prueth.h
6e86fb73de0f ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG")
ffe8a4909176 ("net: ti: icssg-prueth: Read firmware-names from device tree")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2025-07-23
1) Optimize to hold device only for the asynchronous decryption,
where it is really needed.
From Jianbo Liu.
2) Align our inbund SA lookup to RFC 4301. Only SPI and protocol
should be used for an inbound SA lookup.
From Aakash Kumar S.
3) Skip redundant statistics update for xfrm crypto offload.
From Jianbo Liu.
Please pull or let me know if there are problems.
* tag 'ipsec-next-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: Skip redundant statistics update for crypto offload
xfrm: Duplicate SPI Handling
xfrm: hold device only for the asynchronous decryption
====================
Link: https://patch.msgid.link/20250723080402.3439619-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2025-07-23
1) Premption fixes for xfrm_state_find.
From Sabrina Dubroca.
2) Initialize offload path also for SW IPsec GRO. This fixes a
performance regression on SW IPsec offload.
From Leon Romanovsky.
3) Fix IPsec UDP GRO for IKE packets.
From Tobias Brunner,
4) Fix transport header setting for IPcomp after decompressing.
From Fernando Fernandez Mancera.
5) Fix use-after-free when xfrmi_changelink tries to change
collect_md for a xfrm interface.
From Eyal Birger .
6) Delete the special IPcomp x->tunnel state along with the state x
to avoid refcount problems.
From Sabrina Dubroca.
Please pull or let me know if there are problems.
* tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
Revert "xfrm: destroy xfrm_state synchronously on net exit path"
xfrm: delete x->tunnel as we delete x
xfrm: interface: fix use-after-free after changing collect_md xfrm interface
xfrm: ipcomp: adjust transport header after decompressing
xfrm: Set transport header to fix UDP GRO handling
xfrm: always initialize offload path
xfrm: state: use a consistent pcpu_id in xfrm_state_find
xfrm: state: initialize state_ptrs earlier in xfrm_state_find
====================
Link: https://patch.msgid.link/20250723075417.3432644-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This reverts commit f75a2804da391571563c4b6b29e7797787332673.
With all states (whether user or kern) removed from the hashtables
during deletion, there's no need for synchronous destruction of
states. xfrm6_tunnel states still need to have been destroyed (which
will be the case when its last user is deleted (not destroyed)) so
that xfrm6_tunnel_free_spi removes it from the per-netns hashtable
before the netns is destroyed.
This has the benefit of skipping one synchronize_rcu per state (in
__xfrm_state_destroy(sync=true)) when we exit a netns.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The ipcomp fallback tunnels currently get deleted (from the various
lists and hashtables) as the last user state that needed that fallback
is destroyed (not deleted). If a reference to that user state still
exists, the fallback state will remain on the hashtables/lists,
triggering the WARN in xfrm_state_fini. Because of those remaining
references, the fix in commit f75a2804da39 ("xfrm: destroy xfrm_state
synchronously on net exit path") is not complete.
We recently fixed one such situation in TCP due to defered freeing of
skbs (commit 9b6412e6979f ("tcp: drop secpath at the same time as we
currently drop dst")). This can also happen due to IP reassembly: skbs
with a secpath remain on the reassembly queue until netns
destruction. If we can't guarantee that the queues are flushed by the
time xfrm_state_fini runs, there may still be references to a (user)
xfrm_state, preventing the timely deletion of the corresponding
fallback state.
Instead of chasing each instance of skbs holding a secpath one by one,
this patch fixes the issue directly within xfrm, by deleting the
fallback state as soon as the last user state depending on it has been
deleted. Destruction will still happen when the final reference is
dropped.
A separate lockdep class for the fallback state is required since
we're going to lock x->tunnel while x is locked.
Fixes: 9d4139c76905 ("netns xfrm: per-netns xfrm_state_all list")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
In the crypto offload path, every packet is still processed by the
software stack. The state's statistics required for the expiration
check are being updated in software.
However, the code also calls xfrm_dev_state_update_stats(), which
triggers a query to the hardware device to fetch statistics. This
hardware query is redundant and introduces unnecessary performance
overhead.
Skip this call when it's crypto offload (not packet offload) to avoid
the unnecessary hardware access, thereby improving performance.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
collect_md property on xfrm interfaces can only be set on device creation,
thus xfrmi_changelink() should fail when called on such interfaces.
The check to enforce this was done only in the case where the xi was
returned from xfrmi_locate() which doesn't look for the collect_md
interface, and thus the validation was never reached.
Calling changelink would thus errornously place the special interface xi
in the xfrmi_net->xfrmi hash, but since it also exists in the
xfrmi_net->collect_md_xfrmi pointer it would lead to a double free when
the net namespace was taken down [1].
Change the check to use the xi from netdev_priv which is available earlier
in the function to prevent changes in xfrm collect_md interfaces.
[1] resulting oops:
[ 8.516540] kernel BUG at net/core/dev.c:12029!
[ 8.516552] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[ 8.516559] CPU: 0 UID: 0 PID: 12 Comm: kworker/u80:0 Not tainted 6.15.0-virtme #5 PREEMPT(voluntary)
[ 8.516565] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 8.516569] Workqueue: netns cleanup_net
[ 8.516579] RIP: 0010:unregister_netdevice_many_notify+0x101/0xab0
[ 8.516590] Code: 90 0f 0b 90 48 8b b0 78 01 00 00 48 8b 90 80 01 00 00 48 89 56 08 48 89 32 4c 89 80 78 01 00 00 48 89 b8 80 01 00 00 eb ac 90 <0f> 0b 48 8b 45 00 4c 8d a0 88 fe ff ff 48 39 c5 74 5c 41 80 bc 24
[ 8.516593] RSP: 0018:ffffa93b8006bd30 EFLAGS: 00010206
[ 8.516598] RAX: ffff98fe4226e000 RBX: ffffa93b8006bd58 RCX: ffffa93b8006bc60
[ 8.516601] RDX: 0000000000000004 RSI: 0000000000000000 RDI: dead000000000122
[ 8.516603] RBP: ffffa93b8006bdd8 R08: dead000000000100 R09: ffff98fe4133c100
[ 8.516605] R10: 0000000000000000 R11: 00000000000003d2 R12: ffffa93b8006be00
[ 8.516608] R13: ffffffff96c1a510 R14: ffffffff96c1a510 R15: ffffa93b8006be00
[ 8.516615] FS: 0000000000000000(0000) GS:ffff98fee73b7000(0000) knlGS:0000000000000000
[ 8.516619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8.516622] CR2: 00007fcd2abd0700 CR3: 000000003aa40000 CR4: 0000000000752ef0
[ 8.516625] PKRU: 55555554
[ 8.516627] Call Trace:
[ 8.516632] <TASK>
[ 8.516635] ? rtnl_is_locked+0x15/0x20
[ 8.516641] ? unregister_netdevice_queue+0x29/0xf0
[ 8.516650] ops_undo_list+0x1f2/0x220
[ 8.516659] cleanup_net+0x1ad/0x2e0
[ 8.516664] process_one_work+0x160/0x380
[ 8.516673] worker_thread+0x2aa/0x3c0
[ 8.516679] ? __pfx_worker_thread+0x10/0x10
[ 8.516686] kthread+0xfb/0x200
[ 8.516690] ? __pfx_kthread+0x10/0x10
[ 8.516693] ? __pfx_kthread+0x10/0x10
[ 8.516697] ret_from_fork+0x82/0xf0
[ 8.516705] ? __pfx_kthread+0x10/0x10
[ 8.516709] ret_from_fork_asm+0x1a/0x30
[ 8.516718] </TASK>
Fixes: abc340b38ba2 ("xfrm: interface: support collect metadata mode")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
(dst_entry)->obsolete is read locklessly, add corresponding
annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The issue originates when Strongswan initiates an XFRM_MSG_ALLOCSPI
Netlink message, which triggers the kernel function xfrm_alloc_spi().
This function is expected to ensure uniqueness of the Security Parameter
Index (SPI) for inbound Security Associations (SAs). However, it can
return success even when the requested SPI is already in use, leading
to duplicate SPIs assigned to multiple inbound SAs, differentiated
only by their destination addresses.
This behavior causes inconsistencies during SPI lookups for inbound packets.
Since the lookup may return an arbitrary SA among those with the same SPI,
packet processing can fail, resulting in packet drops.
According to RFC 4301 section 4.4.2 , for inbound processing a unicast SA
is uniquely identified by the SPI and optionally protocol.
Reproducing the Issue Reliably:
To consistently reproduce the problem, restrict the available SPI range in
charon.conf : spi_min = 0x10000000 spi_max = 0x10000002
This limits the system to only 2 usable SPI values.
Next, create more than 2 Child SA. each using unique pair of src/dst address.
As soon as the 3rd Child SA is initiated, it will be assigned a duplicate
SPI, since the SPI pool is already exhausted.
With a narrow SPI range, the issue is consistently reproducible.
With a broader/default range, it becomes rare and unpredictable.
Current implementation:
xfrm_spi_hash() lookup function computes hash using daddr, proto, and family.
So if two SAs have the same SPI but different destination addresses, then
they will:
a. Hash into different buckets
b. Be stored in different linked lists (byspi + h)
c. Not be seen in the same hlist_for_each_entry_rcu() iteration.
As a result, the lookup will result in NULL and kernel allows that Duplicate SPI
Proposed Change:
xfrm_state_lookup_spi_proto() does a truly global search - across all states,
regardless of hash bucket and matches SPI and proto.
Signed-off-by: Aakash Kumar S <saakashkumar@marvell.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The skb transport header pointer needs to be adjusted by network header
pointer plus the size of the ipcomp header.
This shows up when running traffic over ipcomp using transport mode.
After being reinjected, packets are dropped because the header isn't
adjusted properly and some checks can be triggered. E.g the skb is
mistakenly considered as IP fragmented packet and later dropped.
kworker/30:1-mm 443 [030] 102.055250: skb:kfree_skb:skbaddr=0xffff8f104aa3ce00 rx_sk=(
ffffffff8419f1f4 sk_skb_reason_drop+0x94 ([kernel.kallsyms])
ffffffff8419f1f4 sk_skb_reason_drop+0x94 ([kernel.kallsyms])
ffffffff84281420 ip_defrag+0x4b0 ([kernel.kallsyms])
ffffffff8428006e ip_local_deliver+0x4e ([kernel.kallsyms])
ffffffff8432afb1 xfrm_trans_reinject+0xe1 ([kernel.kallsyms])
ffffffff83758230 process_one_work+0x190 ([kernel.kallsyms])
ffffffff83758f37 worker_thread+0x2d7 ([kernel.kallsyms])
ffffffff83761cc9 kthread+0xf9 ([kernel.kallsyms])
ffffffff836c3437 ret_from_fork+0x197 ([kernel.kallsyms])
ffffffff836718da ret_from_fork_asm+0x1a ([kernel.kallsyms])
Fixes: eb2953d26971 ("xfrm: ipcomp: Use crypto_acomp interface")
Link: https://bugzilla.suse.com/1244532
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The dev_hold() on skb->dev during packet reception was originally
added to prevent the device from being released prematurely during
asynchronous decryption operations.
As current hardware can offload decryption, this asynchronous path is
not always utilized. This often results in a pattern of dev_hold()
immediately followed by dev_put() for each packet, creating
unnecessary reference counting overhead detrimental to performance.
This patch optimizes this by skipping the dev_hold() and subsequent
dev_put() when asynchronous decryption is not being performed.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Offload path is used for GRO with SW IPsec, and not just for HW
offload. So initialize it anyway.
Fixes: 585b64f5a620 ("xfrm: delay initialization of offload path till its actually requested")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Closes: https://lore.kernel.org/all/aEGW_5HfPqU1rFjl@krikkit
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
|
|
If we get preempted during xfrm_state_find, we could run
xfrm_state_look_at using a different pcpu_id than the one
xfrm_state_find saw. This could lead to ignoring states that should
have matched, and triggering acquires on a CPU that already has a pcpu
state.
xfrm_state_find starts on CPU1
pcpu_id = 1
lookup starts
<preemption, we're now on CPU2>
xfrm_state_look_at pcpu_id = 2
finds a state
found:
best->pcpu_num != pcpu_id (2 != 1)
if (!x && !error && !acquire_in_progress) {
...
xfrm_state_alloc
xfrm_init_tempstate
...
This can be avoided by passing the original pcpu_id down to all
xfrm_state_look_at() calls.
Also switch to raw_smp_processor_id, disabling preempting just to
re-enable it immediately doesn't really make sense.
Fixes: 1ddf9916ac09 ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
In case of preemption, xfrm_state_look_at will find a different
pcpu_id and look up states for that other CPU. If we matched a state
for CPU2 in the state_cache while the lookup started on CPU1, we will
jump to "found", but the "best" state that we got will be ignored and
we will enter the "acquire" block. This block uses state_ptrs, which
isn't initialized at this point.
Let's initialize state_ptrs just after taking rcu_read_lock. This will
also prevent a possible misuse in the future, if someone adjusts this
function.
Reported-by: syzbot+7ed9d47e15e88581dc5b@syzkaller.appspotmail.com
Fixes: e952837f3ddb ("xfrm: state: fix out-of-bounds read during lookup")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
1) Remove some unnecessary strscpy_pad() size arguments.
From Thorsten Blum.
2) Correct use of xso.real_dev on bonding offloads.
Patchset from Cosmin Ratiu.
3) Add hardware offload configuration to XFRM_MSG_MIGRATE.
From Chiachang Wang.
4) Refactor migration setup during cloning. This was
done after the clone was created. Now it is done
in the cloning function itself.
From Chiachang Wang.
5) Validate assignment of maximal possible SEQ number.
Prevent from setting to the maximum sequrnce number
as this would cause for traffic drop.
From Leon Romanovsky.
6) Prevent configuration of interface index when offload
is used. Hardware can't handle this case.i
From Leon Romanovsky.
7) Always use kfree_sensitive() for SA secret zeroization.
From Zilin Guan.
ipsec-next-2025-05-23
* tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: use kfree_sensitive() for SA secret zeroization
xfrm: prevent configuration of interface index when offload is used
xfrm: validate assignment of maximal possible SEQ number
xfrm: Refactor migration setup during the cloning process
xfrm: Migrate offload configuration
bonding: Fix multiple long standing offload races
bonding: Mark active offloaded xfrm_states
xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
xfrm: Remove unneeded device check from validate_xmit_xfrm
xfrm: Use xdo.dev instead of xdo.real_dev
net/mlx5: Avoid using xso.real_dev unnecessarily
xfrm: Remove unnecessary strscpy_pad() size arguments
====================
Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc8).
Conflicts:
80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X")
4bcc063939a5 ("ice, irdma: fix an off by one in error handling code")
c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au
No extra adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
High-level copy_to_user_* APIs already redact SA secret fields when
redaction is enabled, but the state teardown path still freed aead,
aalg and ealg structs with plain kfree(), which does not clear memory
before deallocation. This can leave SA keys and other confidential
data in memory, risking exposure via post-free vulnerabilities.
Since this path is outside the packet fast path, the cost of zeroization
is acceptable and prevents any residual key material. This patch
replaces those kfree() calls unconditionally with kfree_sensitive(),
which zeroizes the entire buffer before freeing.
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
nat_keepalive_sk_ipv[46] is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.
Use sock_bh_locked which has a sock pointer and a local_lock_t. Use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-7-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Both packet and crypto offloads perform decryption while packet is
arriving to the HW from the wire. It means that there is no possible
way to perform lookup on XFRM if_id as it can't be set to be "before' HW.
So instead of silently ignore this configuration, let's warn users about
misconfiguration.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Users can set any seq/seq_hi/oseq/oseq_hi values. The XFRM core code
doesn't prevent from them to set even 0xFFFFFFFF, however this value
will cause for traffic drop.
Is is happening because SEQ numbers here mean that packet with such
number was processed and next number should be sent on the wire. In this
case, the next number will be 0, and it means overflow which causes to
(expected) packet drops.
While it can be considered as misconfiguration and handled by XFRM
datapath in the same manner as any other SEQ number, let's add
validation to easy for packet offloads implementations which need to
configure HW with next SEQ to send and not with current SEQ like it is
done in core code.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Prior to this patch, the mark is sanitized (applying the state's mask to
the state's value) only on inserts when checking if a conflicting XFRM
state or policy exists.
We discovered in Cilium that this same sanitization does not occur
in the hot-path __xfrm_state_lookup. In the hot-path, the sk_buff's mark
is simply compared to the state's value:
if ((mark & x->mark.m) != x->mark.v)
continue;
Therefore, users can define unsanitized marks (ex. 0xf42/0xf00) which will
never match any packet.
This commit updates __xfrm_state_insert and xfrm_policy_insert to store
the sanitized marks, thus removing this footgun.
This has the side effect of changing the ip output, as the
returned mark will have the mask applied to it when printed.
Fixes: 3d6acfa7641f ("xfrm: SA lookups with mark")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Louis DeLosSantos <louis.delos.devel@gmail.com>
Co-developed-by: Louis DeLosSantos <louis.delos.devel@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
ipcomp_post_acomp currently drops all frags (via pskb_trim_unique(skb,
0)), and then subtracts the old skb->data_len from truesize. This
adjustment has already be done during trimming (in skb_condense), so
we don't need to do it again.
This shows up for example when running fragmented traffic over ipcomp,
we end up hitting the WARN_ON_ONCE in skb_try_coalesce.
Fixes: eb2953d26971 ("xfrm: ipcomp: Use crypto_acomp interface")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Previously, migration related setup, such as updating family,
destination address, and source address, was performed after
the clone was created in `xfrm_state_migrate`. This change
moves this setup into the cloning function itself, improving
code locality and reducing redundancy.
The `xfrm_state_clone_and_setup` function now conditionally
applies the migration parameters from struct xfrm_migrate
if it is provided. This allows the function to be used both
for simple cloning and for cloning with migration setup.
Test: Tested with kernel test in the Android tree located
in https://android.googlesource.com/kernel/tests/
The xfrm_tunnel_test.py under the tests folder in
particular.
Signed-off-by: Chiachang Wang <chiachangwang@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|