| Age | Commit message (Collapse) | Author |
|
Add second page_pool for header buffers to each Rx queue and ability
to toggle the header split on/off using Ethtool (default to off to
match the current behaviour).
Unlike idpf, all HW backed up by ice doesn't require any W/As and
correctly splits all types of packets as configured: after L4 headers
for TCP/UDP/SCTP, after L3 headers for other IPv4/IPv6 frames, after
the Ethernet header otherwise (in case of tunneling, same as above,
but after innermost headers).
This doesn't affect the XSk path as there are no benefits of having
it there.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
This patch completes the transition of the ice driver to use the Page Pool
and libeth APIs, following the same direction as commit 5fa4caff59f2
("iavf: switch to Page Pool"). With the legacy page splitting and recycling
logic already removed, the driver is now in a clean state to adopt the
modern memory model.
The Page Pool integration simplifies buffer management by offloading
DMA mapping and recycling to the core infrastructure. This eliminates
the need for driver-specific handling of headroom, buffer sizing, and
page order. The libeth helper is used for CPU-side processing, while
DMA-for-device is handled by the Page Pool core.
Additionally, this patch extends the conversion to cover XDP support.
The driver now uses libeth_xdp helpers for Rx buffer processing,
and optimizes XDP_TX by skipping per-frame DMA mapping. Instead, all
buffers are mapped as bi-directional up front, leveraging Page Pool's
lifecycle management. This significantly reduces overhead in virtualized
environments.
Performance observations:
- In typical scenarios (netperf, XDP_PASS, XDP_DROP), performance remains
on par with the previous implementation.
- In XDP_TX mode:
* With IOMMU enabled, performance improves dramatically - over 5x
increase - due to reduced DMA mapping overhead and better memory reuse.
* With IOMMU disabled, performance remains comparable to the previous
implementation, with no significant changes observed.
- In XDP_DROP mode:
* For small MTUs, (where multiple buffers can be allocated on a single
memory page), a performance drop of approximately 20% is observed.
According to 'perf top' analysis, the bottleneck is caused by atomic
reference counter increments in the Page Pool.
* For normal MTUs, (where only one buffer can be allocated within a
single memory page), performance remains comparable to baseline
levels.
This change is also a step toward a more modular and unified XDP
implementation across Intel Ethernet drivers, aligning with ongoing
efforts to consolidate and streamline feature support.
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
As part of the transition toward Page Pool integration, remove the
legacy page splitting and recycling logic from the ice driver. This
mirrors the approach taken in commit 920d86f3c552 ("iavf: drop page
splitting and recycling").
The previous model attempted to reuse partially consumed pages by
splitting them and tracking their usage across descriptors. While
this was once a memory optimization, it introduced significant
complexity and overhead in the Rx path, including:
- Manual refcount management and page reuse heuristics;
- Per-descriptor buffer shuffling, which could involve moving dozens
of `ice_rx_buf` structures per NAPI cycle;
- Increased branching and cache pressure in the hotpath.
This change simplifies the Rx logic by always allocating fresh pages
and letting the networking stack handle their lifecycle. Although this
may temporarily reduce performance (up to ~98% in some XDP cases), it
greatly improves maintainability and paves the way for Page Pool,
which will restore and exceed previous performance levels.
The `ice_rx_buf` array is retained for now to minimize diffstat and
ease future replacement with a shared buffer abstraction.
Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The commit 53844673d555 ("iavf: kill 'legacy-rx' for good") removed
the legacy Rx path in the iavf driver. This change applies the same
rationale to the ice driver.
The legacy Rx path relied on manual skb allocation and header copying,
which has become increasingly inefficient and difficult to maintain.
With the stabilization of build_skb() and the growing adoption of
features like XDP, page_pool, and multi-buffer support, the legacy
approach is no longer viable.
Key drawbacks of the legacy path included:
- Higher memory pressure due to direct page allocations and splitting;
- Redundant memcpy() operations for packet headers;
- CPU overhead from eth_get_headlen() and Flow Dissector usage;
- Compatibility issues with XDP, which imposes strict headroom and
tailroom requirements.
The ice driver, like iavf, does not benefit from the minimal headroom
savings that legacy Rx once offered, as it already splits pages into
fixed halves. Removing this path simplifies the Rx logic, eliminates
unnecessary branches in the hotpath, and prepares the driver for
upcoming enhancements.
In addition to removing the legacy Rx path, this change also eliminates
the custom construct_skb() functions from both the standard and
zero-copy (ZC) Rx paths. These are replaced with the build_skb()
and standardized xdp_build_skb_from_zc() helpers, aligning the driver
with the modern XDP infrastructure and reducing code duplication.
This cleanup also reduces code complexity and improves maintainability
as we move toward a more unified and modern Rx model across drivers.
Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
E830 supports Earliest TxTime First (ETF) hardware offload, which is
configured via the ETF Qdisc on a per-queue basis (see tc-etf(8)). ETF
introduces a new Tx flow mechanism that utilizes a timestamp ring
(tstamp_ring) alongside the standard Tx ring. This timestamp ring is
used to indicate when hardware will transmit a packet. Tx Time is
supported on the first 2048 Tx queues of the device, and the NVM image
limits the maximum number of Tx queues to 2048 for the device.
The allocation and initialization of the timestamp ring occur when the
feature is enabled on a specific Tx queue via tc-etf. The requested Tx
Time queue index cannot be greater than the number of Tx queues
(vsi->num_txq).
To support ETF, the following flags and bitmap are introduced:
- ICE_F_TXTIME: Device feature flag set for E830 NICs, indicating ETF
support.
- txtime_txqs: PF-level bitmap set when ETF is enabled and cleared
when disabled for a specific Tx queue. It is used by
ice_is_txtime_ena() to check if ETF is allocated and configured on
any Tx queue, which is checked during Tx ring allocation.
- ICE_TX_FLAGS_TXTIME: Per Tx ring flag set when ETF is allocated and
configured for a specific Tx queue. It determines ETF status during
packet transmission and is checked by ice_is_txtime_ena() to verify
if ETF is enabled on any Tx queue.
Due to a hardware issue that can result in a malicious driver detection
event, additional timestamp descriptors are required when wrapping
around the timestamp ring. Up to 64 additional timestamp descriptors
are reserved, reducing the available Tx descriptors.
To accommodate this, ICE_MAX_NUM_DESC_BY_MAC is introduced, defining:
- E830: Maximum Tx descriptor count of 8096 (8K - 32 - 64 for timestamp
fetch descriptors).
- E810 and E82X: Maximum Tx descriptor count of 8160 (8K - 32).
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Co-developed-by: Alice Michael <alice.michael@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
buffer in the current frame. This function was introduced as part of
handling multi-buffer XDP support in the ice driver.
It works by iterating over the buffers from first_desc up to 1 plus the
total number of fragments in the frame, cached from before the XDP program
was executed.
If the hardware posts a descriptor with a size of 0, the logic used in
ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
call ice_put_rx_buf().
Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
page or free it. This leaves a stale page in the ring, as we don't
increment next_to_alloc.
The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
properly, and that it always points to a buffer with a NULL page. Since
this function doesn't check, it will happily recycle a page over the top
of the next_to_alloc buffer, losing track of the old page.
Note that this leak only occurs for multi-buffer frames. The
ice_put_rx_mbuf() function always handles at least one buffer, so a
single-buffer frame will always get handled correctly. It is not clear
precisely why the hardware hands us descriptors with a size of 0 sometimes,
but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
all buffers between first_desc and next_to_clean. Borrow the logic of a
similar function in i40e used for this same purpose. Use the same logic
also in ice_get_pgcnts().
Instead of iterating over just the number of fragments, use a loop which
iterates until the current index reaches to the next_to_clean element just
past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does
call ice_put_rx_buf() on the last buffer of the frame indicating the end of
packet.
For non-linear (multi-buffer) frames, we need to take care when adjusting
the pagecnt_bias. An XDP program might release fragments from the tail of
the frame, in which case that fragment page is already released. Only
update the pagecnt_bias for the first descriptor and fragments still
remaining post-XDP program. Take care to only access the shared info for
fragmented buffers, as this avoids a significant cache miss.
The xdp_xmit value only needs to be updated if an XDP program is run, and
only once per packet. Drop the xdp_xmit pointer argument from
ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
directly. This avoids needing to pass the argument and avoids an extra
bit-wise OR for each buffer in the frame.
Move the increment of the ntc local variable to ensure its updated *before*
all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
requires the index of the element just after the current frame.
Now that we use an index pointer in the ring to identify the packet, we no
longer need to track or cache the number of fragments in the rx_ring.
Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
Tested-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Priya Singh <priyax.singh@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The "ice" driver implementation uses the control VSI to handle
the flow director configuration for PFs and VFs.
Unfortunately, although a separate VSI type was created to handle flow
director queues, the Rx queue handler was shared between the flow
director and a standard NAPI Rx handler.
Such a design approach was not very flexible. First, it mixed hotpath
and slowpath code, blocking their further optimization. It also created
a huge overkill for the flow director command processing, which is
descriptor-based only, so there is no need to allocate Rx data buffers.
For the above reasons, implement a separate Rx handler for the control
VSI. Also, remove from the NAPI handler the code dedicated to
configuring the flow director rules on VFs.
Do not allocate Rx data buffers to the flow director queues because
their processing is descriptor-based only.
Finally, allow Rx data queues to be allocated only for VSIs that have
netdev assigned to them.
This handler splitting approach is the first step in converting the
driver to use the Page Pool (which can only be used for data queues).
Test hints:
1. Create a VF for any PF managed by the ice driver.
2. In a loop, add and delete flow director rules for the VF, e.g.:
for i in {1..128}; do
q=$(( i % 16 ))
ethtool -N ens802f0v0 flow-type tcp4 dst-port "$i" action "$q"
done
for i in {0..127}; do
ethtool -N ens802f0v0 delete "$i"
done
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Suggested-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
E830 supports raw receive and generic transmit checksum offloads.
Raw receive checksum support is provided by hardware calculating the
checksum over the whole packet, regardless of type. The calculated
checksum is provided to driver in the Rx flex descriptor. Then the driver
assigns the checksum to skb->csum and sets skb->ip_summed to
CHECKSUM_COMPLETE.
Generic transmit checksum support is provided by hardware calculating the
checksum given two offsets: the start offset to begin checksum calculation,
and the offset to insert the calculated checksum in the packet. Support is
advertised to the stack using NETIF_F_HW_CSUM feature.
E830 has the following limitations when both generic transmit checksum
offload and TCP Segmentation Offload (TSO) are enabled:
1. Inner packet header modification is not supported. This restriction
includes the inability to alter TCP flags, such as the push flag. As a
result, this limitation can impact the receiver's ability to coalesce
packets, potentially degrading network throughput.
2. The Maximum Segment Size (MSS) is limited to 1023 bytes, which prevents
support of Maximum Transmission Unit (MTU) greater than 1063 bytes.
Therefore NETIF_F_HW_CSUM and NETIF_F_ALL_TSO features are mutually
exclusive. NETIF_F_HW_CSUM hardware feature support is indicated but is not
enabled by default. Instead, IP checksums and NETIF_F_ALL_TSO are the
defaults. Enforcement of mutual exclusivity of NETIF_F_HW_CSUM and
NETIF_F_ALL_TSO is done in ice_set_features(). Mutual exclusivity
of IP checksums and NETIF_F_HW_CSUM is handled by netdev_fix_features().
When NETIF_F_HW_CSUM is requested the provided skb->csum_start and
skb->csum_offset are passed to hardware in the Tx context descriptor
generic checksum (GCS) parameters. Hardware calculates the 1's complement
from skb->csum_start to the end of the packet, and inserts the result in
the packet at skb->csum_offset.
Co-developed-by: Alice Michael <alice.michael@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Co-developed-by: Eric Joyner <eric.joyner@intel.com>
Signed-off-by: Eric Joyner <eric.joyner@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-2-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Idea behind having ice_rx_buf::act was to simplify and speed up the Rx
data path by walking through buffers that were representing cleaned HW
Rx descriptors. Since it caused us a major headache recently and we
rolled back to old approach that 'puts' Rx buffers right after running
XDP prog/creating skb, this is useless now and should be removed.
Get rid of ice_rx_buf::act and related logic. We still need to take care
of a corner case where XDP program releases a particular fragment.
Make ice_run_xdp() to return its result and use it within
ice_put_rx_mbuf().
Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Paolo Abeni says:
====================
net: introduce TX H/W shaping API
We have a plurality of shaping-related drivers API, but none flexible
enough to meet existing demand from vendors[1].
This series introduces new device APIs to configure in a flexible way
TX H/W shaping. The new functionalities are exposed via a newly
defined generic netlink interface and include introspection
capabilities. Some self-tests are included, on top of a dummy
netdevsim implementation. Finally a basic implementation for the iavf
driver is provided.
Some usage examples:
* Configure shaping on a given queue:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/shaper.yaml \
--do set --json '{"ifindex": '$IFINDEX',
"shaper": {"handle":
{"scope": "queue", "id":'$QUEUEID'},
"bw-max": 2000000}}'
* Container B/W sharing
The orchestration infrastructure wants to group the
container-related queues under a RR scheduling and limit the aggregate
bandwidth:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/shaper.yaml \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
{"handle": {"scope": "queue", "id":'$QID1'},
"weight": '$W1'},
{"handle": {"scope": "queue", "id":'$QID2'},
"weight": '$W2'}],
{"handle": {"scope": "queue", "id":'$QID3'},
"weight": '$W3'}],
"handle": {"scope":"node"},
"bw-max": 10000000}'
{'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 0}}
Q1 \
\
Q2 -- node 0 ------- netdev
/ (bw-max: 10M)
Q3 /
* Delegation
A containers wants to limit the aggregate B/W bandwidth of 2 of the 3
queues it owns - the starting configuration is the one from the
previous point:
SPEC=Documentation/netlink/specs/net_shaper.yaml
./tools/net/ynl/cli.py --spec $SPEC \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
{"handle": {"scope": "queue", "id":'$QID1'},
"weight": '$W1'},
{"handle": {"scope": "queue", "id":'$QID2'},
"weight": '$W2'}],
"handle": {"scope": "node"},
"bw-max": 5000000 }'
{'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 1}}
Q1 -- node 1 --------\
/ (bw-max: 5M) \
Q2 / node 0 ------- netdev
/(bw-max: 10M)
Q3 ------------------/
In a group operation, when prior to the op itself, the leaves have
different parents, the user must specify the parent handle for the
group. I.e., starting from the previous config:
./tools/net/ynl/cli.py --spec $SPEC \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
{"handle": {"scope": "queue", "id":'$QID1'},
"weight": '$W1'},
{"handle": {"scope": "queue", "id":'$QID3'},
"weight": '$W3'}],
"handle": {"scope": "node"},
"bw-max": 3000000 }'
Netlink error: Invalid argument
nl_len = 96 (80) nl_flags = 0x300 nl_type = 2
error: -22
extack: {'msg': 'All the leaves shapers must have the same old parent'}
./tools/net/ynl/cli.py --spec $SPEC \
--do group --json '{"ifindex": '$IFINDEX',
"leaves": [
{"handle": {"scope": "queue", "id":'$QID1'},
"weight": '$W1'},
{"handle": {"scope": "queue", "id":'$QID3'},
"weight": '$W3'}],
"handle": {"scope": "node"},
"parent": {"scope": "node", "id": 1},
"bw-max": 3000000 }
{'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 2}}
Q1 -- node 2 ---
/(bw-max:3M)\
Q3 / \
---- node 1 \
/ (bw-max: 5M)\
Q2 node 0 ------- netdev
(bw-max: 10M)
* Cleanup:
Still starting from config 1To delete a single queue shaper
./tools/net/ynl/cli.py --spec $SPEC --do delete --json \
'{"ifindex": '$IFINDEX',
"handle": {"scope": "queue", "id":'$QID3'}}'
Q1 -- node 2 ---
(bw-max:3M)\
\
---- node 1 \
/ (bw-max: 5M)\
Q2 node 0 ------- netdev
(bw-max: 10M)
Deleting a node shaper relinks all its leaves to the node's parent:
./tools/net/ynl/cli.py --spec $SPEC --do delete --json \
'{"ifindex": '$IFINDEX',
"handle": {"scope": "node", "id":2}}'
Q1 ---\
\
node 1----- \
/ (bw-max: 5M)\
Q2----/ node 0 ------- netdev
(bw-max: 10M)
Deleting the last shaper under a node shaper deletes the node, too:
./tools/net/ynl/cli.py --spec $SPEC --do delete --json \
'{"ifindex": '$IFINDEX',
"handle": {"scope": "queue", "id":'$QID1'}}'
./tools/net/ynl/cli.py --spec $SPEC --do delete --json \
'{"ifindex": '$IFINDEX',
"handle": {"scope": "queue", "id":'$QID2'}}'
./tools/net/ynl/cli.py --spec $SPEC --do get --json \
'{"ifindex": '$IFINDEX',
"handle": {"scope": "node", "id": 1}}'
Netlink error: No such file or directory
nl_len = 44 (28) nl_flags = 0x300 nl_type = 2
error: -2
extack: {'bad-attr': '.handle'}
Such delete recurses on parents that are left over with no leaves:
./tools/net/ynl/cli.py --spec $SPEC --do get --json \
'{"ifindex": '$IFINDEX',
"handle": {"scope": "node", "id": 0}}'
Netlink error: No such file or directory
nl_len = 44 (28) nl_flags = 0x300 nl_type = 2
error: -2
extack: {'bad-attr': '.handle'}
v8: https://lore.kernel.org/cover.1727704215.git.pabeni@redhat.com
v7: https://lore.kernel.org/cover.1725919039.git.pabeni@redhat.com
v6: https://lore.kernel.org/cover.1725457317.git.pabeni@redhat.com
v5: https://lore.kernel.org/cover.1724944116.git.pabeni@redhat.com
v4: https://lore.kernel.org/cover.1724165948.git.pabeni@redhat.com
v3: https://lore.kernel.org/cover.1722357745.git.pabeni@redhat.com
RFC v2: https://lore.kernel.org/cover.1721851988.git.pabeni@redhat.com
RFC v1: https://lore.kernel.org/cover.1719518113.git.pabeni@redhat.com
====================
Link: https://patch.msgid.link/cover.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support to configure VF queue rate limit and quanta size.
For quanta size configuration, the quanta profiles are divided evenly
by PF numbers. For each port, the first quanta profile is reserved for
default. When VF is asked to set queue quanta size, PF will search for
an available profile, change the fields and assigned this profile to the
queue.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/fddefc2c1ec3ab32b241ce444af401da19e834dd.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The max_frame and rx_buf_len fields of the VSI set the maximum frame size
for packets on the wire, and configure the size of the Rx buffer. In the
hardware, these are per-queue configuration. Most VSI types use a simple
method to determine the size of the buffers for all queues.
However, VFs may potentially configure different values for each queue.
While the Linux iAVF driver does not do this, it is allowed by the virtchnl
interface.
The current virtchnl code simply sets the per-VSI fields inbetween calls to
ice_vsi_cfg_single_rxq(). This technically works, as these fields are only
ever used when programming the Rx ring, and otherwise not checked again.
However, it is confusing to maintain.
The Rx ring also already has an rx_buf_len field in order to access the
buffer length in the hotpath. It also has extra unused bytes in the ring
structure which we can make use of to store the maximum frame size.
Drop the VSI max_frame and rx_buf_len fields. Add max_frame to the Rx ring,
and slightly re-order rx_buf_len to better fit into the gaps in the
structure layout.
Change the ice_vsi_cfg_frame_size function so that it writes to the ring
fields. Call this function once per ring in ice_vsi_cfg_rxqs(). This is
done over calling it inside the ice_vsi_cfg_rxq(), because
ice_vsi_cfg_rxq() is called in the virtchnl flow where the max_frame and
rx_buf_len have already been configured.
Change the accesses for rx_buf_len and max_frame to all point to the ring
structure. This has the added benefit that ice_vsi_cfg_rxq() no longer has
the surprise side effect of updating ring->rx_buf_len based on the VSI
field.
Update the virtchnl ice_vc_cfg_qs_msg() function to set the ring values
directly, and drop references to the removed VSI fields.
This now makes the VF logic clear, as the ring fields are obviously
per-queue. This reduces the required cognitive load when reasoning about
this logic.
Note that removing the VSI fields does leave a 4 byte gap, but the ice_vsi
structure has many gaps, and its layout is not as critical in the hot path.
The structure may benefit from a more thorough repacking, but no attempt
was made in this change.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Add an ICE_RX_FLAG_MULTIDEV flag to Rx ring.
If it is set try to find correct port representor. Do it based on
src_vsi value stored in flex descriptor. Ids of representor pointers
stored in xarray are equal to corresponding src_vsi value. Thanks to
that we can directly get correct representor if we have src_vsi value.
Set multidev flag during ring configuration.
If the mode is switchdev, change the ring descriptor to the one that
contains src_vsi value.
PF netdev should be reconfigured, do it by calling ice_down() and
ice_up() if the netdev was up before configuring switchdev.
Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
socket.
Since support for handling multi-buffer frames was added to XDP, usage
of bpf_xdp_adjust_tail() helper within XDP program can free the page
that given fragment occupies and in turn decrease the fragment count
within skb_shared_info that is embedded in xdp_buff struct. In current
ice driver codebase, it can become problematic when page recycling logic
decides not to reuse the page. In such case, __page_frag_cache_drain()
is used with ice_rx_buf::pagecnt_bias that was not adjusted after
refcount of page was changed by XDP prog which in turn does not drain
the refcount to 0 and page is never freed.
To address this, let us store the count of frags before the XDP program
was executed on Rx ring struct. This will be used to compare with
current frag count from skb_shared_info embedded in xdp_buff. A smaller
value in the latter indicates that XDP prog freed frag(s). Then, for
given delta decrement pagecnt_bias for XDP_DROP verdict.
While at it, let us also handle the EOP frag within
ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
needed to be applied against freed frags are performed in the single
place.
Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-5-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Implement .xmo_rx_vlan_tag callback to allow XDP code to read
packet's VLAN tag.
At the same time, use vlan_tci instead of vlan_tag in touched code,
because VLAN tag often refers to VLAN proto and VLAN TCI combined,
while in the code we clearly store only VLAN TCI.
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://lore.kernel.org/r/20231205210847.28460-11-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Use previously refactored code and create a function
that allows XDP code to read HW timestamp.
Also, introduce packet context, where hints-related data will be stored.
ice_xdp_buff contains only a pointer to this structure, to avoid copying it
in ZC mode later in the series.
HW timestamp is the first supported hint in the driver,
so also add xdp_metadata_ops.
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://lore.kernel.org/r/20231205210847.28460-6-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In order to use XDP hints via kfuncs we need to put
RX descriptor and miscellaneous data next to xdp_buff.
Same as in hints implementations in other drivers, we achieve
this through putting xdp_buff into a child structure.
Currently, xdp_buff is stored in the ring structure,
so replace it with union that includes child structure.
This way enough memory is available while existing XDP code
remains isolated from hints.
Minimum size of the new child structure (ice_xdp_buff) is exactly
64 bytes (single cache line). To place it at the start of a cache line,
move 'next' field from CL1 to CL4, as it isn't used often. This still
leaves 192 bits available in CL3 for packet context extensions.
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20231205210847.28460-5-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Before performing a Tx timestamp in ice_stamp(), the driver checks a ptp_tx
ring variable to see if timestamping is enabled on that ring. This value is
set for all rings whenever userspace configures Tx timestamping.
Ostensibly this was done to avoid wasting cycles checking other fields when
timestamping has not been enabled. However, for Tx timestamps we already
get an individual per-SKB flag indicating whether userspace wants to
request a timestamp on that packet. We do not gain much by also having
a separate flag to check for whether timestamping was enabled.
In fact, the driver currently fails to restore the field after a PF reset.
Because of this, if a PF reset occurs, timestamps will be disabled.
Since this flag doesn't add value in the hotpath, remove it and always
provide a timestamp if the SKB flag has been set.
A following change will fix the reset path to properly restore user
timestamping configuration completely.
This went unnoticed for some time because one of the most common
applications using Tx timestamps, ptp4l, will reconfigure the socket as
part of its fault recovery logic.
Fixes: ea9b847cda64 ("ice: enable transmit timestamps for E810 devices")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
As not all ICE_TX_FLAGS_* fit in current 16-bit limited
tx_flags field that was introduced in the Fixes commit,
VLAN-related information would be discarded completely.
As such, creating a vlan and trying to run ping through
would result in no traffic passing.
Fix that by refactoring tx_flags variable into flags only and
a separate variable that holds VLAN ID. As there is some space left,
type variable can fit between those two. Pahole reports no size
change to ice_tx_buf struct.
Fixes: aa1d3faf71a6 ("ice: Robustify cleaning/completing XDP Tx buffers")
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As already mentioned, freeing any &xdp_frame via page_frag_free() is
wrong, as it assumes the frame is backed by either an order-0 page or
a page with no "patrons" behind them, while in fact frames backed by
Page Pool can be redirected to a device, which's driver doesn't use it.
Keep storing a pointer to the raw buffer and then freeing it
unconditionally via page_frag_free() for %XDP_TX frames, but introduce
a separate type in the enum for frames coming through .ndo_xdp_xmit(),
and free them via xdp_return_frame_bulk(). Note that saving xdpf as
xdp_buff->data_hard_start is intentional and is always true when
everything is configured properly.
After this change, %XDP_REDIRECT from a Page Pool based driver to ice
becomes zero-alloc as it should be and horrendous 3.3 Mpps / queue
turn into 6.6, hehe.
Let it go with no "Fixes:" tag as it spans across good 5+ commits and
can't be trivially backported.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-6-alexandr.lobakin@intel.com
|
|
When queueing frames from a Page Pool for redirecting to a device backed
by the ice driver, `perf top` shows heavy load on page_alloc() and
page_frag_free(), despite that on a properly working system it must be
fully or at least almost zero-alloc. The problem is in fact a bit deeper
and raises from how ice cleans up completed Tx buffers.
The story so far: when cleaning/freeing the resources related to
a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some
heuristics only without setting any type explicitly (except for dummy
Flow Director packets, which are marked via ice_tx_buf::tx_flags).
This kinda works, but only up to some point. For example, currently ice
assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by
either plain order-0 page or plain page frag, while it may also be
backed by Page Pool or any other possible memory models introduced in
future. This means any &xdp_frame must be freed properly via
xdp_return_frame() family with no assumptions.
In order to do that, the whole heuristics must be replaced with setting
the Tx buffer/frame type explicitly, just how it's always been done via
an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't
hurt much -- especially given that sometimes there was a check for
%ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum
member. The rest of the changes is straightforward and most of it is
just a conversion to rely now on the type set in &ice_tx_buf rather than
to some secondary properties.
For now, no functional changes intended, the change only prepares the
ground for starting freeing XDP frames properly next step. And it must
be done atomically/synchronously to not break stuff.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
|
|
Let us store pointer to xdp_buff that came from xsk_buff_pool on tx_buf
so that it will be possible to recycle it via xsk_buff_free() on Tx
cleaning side. This way it is not necessary to do expensive copy to
another xdp_buff backed by a newly allocated page.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-14-maciej.fijalkowski@intel.com
|
|
Now that both ZC and standard XDP data paths stopped using Tx logic
based on next_dd and next_rs fields, we can safely remove these fields
and shrink Tx ring structure.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-13-maciej.fijalkowski@intel.com
|
|
Similarly as for Rx side in previous patch, logic on XDP Tx in ice
driver needs to be adjusted for multi-buffer support. Specifically, the
way how HW Tx descriptors are produced and cleaned.
Currently, XDP_TX works on strict ring boundaries, meaning it sets RS
bit (on producer side) / looks up DD bit (on consumer/cleaning side)
every quarter of the ring. It means that if for example multi buffer
frame would span across the ring quarter boundary (say that frame
consists of 4 frames and we start from 62 descriptor where ring is sized
to 256 entries), RS bit would be produced in the middle of multi buffer
frame, which would be a broken behavior as it needs to be set on the
last descriptor of the frame.
To make it work, set RS bit at the last descriptor from the batch of
frames that XDP_TX action was used on and make the first entry remember
the index of last descriptor with RS bit set. This way, cleaning side
can take the index of descriptor with RS bit, look up DD bit's presence
and clean from first entry to last.
In order to clean up the code base introduce the common ice_set_rs_bit()
which will return index of descriptor that got RS bit produced on so
that standard driver can store this within proper ice_tx_buf and ZC
driver can simply ignore return value.
Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
|
|
Ice driver needs to be a bit reworked on Rx data path in order to
support multi-buffer XDP. For skb path, it currently works in a way that
Rx ring carries pointer to skb so if driver didn't manage to combine
fragmented frame at current NAPI instance, it can restore the state on
next instance and keep looking for last fragment (so descriptor with EOP
bit set). What needs to be achieved is that xdp_buff needs to be
combined in such way (linear + frags part) in the first place. Then skb
will be ready to go in case of XDP_PASS or BPF program being not present
on interface. If BPF program is there, it would work on multi-buffer
XDP. At this point xdp_buff resides directly on Rx ring, so given the
fact that skb will be built straight from xdp_buff, there will be no
further need to carry skb on Rx ring.
Besides removing skb pointer from Rx ring, lots of members have been
moved around within ice_rx_ring. First and foremost reason was to place
rx_buf with xdp_buff on the same cacheline. This means that once we
touch rx_buf (which is a preceding step before touching xdp_buff),
xdp_buff will already be hot in cache. Second thing was that xdp_rxq is
used rather rarely and it occupies a separate cacheline, so maybe it is
better to have it at the end of ice_rx_ring.
Other change that affects ice_rx_ring is the introduction of
ice_rx_ring::first_desc. Its purpose is twofold - first is to propagate
rx_buf->act to all the parts of current xdp_buff after running XDP
program, so that ice_put_rx_buf() that got moved out of the main Rx
processing loop will be able to tak an appriopriate action on each
buffer. Second is for ice_construct_skb().
ice_construct_skb() has a copybreak mechanism which had an explicit
impact on xdp_buff->skb conversion in the new approach when legacy Rx
flag is toggled. It works in a way that linear part is 256 bytes long,
if frame is bigger than that, remaining bytes are going as a frag to
skb_shared_info.
This means while memcpying frags from xdp_buff to newly allocated skb,
care needs to be taken when picking the destination frag array entry.
Upon the time ice_construct_skb() is called, when dealing with
fragmented frame, current rx_buf points to the *last* fragment, but
copybreak needs to be done against the first one. That's where
ice_rx_ring::first_desc helps.
When frame building spans across NAPI polls (DD bit is not set on
current descriptor and xdp->data is not NULL) with current Rx buffer
handling state there might be some problems.
Since calls to ice_put_rx_buf() were pulled out of the main Rx
processing loop and were scoped from cached_ntc to current ntc, remember
that now mentioned function relies on rx_buf->act, which is set within
ice_run_xdp(). ice_run_xdp() is called when EOP bit was found, so
currently we could put Rx buffer with rx_buf->act being *uninitialized*.
To address this, change scoping to rely on first_desc on both boundaries
instead.
This also implies that cleaned_count which is used as an input to
ice_alloc_rx_buffers() and tells how many new buffers should be refilled
has to be adjusted. If it stayed as is, what could happen is a case
where ntc would go over ntu.
Therefore, remove cleaned_count altogether and use against allocing
routine newly introduced ICE_RX_DESC_UNUSED() macro which is an
equivalent of ICE_DESC_UNUSED() dedicated for Rx side and based on
struct ice_rx_ring::first_desc instead of next_to_clean.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-11-maciej.fijalkowski@intel.com
|
|
Currently calls to ice_put_rx_buf() are sprinkled through
ice_clean_rx_irq() - first place is for explicit flow director's
descriptor handling, second is after running XDP prog and the last one
is after taking care of skb.
1st callsite was actually only for ntc bump purpose, as Rx buffer to be
recycled is not even passed to a function.
It is possible to walk through Rx buffers processed in particular NAPI
cycle by caching ntc from beginning of the ice_clean_rx_irq().
To do so, let us store XDP verdict inside ice_rx_buf, so action we need
to take on will be known. For XDP prog absence, just store ICE_XDP_PASS
as a verdict.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-7-maciej.fijalkowski@intel.com
|
|
This will allow us to avoid carrying additional auxiliary array of page
counts when dealing with XDP multi buffer support. Previously combining
fragmented frame to skb was not affected in the same way as XDP would be
as whole frame is needed to be in place before executing XDP prog.
Therefore, when going through HW Rx descriptors one-by-one, calls to
ice_put_rx_buf() need to be taken *after* running XDP prog on a
potentially multi buffered frame, so some additional storage of
page count is needed.
By adding page count to rx buf, it will make it easier to walk through
processed entries at the end of rx cleaning routine and decide whether
or not buffers should be recycled.
While at it, bump ice_rx_buf::pagecnt_bias from u16 up to u32. It was
proven many times that calculations on variables smaller than standard
register size are harmful. This was also the case during experiments
with embedding page count to ice_rx_buf - when this was added as u16 it
had a performance impact.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-4-maciej.fijalkowski@intel.com
|
|
In preparation for XDP multi-buffer support, let's store xdp_buff on
Rx ring struct. This will allow us to combine fragmented frames across
separate NAPI cycles in the same way as currently skb fragments are
handled. This means that skb pointer on Rx ring will become redundant
and will be removed. For now it is kept and layout of Rx ring struct was
not inspected, some member movement will be needed later on so that will
be the time to take care of it.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-3-maciej.fijalkowski@intel.com
|
|
Rx path is going to be modified in a way that fragmented frame will be
gathered within xdp_buff in the first place. This approach implies that
underlying buffer has to provide tailroom for skb_shared_info. This is
currently the case when ring uses build_skb but not when legacy-rx knob
is turned on. This case configures 2k Rx buffers and has no way to
provide either headroom or tailroom - FWIW it currently has
XDP_PACKET_HEADROOM which is broken and in here it is removed. 2k Rx
buffers were used so driver in this setting was able to support 9k MTU
as it can chain up to 5 Rx buffers. With offset configuring HW writing
2k of a data was passing the half of the page which broke the assumption
of our internal page recycling tricks.
Now if above got fixed and legacy-rx path would be left as is, when
referring to skb_shared_info via xdp_get_shared_info_from_buff(),
packet's content would be corrupted again. Hence size of Rx buffer needs
to be lowered and therefore supported MTU. This operation will allow us
to keep the unified data path and with 8k MTU users (if any of
legacy-rx) would still be good to go. However, tendency is to drop the
support for this code path at some point.
Add ICE_RXBUF_1664 as vsi::rx_buf_len and ICE_MAX_FRAME_LEGACY_RX (8320)
as vsi::max_frame for legacy-rx. For bigger page sizes configure 3k Rx
buffers, not 2k.
Since headroom support is removed, disable data_meta support on legacy-rx.
When preparing XDP buff, rely on ice_rx_ring::rx_offset setting when
deciding whether to support data_meta or not.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-2-maciej.fijalkowski@intel.com
|
|
Resets may occur with or without user interaction. For example, a TX hang
or reconfiguration of parameters will result in a reset. During reset, the
VSI is freed, freeing any statistics structures inside as well. This would
create an issue for the user where a reset happens in the background,
statistics set to zero, and the user checks ring statistics expecting them
to be populated.
To ensure this doesn't happen, accumulate ring statistics over reset.
Define a new ring statistics structure, ice_ring_stats. The new structure
lives in the VSI's parent, preserving ring statistics when VSI is freed.
1. Define a new structure vsi_ring_stats in the PF scope
2. Allocate/free stats only during probe, unload, or change in ring size
3. Replace previous ring statistics functionality with new structure
Signed-off-by: Benjamin Mikailenko <benjamin.mikailenko@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The driver can allow the user to configure whether the CRC aka the FCS
(Frame Check Sequence) is DMA'd to the host as part of the receive
buffer. The driver usually wants this feature disabled so that the
hardware checks the FCS and strips it in order to save PCI bandwidth.
Control the reception of FCS to the host using the command:
ethtool -K eth0 rx-fcs <on|off>
The default shown in ethtool -k eth0 | grep fcs; should be "off", as the
hardware will drop any frame with a bad checksum, and DMA of the
checksum is useless overhead especially for small packets.
Testing Hints:
test the FCS/CRC arrives with received packets using
tcpdump -nnpi eth0 -xxxx
and it should show crc data as the last 4 bytes of the packet. Can also
use wireshark to turn on CRC checking and check the data is correct.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Co-developed-by: Benjamin Mikailenko <benjamin.mikailenko@intel.com>
Signed-off-by: Benjamin Mikailenko <benjamin.mikailenko@intel.com>
Co-developed-by: Anatolii Gerasymenko <anatolii.gerasymenko@intel.com>
Signed-off-by: Anatolii Gerasymenko <anatolii.gerasymenko@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
drivers/net/ethernet/mellanox/mlx5/core/main.c
b33886971dbc ("net/mlx5: Initialize flow steering during driver probe")
40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support")
f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table")
https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/
16d42d313350 ("net/mlx5: Drain fw_reset when removing device")
8324a02c342a ("net/mlx5: Add exit route when waiting for FW")
https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/
tools/testing/selftests/net/mptcp/mptcp_join.sh
e274f7154008 ("selftests: mptcp: add subflow limits test-cases")
b6e074e171bc ("selftests: mptcp: add infinite map testcase")
5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type")
https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/
net/mptcp/options.c
ba2c89e0ea74 ("mptcp: fix checksum byte order")
1e39e5a32ad7 ("mptcp: infinite mapping sending")
ea66758c1795 ("tcp: allow MPTCP to update the announced window")
https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/
net/mptcp/pm.c
95d686517884 ("mptcp: fix subflow accounting on close")
4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/
net/mptcp/subflow.c
ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure")
0348c690ed37 ("mptcp: add the fallback check")
f8d4bcacff3b ("mptcp: infinite mapping receiving")
https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Adaptive-rx and Adaptive-tx are interrupt moderation settings
that can be enabled/disabled using ethtool:
ethtool -C ethX adaptive-rx on/off adaptive-tx on/off
Unfortunately those settings are getting cleared after
changing number of queues, or in ethtool world 'channels':
ethtool -L ethX rx 1 tx 1
Clearing was happening due to introduction of bit fields
in ice_ring_container struct. This way only itr_setting
bits were rebuilt during ice_vsi_rebuild_set_coalesce().
Introduce an anonymous struct of bitfields and create a
union to refer to them as a single variable.
This way variable can be easily saved and restored.
Fixes: 61dc79ced7aa ("ice: Restore interrupt throttle settings after VSI rebuild")
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
When XSK pool uses need_wakeup feature, correlate -ENOBUFS that was
returned from xdp_do_redirect() with a XSK Rx queue being full. In such
case, terminate the Rx processing that is being done on the current HW
Rx ring and let the user space consume descriptors from XSK Rx queue so
that there is room that driver can use later on.
Introduce new internal return code ICE_XDP_EXIT that will indicate case
described above.
Note that it does not affect Tx processing that is bound to the same
NAPI context, nor the other Rx rings.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220413153015.453864-6-maciej.fijalkowski@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2022-02-09
This series contains updates to ice driver only.
Brett adds support for QinQ. This begins with code refactoring and
re-organization of VLAN configuration functions to allow for
introduction of VSI VLAN ops to enable setting and calling of
respective operations based on device support of single or double
VLANs. Implementations are added for outer VLAN support.
To support QinQ, the device must be set to double VLAN mode (DVM).
In order for this to occur, the DDP package and NVM must also support
DVM. Functions to determine compatibility and properly configure the
device are added as well as setting the proper bits to advertise and
utilize the proper offloads. Support for VIRTCHNL_VF_OFFLOAD_VLAN_V2
is also included to allow for VF to negotiate and utilize this
functionality.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently the driver only supports 802.1Q VLAN insertion and stripping.
However, once Double VLAN Mode (DVM) is fully supported, then both 802.1Q
and 802.1ad VLAN insertion and stripping will be supported. Unfortunately
the VSI context parameters only allow for one VLAN ethertype at a time
for VLAN offloads so only one or the other VLAN ethertype offload can be
supported at once.
To support this, multiple changes are needed.
Rx path changes:
[1] In DVM, the Rx queue context l2tagsel field needs to be cleared so
the outermost tag shows up in the l2tag2_2nd field of the Rx flex
descriptor. In Single VLAN Mode (SVM), the l2tagsel field should remain
1 to support SVM configurations.
[2] Modify the ice_test_staterr() function to take a __le16 instead of
the ice_32b_rx_flex_desc union pointer so this function can be used for
both rx_desc->wb.status_error0 and rx_desc->wb.status_error1.
[3] Add the new inline function ice_get_vlan_tag_from_rx_desc() that
checks if there is a VLAN tag in l2tag1 or l2tag2_2nd.
[4] In ice_receive_skb(), add a check to see if NETIF_F_HW_VLAN_STAG_RX
is enabled in netdev->features. If it is, then this is the VLAN
ethertype that needs to be added to the stripping VLAN tag. Since
ice_fix_features() prevents CTAG_RX and STAG_RX from being enabled
simultaneously, the VLAN ethertype will only ever be 802.1Q or 802.1ad.
Tx path changes:
[1] In DVM, the VLAN tag needs to be placed in the l2tag2 field of the Tx
context descriptor. The new define ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN was
added to the list of tx_flags to handle this case.
[2] When the stack requests the VLAN tag to be offloaded on Tx, the
driver needs to set either ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN or
ICE_TX_FLAGS_HW_VLAN, so the tag is inserted in l2tag2 or l2tag1
respectively. To determine which location to use, set a bit in the Tx
ring flags field during ring allocation that can be used to determine
which field to use in the Tx descriptor. In DVM, always use l2tag2,
and in SVM, always use l2tag1.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
One of the things that commit 5574ff7b7b3d ("i40e: optimize AF_XDP Tx
completion path") introduced was the @xdp_tx_active field. Its usage
from i40e can be adjusted to ice driver and give us positive performance
results.
If the descriptor that @next_dd points to has been sent by HW (its DD
bit is set), then we are sure that at least quarter of the ring is ready
to be cleaned. If @xdp_tx_active is 0 which means that related xdp_ring
is not used for XDP_{TX, REDIRECT} workloads, then we know how many XSK
entries should placed to completion queue, IOW walking through the ring
can be skipped.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-9-maciej.fijalkowski@intel.com
|
|
Apply the logic that was done for regular XDP from commit 9610bd988df9
("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top
of that, introduce batching to Tx that is inspired by i40e's
implementation with adjustments to the cleaning logic - take into the
account NAPI budget in ice_clean_xdp_irq_zc().
Separating the stats structs onto separate cache lines seemed to improve
the performance.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-8-maciej.fijalkowski@intel.com
|
|
XDP_TX workloads use a concept of Tx threshold that indicates the
interval of setting RS bit on descriptors which in turn tells the HW to
generate an interrupt to signal the completion of Tx on HW side. It is
currently based on a constant value of 32 which might not work out well
for various sizes of ring combined with for example batch size that can
be set via SO_BUSY_POLL_BUDGET.
Internal tests based on AF_XDP showed that most convenient setup of
mentioned threshold is when it is equal to quarter of a ring length.
Make use of recently introduced ICE_RING_QUARTER macro and use this
value as a substitute for ICE_TX_THRESH.
Align also ethtool -G callback so that next_dd/next_rs fields are up to
date in terms of the ring size.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-5-maciej.fijalkowski@intel.com
|
|
Currently, if ice_clean_rx_irq_zc() processed the whole ring and
next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole
ring even if the XSK buffer pool would have enough free entries (either
from fill ring or the internal recycle mechanism) - it is because ring
wrap is not handled.
Improve the logic in ice_alloc_rx_buf_zc() to address the problem above.
Do not clamp the count of buffers that is passed to
xsk_buff_alloc_batch() in case when next_to_use + buffer count >=
rx_ring->count, but rather split it and have two calls to the mentioned
function - one for the part up until the wrap and one for the part after
the wrap.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-4-maciej.fijalkowski@intel.com
|
|
Currently cleaned_count is initialized to ICE_DESC_UNUSED(rx_ring) and
later on during the Rx processing it is incremented per each frame that
driver consumed. This can result in excessive buffers requested from xsk
pool based on that value.
To address this, just drop cleaned_count and pass
ICE_DESC_UNUSED(rx_ring) directly as a function argument to
ice_alloc_rx_bufs_zc(). Idea is to ask for buffers as many as consumed.
Let us also call ice_alloc_rx_bufs_zc unconditionally at the end of
ice_clean_rx_irq_zc. This has been changed in that way for corresponding
ice_clean_rx_irq, but not here.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Add infrastructure required for "ndo_setup_tc:qdisc_mqprio".
ice_vsi_setup is modified to configure traffic classes based
on mqprio data received from the stack. This includes low-level
functions to configure min, max rate-limit parameters in hardware
for traffic classes. Each traffic class gets mapped to a hardware
channel (VSI) which can be individually configured with different
bandwidth parameters.
Co-developed-by: Tarun Singh <tarun.k.singh@intel.com>
Signed-off-by: Tarun Singh <tarun.k.singh@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Tested-by: Bharathi Sreenivas <bharathi.sreenivas@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Under rare circumstances there might be a situation where a requirement
of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
resources have to be shared between CPUs. This yields a need for placing
accesses to xdp_ring inside a critical section protected by spinlock.
These accesses happen to be in the hot path, so let's introduce the
static branch that will be triggered from the control plane when driver
could not provide Tx queue dedicated for XDP on each CPU.
Currently, the design that has been picked is to allow any number of XDP
Tx queues that is at least half of a count of CPUs that platform has.
For lower number driver will bail out with a response to user that there
were not enough Tx resources that would allow configuring XDP. The
sharing of rings is signalled via static branch enablement which in turn
indicates that lock for xdp_ring accesses needs to be taken in hot path.
Approach based on static branch has no impact on performance of a
non-fallback path. One thing that is needed to be mentioned is a fact
that the static branch will act as a global driver switch, meaning that
if one PF got out of Tx resources, then other PFs that ice driver is
servicing will suffer. However, given the fact that HW that ice driver
is handling has 1024 Tx queues per each PF, this is currently an
unlikely scenario.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
With rings being split, it is now convenient to introduce a pointer to
XDP ring within the Rx ring. For XDP_TX workloads this means that
xdp_rings array access will be skipped, which was executed per each
processed frame.
Also, read the XDP prog once per NAPI and if prog is present, set up the
local xdp_ring pointer. Reading prog a single time was discussed in [1]
with some concern raised by Toke around dispatcher handling and having
the need for going through the RCU grace period in the ndo_bpf driver
callback, but ice currently is torning down NAPI instances regardless of
the prog presence on VSI.
Although the pointer to XDP ring introduced to Rx ring makes things a
lot slimmer/simpler, I still feel that single prog read per NAPI
lifetime is beneficial.
Further patch that will introduce the fallback path will also get a
profit from that as xdp_ring pointer will be set during the XDP rings
setup.
[1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
While it was convenient to have a generic ring structure that served
both Tx and Rx sides, next commits are going to introduce several
Tx-specific fields, so in order to avoid hurting the Rx side, let's
pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs.
Rx ring could be handled by the old ice_ring which would reduce the code
churn within this patch, but this would make things asymmetric.
Make the union out of the ring container within ice_q_vector so that it
is possible to iterate over newly introduced ice_tx_ring.
Remove the @size as it's only accessed from control path and it can be
calculated pretty easily.
Change definitions of ice_update_ring_stats and
ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be
used for both Rx and Tx rings.
Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In
Rx ring xdp_rxq_info occupies its own cacheline, so it's the major
difference now.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Currently ice_container_type is scoped only for ice_ethtool.c. Next
commit that will split the ice_ring struct onto Rx/Tx specific ring
structs is going to also modify the type of linked list of rings that is
within ice_ring_container. Therefore, the functions that are taking the
ice_ring_container as an input argument will need to be aware of a ring
type that will be looked up.
Embed ice_container_type within ice_ring_container and initialize it
properly when allocating the q_vectors.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
This field is dead and driver is not making any use of it. Simply remove
it.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Daniel Borkmann says:
====================
bpf-next 2021-10-02
We've added 85 non-merge commits during the last 15 day(s) which contain
a total of 132 files changed, 13779 insertions(+), 6724 deletions(-).
The main changes are:
1) Massive update on test_bpf.ko coverage for JITs as preparatory work for
an upcoming MIPS eBPF JIT, from Johan Almbladh.
2) Add a batched interface for RX buffer allocation in AF_XDP buffer pool,
with driver support for i40e and ice from Magnus Karlsson.
3) Add legacy uprobe support to libbpf to complement recently merged legacy
kprobe support, from Andrii Nakryiko.
4) Add bpf_trace_vprintk() as variadic printk helper, from Dave Marchevsky.
5) Support saving the register state in verifier when spilling <8byte bounded
scalar to the stack, from Martin Lau.
6) Add libbpf opt-in for stricter BPF program section name handling as part
of libbpf 1.0 effort, from Andrii Nakryiko.
7) Add a document to help clarifying BPF licensing, from Alexei Starovoitov.
8) Fix skel_internal.h to propagate errno if the loader indicates an internal
error, from Kumar Kartikeya Dwivedi.
9) Fix build warnings with -Wcast-function-type so that the option can later
be enabled by default for the kernel, from Kees Cook.
10) Fix libbpf to ignore STT_SECTION symbols in legacy map definitions as it
otherwise errors out when encountering them, from Toke Høiland-Jørgensen.
11) Teach libbpf to recognize specialized maps (such as for perf RB) and
internally remove BTF type IDs when creating them, from Hengqi Chen.
12) Various fixes and improvements to BPF selftests.
====================
Link: https://lore.kernel.org/r/20211002001327.15169-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement code to handle submission of APP TLV's
containing DSCP to TC mapping.
The first such mapping received on an interface
will cause that PF to switch to L3 DSCP QoS mode,
apply the default config for that mode, and apply
the received mapping.
Only one such mapping will be allowed per DSCP value,
and when the last DSCP mapping is deleted, the PF
will switch back into L2 VLAN QoS mode, applying the
appropriate default QoS settings.
L3 DSCP QoS mode will only be allowed in SW DCBx
mode, in other words, when the FW LLDP engine is
disabled. Commands that break this mutual exclusivity
will be blocked.
Co-developed-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|