| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- imh_edac: Add a new EDAC driver for Intel Diamond Rapids and future
incarnations of this memory controllers architecture
- amd64_edac: Remove the legacy csrow sysfs interface which has been
deprecated and unused (we assume) for at least a decade
- Add the capability to fallback to BIOS-provided address translation
functionality (ACPI PRM) which can be used on systems unsupported by
the current AMD address translation library
- The usual fixes, fixlets, cleanups and improvements all over the
place
* tag 'edac_updates_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
RAS/AMD/ATL: Replace bitwise_xor_bits() with hweight16()
EDAC/igen6: Fix error handling in igen6_edac driver
EDAC/imh: Setup 'imh_test' debugfs testing node
EDAC/{skx_comm,imh}: Detect 2-level memory configuration
EDAC/skx_common: Extend the maximum number of DRAM chip row bits
EDAC/{skx_common,imh}: Add EDAC driver for Intel Diamond Rapids servers
EDAC/skx_common: Prepare for skx_set_hi_lo()
EDAC/skx_common: Prepare for skx_get_edac_list()
EDAC/{skx_common,skx,i10nm}: Make skx_register_mci() independent of pci_dev
EDAC/ghes: Replace deprecated strcpy() in ghes_edac_report_mem_error()
EDAC/ie31200: Fix error handling in ie31200_register_mci
RAS/CEC: Replace use of system_wq with system_percpu_wq
EDAC: Remove the legacy EDAC sysfs interface
EDAC/amd64: Remove NUM_CONTROLLERS macro
EDAC/amd64: Generate ctl_name string at runtime
RAS/AMD/ATL: Require PRM support for future systems
ACPI: PRM: Add acpi_prm_handler_available()
RAS/AMD/ATL: Return error codes from helper functions
|
|
The igen6_edac driver calls device_initialize() for all memory
controllers in igen6_register_mci(), but misses corresponding
put_device() calls in error paths and during normal shutdown in
igen6_unregister_mcis().
Adding the missing put_device() calls improves code readability and
ensures proper reference counting for the device structure.
Found by code review.
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251105090244.23327-1-make24@iscas.ac.cn
|
|
Setup the following debugfs testing node to enable fake memory error
address decoding tests for the imh_edac driver.
/sys/kernel/debug/edac/imh_test/addr
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-8-qiuxu.zhuo@intel.com
|
|
Detect 2-level memory configurations and notify the 'skx_common' library
to enable ADXL 2-level memory error decoding.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-7-qiuxu.zhuo@intel.com
|
|
The allowed maximum number of row bits for DRAM chips in the Diamond
Rapids server processor is up to 19. Extend the current maximum row
bits from 18 to 19.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-6-qiuxu.zhuo@intel.com
|
|
Intel Diamond Rapids CPUs include Integrated Memory and I/O Hubs (IMH).
The memory controllers within the IMHs provide memory stacks to the
processor. Create a new driver for this IMH-based memory controllers
rather than applying additional patches to the existing i10nm_edac.c
for the following reasons:
1) The memory controllers are not presented as PCI devices; instead,
the detection and all their registers have been transitioned to
MMIO-based memory spaces.
2) Validation processes are costly. Modifications to i10nm_edac would
require extensive validation checks against multiple platforms,
including Ice Lake, Sapphire Rapids, Emerald Rapids, Granite Rapids,
Sierra Forest, and Grand Ridge.
3) Future Intel CPUs will likely only need patches on top of this new
EDAC driver. Validation can be limited to Diamond Rapids servers
and future Intel CPU generations.
[Tony: Fix kerneldoc for struct local_reg]
[randconfig: Added dependencies on NFIT and DMI]
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-5-qiuxu.zhuo@intel.com
|
|
The upcoming imh_edac driver for Intel Diamond Rapids servers cannot
use skx_get_hi_lo() in skx_common to retrieve the TOHM (Top of High
Memory) and TOLM (Top of Low Memory) parameters. Instead, it obtains
these parameters within its own EDAC driver. To accommodate this,
prepare skx_set_hi_lo() to allow the driver to notify skx_common of
these parameters.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-4-qiuxu.zhuo@intel.com
|
|
The Intel EDAC library 'skx_common' maintains the Intel server EDAC device
list for {skx, i10nm}_edac drivers, which use skx_get_all_bus_mappings()
to build and retrieve the EDAC device list.
However, the upcoming Intel EDAC driver, imh_edac, for Diamond Rapids
servers is designed for memory controllers that are MMIO-based devices
rather than PCI devices. Consequently, it can't use
skx_get_all_bus_mappings() due to the absence of a PCI bus. To accommodate
this, prepare skx_get_edac_list() to enable the upcoming imh_edac driver
to obtain the EDAC device list from the skx_common library and build the
EDAC device list independently.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-3-qiuxu.zhuo@intel.com
|
|
Memory controllers in the new Intel server CPUs, such as Diamond Rapids,
are presented as MMIO-based devices rather than PCI devices.
Modify skx_register_mci() to be independent of 'pci_dev' and use a generic
'dev' of 'struct device' to prepare for support of such MMIO-based memory
controllers.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-2-qiuxu.zhuo@intel.com
|
|
strcpy() has been deprecated¹ because it performs no bounds checking on the
destination buffer, which can lead to buffer overflows. Use the safer
strscpy() instead.
¹ https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://patch.msgid.link/20251118135621.101148-2-thorsten.blum@linux.dev
|
|
The current single-bit error injection mechanism flips bits directly in ECC RAM
by performing write and read operations. When the ECC RAM is actively used by
the Ethernet or USB controller, this approach sometimes trigger a false
double-bit error.
Switch both Ethernet and USB EDAC devices to use the INTTEST register
(altr_edac_a10_device_inject_fops) for single-bit error injection, similar to
the existing double-bit error injection method.
Fixes: 064acbd4f4ab ("EDAC, altera: Add Stratix10 peripheral support")
Signed-off-by: Niravkumar L Rabara <niravkumarlaxmidas.rabara@altera.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251111081333.1279635-1-niravkumarlaxmidas.rabara@altera.com
|
|
The OCRAM ECC is always enabled either by the BootROM or by the Secure Device
Manager (SDM) during a power-on reset on SoCFPGA.
However, during a warm reset, the OCRAM content is retained to preserve data,
while the control and status registers are reset to their default values. As
a result, ECC must be explicitly re-enabled after a warm reset.
Fixes: 17e47dc6db4f ("EDAC/altera: Add Stratix10 OCRAM ECC support")
Signed-off-by: Niravkumar L Rabara <niravkumarlaxmidas.rabara@altera.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251111080801.1279401-1-niravkumarlaxmidas.rabara@altera.com
|
|
ie31200_register_mci() calls device_initialize() for priv->dev
unconditionally. However, in the error path, put_device() is not
called, leading to an imbalance. Similarly, in the unload path,
put_device() is missing.
Although edac_mc_free() eventually frees the memory, it does not
release the device initialized by device_initialize(). For code
readability and proper pairing of device_initialize()/put_device(),
add put_device() calls in both error and unload paths.
Found by code review.
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://patch.msgid.link/20251106084735.35017-1-make24@iscas.ac.cn
|
|
The current code assumes that only DDR errors have split messages. Ensure
proper logging of non-standard event errors that may be split across multiple
messages too.
[ bp: Massage, move comment too, fix it up. ]
Fixes: d5fe2fec6c40 ("EDAC: Add a driver for the AMD Versal NET DDR controller")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251023113108.3467132-1-shubhrajyoti.datta@amd.com
|
|
Commit
199747106934 ("edac: add a new per-dimm API and make the old per-virtual-rank API obsolete")
introduced a new per-DIMM sysfs interface for EDAC making the old
per-virtual-rank sysfs interface obsolete.
Since this new sysfs interface was introduced more than a decade ago, remove
the obsolete legacy interface.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
|
|
Currently, the NUM_CONTROLLERS macro is used to limit the amount of memory
controllers (UMCs) available per node. The number of UMCs available per node,
however, is already cached by the max_mcs variable of struct amd64_pvt.
Allocate the relevant data structures dynamically using the variable instead
of static allocation through the macro.
The max_mcs variable is used for legacy systems too. These systems have a max
of 2 controllers. Since the default value of max_mcs, set in per_family_init(),
is 2, these legacy systems are also covered.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
|
|
Currently, the ctl_name string is statically assigned based on the family and
model of the SOC when the amd64_edac module is loaded.
The same, however, is not exactly needed as the string can be generated and
assigned at runtime through scnprintf().
Remove all static assignments and generate the string at runtime. Also,
cleanup the switch cases which became defunct and consolidate identical cases.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
|
|
The priv->mci[] array has NUM_CONTROLLERS so this > comparison needs to be >=
to prevent an out of bounds access.
Fixes: d5fe2fec6c40 ("EDAC: Add a driver for the AMD Versal NET DDR controller")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add support for new AMD family 0x1a models to amd64_edac
- Add an EDAC driver for the AMD VersalNET memory controller which
reports hw errors from different IP blocks in the fabric using an
IPC-type transport
- Drop the silly static number of memory controllers in the Intel EDAC
drivers (skx, i10nm) in favor of a flexible array so that former
doesn't need to be increased with every new generation which adds
more memory controllers; along with a proper refactoring
- Add support for two Alder Lake-S SOCs to ie31200_edac
- Add an EDAC driver for ADM Cortex A72 cores, and specifically for
reporting L1 and L2 cache errors
- Last but not least, the usual fixes, cleanups and improvements all
over the subsystem
* tag 'edac_updates_for_v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (23 commits)
EDAC/versalnet: Return the correct error in mc_probe()
EDAC/mc_sysfs: Increase legacy channel support to 16
EDAC/amd64: Add support for AMD family 1Ah-based newer models
EDAC: Add a driver for the AMD Versal NET DDR controller
dt-bindings: memory-controllers: Add support for Versal NET EDAC
RAS: Export log_non_standard_event() to drivers
cdx: Export Symbols for MCDI RPC and Initialization
cdx: Split mcdi.h and reorganize headers
EDAC/skx_common: Use topology_physical_package_id() instead of open coding
EDAC: Fix wrong executable file modes for C source files
EDAC/altera: Use dev_fwnode()
EDAC/skx_common: Remove unused *NUM*_IMC macros
EDAC/i10nm: Reallocate skx_dev list if preconfigured cnt != runtime cnt
EDAC/skx_common: Remove redundant upper bound check for res->imc
EDAC/skx_common: Make skx_dev->imc[] a flexible array
EDAC/skx_common: Swap memory controller index mapping
EDAC/skx_common: Move mc_mapping to be a field inside struct skx_imc
EDAC/{skx_common,skx}: Use configuration data, not global macros
EDAC/i10nm: Skip DIMM enumeration on a disabled memory controller
EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC support
...
|
|
* edac-drivers:
EDAC/versalnet: Return the correct error in mc_probe()
EDAC/mc_sysfs: Increase legacy channel support to 16
EDAC/amd64: Add support for AMD family 1Ah-based newer models
EDAC: Add a driver for the AMD Versal NET DDR controller
dt-bindings: memory-controllers: Add support for Versal NET EDAC
RAS: Export log_non_standard_event() to drivers
cdx: Export Symbols for MCDI RPC and Initialization
cdx: Split mcdi.h and reorganize headers
EDAC/skx_common: Use topology_physical_package_id() instead of open coding
EDAC/altera: Use dev_fwnode()
EDAC/skx_common: Remove unused *NUM*_IMC macros
EDAC/i10nm: Reallocate skx_dev list if preconfigured cnt != runtime cnt
EDAC/skx_common: Remove redundant upper bound check for res->imc
EDAC/skx_common: Make skx_dev->imc[] a flexible array
EDAC/skx_common: Swap memory controller index mapping
EDAC/skx_common: Move mc_mapping to be a field inside struct skx_imc
EDAC/{skx_common,skx}: Use configuration data, not global macros
EDAC/i10nm: Skip DIMM enumeration on a disabled memory controller
EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC support
dt-bindings: arm: cpus: Add edac-enabled property
EDAC: Add EDAC driver for ARM Cortex A72 cores
* edac-misc:
EDAC: Fix wrong executable file modes for C source files
MAINTAINERS: EDAC: Drop inactive reviewers
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
Return -ENOMEM if memory allocation in mc_probe() fails.
[ bp: Massage commit message. ]
Fixes: d5fe2fec6c40 ("EDAC: Add a driver for the AMD Versal NET DDR controller")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
Newer AMD systems can support up to 16 channels per EDAC "mc" device.
These are detected by the EDAC module running on the device, and the
current EDAC interface is appropriately enumerated.
The legacy EDAC sysfs interface however, provides device attributes for
channels 0 through 11 only. Consequently, the last four channels, 12
through 15, will not be enumerated and will not be visible through the
legacy sysfs interface.
Add additional device attributes to ensure that all 16 channels, if
present, are enumerated by and visible through the legacy EDAC sysfs
interface.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250916203242.1281036-1-avadhut.naik@amd.com
|
|
Add support for family 1Ah-based models 50h-57h, 90h-9Fh, A0h-AFh, and
C0h-C7h.
Also, raise the maximum memory controllers number as those machines
support that many.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250916203242.1281036-1-avadhut.naik@amd.com
|
|
Add a driver for the AMD Versal NET DDR memory controller which supports
single bit error correction, double bit error detection and other system
errors from various IP subsystems (e.g., RPU, NOCs, HNICX, PL).
The driver listens for notifications from the NMC (Network management
controller) using RPMsg (Remote Processor Messaging).
The channel used for communicating to RPMsg is named "error_edac". Upon
receipt of a notification, the driver sends a RAS event trace.
[ bp:
- Fixup title
- Rewrite commit message
- Fixup Kconfig text
- Zap unused defines and align them
- Simplify rpmsg_cb() considerably
- Drop silly double-brackets in conditionals
- Use proper void * type in mcdi_request()
- Do not clear chinfo in rpmsg_probe() unnecessarily
- Fix indentation
- Do a proper err unwind path in init_versalnet()
- Redo the error unwind path in mc_probe() properly
- Fix the ordering in mc_remove()
]
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250908115649.22903-1-shubhrajyoti.datta@amd.com
Link: https://lore.kernel.org/r/20250703173105.GLaGa-WQCESDNsqygm@fat_crate.local
|
|
Use topology_physical_package_id() to get the CPU package ID instead of
open coding.
Suggested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250903030648.3285935-1-qiuxu.zhuo@intel.com
|
|
Three EDAC source files were mistakenly marked as executable when adding the
EDAC scrub controls.
These are plain C source files and should not carry the executable bit.
Correcting their modes follows the principle of least privilege and avoids
unnecessary execute permissions in the repository.
[ bp: Massage commit message. ]
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250828191954.903125-1-visitorckw@gmail.com
|
|
dma_free_coherent() must only be called if the corresponding
dma_alloc_coherent() call has succeeded. Calling it when the allocation fails
leads to undefined behavior.
Delete the wrong call.
[ bp: Massage commit message. ]
Fixes: 71bcada88b0f3 ("edac: altera: Add Altera SDRAM EDAC support")
Signed-off-by: Salah Triki <salah.triki@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/aIrfzzqh4IzYtDVC@pc
|
|
irq_domain_create_simple() takes fwnode as the first argument. It can be
extracted from the struct device using dev_fwnode() helper instead of using
of_node with of_fwnode_handle().
So use the dev_fwnode() helper.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Link: https://lore.kernel.org/20250723062631.1830757-1-jirislaby@kernel.org
|
|
There are no references to the *NUM*_IMC macros, so remove them.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-8-qiuxu.zhuo@intel.com
|
|
Ideally, read the present DDR memory controller count first and then
allocate the skx_dev list using this count. However, this approach
requires adding a significant amount of code similar to
skx_get_all_bus_mappings() to obtain the PCI bus mappings for the first
socket and use these mappings along with the related PCI register offset
to read the memory controller count.
Given that the Granite Rapids CPU is the only one that can detect the
count of memory controllers at runtime (other CPUs use the count in the
configuration data), to reduce code complexity, reallocate the skx_dev
list only if the preconfigured count of DDR memory controllers differs
from the count read at runtime for Granite Rapids CPU.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-7-qiuxu.zhuo@intel.com
|
|
The following upper bound check for the memory controller physical index
decoded by ADXL is the only place where use the macro 'NUM_IMC' is used:
res->imc > NUM_IMC - 1
Since this check is already covered by skx_get_mc_mapping(), meaning no
memory controller logical index exists for an invalid memory controller
physical index decoded by ADXL, remove the redundant upper bound check
so that the definition for 'NUM_IMC' can be cleaned up (in another patch).
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-6-qiuxu.zhuo@intel.com
|
|
The current skx->imc[NUM_IMC] array of memory controller instances is
sized using the macro NUM_IMC. Each time EDAC support is added for a
new CPU, NUM_IMC needs to be updated to ensure it is greater than or
equal to the number of memory controllers for the new CPU. This approach
is inconvenient and results in memory waste for older CPUs with fewer
memory controllers.
To address this, make skx->imc[] a flexible array and determine its size
from configuration data or at runtime.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-5-qiuxu.zhuo@intel.com
|
|
The current mapping of memory controller indices is from physical index [1]
to logical index [2], as show below:
skx_dev->imc[pmc].mc_mapping = lmc
Since skx_dev->imc[] is an array of present memory controller instances,
mapping memory controller indices from logical index to physical index,
as show below, is more reasonable. This is also a preparatory step for
making skx_dev->imc[] a flexible array.
skx_dev->imc[lmc].mc_mapping = pmc
Both mappings are equivalent. No functional changes intended.
[1] Indices for memory controllers include both those present to the
OS and those disabled by BIOS.
[2] Indices for memory controllers present to the OS.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-4-qiuxu.zhuo@intel.com
|
|
The mc_mapping and imc fields of struct skx_dev have the same size,
NUM_IMC. Move mc_mapping to be a field inside struct skx_imc to prepare
for making the imc array of memory controller instances a flexible array.
No functional changes intended.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-3-qiuxu.zhuo@intel.com
|
|
Use model-specific configuration data for the number of memory controllers
per socket, channels per memory controller, and DIMMs per channel as
intended, instead of relying on global macros for maximum values.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-2-qiuxu.zhuo@intel.com
|
|
When loading the i10nm_edac driver on some Intel Granite Rapids servers,
a call trace may appear as follows:
UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:453:16
shift exponent -66 is negative
...
__ubsan_handle_shift_out_of_bounds+0x1e3/0x390
skx_get_dimm_info.cold+0x47/0xd40 [skx_edac_common]
i10nm_get_dimm_config+0x23e/0x390 [i10nm_edac]
skx_register_mci+0x159/0x220 [skx_edac_common]
i10nm_init+0xcb0/0x1ff0 [i10nm_edac]
...
This occurs because some BIOS may disable a memory controller if there
aren't any memory DIMMs populated on this memory controller. The DIMMMTR
register of this disabled memory controller contains the invalid value
~0, resulting in the call trace above.
Fix this call trace by skipping DIMM enumeration on a disabled memory
controller.
Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support")
Reported-by: Jose Jesus Ambriz Meza <jose.jesus.ambriz.meza@intel.com>
Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Closes: https://lore.kernel.org/all/20250730063155.2612379-1-acelan.kao@canonical.com/
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Link: https://lore.kernel.org/r/20250806065707.3533345-1-qiuxu.zhuo@intel.com
|
|
Host Device IDs (DID0) correspond to:
* Intel Core i7-12700K
* Intel Core i5-12600K
See documentation:
* 12th Generation Intel® Core™ Processors Datasheet
* Volume 1 of 2, Doc. No.: 655258, Rev.: 011
* https://edc.intel.com/output/DownloadPdfDocument?id=8297 (PDF)
Signed-off-by: Kyle Manna <kyle@kylemanna.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20250819161739.3241152-1-kyle@kylemanna.com
|
|
The driver is designed to support error detection and reporting for
Cortex A72 cores, specifically within their L1 and L2 cache systems.
The errors are detected by reading CPU/L2 memory error syndrome
registers.
Unfortunately there is no robust way to inject errors into the caches,
so this driver doesn't contain any code to actually test it. It has
been tested though with code taken from an older version [1] of this
driver. For reasons stated in thread [1], the error injection code is
not suitable for mainline, so it is removed from the driver.
[1] https://lore.kernel.org/all/1521073067-24348-1-git-send-email-york.sun@nxp.com/#t
[ bp: minor touchups. ]
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Co-developed-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/1752714390-27389-2-git-send-email-vijayb@linux.microsoft.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- i10nm:
- switch to using scnprintf()
- Add Granite Rapids-D support
- synopsys: Make sure ECC error and counter registers are cleared
during init/probing to avoid reporting stale errors
- igen6: Add Wildcat Lake SoCs support
- Make sure scrub features sysfs attributes are initialized properly
- Allocate memory repair sysfs attributes statically to reduce stack
usage
- Fix DIMM module size computation for DIMMs with total capacity which
is a non power-of-two number, in amd64_edac
- Do not be too dramatic when reporting disabled memory controllers in
igen6_edac
- Add support to ie31200_edac for the following SoCs:
- Core i5-14[67]00
- Bartless Lake-S SoCs
- Raptor Lake-HX
* tag 'edac_updates_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/{skx_common,i10nm}: Use scnprintf() for safer buffer handling
EDAC/synopsys: Clear the ECC counters on init
EDAC/ie31200: Add Intel Raptor Lake-HX SoCs support
EDAC/igen6: Add Intel Wildcat Lake SoCs support
EDAC/i10nm: Add Intel Granite Rapids-D support
EDAC/mem_repair: Reduce stack usage in edac_mem_repair_get_desc()
EDAC/igen6: Reduce log level to debug for absent memory controllers
EDAC/ie31200: Document which CPUs correspond to each Raptor Lake-S device ID
EDAC/ie31200: Enable support for Core i5-14600 and i7-14700
ie31200/EDAC: Add Intel Bartlett Lake-S SoCs support
|
|
snprintf() is fragile when its return value will be used to append
additional data to a buffer. Use scnprintf() instead.
Signed-off-by: Wang Haoran <haoranwangsec@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20250715131700.1092720-1-haoranwangsec@gmail.com
|
|
Clear the ECC error and counter registers during initialization/probe to avoid
reporting stale errors that may have occurred before EDAC registration.
For that, unify the Zynq and ZynqMP ECC state reading paths and simplify the
code.
[ bp: Massage commit message.
Fix an -Wsometimes-uninitialized warning as reported by
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507141048.obUv3ZUm-lkp@intel.com ]
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250713050753.7042-1-shubhrajyoti.datta@amd.com
|
|
Intel Raptor Lake-HX SoC shares the same memory controller registers
as Raptor Lake-S SoC. Add a compute die ID for Raptor Lake-HX SoCs with
Out-of-Band ECC capability for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Laurens SEGHERS <laurens@rale.com>
Link: https://lore.kernel.org/r/20250704151609.7833-4-qiuxu.zhuo@intel.com
|
|
Intel Wildcat Lake is a mobile derivative of Panther Lake with one
memory controller. Wildcat Lake SoCs share the same IBECC registers
with Meteor Lake-P SoCs.
Add a compute die ID and a new configuration structure for Wildcat
Lake SoCs with In-Band ECC capability for EDAC support.
Signed-off-by: Lili Li <lili.li@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250704151609.7833-3-qiuxu.zhuo@intel.com
|
|
The Granite Rapids-D CPU model uses memory controller registers similar
to those of the Granite Rapids server CPU but with a different memory
controller MMIO base.
Add the Granite Rapids-D CPU model ID and use the new memory controller
MMIO base for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: VikasX Chougule <vikasx.chougule@intel.com>
Link: https://lore.kernel.org/r/20250704151609.7833-2-qiuxu.zhuo@intel.com
|
|
Fix the lockdep splat caused by missing sysfs_attr_init() calls for the
recently added EDAC feature's sysfs attributes.
In lockdep_init_map_type(), the check for the lock-class key if
(!static_obj(key) && !is_dynamic_key(key)) causes the splat.
Backtrace:
RIP: 0010:lockdep_init_map_type
Call Trace:
__kernfs_create_file
sysfs_add_file_mode_ns
internal_create_group
internal_create_groups
device_add
? __init_waitqueue_head
edac_dev_register
devm_cxl_memdev_edac_register
? lock_acquire
? find_held_lock
? cxl_mem_probe
? cxl_mem_probe
? lockdep_hardirqs_on
? cxl_mem_probe
cxl_mem_probe
[ bp: Massage. ]
Fixes: f90b738166fe ("EDAC: Add scrub control feature")
Fixes: bcbd069b11b0 ("EDAC: Add a Error Check Scrub control feature")
Fixes: 699ea5219c4b ("EDAC: Add a memory repair control feature")
Reported-by: Dave Jiang <dave.jiang@intel.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://lore.kernel.org/20250626101344.1726-1-shiju.jose@huawei.com
|
|
Constructing an array on the stack adds complexity and can exceed the
warning limit for per-function stack usage:
drivers/edac/mem_repair.c:361:5: error: stack frame size (1296) exceeds
limit (1280) in 'edac_mem_repair_get_desc' [-Werror,-Wframe-larger-than]
Change this to have the actual attribute array allocated statically and then
just add the instance number on the per-instance copy.
Fixes: 699ea5219c4b ("EDAC: Add a memory repair control feature")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250620114135.4017183-1-arnd@kernel.org
|
|
Each Chip-Select (CS) of a Unified Memory Controller (UMC) on AMD Zen-based
SOCs has an Address Mask and a Secondary Address Mask register associated with
it. The amd64_edac module logs DIMM sizes on a per-UMC per-CS granularity
during init using these two registers.
Currently, the module primarily considers only the Address Mask register for
computing DIMM sizes. The Secondary Address Mask register is only considered
for odd CS. Additionally, if it has been considered, the Address Mask register
is ignored altogether for that CS. For power-of-two DIMMs i.e. DIMMs whose
total capacity is a power of two (32GB, 64GB, etc), this is not an issue
since only the Address Mask register is used.
For non-power-of-two DIMMs i.e., DIMMs whose total capacity is not a power of
two (48GB, 96GB, etc), however, the Secondary Address Mask register is used
in conjunction with the Address Mask register. However, since the module only
considers either of the two registers for a CS, the size computed by the
module is incorrect. The Secondary Address Mask register is not considered for
even CS, and the Address Mask register is not considered for odd CS.
Introduce a new helper function so that both Address Mask and Secondary
Address Mask registers are considered, when valid, for computing DIMM sizes.
Furthermore, also rename some variables for greater clarity.
Fixes: 81f5090db843 ("EDAC/amd64: Support asymmetric dual-rank DIMMs")
Closes: https://lore.kernel.org/dbec22b6-00f2-498b-b70d-ab6f8a5ec87e@natrix.lt
Reported-by: Žilvinas Žaltiena <zilvinas@natrix.lt>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Žilvinas Žaltiena <zilvinas@natrix.lt>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250529205013.403450-1-avadhut.naik@amd.com
|
|
The current KERN_WARNING level message for detecting absent memory
controllers is overly dramatic. The BIOS likely had valid reasons to
disable the memory controller (e.g. it isn't connected to any DIMM
slots on the motherboard for this system). So there's nothing actually
wrong that needs to be fixed.
Reduce the log level to KERN_DEBUG to eliminate the false warning.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250618162307.1523736-2-qiuxu.zhuo@intel.com
|
|
Based on table 103 ("Host Device ID (DID0)") in [1], document which CPUs
correspond to each Raptor Lake-S device ID for better readability.
[1] https://www.intel.com/content/www/us/en/content-details/743844/13th-generation-intel-core-intel-core-14th-generation-intel-core-processor-series-1-and-series-2-and-intel-xeon-e-2400-processor-datasheet-volume-1-of-2.html
Signed-off-by: George Gaidarov <gdgaidarov+lkml@gmail.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250529162933.1228735-2-gdgaidarov+lkml@gmail.com
|
|
Device ID '0xa740' is shared by i7-14700, i7-14700K, and i7-14700T.
Device ID '0xa704' is shared by i5-14600, i5-14600K, and i5-14600T.
Tested locally on my i7-14700K.
Signed-off-by: George Gaidarov <gdgaidarov+lkml@gmail.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250529162933.1228735-1-gdgaidarov+lkml@gmail.com
|