summaryrefslogtreecommitdiff
path: root/drivers/edac
AgeCommit message (Collapse)Author
6 daysMerge tag 'edac_updates_for_v6.19_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - imh_edac: Add a new EDAC driver for Intel Diamond Rapids and future incarnations of this memory controllers architecture - amd64_edac: Remove the legacy csrow sysfs interface which has been deprecated and unused (we assume) for at least a decade - Add the capability to fallback to BIOS-provided address translation functionality (ACPI PRM) which can be used on systems unsupported by the current AMD address translation library - The usual fixes, fixlets, cleanups and improvements all over the place * tag 'edac_updates_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: RAS/AMD/ATL: Replace bitwise_xor_bits() with hweight16() EDAC/igen6: Fix error handling in igen6_edac driver EDAC/imh: Setup 'imh_test' debugfs testing node EDAC/{skx_comm,imh}: Detect 2-level memory configuration EDAC/skx_common: Extend the maximum number of DRAM chip row bits EDAC/{skx_common,imh}: Add EDAC driver for Intel Diamond Rapids servers EDAC/skx_common: Prepare for skx_set_hi_lo() EDAC/skx_common: Prepare for skx_get_edac_list() EDAC/{skx_common,skx,i10nm}: Make skx_register_mci() independent of pci_dev EDAC/ghes: Replace deprecated strcpy() in ghes_edac_report_mem_error() EDAC/ie31200: Fix error handling in ie31200_register_mci RAS/CEC: Replace use of system_wq with system_percpu_wq EDAC: Remove the legacy EDAC sysfs interface EDAC/amd64: Remove NUM_CONTROLLERS macro EDAC/amd64: Generate ctl_name string at runtime RAS/AMD/ATL: Require PRM support for future systems ACPI: PRM: Add acpi_prm_handler_available() RAS/AMD/ATL: Return error codes from helper functions
2025-11-21EDAC/igen6: Fix error handling in igen6_edac driverMa Ke
The igen6_edac driver calls device_initialize() for all memory controllers in igen6_register_mci(), but misses corresponding put_device() calls in error paths and during normal shutdown in igen6_unregister_mcis(). Adding the missing put_device() calls improves code readability and ensures proper reference counting for the device structure. Found by code review. Signed-off-by: Ma Ke <make24@iscas.ac.cn> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20251105090244.23327-1-make24@iscas.ac.cn
2025-11-21EDAC/imh: Setup 'imh_test' debugfs testing nodeQiuxu Zhuo
Setup the following debugfs testing node to enable fake memory error address decoding tests for the imh_edac driver. /sys/kernel/debug/edac/imh_test/addr Tested-by: Yi Lai <yi1.lai@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20251119134132.2389472-8-qiuxu.zhuo@intel.com
2025-11-21EDAC/{skx_comm,imh}: Detect 2-level memory configurationQiuxu Zhuo
Detect 2-level memory configurations and notify the 'skx_common' library to enable ADXL 2-level memory error decoding. Tested-by: Yi Lai <yi1.lai@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20251119134132.2389472-7-qiuxu.zhuo@intel.com
2025-11-21EDAC/skx_common: Extend the maximum number of DRAM chip row bitsQiuxu Zhuo
The allowed maximum number of row bits for DRAM chips in the Diamond Rapids server processor is up to 19. Extend the current maximum row bits from 18 to 19. Tested-by: Yi Lai <yi1.lai@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20251119134132.2389472-6-qiuxu.zhuo@intel.com
2025-11-21EDAC/{skx_common,imh}: Add EDAC driver for Intel Diamond Rapids serversQiuxu Zhuo
Intel Diamond Rapids CPUs include Integrated Memory and I/O Hubs (IMH). The memory controllers within the IMHs provide memory stacks to the processor. Create a new driver for this IMH-based memory controllers rather than applying additional patches to the existing i10nm_edac.c for the following reasons: 1) The memory controllers are not presented as PCI devices; instead, the detection and all their registers have been transitioned to MMIO-based memory spaces. 2) Validation processes are costly. Modifications to i10nm_edac would require extensive validation checks against multiple platforms, including Ice Lake, Sapphire Rapids, Emerald Rapids, Granite Rapids, Sierra Forest, and Grand Ridge. 3) Future Intel CPUs will likely only need patches on top of this new EDAC driver. Validation can be limited to Diamond Rapids servers and future Intel CPU generations. [Tony: Fix kerneldoc for struct local_reg] [randconfig: Added dependencies on NFIT and DMI] Tested-by: Yi Lai <yi1.lai@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20251119134132.2389472-5-qiuxu.zhuo@intel.com
2025-11-19EDAC/skx_common: Prepare for skx_set_hi_lo()Qiuxu Zhuo
The upcoming imh_edac driver for Intel Diamond Rapids servers cannot use skx_get_hi_lo() in skx_common to retrieve the TOHM (Top of High Memory) and TOLM (Top of Low Memory) parameters. Instead, it obtains these parameters within its own EDAC driver. To accommodate this, prepare skx_set_hi_lo() to allow the driver to notify skx_common of these parameters. Tested-by: Yi Lai <yi1.lai@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20251119134132.2389472-4-qiuxu.zhuo@intel.com
2025-11-19EDAC/skx_common: Prepare for skx_get_edac_list()Qiuxu Zhuo
The Intel EDAC library 'skx_common' maintains the Intel server EDAC device list for {skx, i10nm}_edac drivers, which use skx_get_all_bus_mappings() to build and retrieve the EDAC device list. However, the upcoming Intel EDAC driver, imh_edac, for Diamond Rapids servers is designed for memory controllers that are MMIO-based devices rather than PCI devices. Consequently, it can't use skx_get_all_bus_mappings() due to the absence of a PCI bus. To accommodate this, prepare skx_get_edac_list() to enable the upcoming imh_edac driver to obtain the EDAC device list from the skx_common library and build the EDAC device list independently. Tested-by: Yi Lai <yi1.lai@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20251119134132.2389472-3-qiuxu.zhuo@intel.com
2025-11-19EDAC/{skx_common,skx,i10nm}: Make skx_register_mci() independent of pci_devQiuxu Zhuo
Memory controllers in the new Intel server CPUs, such as Diamond Rapids, are presented as MMIO-based devices rather than PCI devices. Modify skx_register_mci() to be independent of 'pci_dev' and use a generic 'dev' of 'struct device' to prepare for support of such MMIO-based memory controllers. Tested-by: Yi Lai <yi1.lai@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20251119134132.2389472-2-qiuxu.zhuo@intel.com
2025-11-18EDAC/ghes: Replace deprecated strcpy() in ghes_edac_report_mem_error()Thorsten Blum
strcpy() has been deprecated¹ because it performs no bounds checking on the destination buffer, which can lead to buffer overflows. Use the safer strscpy() instead. ¹ https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://patch.msgid.link/20251118135621.101148-2-thorsten.blum@linux.dev
2025-11-11EDAC/altera: Use INTTEST register for Ethernet and USB SBE injectionNiravkumar L Rabara
The current single-bit error injection mechanism flips bits directly in ECC RAM by performing write and read operations. When the ECC RAM is actively used by the Ethernet or USB controller, this approach sometimes trigger a false double-bit error. Switch both Ethernet and USB EDAC devices to use the INTTEST register (altr_edac_a10_device_inject_fops) for single-bit error injection, similar to the existing double-bit error injection method. Fixes: 064acbd4f4ab ("EDAC, altera: Add Stratix10 peripheral support") Signed-off-by: Niravkumar L Rabara <niravkumarlaxmidas.rabara@altera.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dinh Nguyen <dinguyen@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251111081333.1279635-1-niravkumarlaxmidas.rabara@altera.com
2025-11-11EDAC/altera: Handle OCRAM ECC enable after warm resetNiravkumar L Rabara
The OCRAM ECC is always enabled either by the BootROM or by the Secure Device Manager (SDM) during a power-on reset on SoCFPGA. However, during a warm reset, the OCRAM content is retained to preserve data, while the control and status registers are reset to their default values. As a result, ECC must be explicitly re-enabled after a warm reset. Fixes: 17e47dc6db4f ("EDAC/altera: Add Stratix10 OCRAM ECC support") Signed-off-by: Niravkumar L Rabara <niravkumarlaxmidas.rabara@altera.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dinh Nguyen <dinguyen@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251111080801.1279401-1-niravkumarlaxmidas.rabara@altera.com
2025-11-10EDAC/ie31200: Fix error handling in ie31200_register_mciMa Ke
ie31200_register_mci() calls device_initialize() for priv->dev unconditionally. However, in the error path, put_device() is not called, leading to an imbalance. Similarly, in the unload path, put_device() is missing. Although edac_mc_free() eventually frees the memory, it does not release the device initialized by device_initialize(). For code readability and proper pairing of device_initialize()/put_device(), add put_device() calls in both error and unload paths. Found by code review. Signed-off-by: Ma Ke <make24@iscas.ac.cn> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://patch.msgid.link/20251106084735.35017-1-make24@iscas.ac.cn
2025-11-07EDAC/versalnet: Handle split messages for non-standard errorsShubhrajyoti Datta
The current code assumes that only DDR errors have split messages. Ensure proper logging of non-standard event errors that may be split across multiple messages too. [ bp: Massage, move comment too, fix it up. ] Fixes: d5fe2fec6c40 ("EDAC: Add a driver for the AMD Versal NET DDR controller") Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/20251023113108.3467132-1-shubhrajyoti.datta@amd.com
2025-11-06EDAC: Remove the legacy EDAC sysfs interfaceAvadhut Naik
Commit 199747106934 ("edac: add a new per-dimm API and make the old per-virtual-rank API obsolete") introduced a new per-DIMM sysfs interface for EDAC making the old per-virtual-rank sysfs interface obsolete. Since this new sysfs interface was introduced more than a decade ago, remove the obsolete legacy interface. Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
2025-11-06EDAC/amd64: Remove NUM_CONTROLLERS macroAvadhut Naik
Currently, the NUM_CONTROLLERS macro is used to limit the amount of memory controllers (UMCs) available per node. The number of UMCs available per node, however, is already cached by the max_mcs variable of struct amd64_pvt. Allocate the relevant data structures dynamically using the variable instead of static allocation through the macro. The max_mcs variable is used for legacy systems too. These systems have a max of 2 controllers. Since the default value of max_mcs, set in per_family_init(), is 2, these legacy systems are also covered. Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
2025-11-06EDAC/amd64: Generate ctl_name string at runtimeAvadhut Naik
Currently, the ctl_name string is statically assigned based on the family and model of the SOC when the amd64_edac module is loaded. The same, however, is not exactly needed as the string can be generated and assigned at runtime through scnprintf(). Remove all static assignments and generate the string at runtime. Also, cleanup the switch cases which became defunct and consolidate identical cases. Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
2025-10-13EDAC/versalnet: Fix off by one in handle_error()Dan Carpenter
The priv->mci[] array has NUM_CONTROLLERS so this > comparison needs to be >= to prevent an out of bounds access. Fixes: d5fe2fec6c40 ("EDAC: Add a driver for the AMD Versal NET DDR controller") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
2025-09-30Merge tag 'edac_updates_for_v6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add support for new AMD family 0x1a models to amd64_edac - Add an EDAC driver for the AMD VersalNET memory controller which reports hw errors from different IP blocks in the fabric using an IPC-type transport - Drop the silly static number of memory controllers in the Intel EDAC drivers (skx, i10nm) in favor of a flexible array so that former doesn't need to be increased with every new generation which adds more memory controllers; along with a proper refactoring - Add support for two Alder Lake-S SOCs to ie31200_edac - Add an EDAC driver for ADM Cortex A72 cores, and specifically for reporting L1 and L2 cache errors - Last but not least, the usual fixes, cleanups and improvements all over the subsystem * tag 'edac_updates_for_v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (23 commits) EDAC/versalnet: Return the correct error in mc_probe() EDAC/mc_sysfs: Increase legacy channel support to 16 EDAC/amd64: Add support for AMD family 1Ah-based newer models EDAC: Add a driver for the AMD Versal NET DDR controller dt-bindings: memory-controllers: Add support for Versal NET EDAC RAS: Export log_non_standard_event() to drivers cdx: Export Symbols for MCDI RPC and Initialization cdx: Split mcdi.h and reorganize headers EDAC/skx_common: Use topology_physical_package_id() instead of open coding EDAC: Fix wrong executable file modes for C source files EDAC/altera: Use dev_fwnode() EDAC/skx_common: Remove unused *NUM*_IMC macros EDAC/i10nm: Reallocate skx_dev list if preconfigured cnt != runtime cnt EDAC/skx_common: Remove redundant upper bound check for res->imc EDAC/skx_common: Make skx_dev->imc[] a flexible array EDAC/skx_common: Swap memory controller index mapping EDAC/skx_common: Move mc_mapping to be a field inside struct skx_imc EDAC/{skx_common,skx}: Use configuration data, not global macros EDAC/i10nm: Skip DIMM enumeration on a disabled memory controller EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC support ...
2025-09-26Merge branches 'edac-drivers' and 'edac-misc' into edac-updatesBorislav Petkov (AMD)
* edac-drivers: EDAC/versalnet: Return the correct error in mc_probe() EDAC/mc_sysfs: Increase legacy channel support to 16 EDAC/amd64: Add support for AMD family 1Ah-based newer models EDAC: Add a driver for the AMD Versal NET DDR controller dt-bindings: memory-controllers: Add support for Versal NET EDAC RAS: Export log_non_standard_event() to drivers cdx: Export Symbols for MCDI RPC and Initialization cdx: Split mcdi.h and reorganize headers EDAC/skx_common: Use topology_physical_package_id() instead of open coding EDAC/altera: Use dev_fwnode() EDAC/skx_common: Remove unused *NUM*_IMC macros EDAC/i10nm: Reallocate skx_dev list if preconfigured cnt != runtime cnt EDAC/skx_common: Remove redundant upper bound check for res->imc EDAC/skx_common: Make skx_dev->imc[] a flexible array EDAC/skx_common: Swap memory controller index mapping EDAC/skx_common: Move mc_mapping to be a field inside struct skx_imc EDAC/{skx_common,skx}: Use configuration data, not global macros EDAC/i10nm: Skip DIMM enumeration on a disabled memory controller EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC support dt-bindings: arm: cpus: Add edac-enabled property EDAC: Add EDAC driver for ARM Cortex A72 cores * edac-misc: EDAC: Fix wrong executable file modes for C source files MAINTAINERS: EDAC: Drop inactive reviewers Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-09-18EDAC/versalnet: Return the correct error in mc_probe()Dan Carpenter
Return -ENOMEM if memory allocation in mc_probe() fails. [ bp: Massage commit message. ] Fixes: d5fe2fec6c40 ("EDAC: Add a driver for the AMD Versal NET DDR controller") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-09-17EDAC/mc_sysfs: Increase legacy channel support to 16Avadhut Naik
Newer AMD systems can support up to 16 channels per EDAC "mc" device. These are detected by the EDAC module running on the device, and the current EDAC interface is appropriately enumerated. The legacy EDAC sysfs interface however, provides device attributes for channels 0 through 11 only. Consequently, the last four channels, 12 through 15, will not be enumerated and will not be visible through the legacy sysfs interface. Add additional device attributes to ensure that all 16 channels, if present, are enumerated by and visible through the legacy EDAC sysfs interface. Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250916203242.1281036-1-avadhut.naik@amd.com
2025-09-17EDAC/amd64: Add support for AMD family 1Ah-based newer modelsAvadhut Naik
Add support for family 1Ah-based models 50h-57h, 90h-9Fh, A0h-AFh, and C0h-C7h. Also, raise the maximum memory controllers number as those machines support that many. Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250916203242.1281036-1-avadhut.naik@amd.com
2025-09-15EDAC: Add a driver for the AMD Versal NET DDR controllerShubhrajyoti Datta
Add a driver for the AMD Versal NET DDR memory controller which supports single bit error correction, double bit error detection and other system errors from various IP subsystems (e.g., RPU, NOCs, HNICX, PL). The driver listens for notifications from the NMC (Network management controller) using RPMsg (Remote Processor Messaging). The channel used for communicating to RPMsg is named "error_edac". Upon receipt of a notification, the driver sends a RAS event trace. [ bp: - Fixup title - Rewrite commit message - Fixup Kconfig text - Zap unused defines and align them - Simplify rpmsg_cb() considerably - Drop silly double-brackets in conditionals - Use proper void * type in mcdi_request() - Do not clear chinfo in rpmsg_probe() unnecessarily - Fix indentation - Do a proper err unwind path in init_versalnet() - Redo the error unwind path in mc_probe() properly - Fix the ordering in mc_remove() ] Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250908115649.22903-1-shubhrajyoti.datta@amd.com Link: https://lore.kernel.org/r/20250703173105.GLaGa-WQCESDNsqygm@fat_crate.local
2025-09-03EDAC/skx_common: Use topology_physical_package_id() instead of open codingQiuxu Zhuo
Use topology_physical_package_id() to get the CPU package ID instead of open coding. Suggested-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250903030648.3285935-1-qiuxu.zhuo@intel.com
2025-08-30EDAC: Fix wrong executable file modes for C source filesKuan-Wei Chiu
Three EDAC source files were mistakenly marked as executable when adding the EDAC scrub controls. These are plain C source files and should not carry the executable bit. Correcting their modes follows the principle of least privilege and avoids unnecessary execute permissions in the repository. [ bp: Massage commit message. ] Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250828191954.903125-1-visitorckw@gmail.com
2025-08-25EDAC/altera: Delete an inappropriate dma_free_coherent() callSalah Triki
dma_free_coherent() must only be called if the corresponding dma_alloc_coherent() call has succeeded. Calling it when the allocation fails leads to undefined behavior. Delete the wrong call. [ bp: Massage commit message. ] Fixes: 71bcada88b0f3 ("edac: altera: Add Altera SDRAM EDAC support") Signed-off-by: Salah Triki <salah.triki@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dinh Nguyen <dinguyen@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/aIrfzzqh4IzYtDVC@pc
2025-08-20EDAC/altera: Use dev_fwnode()Jiri Slaby (SUSE)
irq_domain_create_simple() takes fwnode as the first argument. It can be extracted from the struct device using dev_fwnode() helper instead of using of_node with of_fwnode_handle(). So use the dev_fwnode() helper. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Acked-by: Dinh Nguyen <dinguyen@kernel.org> Link: https://lore.kernel.org/20250723062631.1830757-1-jirislaby@kernel.org
2025-08-19EDAC/skx_common: Remove unused *NUM*_IMC macrosQiuxu Zhuo
There are no references to the *NUM*_IMC macros, so remove them. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250731145534.2759334-8-qiuxu.zhuo@intel.com
2025-08-19EDAC/i10nm: Reallocate skx_dev list if preconfigured cnt != runtime cntQiuxu Zhuo
Ideally, read the present DDR memory controller count first and then allocate the skx_dev list using this count. However, this approach requires adding a significant amount of code similar to skx_get_all_bus_mappings() to obtain the PCI bus mappings for the first socket and use these mappings along with the related PCI register offset to read the memory controller count. Given that the Granite Rapids CPU is the only one that can detect the count of memory controllers at runtime (other CPUs use the count in the configuration data), to reduce code complexity, reallocate the skx_dev list only if the preconfigured count of DDR memory controllers differs from the count read at runtime for Granite Rapids CPU. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250731145534.2759334-7-qiuxu.zhuo@intel.com
2025-08-19EDAC/skx_common: Remove redundant upper bound check for res->imcQiuxu Zhuo
The following upper bound check for the memory controller physical index decoded by ADXL is the only place where use the macro 'NUM_IMC' is used: res->imc > NUM_IMC - 1 Since this check is already covered by skx_get_mc_mapping(), meaning no memory controller logical index exists for an invalid memory controller physical index decoded by ADXL, remove the redundant upper bound check so that the definition for 'NUM_IMC' can be cleaned up (in another patch). Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250731145534.2759334-6-qiuxu.zhuo@intel.com
2025-08-19EDAC/skx_common: Make skx_dev->imc[] a flexible arrayQiuxu Zhuo
The current skx->imc[NUM_IMC] array of memory controller instances is sized using the macro NUM_IMC. Each time EDAC support is added for a new CPU, NUM_IMC needs to be updated to ensure it is greater than or equal to the number of memory controllers for the new CPU. This approach is inconvenient and results in memory waste for older CPUs with fewer memory controllers. To address this, make skx->imc[] a flexible array and determine its size from configuration data or at runtime. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250731145534.2759334-5-qiuxu.zhuo@intel.com
2025-08-19EDAC/skx_common: Swap memory controller index mappingQiuxu Zhuo
The current mapping of memory controller indices is from physical index [1] to logical index [2], as show below: skx_dev->imc[pmc].mc_mapping = lmc Since skx_dev->imc[] is an array of present memory controller instances, mapping memory controller indices from logical index to physical index, as show below, is more reasonable. This is also a preparatory step for making skx_dev->imc[] a flexible array. skx_dev->imc[lmc].mc_mapping = pmc Both mappings are equivalent. No functional changes intended. [1] Indices for memory controllers include both those present to the OS and those disabled by BIOS. [2] Indices for memory controllers present to the OS. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250731145534.2759334-4-qiuxu.zhuo@intel.com
2025-08-19EDAC/skx_common: Move mc_mapping to be a field inside struct skx_imcQiuxu Zhuo
The mc_mapping and imc fields of struct skx_dev have the same size, NUM_IMC. Move mc_mapping to be a field inside struct skx_imc to prepare for making the imc array of memory controller instances a flexible array. No functional changes intended. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250731145534.2759334-3-qiuxu.zhuo@intel.com
2025-08-19EDAC/{skx_common,skx}: Use configuration data, not global macrosQiuxu Zhuo
Use model-specific configuration data for the number of memory controllers per socket, channels per memory controller, and DIMMs per channel as intended, instead of relying on global macros for maximum values. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250731145534.2759334-2-qiuxu.zhuo@intel.com
2025-08-19EDAC/i10nm: Skip DIMM enumeration on a disabled memory controllerQiuxu Zhuo
When loading the i10nm_edac driver on some Intel Granite Rapids servers, a call trace may appear as follows: UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:453:16 shift exponent -66 is negative ... __ubsan_handle_shift_out_of_bounds+0x1e3/0x390 skx_get_dimm_info.cold+0x47/0xd40 [skx_edac_common] i10nm_get_dimm_config+0x23e/0x390 [i10nm_edac] skx_register_mci+0x159/0x220 [skx_edac_common] i10nm_init+0xcb0/0x1ff0 [i10nm_edac] ... This occurs because some BIOS may disable a memory controller if there aren't any memory DIMMs populated on this memory controller. The DIMMMTR register of this disabled memory controller contains the invalid value ~0, resulting in the call trace above. Fix this call trace by skipping DIMM enumeration on a disabled memory controller. Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support") Reported-by: Jose Jesus Ambriz Meza <jose.jesus.ambriz.meza@intel.com> Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> Closes: https://lore.kernel.org/all/20250730063155.2612379-1-acelan.kao@canonical.com/ Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> Link: https://lore.kernel.org/r/20250806065707.3533345-1-qiuxu.zhuo@intel.com
2025-08-19EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC supportKyle Manna
Host Device IDs (DID0) correspond to: * Intel Core i7-12700K * Intel Core i5-12600K See documentation: * 12th Generation Intel® Core™ Processors Datasheet * Volume 1 of 2, Doc. No.: 655258, Rev.: 011 * https://edc.intel.com/output/DownloadPdfDocument?id=8297 (PDF) Signed-off-by: Kyle Manna <kyle@kylemanna.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20250819161739.3241152-1-kyle@kylemanna.com
2025-08-15EDAC: Add EDAC driver for ARM Cortex A72 coresSascha Hauer
The driver is designed to support error detection and reporting for Cortex A72 cores, specifically within their L1 and L2 cache systems. The errors are detected by reading CPU/L2 memory error syndrome registers. Unfortunately there is no robust way to inject errors into the caches, so this driver doesn't contain any code to actually test it. It has been tested though with code taken from an older version [1] of this driver. For reasons stated in thread [1], the error injection code is not suitable for mainline, so it is removed from the driver. [1] https://lore.kernel.org/all/1521073067-24348-1-git-send-email-york.sun@nxp.com/#t [ bp: minor touchups. ] Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Co-developed-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/1752714390-27389-2-git-send-email-vijayb@linux.microsoft.com
2025-07-29Merge tag 'edac_updates_for_v6.17_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - i10nm: - switch to using scnprintf() - Add Granite Rapids-D support - synopsys: Make sure ECC error and counter registers are cleared during init/probing to avoid reporting stale errors - igen6: Add Wildcat Lake SoCs support - Make sure scrub features sysfs attributes are initialized properly - Allocate memory repair sysfs attributes statically to reduce stack usage - Fix DIMM module size computation for DIMMs with total capacity which is a non power-of-two number, in amd64_edac - Do not be too dramatic when reporting disabled memory controllers in igen6_edac - Add support to ie31200_edac for the following SoCs: - Core i5-14[67]00 - Bartless Lake-S SoCs - Raptor Lake-HX * tag 'edac_updates_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/{skx_common,i10nm}: Use scnprintf() for safer buffer handling EDAC/synopsys: Clear the ECC counters on init EDAC/ie31200: Add Intel Raptor Lake-HX SoCs support EDAC/igen6: Add Intel Wildcat Lake SoCs support EDAC/i10nm: Add Intel Granite Rapids-D support EDAC/mem_repair: Reduce stack usage in edac_mem_repair_get_desc() EDAC/igen6: Reduce log level to debug for absent memory controllers EDAC/ie31200: Document which CPUs correspond to each Raptor Lake-S device ID EDAC/ie31200: Enable support for Core i5-14600 and i7-14700 ie31200/EDAC: Add Intel Bartlett Lake-S SoCs support
2025-07-15EDAC/{skx_common,i10nm}: Use scnprintf() for safer buffer handlingWang Haoran
snprintf() is fragile when its return value will be used to append additional data to a buffer. Use scnprintf() instead. Signed-off-by: Wang Haoran <haoranwangsec@gmail.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20250715131700.1092720-1-haoranwangsec@gmail.com
2025-07-14EDAC/synopsys: Clear the ECC counters on initShubhrajyoti Datta
Clear the ECC error and counter registers during initialization/probe to avoid reporting stale errors that may have occurred before EDAC registration. For that, unify the Zynq and ZynqMP ECC state reading paths and simplify the code. [ bp: Massage commit message. Fix an -Wsometimes-uninitialized warning as reported by Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507141048.obUv3ZUm-lkp@intel.com ] Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250713050753.7042-1-shubhrajyoti.datta@amd.com
2025-07-07EDAC/ie31200: Add Intel Raptor Lake-HX SoCs supportQiuxu Zhuo
Intel Raptor Lake-HX SoC shares the same memory controller registers as Raptor Lake-S SoC. Add a compute die ID for Raptor Lake-HX SoCs with Out-of-Band ECC capability for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Laurens SEGHERS <laurens@rale.com> Link: https://lore.kernel.org/r/20250704151609.7833-4-qiuxu.zhuo@intel.com
2025-07-07EDAC/igen6: Add Intel Wildcat Lake SoCs supportLili Li
Intel Wildcat Lake is a mobile derivative of Panther Lake with one memory controller. Wildcat Lake SoCs share the same IBECC registers with Meteor Lake-P SoCs. Add a compute die ID and a new configuration structure for Wildcat Lake SoCs with In-Band ECC capability for EDAC support. Signed-off-by: Lili Li <lili.li@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250704151609.7833-3-qiuxu.zhuo@intel.com
2025-07-07EDAC/i10nm: Add Intel Granite Rapids-D supportQiuxu Zhuo
The Granite Rapids-D CPU model uses memory controller registers similar to those of the Granite Rapids server CPU but with a different memory controller MMIO base. Add the Granite Rapids-D CPU model ID and use the new memory controller MMIO base for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: VikasX Chougule <vikasx.chougule@intel.com> Link: https://lore.kernel.org/r/20250704151609.7833-2-qiuxu.zhuo@intel.com
2025-06-30EDAC: Initialize EDAC features sysfs attributesShiju Jose
Fix the lockdep splat caused by missing sysfs_attr_init() calls for the recently added EDAC feature's sysfs attributes. In lockdep_init_map_type(), the check for the lock-class key if (!static_obj(key) && !is_dynamic_key(key)) causes the splat. Backtrace: RIP: 0010:lockdep_init_map_type Call Trace: __kernfs_create_file sysfs_add_file_mode_ns internal_create_group internal_create_groups device_add ? __init_waitqueue_head edac_dev_register devm_cxl_memdev_edac_register ? lock_acquire ? find_held_lock ? cxl_mem_probe ? cxl_mem_probe ? lockdep_hardirqs_on ? cxl_mem_probe cxl_mem_probe [ bp: Massage. ] Fixes: f90b738166fe ("EDAC: Add scrub control feature") Fixes: bcbd069b11b0 ("EDAC: Add a Error Check Scrub control feature") Fixes: 699ea5219c4b ("EDAC: Add a memory repair control feature") Reported-by: Dave Jiang <dave.jiang@intel.com> Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://lore.kernel.org/20250626101344.1726-1-shiju.jose@huawei.com
2025-06-26EDAC/mem_repair: Reduce stack usage in edac_mem_repair_get_desc()Arnd Bergmann
Constructing an array on the stack adds complexity and can exceed the warning limit for per-function stack usage: drivers/edac/mem_repair.c:361:5: error: stack frame size (1296) exceeds limit (1280) in 'edac_mem_repair_get_desc' [-Werror,-Wframe-larger-than] Change this to have the actual attribute array allocated statically and then just add the instance number on the per-instance copy. Fixes: 699ea5219c4b ("EDAC: Add a memory repair control feature") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250620114135.4017183-1-arnd@kernel.org
2025-06-25EDAC/amd64: Fix size calculation for Non-Power-of-Two DIMMsAvadhut Naik
Each Chip-Select (CS) of a Unified Memory Controller (UMC) on AMD Zen-based SOCs has an Address Mask and a Secondary Address Mask register associated with it. The amd64_edac module logs DIMM sizes on a per-UMC per-CS granularity during init using these two registers. Currently, the module primarily considers only the Address Mask register for computing DIMM sizes. The Secondary Address Mask register is only considered for odd CS. Additionally, if it has been considered, the Address Mask register is ignored altogether for that CS. For power-of-two DIMMs i.e. DIMMs whose total capacity is a power of two (32GB, 64GB, etc), this is not an issue since only the Address Mask register is used. For non-power-of-two DIMMs i.e., DIMMs whose total capacity is not a power of two (48GB, 96GB, etc), however, the Secondary Address Mask register is used in conjunction with the Address Mask register. However, since the module only considers either of the two registers for a CS, the size computed by the module is incorrect. The Secondary Address Mask register is not considered for even CS, and the Address Mask register is not considered for odd CS. Introduce a new helper function so that both Address Mask and Secondary Address Mask registers are considered, when valid, for computing DIMM sizes. Furthermore, also rename some variables for greater clarity. Fixes: 81f5090db843 ("EDAC/amd64: Support asymmetric dual-rank DIMMs") Closes: https://lore.kernel.org/dbec22b6-00f2-498b-b70d-ab6f8a5ec87e@natrix.lt Reported-by: Žilvinas Žaltiena <zilvinas@natrix.lt> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Tested-by: Žilvinas Žaltiena <zilvinas@natrix.lt> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250529205013.403450-1-avadhut.naik@amd.com
2025-06-23EDAC/igen6: Reduce log level to debug for absent memory controllersQiuxu Zhuo
The current KERN_WARNING level message for detecting absent memory controllers is overly dramatic. The BIOS likely had valid reasons to disable the memory controller (e.g. it isn't connected to any DIMM slots on the motherboard for this system). So there's nothing actually wrong that needs to be fixed. Reduce the log level to KERN_DEBUG to eliminate the false warning. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250618162307.1523736-2-qiuxu.zhuo@intel.com
2025-06-23EDAC/ie31200: Document which CPUs correspond to each Raptor Lake-S device IDGeorge Gaidarov
Based on table 103 ("Host Device ID (DID0)") in [1], document which CPUs correspond to each Raptor Lake-S device ID for better readability. [1] https://www.intel.com/content/www/us/en/content-details/743844/13th-generation-intel-core-intel-core-14th-generation-intel-core-processor-series-1-and-series-2-and-intel-xeon-e-2400-processor-datasheet-volume-1-of-2.html Signed-off-by: George Gaidarov <gdgaidarov+lkml@gmail.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250529162933.1228735-2-gdgaidarov+lkml@gmail.com
2025-06-23EDAC/ie31200: Enable support for Core i5-14600 and i7-14700George Gaidarov
Device ID '0xa740' is shared by i7-14700, i7-14700K, and i7-14700T. Device ID '0xa704' is shared by i5-14600, i5-14600K, and i5-14600T. Tested locally on my i7-14700K. Signed-off-by: George Gaidarov <gdgaidarov+lkml@gmail.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250529162933.1228735-1-gdgaidarov+lkml@gmail.com