diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-07-18 09:34:02 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-07-18 09:34:02 -0700 |
| commit | b3ce7a30847a54a7f96a35e609303d8afecd460b (patch) | |
| tree | 81fb53546e55b9c670da4476b4b0b27e57abb25d /drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c | |
| parent | b1bc554e009e3aeed7e4cfd2e717c7a34a98c683 (diff) | |
| parent | 478a52707b0abe98aac7f8c53ccddb759be66b06 (diff) | |
Merge tag 'drm-next-2024-07-18' of https://gitlab.freedesktop.org/drm/kernel
Pull drm updates from Dave Airlie:
"There's a lot of stuff in here, amd, i915 and xe have new platform
work, lots of core rework around EDID handling, some new COMPILE_TEST
options, maintainer changes and a lots of other stuff. Summary:
core:
- deprecate DRM data and return 0 date
- connector: Create a set of helpers to help with HDMI support
- Remove driver owner assignments
- Allow more drivers to compile with COMPILE_TEST
- Conversions to drm_edid
- Sprinkle MODULE_DESCRIPTIONS everywhere they are missing
- Remove drm_mm_replace_node
- print: Add a drm prefix to warn level messages too, remove
___drm_dbg, consolidate prefix handling
- New monochrome TV mode variant
ttm:
- improve number of page faults on some platforms
- fix test builds under PREEMPT_RT
- more test coverage
ci:
- Require a more recent version of mesa
- improve farm setup and test generation
dma-buf:
- warn if reserving 0 fence slots
- internal API heap enhancements
fbdev:
- Create memory manager optimized fbdev emulation
panic:
- Allow to select fonts
- improve drm_fb_dma_get_scanout_buffer
- Allow to dump kmsg to the screen
bridge:
- Remove redundant checks on bridge->encoder
- Remove drm_bridge_chain_mode_fixup
- bridge-connector: Plumb in the new HDMI helper
- analogix_dp: Various improvements, handle AUX transfers timeout
- samsung-dsim: Fix timings calculation
- tc358767: Plenty of small fixes, fix no connector attach, fix
clocks
- sii902x: state validation improvements
panels:
- Switch panels from register table initialization to proper code
- Now that the panel code tracks the panel state, remove every ad-hoc
implementation in the panel drivers
- More cleanup of prepare / enable state tracking in drivers
- edp: Drop legacy panel compatibles
- simple-bridge: Switch to devm_drm_bridge_add
- New panels: Lincoln Tech Sol LCD185-101CT, Microtips Technology
13-101HIEBCAF0-C, Microtips Technology MF-103HIEB0GA0,
BOE nv110wum-l60, IVO t109nw41, WL-355608-A8, PrimeView
PM070WL4, Lincoln Technologies LCD197, Ortustech
COM35H3P70ULC, AUO G104STN01, K&d kd101ne3-40ti
amdgpu:
- DCN 4.0.x support
- GC 12.0 support
- GMC 12.0 support
- SDMA 7.0 support
- MES12 support
- MMHUB 4.1 support
- GFX12 modifier and DCC support
- lots of IP fixes/updates
amdkfd:
- Contiguous VRAM allocations
- GC 12.0 support
- SDMA 7.0 support
- SR-IOV fixes
- KFD GFX ALU exceptions
i915:
- Battlemage Xe2 HPD display enablement
- Panel Replay enabling
- DP AUX-less ALPM/LOBF
- Enable link training failure fallback for DP MST links
- CMRR (Content Match Refresh Rate) enabling
- Increase ADL-S/ADL-P/DG2+ max TMDS bitrate to 6 Gbps
- Enable eDP AUX based HDR backlight
- Support replaying GPU hangs with captured context image
- Automate CCS Mode setting during engine resets
- lots of refactoring
- Support replaying GPU hangs with captured context image
- Increase FLR timeout from 3s to 9s
- Enable w/a 16021333562 for DG2, MTL and ARL [guc]
xe:
- update MAINATINERS
- New uapi adding OA functionality to Xe
- expose l3 bank mask
- fix display detect on ADL-N
- runtime PM Fixes
- Fix silent backmerge issues
- More prep for SR-IOV
- HWmon additions
- per client usage info
- Rework GPU page fault handling
- Drop EXEC_QUEUE_FLAG_BANNED
- Add BMG PCI IDs
- Scheduler fixes and improvements
- Rename xe_exec_queue::compute to xe_exec_queue::lr
- Use ttm_uncached for BO with NEEDS_UC flag
- Rename xe perf layer as xe observation layer
- lots of refactoring
radeon:
- Backlight workaround for iMac
- Silence UBSAN flex array warnings
msm:
- Validate registers XML description against schema in CI
- core/dpu: SM7150 support
- mdp5: Add support for MSM8937
- gpu: Add param for userspace to know if raytracing is supported
- gpu: X185 support (aka gpu in X1 laptop chips)
- gpu: a505 support
ivpu:
- hardware scheduler support
- profiling support
- improvements to the platform support layer
- firmware handling improvements
- clocks/power mgmt improvements
- scheduler/logging improvements
habanalabs:
- Gradual sleep in polling memory macro
- Reduce Gaudi2 MSI-X interrupt count to 128
- Add Gaudi2-D revision support
- Add timestamp to CPLD info
- Gaudi2: Assume hard-reset by firmware upon MC SEI severe error
- Align Gaudi2 interrupt names
- Check for errors after preboot is ready
- Change habanalabs maintainer and git repo path
mgag200:
- refactoring and improvements
- Add BMC output
- enable polling
nouveau:
- add registry command line
v3d:
- perf counters improvements
zynqmp:
- irq and debugfs improvements
atmel-hlcdc:
- Support XLCDC in sam9x7
mipi-dbi:
- Remove mipi_dbi_machine_little_endian
- make SPI bits per word configurable
- support RGB888
- allow pixel formats to be specified in the DT
sun4i:
- Rework the blender setup for DE2
panfrost:
- Enable MT8188 support
vc4:
- Monochrome TV support
exynos:
- fix fallback mode regression
- fix memory leak
- Use drm_edid_duplicate() instead of kmemdup()
etnaviv:
- fix i.MX8MP NPU clock gating
- workaround FE register cdc issues on some cores
- fix DMA sync handling for cached buffers
- fix job timeout handling
- keep TS enabled on MMUv2 cores for improved performance
mediatek:
- Convert to platform remove callback returning void-
- Drop chain_mode_fixup call in mode_valid()
- Fixes the errors of MediaTek display driver found by IGT
- Add display support for the MT8365-EVK board
- Fix bit depth overwritten for mtk_ovl_set bit_depth()
- Fix possible_crtcs calculation
- Fix spurious kfree()
ast:
- refactor mode setting code
stm:
- Add LVDS support
- DSI PHY updates"
* tag 'drm-next-2024-07-18' of https://gitlab.freedesktop.org/drm/kernel: (2501 commits)
drm/amdgpu/mes12: add missing opcode string
drm/amdgpu/mes11: update opcode strings
Revert "drm/amd/display: Reset freesync config before update new state"
drm/omap: Restrict compile testing to PAGE_SIZE less than 64KB
drm/xe: Drop trace_xe_hw_fence_free
drm/xe/uapi: Rename xe perf layer as xe observation layer
drm/amdgpu: remove exp hw support check for gfx12
drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed
drm/amdgpu: flush all cached ras bad pages to eeprom
drm/amdgpu: select compute ME engines dynamically
drm/amd/display: Allow display DCC for DCN401
drm/amdgpu: select compute ME engines dynamically
drm/amdgpu/job: Replace DRM_INFO/ERROR logging
drm/amdgpu: select compute ME engines dynamically
drm/amd/pm: Ignore initial value in smu response register
drm/amdgpu: Initialize VF partition mode
drm/amd/amdgpu: fix SDMA IRQ client ID <-> req mapping
MAINTAINERS: fix Xinhui's name
MAINTAINERS: update powerplay and swsmu
drm/qxl: Pin buffer objects for internal mappings
...
Diffstat (limited to 'drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c')
| -rw-r--r-- | drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c | 383 |
1 files changed, 239 insertions, 144 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c index 0734490347db..2542bd7aa7c7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c @@ -153,7 +153,7 @@ int amdgpu_mca_mpio_ras_sw_init(struct amdgpu_device *adev) return 0; } -void amdgpu_mca_bank_set_init(struct mca_bank_set *mca_set) +static void amdgpu_mca_bank_set_init(struct mca_bank_set *mca_set) { if (!mca_set) return; @@ -162,7 +162,7 @@ void amdgpu_mca_bank_set_init(struct mca_bank_set *mca_set) INIT_LIST_HEAD(&mca_set->list); } -int amdgpu_mca_bank_set_add_entry(struct mca_bank_set *mca_set, struct mca_bank_entry *entry) +static int amdgpu_mca_bank_set_add_entry(struct mca_bank_set *mca_set, struct mca_bank_entry *entry) { struct mca_bank_node *node; @@ -183,14 +183,36 @@ int amdgpu_mca_bank_set_add_entry(struct mca_bank_set *mca_set, struct mca_bank_ return 0; } -void amdgpu_mca_bank_set_release(struct mca_bank_set *mca_set) +static int amdgpu_mca_bank_set_merge(struct mca_bank_set *mca_set, struct mca_bank_set *new) +{ + struct mca_bank_node *node; + + list_for_each_entry(node, &new->list, node) + amdgpu_mca_bank_set_add_entry(mca_set, &node->entry); + + return 0; +} + +static void amdgpu_mca_bank_set_remove_node(struct mca_bank_set *mca_set, struct mca_bank_node *node) +{ + if (!node) + return; + + list_del(&node->node); + kvfree(node); + + mca_set->nr_entries--; +} + +static void amdgpu_mca_bank_set_release(struct mca_bank_set *mca_set) { struct mca_bank_node *node, *tmp; - list_for_each_entry_safe(node, tmp, &mca_set->list, node) { - list_del(&node->node); - kvfree(node); - } + if (list_empty(&mca_set->list)) + return; + + list_for_each_entry_safe(node, tmp, &mca_set->list, node) + amdgpu_mca_bank_set_remove_node(mca_set, node); } void amdgpu_mca_smu_init_funcs(struct amdgpu_device *adev, const struct amdgpu_mca_smu_funcs *mca_funcs) @@ -200,6 +222,45 @@ void amdgpu_mca_smu_init_funcs(struct amdgpu_device *adev, const struct amdgpu_m mca->mca_funcs = mca_funcs; } +int amdgpu_mca_init(struct amdgpu_device *adev) +{ + struct amdgpu_mca *mca = &adev->mca; + struct mca_bank_cache *mca_cache; + int i; + + atomic_set(&mca->ue_update_flag, 0); + + for (i = 0; i < ARRAY_SIZE(mca->mca_caches); i++) { + mca_cache = &mca->mca_caches[i]; + mutex_init(&mca_cache->lock); + amdgpu_mca_bank_set_init(&mca_cache->mca_set); + } + + return 0; +} + +void amdgpu_mca_fini(struct amdgpu_device *adev) +{ + struct amdgpu_mca *mca = &adev->mca; + struct mca_bank_cache *mca_cache; + int i; + + atomic_set(&mca->ue_update_flag, 0); + + for (i = 0; i < ARRAY_SIZE(mca->mca_caches); i++) { + mca_cache = &mca->mca_caches[i]; + amdgpu_mca_bank_set_release(&mca_cache->mca_set); + mutex_destroy(&mca_cache->lock); + } +} + +int amdgpu_mca_reset(struct amdgpu_device *adev) +{ + amdgpu_mca_fini(adev); + + return amdgpu_mca_init(adev); +} + int amdgpu_mca_smu_set_debug_mode(struct amdgpu_device *adev, bool enable) { const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; @@ -213,7 +274,7 @@ int amdgpu_mca_smu_set_debug_mode(struct amdgpu_device *adev, bool enable) static void amdgpu_mca_smu_mca_bank_dump(struct amdgpu_device *adev, int idx, struct mca_bank_entry *entry, struct ras_query_context *qctx) { - u64 event_id = qctx->event_id; + u64 event_id = qctx ? qctx->evid.event_id : RAS_EVENT_INVALID_ID; RAS_EVENT_LOG(adev, event_id, HW_ERR "Accelerator Check Architecture events logged\n"); RAS_EVENT_LOG(adev, event_id, HW_ERR "aca entry[%02d].STATUS=0x%016llx\n", @@ -228,175 +289,213 @@ static void amdgpu_mca_smu_mca_bank_dump(struct amdgpu_device *adev, int idx, st idx, entry->regs[MCA_REG_IDX_SYND]); } -int amdgpu_mca_smu_log_ras_error(struct amdgpu_device *adev, enum amdgpu_ras_block blk, enum amdgpu_mca_error_type type, - struct ras_err_data *err_data, struct ras_query_context *qctx) +static int amdgpu_mca_smu_get_valid_mca_count(struct amdgpu_device *adev, enum amdgpu_mca_error_type type, uint32_t *count) { - struct amdgpu_smuio_mcm_config_info mcm_info; - struct ras_err_addr err_addr = {0}; - struct mca_bank_set mca_set; - struct mca_bank_node *node; - struct mca_bank_entry *entry; - uint32_t count; - int ret, i = 0; - - amdgpu_mca_bank_set_init(&mca_set); - - ret = amdgpu_mca_smu_get_mca_set(adev, blk, type, &mca_set); - if (ret) - goto out_mca_release; - - list_for_each_entry(node, &mca_set.list, node) { - entry = &node->entry; + const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; - amdgpu_mca_smu_mca_bank_dump(adev, i++, entry, qctx); + if (!count) + return -EINVAL; - count = 0; - ret = amdgpu_mca_smu_parse_mca_error_count(adev, blk, type, entry, &count); - if (ret) - goto out_mca_release; + if (mca_funcs && mca_funcs->mca_get_valid_mca_count) + return mca_funcs->mca_get_valid_mca_count(adev, type, count); - if (!count) - continue; + return -EOPNOTSUPP; +} - mcm_info.socket_id = entry->info.socket_id; - mcm_info.die_id = entry->info.aid; +static int amdgpu_mca_smu_get_mca_entry(struct amdgpu_device *adev, enum amdgpu_mca_error_type type, + int idx, struct mca_bank_entry *entry) +{ + const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; + int count; - if (blk == AMDGPU_RAS_BLOCK__UMC) { - err_addr.err_status = entry->regs[MCA_REG_IDX_STATUS]; - err_addr.err_ipid = entry->regs[MCA_REG_IDX_IPID]; - err_addr.err_addr = entry->regs[MCA_REG_IDX_ADDR]; - } + if (!mca_funcs || !mca_funcs->mca_get_mca_entry) + return -EOPNOTSUPP; - if (type == AMDGPU_MCA_ERROR_TYPE_UE) - amdgpu_ras_error_statistic_ue_count(err_data, - &mcm_info, &err_addr, (uint64_t)count); - else { - if (amdgpu_mca_is_deferred_error(adev, entry->regs[MCA_REG_IDX_STATUS])) - amdgpu_ras_error_statistic_de_count(err_data, - &mcm_info, &err_addr, (uint64_t)count); - else - amdgpu_ras_error_statistic_ce_count(err_data, - &mcm_info, &err_addr, (uint64_t)count); - } + switch (type) { + case AMDGPU_MCA_ERROR_TYPE_UE: + count = mca_funcs->max_ue_count; + break; + case AMDGPU_MCA_ERROR_TYPE_CE: + count = mca_funcs->max_ce_count; + break; + default: + return -EINVAL; } -out_mca_release: - amdgpu_mca_bank_set_release(&mca_set); + if (idx >= count) + return -EINVAL; - return ret; + return mca_funcs->mca_get_mca_entry(adev, type, idx, entry); } - -int amdgpu_mca_smu_get_valid_mca_count(struct amdgpu_device *adev, enum amdgpu_mca_error_type type, uint32_t *count) +static bool amdgpu_mca_bank_should_update(struct amdgpu_device *adev, enum amdgpu_mca_error_type type) { - const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; - - if (!count) - return -EINVAL; - - if (mca_funcs && mca_funcs->mca_get_valid_mca_count) - return mca_funcs->mca_get_valid_mca_count(adev, type, count); + struct amdgpu_mca *mca = &adev->mca; + bool ret = true; + + /* + * Because the UE Valid MCA count will only be cleared after reset, + * in order to avoid repeated counting of the error count, + * the aca bank is only updated once during the gpu recovery stage. + */ + if (type == AMDGPU_MCA_ERROR_TYPE_UE) { + if (amdgpu_ras_intr_triggered()) + ret = atomic_cmpxchg(&mca->ue_update_flag, 0, 1) == 0; + else + atomic_set(&mca->ue_update_flag, 0); + } - return -EOPNOTSUPP; + return ret; } -int amdgpu_mca_smu_get_mca_set_error_count(struct amdgpu_device *adev, enum amdgpu_ras_block blk, - enum amdgpu_mca_error_type type, uint32_t *total) +static int amdgpu_mca_smu_get_mca_set(struct amdgpu_device *adev, enum amdgpu_mca_error_type type, struct mca_bank_set *mca_set, + struct ras_query_context *qctx) { - const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; - struct mca_bank_set mca_set; - struct mca_bank_node *node; - struct mca_bank_entry *entry; - uint32_t count; + struct mca_bank_entry entry; + uint32_t count = 0, i; int ret; - if (!total) + if (!mca_set) return -EINVAL; - if (!mca_funcs) - return -EOPNOTSUPP; - - if (!mca_funcs->mca_get_ras_mca_set || !mca_funcs->mca_get_valid_mca_count) - return -EOPNOTSUPP; - - amdgpu_mca_bank_set_init(&mca_set); + if (!amdgpu_mca_bank_should_update(adev, type)) + return 0; - ret = mca_funcs->mca_get_ras_mca_set(adev, blk, type, &mca_set); + ret = amdgpu_mca_smu_get_valid_mca_count(adev, type, &count); if (ret) - goto err_mca_set_release; - - *total = 0; - list_for_each_entry(node, &mca_set.list, node) { - entry = &node->entry; + return ret; - count = 0; - ret = mca_funcs->mca_parse_mca_error_count(adev, blk, type, entry, &count); + for (i = 0; i < count; i++) { + memset(&entry, 0, sizeof(entry)); + ret = amdgpu_mca_smu_get_mca_entry(adev, type, i, &entry); if (ret) - goto err_mca_set_release; + return ret; - *total += count; - } + amdgpu_mca_bank_set_add_entry(mca_set, &entry); -err_mca_set_release: - amdgpu_mca_bank_set_release(&mca_set); + amdgpu_mca_smu_mca_bank_dump(adev, i, &entry, qctx); + } - return ret; + return 0; } -int amdgpu_mca_smu_parse_mca_error_count(struct amdgpu_device *adev, enum amdgpu_ras_block blk, - enum amdgpu_mca_error_type type, struct mca_bank_entry *entry, uint32_t *count) +static int amdgpu_mca_smu_parse_mca_error_count(struct amdgpu_device *adev, enum amdgpu_ras_block blk, + enum amdgpu_mca_error_type type, struct mca_bank_entry *entry, uint32_t *count) { const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; + if (!count || !entry) return -EINVAL; if (!mca_funcs || !mca_funcs->mca_parse_mca_error_count) return -EOPNOTSUPP; - return mca_funcs->mca_parse_mca_error_count(adev, blk, type, entry, count); } -int amdgpu_mca_smu_get_mca_set(struct amdgpu_device *adev, enum amdgpu_ras_block blk, - enum amdgpu_mca_error_type type, struct mca_bank_set *mca_set) +static int amdgpu_mca_dispatch_mca_set(struct amdgpu_device *adev, enum amdgpu_ras_block blk, enum amdgpu_mca_error_type type, + struct mca_bank_set *mca_set, struct ras_err_data *err_data) { - const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; + struct ras_err_addr err_addr; + struct amdgpu_smuio_mcm_config_info mcm_info; + struct mca_bank_node *node, *tmp; + struct mca_bank_entry *entry; + uint32_t count; + int ret; if (!mca_set) return -EINVAL; - if (!mca_funcs || !mca_funcs->mca_get_ras_mca_set) - return -EOPNOTSUPP; + if (!mca_set->nr_entries) + return 0; - WARN_ON(!list_empty(&mca_set->list)); + list_for_each_entry_safe(node, tmp, &mca_set->list, node) { + entry = &node->entry; + + count = 0; + ret = amdgpu_mca_smu_parse_mca_error_count(adev, blk, type, entry, &count); + if (ret && ret != -EOPNOTSUPP) + return ret; + + if (!count) + continue; + + memset(&mcm_info, 0, sizeof(mcm_info)); + memset(&err_addr, 0, sizeof(err_addr)); + + mcm_info.socket_id = entry->info.socket_id; + mcm_info.die_id = entry->info.aid; + + if (blk == AMDGPU_RAS_BLOCK__UMC) { + err_addr.err_status = entry->regs[MCA_REG_IDX_STATUS]; + err_addr.err_ipid = entry->regs[MCA_REG_IDX_IPID]; + err_addr.err_addr = entry->regs[MCA_REG_IDX_ADDR]; + } + + if (type == AMDGPU_MCA_ERROR_TYPE_UE) { + amdgpu_ras_error_statistic_ue_count(err_data, + &mcm_info, &err_addr, (uint64_t)count); + } else { + if (amdgpu_mca_is_deferred_error(adev, entry->regs[MCA_REG_IDX_STATUS])) + amdgpu_ras_error_statistic_de_count(err_data, + &mcm_info, &err_addr, (uint64_t)count); + else + amdgpu_ras_error_statistic_ce_count(err_data, + &mcm_info, &err_addr, (uint64_t)count); + } - return mca_funcs->mca_get_ras_mca_set(adev, blk, type, mca_set); + amdgpu_mca_bank_set_remove_node(mca_set, node); + } + + return 0; } -int amdgpu_mca_smu_get_mca_entry(struct amdgpu_device *adev, enum amdgpu_mca_error_type type, - int idx, struct mca_bank_entry *entry) +static int amdgpu_mca_add_mca_set_to_cache(struct amdgpu_device *adev, enum amdgpu_mca_error_type type, struct mca_bank_set *new) { - const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; - int count; + struct mca_bank_cache *mca_cache = &adev->mca.mca_caches[type]; + int ret; - if (!mca_funcs || !mca_funcs->mca_get_mca_entry) - return -EOPNOTSUPP; + mutex_lock(&mca_cache->lock); + ret = amdgpu_mca_bank_set_merge(&mca_cache->mca_set, new); + mutex_unlock(&mca_cache->lock); - switch (type) { - case AMDGPU_MCA_ERROR_TYPE_UE: - count = mca_funcs->max_ue_count; - break; - case AMDGPU_MCA_ERROR_TYPE_CE: - count = mca_funcs->max_ce_count; - break; - default: - return -EINVAL; + return ret; +} + +int amdgpu_mca_smu_log_ras_error(struct amdgpu_device *adev, enum amdgpu_ras_block blk, enum amdgpu_mca_error_type type, + struct ras_err_data *err_data, struct ras_query_context *qctx) +{ + struct mca_bank_set mca_set; + struct mca_bank_cache *mca_cache = &adev->mca.mca_caches[type]; + int ret; + + amdgpu_mca_bank_set_init(&mca_set); + + ret = amdgpu_mca_smu_get_mca_set(adev, type, &mca_set, qctx); + if (ret) + goto out_mca_release; + + ret = amdgpu_mca_dispatch_mca_set(adev, blk, type, &mca_set, err_data); + if (ret) + goto out_mca_release; + + /* add remain mca bank to mca cache */ + if (mca_set.nr_entries) { + ret = amdgpu_mca_add_mca_set_to_cache(adev, type, &mca_set); + if (ret) + goto out_mca_release; } - if (idx >= count) - return -EINVAL; + /* dispatch mca set again if mca cache has valid data */ + mutex_lock(&mca_cache->lock); + if (mca_cache->mca_set.nr_entries) + ret = amdgpu_mca_dispatch_mca_set(adev, blk, type, &mca_cache->mca_set, err_data); + mutex_unlock(&mca_cache->lock); - return mca_funcs->mca_get_mca_entry(adev, type, idx, entry); +out_mca_release: + amdgpu_mca_bank_set_release(&mca_set); + + return ret; } #if defined(CONFIG_DEBUG_FS) @@ -437,36 +536,32 @@ static void mca_dump_entry(struct seq_file *m, struct mca_bank_entry *entry) static int mca_dump_show(struct seq_file *m, enum amdgpu_mca_error_type type) { struct amdgpu_device *adev = (struct amdgpu_device *)m->private; - struct mca_bank_entry *entry; - uint32_t count = 0; - int i, ret; + struct mca_bank_node *node; + struct mca_bank_set mca_set; + struct ras_query_context qctx; + int ret; - ret = amdgpu_mca_smu_get_valid_mca_count(adev, type, &count); + amdgpu_mca_bank_set_init(&mca_set); + + qctx.evid.event_id = RAS_EVENT_INVALID_ID; + ret = amdgpu_mca_smu_get_mca_set(adev, type, &mca_set, &qctx); if (ret) - return ret; + goto err_free_mca_set; seq_printf(m, "amdgpu smu %s valid mca count: %d\n", - type == AMDGPU_MCA_ERROR_TYPE_UE ? "UE" : "CE", count); - - if (!count) - return 0; + type == AMDGPU_MCA_ERROR_TYPE_UE ? "UE" : "CE", mca_set.nr_entries); - entry = kmalloc(sizeof(*entry), GFP_KERNEL); - if (!entry) - return -ENOMEM; + if (!mca_set.nr_entries) + goto err_free_mca_set; - for (i = 0; i < count; i++) { - memset(entry, 0, sizeof(*entry)); + list_for_each_entry(node, &mca_set.list, node) + mca_dump_entry(m, &node->entry); - ret = amdgpu_mca_smu_get_mca_entry(adev, type, i, entry); - if (ret) - goto err_free_entry; + /* add mca bank to mca bank cache */ + ret = amdgpu_mca_add_mca_set_to_cache(adev, type, &mca_set); - mca_dump_entry(m, entry); - } - -err_free_entry: - kfree(entry); +err_free_mca_set: + amdgpu_mca_bank_set_release(&mca_set); return ret; } @@ -513,7 +608,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(mca_debug_mode_fops, NULL, amdgpu_mca_smu_debug_mode_se void amdgpu_mca_smu_debugfs_init(struct amdgpu_device *adev, struct dentry *root) { #if defined(CONFIG_DEBUG_FS) - if (!root || amdgpu_ip_version(adev, MP1_HWIP, 0) != IP_VERSION(13, 0, 6)) + if (!root) return; debugfs_create_file("mca_debug_mode", 0200, root, adev, &mca_debug_mode_fops); |