summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
AgeCommit message (Collapse)Author
14 daysdrm/amdgpu: fix cyan_skillfish2 gpu info fw handlingAlex Deucher
If the board supports IP discovery, we don't need to parse the gpu info firmware. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4721 Fixes: fa819e3a7c1e ("drm/amdgpu: add support for cyan skillfish gpu_info") Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-24drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloadedRodrigo Siqueira
When trying to unload amdgpu in the SteamDeck (TTY mode), the following set of errors happens and the system gets unstable: [..] [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0 amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110). amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). [..] amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! [..] When the driver initializes the GPU, the PSP validates all the firmware loaded, and after that, it is not possible to load any other firmware unless the device is reset. What is happening in the load/unload situation is that PSP halts the GC engine because it suspects that something is amiss. To address this issue, this commit ensures that the GPU is reset (mode 2 reset) in the unload sequence. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-19drm/amd: Skip power ungate during suspend for VPEMario Limonciello
During the suspend sequence VPE is already going to be power gated as part of vpe_suspend(). It's unnecessary to call during calls to amdgpu_device_set_pg_state(). It actually can expose a race condition with the firmware if s0i3 sequence starts as well. Drop these calls. Cc: Peyton.Lee@amd.com Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-11drm/amdgpu: Use DC by default on SI dGPUsTimur Kristóf
Now that DC supports analog connectors, it has reached feature parity with the legacy non-DC display driver on SI dGPUs. Use the DC display driver by default on SI dGPUs, unless it is explicitly disabled using the amdgpu.dc=0 module parameter. DC brings proper support for DP/HDMI audio, DP MST, 10-bit colors, some HDR features, atomic modesetting, etc. Also clarify the comment about what is missing to have full DC support for CIK APUs. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06drm/amd/ras: Fix the issue of incorrect function callYiPeng Chai
When amdgpu_device_health_check fails, amdgpu_ras_pre_reset will not be called and therefore amdgpu_ras_post_reset cannot be called either. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amdgpu: suspend ras module before gpu resetYiPeng Chai
During gpu reset, all GPU-related resources are inaccessible. To avoid affecting ras functionality, suspend ras module before gpu reset and resume it after gpu reset is complete. V2: Rename functions to avoid misunderstanding. V3: Move flush_delayed_work to amdgpu_ras_process_pause, Move schedule_delayed_work to amdgpu_ras_process_unpause. V4: Rename functions. V5: Move the function to amdgpu_ras.c. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amdgpu: Add ras ip block nameYiPeng Chai
Add ras ip block name. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amdgpu: Implement user queue reset functionalityJesse.Zhang
This patch adds robust reset handling for user queues (userq) to improve recovery from queue failures. The key components include: 1. Queue detection and reset logic: - amdgpu_userq_detect_and_reset_queues() identifies failed queues - Per-IP detect_and_reset callbacks for targeted recovery - Falls back to full GPU reset when needed 2. Reset infrastructure: - Adds userq_reset_work workqueue for async reset handling - Implements pre/post reset handlers for queue state management - Integrates with existing GPU reset framework 3. Error handling improvements: - Enhanced state tracking with HUNG state - Automatic reset triggering on critical failures - VRAM loss handling during recovery 4. Integration points: - Added to device init/reset paths - Called during queue destroy, suspend, and isolation events - Handles both individual queue and full GPU resets The reset functionality works with both gfx/compute and sdma queues, providing better resilience against queue failures while minimizing disruption to unaffected queues. v2: add detection and reset calls when preemption/unmaped fails. add a per device userq counter for each user queue type.(Alex) v3: make sure we hold the adev->userq_mutex when we call amdgpu_userq_detect_and_reset_queues. (Alex) warn if the adev->userq_mutex is not held. v4: make sure we have all of the uqm->userq_mutex held. warn if the uqm->userq_mutex is not held. v5: Use array for user queue type counters.(Alex) all of the uqm->userq_mutex need to be held when calling detect and reset. (Alex) v6: fix lock dep warning in amdgpu_userq_fence_dence_driver_process v7: add the queue types in an array and use a loop in amdgpu_userq_detect_and_reset_queues (Lijo) v8: remove atomic_set(&userq_mgr->userq_count[i], 0). it should already be 0 since we kzalloc the structure (Alex) v9: For consistency with kernel queues, We may want something like: amdgpu_userq_is_reset_type_supported (Alex) Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd: Unwind for failed device suspendMario Limonciello (AMD)
If device suspend has failed, add a recovery flow that will attempt to unwind the suspend and get things back up and running. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4627 Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd: Add an unwind for failures in amdgpu_device_ip_suspend_phase2()Mario Limonciello (AMD)
If any hardware IPs involved with the second phase of suspend fail, unwind all steps to restore back to original state. Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd: Add an unwind for failures in amdgpu_device_ip_suspend_phase1()Mario Limonciello (AMD)
If any hardware IPs involved with the first phase of suspend fail, unwind all steps to restore back to original state. Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amdgpu: Drop PMFW RLC notifier from amdgpu_device_suspend()Alex Deucher
For S3 on vangogh, PMFW needs to be notified before the driver powers down RLC. This already happens in smu_disable_dpms() so drop the superfluous call in amdgpu_device_suspend(). Co-developed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amdgpu: Remove invalidate and flush hdp macrosAsad Kamal
Remove amdgpu_asic_flush_hdp & amdgpu_asic_invalidate_hdp functions and directly use the mapped ones Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-31Merge tag 'amd-drm-next-6.19-2025-10-29' of ↵Simona Vetter
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.19-2025-10-29: amdgpu: - VPE idle handler fix - Re-enable DM idle optimizations - DCN3.0 fix - SMU fix - Powerplay fixes for fiji/iceland - License copy-pasta fixes - HDP eDP panel fix - Vblank fix - RAS fixes - SR-IOV updates - SMU 13 VCN reset fix - DMUB fixes - DC frame limit fix - Additional DC underflow logging - DCN 3.1.5 fixes - DC Analog encoders support - Enable DC on bonaire by default - UserQ fixes - Remove redundant pm_runtime_mark_last_busy() calls amdkfd: - Process cleanup fix - Misc fixes radeon: - devm migration fixes - Remove redundant pm_runtime_mark_last_busy() calls UAPI - Add ABM KMS property Proposed kwin changes: https://invent.kde.org/plasma/kwin/-/merge_requests/6028 Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20251029205713.9480-1-alexander.deucher@amd.com
2025-10-31Merge tag 'amd-drm-next-6.19-2025-10-24' of ↵Simona Vetter
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.19-2025-10-24: amdgpu: - HMM cleanup - Add new RAS framework - DML2.1 updates - YCbCr420 fixes - DC FP fixes - DMUB fixes - LTTPR fixes - DTBCLK fixes - DMU cursor offload handling - Userq validation improvements - Misc code cleanups - Unify shutdown callback handling - Suspend improvements - Power limit code cleanup - Fence cleanup - IP Discovery cleanup - SR-IOV fixes - AUX backlight fixes - DCN 3.5 fixes - HDMI compliance fixes - DCN 4.0.1 cursor updates - DCN interrupt fix - DC KMS full update improvements - Add additional HDCP traces - DCN 3.2 fixes - DP MST fixes - Add support for new SR-IOV mailbox interface Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20251024175249.58099-1-alexander.deucher@amd.com
2025-10-28drm/amdgpu: Use DC by default for BonaireTimur Kristóf
Now that DC supports analog connectors, there is nothing stopping us from using it by default on Bonaire. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-28drm/amdgpu: Convert amdgpu userqueue management from IDR to XArrayJesse.Zhang
This commit refactors the AMDGPU userqueue management subsystem to replace IDR (ID Allocation) with XArray for improved performance, scalability, and maintainability. The changes address several issues with the previous IDR implementation and provide better locking semantics. Key changes: 1. **Global XArray Introduction**: - Added `userq_doorbell_xa` to `struct amdgpu_device` for global queue tracking - Uses doorbell_index as key for efficient global lookup - Replaces the previous `userq_mgr_list` linked list approach 2. **Per-process XArray Conversion**: - Replaced `userq_idr` with `userq_mgr_xa` in `struct amdgpu_userq_mgr` - Maintains per-process queue tracking with queue_id as key - Uses XA_FLAGS_ALLOC for automatic ID allocation 3. **Locking Improvements**: - Removed global `userq_mutex` from `struct amdgpu_device` - Replaced with fine-grained XArray locking using XArray's internal spinlocks 4. **Runtime Idle Check Optimization**: - Updated `amdgpu_runtime_idle_check_userq()` to use xa_empty 5. **Queue Management Functions**: - Converted all IDR operations to equivalent XArray functions: - `idr_alloc()` → `xa_alloc()` - `idr_find()` → `xa_load()` - `idr_remove()` → `xa_erase()` - `idr_for_each()` → `xa_for_each()` Benefits: - **Performance**: XArray provides better scalability for large numbers of queues - **Memory Efficiency**: Reduced memory overhead compared to IDR - **Thread Safety**: Improved locking semantics with XArray's internal spinlocks v2: rename userq_global_xa/userq_xa to userq_doorbell_xa/userq_mgr_xa Remove xa_lock and use its own lock. v3: Set queue->userq_mgr = uq_mgr in amdgpu_userq_create() v4: use xa_store_irq (Christian) hold the read side of the reset lock while creating/destroying queues and the manager data structure. (Chritian) Acked-by: Alex Deucher <alexander.deucher@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-24Merge tag 'drm-misc-next-2025-10-21' of ↵Simona Vetter
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for v6.19: UAPI Changes: amdxdna: - Support reading last hardware error Cross-subsystem Changes: dma-buf: - heaps: Create heap per CMA reserved location; Improve user-space documentation Core Changes: atomic: - Clean up and improve state-handling interfaces, update drivers bridge: - Improve ref counting buddy: - Optimize block management Driver Changes: amdxdna: - Fix runtime power management - Support firmware debug output ast: - Set quirks for each chip model atmel-hlcdc: - Set LCDC_ATTRE register in plane disable - Set correct values for plane scaler bochs: - Use vblank timer bridge: - synopsis: Support CEC; Init timer with correct frequency cirrus-qemu: - Use vblank timer imx: - Clean up ivu: - Update JSM API to 3.33.0 - Reset engine on more job errors - Return correct error codes for jobs komeda: - Use drm_ logging functions panel: - edp: Support AUO B116XAN02.0 panfrost: - Embed struct drm_driver in Panfrost device - Improve error handling - Clean up job handling panthor: - Support custom ASN_HASH for mt8196 renesas: - rz-du: Fix dependencies rockchip: - dsi: Add support for RK3368 - Fix LUT size for RK3386 sitronix: - Fix output position when clearing screens qaic: - Support dma-buf exports - Support new firmware's READ_DATA implementation - Replace kcalloc with memdup - Replace snprintf() with sysfs_emit() - Avoid overflows in arithmetics - Clean up - Fixes qxl: - Use vblank timer rockchip: - Clean up mode-setting code vgem: - Fix fence timer deadlock virtgpu: - Use vblank timer Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://lore.kernel.org/r/20251021111837.GA40643@linux.fritz.box
2025-10-20drm/amdgpu: Introduce SRIOV critical regions v2 during VF initEllen Pan
1. Introduced amdgpu_virt_init_critical_region during VF init. - VFs use init_data_header_offset and init_data_header_size_kb transmitted via PF2VF mailbox to fetch the offset of critical regions' offsets/sizes in VRAM and save to adev->virt.crit_region_offsets and adev->virt.crit_region_sizes_kb. Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: use GPU_HDP_FLUSH for sriovVictor Zhao
Currently SRIOV runtime will use kiq to write HDP_MEM_FLUSH_CNTL for hdp flush. This register need to be write from CPU for nbif to aware, otherwise it will not work. Implement amdgpu_kiq_hdp_flush and use kiq to do gpu hdp flush during sriov runtime. v2: - fallback to amdgpu_asic_flush_hdp when amdgpu_kiq_hdp_flush failed - add function amdgpu_mes_hdp_flush v3: - changed returned error Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amd: Add a helper to tell whether an IP block HW is enabledMario Limonciello
There is already a helper for telling if a block is valid, but if IP handling wants to check if it's HW is enabled no such helper exists. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-18drm/client: Remove holds_console_lock parameter from suspend/resumeThomas Zimmermann
No caller of the client resume/suspend helpers holds the console lock. The last such cases were removed from radeon in the patch series at [1]. Now remove the related parameter and the TODO items. v2: - update placeholders for CONFIG_DRM_CLIENT=n Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/series/151624/ # [1] Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Acked-by: Danilo Krummrich <dakr@kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Reviewed-by: Jocelyn Falempe <jfalempe@redhat.com> Link: https://lore.kernel.org/r/20251001143709.419736-1-tzimmermann@suse.de
2025-10-13drm/amdgpu: Move reset-on-init sequence earlierLijo Lazar
Complete reset-on-init sequence before sysfs interfaces are created. Devices get properly initiaized only after reset, and then only sysfs interfaces should be made available. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: Add amdgpu_discovery_infoLijo Lazar
Add amdgpu_discovery_info structure to keep all discovery related information. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: Reorganize sysfs ini/fini callsLijo Lazar
Aggregate sysfs ini/fini calls into separate functions. No functional change. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: clean up and unify hw fence handlingAlex Deucher
Decouple the amdgpu fence from the amdgpu_job structure. This lets us clean up the separate fence ops for the embedded fence and other fences. This also allows us to allocate the vm fence up front when we allocate the job. v2: Additional cleanup suggested by Christian v3: Additional cleanups suggested by Christian v4: Additional cleanups suggested by David and vm fence fix v5: cast seqno (David) Cc: David.Wu3@amd.com Cc: christian.koenig@amd.com Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Pass userq suspend failures up to callerMario Limonciello
If a userq failed to suspend the rest of the suspend sequence may have problems. Pass the error code up to the caller for a decision on what to do. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Pass IP suspend errors up to callersMario Limonciello
If IP suspend fails the callers should be notified so that they can potentially react. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Don't always set IP block HW status to falseMario Limonciello
amdgpu_device_ip_suspend_phase2() calls amdgpu_ip_block_suspend() which already sets HW block status to false when succeeding with IP suspend. Remove the explicit call in amdgpu_device_ip_suspend_phase2() so that the status is accurate. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Remove comment about handling errors in ↵Mario Limonciello
amdgpu_device_ip_suspend_phase1() Error handling was introduced in commit e095026f0066 ("drm/amdgpu: validate suspend before function call") so the comment about TODO is no longer needed. Fixes: e095026f0066 ("drm/amdgpu: validate suspend before function call") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Stop exporting amdgpu_device_ip_suspend() outside amdgpu_deviceMario Limonciello
amdgpu_device_ip_suspend() doesn't have a caller outside of amdgpu_device.c. Make it static. No intended functional changes. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: reduce queue timeout to 2 seconds v2Christian König
There has been multiple complains that 10 seconds are usually to long. The original requirement for longer timeout came from compute tests on AMDVLK, since that is no longer a topic reduce the timeout back to 2 seconds for all queues. While at it also remove any special handling for compute queues under SRIOV or pass through. v2: fix checkpatch warning. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Disable ASPM on SITimur Kristóf
Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4. It's unclear if this is a platform-specific or GPU-specific issue. Disable ASPM on SI for the time being. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07drm/amdgpu: Report individual reset errorLijo Lazar
If reinitialization of one of the GPUs fails after reset, it logs failure on all subsequent GPUs eventhough they have resumed successfully. A sample log where only device at 0000:95:00.0 had a failure - amdgpu 0000:15:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:65:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:75:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:85:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:95:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:e5:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:f5:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:05:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:15:00.0: amdgpu: GPU reset end with ret = -5 To avoid confusion, report the error for each device separately and return the first error as the overall result. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07drm/amdgpu: Check swus/ds for switch state saveLijo Lazar
For saving switch state, check if the GPU is having SWUS/DS architecture. Otherwise, skip saving. Reported-by: Roman Elshin <roman.elshin@gmail.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4602 Fixes: 1dd2fa0e00f1 ("drm/amdgpu: Save and restore switch state") Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18drm/amdgpu: suspend KFD and KGD user queues for S0ixAlex Deucher
We need to make sure the user queues are preempted so GFX can enter gfxoff. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: David Perry <david.perry@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15drm/amd: Avoid evicting resources at S5Mario Limonciello (AMD)
Normally resources are evicted on dGPUs at suspend or hibernate and on APUs at hibernate. These steps are unnecessary when using the S4 callbacks to put the system into S5. Cc: AceLan Kao <acelan.kao@canonical.com> Cc: Kai-Heng Feng <kaihengf@nvidia.com> Cc: Mark Pearson <mpearson-lenovo@squebb.ca> Cc: Denis Benato <benato.denis96@gmail.com> Cc: Merthan Karakaş <m3rthn.k@gmail.com> Tested-by: Eric Naim <dnaim@cachyos.org> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15drm/amdgpu: Release hive reference properlyLijo Lazar
xgmi hive reference is taken on function entry, but not released correctly for all paths. Use __free() to release reference properly. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Ce Sun <cesun102@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-09drm/amdgpu: Fix NULL ptr deref in amdgpu_device_cache_switch_state()John Olender
Kaveri has no upstream bridge, therefore parent is NULL. $ lspci -PP ... 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Kaveri [Radeon R7 Graphics] (rev d4) For comparison, Raphael: $ lspci -PP ... 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A] ... 00:08.1/0e:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev c5) Fixes: 1dd2fa0e00f1 ("drm/amdgpu: Save and restore switch state") Link: https://lore.kernel.org/amd-gfx/38fe6513-f8a9-4669-8e86-89c54c465611@gmail.com/ Reviewed-by: Candice Li <candice.li@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: John Olender <john.olender@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-05drm/amdgpu: add support for cyan skillfish gpu_infoAlex Deucher
Some SOCs which are part of the cyan skillfish family rely on an explicit firmware for IP discovery. Add support for the gpu_info firmware. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-02Merge tag 'amd-drm-next-6.18-2025-08-29' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-08-29: amdgpu: - Replay fixes - RAS updates - VCN SRAM load fixes - EDID read fixes - eDP ALPM support - AUX fixes - Documenation updates - Rework how PTE flags are generated - DCE6 fixes - VCN devcoredump cleanup - MMHUB client id fixes - SR-IOV fixes - VRR fixes - VCN 5.0.1 RAS support - Backlight fixes - UserQ fixes - Misc code cleanups - SMU 13.0.12 updates - Expanded PCIe DPC support - Expanded VCN reset support - SMU 13.0.x Updates - VPE per queue reset support - Cusor rotation fix - DSC fixes - GC 12 MES TLB invalidation update - Cursor fixes - Non-DC TMDS clock validation fix amdkfd: - debugfs fixes - Misc code cleanups - Page migration fixes - Partition fixes - SVM fixes radeon: - Misc code cleanups Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250829190848.1921648-1-alexander.deucher@amd.com
2025-08-29drm/amd/amdgpu: unified amdgpu ip block nameYang Wang
v1: 1. Unified amdgpu ip block name print with format "{ip_type}_v{major}_{minor}_{rev}" 2. Avoid IP block name conflicts for SMU/PSP ip block v2: Update IP block print format to keep legacy IP block name (Alex) "{ip_type}_v{major}_{minor}_{rev} ({funcs->name})" Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11drm/amdgpu: Save and restore switch stateLijo Lazar
During a DPC error kernel waits for the link to be active before notifying downstream devices. On certain platforms with Broadcom switch in synthetiic mode, switch responds with values even though the link is not fully ready. The config space restoration done by pcie port driver for SWUS/DS of dGPU is thus not effective as the switch is still doing internal enumeration. As a workaround, save state of SWUS/DS device in driver. Add additional check to see if link is active and restore the values during DPC error callbacks. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-08Merge tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm fixes from Dave Airlie: "This is the fixes that built up in the merge window, mostly amdgpu and xe with one i915 display fix, seems like things are pretty good for rc1. i915: - DP LPFS fixes xe: - SRIOV: PF fixes and removal of need of module param - Fix driver unbind around Devcoredump - Mark xe driver as BROKEN if kernel page size is not 4kB amdgpu: - GC 9.5.0 fixes - SMU fix - DCE 6 DC fixes - mmhub client ID fixes - VRR fix - Backlight fix - UserQ fix - Legacy reset fix - Misc fixes amdkfd: - CRIU fix - Debugfs fix" * tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel: (28 commits) drm/amdgpu: add missing vram lost check for LEGACY RESET drm/amdgpu/discovery: fix fw based ip discovery drm/amdkfd: Destroy KFD debugfs after destroy KFD wq amdgpu/amdgpu_discovery: increase timeout limit for IFWI init drm/amdgpu: Update SDMA firmware version check for user queue support drm/amdgpu: Add NULL check for asic_funcs drm/amd/display: Revert "drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value" drm/amd/display: fix a Null pointer dereference vulnerability drm/amd/display: Add primary plane to commits for correct VRR handling drm/amdgpu: update mmhub 3.3 client id mappings drm/amdgpu: update mmhub 3.0.1 client id mappings drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job drm/amd/display: Fix DCE 6.0 and 6.4 PLL programming. drm/amd/display: Don't overwrite dce60_clk_mgr drm/amdkfd: Fix checkpoint-restore on multi-xcc drm/amd: Restore cached manual clock settings during resume drm/amd: Restore cached power limit during resume drm/amdgpu: Update external revid for GC v9.5.0 drm/amdgpu: Update supported modes for GC v9.5.0 Mark xe driver as BROKEN if kernel page size is not 4kB ...
2025-08-06drm/amdgpu: add missing vram lost check for LEGACY RESETAlex Deucher
Legacy resets reset the memory controllers so VRAM contents may be unreliable after reset. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit aae94897b6661a2a4b1de2d328090fc388b3e0af) Cc: stable@vger.kernel.org
2025-08-06drm/amdgpu/discovery: fix fw based ip discoveryAlex Deucher
We only need the fw based discovery table for sysfs. No need to parse it. Additionally parsing some of the board specific tables may result in incorrect data on some boards. just load the binary and don't parse it on those boards. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441 Fixes: 80a0e8282933 ("drm/amdgpu/discovery: optionally use fw based ip discovery") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 62eedd150fa11aefc2d377fc746633fdb1baeb55) Cc: stable@vger.kernel.org
2025-08-06drm/amdgpu: add missing vram lost check for LEGACY RESETAlex Deucher
Legacy resets reset the memory controllers so VRAM contents may be unreliable after reset. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06drm/amdgpu/discovery: fix fw based ip discoveryAlex Deucher
We only need the fw based discovery table for sysfs. No need to parse it. Additionally parsing some of the board specific tables may result in incorrect data on some boards. just load the binary and don't parse it on those boards. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441 Fixes: 80a0e8282933 ("drm/amdgpu/discovery: optionally use fw based ip discovery") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06drm/amdgpu: Add helpers to set/get unique idsLijo Lazar
Add a struct to store unique id information for each type. Add helper to fetch the unique id. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06drm/amdgpu: Prevent hardware access in dpc stateLijo Lazar
Don't allow hardware access while in dpc state. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Ce Sun <cesun102@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>