summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/aldebaran.c
diff options
context:
space:
mode:
authorLijo Lazar <lijo.lazar@amd.com>2024-10-24 11:01:57 +0530
committerAlex Deucher <alexander.deucher@amd.com>2024-12-10 10:26:46 -0500
commite1ee2111ca48169a9fdc5075f7863f5d4d591e2f (patch)
tree487517237aa6a8c5587ef88e717accf3a72878b2 /drivers/gpu/drm/amd/amdgpu/aldebaran.c
parent0eecff79e49f8ce5475e1b4d968f26263587be66 (diff)
drm/amdgpu: Prefer RAS recovery for scheduler hang
Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A scheduler/job hang could be the side effect of a RAS error. In such cases, it is required to go through the RAS error recovery process. A RAS error recovery process in certains cases also could avoid a full device device reset. An error state is maintained in RAS context to detect the block affected. Fatal Error state uses unused block id. Set the block id when error is detected. If the interrupt handler detected a poison error, it's not required to look for a fatal error. Skip fatal error checking in such cases. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Diffstat (limited to 'drivers/gpu/drm/amd/amdgpu/aldebaran.c')
-rw-r--r--drivers/gpu/drm/amd/amdgpu/aldebaran.c2
1 files changed, 2 insertions, 0 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
index f44de9d4b6a1..e13fbd974141 100644
--- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
@@ -334,6 +334,8 @@ aldebaran_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
AMDGPU_INIT_LEVEL_RESET_RECOVERY);
dev_info(tmp_adev->dev,
"GPU reset succeeded, trying to resume\n");
+ /*TBD: Ideally should clear only GFX, SDMA blocks*/
+ amdgpu_ras_clear_err_state(tmp_adev);
r = aldebaran_mode2_restore_ip(tmp_adev);
if (r)
goto end;