summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
AgeCommit message (Collapse)Author
2025-11-11drm/amdgpu: resume MES scheduling after user queue hang detection and recoveryJesse.Zhang
This patch ensures the Micro-Engine Scheduler (MES) is properly resumed after detecting and recovering from a user queue hang condition. Key changes: 1. Track when a hung user queue is detected using found_hung_queue flag 2. Call amdgpu_mes_resume() to restart MES scheduling after completing the hang recovery process 3. This complements the existing recovery steps (fence force completion and device wedging) by ensuring the scheduler can process new work Without this resume call, the MES scheduler may remain in a paused state even after the hung queue has been handled, preventing newly submitted work from being processed and leading to system stalls. Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-28drm/amdgpu/userq: fix SDMA and compute validationAlex Deucher
The CSA and EOP buffers have different alignement requirements. Hardcode them for now as a bug fix. A proper query will be added in a subsequent patch. v2: verify gfx shadow helper callback (Prike) Fixes: 9e46b8bb0539 ("drm/amdgpu: validate userq buffer virtual address and size") Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-28drm/amdgpu: Convert amdgpu userqueue management from IDR to XArrayJesse.Zhang
This commit refactors the AMDGPU userqueue management subsystem to replace IDR (ID Allocation) with XArray for improved performance, scalability, and maintainability. The changes address several issues with the previous IDR implementation and provide better locking semantics. Key changes: 1. **Global XArray Introduction**: - Added `userq_doorbell_xa` to `struct amdgpu_device` for global queue tracking - Uses doorbell_index as key for efficient global lookup - Replaces the previous `userq_mgr_list` linked list approach 2. **Per-process XArray Conversion**: - Replaced `userq_idr` with `userq_mgr_xa` in `struct amdgpu_userq_mgr` - Maintains per-process queue tracking with queue_id as key - Uses XA_FLAGS_ALLOC for automatic ID allocation 3. **Locking Improvements**: - Removed global `userq_mutex` from `struct amdgpu_device` - Replaced with fine-grained XArray locking using XArray's internal spinlocks 4. **Runtime Idle Check Optimization**: - Updated `amdgpu_runtime_idle_check_userq()` to use xa_empty 5. **Queue Management Functions**: - Converted all IDR operations to equivalent XArray functions: - `idr_alloc()` → `xa_alloc()` - `idr_find()` → `xa_load()` - `idr_remove()` → `xa_erase()` - `idr_for_each()` → `xa_for_each()` Benefits: - **Performance**: XArray provides better scalability for large numbers of queues - **Memory Efficiency**: Reduced memory overhead compared to IDR - **Thread Safety**: Improved locking semantics with XArray's internal spinlocks v2: rename userq_global_xa/userq_xa to userq_doorbell_xa/userq_mgr_xa Remove xa_lock and use its own lock. v3: Set queue->userq_mgr = uq_mgr in amdgpu_userq_create() v4: use xa_store_irq (Christian) hold the read side of the reset lock while creating/destroying queues and the manager data structure. (Chritian) Acked-by: Alex Deucher <alexander.deucher@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: add userq object va track helpersPrike Liang
Add the userq object virtual address list_add() helpers for tracking the userq obj va address usage. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: fix hung reset queue array memory allocationJonathan Kim
By design the MES will return an array result that is twice the number of hung doorbells it can report. i.e. if up k reported doorbells are supported, then the second half of the array, also of length k, holds the HQD information (type/queue/pipe) where queue 1 corresponds to index 0 and k, queue 2 corresponds to index 1 and k + 1 etc ... The driver will use the HDQ info to target queue/pipe reset for hardware scheduled user compute queues. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15drm/amdgpu: adjust MES API used for suspend and resumeJesse.Zhang
Use the suspend and resume API rather than remove queue and add queue API. The former just preempts the queue while the latter remove it from the scheduler completely. There is no need to do that, we only need preemption in this case. V2: replace queue_active with queue state v3: set the suspend_fence_addr v4: allocate another per queue buffer for the suspend fence, and set the sequence number. also wait for the suspend fence. (Alex) v5: use a wb slot (Alex) v6: Change the timeout period. For MES, the default timeout is 2100000; /* 2100 ms */ (Alex) Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15drm/amdgpu: validate userq buffer virtual address and sizePrike Liang
It needs to validate the userq object virtual address to determine whether it is residented in a valid vm mapping. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-09drm/amdgpu: validate userq input argsPrike Liang
This will help on validating the userq input args, and rejecting for the invalid userq request at the IOCTLs first place. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-05drm/amdgpu/userq: add a detect and reset callbackJesse.Zhang
Add a detect and reset callback and add the implementation for mes. The callback will detect all hung queues of a particular ip type (e.g., GFX or compute or SDMA) and reset them. v2: increase reset counter and set fence force completion v3: Removed userq_mutex in mes_userq_detect_and_reset since the driver holds it when calling Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-22drm/amdgpu/userq: use consistent function namingAlex Deucher
s/userqueue/userq/ 1. remove the mix of amdgpu_userqueue and amdgpu_userq 2. to be consistent with other amdgpu_userq_fence.c 3. it's shorter Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-22drm/amdgpu: switch from queue_active to queue stateAlex Deucher
Track the state of the queue rather than simple active vs not. This is needed for other states (hung, preempted, etc.). While we are at it, move the state tracking into the user queue front end code. Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-22drm/amdgpu/userq/mes: pass the secure flag to mqd initAlex Deucher
So that we initialize the MQD as a secure queue. Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-21drm/amdgpu/userq/mes: handle user queue priorityAlex Deucher
Handle the queue priority set by the user. Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-21drm/amdgpu/userq: move runpm handling into core userq codeAlex Deucher
Pull it out of the MES code and into the generic code. It's not MES specific and needs to be applied to all user queues regardless of the backend. Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-21drm/amdgpu/userq: rework front end call sequenceAlex Deucher
Split out the queue map from the mqd create call and split out the queue unmap from the mqd destroy call. This splits the queue setup and teardown with the actual enablement in the firmware. Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-21drm/amdgpu/userq: rename suspend/resume callbacksAlex Deucher
Rename to map and umap to better align with what is happening at the firmware level and remove the extra level of indirection in the MES userq code. Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-21drm/amdgpu/userq/mes: remove unused headerAlex Deucher
This is unused so remove it. Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11drm/amdgpu: remove the duplicated mes queue active state settingPrike Liang
The MES queue deactivation and active status are already set in mes_userq_unmap|map(), so the caller needn't set the queue_active bit again. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu/userq: handle runtime pmAlex Deucher
Take a reference when we create a queue and drop it when we destroy the queue. We need to keep the device active while user queues are active. v2: squash in fix from Sunil v3: squash in fix from Prike Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: Modify the MES process va end limitChristian König
Modify the MES process va end limit to max pfn. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: enable userqueue secure sem for GFX 12Arunpravin Paneer Selvam
- Add a field in struct amdgpu_mqd_prop for userqueue secure sem fence address since now we have a generic file for mes_userqueue.c - Add secure sem fence address mqd support to gfx12 into their corresponding init functions. - Enable secure semaphore IRQ handling V2: Address review comment from Alex: Use fence_address instead of fenceaddress (Shashank) Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu/uq: make MES UQ setup genericAlex Deucher
Now that all of the IP specific code has been moved into the IP specific functions, we can make this code generic. V2: Fixed build errors and porting logics (Shashank) Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>