- 13 Jul, 2021 1 commit
-
-
Luben Tuikov authored
In amdgpu_ras_query_error_count() return an error if the device doesn't support RAS. This prevents that function from having to always set the values of the integer pointers (if set), and thus prevents function side effects--always to have to set values of integers if integer pointers set, regardless of whether RAS is supported or not--with this change this side effect is mitigated. Also, if no pointers are set, don't count, since we've no way of reporting the counts. Also, give this function a kernel-doc. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Reported-by:
Tom Rix <trix@redhat.com> Fixes: a46751fb ("drm/amdgpu: Fix RAS function interface") Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 18 Jun, 2021 2 commits
-
-
Stanley.Yang authored
Use SMU to update the bad pages rather than directly accessing the EEPROM from the driver. Signed-off-by:
Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by:
John Clements <john.clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Stanley.Yang authored
Signed-off-by:
Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 11 Jun, 2021 1 commit
-
-
Guchun Chen authored
Use adev_to_drm() to get to the drm_device pointer. Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 27 May, 2021 2 commits
-
-
Luben Tuikov authored
On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL. v2: Cancel pending delayed work at ras_fini(). v3: Remove conditionals when dealing with delayed work manipulation as they're inherently racy. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
Alexander Deucher <Alexander.Deucher@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
The correctable and uncorrectable errors are calculated at each invocation of this function. Therefore, it is highly inefficient to return just one of them based on a Boolean input. If the caller wants both, twice the work would be done. (And this work is O(n^3) on Vega20.) Fix this "interface" to simply return what it had calculated--both values. Let the caller choose what it wants to record, inspect, use. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 20 May, 2021 2 commits
-
-
Andrey Grodzovsky authored
Some of the stuff in amdgpu_device_fini such as HW interrupts disable and pending fences finilization must be done right away on pci_remove while most of the stuff which relates to finilizing and releasing driver data structures can be kept until drm_driver.release hook is called, i.e. when the last device reference is dropped. v4: Change functions prefix early->hw and late->sw Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-3-andrey.grodzovsky@amd.com
-
John Clements authored
Only clear RAS error counters if perestent EDC harvesting is not supported Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
John Clements <john.clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 10 May, 2021 11 commits
-
-
Dwaipayan Ray authored
Fix a couple of syntax errors and removed one excess parameter in the function documentations which lead to kernel docs build warning. Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Dwaipayan Ray <dwaipayanray1@gmail.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Dennis Li authored
For aldebaran, hardware will not clear error status automatically when reading error status register, insteadly driver should set clear bit of the error status register explicitly to clear error status. Signed-off-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Dennis Li authored
The original codes use ras status and kernl errno together in the same function, which is a wrong code style. Signed-off-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Oak Zeng authored
If RAS is disabled through amdgpu_ras_enable kernel parameter, we should quit the RAS initialization eariler to avoid initialization of some RAS data structure such as sysfs etc. Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Oak Zeng <Oak.Zeng@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
Export the runtime-set "ras_hw_enabled" and "ras_enabled" to debugfs, for debugging. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Acked-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
Rename, ras_hw_supported --> ras_hw_enabled, and ras_features --> ras_enabled, to show that ras_enabled is a subset of ras_hw_enabled, which itself is a subset of the ASIC capability. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Acked-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
Move ras_hw_supported into struct amdgpu_dev. The dependency is: struct amdgpu_ras <== struct amdgpu_dev <== ASIC, read as "struct amdgpu_ras depends on struct amdgpu_dev, which depends on the hardware." This can be loosely understood as, "if RAS is supported, which is property of the ASIC (struct amdgpu_dev), then we can access struct amdgpu_ras." v2: Fix a typo: must binary AND in ternary cond in amdgpu_ras.c Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Acked-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
Remove redundant ras->supported, as this value is also stored in adev->ras_features. Use adev->ras_features, as that supercedes "ras", since the latter is its member. The dependency goes like this: ras <== adev->ras_features <== hw_supported, and is read as "ras depends on ras_features, which depends on hw_supported." The arrows show the flow of information, i.e. the dependency update. "hw_supported" should also live in "adev". Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Acked-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Stanley.Yang authored
Signed-off-by:
Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
gfx ras now can be enabled by default in aldebaran Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Reviewed-by:
Dennis Li <Dennis.Li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
add hdp block ras error query and reset support in amdgpu ras error count query and reset interface Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Reviewed-by:
Dennis Li <Dennis.Li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 29 Apr, 2021 1 commit
-
-
Hawking Zhang authored
Add socket/die information in RAS messages for platforms that support query those information Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 23 Apr, 2021 2 commits
-
-
Hawking Zhang authored
aldebaran gfx ras is still under development Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <Dennis.Li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Stanley.Yang authored
Signed-off-by:
Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 21 Apr, 2021 2 commits
-
-
Mukul Joshi authored
Reset the RAS error count and error status registers after reading to prevent over reporting error counts on Aldebaran. Signed-off-by:
Mukul Joshi <mukul.joshi@amd.com> Reviewed-By:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Dennis Li authored
because "sscanf(str, "retire_page")" always return 0, if application use the raw data for error injection, it always wrongly falls into "op == 3". Change to use strstr instead. Signed-off-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 15 Apr, 2021 5 commits
-
-
Luben Tuikov authored
Add back the double-sscanf so that both decimal and hexadecimal values could be read in, but this time invert the scan so that hexadecimal format with a leading 0x is tried first, and if that fails, then try decimal format. Also use a logical-AND instead of nesting double if-conditional. See commit "drm/amdgpu: Fix a bug for input with double sscanf" Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
John Clements <john.clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
Imporve the kernel-doc for the RAS sysfs interface. Fix the grammar, fix the context. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
Add bad_page_cnt_threshold to debugfs, an optional file system used for debugging, for reporting purposes only--it usually matches the size of EEPROM but may be different depending on the "bad_page_threshold" kernel module option. The "bad_page_cnt_threshold" is a dynamically computed value. It depends on three things: the VRAM size; the size of the EEPROM (or the size allocated to the RAS table therein); and the "bad_page_threshold" module parameter. It is a dynamically computed value, when the amdgpu module is run, on which further parameters and logic depend, and as such it is helpful to see the dynamically computed value in debugfs. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
Fix if (ret) --> if (!ret), a bug, for "retire_page", which caused the kernel to recall the method with *pos == end of file, and that bounced back with error. On the first run, we advanced *pos, but returned 0 back to fs layer, also a bug. Fix the logic of the check of the result of amdgpu_reserve_page_direct()--it is 0 on success, and non-zero on error, not the other way around. This patch fixes this bug. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
Remove double-sscanf to scan for %llu and 0x%llx, as that is not going to work! The %llu will consume the "0" in "0x" of your input, and the hex value you think you're entering will always be 0. That is, a valid hex value can never be consumed. On the other hand, just entering a hex number without leading 0x will either be scanned as a string and not match, for instance FAB123, or the leading decimal portion is scanned as the %llu, for instance 123FAB will be scanned as 123, which is not correct. Thus remove the first %llu scan and leave only the %llx scan, removing the leading 0x since %llx can scan either. Addresses are usually always hex values, so this suffices. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Xinhui Pan <xinhui.pan@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Luben Tuikov <luben.tuikov@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 09 Apr, 2021 11 commits
-
-
John Clements authored
added support in RAS debugfs to add bad page for isolated page retirement testing Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
John Clements <john.clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
John Clements authored
In event of RAS UE + warm reset, error counters shall be harvested and cleared on driver load Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
John Clements <john.clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
gfx ras is only available in cerntain ip generations. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
mmhub ras is only avaiable in cerntain mmhub ip generation. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
umc ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split umc callbacks into ras and non-ras ones so gpu driver only initializes umc ras callbacks when it manages umc ras. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
xgmi ras is not managed by gpu driver when gpu is connected to cpu through xgmi. move all xgmi ras functions to xgmi_ras_funcs so gpu driver only initializes xgmi ras functions when it manages xgmi ras. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
nbio ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split nbio callbacks into ras and non-ras ones so gpu driver only initializes nbio ras callbacks when it manages nbio ras. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
Driver only manages GFX/SDMA/MMHUB RAS in platforms that gpu node is connected to cpu through XGMI, other than that, it queries VBIOS for RAS capabilities. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Acked-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
John Clements <John.Clements@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Bernard Zhao authored
Fix patch check warning: WARNING: suspect code indent for conditional statements (8, 17) + if (obj && obj->use < 0) { + DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name); WARNING: braces {} are not necessary for single statement blocks + if (obj && obj->use < 0) { + DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name); + } Signed-off-by:
Bernard Zhao <bernard@vivo.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Stanley.Yang authored
Signed-off-by:
Stanley.Yang <Stanley.Yang@amd.com> Reivewed-by:
Dennis Li <Dennis.Li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tian Tao authored
Fix the following coccicheck warning: drivers/gpu//drm/amd/amdgpu/amdgpu_ras.c:434:9-17: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:220:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:249:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/df_v3_6.c:208:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_psp.c:2973:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:75:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:112:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:58:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:93:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:125:9-17: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:52:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:71:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:140:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:164:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:186:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:208:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_atombios.c:1916:8-16: WARNING: use scnprintf or sprintf Signed-off-by:
Tian Tao <tiantao6@hisilicon.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-