[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Freeing Machine Learning with ROCm

From: Zacchaeus Scheffer
Subject: Re: Freeing Machine Learning with ROCm
Date: Tue, 26 Apr 2022 00:45:13 -0400

> Based on the fact that many ROCm packages exist in guix, and
> that I don't see people complain, it seems it must have worked in the
> past.
Indeed, I am using Guix’ darktable and rocm-opencl-runtime packages
for OpenCL-accelerated photo editing. But I’m also doing this on a
foreign distribution with a custom kernel (5.15) – not Guix System.
I tried kernel version 5.15 before I tried 5.4.  Is there anything else special about your kernel version?

> > ROCk module is loaded
> > Unable to open /dev/kfd read-write: No such file or directory
> > <my username here> is member of video group
Which GPU are you using? Can you see it with `lspci` and does it have the
`amdgpu` driver attached? Is the firmware loaded (`dmesg | grep amdgpu`,
I’m guessing no, since you use linux-libre)?
I have an AMD Radeon Instinct MI60, one of the few officially supported GPUs.  `lspci | grep -i amd` gives:
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a0
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a1
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20
so it seems to be detected. `dmesg | grep amdgpu` gives:
[   12.446826] [drm] amdgpu kernel modesetting enabled.
[   12.485012] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   12.522503] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   12.522538] amdgpu: ATOM BIOS: 113-D1630600-107
[   12.523127] [drm:sdma_v4_0_early_init.cold [amdgpu]] *ERROR* sdma_v4_0: Failed to load firmware "/*(DEBLOBBED)*/"
[   12.523277] [drm:sdma_v4_0_early_init.cold [amdgpu]] *ERROR* Failed to load sdma firmware!
[   12.533887] amdgpu 0000:03:00.0: amdgpu: MEM ECC is active.
[   12.533889] amdgpu 0000:03:00.0: amdgpu: SRAM ECC is active.
[   12.533893] amdgpu 0000:03:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[   12.533902] amdgpu 0000:03:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[   12.533904] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   12.533905] amdgpu 0000:03:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[   12.557543] [drm] amdgpu: 32752M of VRAM memory ready
[   12.557549] [drm] amdgpu: 24018M of GTT memory ready.
[   12.557775] amdgpu 0000:03:00.0: amdgpu: failed to init sos firmware
[   12.557777] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp firmware!
[   12.557916] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init of IP block <psp> failed -2
[   12.558042] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[   12.558044] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[   12.558047] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[   12.602981] amdgpu: probe of 0000:03:00.0 failed with error -2
[   12.603026] [drm] amdgpu: ttm finalized
So it seems to be partially working, partially not.  That "Fatal error during GPU init" is pretty discouraging though...  With the way AMD promoted ROCm as being so open, I was really under the impression that I would be able to make this work on Guix, albeit with some work on my end, but you sound skeptical.  Do you think it is possible?

Thanks for your kind response,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]