Basically all computer vision and/or machine learning research is done on GPUs in Pytorch and/or Tensorflow. Now, it should be possible to do this with ROCm drivers on a supported AMD GPU. However, I'm having trouble utilizing my GPU with ROCm drivers. This seems to be due to problems in the current Guix version, as I was able to utilize the GPU fine on a different OS. Based on the fact that many ROCm packages exist in guix, and that I don't see people complain, it seems it must have worked in the past. While I am interested in helping fix this in the current guix version (discussed more below), I also think it is important that people be able to use GPUs now. This brings me to my first question:
Has anyone been able to run a ROCm compatible GPU on a Guix system using ROCm drivers? And, if so, could you provide resources to do so? (channels.scm with working guix commit, system.scm, home.scm, manifest.scm, etc.) Also, if you were able to get pytorch/tensorflow to play well on a GPU, info on that would also be nice.
Currently, I have tried putting the results of "guix search rocm" (minus procmail) into a manifest (included below), and calling rocminfo (AMD nvidia-smi equivalent-ish). This gives me:
> ROCk module is loaded
> Unable to open /dev/kfd read-write: No such file or directory
> <my username here> is member of video group
Maybe there is a missing magic udev rule? I was able to find a thread somewhere (can't find it now) where they suggested rolling back the kernel version. Cross-checking with the ROCm 4.3 install documentation (because the ROCm version in the guix repo is 4.3), I saw that the supported ubuntu version had kernel version 5.4.*, so I tried downgrading my kernel by adding:
(kernel (specification->package "firstname.lastname@example.org"))
to my system.scm, reconfiguring, and rebooting. I also tried similarly adding all rocm packages to my system.scm. In every instance, I tried running as a user (with appropriate groups as indicated by ROCm documentation) and root. In all cases, I get the error printed above. While this problem would seem like a good question for upstream ROCm, they don't officially support any but a few OS's, so here I am.
In retrospect, could it maybe be that I can use the card without probing it with rocminfo? It would certainly be nice to be able to check the temperature (especially so I don't have to leave the fan on full blast) among other things, but maybe that isn't strictly necessary for doing machine learning on it?
Any suggestion for how to get closer to getting GPU-accelerated (ROCm) pytorch/tensorflow running on Guix is appreciated.
The archive for email@example.com
is pretty sparse. Should I be posting this to bug-guix instead?
contents of my manifest.scm mentioned above: