Skip to content

Update NVIDIA driver symlink script#158

Draft
casparvl wants to merge 3 commits intoEESSI:mainfrom
casparvl:link_nvidia_drivers
Draft

Update NVIDIA driver symlink script#158
casparvl wants to merge 3 commits intoEESSI:mainfrom
casparvl:link_nvidia_drivers

Conversation

@casparvl
Copy link
Contributor

@casparvl casparvl commented Feb 4, 2026

We'll need the following variant symlinks to be in place before this script can work as intended:

ln -s '$(EESSI_202506_NVIDIA_OVERRIDE:-/cvmfs/software.eessi.io/defaults/nvidia)' /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia
ln -s '$(EESSI_202506_NVIDIA_OVERRIDE:-/cvmfs/software.eessi.io/defaults/nvidia)' /cvmfs/software.eessi.io/versions/2025.06/compat/linux/aarch64/lib/nvidia
ln -s '$(EESSI_202506_NVIDIA_OVERRIDE:-/cvmfs/software.eessi.io/defaults/nvidia)' /cvmfs/software.eessi.io/versions/2025.06/compat/linux/riscv64/lib/nvidia

And then:

ln -s '$(EESSI_NVIDIA_OVERRIDE_DEFAULT:-/dev/null)' /cvmfs/software.eessi.io/defaults/nvidia

This can then be quite easily tested from within the container:

./eessi_container.sh -a rw -r software.eessi.io -b $<host-software-layer-scripts>:/software-layer-scripts --nvidia all
cd /software-layer-scripts/scripts/gpu_support/nvidia
./link_nvidia_host_libraries.sh

This should error out stating that the variant symlink resolves to /dev/null. Then, you can change /etc/cvmfs/default.local to set e.g. EESSI_NVIDIA_OVERRIDE_DEFAULT (e.g. to /opt/eessi/nvidia) and run the linking script again - this should the install the symlinks.

@casparvl
Copy link
Contributor Author

casparvl commented Feb 4, 2026

Although we don't have the symlinks yet, I can actually already test this in the container - it will just create the symlinks in /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia/ in the writeable overlay. That's fine.

What I did:

$ cd /software-layer-scripts/scripts/gpu_support/nvidia/
$ umask 0022
$  source /cvmfs/software.eessi.io/versions/2025.06/init/lmod/bash
# For some reason this failed to load the module - some module cache issue?
$ module load EESSI/2025.06
$ cat > dummy.c <<'EOF'
int main(void) { return 0; }
EOF
$ gcc -Wall -Wl,--no-as-needed -lcuda dummy.c -o dummy -L /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia/
# singularity has /.singularity.d/libs with the CUDA drivers in the LD_LIBRARY_PATH, but those are not the ones we want to find...
$  unset LD_LIBRARY_PATH
$ ldd dummy
Apptainer> ldd dummy
        linux-vdso.so.1 (0x00007ffc59bb4000)
        libcuda.so.1 => /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia/libcuda.so.1 (0x000014f19b377000)
...

Works as intended. After implementing the variant symlinks, we should retest, try to use the EESSI_NVIDIA_OVERRIDE_DEFAULT symlink, and, once that works, try again using the EESSI_202506_NVIDIA_OVERRIDE variant symlink.

@bedroge
Copy link
Contributor

bedroge commented Feb 17, 2026

Tested in the container using EESSI 2025.06 and without having configured the variant symlinks:

ERROR: /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia is a symlink pointing to /cvmfs/software.eessi.io/defaults/nvidia, which is a symlink pointing to /dev/null
If you want to symlink the drivers in a single location for all EESSI versions, please define the EESSI_NVIDIA_OVERRIDE_DEFAULT variant symlink in your local CVMFS configuration to point to writeable location. This will change the target of symlink /cvmfs/software.eessi.io/defaults/nvidia.
If you want to symlink the drivers only for this version of EESSI (2025.06), please define the EESSI__NVIDIA_OVERRIDE variant symlink in your local CVMFS configuration to point to writeable location. This will change the target of symlink /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia.

With the variant symlink reconfigured as EESSI_NVIDIA_OVERRIDE_DEFAULT=/opt/eessi/nvidia:

Ensure the final target of /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia (/opt/eessi/nvidia) exists
Host NVIDIA GPU drivers linked successfully for EESSI

Wiping that dir and doing it again using EESSI_202506_NVIDIA_OVERRIDE=/opt/eessi/nvidia yields the same result.

Also checked the symlinks, and the pointed to the expected locations.

# Do some checks on existence of links and that we don't end up at /dev/null (the default), so we can print some informative information
# One downside is that we can't explicitely check if something is a variant symlink, so we'll just assume that if it's a link AND it
# lives in our CVMFS repository, it must be a variant symlink
nvidia_trusted_dir="${EESSI_EPREFIX}/lib/nvidia"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the script will no longer work for 2023.06?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah, that's annoying, this script is in an unversioned prefix. I mean, if we deploy this only for 2025.06, we keep the old version for 2023.06. But then if we want to update that, we have to revert all changes, etc. Maybe we should just duplicate the script? I.e. create something like scripts/gpu_support/nvidia/2023.06/link_nvidia_host_libraries.sh? Or should it be at higher level 2023.06/scripts/gpu_support...?

casparvl and others added 2 commits February 17, 2026 15:09
Co-authored-by: Bob Dröge <b.e.droge@rug.nl>
Co-authored-by: Bob Dröge <b.e.droge@rug.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants