Compiling multi-GPU MPI-CUDA code on Casper

Updated 12/23/2020: The first paragraph was revised to more specifically identify the compiler commands to use and to recommend using nvfortran for CUDA Fortran code rather than a PGI compiler mentioned previously.

Follow the example below to build and run a multi-GPU, MPI/CUDA application on the Casper cluster. The example uses the nvcc NVIDIA CUDA compiler to compile a C code. (Use the nvfortran compiler via the nvhpc module if your code is CUDA Fortran.)

Any libraries you build to support an application should be built with the same compiler, compiler version, and compatible flags that were used to compile the other parts of the application, including the main executable(s). Also, when you run the applications, be sure you have loaded the same module/version environment in which you created the applications. This will avoid job failures that can result from missing mpi launchers and library routines. 


Log in to either Casper or Cheyenne, then copy the sample files from here to your own GLADE file space:

/glade/u/home/csgteam/Examples/mpi_cuda_hello

Run execdav to start an interactive job on a GPU-accelerated Casper node. Request 5 cores for this example. (When you launch the interactive job, your login shell uses 1 core or "slot" from this original request. As you will see below, you will use the 4 that remain when you launch the executable.)

execdav --constraint=gp100 --ntasks=5

Load the CUDA module when your job starts.

module load cuda

Use the NVIDIA compiler (nvcc) to compile portions of your code that contain CUDA calls. (As an alternative to doing each of the following compiling and linking steps separately, you can run make to automate those steps. The necessary makefile is included with the sample files.)

nvcc -c gpu_driver.cu
nvcc -c hello.cu

Compile any portions of the code containing MPI calls.

mpicc -c main.c

Link the object files.

mpicxx -o hello gpu_driver.o hello.o main.o

OR

mpicc -o hello gpu_driver.o hello.o main.o -lstdc++

Launch the executable with srun.

srun -n 4 ./hello

Sample output:

[task 2] Contents of data before kernel call: HdjikhjcZ
there are 1 gpus on host casper26
[task 2] is using gpu 0 on host casper26
[task 2] Contents of data after kernel call: Hello World!
Using 4 MPI Tasks
[task 0] Contents of data before kernel call: HdjikhjcZ
there are 1 gpus on host casper26
[task 0] is using gpu 0 on host casper26
[task 0] Contents of data after kernel call: Hello World!
[task 3] Contents of data before kernel call: HdjikhjcZ
there are 1 gpus on host casper26
[task 3] is using gpu 0 on host casper26
[task 3] Contents of data after kernel call: Hello World!
[task 1] Contents of data before kernel call: HdjikhjcZ
there are 1 gpus on host casper26
[task 1] is using gpu 0 on host casper26
[task 1] Contents of data after kernel call: Hello World!