CUDA

A. GPU nodes in the CoARE HPC clusters

There are two clusters having GPU nodes: (1) gpu cluster (202.90.159.241) and (2) duke cluster (202.90.159.130). The former has 2 NVIDIA Telsa K80 GPU computing processors and the latter has an older NVIDIA Tesla M2070 GPU computing processor. See specifications below.

Specifications Tesla K80 Tesla M2070
Total amount of global memory (MB) 11,440 5,301
CUDA cores 2,496 448
GPU Max Clock rate (MHz) 824 1,147
Memory Clock rate (MHz) 2,505 1,566
Memory Bus Width (bits) 384 384
L2 Cache Size (bytes) 1,572,864 786,432

B. Running PyCUDA applications

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. To run PyCUDA application on CoARE GPU cluster, follow the following instructions:

  1. Login using your ASTI designated user account to 202.90.159.241.
    $ ssh username@202.90.159.241
    
  2. Load the "anaconda2" module to allow you to set up your own python environment. This is necessary since the OS (CentOS 7.2) does not support Python 3.x as of yet.
    [username@tux-gpu-01-g2 ~]$ module load anaconda2/4.3.0 
    
  3. Create your anaconda environment. Let's name your conda environment as cuda. You can set the environment according to your preference.
    [username@tux-gpu-01-g2 ~]$ conda create -n cuda python=3.5
    
  4. Activate your newly-created anaconda environment
    [username@tux-gpu-01-g2 ~]$ source activate cuda
    
  5. Install numpy and pycuda into your environment.
    (cuda) [username@tux-gpu-01-g2 ~]$ pip install numpy
    (cuda) [username@tux-gpu-01-g2 ~]$ module load cuda/8.0.61
    >> cuda-8.0.61 has been loaded.
    (cuda) [username@tux-gpu-01-g2 ~]$ echo $CUDA_ROOT
    /opt/hpcc/cuda/8.0.61
    (cuda) [username@tux-gpu-01-g2 ~]$ pip install pycuda
    
  6. Validate your installation. Try executing an interactive python session on your shell and import tensorflow.
    (cuda) [username@tux-gpu-01-g2 ~]$ python
    Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pycuda
    >>> exit()
    (cuda) [username@tux-gpu-01-g2 ~]$ source deactivate
    [username@tux-gpu-01-g2 ~]$
    
  7. Create a sample python script to test PyCUDA. Source code copied from https://redmine.hpc.rug.nl/redmine/projects/peregrine/wiki/Submitting_a_single_job_with_Python_(GPU). Save it as tut.py.
    import pycuda.gpuarray as gpuarray
    import pycuda.driver as cuda
    import pycuda.autoinit
    import numpy
    from pycuda.curandom import rand as curand
    
    a_gpu = curand((50,))
    b_gpu = curand((50,))
    
    from pycuda.elementwise import ElementwiseKernel
    lin_comb = ElementwiseKernel(
            "float a, float *x, float b, float *y, float *z",
            "z[i] = a*x[i] + b*y[i]",
            "linear_combination")
    
    c_gpu = gpuarray.empty_like(a_gpu)
    lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
    
    import numpy.linalg as la
    assert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5
    print (c_gpu) # This line is added to the original file to show the final output of the c_gpu array.
    
  8. Create a SLURM script to run your Python code in the HPC. Save it as cuda.slurm.
    The SLURM script contains necessary information about the specific amount and type of computational resources you’ll be requiring for a particular job/run. It includes the sequence of commands you normally invoke in an interactive session in order to properly execute an application using the batch scheduler.
    #!/bin/bash
    
    #SBATCH --output=cuda-tutorial.log
    #SBATCH --job-name=cuda-tut
    #SBATCH --gres=gpu:1
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=2
    
    # check which GPU device was allocated
    echo "CUDA_DEVICE=/dev/nvidia$CUDA_VISIBLE_DEVICES" 
    
    # prepare working environment 
    module load anaconda2/4.3.0
    module load cuda/8.0.61
    
    # activate your conda environment
    source activate cuda
    
    srun python tut.py
    
    # deactivate environment
    source deactivate
    

    IMPORTANT NOTES:
    • --output: path where to store the the logs of the job
    • --job-name: serves as indentifier for your job
    • --gres=gpu:<number>: specifies the number of GPU devices that your job requires for it to run.
    • --ntasks=1: instructs the batch scheduler that the job will spawn one process.
    • --cpus-per-task=2: indicates the number of processors that will be assigned to the “python” processes. Note that the batch scheduler leverages on Linux Control Groups (cgroups) to prevent users from consuming resources beyond their allocations. It is a kernel mechanism that isolate user processes from each other.
    • source activate <env> - source deactivate pair: This activates/deactivates your conda environment. In this example, this activates the cuda environment where pycuda is installed.
    • srun python tut.py: runs the python script tut.py under your conda environment.

9. Submit your job script to the queue and wait for available resources to turn up.

[username@tux-gpu-01-g2 ~]$ ls
cuda.slurm  tut.py
[username@tux-gpu-01-g2 ~]$ sbatch cuda.slurm

10. Check the status of your job. R - Running; PD - Pending

[username@tux-gpu-01-g2 ~]$ squeue

11. As soon as your job starts to run, all of the console messages generated by the cuda.slurm will appear in cuda-tutorial.log. More information about the usage of SLURM commands as well as the parameters available for configuring job runs here.

12. To check the “occupancy” or usage of the GPU devices, one can issue:

[username@tux-gpu-01-g2 ~]$ nvidia-smi

C. Running C/C++ CUDA applications

  1. Login using your ASTI designated user accound to 202.90.159.241
    $ ssh username@202.90.159.241
    
  2. Load the cuda module. Verify that CUDA_ROOT is defined after loading the module
    $ module load cuda/8.0.61
    >> cuda-8.0.61 has been loaded.
    $ echo $CUDA_ROOT
    /opt/hpcc/cuda/8.0.61
    
  3. Download the sample C/C++ scripts from NVIDIA. This may take a while to complete.
    $ mkdir scripts
    $ cd scripts
    $ cuda-install-samples-8.0.sh ./
    
  4. Let's compile a sample script.
    $ cd NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody
    $ make
    "/opt/hpcc/cuda/8.0.61"/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -ftz=true -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_60,code=compute_60 -o nbody.o -c nbody.cpp
    nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
    "/opt/hpcc/cuda/8.0.61"/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_60,code=compute_60 -o nbody bodysystemcuda.o nbody.o render_particles.o  -L/usr/lib64/nvidia -lGL -lGLU -lX11 -lglut
    nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
    mkdir -p ../../bin/x86_64/linux/release
    cp nbody ../../bin/x86_64/linux/release
    

    You should find a binary nbody if the compilation were successful.
    $ ls | grep nbody
    nbody
    nbody.cpp
    nbody.o
    $ 
    
  5. Create a SLURM script to run your compiled code in the HPC. Save it as cudacpp.slurm.
    #!/bin/bash
    
    #SBATCH --output=cudacpp-tutorial.log
    #SBATCH --ntasks=1
    #SBATCH --gres=gpu:1
    #SBATCH --cpus-per-task=2
    
    # check which GPU device was allocated
    echo "CUDA_DEVICE=/dev/nvidia$CUDA_VISIBLE_DEVICES" 
    
    ./nbody -benchmark -numbodies=25600 -numdevices=1
    
  6. Submit your job script to the queue and wait for available resources to turn up.
    [username@tux-gpu-01-g2 ~]$ ls
    cuda.slurm  tut.py
    [username@tux-gpu-01-g2 ~]$ sbatch cudacpp.slurm
    
  7. Check the status of your job. R - Running; PD - Pending
    [username@tux-gpu-01-g2 ~]$ squeue
    
  8. After the job is finished, check the contents of cudacpp-tutorial.log. The output should look like the following:
    CUDA_DEVICE=/dev/nvidia0
    Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
            -fullscreen       (run n-body simulation in fullscreen mode)
            -fp64             (use double precision floating point values for simulation)
            -hostmem          (stores simulation data in host memory)
            -benchmark        (run benchmark to measure performance) 
            -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
            -device=<d>       (where d=0,1,2.... for the CUDA device to use)
            -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
            -compare          (compares simulation results running once on the default GPU and once on the CPU)
            -cpu              (run n-body simulation on the CPU)
            -tipsy=<file.bin> (load a tipsy model file for simulation)
    
    NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
    
    number of CUDA devices  = 1
    > Windowed mode
    > Simulation data stored in video memory
    > Single precision floating point simulation
    > 1 Devices used for simulation
    GPU Device 0: "Tesla K80" with compute capability 3.7
    
    > Compute 3.7 CUDA device: [Tesla K80]
    number of bodies = 25600
    25600 bodies, total time for 10 iterations: 112.389 ms
    = 58.312 billion interactions per second
    = 1166.235 single-precision GFLOP/s at 20 flops per interaction
    
  9. To check the “occupancy” or usage of the GPU devices, one can issue:
    [username@tux-gpu-01-g2 ~]$ nvidia-smi
    [jaya.combinido@tux-gpu-01-g2 ~]$ nvidia-smi
    Fri Aug  4 18:08:33 2017       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla K80           Off  | 0000:04:00.0     Off |                    0 |
    | N/A   39C    P0   137W / 149W |    100MiB / 11439MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
    | N/A   27C    P0    71W / 149W |      2MiB / 11439MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |    0     26544    C   ./nbody                                         98MiB |
    +-----------------------------------------------------------------------------+
    
Important notes:
  1. The above instructions are suited for the GPU cluster (202.90.159.241). For the Duke cluster, change --gres=gpu:1 to --gres=gpu:tesla:1; and add an #SBATCH -p gpu line in the SLURM script.
  2. Since there are only 3 GPU nodes in the CoARE facility, we recommend that you only use 1 GPU node per job so that others are able to run their jobs as well.