How to Use the HPC¶
- How to Use the HPC
- SLURM Workload Manager
- Logging in
- Compiling and Installing Software
- Loading Software Into Your Environment
- Managing Jobs
- Uploading files to the home directory (/home)
SLURM Workload Manager¶
- Simple Linux Utility for Resource Management
- The native scheduler software that runs on ASTI's HPC cluster
- Free and open-source job scheduler for the Linux kernel used by many of the world's supercomputers and computer clusters. (On the November 2013 Top500 list, five of the ten top systems use Slurm)
- Users request for allocation of compute resources through SLURM
SLURM Three Key Functions¶
- Allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work
- Provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes
- Arbitrates contention for resources by managing a queue of pending work
Compute resource managed by SLURM
Logical set of nodes with the same queue parameters (job size limit, job time limit, users permitted to use it, etc)
Allocations of resources assigned to a user for a specified amount of time
- Jobs Step
Which are sets of (possibly parallel) tasks within a job
Types of Jobs¶
- Multi-node parallel jobs
- Use more than one node and require MPI to communicate between nodes
- Jobs usually require computing resource (cores) more than a single node can offer
- Single node parallel jobs
- Use only one node, but multiple cores on that node
- Includes pthreads, OpenMP and shared memory MPI
- Truly serial jobs
- Require only one core on one node
CoARE facility's HPC Cluster Partitions¶
- Suitable for jobs that take a long time to finish (<= 7 days)
- Six (6) nodes may be allocated to any single job
- Each job can allocate up to 4GB of memory per CPU core
- Default partition when the partition directive is unfilled in a request
- Queue for small/short jobs
- Maximum run time limit per job is 60 minutes or 1 hour
- Best for interactive usage (e.g. compiling, debugging)
A. Linux/MAC OS¶
- Launch your terminal application
- Issue one of these commands:
$ ssh email@example.com $ ssh firstname.lastname@example.org
- Useful options:
OPTION | USE -v | Increase verbosity (show more connection messages) -i | Indicate the private key to be used (location of file either absolute or relative to the current working directory) <identity_file>
B. Windows OS¶
- Open PuTTY application (Please click on the link to download the application: http://the.earth.li/~sgtatham/putty/latest/x86/putty.exe)
- Set port number to 22 and the host name to the given IP address
3. Under the 'SSH' Category, go to 'Auth'. Then browse the location of your private key (.ppk file). Click Open:
4. When the terminal opens, enter your account user name
- Select Yes when prompted by a PuTTY Security Alert window
- If you want to use your key (generated using PuTTY) in OpenSSH, check the article on converting PuTTY key to SSH key
Compiling and Installing Software¶
Installation of applications are done by the CoARE project's technical team. However, users can also install their own application.
Who should compile and install¶
Compiling and installing software in the HPC are done by the CoARE project's technical team.
Users can also compile and install their own applications. Technical team allows users to do this if there are certain processes and measures that only users can do. But as much as possible, it is preferred that technical team should be the one compiling and installing applications in the HPC. You may contact the team via email to give instructions on how to compile and install your applications.
If the software requires license¶
If the software you need requires a license, you will be asked to provide the necessary license. Once the software is installed, you may share it with your colleagues and even to other users of the HPC.
Loading Software Into Your Environment¶
- A Linux utility that manages what software is loaded into a user's environment.
- Software modules automatically add, modify, and change environment variables that are necessary for the successful execution of an application.
- For example, a job requires to run a binary that was compiled using icc 14.0 and openmpi 1.8. The application is linked to the library files of icc and mpi. The application will fail to run if the shared libraries are not in the directories specified by the $LD_LIBRARY_PATH.
- It's also quite cumbersome to memorize and typing in the absolute path of the binary file you're trying to execute. It is more convenient to just execute the name of the binary. To do this, the directory on which the binary files are installed should be included in the PATH variable.
- These cases are very common in an HPC environment since application files are not installed in the default location.
- Module files are scripts that defines the particular variables that need to be set per application.
- Each application has its own module file.
- Associated module files are available for packages installed in /opt/hpcc
A. Listing All Available Packages¶
- The command to list all available packages:
[user@hpc ~]$ module avail -----------------------------/opt/hpc/modulefiles/----------------------------- intel/2014 mvapich/2.2 openmpi/1.8.4 openmpi/1.6.1 slurm/14.11.4 torque/5.1.0
- The format of the listed packages:
<package name>/<package version>Example:
intel/14.0 (version 14.0 of the Intel compiler suite)
- A slightly more informative listing can be attained by using:
[user@hpc ~]$ module whatis intel/2014 : gives access to the intel compilers (icc, ifort, etc.) and mkl library mvapich/2.2 : loads the MVAPICH2 mpi toolset for MPI programming openmpi/1.8.4 : loads the OpenMPI 1.8.4 mpi toolset for MPI programming openmpi/1.6.1 : loads the OpenMPI 1.6.1 mpi toolset for MPI programming slurm/14.11.4 : adds the Slurm resource manager for interaction with cluster torque/5.1.0 : adds the Torque resource manager for interaction with cluster [user@hpc ~]$ module whatis mvapich/2.2 mvapich/2.2 : loads the MVAPICH2 mpi toolset for MPI programming
B. Loading a Package¶
- The command for loading a packages is:
module load <package name> or module add <package name>
- If there are multiple versions of the package available, you can append the version as seen from the module list command
- Packages will also load the module files of prerequisites when available.
C. Listing Currently Loaded Packages¶
- To see the list of currently loaded modules in a user's shell, use:
D. Unloading a Package¶
- To unload a package from the user's environment:
module unload <package name>
- To remove all packages that are loaded:
E. Load Modules Automatically on Login¶
- Modules can be automatically loaded on a user's shell environment by adding the appropriate module commands into the ~/.bashrc file
A. Submitting Jobs¶
- Command to submit your job to the queue:
- In preparing your script, specify the partition, time limit, memory allocation and number of cores
- Additional other parameters such as job name, email notification and work clock limit can be included
SCRIPT | MEANING | USE #!/bin/bash | | Allows script to run as bash script | | #SBATCH -p general | # partition (queue) | Set partition to "general" #SBATCH -N 1 | # number of nodes | To request that all cores should be in one node #SBATCH -n 1 | # number of cores | Determines the no. of cores you need #SBATCH -t 0-4:00 | # time (D-HH:MM) | To set time allocation #SBATCH -o slurm.%N.%j.out | # STDOUT | To show non-error output messages (%N= name of the node, %j=job ID) #SBATCH -e slurm.%N.%j.err | # STDERR | A parameter to separate error from non-error output messages #SBATCH --mail-type=END,FAIL | # notifications for job done & fail | To receive email notification when the job is completed or failed #SBATCH --mail-user=myemail | | @asti.dost.gov.ph | # send-to address | To specify the email ad of the recipient
- Sample script
#!/bin/bash #SBATCH --partition=batch #SBATCH --time=00:15:00 #SBATCH --nodes=2 ## replace --nodes with --ntasks to specify number of cores; nodes and ntasks can't be specified together #SBATCH --ntasks-per-node=8 #SBATCH --mem=24000 ## mem-per-cpu can be used to specify per core memory limit; mem-per-cpu and mem can't be specified together #SBATCH --job-name="hello_test" #SBATCH --output=test-srun.out #SBATCH --email@example.com #SBATCH --mail-type=ALL #SBATCH --requeue echo "SLURM_JOBID="$SLURM_JOBID echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST echo "SLURM_NNODES"=$SLURM_NNODES echo "SLURMTMPDIR="$SLURMTMPDIR echo "working directory = "$SLURM_SUBMIT_DIR # Place commands to load environment modules here module load intel/compiler/14.0.3 # Set stack size to unlimited ulimit -s unlimited # MAIN srun /path/to/binary
- After preparing your script, you can now submit the job to the queue with the command:
sbatch name-of-file.slurm Note: To view a list of sbatch options, visit this link: http://www.schedmd.com/slurmdocs/sbatch.html
- IMPORTANT NOTES:
- It is important to set accurate resources and parameters. By doing this, you can effectively schedule jobs, prevent your program from crashing and avoid wasting resources. Also, before you submit your job, kindly determine which partition you will submit it, batch or debug. Refer to this manual to help you decide the appropriate partition for your job.
- Running jobs in /home is not allowed
- Active files should be transferred in /scratch1 and/or /scratch2
- Outputs of your run should be stored in /home
- /scratch1 and/or /scratch2 should not be used as a long term storage for your files. If you wish to store your files for a longer time, please use your home directory
B. Monitoring Job Progress¶
- To monitor the progress of submitted jobs to the HPC, you can use squeue command
OPTION | USE * squeue -u <username> | To list all current jobs for a particular user * squeue -u <username> -t RUNNING | To list all running jobs for a particular user * squeue -u <username> -t PENDING | To list all pending jobs for a particular user * squeue -u <username> -p general | To list all current jobs in the general partition for a user * squeue -j <jobid> | To get information about specific jobs For more squeue options, visit this link: http://www.schedmd.com/slurmdocs/squeue.html
- You can also use showq, scontrol and sstat:
OPTION | USE * showq-slurm -o -U -q <partition> | To list priority order of jobs in a given partition * scontrol show jobid -dd <jobid> | To list detailed information for a job, which can be useful for troubleshooting * sstat --format=AveCPU,AvePages,AveRSS,| To list status info for a currently running job AveVMSize,JobID -j <jobid> --allsteps |
- While sacct command will give you information for currently running and previously finished jobs. This includes information that cannot be viewed while the job is running, such as run time, memory used, etc.
OPTION | USE * sacct -j <jobid> | To show details on the state of a job * sacct -j <jobid> --format=JobID, | To get statistics on completed jobs by jobID JobName,MaxRSS,Elapsed | * sacct -u <username> --format=JobID, | To view the same information for all jobs of a user JobName,MaxRSS,Elapsed | MaxRSS= Maximum Resident Set Size
- Both squeue and sacct commands provide the state of a job:
STATE | MEANING * Running | The job is currently running * Completed | The job now finished and completed * Failed | The job is finished but unsuccessful * Cancelled | The job is terminated to the queue * Pending | The job is awaiting resources to become available
- Note: The duration to complete your job depends on the resources and time it requires and the available resources in the facility.
- To check the status of nodes in the cluster, use these commands:
$ sinfo $ scontrol show nodes Notes: * These commands are also use to: - Provide information of the state of the nodes in the cluster (idle, mix, down) - Show the partition names, node allocation and availability * Consult the sinfo man pages for more info on usage and options
C. Controlling Jobs¶
- You can pause, resume and requeue your jobs using scontrol command:
OPTION | USE * scontrol hold <jobid> | To pause a job * scontrol resume <jobid> | To resume running a job that is previously paused * scontrol requeue <jobid> | To cancel and rerun a job
D. Terminating Jobs¶
- Jobs can be terminated using scancel command:
OPTION | USE * scancel $<job_id> | To cancel a job * scancel -u <username> | To terminate all your jobs * scancel -t PENDING -u | To cancel all your PENDING jobs * scancel --name myJobName | To cancel one or more jobs by name For more scancel options, go to this link: http://www.schedmd.com/slurmdocs/scancel.html#index
Uploading files to the home directory (/home)¶
The home directory is intended to store inactive and output files. Each user has up to 100 GB limit to consume on this directory.
A. UNIX OS¶
- For UNIX OS you may use scp or rsync commands
- To use scp command, make sure first that the command is available in your machine. To check, type this command on your terminal:
- If the command is available on your machine, type the ff on your terminal:
scp -rv (absolute path of the file) firstname.lastname@example.org:/home/username
- To use the rsync command:
rsync -avh (absolute path of the file) email@example.com:/home/username
B. Windows OS¶
- Download winscp: https://winscp.net/eng/download.php
- Run the program
- Setup the program. You may use the link provided below to help you setup winscp program:
- Once setup, establish a connection to the HPC cluster
- Drag-and-drop your files to copy it from your computer to the HPC cluster or v.v.