Slurm

    Commands

    • salloc obtain job allocation
    • sbatch submit a batch script to execute
    • srun obtain job allocation and execute an application
    • squeue view information about the jobs
    • sinfo information about nodes and partition

    Run simple command

    srun -n1 -l /bin/hostname
    

    Notes:

    • -n number of tasks
    • this is run on the default partition

    ## Global informations

    sinfo

    See also for a specific node:

    scontrol show node nodename
    

    Note:

    • replace nodename
    • You can also show informations concerning job (if running), partition

    Update the state of a node

    scontrol update NodeName=nodename State=RESUME
    

    Fix unknown state (or reboot unexpected)

    scontrol update NodeName=nodename State=DOWN Reason='undraining'
    scontrol update NodeName=nodename State=RESUME
    

    History of jobs

    sacct --format=User,JobIDRaw,JobID,Jobname%30,state,elapsed,start,end,nodelist,partition,time
    

    Notes:

    • filter by joname with --name=jobname
    • filter by starting date --starttime 2020-07-25
    • filter by state --state=failed

    Jobanmes only

    sacct -X --starttime 2010-01-01 --format=Jobname%100 | uniq
    

    Maintenance on node/partition

    Set the state of the partition (or node) as down:

    scontrol update Partition=debug State=down
    

    Suspend all jobs that run:

    scontrol suspend jobidrun
    

    One maintenance is done, resume all jobs that were suspend, then set partition (or nodes) to up.

    ## sbatch example with pyxis

    #!/bin/bash
    #
    #SBATCH --job-name=test
    #
    #SBATCH --ntasks=2
    #SBATCH --mem-per-cpu=100
    #
    #SBATCH --distribution=cyclic:block
    #
    #SBATCH --output="/mnt/share/result_%A_%a.log"
    #SBATCH --error="/mnt/share/result_%A_%a.log"
    #SBATCH --array=[1-300]
    #
    #SBATCH --export=ENROOT_CONFIG_PATH=/etc/enroot
    
    hostname
    
    ## --- pyxis required here (check also correct enroot export if auth is required) ---
    srun --cpu-bind=cores --container-image="user@domain.tld#hpc/marymorstan-experiment:v1.0.0" --container-mount-home ls /
    ## -- end of pyxis usage ---
    
    result="$?"
    
    log_file="result_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}.log"
    
    if [ "${result}" != "0" ]; then
            echo "TROUBLE SRUN" >>${log_file}
    else
            echo "SRUN OK" >>${log_file}
    fi
    
    if [ -f "${log_file}" ]
    then
            mv "${log_file}" "${share_dir}"
    fi
    
    if [ "${result}" != "0" ]; then
            exit ${result}
    fi
    

    Note: you can use SBATCH --requeue to restart a job that failed

    Tricks

    Global view of the cluster

    sinfo --Node --long
    

    Suspend all jobs

    squeue --format="%A" | xargs -n 1 scancel
    

    Suspend all running jobs

    squeue --format="%A" --state RUNNING | xargs scontrol suspend
    

    Overall status of all jobs from sbatch

    sbatch ... >jobs.txt
    sacct -X --format=User,JobIDRaw,JobID,Jobname%30,state,time,start,end,elapsed,node --job=$(cat jobs.txt | awk '{print $4}' | tr ' \n' ',')
    

    Show unique jobname in the queue

    squeue --format=%j | sort | uniq
    

    Priority per job in the queue

    squeue --format='%j %.15p' | sort | uniq
    

    Squeue essentials

    squeue --format='%i %.50j %.15T %.15L %.15p'
    

    Interactive use

    srun ... --pty bash
    

    Count cpu used for a partition

    sinfo --Node --partition=prod | tail -n +2 | awk '{print $1}' | xargs -I {} scontrol show node {} | grep Alloc= | cut -d"=" -f2 | cut -d" " -f1 | paste -sd+ | bc
    

    Cheat Sheet

    Cheat sheet