How to use SGE at the BIC

system · March 26, 2018, 6:05pm

What is SGE ? SGE (Sun Grid Engine) and SOGE (Son of Grid Engine) are both job scheduling and management systems used for distributed computing environments. SGE was originally developed by Sun Microsystems and was later acquired by Oracle, while SOGE is a fork of the open-source version of SGE.

Both SGE and SOGE allow users to submit jobs to a cluster of computers and manage resources such as CPU, memory, and disk space. They also provide features such as job prioritization, job dependencies, and job monitoring. Additionally, they support parallel computing, enabling users to run multiple tasks simultaneously.

Overall, SGE and SOGE are powerful tools for managing and scheduling jobs in high-performance computing environments.

Note that at the BIC, we use SOGE and we refer to it as SGE! Please read on if you would like to use SGE to access compute/cpu resources at the BIC.

There are various queues (see command below: qstat -g c) defined and some queues are restricted to various labs, hence regrouping the lab’s workstations, but those available to the entire community are all.q, gpu.q and gpu-watch.q; the latter should only be used with qlogin to troubleshoot jobs running on gpu.q.

SGE uses the commands qsub (along with qrsh and qlogin), qdel, qstat and qhost to submit, delete and query jobs and queues states. For inexperienced users, a simple script/wrapper around SGE is available: qbatch. It can be used instead of the more complicated qsub command. A few examples are given below.

Extensive documentation for SGE is available on all BIC hosts. Just type “apropos -s 1 SGE” (or “man -s 1 -k sge”) at the shell prompt. Don’t be overwhelmed at the length and complexity of the man pages! The learning curve for SGE is very steep as for any queueing system that can scale from tens up to thousands of nodes. Notwithstanding this, what is written below should be sufficient to get you going as a SGE beginner user just learning the ropes.

You should at least peruse the man pages for the SGE commands qsub, qdel and qhost.
Most if not all of the BIC’s Linux machines can run jobs in the background using SGE.
All workstations have their queues disabled during the weekdays, Mondays to Friday between 09:00 to 19:00.

1 Shell environment setup for users

For users with bash as a login shell: . /opt/sge/default/common/settings.sh.
For users with tcsh as a login shell: source /opt/sge/default/common/settings.csh

Note that it should not be necessary to add anything to your login setup as the bits are already in place in the default initialization files for bash and tcsh on all BIC systems.

Important Note: you are on your own if you muck around with the SGE environment variable $SGE_ROOT!

2 To query the state of SGE use the commands qstat and qhost

For jobs in any state and for all users: Don’t forget to escape * with a backslash \ to prevent the shell to expend it first!
To query the status of ALL your jobs, those running, queued, or in error: qstat -u <username>.
To limit the query to only your running jobs: qstat -s r -u <username>.
To query status of SGE hosts, their queues and the jobs associated with them try the command qhost.
To query the status of ALL running jobs form ALL users: qstat -f -s r -u \*.
To query the status to ALL running jobs of ALL users in ALL the non-empty queues: qstat -f -ne -s r -u \*.

These shell aliases are defined for all BIC users in their default shell environment for both tcsh and bash:

alias q='qstat -f -u \* \!*'
alias qne='qstat -f -ne -u \* \!*'
alias qr='qstat -f -s r -u \* \!*'
alias qrne='qstat -f -ne -s r -u \* \!*'

To get a cluster queue summary of all the queues:

~$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
alain.q                           0.12      0      0     36     50      0     14 
all.q                             0.01      3      0    397    520      0    120 
amirlab.q                         0.28      0      0      4      4      0      0 
brainweb.q                        0.00      0      0      1      6      0      5 
chai.q                            0.02      0      0    128    140      0     12 
dblab.q                           0.07      0      0    256    256      0      0 
doyon.q                           0.01      0      0      4      4      0      0 
durcan.q                          0.00      0      0     30     30      0      0 
gil.q                             0.07      0      0     20     20      0      0 
gpu-watch.q                       0.02      0      0      6      6      0      0 
gpu.q                             0.02      0      0      3      3      0      0 
grova.q                           0.10      0      0      6     15      0      9 
ipl.q                             0.00      0      0      0    120      0    120 
klein.q                           0.01      0      0     38     38      0      0 
meg.q                             0.33      0      0     94    114     10     10 
mica.q                            0.10      0      0    324    390      0     66 
misic.q                           0.00      0      0     52     52      0      0 
neuropm.q                         0.01      0      0     30     36      0      6 
noel.q                            0.00      0      0    164    176      0     12 
noel64.q                          0.00      0      0     24     24      0      0 
origami.q                         0.04      0      0     60     60      0      0 
spreng.q                          0.13      0      0     51     51      0      0 
test.q                            0.00      0      0     20     20      0      0

3 To submit jobs use qsub or qbatch

Notes on qsub:

qsub will not take binaries as command-line arguments.
qsub will interpret anything from STDIO as a script.

See below for more info on qbatch.

4 A few examples

Before proceeding, a special note concerning memory usage. On some queues (namely all*.q and gpu.q), some measures were put in place to restrict memory usage in an effort to mitigate system over-subscription of memory - the default amount per job is equal to the total amount of physical RAM/memory on the system divided by the number of slots available on that same system.

You can use the qhost command to see how much RAM is available per host.

The default amount should be enough for most users but it’s better to know your requirements in advance and make a request as with the following…

$ qsub -l h_vmem=5G …, if you need for example 5GB of RAM.

This will insure that you get a system with that much memory available. The job is killed if it exceeds the requested virtual memory and the reported messages are not always so helpful.

Special attention regarding memory must be given to matlab processes as the requirement can be quite high depending on whether JVM (Java Virtual Machine) is started (matlab -nodesktop) or not (matlab -nojvm). The memory usage reported by SGE (see qacct) is also not so helpful as it reports the maximum resident memory size while the virtual size can be a lot higher. After some investigation, the following was observed and this is ONLY to start matlab:

VMEM=6.0G for $(matlab19b -nodesktop)
VMEM=1.5G for $(matlab19b -nojvm)

VMEM=3.5G for $(matlab15b -nodesktop)
VMEM=0.8G for $(matlab15b -nojvm)

so this VMEM value should be your starting point for setting/requesting h_vmem, add to that what you think you will need for your processing. Be careful when using parallel environments as h_vmem is multiplied by the number of threads requested so if you need a total of 10 GB of memory and request 10 threads, then your ask should be h_vmem=1G.

Now let’s start with a simple command - a directory listing. One needs submit the job to the scheduler using the qsub command.

This will fail qsub “ls -l”. Instead do this: echo “ls -l” | qsub.

The following is an attempt at demonstrating how to submit a Matlab job to SGE. First we create the relevant files and directories and then we send the job to be processed.

Create a directory to contain your scripts, matlab .m files and the job output files:

~$ mkdir /my/big/data/eve
~$ cd /my/big/data/eve
~$ mkdir output

Using an editor of your choice (nano, joe, vi, etc) create a shell script test-matlab.sh:

#!/bin/sh
#
#$ -cwd
#$ -o output/out.txt
#$ -e output/err.txt
#$ -m e
#$ -l h_vmem=3G
matlab -singleCompThread -nojvm -nosplash  < test-matlab.m

A few things to notice: the lines starting with the characters #$ are specials and are interpreted by SGE as embedded command flags: the same behaviour can be obtained by adding command line options to the submission executable, qsub -cwd -o output/out.txt -e output/err.txt -m e -q all.q <script_name>. This is explained below.

In this case we specify that the current working directory will be the SGE working directory, as if we had specified it on the qsub command line as qsub -cwd. If you don’t specify it, Matlab will use the top of your user home directory, $HOME and will expect to find the script file there.

The 2 other #$ lines specify the standard output and error filenames of the job as if qsub had been called with the command options -o /ѕome/path -e /some/other/path. The #$ -m e option is to email you when the job completes.

The online help man qsub will show all the available options. There are quite a few, don’t despair!

Create a matlab .m file test-matlab.m containing:

disp  'matlab test ...'
1+1
prod(1:5)
disp ' ... all done.'

Very complex: compute 1 + 1 and the factorial of 5

Turn on the execute bits on the shell script test-matlab.sh

~$ chmod u+x test-matlab.sh

Submit the job:

~$ qsub -q all.q ./test-matlab.sh

Note option -singleCompThread above - THIS IS important as matlab will by default gladly try to use all the system resources … so imagine 2, 3, perhaps 20 multi-threaded matlab processes all trying to use all of the resources at once! Welcome to swap city, not to mention that this is unfair to yourself and mostly to other users. See below on how to use matlab in parallel environments.

Check the status of all your jobs:

~$ qstat -u eve
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
3350496 0.29262 STDIN      eve          r     07/21/2015 15:33:01 all.q@arnode08.bic.mni.mcgill.     1

Check the job output in output/out.txt:

< M A T L A B (R) >
  Copyright 1984-2012 The MathWorks, Inc.
R2012b (8.0.0.783) 64-bit (glnxa64)
  August 22, 2012
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
>> matlab test ...
>> 
ans =
 2
>> 
ans =
   120
>>  ... all done.

5 Monitoring jobs and resource usage with qstat

Use qstat -j <job-id> to monitor a job using its job-id. This example of an typical output displays a lot of things not discussed so far, like job array tasks but it should give you an idea of what to expect. Even to untrained eyes some of the fields have an obvious meaning.

~$ qstat -j 3533996
==============================================================
job_number:                 3533996
exec_file:                  job_scripts/3533996
submission_time:            Mon Nov 16 13:07:25 2015
owner:                      jane
uid:                        2320
group:                      pet
gid:                        200
sge_o_home:                 /home/bic/jane
sge_o_log_name:             jane
sge_o_path:                 .:/home/bic/jane/stuffScripts:/usr/sge/bin/lx26-amd64:/usr/local/bic/bin:/usr/local/mni/bin:/usr/local/bic/emma-1.0.0/bin:/data/spades/spades1/quarantines/Linux-x86_64/Oct-2010/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/usr/X11R6/bin:/opt/bin:/boss/boss4/joe/share/AFNI/:/usr/lib/fsl/5.0
sge_o_shell:                /bin/bash
sge_o_workdir:              /warehouse/wh1/PreprocessedData
sge_o_host:                 tuna
account:                    sge
stderr_path_list:           NONE:NONE:/warehouse/wh1/PreprocessedData/NV-121/NV-121_V1_DTI_Output.bedpostX/logs/job_bedpostx_NV-121.elog
mail_list:                  jane@tuna.bic.mni.mcgill.ca
notify:                     FALSE
job_name:                   job_bedpostx_NV-121
stdout_path_list:           NONE:NONE:/warehouse/wh1/PreprocessedData/NV-121/NV-121_V1_DTI_Output.bedpostX/logs/job_bedpostx_NV-121.olog
jobshare:                   0
hard_queue_list:            all.q
env_list:                   SSH_AGENT_PID=11571,MATLABPATH=/usr/local/spm8,TERM=xterm,SHELL=/bin/bash,DESKTOP_STARTUP_ID=NONE,XDG_SESSION_COOKIE=fa09a4d81d7e18ccb0dd7f2b0000001a-1445260416.389186-1005527843,FSLMULTIFILEQUIT=TRUE,SGE_CELL=default,WINDOWID=44040195,POSSUMDIR=/usr/share/fsl/5.0,gflag=0,LC_ALL=C,USER=jane,LD_LIBRARY_PATH=/usr/lib/fsl/5.0:/usr/sge/lib/lx24-ia64/:/data/spades/spades1/quarantines/Linux-x86_64/Oct-2010/lib,subjdir=/warehouse/wh1/PreprocessedData/NV-121/NV-121_V1_DTI_Output,SSH_AUTH_SOCK=/tmp/ssh-YZtlc11542/agent.11542,SESSION_MANAGER=local/tuna:@/tmp/.ICE-unix/11647,unix/tuna:/tmp/.ICE-unix/11647,preprocjob=job_bedpostx_preproc_NV-121,PATH=.:/home/bic/jane/stuffScripts:/usr/sge/bin/lx26-amd64:/usr/local/bic/bin:/usr/local/mni/bin:/usr/local/bic/emma-1.0.0/bin:/data/spades/spades1/quarantines/Linux-x86_64/Oct-2010/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/usr/X11R6/bin:/opt/bin:/boss/boss4/joe/share/AFNI/:/usr/lib/fsl/5.0,bedpostjob=job_bedpostx_NV-121,PWD=/warehouse/wh1/PreprocessedData,LANG=en_CA.ISO-8859-1,FSLTCLSH=/usr/bin/tclsh,SGE_QMASTER_PORT=6444,VOLUME_CACHE_THRESHOLD=-1,FSLMACHINELIST=NONE,SGE_ROOT=/var/lib/gridengine,FSLREMOTECALL=NONE,FSLWISH=/usr/bin/wish,FSLBROWSER=/etc/alternatives/x-www-browser,SHLVL=2,HOME=/home/bic/jane,LOGNAME=jane,FSLDIR=/usr/share/fsl/5.0,DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-mbE4W1bath,guid=eb5d8b5de59358d12501d287000003de,DISPLAY=:0.0,FSLLOCKDIR=NONE,FSLOUTPUTTYPE=NIFTI_GZ,COLORTERM=Terminal
script_file:                /warehouse/wh1/PreprocessedData/NV-121/NV-121_V1_DTI_Output.bedpostX/commands.txt
parallel environment:  all.pe range: 10
jid_predecessor_list (req):  job_bedpostx_preproc_NV-121
jid_successor_list:          3533997
job-array tasks:            1-70:1
usage   48:                 cpu=13:08:39, mem=2808.69090 GBs, io=0.67630, vmem=74.691M, maxvmem=102.383M
usage   49:                 cpu=11:36:59, mem=2491.12906 GBs, io=0.61604, vmem=78.102M, maxvmem=102.285M
usage   50:                 cpu=11:01:28, mem=2364.79003 GBs, io=0.63801, vmem=79.566M, maxvmem=102.438M
usage   51:                 cpu=01:31:47, mem=278.89056 GBs, io=0.11190, vmem=72.891M, maxvmem=86.496M
usage   52:                 cpu=01:31:04, mem=276.53716 GBs, io=0.10675, vmem=73.570M, maxvmem=83.043M
usage   53:                 cpu=01:29:21, mem=270.97023 GBs, io=0.10124, vmem=73.312M, maxvmem=85.707M
usage   54:                 cpu=01:23:45, mem=252.87035 GBs, io=0.10027, vmem=72.668M, maxvmem=83.043M
usage   55:                 cpu=01:06:56, mem=199.64696 GBs, io=0.07497, vmem=72.340M, maxvmem=76.434M
usage   56:                 cpu=00:16:45, mem=48.41589 GBs, io=0.02036, vmem=66.520M, maxvmem=78.473M
usage   57:                 cpu=00:10:17, mem=30.12382 GBs, io=0.00971, vmem=66.047M, maxvmem=68.555M
scheduling info:            queue instance "louislab.q@banquo.bic.mni.mcgill.ca" dropped because it is temporarily not available
                            queue instance "louislab.q@fonseca.bic.mni.mcgill.ca" dropped because it is temporarily not available
                            queue instance "louislab.q@portia.bic.mni.mcgill.ca" dropped because it is temporarily not available
                            queue instance "louislab.q@volsce.bic.mni.mcgill.ca" dropped because it is temporarily not available
                            queue instance "neuronav.q@varro.bic.mni.mcgill.ca" dropped because it is temporarily not available
                            queue instance "grova.q@davy.bic.mni.mcgill.ca" dropped because it is temporarily not available
                            queue instance "grova.q@jack.bic.mni.mcgill.ca" dropped because it is temporarily not available
                            queue instance "meg.q@rosaline.bic.mni.mcgill.ca" dropped because it is disabled
                            queue instance "noel64.q@somerville.bic.mni.mcgill.ca" dropped because it is disabled
                            queue instance "noel64.q@philemon.bic.mni.mcgill.ca" dropped because it is disabled
                            queue instance "noel64.q@gypsy.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "noel64.q@dejerine.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "noel64.q@oberon.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "noel64.q@weka.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode10.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@node21.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@node20.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@node17.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@node18.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode05.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode08.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode02.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@node19.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode04.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode01.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode03.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode11.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode07.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode06.bic.mni.mcgill.ca" dropped because it is full
                            queue instance "all.q@arnode12.bic.mni.mcgill.ca" dropped because it is full
                            cannot run in queue "gil-gpu.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "gil.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "hrrt.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "louislab.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "brainweb.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "grova.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "meg.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "int.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "ipl.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "noel64.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "klein.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "alain.q" because it is not contained in its hard queue list (-q)
                            cannot run in PE "all.pe" because it only offers 0 slots

6 Deleting submitted jobs

Say your username is bing. Display your job(s) with qstat -u bing:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2288997 0.38822 bada       bing         r     02/14/2014 21:15:12 all.q@arnode01.bic.mni.mcgill.     1        
2288998 0.38822 bada       bing         r     02/14/2014 21:15:12 all.q@arnode01.bic.mni.mcgill.     1        
2288999 0.38822 bada       bing         r     02/14/2014 21:15:12 all.q@arnode02.bic.mni.mcgill.     1

The first column displays the job ids.
Use command qdel to delete a job.
You can specify a range of job-ids like qdel 2288997–2288999.
You can only delete your jobs unless you are a SGE administrator or operator.
Job array task IDs are disussed below.

7 Interactive jobs submission with qlogin and qrsh

You can also use qlogin and qrsh to submit interactive jobs to SGE. As their names imply, the former is to submit an interactive login session to SGE while the later starts an interactive shell session. On the host named agrippa:

$ qlogin -q all.q
local configuration agrippa.bic.mni.mcgill.ca not defined - using global configuration
Your job 2377466 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 2377466 has been successfully scheduled.
Establishing builtin session to host node18.bic.mni.mcgill.ca ...
@node18>

If you want to use X-forwarding, try the command:

$ qrsh -q all.q xterm

An xterm should pop up on one of the execution host of the specified cluster queue, in this particuliar case, all.q. Note that qrsh is programmed to use ssh:

$ qconf -sconf | grep ssh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rlogin_daemon                /usr/sbin/sshd -i
rsh_daemon                   /usr/sbin/sshd -i

8 Selecting submission hosts and queues

You may restrict jobs submission to specific system(s).
Read also the section about using wildcards to choose queues and hosts with qsub.
With qbatch simply use the option -q as in: qbatch -q all.q <my_script>.
qbatch -help will give you some hints for more options.

With qsub you can specify a specific queue instance on a host or a hostname eg:

qsub -q ipl.q@node17.bic.mni.mcgill.ca <my_script>
qsub -l 'hostname=weka.bic.mni.mcgill.ca' <my_script>

There are a few sets of systems clusters queues. They are displayed you the command qconf -sql. The mnemonics for the command option -sql is show queue lists.

$ qconf -sql
alain.q
all.q
all128.q
all512.q
amirlab.q
brainweb.q
chai.q
durcan.q
gil.q
gpu-watch.q
gpu.q
grova.q
ipl.q 
klein.q 
meg.q
mica.q
neuropm.q
noel.q
noel64.q

9 Using wildcards and matching types to select cluster queues, queue domains, queue instances and hostgroups

This is an advanced topic and beginning users can safely skip it.

You can use wildcards and patterns to select or exclude specific SGE queues where to run (or not) your jobs.
Read man sge_types for all the details in full glory. Here we explain the concepts and syntax.
First some definitions

Expressions are patterns separated by regular boolean operators

 expression= ["!"] ["("] valExp [")"] [ AND_OR expression ]*
 valExp  = pattern | expression
 AND_OR  = "&" | "|"

Boolean operators

   "!"       not operator  -- negate the following pattern or expression
   "&"       and operator  -- logically and with the following expression
   "|"       or operator   -- logically or with the following expression
   "("       open bracket  -- begin an inner expression.
   ")"       close bracket -- end an inner expression.

Patterns

   "*"     matches any character and any number of characters
           (between 0 and inv).
   "?"     matches any character. It cannot be no character
   "."     is the character ".". It has no other meaning
   "\"     escape character. "\\" = "\", "\*" = "*", "\?" = "?"
   "[...]" specifies an array or a range of allowed
           characters for one character at a specific position.
           Character ranges may be specified using the a-z notation.
           The caret symbol (^) is not interpreted as a logical
           not; it is interpreted literally.

The pattern itself should be put inside quotes “ to ensure that clients receive the complete pattern.
Range specifier has the form

n[-m[:s]][,n[-m[:s]], ...] or n[-m[:s]][ n[-m[:s]] ...]

and thus consists of a comma or blank separated list of range specifiers n[-m[:s]]. Each range may be a single number, a simple range of the form n-m or a range with a step size.

qsub option -q wc_queue_list

-q wc_queue_list is a comma separated list of queues wc_queue [“,” wc_queue “,” …].

wc_queue can consist of 3 types of wildcard objects: cluster queues (wc_cqueue), cluster domain queues (wc_qdomain) and cluster instances (wc_qinstance).

a cluster queue wc_cqueue wildcard matching some queues in the cluster. It cannot contain the character “@” eg

* all queues
a* all queues starting with the letter a
a*&!all all queues starting with a but not all
a cluster queue domain wildcard wc_qdomain of the form wc_cqueue “@” wc_hostgroup

eg

Meta Comment: There is an issue with the pmwiki markup and how to render @@ in monotype so I use the escape sequence [= =]

*@@* all queue instances whose underlying host is part of at least one hostgroup
a*@@e* all queue instances who begins with “a” whose underlying host is part of at least one hostgroup beginning with “e”
*@@linux-64bit all queue instances on hosts part of the @linux-64bit hostgroup

Note: a hostgroup wc_hostgroup always starts with the @ character.

a cluster queue instance wilcard wc_qinstance of the form wc_cqueue “@” wc_host where wc_host is a hostname wildcard

eg

*@* all queue instances in the cluster
*@b* all queue instances whose hostname begins with b
*@b*|c* all queue instances whose hostname begins with b or c

10 More fun with qsub options

Here are the most commonly used options when submitting jobs:

“-l h_rt=time” to limit the job run time either in seconds or in hh:mm:ss format.
“-l h_vmem=mem” to specify the job’s hard virtual memory limit; mem specifier may include k, K, m, M, g, G.
“-N job_name” to specify a job name.
“-m bae” to receive an email when the job begins/aborts/ends. Just select which one you want.
“-cwd” to execute the job in the current working directory. If not specified, the job will run in your home dir.
“-j y” to merge the standard and error output streams.
“-o file” to redirect the standard output to the named file.
“-S shell” the shell to interpret the job script: /bin/bash (default) or /bin/tcsh.

11 SGE configuration: displaying cluster queues, queues instances, submit hosts , etc…

You can display all sort of info regarding SGE using the command qconf.

list all cluster queues with qconf -sql. Mnemonic: Show Queue List.
list specific queue with qconf -sq hrrt.q. Here hrrt.q is a cluster queue from the output of the previous command. Mnemonic: Show Queue.
list all groups of hosts with qconf -shgrpl. Mnemonic: Show Hostgroup List.
list specific host group with qconf -shgrp @ipl. Here @ipl is a host group list from the output of the previous command. Mnemonic: Show Hostgroup.
list all submit hosts with qconf -ss. Mnemonic: Show Submission.

12 SGE status: displaying hosts, queues and jobs status

12.1 Host States

Use the command qhost to get information on the status of SGE hosts, queues and jobs. Read the man page man qhost for more info. The output will consist of one line for each host with the following columns:

the Hostname
the Architecture.
the Number of processors.
the Load.
the Total Memory.
the Used Memory.
the Total Swapspace.
the Used Swapspace.

For instance, restricting the output to host agrippa with the option -h agrippa (see the man page for all the other options):

$ qhost -h agrippa
 HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
agrippa.bic.mni.mcgill.ca lx26-amd64     24  0.27   47.3G    1.0G   96.0G     0.0

12.2 Queues States

Using only the option -q give you more info about the queues hosted on all the SGE execution hosts. Restrict the ouput to a specific host by using the option -h :

$ qhost -q -h node21
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
node21.bic.mni.mcgill.ca lx26-amd64     24  2.04   47.3G    1.1G   96.0G   20.3M
   ipl.q                BIP   0/0/20        
   all.q                BIP   0/1/4         D

The first line contains the hostname, the host architecture (lx26-amd64), the number of cpu/cores, the last 5 minutes load average, total memory available, its current usage, the memory swap space available and its current usage.

There are extra indented lines of output for each queue associated with the host: the queue name, the queue type and the number of reserved, used and available job slots and the queue state.

Queue type: one of B(atch), I(nteractive), C(heckpointing), P(arallel), T(ransfer) or combinations thereof.
Queue state: it can be a combination of the following abbreviations:

u(nknown) sge_execd cannot be contacted
a(alarm) load thresholds exceeded
A(larm) suspend thresholds exceeded
s(uspended)/d(isabled) queue states assigned via qmod.
D(isabled) queue disabled via calendar facility.
C(alendar suspended) queue suspended via calendar facility.
S(ubordinate) queue suspended via subordination to another queue.
E(rror) sge_execd cant locate sge_shepherd on that host.
Check error logfile of that sge_execd.
c(onfiguration ambiguous) state of a queue instance. Advanced stuff: read ‘sge_qmaster(8)’!!
o(rphaned) queue instance is no longer demanded by the current cluster queue’s configuration or the host group configuration.
Jobs will continues to run but cannot be submitted until further intervention is done.

12.3 Jobs States

Use the option -j (it implicitely calls -q) outputs info about jobs tied to a host and its underlying queues. After the queue status line (in case of -j) a single line is printed for each job running currently in this queue. Each job status line contains

the job ID,
the job name,
the job owner name,
the status of the job - one of t(ransfering), r(unning), R(estarted), s(uspended), S(uspended) or T(hreshold)
the start date and time
the function of the job (MASTER or SLAVE - only meaningful in case of a parallel job)
the job array task-id.

qhost -j -h arnode10
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS

global - - - - - - -
arnode10.bic.mni.mcgill.ca lx26-amd64 16 5.41 70.9G 2.2G 70.9G 48.9M
job-ID prior name user state submit/start at queue master ja-task-ID
----------------------------------------------------------------------------------------------
3484527 0.75000 shell.sh joe r 11/01/2015 13:35:05 all.q@arno MASTER
all.q@arno SLAVE
all.q@arno SLAVE
all.q@arno SLAVE
all.q@arno SLAVE
all.q@arno SLAVE
3533996 0.34234 job_bedpos jane r 11/19/2015 00:37:53 all.q@arno MASTER 48
all.q@arno SLAVE 48
all.q@arno SLAVE 48
all.q@arno SLAVE 48
all.q@arno SLAVE 48
all.q@arno SLAVE 48
all.q@arno SLAVE 48
all.q@arno SLAVE 48
all.q@arno SLAVE 48
all.q@arno SLAVE 48

Job status can have the values:

d(eletion) qdel(1) has been used to initiate job deletion.
t(ransfering) and r(unning) job is about to be executed or is already executing
s(uspended), S(uspended) and T(hreshold) show that an already running jobs has been suspended.
s(uspended) suspending the job via the qmod(1) command
S(uspended) queue containing the job is suspended, hence the job too.
T(hreshold) at least one suspend threshold of the corresponding queue was exceeded, hence the job too.
R(estarted) job was restarted.
states w(aiting) and h(old) only appear for pending jobs.
h(old) job currently is not eligible for execution due to a hold state assigned to it
w(aiting) job is waiting for completion of the jobs to which job dependencies have been assigned
E(rror) jobs that couldn’t be started due to job properties.
13 Submitting array jobs
Say a job consist of a many tasks that are very similar in nature, for instance a single input file generates a very large number of independant outputs but each of them differ due to some difference in their processing. In a situation like this it might preferable to use the concept of an SGE array of jobs rather than submitting hundreds or thousands of jobs. This allows you to create a single script and submit it once as a single job to SGE rather than tracking thousands of jobs and their outputs.

An array job is an array of identical tasks being differentiated only by an index number and being treated by SGE almost like a series of jobs. The number of tasks to be run is set at job submission time using the -t argument to the qsub command i.e.

~$ qsub -t 1-10:1 myscript.sh
This command will submit an array job consisting of 10 tasks, numbered 1 to 10. The ‘:1′ following the range of tasks to execute indicates the size of the increment between job tasks (1 by default). The task id range specified in the option argument may be a single number, a simple range of the form n-m or a range with a step size. The index numbers will be exported to the job tasks via the environment variable SGE_TASK_ID. The option arguments n, m and s will be available through the environment variables SGE_TASK_FIRST, SGE_TASK_LAST and SGE_TASK_STEPSIZE.

Hence, the task id range specified by 2–10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks, each with the environment variable SGE_TASK_ID containing one of the 5 index numbers.

Let say you have a script named /home/bic/joe/task.sh that processes a single input file and generates a hundred different output files according to some initial state values—an irrelevent issue here, this is just an example. Rather than submitting 100 single jobs, create a single script and submit it once as a task array and use SGE_TASK_ID as an index to label the output:

#!/bin/bash
#$ -N NV-121
#$ -j y
#$ -cwd
/home/bic/joe/bin/process.sh --input=blah --ouput=/data/PreprocessedData/NV-121/NV-121-output.${SGE_TASK_ID} --nf=2 --fudge=1 --bi=1000 --nj=1250 --se=25 --model=fuzzy --cnonlinear
$~ qsub -t 1-100 task.sh
You can query the job status with the same command as before using qstat -j . You can delete a job task by specifying its parent job-id and task-id. So if you have submitted a job as qsub -t 1:10 sleep.sh you can delete task ids 6 to 10 with qdel -t 6–10. Tasks ids 1 to 5 will be executed or left to run to completion.

14 Submitting parallel matlab jobs with SGE

A light introduction may be found here: How To Use Sge Batch with Matlab.

Running Shared Memory Parallel Matlab Jobs
Parallel jobs are those which make use of more than one CPU core (or job slot) simultaneously. It’s possible to run jobs on multiple cores on a single host but to do so you must tell Matlab and SGE. The advantage is that the host memory is available to all the different threads of the job. However you must take care of not overloading the host and Matlab is by default very greedy in that respect.

List all the parallel environment (PE) instances available at the BIC

~$ qconf -spl
    alain.pe
    all.pe
    gil.pe
    grova.pe
    ipl.pe
    ipl_mpi.pe
    make
    matlab
    mpi
    smp

Display a specific parallel environment (PE) instance:

~# qconf -sp all.pe
pe_name            all.pe
slots              560
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE

This show that there are 560 cores (slots) available in it. Say you want to restrict your job to 8 cores or less. Use the embedded option #$ -pe all.pe 8 in your shell script: this will tell SGE to submit your job to the parallel environment (PE) all.pe using a maximum of 8 cores. You can also specify that on the command line when submitting your job with qsub -q all.q -pe all.pe 8 test-matlab.sh

Make Matlab aware of the core limits.

Note: Matlab versions more recent that 2012b uses the function parpool to create a pool of workers in a cluster whereas version 2012b and older use matlabpool.

In this case, the scheduler stores the number of cores in the env variable $NSLOTS.

For Matlab2012b and older
In your matlab script test-matlab.m add the following:

thrds=str2num(getenv('NSLOTS'))
matlabpool open thrds

For Matlab2013 and more recent:
Matlab uses the Parallel Computing Toolbox. To use multiple cores in Matlab, you should request no more than N cores via -pe all.pe N and set the pool size to no more than N when using the parpool() function in Matlab. To ensure that the number of workers matches the number requested you should explicitly invoke parpool() with the number of workers equal to the number of cores requested. The easiest way to do this is with this line of code, as $NSLOTS is set by the queueing scheduler to equal the number of cores requested.

parpool(str2num(getenv('NSLOTS')));

This following could be run at the start of your cluster job or just once in a Linux session on one of the BIC hosts to setup a parallel cluster profile.

cl = parcluster('local');
cl.NumWorkers = 8;
saveProfile(cl);

Note that Java must be initialized in order to use the Parallel Computing Toolbox so do not pass the option -nojvm to matlab in this case.

Reserving Cores:

Finally, if the cluster is heavily loaded and you submit a job requesting many cores, it is possible that as fewer cores become available, jobs requesting fewer cores will slip in front of your job and your job may have to wait a long time before the necessary resources are available. To tell SGE to accumulate cores for your job via what is known as a reservation, you can add the flag -R y to your submission, e.g.,

~$ qsub -q all.q -pe all.pe 20 -R y grid1.sh

or embed the command #$ -R y in your scripts.

This should allow your job to start once it reaches the top of the queue and SGE is able to accumulate enough cores for the job, though it’s possible some other jobs might step ahead of your job because SGE needs to choose a specific node on which to accumulate cores and make some guesses about how long currently-running jobs will take to finish.

By making a reservation, the necessary slots on a node are blocked from accepting new jobs, so that when the jobs running on those slots are complete there are sufficent free slots on a single node to execute the SMP job. SGE may still make use of these reserved slots for jobs with a short run time, as long as they will not interfere with scheduling the SMP job. Job execution can be monitored as usual with qstat, however note that the number of slots being used by the job reflects the number requested by the parallel environment.