5. Job Scheduler¶
The system's job scheduling uses Altair Grid Engine, which efficiently schedules single and parallel jobs according to priority and required resources.
5.1. Job Scheduling System Configuration¶
TSUBAME4.0 provides the following queue types/job types for different purposes.
Queue types | Job types | command | Resource types |
---|---|---|---|
Normal queue | Batch job | qsub | Selectable |
Interactive job | qrsh | Selectable | |
Interactive job queue | Interactive job | iqrsh | Fixed |
[Queue types]
- Normal queue
System resources are allocated and occupied for use in units of logically divided resource types.
This is usually used. Node reservation is also available. - Interactive job queue
An interactive queue is an environment where the same resource is shared among users, making it less likely to fail to secure a node even when congested, and providing a quick start to visualization and interactive programs.
The programs with intermittent processor usage, such as debuggers, visualizers, Jupyter Lab, are expected and do not use this service for programs that dominate processors continuously.
Please be sure to read List of various limits before use.
Info
Interactive job queue is available free of charge only for intramural users (tgz-edu) and access card users.
See Interactive job queue for usage.
[Job types]
- Batch job
Create and submit a job script. See Batch job for details. - Interactive job
Execute a program or shell script interactively. Usage differs depending on the queue type.
When using "Normal queue", please refer to Interactive job.
When using "Interactive job queue", please refer to Interactive job queue.
[Resource types]
- Use Normal queue
Please refer to Resource types for available resource types. - Use Interactive job queue
Assigned resources number of physical CPU cores 24 cores, 96GB memory, 1MIG, but up to 12 people share the same resources.
5.1.1. Resource types¶
This system uses resource types in which compute nodes are logically divided to reserve system resources.
When submitting a job, specify how many resource types will be used (e.g., -l node_f=2
). The available resource types are listed below.
Resource type | Physical CPU cores | Memory (GB) | GPUs | Local scratch area (GB) |
---|---|---|---|---|
node_f | 192 | 768 | 4 | 1920 |
node_h | 96 | 384 | 2 | 960 |
node_q | 48 | 192 | 1 | 480 |
node_o | 24 | 96 | 1/2 | 240 |
gpu_1 | 8 | 96 | 1 | 240 |
gpu_h | 4 | 48 | 1/2 | 120 |
cpu_160 | 160 | 368 | 0 | 960 |
cpu_80 | 80 | 184 | 0 | 480 |
cpu_40 | 40 | 92 | 0 | 240 |
cpu_16 | 16 | 36.8 | 0 | 96 |
cpu_8 | 8 | 18.4 | 0 | 48 |
cpu_4 | 4 | 9.2 | 0 | 24 |
- "Physical CPU cores", "Memory (GB), "GPUs" are the available resources per resource type.
- Same resource type can be used with [Resource type]=[Num]. Different resource types combinations are not available.
- Maximum run time is 24 hours.
- TSUBAME4 has various limits such as "the number of jobs that can be executed at the same time" and "the total number of slots that can be executed". (slots=Physical cores in each resoruce type x nodes number (same as slots of qstat))
Current limits are shown below:
https://www.t4.gsic.titech.ac.jp/en/resource-limit
Note that it may change depends on usage any time.
5.2. Normal queue¶
5.2.1. Batch job¶
To run a job on this system, log in to the login node and execute qsub
command.
5.2.1.1. Job submission flow¶
To submit a job, create and submit a job script. The submission command is qsub
.
- Create a job script
- Submit a job using qsub
- Status check using qstat
- Cancel a job using qdel
- Check job result
The qsub command confirms billing information (TSUBAME points) and accepts jobs.
5.2.1.2. Creating job script¶
Here is a job script format:
#!/bin/sh
#$ -cwd
#$ -l [Resource type] =[Number]
#$ -l h_rt=[Maximum run time]
#$ -p [Priority]
[Load relevant modules requried for the job]
[Your program]
Warning
shebang(#!/bin/sh
line) must be located at the first of the job script.
- [Load relevant modules requried for the job] Load required related modules that you need. For example, loading Intel compiler is as below:
module load intel
- [Your program]
Execute your program Running a.out binary is like below:
./a.out
Resource type can be set at command line, or in comment block (#$
)at the top of job script file.
Make sure to set resource type and maximun run time, which are requirement to submit a job.
The main options for the qsub command are listed below.
Option | Description |
---|---|
-l [Resource type Name] =Number | Specify the resource type. |
-l h_rt=[Maximum run time] (Required) | Specify the maximum run time (hours, minutes and seconds). [[HH:]MM:]SS, HH:MM:S, MM:SS or SS |
-N | name of the job (Script file name if not specified)) |
-o | name of the standard output file |
-e | name of the standard error output filename of the standard error output file |
-j y | Integrate standard error output into a standard output file. |
-m | Will send email when job ends or aborts. The conditions for the -m argument include: a: mail is sent when the job is aborted. b: mail is sent when the job begins. e: mail is sent when the job ends. It is also possible to combine like abe. When a large number of jobs with mail option are submitted, a large amount of mail is also sent, heavy load is applied to the mail server, and it may be detected as an attack and the mail from Science Tokyo may be blocked. If you need to execute such jobs, please remove the mail option or review the script so that it can be executed with one job. |
-p (Premium Options) | Specify the job execution priority. If -4 or -3 is specified, a charge factor higher than -5 is applied. The setting values -5, -4, -3 correspond to the priority 0, 1, 2 of the charging rule. -5: Standard execution priority. (Default) -4: The execution priority is higher than -5 and lower than -3. -3 :Highest execution priority. Note that all priority number to specify is negative value. Do not forget preceding minus sign. |
-t | Submits a Array Job specified with start-end[:step] |
-hold_jid | Defines the job dependency list of the submitted job. The job is executed after the specified dependent job is finished. |
-ar | Specify the reserved AR ID when using the reserved node. |
Note that -V option for passing environment variables in the job submission environment is not available in this system.
5.2.1.3. Job script examples¶
5.2.1.3.1. Single job/GPU job¶
The following is an example of a job script created when executing a single job (job not parallelized) or GPU job.
For GPU job, please replace -l cpu_4=1
with -l gpu_1=1
and load required modules such as CUDA environment.
#!/bin/sh
# run in current working directory
#$ -cwd
#$ -l cpu_4=1
# maximum run time
#$ -l h_rt=1:00:00
#$ -N serial
# load cuda module
module load cuda
# load Intel Compiler
module load intel
./a.out
5.2.1.3.2. SMP job¶
An example of a job script created when executing an SMP parallel job is shown below. Hyper-threading is enabled for compute nodes. Please explicitly specify the number of threads to use.
#!/bin/sh
#$ -cwd
# node_f 1 node
#$ -l node_f=1
#$ -l h_rt=1:00:00
#$ -N openmp
module load cuda
module load intel
# 192 therads per node
export OMP_NUM_THREADS=192
./a.out
5.2.1.3.3. MPI job¶
An example of a job script created when executing an MPI parallel job is shown below. Please specify an MPI environment according to the MPI library used by you for MPI jobs as follows. For OpenMPI, to pass library environment variables to all nodes, you need to use -x LD_LIBRARY_PATH.
Intel MPI
#!/bin/sh
#$ -cwd
# node_f x 4
#$ -l node_f=4
#$ -l h_rt=1:00:00
#$ -N flatmpi
module load cuda
module load intel
# Intel MPI
module load intel-mpi
# 8 process/node, all MPI process is 32
mpiexec.hydra -ppn 8 -n 32 ./a.out
OpenMPI
#!/bin/sh
#$ -cwd
# node_f x 4
#$ -l node_f=4
#$ -l h_rt=1:00:00
#$ -N flatmpi
# Load OpenMPI: Intel compiler, CUDA are loaded automatically
module load openmpi/5.0.2-intel
# 8 process/node, all MPI process is 32
mpirun -npernode 8 -n 32 -x LD_LIBRARY_PATH ./a.out
The file of the node list assigned to the submitted job can be referred from $PE_HOSTFILE.
$ echo $PE_HOSTFILE
/var/spool/age/r15n10/active_jobs/1407687.1/pe_hostfile
$ cat /var/spool/age/r15n10/active_jobs/1407687.1/pe_hostfile
r15n10 24 all.q@r15n10 <NULL>
r20n11 24 all.q@r20n11 <NULL>
r20n10 24 all.q@r20n10 <NULL>
r23n9 24 all.q@r23n9 <NULL>
5.2.1.3.4. Hybrid parallel (Hybrid, MPI+OpenMP)¶
An example of a job script created when executing a process/thread parallel (hybrid) job is shown below. Please specify an MPI environment according to the MPI library used by you for MPI jobs as follows. For OpenMPI, to pass library environment variables to all nodes, you need to use =-x LD_LIBRARY_PATH`.
Intel MPI
#!/bin/sh
#$ -cwd
# node_f x 4
#$ -l node_f=4
#$ -l h_rt=1:00:00
#$ -N hybrid
module load cuda
module load intel
module load intel-mpi
# 192 threads per node
export OMP_NUM_THREADS=192
# 1 MPI process per node, all MPI process is 4
mpiexec.hydra -ppn 1 -n 4 ./a.out
OpenMPI
#!/bin/sh
#$ -cwd
# node_f x 4
#$ -l node_f=4
#$ -l h_rt=1:00:00
#$ -N hybrid
# Open MPI: Intel compiler, CUDA are loaded automatically
module load openmpi/5.0.2-intel
# 192 threads per node
export OMP_NUM_THREADS=192
# 1 MPI process per node, all MPI process is 4
mpirun -npernode 1 -n 4 -x LD_LIBRARY_PATH ./a.out
5.2.1.4. Job submission¶
Job is queued and executed by specifying the job submission script in the qsub command. You can submit a job using qsub as follows.
$ qsub -g [TSUBAME group] SCRIPTFILE
Option | Description |
---|---|
-g | Specify the TSUBAME group name. Please add as qsub command option, not in script. |
-q prior | Subscription job. Wait one hour at most until execution. |
5.2.1.5. Job status¶
The qstat is a job status display command.
$ qstat [option]
The options of qstat is listed below.
Option | Descripion |
---|---|
-r | Displays job resource infromation |
-j [job-ID] | Displays additional information about the job |
Here is an example of qstat:
$ qstat
job-IDprior nameuser statesubmit/start at queuejclass slotsja-task-ID
----------------------------------------------------------------------------------
307 0.55500 sample.sh testuser r 02/12/2023 17:48:10 all.q@r8n1A.default32
(snip)
Item | Description |
---|---|
Job-ID | Job-ID number |
prior | Priority of job |
name | Name of the job |
user | ID or user who submitted job |
state | State of the job r runnig qw waiting in the queue h on hold d deleting t a transition like during job-start s suspended S suspended by the queue T has reached the limit of the tail E error Rq Rrescheduled and waiting for run Rr rescheduled and running |
submit/start at | Submit or start time and date of the job |
queue | queue name |
jclass | job class name |
slots | The number of slot the job is taking (slot=physical cores in each resource type x node number) |
ja-task-ID | Array job task-id |
5.2.1.6. Deleting job¶
To delete your job, use the qdel
command.
$ qdel [Job-ID]
An example of deleting job is shown as below:
$ qstat
job-IDprior nameuser statesubmit/start at queuejclass slotsja-task-ID
----------------------------------------------------------------------------------
307 0.55500 sample.sh testuser r 02/12/2023 17:48:10 all.q@r8n1A.default32
$ qdel 307
testuser has registered the job 307 for deletion
$ qstat
job-IDprior nameuser statesubmit/start at queuejclass slotsja-task-ID
----------------------------------------------------------------------------------
5.2.1.7. Job result¶
The standard output of AGE jobs is stored in the file "script file name.o job ID" in the directory where the job was executed. Also, the standard error output is "script file name.eJob ID".
5.2.1.8. Array job¶
An array job is as a function to parameterize and execute the operation contained in the job script repeatedly.
Each job executed in an array job is called a task and is managed by a task ID. A job ID that does not specify a task ID has the entire task ID as its range.
Info
Each task in an array job is scheduled as a separate job, so the schedule wait time is proportional to the number of tasks.
If each task is short or the number of tasks is large, it is strongly recommended to reduce the number of tasks by grouping several tasks together.
Example: 10000 tasks are combined into 100 tasks that each process 100 tasks.
The task number can be specified as an option of the qsub command or defined in a job script.
Submission options are specified as -t(start number)-(end number):(step size)
.
If the step size is 1, it can be omitted. An example is shown below.
# describe below in a job script
#$ -t 2-10:2
The above example (2-10:2) specifies a start number of 2, an end number of 10, a step size of 2 (one skipped index), and consists of five tasks with task numbers 2, 4, 6, 8, and 10.
The task number for each task is set in an environment variable named $SGE_TASK_ID, so this environment variable can be used in the job script to enable parameter study. The result file will be output with the job name followed by the task ID.
Also, if you want to remove a specific task ID before/while it is running, use qdel's -t option as follows.
$ qdel[Job-ID] -t [Task-ID]
5.2.2. Interactive job¶
The system's job scheduler has the ability to run programs and shell scripts interactively. To run an interactive job, use the qrsh command and specify the resource type and elapsed time with -l. After submitting a job with qrsh, a command prompt will be returned when the job is dispatched. The following shows how to use an interactive job.
$ qrsh -g [TSUBAME group] -l [Resource type]=[number] -l h_rt=[maximum run time]
Directory: /home/N/username
(job start time)
username@rXnY:~> [commands to run]
username@rXnY:~> exit
If the -g option group is not specified, up to two resource types, up to 10 minutes of elapsed time, and a "trial run" with a priority of -5 will be performed.
Example specifying resource type node_f, 1 node, elapsed time 10 minutes
$ qrsh -g [TSUBAME group] -l node_f=1 -l h_rt=0:10:00
Directory: /home/N/username
(job start time)
username@rXnY:~> [commands to run]
username@rXnY:~> exit
Enter exit
can stop the interactive job.
5.2.2.1. X forwarding using interactive nodes¶
To perform X forwarding directly from a node connected by qrsh, please follow the procedure below.
- Enable X forwarding and connect to login node with ssh.
- Execute qrsh command with X11 forwarding like the following example.
Example)
The example shows a 2-hour job with resource type cpu_4 and 1 node.
The assigned node is dispatched by the schedular from the free nodes at the time the command is executed, so you cannot explicitly specify a node.
# Execution of qrsh command
$ qrsh -g [TSUBAME group] -l cpu_4=1 -l h_rt=2:00:00
username@rXnY:~> module load [application modules to load]
username@rXnY:~> [command to run X11 application]
username@rXnY:~> exit
Info
For X forwarding using interactive nodes, Open OnDemand is also available.
5.2.2.2. Connecting to the network applications¶
If you need to operate an application using a Web browser or other means, you can use SSH port forwarding to access the application from a Web browser at local.
(1) Get hostname of interactive node connected by qrsh
$ qrsh -g tga-hpe_group00 -l cpu_4=1 -l h_rt=0:10:00
$ hostname
r1n1
$ [execute the program that requires Web browser]
After starting an interactive job with qrsh, get the hostname of the machine.
In the above example, r1n1
is the hostname.
There is nothing to do anymore in the console, but keep it until the work by the application is finished.
(2) Connect with SSH port forwarding enabled from the console from which you are connecting to ssh. (not on the login node or interactive job)
ssh -i /path/to/key -l username -L 8888:<hostname>:<network port of the appliction to connect your PC> login.t4.gsic.titech.ac.jp
Tips
SSH port forwarding settings depend on the SSH console (or terminal software) used to SSH into TSUBAME4. For details, please check the manual of each SSH console or refer to the FAQ.
(3) Connect to the application with a web browser.
Launch a web browser (Microsoft Edge, Firefox, Safari, etc.) on the console at hand and go to http://localhost:8888/.
5.2.3. Trial run¶
Info
This feature is available only for TSUBAME account holders. It is designed mainly for Science Tokyo users who can sign up by themselves.
TSUBAME provides the "trial run" feature, in which users can execute jobs without consuming points, for those who are anxious whether TSUBAME applies to their research or not. To use this feature, submit jobs without specifying a group via -g option. In this case, the job is limited to 2 nodes, 10 minutes of running time, and priority -5 (worst).
Warning
The trial run feature is only for testing whether your program works or not. Do not use it for actual execution for your research and measurement.
It does not mean that you can execute jobs freely without charge if the job size meets the limitation written in above.
The trial run function is provided that users who are not sure whether they can use TSUBAME for their own research can check the operation before purchasing points. Please do not use the trial run to perform calculations that deviate significantly from these purposes or that directly lead to research results.
Please consider using interactive queue if you wish to perform small calculations for educational purposes in class.
For trial runs, the following restrictions apply to the amount of resources
Maximum number of the resource type specified | 2 |
Maximum run time | 10 minutes |
number of concurrent runs | 1 |
Resource type | no limitation |
For Trial run, it is necessary to run a job without specifying a TSUBAME group. Note that the points are consumed when you submit a job with the TSUBAME group.
5.2.4. Subscription job¶
Submittin a job for compute node subscription requires -q prior
option. Other options are same as the other jobs.
$ qsub -q prior -g [TSUBAME group] SCRIPTFILE
For more details about compute node subscription, check here.
Warning
Even if a job for the subscription group, note that if -q prior
is not specified, the job will be processed as a pay-as-you-go job.
5.2.5. Reserving compute nodes¶
It is possible to execute jobs exceeding 24 hours and/or 72 nodes by reserving computation nodes.
Steps for reservation is as follows.
- Make a reservation from TSUBAME portal
- Check reservation status, cancel a reservation from TSUBAME portal
- Submit a job using qsub for reserved node
- Cancel a job using qdel
- Check job result
- Check the reservation status and AR ID from the command line
Please refer to TSUBAME Portal User's Guide "Reserving compute nodes" on reservation from the portal, confirmation of reservation status and cancellation of the reservation.
When the reservation time is reached, the job can be executed in the reservation group account. An example of job submission specifying an AR ID, which is a reservation ID, is shown below.
- Submitting a job to a reserved node with qsub
$ qsub -g [TSUBAME group] -ar [AR ID] SCRIPTFILE
- Submitting an interactive job to a reserved node with qrsh
$ qrsh -g [TSUBAME group] -l [Resource type]=[number] -l h_rt=[maximum run time] -ar [AR ID]
The resource types available for reserved execution are node_f, node_h, node_q, and node_o. Other resource types are not available for reservation.
The qstat command is used to check the status of a job after it has been submitted, and the qdel command is used to delete a job.
Also, the format of the script is the same as that of the normal execution.
Use t4-user-info compute ar
to check the reservation status and AR ID from the command line.
xxxxx@login1:~> t4-user-info compute ar
ar_id uid user_name gid group_name state start_date end_date time_hour node_count point return_point
-------------------------------------------------------------------------------------------------------------------------------------------------------
1320 2005 A2901247 2015 tga-red000 r 2023-01-29 12:00:00 2023-01-29 13:00:00 1 1 18000 0
1321 2005 A2901247 2015 tga-red000 r 2023-01-29 13:00:00 2023-01-29 14:00:00 1 1 18000 0
1322 2005 A2901247 2015 tga-red000 w 2023-01-29 14:00:00 2023-02-02 14:00:00 96 1 1728000 1728000
1323 2005 A2901247 2015 tga-red000 r 2023-01-29 14:00:00 2023-02-02 14:00:00 96 1 1728000 1728000
1324 2005 A2901247 2015 tga-red000 r 2023-01-29 15:00:00 2023-01-29 16:00:00 1 17 306000 0
1341 2005 A2901247 2015 tga-red000 w 2023-02-25 12:00:00 2023-02-25 13:00:00 1 18 162000 162000
3112 2004 A2901239 2349 tgz-training r 2023-04-24 12:00:00 2023-04-24 18:00:00 6 20 540000 0
3113 2004 A2901239 2349 tgz-training r 2023-04-25 12:00:00 2023-04-25 18:00:00 6 20 540000 0
3116 2005 A2901247 2015 tga-red000 r 2023-04-18 17:00:00 2023-04-25 16:00:00 167 1 3006000 0
3122 2005 A2901247 2014 tga-blue000 r 2023-04-25 08:00:00 2023-05-02 08:00:00 168 5 15120000 0
3123 2005 A2901247 2014 tga-blue000 r 2023-05-02 08:00:00 2023-05-09 08:00:00 168 5 3780000 0
3301 2005 A2901247 2015 tga-red000 r 2023-08-30 14:00:00 2023-08-31 18:00:00 28 1 504000 0
3302 2005 A2901247 2009 tga-green000 r 2023-08-30 14:00:00 2023-08-31 18:00:00 28 1 504000 0
3304 2005 A2901247 2014 tga-blue000 r 2023-09-03 10:00:00 2023-09-04 10:00:00 24 1 432000 0
3470 2005 A2901247 2014 tga-blue000 w 2023-11-11 22:00:00 2023-11-11 23:00:00 1 1 4500 4500
4148 2004 A2901239 2007 tga-hpe_group00 w 2024-04-12 17:00:00 2024-04-12 18:00:00 1 1 4500 4500
4149 2005 A2901247 2015 tga-red000 w 2024-04-12 17:00:00 2024-04-13 17:00:00 24 1 108000 108000
4150 2004 A2901239 2007 tga-hpe_group00 w 2024-04-12 17:00:00 2024-04-12 18:00:00 1 1 4500 4500
-------------------------------------------------------------------------------------------------------------------------------------------------------
total : 818 97 28507500 3739500
Use t4-user-info compute ars
to check the availability of appointments for the current month from the command line.
5.2.6. SSH login to compute nodes¶
Nodes that have been submitted jobs with resource type node_f can be logged in directly via ssh. The secured node can be checked by the following procedure.
$ qstat -j 1463
==============================================================
job_number: 1463
(snip)
exec_host_list 1: r8n3:28, r8n4:28 <-- assigned nodes: r8n3, r8n4
(snip)
Info
When you ssh into a compute node, the GID of the sshed process is set to tsubame-users(2000), so you cannot see your job process running immediately after ssh, except for a trial run, and you cannot attach to it with gdb. To make it visible, execute the following with the name of the group that executed the job after ssh.
newgrp <groupname>
or
sg <groupname>
5.3. Interactive job queue¶
An interactive job queue is an environment where the same resource is shared among users, making it less likely to fail to secure a node even when congested, and providing a quick start to visualization and interactive programs.
Submitting jobs to the interactive queue is as follows
Info
Interactive job queue is available free of charge only for intramural users (tgz-edu) and access card users.
To use interactive job queue free of charge, please submit a job without the -g [TSUBAME group] option.
Please note that you will be charged for the target group if the -g option is specified.
iqrsh -g [TSUBAME group] -l h_rt=<maximum run time>
Note that CPU/GPU over-commitment is allowed in the interactive job queue.
Limit for nteractive job queue is descrived here.
5.4. Storage use on Compute Nodes¶
5.4.1. Local scratch area¶
SSD on each compute node can be used as a local scratch area.
By using the local scratch area, the fastest file access can be achieved without crossing over nodes.
Note, however, that the file transfer process from the group disk or home directory to the local scratch area must be performed at the beginning of the job, which may result in slower execution time if the number of files to be transferred is large or the access frequency is low.
Files in the local scratch area are deleted at the end of the job, so users should explicitly save necessary files in the Home directory or group disk.
The capacity of local scratch area in each resource type is described here.
The local scratch area is set to /local/${JOB_ID}
(or /local/${JOB_ID}${SGE_TASK_ID}
for array jobs). It can be referenced by specifying the path of the work area in the job script.
Since the local scratch area is a separate area for each compute node and is not shared, input and output files from within the job script must be staged on the local host.
The following example copies the input data set from the home directory to the local scratch area and returns the results to the home directory when there is only one compute node used. (Multiple nodes are not supported.)
#!/bin/sh
# copy input data to the scratch
cp -rp $HOME/datasets /local/${JOB_ID}
# run your program that uses input/output data
./a.out /local/${JOB_ID}/datasets /local/${JOB_ID}/results
# copy the result back to your home directory
cp -rp /local/${JOB_ID}/results $HOME/results
Tips
/local/${JOB_ID} will be deleted when the job completes.