コンテンツにスキップ

5. Job Scheduler

The system's job scheduling uses Altair Grid Engine, which efficiently schedules single and parallel jobs according to priority and required resources.

5.1. Job Scheduling System Configuration

TSUBAME4.0 provides the following queue types/job types for different purposes.

Queue types Job types command Resource types
Normal queue Batch job qsub Selectable
Interactive job qrsh Selectable
Interactive job queue Interactive job iqrsh Fixed

[Queue types]

  • Normal queue
    System resources are allocated and occupied for use in units of logically divided resource types.
    This is usually used. Node reservation is also available.
  • Interactive job queue
    An interactive queue is an environment where the same resource is shared among users, making it less likely to fail to secure a node even when congested, and providing a quick start to visualization and interactive programs.
    The programs with intermittent processor usage, such as debuggers, visualizers, Jupyter Lab, are expected and do not use this service for programs that dominate processors continuously.
    Please be sure to read List of various limits before use.

Info

Interactive job queue is available free of charge only for intramural users (tgz-edu) and access card users.
See Interactive job queue for usage.

[Job types]

  • Batch job
    Create and submit a job script. See Batch job for details.
  • Interactive job
    Execute a program or shell script interactively. Usage differs depending on the queue type.
    When using "Normal queue", please refer to Interactive job.
    When using "Interactive job queue", please refer to Interactive job queue.

[Resource types]

  • Use Normal queue
    Please refer to Resource types for available resource types.
  • Use Interactive job queue
    Assigned resources number of physical CPU cores 24 cores, 96GB memory, 1MIG, but up to 12 people share the same resources.

5.1.1. Resource types

This system uses resource types in which compute nodes are logically divided to reserve system resources.

When submitting a job, specify how many resource types will be used (e.g., -l node_f=2). The available resource types are listed below.

Resource type Physical CPU cores Memory (GB) GPUs Local scratch area (GB)
node_f 192 768 4 1920
node_h 96 384 2 960
node_q 48 192 1 480
node_o 24 96 1/2 240
gpu_1 8 96 1 240
gpu_h 4 48 1/2 120
cpu_160 160 368 0 960
cpu_80 80 184 0 480
cpu_40 40 92 0 240
cpu_16 16 36.8 0 96
cpu_8 8 18.4 0 48
cpu_4 4 9.2 0 24
  • "Physical CPU cores", "Memory (GB), "GPUs" are the available resources per resource type.
  • Same resource type can be used with [Resource type]=[Num]. Different resource types combinations are not available.
  • Maximum run time is 24 hours.
  • TSUBAME4 has various limits such as "the number of jobs that can be executed at the same time" and "the total number of slots that can be executed". (slots=Physical cores in each resoruce type x nodes number (same as slots of qstat))

Current limits are shown below:

https://www.t4.gsic.titech.ac.jp/en/resource-limit

Note that it may change depends on usage any time.

5.2. Normal queue

5.2.1. Batch job

To run a job on this system, log in to the login node and execute qsub command.

5.2.1.1. Job submission flow

To submit a job, create and submit a job script. The submission command is qsub.

  • Create a job script
  • Submit a job using qsub
  • Status check using qstat
  • Cancel a job using qdel
  • Check job result

The qsub command confirms billing information (TSUBAME points) and accepts jobs.

5.2.1.2. Creating job script

Here is a job script format:

#!/bin/sh
#$ -cwd
#$ -l [Resource type] =[Number]
#$ -l h_rt=[Maximum run time]
#$ -p [Priority]

[Load relevant modules requried for the job]

[Your program]

Warning

shebang(#!/bin/sh line) must be located at the first of the job script.

  • [Load relevant modules requried for the job] Load required related modules that you need. For example, loading Intel compiler is as below:
module load intel
  • [Your program]
    Execute your program Running a.out binary is like below:
./a.out

Resource type can be set at command line, or in comment block (#$)at the top of job script file.

Make sure to set resource type and maximun run time, which are requirement to submit a job.

The main options for the qsub command are listed below.

Option Description
-l [Resource type Name] =Number Specify the resource type.
-l h_rt=[Maximum run time] (Required) Specify the maximum run time (hours, minutes and seconds).
[[HH:]MM:]SS, HH:MM:S, MM:SS or SS
-N name of the job (Script file name if not specified))
-o name of the standard output file
-e name of the standard error output filename of the standard error output file
-j y Integrate standard error output into a standard output file.
-m Will send email when job ends or aborts. The conditions for the -m argument include:
a: mail is sent when the job is aborted.
b: mail is sent when the job begins.
e: mail is sent when the job ends.
It is also possible to combine like abe.
When a large number of jobs with mail option are submitted, a large amount of mail is also sent, heavy load is applied to the mail server, and it may be detected as an attack and the mail from Science Tokyo may be blocked. If you need to execute such jobs, please remove the mail option or review the script so that it can be executed with one job.
-p (Premium Options) Specify the job execution priority. If -4 or -3 is specified, a charge factor higher than -5 is applied. The setting values -5, -4, -3 correspond to the priority 0, 1, 2 of the charging rule.
-5: Standard execution priority. (Default)
-4: The execution priority is higher than -5 and lower than -3.
-3 :Highest execution priority.
Note that all priority number to specify is negative value. Do not forget preceding minus sign.
-t Submits a Array Job
specified with start-end[:step]
-hold_jid Defines the job dependency list of the submitted job.
The job is executed after the specified dependent job is finished.
-ar Specify the reserved AR ID when using the reserved node.

Note that -V option for passing environment variables in the job submission environment is not available in this system.

5.2.1.3. Job script examples

5.2.1.3.1. Single job/GPU job

The following is an example of a job script created when executing a single job (job not parallelized) or GPU job.

For GPU job, please replace -l cpu_4=1 with -l gpu_1=1 and load required modules such as CUDA environment.

#!/bin/sh
# run in current working directory
#$ -cwd

#$ -l cpu_4=1
# maximum run time
#$ -l h_rt=1:00:00
#$ -N serial

# load cuda module
module load cuda
# load Intel Compiler
module load intel
./a.out
5.2.1.3.2. SMP job

An example of a job script created when executing an SMP parallel job is shown below. Hyper-threading is enabled for compute nodes. Please explicitly specify the number of threads to use.

#!/bin/sh
#$ -cwd
# node_f 1 node
#$ -l node_f=1
#$ -l h_rt=1:00:00
#$ -N openmp

module load cuda
module load intel
# 192 therads per node
export OMP_NUM_THREADS=192
./a.out
5.2.1.3.3. MPI job

An example of a job script created when executing an MPI parallel job is shown below. Please specify an MPI environment according to the MPI library used by you for MPI jobs as follows. For OpenMPI, to pass library environment variables to all nodes, you need to use -x LD_LIBRARY_PATH.

Intel MPI

#!/bin/sh
#$ -cwd
# node_f x 4 
#$ -l node_f=4
#$ -l h_rt=1:00:00
#$ -N flatmpi

module load cuda
module load intel
# Intel MPI
module load intel-mpi
# 8 process/node, all MPI process is 32
mpiexec.hydra -ppn 8 -n 32 ./a.out

OpenMPI

#!/bin/sh
#$ -cwd
# node_f x 4
#$ -l node_f=4
#$ -l h_rt=1:00:00
#$ -N flatmpi

# Load OpenMPI: Intel compiler, CUDA are loaded automatically
module load openmpi/5.0.2-intel
# 8 process/node, all MPI process is 32
mpirun -npernode 8 -n 32 -x LD_LIBRARY_PATH ./a.out

The file of the node list assigned to the submitted job can be referred from $PE_HOSTFILE.

$ echo $PE_HOSTFILE  
/var/spool/age/r15n10/active_jobs/1407687.1/pe_hostfile  
$ cat /var/spool/age/r15n10/active_jobs/1407687.1/pe_hostfile  
r15n10 24 all.q@r15n10 <NULL>  
r20n11 24 all.q@r20n11 <NULL>  
r20n10 24 all.q@r20n10 <NULL>  
r23n9 24 all.q@r23n9 <NULL>  

5.2.1.3.4. Hybrid parallel (Hybrid, MPI+OpenMP)

An example of a job script created when executing a process/thread parallel (hybrid) job is shown below. Please specify an MPI environment according to the MPI library used by you for MPI jobs as follows. For OpenMPI, to pass library environment variables to all nodes, you need to use =-x LD_LIBRARY_PATH`.

Intel MPI

#!/bin/sh
#$ -cwd
# node_f x 4
#$ -l node_f=4
#$ -l h_rt=1:00:00
#$ -N hybrid

module load cuda
module load intel
module load intel-mpi
# 192 threads per node
export OMP_NUM_THREADS=192
# 1 MPI process per node, all MPI process is 4
mpiexec.hydra -ppn 1 -n 4 ./a.out

OpenMPI

#!/bin/sh
#$ -cwd
# node_f x 4
#$ -l node_f=4
#$ -l h_rt=1:00:00
#$ -N hybrid

# Open MPI: Intel compiler, CUDA are loaded automatically
module load openmpi/5.0.2-intel
# 192 threads per node 
export OMP_NUM_THREADS=192
# 1 MPI process per node, all MPI process is 4
mpirun -npernode 1 -n 4 -x LD_LIBRARY_PATH ./a.out

5.2.1.4. Job submission

Job is queued and executed by specifying the job submission script in the qsub command. You can submit a job using qsub as follows.

$ qsub -g [TSUBAME group] SCRIPTFILE
Option Description
-g Specify the TSUBAME group name.
Please add as qsub command option, not in script.
-q prior Subscription job.
Wait one hour at most until execution.

5.2.1.5. Job status

The qstat is a job status display command.

$ qstat [option]

The options of qstat is listed below.

Option Descripion
-r Displays job resource infromation
-j [job-ID] Displays additional information about the job

Here is an example of qstat:

$ qstat
job-IDprior  nameuser   statesubmit/start at  queuejclass  slotsja-task-ID
----------------------------------------------------------------------------------
307 0.55500 sample.sh testuser r 02/12/2023 17:48:10  all.q@r8n1A.default32
(snip)
Item Description
Job-ID Job-ID number
prior Priority of job
name Name of the job
user ID or user who submitted job
state State of the job
r runnig
qw waiting in the queue
h on hold
d deleting
t a transition like during job-start
s suspended
S suspended by the queue
T has reached the limit of the tail
E error
Rq Rrescheduled and waiting for run
Rr rescheduled and running
submit/start at Submit or start time and date of the job
queue queue name
jclass job class name
slots The number of slot the job is taking
(slot=physical cores in each resource type x node number)
ja-task-ID Array job task-id

5.2.1.6. Deleting job

To delete your job, use the qdel command.

$ qdel [Job-ID]

An example of deleting job is shown as below:

$ qstat
job-IDprior  nameuser   statesubmit/start at  queuejclass  slotsja-task-ID
----------------------------------------------------------------------------------
307 0.55500 sample.sh testuser r 02/12/2023 17:48:10  all.q@r8n1A.default32

$ qdel 307
testuser has registered the job 307 for deletion

$ qstat
job-IDprior  nameuser   statesubmit/start at  queuejclass  slotsja-task-ID
----------------------------------------------------------------------------------

5.2.1.7. Job result

The standard output of AGE jobs is stored in the file "script file name.o job ID" in the directory where the job was executed. Also, the standard error output is "script file name.eJob ID".

5.2.1.8. Array job

An array job is as a function to parameterize and execute the operation contained in the job script repeatedly.

Each job executed in an array job is called a task and is managed by a task ID. A job ID that does not specify a task ID has the entire task ID as its range.

Info

Each task in an array job is scheduled as a separate job, so the schedule wait time is proportional to the number of tasks.
If each task is short or the number of tasks is large, it is strongly recommended to reduce the number of tasks by grouping several tasks together.
Example: 10000 tasks are combined into 100 tasks that each process 100 tasks.

The task number can be specified as an option of the qsub command or defined in a job script. Submission options are specified as -t(start number)-(end number):(step size). If the step size is 1, it can be omitted. An example is shown below.

# describe below in a job script
#$ -t 2-10:2

The above example (2-10:2) specifies a start number of 2, an end number of 10, a step size of 2 (one skipped index), and consists of five tasks with task numbers 2, 4, 6, 8, and 10.

The task number for each task is set in an environment variable named $SGE_TASK_ID, so this environment variable can be used in the job script to enable parameter study. The result file will be output with the job name followed by the task ID.

Also, if you want to remove a specific task ID before/while it is running, use qdel's -t option as follows.

$ qdel[Job-ID] -t [Task-ID]

5.2.2. Interactive job

The system's job scheduler has the ability to run programs and shell scripts interactively. To run an interactive job, use the qrsh command and specify the resource type and elapsed time with -l. After submitting a job with qrsh, a command prompt will be returned when the job is dispatched. The following shows how to use an interactive job.

$ qrsh -g [TSUBAME group] -l [Resource type]=[number] -l h_rt=[maximum run time]
Directory: /home/N/username
(job start time)
username@rXnY:~> [commands to run]
username@rXnY:~> exit

If the -g option group is not specified, up to two resource types, up to 10 minutes of elapsed time, and a "trial run" with a priority of -5 will be performed.

Example specifying resource type node_f, 1 node, elapsed time 10 minutes

$ qrsh -g [TSUBAME group] -l node_f=1 -l h_rt=0:10:00
Directory: /home/N/username
(job start time)
username@rXnY:~> [commands to run]
username@rXnY:~> exit

Enter exit can stop the interactive job.

5.2.2.1. X forwarding using interactive nodes

To perform X forwarding directly from a node connected by qrsh, please follow the procedure below.

  1. Enable X forwarding and connect to login node with ssh.
  2. Execute qrsh command with X11 forwarding like the following example.

Example)

The example shows a 2-hour job with resource type cpu_4 and 1 node.

The assigned node is dispatched by the schedular from the free nodes at the time the command is executed, so you cannot explicitly specify a node.

# Execution of qrsh command
$ qrsh -g [TSUBAME group] -l cpu_4=1 -l h_rt=2:00:00
username@rXnY:~> module load [application modules to load]
username@rXnY:~> [command to run X11 application]
username@rXnY:~> exit

Info

For X forwarding using interactive nodes, Open OnDemand is also available.

5.2.2.2. Connecting to the network applications

If you need to operate an application using a Web browser or other means, you can use SSH port forwarding to access the application from a Web browser at local.

(1) Get hostname of interactive node connected by qrsh

$  qrsh -g tga-hpe_group00 -l cpu_4=1 -l h_rt=0:10:00
$ hostname
r1n1
$ [execute the program that requires Web browser]

After starting an interactive job with qrsh, get the hostname of the machine. In the above example, r1n1 is the hostname. There is nothing to do anymore in the console, but keep it until the work by the application is finished.

(2) Connect with SSH port forwarding enabled from the console from which you are connecting to ssh. (not on the login node or interactive job)

ssh -i /path/to/key -l username -L 8888:<hostname>:<network port of the appliction to connect your PC> login.t4.gsic.titech.ac.jp
The network port of the application to be connected is different for each application. For details, please refer to the documentation of each application or the application's startup message.

Tips

SSH port forwarding settings depend on the SSH console (or terminal software) used to SSH into TSUBAME4. For details, please check the manual of each SSH console or refer to the FAQ.

(3) Connect to the application with a web browser.

Launch a web browser (Microsoft Edge, Firefox, Safari, etc.) on the console at hand and go to http://localhost:8888/.

5.2.3. Trial run

Info

This feature is available only for TSUBAME account holders. It is designed mainly for Science Tokyo users who can sign up by themselves.

TSUBAME provides the "trial run" feature, in which users can execute jobs without consuming points, for those who are anxious whether TSUBAME applies to their research or not. To use this feature, submit jobs without specifying a group via -g option. In this case, the job is limited to 2 nodes, 10 minutes of running time, and priority -5 (worst).

Warning

The trial run feature is only for testing whether your program works or not. Do not use it for actual execution for your research and measurement.
It does not mean that you can execute jobs freely without charge if the job size meets the limitation written in above.

The trial run function is provided that users who are not sure whether they can use TSUBAME for their own research can check the operation before purchasing points. Please do not use the trial run to perform calculations that deviate significantly from these purposes or that directly lead to research results.

Please consider using interactive queue if you wish to perform small calculations for educational purposes in class.

For trial runs, the following restrictions apply to the amount of resources

Maximum number of the resource type specified 2
Maximum run time 10 minutes
number of concurrent runs 1
Resource type no limitation

For Trial run, it is necessary to run a job without specifying a TSUBAME group. Note that the points are consumed when you submit a job with the TSUBAME group.

5.2.4. Subscription job

Submittin a job for compute node subscription requires -q prior option. Other options are same as the other jobs.

$ qsub -q prior -g [TSUBAME group] SCRIPTFILE

For more details about compute node subscription, check here.

Warning

Even if a job for the subscription group, note that if -q prior is not specified, the job will be processed as a pay-as-you-go job.

5.2.5. Reserving compute nodes

It is possible to execute jobs exceeding 24 hours and/or 72 nodes by reserving computation nodes.

Steps for reservation is as follows.

  • Make a reservation from TSUBAME portal
  • Check reservation status, cancel a reservation from TSUBAME portal
  • Submit a job using qsub for reserved node
  • Cancel a job using qdel
  • Check job result
  • Check the reservation status and AR ID from the command line

Please refer to TSUBAME Portal User's Guide "Reserving compute nodes" on reservation from the portal, confirmation of reservation status and cancellation of the reservation.

When the reservation time is reached, the job can be executed in the reservation group account. An example of job submission specifying an AR ID, which is a reservation ID, is shown below.

  • Submitting a job to a reserved node with qsub
$ qsub -g [TSUBAME group] -ar [AR ID] SCRIPTFILE
  • Submitting an interactive job to a reserved node with qrsh
$ qrsh -g [TSUBAME group] -l [Resource type]=[number] -l h_rt=[maximum run time] -ar [AR ID]

The resource types available for reserved execution are node_f, node_h, node_q, and node_o. Other resource types are not available for reservation.

The qstat command is used to check the status of a job after it has been submitted, and the qdel command is used to delete a job.

Also, the format of the script is the same as that of the normal execution.

Use t4-user-info compute ar to check the reservation status and AR ID from the command line.

xxxxx@login1:~> t4-user-info compute ar
ar_id   uid user_name         gid group_name                state     start_date           end_date        time_hour node_count      point return_point
-------------------------------------------------------------------------------------------------------------------------------------------------------
 1320  2005 A2901247         2015 tga-red000                  r   2023-01-29 12:00:00 2023-01-29 13:00:00          1          1      18000            0
 1321  2005 A2901247         2015 tga-red000                  r   2023-01-29 13:00:00 2023-01-29 14:00:00          1          1      18000            0
 1322  2005 A2901247         2015 tga-red000                  w   2023-01-29 14:00:00 2023-02-02 14:00:00         96          1    1728000      1728000
 1323  2005 A2901247         2015 tga-red000                  r   2023-01-29 14:00:00 2023-02-02 14:00:00         96          1    1728000      1728000
 1324  2005 A2901247         2015 tga-red000                  r   2023-01-29 15:00:00 2023-01-29 16:00:00          1         17     306000            0
 1341  2005 A2901247         2015 tga-red000                  w   2023-02-25 12:00:00 2023-02-25 13:00:00          1         18     162000       162000
 3112  2004 A2901239         2349 tgz-training                r   2023-04-24 12:00:00 2023-04-24 18:00:00          6         20     540000            0
 3113  2004 A2901239         2349 tgz-training                r   2023-04-25 12:00:00 2023-04-25 18:00:00          6         20     540000            0
 3116  2005 A2901247         2015 tga-red000                  r   2023-04-18 17:00:00 2023-04-25 16:00:00        167          1    3006000            0
 3122  2005 A2901247         2014 tga-blue000                 r   2023-04-25 08:00:00 2023-05-02 08:00:00        168          5   15120000            0
 3123  2005 A2901247         2014 tga-blue000                 r   2023-05-02 08:00:00 2023-05-09 08:00:00        168          5    3780000            0
 3301  2005 A2901247         2015 tga-red000                  r   2023-08-30 14:00:00 2023-08-31 18:00:00         28          1     504000            0
 3302  2005 A2901247         2009 tga-green000                r   2023-08-30 14:00:00 2023-08-31 18:00:00         28          1     504000            0
 3304  2005 A2901247         2014 tga-blue000                 r   2023-09-03 10:00:00 2023-09-04 10:00:00         24          1     432000            0
 3470  2005 A2901247         2014 tga-blue000                 w   2023-11-11 22:00:00 2023-11-11 23:00:00          1          1       4500         4500
 4148  2004 A2901239         2007 tga-hpe_group00             w   2024-04-12 17:00:00 2024-04-12 18:00:00          1          1       4500         4500
 4149  2005 A2901247         2015 tga-red000                  w   2024-04-12 17:00:00 2024-04-13 17:00:00         24          1     108000       108000
 4150  2004 A2901239         2007 tga-hpe_group00             w   2024-04-12 17:00:00 2024-04-12 18:00:00          1          1       4500         4500
-------------------------------------------------------------------------------------------------------------------------------------------------------
total :                                                                                                          818         97   28507500      3739500

Use t4-user-info compute ars to check the availability of appointments for the current month from the command line.

5.2.6. SSH login to compute nodes

Nodes that have been submitted jobs with resource type node_f can be logged in directly via ssh. The secured node can be checked by the following procedure.

$ qstat -j 1463
==============================================================
job_number:                 1463
(snip)
exec_host_list        1:    r8n3:28, r8n4:28       <-- assigned nodes: r8n3, r8n4
(snip)

Info

When you ssh into a compute node, the GID of the sshed process is set to tsubame-users(2000), so you cannot see your job process running immediately after ssh, except for a trial run, and you cannot attach to it with gdb. To make it visible, execute the following with the name of the group that executed the job after ssh.

newgrp <groupname>
or
sg <groupname>

5.3. Interactive job queue

An interactive job queue is an environment where the same resource is shared among users, making it less likely to fail to secure a node even when congested, and providing a quick start to visualization and interactive programs.

Submitting jobs to the interactive queue is as follows

Info

Interactive job queue is available free of charge only for intramural users (tgz-edu) and access card users.
To use interactive job queue free of charge, please submit a job without the -g [TSUBAME group] option.
Please note that you will be charged for the target group if the -g option is specified.

iqrsh -g [TSUBAME group] -l h_rt=<maximum run time>

Note that CPU/GPU over-commitment is allowed in the interactive job queue.
Limit for nteractive job queue is descrived here.

5.4. Storage use on Compute Nodes

5.4.1. Local scratch area

SSD on each compute node can be used as a local scratch area.

By using the local scratch area, the fastest file access can be achieved without crossing over nodes.
Note, however, that the file transfer process from the group disk or home directory to the local scratch area must be performed at the beginning of the job, which may result in slower execution time if the number of files to be transferred is large or the access frequency is low.
Files in the local scratch area are deleted at the end of the job, so users should explicitly save necessary files in the Home directory or group disk.

The capacity of local scratch area in each resource type is described here.

The local scratch area is set to /local/${JOB_ID} (or /local/${JOB_ID}${SGE_TASK_ID} for array jobs). It can be referenced by specifying the path of the work area in the job script.

Since the local scratch area is a separate area for each compute node and is not shared, input and output files from within the job script must be staged on the local host.

The following example copies the input data set from the home directory to the local scratch area and returns the results to the home directory when there is only one compute node used. (Multiple nodes are not supported.)

#!/bin/sh
# copy input data to the scratch
cp -rp $HOME/datasets /local/${JOB_ID}
# run your program that uses input/output data
./a.out /local/${JOB_ID}/datasets /local/${JOB_ID}/results
# copy the result back to your home directory
cp -rp /local/${JOB_ID}/results $HOME/results

Tips

/local/${JOB_ID} will be deleted when the job completes.