SGE Grid Job Dependency
It is possible to describe
SGE (Sun Grid Engine)
job (or any other grid engine) dependency in a
DAG (Directed
Acyclic Graph) format. By taking advantage of the opensource
Graphviz, it is very easy to document
this dependency in
DOT language format.
Below shows you a sample DOT file:
$ cat job-dep.dot
digraph jobs101 {
job_1 -> job_11;
job_1 -> job_12;
job_1 -> job_13;
job_11 -> job_111;
job_12 -> job_111;
job_2 -> job_13;
job_2 -> job_21;
job_3 -> job_21;
job_3 -> job_31;
}
With this DOT file, one can generate the graphical representation:
$ dot -Tpng -o job-dep.png job-dep.dot
It is also possible to derive the corresponding SGE commands by the following Tcl script.
$ cat ./dot2sge.tcl
#! /usr/local/bin/tclsh
if { $argc != 1 } {
puts stderr "Usage: $argv0 "
exit 1
}
set dotfile [lindex $argv 0]
if { [file exists $dotfile] == 0 } {
puts stderr "Error. $dotfile does not exist"
exit 2
}
# assume simple directed graph a -> b
set fp [open $dotfile r]
set data [read $fp]
close $fp
set sge_jobs {}
foreach i [split [lindex $data 2] {;}] {
if { [regexp {(\w+)\s*->\s*(\w+)} $i x parent child] != 0 } {
lappend sge_jobs $parent
lappend sge_jobs $child
lappend sge_job_rel($parent) $child
}
}
# submit unique jobs, and hold
set queue all.q
set sge_unique_jobs [lsort -unique $sge_jobs]
foreach i $sge_unique_jobs {
puts "qsub -h -q $queue -N $i job-submit.sh"
}
# alter the job dependency, but unhold after all the hold relationships are
# established
foreach i $sge_unique_jobs {
if { [info exists sge_job_rel($i)] } {
# with dependency
puts "qalter -hold_jid [join $sge_job_rel($i) {,}] $i"
}
}
foreach i $sge_unique_jobs {
puts "qalter -h U $i"
}
Run this Tcl script to generate the SGE submission commands and alternation commands to register the job dependency
$ ./dot2sge.tcl job-dep.dot qsub -h -q all.q -N job_1 job-submit.sh qsub -h -q all.q -N job_11 job-submit.sh qsub -h -q all.q -N job_111 job-submit.sh qsub -h -q all.q -N job_12 job-submit.sh qsub -h -q all.q -N job_13 job-submit.sh qsub -h -q all.q -N job_2 job-submit.sh qsub -h -q all.q -N job_21 job-submit.sh qsub -h -q all.q -N job_3 job-submit.sh qsub -h -q all.q -N job_31 job-submit.sh qalter -hold_jid job_11,job_12,job_13 job_1 qalter -hold_jid job_111 job_11 qalter -hold_jid job_111 job_12 qalter -hold_jid job_13,job_21 job_2 qalter -hold_jid job_21,job_31 job_3 qalter -h U job_1 qalter -h U job_11 qalter -h U job_111 qalter -h U job_12 qalter -h U job_13 qalter -h U job_2 qalter -h U job_21 qalter -h U job_3 qalter -h U job_31
Below show the above proof-of-concept in action. So sit back....
#
# ----------below is a very simple script
#
$ cat job-submit.sh
#! /bin/sh
#$ -S /bin/sh
date
sleep 10
#
# ----------run all the qsub to submit jobs, but put them on hold
#
$ qsub -h -q all.q -N job_1 job-submit.sh
Your job 333 ("job_1") has been submitted.
$ qsub -h -q all.q -N job_11 job-submit.sh
Your job 334 ("job_11") has been submitted.
$ qsub -h -q all.q -N job_111 job-submit.sh
Your job 335 ("job_111") has been submitted.
$ qsub -h -q all.q -N job_12 job-submit.sh
Your job 336 ("job_12") has been submitted.
$ qsub -h -q all.q -N job_13 job-submit.sh
Your job 337 ("job_13") has been submitted.
$ qsub -h -q all.q -N job_2 job-submit.sh
Your job 338 ("job_2") has been submitted.
$ qsub -h -q all.q -N job_21 job-submit.sh
Your job 339 ("job_21") has been submitted.
$ qsub -h -q all.q -N job_3 job-submit.sh
Your job 340 ("job_3") has been submitted.
$ qsub -h -q all.q -N job_31 job-submit.sh
Your job 341 ("job_31") has been submitted.
#
# ----------show the status, all jobs are in hold position (hqw)
#
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 0/4 0.01 sol-amd64
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1
334 0.00000 job_11 chihung hqw 07/19/2007 21:04:34 1
335 0.00000 job_111 chihung hqw 07/19/2007 21:04:34 1
336 0.00000 job_12 chihung hqw 07/19/2007 21:04:34 1
337 0.00000 job_13 chihung hqw 07/19/2007 21:04:34 1
338 0.00000 job_2 chihung hqw 07/19/2007 21:04:34 1
339 0.00000 job_21 chihung hqw 07/19/2007 21:04:34 1
340 0.00000 job_3 chihung hqw 07/19/2007 21:04:34 1
341 0.00000 job_31 chihung hqw 07/19/2007 21:04:34 1
#
# ----------register the job dependency
#
$ qalter -hold_jid job_11,job_12,job_13 job_1
modified job id hold list of job 333
blocking jobs: 334,336,337
exited jobs: NONE
$ qalter -hold_jid job_111 job_11
modified job id hold list of job 334
blocking jobs: 335
exited jobs: NONE
$ qalter -hold_jid job_111 job_12
modified job id hold list of job 336
blocking jobs: 335
exited jobs: NONE
$ qalter -hold_jid job_13,job_21 job_2
modified job id hold list of job 338
blocking jobs: 337,339
exited jobs: NONE
$ qalter -hold_jid job_21,job_31 job_3
modified job id hold list of job 340
blocking jobs: 339,341
exited jobs: NONE
#
# ----------release all the holds and let SGE to sort itself out
#
$ qalter -h U job_1
modified hold of job 333
$ qalter -h U job_11
modified hold of job 334
$ qalter -h U job_111
modified hold of job 335
$ qalter -h U job_12
modified hold of job 336
$ qalter -h U job_13
modified hold of job 337
$ qalter -h U job_2
modified hold of job 338
$ qalter -h U job_21
modified hold of job 339
$ qalter -h U job_3
modified hold of job 340
$ qalter -h U job_31
modified hold of job 341
#
# ----------query SGE stats
#
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 0/4 0.01 sol-amd64
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1
334 0.00000 job_11 chihung hqw 07/19/2007 21:04:34 1
335 0.00000 job_111 chihung qw 07/19/2007 21:04:34 1
336 0.00000 job_12 chihung hqw 07/19/2007 21:04:34 1
337 0.00000 job_13 chihung qw 07/19/2007 21:04:34 1
338 0.00000 job_2 chihung hqw 07/19/2007 21:04:34 1
339 0.00000 job_21 chihung qw 07/19/2007 21:04:34 1
340 0.00000 job_3 chihung hqw 07/19/2007 21:04:34 1
341 0.00000 job_31 chihung qw 07/19/2007 21:04:34 1
#
# ----------some jobs started to run
#
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 2/4 0.01 sol-amd64
339 0.55500 job_21 chihung r 07/19/2007 21:05:36 1
341 0.55500 job_31 chihung r 07/19/2007 21:05:36 1
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 1/4 0.01 sol-amd64
335 0.55500 job_111 chihung r 07/19/2007 21:05:36 1
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 1/4 0.01 sol-amd64
337 0.55500 job_13 chihung r 07/19/2007 21:05:36 1
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1
334 0.00000 job_11 chihung hqw 07/19/2007 21:04:34 1
336 0.00000 job_12 chihung hqw 07/19/2007 21:04:34 1
338 0.00000 job_2 chihung hqw 07/19/2007 21:04:34 1
340 0.00000 job_3 chihung hqw 07/19/2007 21:04:34 1
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 2/4 0.01 sol-amd64
339 0.55500 job_21 chihung r 07/19/2007 21:05:36 1
341 0.55500 job_31 chihung r 07/19/2007 21:05:36 1
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 1/4 0.01 sol-amd64
335 0.55500 job_111 chihung r 07/19/2007 21:05:36 1
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 1/4 0.01 sol-amd64
337 0.55500 job_13 chihung r 07/19/2007 21:05:36 1
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1
334 0.00000 job_11 chihung hqw 07/19/2007 21:04:34 1
336 0.00000 job_12 chihung hqw 07/19/2007 21:04:34 1
338 0.00000 job_2 chihung hqw 07/19/2007 21:04:34 1
340 0.00000 job_3 chihung hqw 07/19/2007 21:04:34 1
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 0/4 0.01 sol-amd64
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1
334 0.00000 job_11 chihung qw 07/19/2007 21:04:34 1
336 0.00000 job_12 chihung qw 07/19/2007 21:04:34 1
338 0.00000 job_2 chihung qw 07/19/2007 21:04:34 1
340 0.00000 job_3 chihung qw 07/19/2007 21:04:34 1
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 2/4 0.01 sol-amd64
338 0.55500 job_2 chihung r 07/19/2007 21:05:51 1
340 0.55500 job_3 chihung r 07/19/2007 21:05:51 1
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 1/4 0.01 sol-amd64
334 0.55500 job_11 chihung r 07/19/2007 21:05:51 1
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 1/4 0.01 sol-amd64
336 0.55500 job_12 chihung r 07/19/2007 21:05:51 1
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 2/4 0.01 sol-amd64
338 0.55500 job_2 chihung r 07/19/2007 21:05:51 1
340 0.55500 job_3 chihung r 07/19/2007 21:05:51 1
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 1/4 0.01 sol-amd64
334 0.55500 job_11 chihung r 07/19/2007 21:05:51 1
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 1/4 0.01 sol-amd64
336 0.55500 job_12 chihung r 07/19/2007 21:05:51 1
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 0/4 0.01 sol-amd64
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
333 0.00000 job_1 chihung qw 07/19/2007 21:04:34 1
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 1/4 0.01 sol-amd64
333 0.55500 job_1 chihung r 07/19/2007 21:06:06 1
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@sgeexec0 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1 BIP 0/4 0.01 sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2 BIP 1/4 0.01 sol-amd64
333 0.55500 job_1 chihung r 07/19/2007 21:06:06 1
#
# ----------output of all jobs, you can see job job_1/2/3 finished last
#
$ grep 2007 job_*.o*
job_111.o335:Thu Jul 19 21:05:36 SGT 2007
job_11.o334:Thu Jul 19 21:05:51 SGT 2007
job_12.o336:Thu Jul 19 21:05:51 SGT 2007
job_13.o337:Thu Jul 19 21:05:36 SGT 2007
job_1.o333:Thu Jul 19 21:06:06 SGT 2007
job_21.o339:Thu Jul 19 21:05:36 SGT 2007
job_2.o338:Thu Jul 19 21:05:51 SGT 2007
job_31.o341:Thu Jul 19 21:05:37 SGT 2007
job_3.o340:Thu Jul 19 21:05:52 SGT 2007
Another successful proof-of-concept. :-)

0 Comments:
Post a Comment
<< Home