SGE Grid Job Dependency
It is possible to describe
SGE (Sun Grid Engine)
job (or any other grid engine) dependency in a
DAG (Directed
Acyclic Graph) format. By taking advantage of the opensource
Graphviz, it is very easy to document
this dependency in
DOT language format.
Below shows you a sample DOT file:
$ cat job-dep.dot digraph jobs101 { job_1 -> job_11; job_1 -> job_12; job_1 -> job_13; job_11 -> job_111; job_12 -> job_111; job_2 -> job_13; job_2 -> job_21; job_3 -> job_21; job_3 -> job_31; }
With this DOT file, one can generate the graphical representation:
$ dot -Tpng -o job-dep.png job-dep.dot
It is also possible to derive the corresponding SGE commands by the following Tcl script.
$ cat ./dot2sge.tcl #! /usr/local/bin/tclsh if { $argc != 1 } { puts stderr "Usage: $argv0" exit 1 } set dotfile [lindex $argv 0] if { [file exists $dotfile] == 0 } { puts stderr "Error. $dotfile does not exist" exit 2 } # assume simple directed graph a -> b set fp [open $dotfile r] set data [read $fp] close $fp set sge_jobs {} foreach i [split [lindex $data 2] {;}] { if { [regexp {(\w+)\s*->\s*(\w+)} $i x parent child] != 0 } { lappend sge_jobs $parent lappend sge_jobs $child lappend sge_job_rel($parent) $child } } # submit unique jobs, and hold set queue all.q set sge_unique_jobs [lsort -unique $sge_jobs] foreach i $sge_unique_jobs { puts "qsub -h -q $queue -N $i job-submit.sh" } # alter the job dependency, but unhold after all the hold relationships are # established foreach i $sge_unique_jobs { if { [info exists sge_job_rel($i)] } { # with dependency puts "qalter -hold_jid [join $sge_job_rel($i) {,}] $i" } } foreach i $sge_unique_jobs { puts "qalter -h U $i" }
Run this Tcl script to generate the SGE submission commands and alternation commands to register the job dependency
$ ./dot2sge.tcl job-dep.dot qsub -h -q all.q -N job_1 job-submit.sh qsub -h -q all.q -N job_11 job-submit.sh qsub -h -q all.q -N job_111 job-submit.sh qsub -h -q all.q -N job_12 job-submit.sh qsub -h -q all.q -N job_13 job-submit.sh qsub -h -q all.q -N job_2 job-submit.sh qsub -h -q all.q -N job_21 job-submit.sh qsub -h -q all.q -N job_3 job-submit.sh qsub -h -q all.q -N job_31 job-submit.sh qalter -hold_jid job_11,job_12,job_13 job_1 qalter -hold_jid job_111 job_11 qalter -hold_jid job_111 job_12 qalter -hold_jid job_13,job_21 job_2 qalter -hold_jid job_21,job_31 job_3 qalter -h U job_1 qalter -h U job_11 qalter -h U job_111 qalter -h U job_12 qalter -h U job_13 qalter -h U job_2 qalter -h U job_21 qalter -h U job_3 qalter -h U job_31
Below show the above proof-of-concept in action. So sit back....
# # ----------below is a very simple script # $ cat job-submit.sh #! /bin/sh #$ -S /bin/sh date sleep 10 # # ----------run all the qsub to submit jobs, but put them on hold # $ qsub -h -q all.q -N job_1 job-submit.sh Your job 333 ("job_1") has been submitted. $ qsub -h -q all.q -N job_11 job-submit.sh Your job 334 ("job_11") has been submitted. $ qsub -h -q all.q -N job_111 job-submit.sh Your job 335 ("job_111") has been submitted. $ qsub -h -q all.q -N job_12 job-submit.sh Your job 336 ("job_12") has been submitted. $ qsub -h -q all.q -N job_13 job-submit.sh Your job 337 ("job_13") has been submitted. $ qsub -h -q all.q -N job_2 job-submit.sh Your job 338 ("job_2") has been submitted. $ qsub -h -q all.q -N job_21 job-submit.sh Your job 339 ("job_21") has been submitted. $ qsub -h -q all.q -N job_3 job-submit.sh Your job 340 ("job_3") has been submitted. $ qsub -h -q all.q -N job_31 job-submit.sh Your job 341 ("job_31") has been submitted. # # ----------show the status, all jobs are in hold position (hqw) # $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 0/4 0.01 sol-amd64 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1 334 0.00000 job_11 chihung hqw 07/19/2007 21:04:34 1 335 0.00000 job_111 chihung hqw 07/19/2007 21:04:34 1 336 0.00000 job_12 chihung hqw 07/19/2007 21:04:34 1 337 0.00000 job_13 chihung hqw 07/19/2007 21:04:34 1 338 0.00000 job_2 chihung hqw 07/19/2007 21:04:34 1 339 0.00000 job_21 chihung hqw 07/19/2007 21:04:34 1 340 0.00000 job_3 chihung hqw 07/19/2007 21:04:34 1 341 0.00000 job_31 chihung hqw 07/19/2007 21:04:34 1 # # ----------register the job dependency # $ qalter -hold_jid job_11,job_12,job_13 job_1 modified job id hold list of job 333 blocking jobs: 334,336,337 exited jobs: NONE $ qalter -hold_jid job_111 job_11 modified job id hold list of job 334 blocking jobs: 335 exited jobs: NONE $ qalter -hold_jid job_111 job_12 modified job id hold list of job 336 blocking jobs: 335 exited jobs: NONE $ qalter -hold_jid job_13,job_21 job_2 modified job id hold list of job 338 blocking jobs: 337,339 exited jobs: NONE $ qalter -hold_jid job_21,job_31 job_3 modified job id hold list of job 340 blocking jobs: 339,341 exited jobs: NONE # # ----------release all the holds and let SGE to sort itself out # $ qalter -h U job_1 modified hold of job 333 $ qalter -h U job_11 modified hold of job 334 $ qalter -h U job_111 modified hold of job 335 $ qalter -h U job_12 modified hold of job 336 $ qalter -h U job_13 modified hold of job 337 $ qalter -h U job_2 modified hold of job 338 $ qalter -h U job_21 modified hold of job 339 $ qalter -h U job_3 modified hold of job 340 $ qalter -h U job_31 modified hold of job 341 # # ----------query SGE stats # $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 0/4 0.01 sol-amd64 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1 334 0.00000 job_11 chihung hqw 07/19/2007 21:04:34 1 335 0.00000 job_111 chihung qw 07/19/2007 21:04:34 1 336 0.00000 job_12 chihung hqw 07/19/2007 21:04:34 1 337 0.00000 job_13 chihung qw 07/19/2007 21:04:34 1 338 0.00000 job_2 chihung hqw 07/19/2007 21:04:34 1 339 0.00000 job_21 chihung qw 07/19/2007 21:04:34 1 340 0.00000 job_3 chihung hqw 07/19/2007 21:04:34 1 341 0.00000 job_31 chihung qw 07/19/2007 21:04:34 1 # # ----------some jobs started to run # $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 2/4 0.01 sol-amd64 339 0.55500 job_21 chihung r 07/19/2007 21:05:36 1 341 0.55500 job_31 chihung r 07/19/2007 21:05:36 1 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 1/4 0.01 sol-amd64 335 0.55500 job_111 chihung r 07/19/2007 21:05:36 1 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 1/4 0.01 sol-amd64 337 0.55500 job_13 chihung r 07/19/2007 21:05:36 1 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1 334 0.00000 job_11 chihung hqw 07/19/2007 21:04:34 1 336 0.00000 job_12 chihung hqw 07/19/2007 21:04:34 1 338 0.00000 job_2 chihung hqw 07/19/2007 21:04:34 1 340 0.00000 job_3 chihung hqw 07/19/2007 21:04:34 1 $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 2/4 0.01 sol-amd64 339 0.55500 job_21 chihung r 07/19/2007 21:05:36 1 341 0.55500 job_31 chihung r 07/19/2007 21:05:36 1 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 1/4 0.01 sol-amd64 335 0.55500 job_111 chihung r 07/19/2007 21:05:36 1 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 1/4 0.01 sol-amd64 337 0.55500 job_13 chihung r 07/19/2007 21:05:36 1 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1 334 0.00000 job_11 chihung hqw 07/19/2007 21:04:34 1 336 0.00000 job_12 chihung hqw 07/19/2007 21:04:34 1 338 0.00000 job_2 chihung hqw 07/19/2007 21:04:34 1 340 0.00000 job_3 chihung hqw 07/19/2007 21:04:34 1 $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 0/4 0.01 sol-amd64 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1 334 0.00000 job_11 chihung qw 07/19/2007 21:04:34 1 336 0.00000 job_12 chihung qw 07/19/2007 21:04:34 1 338 0.00000 job_2 chihung qw 07/19/2007 21:04:34 1 340 0.00000 job_3 chihung qw 07/19/2007 21:04:34 1 $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 2/4 0.01 sol-amd64 338 0.55500 job_2 chihung r 07/19/2007 21:05:51 1 340 0.55500 job_3 chihung r 07/19/2007 21:05:51 1 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 1/4 0.01 sol-amd64 334 0.55500 job_11 chihung r 07/19/2007 21:05:51 1 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 1/4 0.01 sol-amd64 336 0.55500 job_12 chihung r 07/19/2007 21:05:51 1 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1 $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 2/4 0.01 sol-amd64 338 0.55500 job_2 chihung r 07/19/2007 21:05:51 1 340 0.55500 job_3 chihung r 07/19/2007 21:05:51 1 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 1/4 0.01 sol-amd64 334 0.55500 job_11 chihung r 07/19/2007 21:05:51 1 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 1/4 0.01 sol-amd64 336 0.55500 job_12 chihung r 07/19/2007 21:05:51 1 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 333 0.00000 job_1 chihung hqw 07/19/2007 21:04:34 1 $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 0/4 0.01 sol-amd64 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 333 0.00000 job_1 chihung qw 07/19/2007 21:04:34 1 $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 1/4 0.01 sol-amd64 333 0.55500 job_1 chihung r 07/19/2007 21:06:06 1 $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@sgeexec0 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec1 BIP 0/4 0.01 sol-amd64 ---------------------------------------------------------------------------- all.q@sgeexec2 BIP 1/4 0.01 sol-amd64 333 0.55500 job_1 chihung r 07/19/2007 21:06:06 1 # # ----------output of all jobs, you can see job job_1/2/3 finished last # $ grep 2007 job_*.o* job_111.o335:Thu Jul 19 21:05:36 SGT 2007 job_11.o334:Thu Jul 19 21:05:51 SGT 2007 job_12.o336:Thu Jul 19 21:05:51 SGT 2007 job_13.o337:Thu Jul 19 21:05:36 SGT 2007 job_1.o333:Thu Jul 19 21:06:06 SGT 2007 job_21.o339:Thu Jul 19 21:05:36 SGT 2007 job_2.o338:Thu Jul 19 21:05:51 SGT 2007 job_31.o341:Thu Jul 19 21:05:37 SGT 2007 job_3.o340:Thu Jul 19 21:05:52 SGT 2007
Another successful proof-of-concept. :-)
0 Comments:
Post a Comment
<< Home