Thursday, July 19, 2007

SGE Grid Job Dependency

It is possible to describe SGE (Sun Grid Engine) job (or any other grid engine) dependency in a DAG (Directed Acyclic Graph) format. By taking advantage of the opensource Graphviz, it is very easy to document this dependency in DOT language format. Below shows you a sample DOT file:

$ cat job-dep.dot
digraph jobs101 {
        job_1 -> job_11;
        job_1 -> job_12;
        job_1 -> job_13;
        job_11 -> job_111;
        job_12 -> job_111;
        job_2 -> job_13;
        job_2 -> job_21;
        job_3 -> job_21;
        job_3 -> job_31;
}

With this DOT file, one can generate the graphical representation:

$ dot -Tpng -o job-dep.png job-dep.dot

It is also possible to derive the corresponding SGE commands by the following Tcl script.

$ cat ./dot2sge.tcl
#! /usr/local/bin/tclsh


if { $argc != 1 } {
        puts stderr "Usage: $argv0 "
        exit 1
}
set dotfile [lindex $argv 0]
if { [file exists $dotfile] == 0 } {
        puts stderr "Error. $dotfile does not exist"
        exit 2
}


# assume simple directed graph a -> b

set fp [open $dotfile r]
set data [read $fp]
close $fp


set sge_jobs {}
foreach i [split [lindex $data 2] {;}] {
        if { [regexp {(\w+)\s*->\s*(\w+)} $i x parent child] != 0 } {
                lappend sge_jobs $parent
                lappend sge_jobs $child

                lappend sge_job_rel($parent) $child
        }
}


# submit unique jobs, and hold
set queue all.q
set sge_unique_jobs [lsort -unique $sge_jobs]
foreach i $sge_unique_jobs {
        puts "qsub -h -q $queue -N $i job-submit.sh"
}


# alter the job dependency, but unhold after all the hold relationships are
# established
foreach i $sge_unique_jobs {
        if { [info exists sge_job_rel($i)] } {
                # with dependency
                puts "qalter -hold_jid [join $sge_job_rel($i) {,}] $i"
        }
}
foreach i $sge_unique_jobs {
        puts "qalter -h U $i"
}

Run this Tcl script to generate the SGE submission commands and alternation commands to register the job dependency

$ ./dot2sge.tcl job-dep.dot
qsub -h -q all.q -N job_1 job-submit.sh
qsub -h -q all.q -N job_11 job-submit.sh
qsub -h -q all.q -N job_111 job-submit.sh
qsub -h -q all.q -N job_12 job-submit.sh
qsub -h -q all.q -N job_13 job-submit.sh
qsub -h -q all.q -N job_2 job-submit.sh
qsub -h -q all.q -N job_21 job-submit.sh
qsub -h -q all.q -N job_3 job-submit.sh
qsub -h -q all.q -N job_31 job-submit.sh
qalter -hold_jid job_11,job_12,job_13 job_1
qalter -hold_jid job_111 job_11
qalter -hold_jid job_111 job_12
qalter -hold_jid job_13,job_21 job_2
qalter -hold_jid job_21,job_31 job_3
qalter -h U job_1
qalter -h U job_11
qalter -h U job_111
qalter -h U job_12
qalter -h U job_13
qalter -h U job_2
qalter -h U job_21
qalter -h U job_3
qalter -h U job_31

Below show the above proof-of-concept in action. So sit back....

#
# ----------below is a very simple script
#
$ cat job-submit.sh
#! /bin/sh
#$ -S /bin/sh

date
sleep 10


#
# ----------run all the qsub to submit jobs, but put them on hold
#
$ qsub -h -q all.q -N job_1 job-submit.sh
Your job 333 ("job_1") has been submitted.
$ qsub -h -q all.q -N job_11 job-submit.sh
Your job 334 ("job_11") has been submitted.
$ qsub -h -q all.q -N job_111 job-submit.sh
Your job 335 ("job_111") has been submitted.
$ qsub -h -q all.q -N job_12 job-submit.sh
Your job 336 ("job_12") has been submitted.
$ qsub -h -q all.q -N job_13 job-submit.sh
Your job 337 ("job_13") has been submitted.
$ qsub -h -q all.q -N job_2 job-submit.sh
Your job 338 ("job_2") has been submitted.
$ qsub -h -q all.q -N job_21 job-submit.sh
Your job 339 ("job_21") has been submitted.
$ qsub -h -q all.q -N job_3 job-submit.sh
Your job 340 ("job_3") has been submitted.
$ qsub -h -q all.q -N job_31 job-submit.sh
Your job 341 ("job_31") has been submitted.


#
# ----------show the status, all jobs are in hold position (hqw)
#
$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   0/4       0.01     sol-amd64

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    333 0.00000 job_1      chihung      hqw   07/19/2007 21:04:34     1
    334 0.00000 job_11     chihung      hqw   07/19/2007 21:04:34     1
    335 0.00000 job_111    chihung      hqw   07/19/2007 21:04:34     1
    336 0.00000 job_12     chihung      hqw   07/19/2007 21:04:34     1
    337 0.00000 job_13     chihung      hqw   07/19/2007 21:04:34     1
    338 0.00000 job_2      chihung      hqw   07/19/2007 21:04:34     1
    339 0.00000 job_21     chihung      hqw   07/19/2007 21:04:34     1
    340 0.00000 job_3      chihung      hqw   07/19/2007 21:04:34     1
    341 0.00000 job_31     chihung      hqw   07/19/2007 21:04:34     1


#
# ----------register the job dependency
#
$ qalter -hold_jid job_11,job_12,job_13 job_1
modified job id hold list of job 333
   blocking jobs: 334,336,337
   exited jobs:   NONE
$ qalter -hold_jid job_111 job_11
modified job id hold list of job 334
   blocking jobs: 335
   exited jobs:   NONE
$ qalter -hold_jid job_111 job_12
modified job id hold list of job 336
   blocking jobs: 335
   exited jobs:   NONE
$ qalter -hold_jid job_13,job_21 job_2
modified job id hold list of job 338
   blocking jobs: 337,339
   exited jobs:   NONE
$ qalter -hold_jid job_21,job_31 job_3
modified job id hold list of job 340
   blocking jobs: 339,341
   exited jobs:   NONE


#
# ----------release all the holds and let SGE to sort itself out
#
$ qalter -h U job_1
modified hold of job 333
$ qalter -h U job_11
modified hold of job 334
$ qalter -h U job_111
modified hold of job 335
$ qalter -h U job_12
modified hold of job 336
$ qalter -h U job_13
modified hold of job 337
$ qalter -h U job_2
modified hold of job 338
$ qalter -h U job_21
modified hold of job 339
$ qalter -h U job_3
modified hold of job 340
$ qalter -h U job_31
modified hold of job 341


#
# ----------query SGE stats
#
$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   0/4       0.01     sol-amd64

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    333 0.00000 job_1      chihung      hqw   07/19/2007 21:04:34     1
    334 0.00000 job_11     chihung      hqw   07/19/2007 21:04:34     1
    335 0.00000 job_111    chihung      qw    07/19/2007 21:04:34     1
    336 0.00000 job_12     chihung      hqw   07/19/2007 21:04:34     1
    337 0.00000 job_13     chihung      qw    07/19/2007 21:04:34     1
    338 0.00000 job_2      chihung      hqw   07/19/2007 21:04:34     1
    339 0.00000 job_21     chihung      qw    07/19/2007 21:04:34     1
    340 0.00000 job_3      chihung      hqw   07/19/2007 21:04:34     1
    341 0.00000 job_31     chihung      qw    07/19/2007 21:04:34     1


#
# ----------some jobs started to run
#
$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   2/4       0.01     sol-amd64
    339 0.55500 job_21     chihung      r     07/19/2007 21:05:36     1
    341 0.55500 job_31     chihung      r     07/19/2007 21:05:36     1
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   1/4       0.01     sol-amd64
    335 0.55500 job_111    chihung      r     07/19/2007 21:05:36     1
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   1/4       0.01     sol-amd64
    337 0.55500 job_13     chihung      r     07/19/2007 21:05:36     1

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    333 0.00000 job_1      chihung      hqw   07/19/2007 21:04:34     1
    334 0.00000 job_11     chihung      hqw   07/19/2007 21:04:34     1
    336 0.00000 job_12     chihung      hqw   07/19/2007 21:04:34     1
    338 0.00000 job_2      chihung      hqw   07/19/2007 21:04:34     1
    340 0.00000 job_3      chihung      hqw   07/19/2007 21:04:34     1


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   2/4       0.01     sol-amd64
    339 0.55500 job_21     chihung      r     07/19/2007 21:05:36     1
    341 0.55500 job_31     chihung      r     07/19/2007 21:05:36     1
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   1/4       0.01     sol-amd64
    335 0.55500 job_111    chihung      r     07/19/2007 21:05:36     1
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   1/4       0.01     sol-amd64
    337 0.55500 job_13     chihung      r     07/19/2007 21:05:36     1

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    333 0.00000 job_1      chihung      hqw   07/19/2007 21:04:34     1
    334 0.00000 job_11     chihung      hqw   07/19/2007 21:04:34     1
    336 0.00000 job_12     chihung      hqw   07/19/2007 21:04:34     1
    338 0.00000 job_2      chihung      hqw   07/19/2007 21:04:34     1
    340 0.00000 job_3      chihung      hqw   07/19/2007 21:04:34     1


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   0/4       0.01     sol-amd64

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    333 0.00000 job_1      chihung      hqw   07/19/2007 21:04:34     1
    334 0.00000 job_11     chihung      qw    07/19/2007 21:04:34     1
    336 0.00000 job_12     chihung      qw    07/19/2007 21:04:34     1
    338 0.00000 job_2      chihung      qw    07/19/2007 21:04:34     1
    340 0.00000 job_3      chihung      qw    07/19/2007 21:04:34     1


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   2/4       0.01     sol-amd64
    338 0.55500 job_2      chihung      r     07/19/2007 21:05:51     1
    340 0.55500 job_3      chihung      r     07/19/2007 21:05:51     1
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   1/4       0.01     sol-amd64
    334 0.55500 job_11     chihung      r     07/19/2007 21:05:51     1
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   1/4       0.01     sol-amd64
    336 0.55500 job_12     chihung      r     07/19/2007 21:05:51     1

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    333 0.00000 job_1      chihung      hqw   07/19/2007 21:04:34     1


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   2/4       0.01     sol-amd64
    338 0.55500 job_2      chihung      r     07/19/2007 21:05:51     1
    340 0.55500 job_3      chihung      r     07/19/2007 21:05:51     1
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   1/4       0.01     sol-amd64
    334 0.55500 job_11     chihung      r     07/19/2007 21:05:51     1
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   1/4       0.01     sol-amd64
    336 0.55500 job_12     chihung      r     07/19/2007 21:05:51     1

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    333 0.00000 job_1      chihung      hqw   07/19/2007 21:04:34     1


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   0/4       0.01     sol-amd64

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    333 0.00000 job_1      chihung      qw    07/19/2007 21:04:34     1


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   1/4       0.01     sol-amd64
    333 0.55500 job_1      chihung      r     07/19/2007 21:06:06     1


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@sgeexec0                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec1                 BIP   0/4       0.01     sol-amd64
----------------------------------------------------------------------------
all.q@sgeexec2                 BIP   1/4       0.01     sol-amd64
    333 0.55500 job_1      chihung      r     07/19/2007 21:06:06     1


#
# ----------output of all jobs, you can see job job_1/2/3 finished last
#
$ grep 2007 job_*.o*
job_111.o335:Thu Jul 19 21:05:36 SGT 2007
job_11.o334:Thu Jul 19 21:05:51 SGT 2007
job_12.o336:Thu Jul 19 21:05:51 SGT 2007
job_13.o337:Thu Jul 19 21:05:36 SGT 2007
job_1.o333:Thu Jul 19 21:06:06 SGT 2007
job_21.o339:Thu Jul 19 21:05:36 SGT 2007
job_2.o338:Thu Jul 19 21:05:51 SGT 2007
job_31.o341:Thu Jul 19 21:05:37 SGT 2007
job_3.o340:Thu Jul 19 21:05:52 SGT 2007

Another successful proof-of-concept. :-)

Labels: , ,

0 Comments:

Post a Comment

<< Home