200 Countries, 200 Years, 4 Minutes
Wonder why Congo still stuck in the bottom-left of the chart. You may want to find out more about Rape of a Nation
The Scripting Guy in the Lion City with a performance sense.
Wonder why Congo still stuck in the bottom-left of the chart. You may want to find out more about Rape of a Nation
I developed a launcher program (watchdog) to limit the runtime of script (worker) that may have the above mentioned behaviour. It works well with my worker script and it should work for other programs too. So, no more hanging job in cron !!
#! /bin/ksh # # A watchdog program to limit the elapsed time of the worker shell script # to avoid hanging processes that can pile up if worker runs under cron # export PATH=/usr/bin:/usr/sbin:/bin # # default time limit is 60 seconds # timelimit=${1:-60} B worker="${0%/*}/check-worker.ksh" worker_name=${worker##*/} worker_name=${worker_name%.*} if [ ! -f $worker ]; then echo "Error. \"$worker\" cannot be found" exit 1 fi if [ ! -x $worker ]; then echo "Error. \"$worker\" is not executable" exit 2 fi watchdog() { sleep 1; # wait for the worker to start while [ $timelimit -gt 0 ] do # pgrep is available since 5.8, else use ps -ef | grep -v grep | grep $worker_name jobid=`pgrep $worker_name` if [ $? -eq 1 ]; then break else sleep 1 fi ((timelimit-=1)) done if [ $timelimit -eq 0 ]; then # kill worker + child processes ptree $jobid | awk '$1=='$jobid'{start=1}start==1{print $1}' | while read pid do kill -TERM "$pid" > /dev/null 2>&1 done fi } # # start the watchdog before the worker # watchdog & tmpfile="/tmp/.$work_name.$$" $worker > $tmpfile 2>&1 & worker_id=$! wait $worker_id > /dev/null 2>&1 rc=$? if [ $rc -ne 0 ]; then # replace this line to do whatever you want, send email, sms, logger.... # # echo .... | mailx someone@somewhere.com details=`cat $tmpfile 2>/dev/null` echo "Exit status=$rc. There is a problem with the server '`hostname`' - $details" fi rm -f $tmpfile
Labels: shell script, Solaris
Anyway, here is my last script in the company.
If this script is going to run periodically via cron, we cannot just 'grep' certain pattern from the file. What if the /var/adm/messages has not been rotated for days or even weeks, it will be pretty inefficient to simply scan through the entire file. BTW, I have seen /var/adm/messages as big as few GB. Also, problem fixed previously will also appear in /var/adm/messages if it is not rotated.
The only way is to 'grep' only the content from where it left off in the previous run. It is easy to keep track of the current file size and store this information somewhere for the next poll to consume. The challenge is to 'seek' to the previous location and output only the content between the two polls. However, I am not able to find any UNIX commands that allow me to 'cat' file based on byte range.
In order to accomplish this, some high level scripting languages are used. Perl is selected because it is available since Solaris 8. Below one-liner is able to 'cat' only the updated contents in /var/adm/messages.
POS=$prev_pos perl -e 'open(F,"/var/adm/messages");seek(F,$ENV{"POS"},0); while($line=<F>){print $line;}'
Who say shell script is slow and inefficient! Always challenge yourself for a better solution.
Here is the complete script doing all the checking
#! /usr/bin/ksh # # Check on all the faults # hardware, svm, vx, df, /var/adm/messages, ifconfig, psrinfo # Exit 1 if there is an error # PATH=/usr/bin:/usr/sbin:/usr/platform/`arch -k`/sbin:/usr/sfw/bin:/usr/local/bin export PATH LD_LIBRARY_PATH=/usr/lib:usr/platform/`arch -k`/lib:/usr/sfw/lib:/usr/local/lib export LD_LIBRARY_PATH Update() { # if you do not want output, simply comment out the 'echo' command # remember to leave the colon (no-ops) command there echo "$1,\c" : } RC=0 # # prtdiag(1M) - display system diagnostic information # # Exit Status # 0 No failures or errors are detected in the system. # 1 Failures or errors are detected in the system. # 2 An internal prtdiag error occurred, for example, out of memory. # prtdiag > /dev/null 2>&1 [ $? -ne 0 ] && ((RC+=1)) && Update prtdiag # # fmadm(1M) - fault management configuration tool # # Exit Status # 0 Successful completion. # 1 An error occurred. Errors include a failure to communicate with fmd or # insufficient privileges to perform the requested operation. # 2 Invalid command-line options were specified. # osver=`uname -r` if [ "${osver#*.}" -ge 10 ]; then fmadm faulty > /dev/null 2>&1 [ $? -ne 0 ] && ((RC+=1)) && Update fmadm fi # # df(1M) - displays number of free disk blocks and free files # # Exit Status # 0 Successful completion. # >0 An error occurred. # df -l > /dev/null 2>&1 [ $? -ne 0 ] && ((RC+=1)) && Update df # # ifconfig(1M) - configure network interface parameters # # Interface flag: # FAILED # The interface has failed. New addresses cannot be # created on this interface. If this interface is part of # an IP network multipathing group, a failover will occur # to another interface in the group, if possible # ifconfig -a | grep FAILED > /dev/null 2>&1 [ $? -eq 0 ] && ((RC+=1)) && Update ifconfig # # check svm via metadb and metastat # pkginfo SUNWmdu > /dev/null 2>&1 if [ $? -eq 0 ]; then metaset=`metaset` metadb > /dev/null 2>&1 if [ $? -eq 0 -a ! -z $metaset ]; then # output should not have these characters from metadb, see metadb -i man page # d - replica does not have an associated device ID. # r - replica does not have device relocation information # W - replica has device write errors # M - replica had problem with master blocks # D - replica had problem with data blocks # F - replica had format problems # S - replica is too small to hold current database # R - replica had device read errors # B - tagged data associated with the replica is not valid metadb -i | sed -n '2,$s/[1-9][0-9].*//p' | egrep '[drWMDFSRB]' > /dev/null 2>&1 rc1=$? # metastat | awk '$1=="State:" && $2!="Okay" {exit(1)}' rc2=$? if [ $rc1 -ne 0 -o $rc2 -ne 0 ]; then ((RC+=1)) Update svm fi fi fi # # psrinfo(1M) - displays information about processors # Exit Status # 0 Successful completion. # >0 An error occurred. # psrinfo > /dev/null 2>&1 rc1=$? psrinfo | awk '$2!="on-line"{exit(1)}' rc2=$? if [ $rc1 -ne 0 -o $rc2 -ne 0 ]; then ((RC+=1)) Update psrinfo fi # # VX # pkginfo VRTSvxfs VRTSvxvm > /dev/null 2>&1 if [ $? -eq 0 ]; then vxdisk list 2>&1 | grep -i fail > /dev/null 2>&1 if [ $? -eq 0 ]; then ((RC+=1)) Update vx fi fi # # 'grep' from /var/adm/messages # need to seek to the previous position, instead of scan through the whole file again # hidden=${0##*/} hidden="/tmp/.${hidden%%.*}_var-adm-messages" prev_pos=0 if [ -f $hidden ]; then prev_pos=`cat $hidden` fi # if previous position is less than current file size, likely file has been recycled filesize=`ls -l /var/adm/messages | awk '{print $5}'` if [ $prev_pos -gt $filesize ]; then prev_pos=0 fi # perl -e 'open(F,"/var/adm/messages");seek(F,'$prev_pos',0); while($line=<F>){print $line;}' | \ egrep -i 'ECC error|PS[0-9] has FAILED|Link Down|reboot after panic|file system full|by NR list: I/O error|fmd.*SEVERITY: Critical' > /dev/null 2>&1 if [ $? -eq 0 ]; then ((RC+=1)) Update /var/adm/messages fi ls -l /var/adm/messages | awk '{print $5}' > $hidden if [ $RC -gt 0 ]; then exit 1 fi
Labels: Perl, shell script, Solaris