Chi Hung Chan: March 2011

Have you encountered 'prtdiag' or other commands hang for whatever reason? If your script happens to run these commands and launch from cron, your job will simply pile up until cron hits the limit. By default, Solaris configured cron to run 100 concurrent jobs and the next 101th job will just fail.

I developed a launcher program (watchdog) to limit the runtime of script (worker) that may have the above mentioned behaviour. It works well with my worker script and it should work for other programs too. So, no more hanging job in cron !!

#! /bin/ksh
#
# A watchdog program to limit the elapsed time of the worker shell script
# to avoid hanging processes that can pile up if worker runs under cron
#


export PATH=/usr/bin:/usr/sbin:/bin


#
# default time limit is 60 seconds
#
timelimit=${1:-60}

B
worker="${0%/*}/check-worker.ksh"
worker_name=${worker##*/}
worker_name=${worker_name%.*}
if [ ! -f $worker ]; then
    echo "Error. \"$worker\" cannot be found"
    exit 1
fi
if [ ! -x $worker ]; then
    echo "Error. \"$worker\" is not executable"
    exit 2
fi


watchdog()
{
    sleep 1; # wait for the worker to start
    while [ $timelimit -gt 0 ]
    do
        # pgrep is available since 5.8, else use ps -ef | grep -v grep | grep $worker_name
        jobid=`pgrep $worker_name`
        if [ $? -eq 1 ]; then
            break
        else
            sleep 1
        fi
        ((timelimit-=1))
    done
    if [ $timelimit -eq 0 ]; then
        # kill worker + child processes
        ptree $jobid | awk '$1=='$jobid'{start=1}start==1{print $1}' | while read pid
            do
                kill -TERM "$pid" > /dev/null 2>&1
            done
    fi
}


#
# start the watchdog before the worker
#
watchdog &


tmpfile="/tmp/.$work_name.$$"
$worker > $tmpfile 2>&1 &
worker_id=$!
wait $worker_id > /dev/null 2>&1
rc=$?


if [ $rc -ne 0 ]; then
    # replace this line to do whatever you want, send email, sms, logger....
    #
    # echo .... | mailx someone@somewhere.com

    details=`cat $tmpfile 2>/dev/null`
    echo "Exit status=$rc. There is a problem with the server '`hostname`' - $details"
fi


rm -f $tmpfile

Labels: shell script, Solaris

I have not been scripting for quite a while and feel like I don't quality as a "Scripting Guy in the Lion City" anymore. Luckily my colleage (now is ex-colleague, 'cos I will be starting my new job next week) approached me to develop script to check hardware error messages from the /var/adm/messages in Solaris. Now I am as happy as a sandboy, or I should say "As Happy As A Scripting Guy!"

Anyway, here is my last script in the company.

If this script is going to run periodically via cron, we cannot just 'grep' certain pattern from the file. What if the /var/adm/messages has not been rotated for days or even weeks, it will be pretty inefficient to simply scan through the entire file. BTW, I have seen /var/adm/messages as big as few GB. Also, problem fixed previously will also appear in /var/adm/messages if it is not rotated.

The only way is to 'grep' only the content from where it left off in the previous run. It is easy to keep track of the current file size and store this information somewhere for the next poll to consume. The challenge is to 'seek' to the previous location and output only the content between the two polls. However, I am not able to find any UNIX commands that allow me to 'cat' file based on byte range.

In order to accomplish this, some high level scripting languages are used. Perl is selected because it is available since Solaris 8. Below one-liner is able to 'cat' only the updated contents in /var/adm/messages.

POS=$prev_pos perl -e 'open(F,"/var/adm/messages");seek(F,$ENV{"POS"},0); while($line=<F>){print $line;}'

Who say shell script is slow and inefficient! Always challenge yourself for a better solution.

Here is the complete script doing all the checking

#! /usr/bin/ksh
#
# Check on all the faults
#       hardware, svm, vx, df, /var/adm/messages, ifconfig, psrinfo
# Exit 1 if there is an error
#



PATH=/usr/bin:/usr/sbin:/usr/platform/`arch -k`/sbin:/usr/sfw/bin:/usr/local/bin
export PATH
LD_LIBRARY_PATH=/usr/lib:usr/platform/`arch -k`/lib:/usr/sfw/lib:/usr/local/lib
export LD_LIBRARY_PATH


Update()
{
    # if you do not want output, simply comment out the 'echo' command
    # remember to leave the colon (no-ops) command there
    echo "$1,\c"
    :
}


RC=0


#
# prtdiag(1M) - display system diagnostic information
#
# Exit Status
# 0 No failures or errors are detected in the system.
# 1 Failures or errors are detected in the system.
# 2 An internal prtdiag error occurred, for example, out of memory.
#
prtdiag > /dev/null 2>&1
[ $? -ne 0 ] && ((RC+=1)) && Update prtdiag



#
# fmadm(1M) - fault management configuration tool
#
# Exit Status
# 0 Successful completion.
# 1 An error occurred. Errors include a failure to communicate with fmd or 
#   insufficient privileges to perform the requested operation.
# 2 Invalid command-line options were specified.
#
osver=`uname -r`
if [ "${osver#*.}" -ge 10 ]; then
    fmadm faulty > /dev/null 2>&1
    [ $? -ne 0 ] && ((RC+=1)) && Update fmadm
fi



#
# df(1M) - displays number of free disk blocks and free files
#
# Exit Status
# 0  Successful completion.
# >0 An error occurred.
#
df -l > /dev/null 2>&1
[ $? -ne 0 ] && ((RC+=1)) && Update df



#
# ifconfig(1M) - configure network interface parameters
#
# Interface flag:
# FAILED
#     The  interface  has  failed.  New  addresses  cannot  be
#     created  on this interface. If this interface is part of
#     an IP network multipathing group, a failover will  occur
#     to another interface in the group, if possible
#
ifconfig -a | grep FAILED > /dev/null 2>&1
[ $? -eq 0 ] && ((RC+=1)) && Update ifconfig



#
# check svm via metadb and metastat
#
pkginfo SUNWmdu > /dev/null 2>&1
if [ $? -eq 0 ]; then
    metaset=`metaset`
    metadb > /dev/null 2>&1
    if [ $? -eq 0 -a ! -z $metaset ]; then

        # output should not have these characters from metadb, see metadb -i man page
        # d - replica does not have an associated device ID.
        # r - replica does not have device relocation information
        # W - replica has device write errors
        # M - replica had problem with master blocks
        # D - replica had problem with data blocks
        # F - replica had format problems
        # S - replica is too small to hold current database
        # R - replica had device read errors
        # B - tagged data associated with the replica is not valid
        metadb -i | sed -n '2,$s/[1-9][0-9].*//p' | egrep '[drWMDFSRB]' > /dev/null 2>&1
        rc1=$?
        #
        metastat | awk '$1=="State:" && $2!="Okay" {exit(1)}'
        rc2=$?
        if [ $rc1 -ne 0 -o $rc2 -ne 0 ]; then
            ((RC+=1))
            Update svm
        fi
    fi
fi



#
# psrinfo(1M) - displays information about processors
# Exit Status
# 0  Successful completion.
# >0 An error occurred.
#
psrinfo > /dev/null 2>&1
rc1=$?
psrinfo | awk '$2!="on-line"{exit(1)}'
rc2=$?
if [ $rc1 -ne 0 -o $rc2 -ne 0 ]; then
    ((RC+=1))
    Update psrinfo
fi



#
# VX
#
pkginfo VRTSvxfs VRTSvxvm > /dev/null 2>&1
if [ $? -eq 0 ]; then
    vxdisk list 2>&1 | grep -i fail > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        ((RC+=1))
        Update vx
    fi
fi



#
# 'grep' from /var/adm/messages
# need to seek to the previous position, instead of scan through the whole file again
#
hidden=${0##*/}
hidden="/tmp/.${hidden%%.*}_var-adm-messages"
prev_pos=0
if [ -f $hidden ]; then
    prev_pos=`cat $hidden`
fi
# if previous position is less than current file size, likely file has been recycled
filesize=`ls -l /var/adm/messages | awk '{print $5}'`
if [ $prev_pos -gt $filesize ]; then
    prev_pos=0
fi
#
perl -e 'open(F,"/var/adm/messages");seek(F,'$prev_pos',0); while($line=<F>){print $line;}' | \
    egrep -i 'ECC error|PS[0-9] has FAILED|Link Down|reboot after panic|file system full|by NR list: I/O error|fmd.*SEVERITY: Critical' > /dev/null 2>&1
if [ $? -eq 0 ]; then
    ((RC+=1))
    Update /var/adm/messages
fi
ls -l /var/adm/messages | awk '{print $5}' > $hidden



if [ $RC -gt 0 ]; then
    exit 1
fi

Labels: Perl, shell script, Solaris

Chi Hung Chan

Saturday, March 05, 2011

200 Countries, 200 Years, 4 Minutes

No More Hanging Jobs in Cron

Feel Like A Scripting Guy Again!

About Me

Search My Blog

Other Blogs

Previous Posts

Archives