Tuesday, October 26, 2010

House Keep Script Not Doing Its Job

Recently I found one of my customer's scripts is not doing what it suppose to do, which is to house keep old log files. Here is how I locate the culprit.

I have been dumping out those zero link files daily for all my managed servers and realised that one of the application accounts has been choking up 2GB+ worth of memory. Since these files are having zero link, there is no way to find out the original file name. With the help of DTrace in Solaris 10, I am able to tell when this particular user trying to delete (unlink in syscall) a file. Below is a simple dtrace script that I run for a couple of days.

#! /usr/sbin/dtrace -s

#pragma D option quiet

syscall::unlink:entry
/uid==1234/
{
        printf("%-20Y pid=%d ppid=%d execname=%s args=%s\n", walltimestamp, pid, ppid, execname, copyinstr(arg0));
}

Sample output:

2010 Oct 21 20:01:15 pid=1667 ppid=29587 execname=gzip args=/opt/chihung/log/Err-s_ac.txt.20101021064218
2010 Oct 21 20:01:15 pid=29599 ppid=29587 execname=gzip args=/opt/chihung/log/ChanSecurityLog211010.log
2010 Oct 21 20:01:48 pid=1668 ppid=29587 execname=gzip args=/opt/chihung/log/ChanControlServicesLog211010.log
2010 Oct 21 20:02:21 pid=1916 ppid=29587 execname=gzip args=/opt/chihung/log/ChanBusinessServicesLog211010.log
2010 Oct 21 20:02:27 pid=2160 ppid=29587 execname=gzip args=/opt/chihung/log/ChanFixedexpenseLog211010.log
2010 Oct 21 20:02:47 pid=2290 ppid=29587 execname=gzip args=/opt/chihung/log/aclist.txt.20101021084119
2010 Oct 21 20:02:47 pid=2294 ppid=29587 execname=gzip args=/opt/chihung/log/Chan_20101020_error_13360_batch
2010 Oct 21 20:02:47 pid=2296 ppid=2295 execname=rm args=/opt/chihung/log/ChanControlServicesLog211010.log.gz
2010 Oct 21 20:02:47 pid=2297 ppid=2295 execname=rm args=/opt/chihung/log/ChanFixedexpenseLog211010.log.gz
2010 Oct 21 20:02:47 pid=2299 ppid=2295 execname=rm args=/opt/chihung/log/ChanUserLog211010.log.gz
2010 Oct 21 20:02:47 pid=2300 ppid=2295 execname=rm args=/opt/chihung/log/aclist.txt.20101021084119.gz
2010 Oct 21 20:02:47 pid=2301 ppid=2295 execname=rm args=/opt/chihung/log/Chan003_20101020_error_13359_adhoc.gz
2010 Oct 21 20:02:47 pid=2303 ppid=2295 execname=rm args=/opt/chihung/log/ChanBusinessServicesLog211010.log.gz
2010 Oct 21 20:02:47 pid=2304 ppid=2295 execname=rm args=/opt/chihung/log/ChanDepositLog211010.log.gz
2010 Oct 21 20:02:47 pid=2305 ppid=2295 execname=rm args=/opt/chihung/log/ChanNdfLog211010.log.gz
2010 Oct 21 20:02:47 pid=2307 ppid=2295 execname=rm args=/opt/chihung/log/Err-S001_20101020_error_13358_adhoc.gz

At 20:00 daily, gzip is compressing a couple of log files and that followed by rm on the gzipped files. That's wierd. Why would someone want to compress log files and remove it after that. When the application rotate log file daily, it is unable to locate the original log file handler to release the memory. Therefore, the number of zero link files increases and hence more used memory. By cross examining the scheduled jobs at 20:00, I found the following script

#!/bin/ksh -x

APPDIR=/opt/$1/log
if [[ -d $APPDIR ]]; then
        find $APPDIR \( -name "*.out" -o -name "*.gz" \) -prune -o -type f -mtime -1 -exec gzip {} \;
        find $APPDIR -name "*.gz" -mtime -1 -exec rm -f {} \;
        exit 0
fi

If you look at the man page for find(1), -mtime -1 means modified time less than 1 day! Shouldn't it be -mtime +1 - modified time older than 1 day.

Labels: ,

0 Comments:

Post a Comment

<< Home