Sunday, September 30, 2012

Disk Usage Summary per User and Time, take 2

I would very much like to compare awk (gawk) with python and therefore I coded the same thing in gawk. Here is the code:
#! /bin/bash
#
# count user file size by block
#


if [ $# -ne 1 ]; then
    echo "Usage: $0 <directory>"
    exit 1
fi


if [ ! -d $1 ]; then
    echo "Error. $1 does not exist"
fi


now=$(date +%s)
ls -lRs --time-style=+%s $1 | awk -v now=$now '

function print_header() {
    printf("%-15s %8s %8s %8s %8s %8s %8s %8s %8s\n", 
        "User", "0m-1m", "1m-3m", "3m-6m", "6m-1y",
        "1y-2y", "2y-3y", "3y-  ", "Total")
}

function print_line() {
    d8="--------"
    d15="---------------"
    printf("%-15s %8s %8s %8s %8s %8s %8s %8s %8s\n", d15, d8, d8, d8, d8, d8, d8, d8, d8)
}

function print_footer() {
    printf("\nNote: Size in GB\n")
}

BEGIN {
    print_header()
    print_line()

    factor=1024.0*1024.0

    y0m0=0
    y0m1=1*30*86400
    y0m3=3*30*86400
    y0m6=6*30*86400
    y1m0=1*365*86400
    y2m0=2*365*86400
    y3m0=3*365*86400
    yxm0=100*365*86400
}

# match directory, link and file
$2 ~ /^[dl-]/ {
    block=$1
    user= $4
    epoch=$7

    users[user]=1

    dt=now-epoch

    if ( y0m0<=dt && dt<y0m1 ) { cnt=1 }
    if ( y0m1<=dt && dt<y0m3 ) { cnt=2 }
    if ( y0m3<=dt && dt<y0m6 ) { cnt=3 }
    if ( y0m6<=dt && dt<y1m0 ) { cnt=4 }
    if ( y1m0<=dt && dt<y2m0 ) { cnt=5 }
    if ( y2m0<=dt && dt<y3m0 ) { cnt=6 }
    if ( y3m0<=dt && dt<yxm0 ) { cnt=7 }

    summary[user,cnt]+=block
    total_time[cnt]+=block
    total_user[user]+=block
}
END {
    # sort user name using asorti (gawk)
    n=asorti(users, users_sorted)
    for(i=1;i<=n;++i) {
        user=users_sorted[i]
        printf("%-15s", user)
        for(cnt=1;cnt<=7;++cnt) {
            if ( summary[user,cnt] == "" ) {
                summary[user,cnt]=0.0
            }
            printf(" %8.2f", summary[user,cnt]/factor)
        }

        # print per user total
        printf(" %8.2f\n", total_user[user]/factor)
    }

    print_line()

    # print total per time
    total=0.0
    printf("%15s", "Total:")
    for(cnt=1;cnt<=7;++cnt) {
        if ( total_time[cnt] == "" ) {
            total_time[cnt]=0.0
        }
        total+=total_time[cnt]
        printf(" % 8.2f", total_time[cnt]/factor)
    }
    printf(" %8.2f\n", total/factor)

    print_footer()
}
'


With 814MB and 10,208 files in /var, python solution took 1.17s and gawk took 0.95s. I am yet to find out how the two compare for millions of files.

Saturday, September 29, 2012

Disk Usage Summary per User and Time

If you are a system administrator, you often face disk utilisation dilemma. On one hand, you need to clean up those old and unwanted files. On the other hand, you cannot do so because they are owned by other users and you need their permission.

Below script summarise users' disk utilisation over certain duration. Hopefully this can help user to determine when they need to housekeep.

# cat b.py
#! /usr/bin/python

import fileinput, time, re


fmt="%-15s %8s %8s %8s %8s %8s %8s %8s %8s"
def print_line():
    print fmt % ('-'*15,'-'*8,'-'*8,'-'*8,'-'*8,'-'*8,'-'*8,'-'*8, '-'*8)
def print_header():
    print fmt % ('User','0-1m','1m-3m','3m-6m','6m-1y','1y-2y','2y-3y','3y-  ', 'Total')
def print_footer():
    print '\nNote: Size in GB'


# match directory, link, file
p=re.compile("^[ ]*[1-9][0-9]* [dl-]")


now=int(time.time())


y0m0=0
y0m1=1*30*86400
y0m3=3*30*86400
y0m6=6*30*86400
y1m0=1*365*86400
y2m0=2*365*86400
y3m0=3*365*86400
yxm0=100*365*86400
tranges=(
    [y0m0, y0m1],
    [y0m1, y0m3],
    [y0m3, y0m6],
    [y0m6, y1m0],
    [y1m0, y2m0],
    [y2m0, y3m0],
    [y3m0, yxm0]
)


users=set()
summary=dict()
total_time=dict()
total_user=dict()


for line in fileinput.input():

    if p.match(line):

        (block, perm, link, user, group, size, epoch, others)=line.split(None,7)
        iblock=int(block)
        dt=now-int(epoch)
        users.add(user)

        # summary per user+duration
        cnt=0
        for (t1,t2) in tranges:
            if t1<=dt and dt<t2:
                key=(user,cnt)
                if key in summary:
                    summary[key]+=iblock
                else:
                    summary[key]=iblock

                # total per duration
                if cnt in total_time:
                    total_time[cnt]+=iblock
                else:
                    total_time[cnt]=iblock

            cnt+=1

        # total per user
        if user in total_user:
            total_user[user]+=iblock
        else:
            total_user[user]=iblock


allusers=list(users)
allusers.sort()
factor=1024.0*1024.0


print_header()
print_line()


for user in allusers:
    print "%-15s" % user,
    for cnt in range(len(tranges)):
        key=(user,cnt)
        if key in summary:
            gb=summary[key]/factor
        else:
            gb=0.0
        print "%8.2lf" % gb,

    # user total
    gb=total_user[user]/factor
    print "%8.2lf" % gb


print_line()


print "%15s" % "Total:",
total=0.0
for cnt in range(len(tranges)):
    if cnt in total_time:
        gb=total_time[cnt]/factor
    else:
        gb=0.0
    print "%8.2lf" % gb,
    total+=gb
print "%8.2lf" % total


print_footer()


Here is a sample output from my 16GB SSD netbook

# ls -lRs --time-style=+%s /var | ./b.py
User                0-1m    1m-3m    3m-6m    6m-1y    1y-2y    2y-3y    3y-      Total
--------------- -------- -------- -------- -------- -------- -------- -------- --------
avahi-autoipd       0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
chihung             0.00     0.00     0.00     0.41     0.00     0.00     0.00     0.41
colord              0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
daemon              0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
libuuid             0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
lightdm             0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
lp                  0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
man                 0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
mysql               0.03     0.00     0.00     0.00     0.00     0.00     0.00     0.03
root                0.21     0.07     0.04     0.02     0.00     0.00     0.00     0.35
speech-dispatcher     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
syslog              0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
www-data            0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
--------------- -------- -------- -------- -------- -------- -------- -------- --------
         Total:     0.24     0.07     0.04     0.43     0.00     0.00     0.00     0.79

Note: Size in GB

ssh runs only once inside a loop

If you need to run ssh inside a loop, you need to pass in "-n" flag in ssh to tell ssh to take stdin from /dev/null. If not, the loop will stop after the first run

for h in host1 host2 host3 host4
do
    ssh -n user@$h "/run/something"
done

man page on my Ubuntu say:
     -n      Redirects stdin from /dev/null (actually, prevents reading from
             stdin).  This must be used when ssh is run in the background.  A
             common trick is to use this to run X11 programs on a remote
             machine.  For example, ssh -n shadows.cs.hut.fi emacs & will
             start an emacs on shadows.cs.hut.fi, and the X11 connection will
             be automatically forwarded over an encrypted channel.  The ssh
             program will be put in the background.  (This does not work if
             ssh needs to ask for a password or passphrase; see also the -f
             option.)