Tuesday, May 04, 2010

Splitting CSV

Today my colleague was asking me whether I can assist to split a csv file with millions of lines into individual csv files based on the user name appeared in column 2.

My initial solution was a two-pass approach using shell script and AWK. First pass to find out all the unique names and second pass to extract the records that matches the unique name. That means I have to parse the original csv files many times and that make the solution very inefficient. I do encounter this type of request every now and then, therefore I decided to come up with a better solution in just 1 parse.

I used Python CSV module before and I believe it will be the right tool.

Here is the python script that will split into separate csv files based on the name in column 2:

$ cat a.py
#! /usr/bin/python


import os, sys, csv, re


if len(sys.argv) != 2:
        sys.stderr.write('Usage: %s csv-file\n' % sys.argv[0] )
        sys.exit(1)
csvfile = sys.argv[1]


if not os.path.exists( csvfile ):
        sys.stderr.write( 'Error: "%s" file does not\n' % csvfile )
        sys.exit(2)


#
# keep track of file handler and csv writer
#
name2fp = dict()
name2writer = dict()


for line in csv.reader(open(csvfile, 'rb')):

        #
        # safe name to avoid problem
        #
        safename = line[1].strip()
        safename = re.sub('[ ]', '_', safename)
        safename = re.sub(os.path.sep, ',', safename)
        safename = re.sub('[^a-zA-Z_.,()-]+', '', safename)

        # create file handler and csv writer if not exist
        if not safename in name2fp:
                outfile = "%s.%s" % (safename, csvfile)
                name2fp[safename] = open( outfile, 'wb' )
                name2writer[safename] = csv.writer( name2fp[safename] )

        name2writer[safename].writerow(line)


for safename in name2fp:
        name2fp[safename].close()

Here is a listing of the corresponding csv files and some command output to verify the task.

$ cat a.csv
gdz-u-01,"Chan Chi Hung",20,43
cld-d-01,"Chan Chi Hung",22,41
gdz-u-02,"Chi Hung",53,30
gdz-u-01,"Chan Chi Hung",20,43
cld-p-01,"Chan Chi Hung",21,42
gri-d-01,"Chi Hung",52,31
gdz-u-02,"Chan Chi Hung",23,40
gdz-u-01,"Chan Chi Hung",20,43
cld-p-01,"Chi Hung",51,32
cld-d-01,"Chi Hung",52,31
gdz-u-02,"Chan Chi Hung",23,40
gdz-u-01,"CH Chan/Chi Hung",90,93
cld-p-01,"Chan Chi Hung",21,42
gri-d-01,"CH Chan",92,91
zld-p-09,"Chan Chi Hung",21,42

$ ./a.py a.csv

$ ls
CH_Chan,Chi_Hung.a.csv  Chan_Chi_Hung.a.csv  a.csv
CH_Chan.a.csv           Chi_Hung.a.csv       a.py

$ wc -l *csv
  1 CH_Chan,Chi_Hung.a.csv
  1 CH_Chan.a.csv
  9 Chan_Chi_Hung.a.csv
  4 Chi_Hung.a.csv
 15 a.csv
 30 total

Labels: ,

0 Comments:

Post a Comment

<< Home