Splitting CSV
My initial solution was a two-pass approach using shell script and AWK. First pass to find out all the unique names and second pass to extract the records that matches the unique name. That means I have to parse the original csv files many times and that make the solution very inefficient. I do encounter this type of request every now and then, therefore I decided to come up with a better solution in just 1 parse.
I used Python CSV module before and I believe it will be the right tool.
Here is the python script that will split into separate csv files based on the name in column 2:
$ cat a.py
#! /usr/bin/python
import os, sys, csv, re
if len(sys.argv) != 2:
sys.stderr.write('Usage: %s csv-file\n' % sys.argv[0] )
sys.exit(1)
csvfile = sys.argv[1]
if not os.path.exists( csvfile ):
sys.stderr.write( 'Error: "%s" file does not\n' % csvfile )
sys.exit(2)
#
# keep track of file handler and csv writer
#
name2fp = dict()
name2writer = dict()
for line in csv.reader(open(csvfile, 'rb')):
#
# safe name to avoid problem
#
safename = line[1].strip()
safename = re.sub('[ ]', '_', safename)
safename = re.sub(os.path.sep, ',', safename)
safename = re.sub('[^a-zA-Z_.,()-]+', '', safename)
# create file handler and csv writer if not exist
if not safename in name2fp:
outfile = "%s.%s" % (safename, csvfile)
name2fp[safename] = open( outfile, 'wb' )
name2writer[safename] = csv.writer( name2fp[safename] )
name2writer[safename].writerow(line)
for safename in name2fp:
name2fp[safename].close()
Here is a listing of the corresponding csv files and some command output to verify the task.
$ cat a.csv gdz-u-01,"Chan Chi Hung",20,43 cld-d-01,"Chan Chi Hung",22,41 gdz-u-02,"Chi Hung",53,30 gdz-u-01,"Chan Chi Hung",20,43 cld-p-01,"Chan Chi Hung",21,42 gri-d-01,"Chi Hung",52,31 gdz-u-02,"Chan Chi Hung",23,40 gdz-u-01,"Chan Chi Hung",20,43 cld-p-01,"Chi Hung",51,32 cld-d-01,"Chi Hung",52,31 gdz-u-02,"Chan Chi Hung",23,40 gdz-u-01,"CH Chan/Chi Hung",90,93 cld-p-01,"Chan Chi Hung",21,42 gri-d-01,"CH Chan",92,91 zld-p-09,"Chan Chi Hung",21,42 $ ./a.py a.csv $ ls CH_Chan,Chi_Hung.a.csv Chan_Chi_Hung.a.csv a.csv CH_Chan.a.csv Chi_Hung.a.csv a.py $ wc -l *csv 1 CH_Chan,Chi_Hung.a.csv 1 CH_Chan.a.csv 9 Chan_Chi_Hung.a.csv 4 Chi_Hung.a.csv 15 a.csv 30 total


0 Comments:
Post a Comment
<< Home