Splitting CSV
My initial solution was a two-pass approach using shell script and AWK. First pass to find out all the unique names and second pass to extract the records that matches the unique name. That means I have to parse the original csv files many times and that make the solution very inefficient. I do encounter this type of request every now and then, therefore I decided to come up with a better solution in just 1 parse.
I used Python CSV module before and I believe it will be the right tool.
Here is the python script that will split into separate csv files based on the name in column 2:
$ cat a.py #! /usr/bin/python import os, sys, csv, re if len(sys.argv) != 2: sys.stderr.write('Usage: %s csv-file\n' % sys.argv[0] ) sys.exit(1) csvfile = sys.argv[1] if not os.path.exists( csvfile ): sys.stderr.write( 'Error: "%s" file does not\n' % csvfile ) sys.exit(2) # # keep track of file handler and csv writer # name2fp = dict() name2writer = dict() for line in csv.reader(open(csvfile, 'rb')): # # safe name to avoid problem # safename = line[1].strip() safename = re.sub('[ ]', '_', safename) safename = re.sub(os.path.sep, ',', safename) safename = re.sub('[^a-zA-Z_.,()-]+', '', safename) # create file handler and csv writer if not exist if not safename in name2fp: outfile = "%s.%s" % (safename, csvfile) name2fp[safename] = open( outfile, 'wb' ) name2writer[safename] = csv.writer( name2fp[safename] ) name2writer[safename].writerow(line) for safename in name2fp: name2fp[safename].close()
Here is a listing of the corresponding csv files and some command output to verify the task.
$ cat a.csv gdz-u-01,"Chan Chi Hung",20,43 cld-d-01,"Chan Chi Hung",22,41 gdz-u-02,"Chi Hung",53,30 gdz-u-01,"Chan Chi Hung",20,43 cld-p-01,"Chan Chi Hung",21,42 gri-d-01,"Chi Hung",52,31 gdz-u-02,"Chan Chi Hung",23,40 gdz-u-01,"Chan Chi Hung",20,43 cld-p-01,"Chi Hung",51,32 cld-d-01,"Chi Hung",52,31 gdz-u-02,"Chan Chi Hung",23,40 gdz-u-01,"CH Chan/Chi Hung",90,93 cld-p-01,"Chan Chi Hung",21,42 gri-d-01,"CH Chan",92,91 zld-p-09,"Chan Chi Hung",21,42 $ ./a.py a.csv $ ls CH_Chan,Chi_Hung.a.csv Chan_Chi_Hung.a.csv a.csv CH_Chan.a.csv Chi_Hung.a.csv a.py $ wc -l *csv 1 CH_Chan,Chi_Hung.a.csv 1 CH_Chan.a.csv 9 Chan_Chi_Hung.a.csv 4 Chi_Hung.a.csv 15 a.csv 30 total
0 Comments:
Post a Comment
<< Home