Netflix, While I Am Waiting, .... Part 2
As I mentioned in my previous blog, there are some strange customers who rated more movies than anyone else and it is almost impossible for a human being to be able to rate (or watch) that many movies over the last 6 years. So the question is, can I simply ignore these customers. Will these customers provide some genuine data. If so, what is considered genuine and what is not.
If I could summarise the whole training set based on the customer id and date as the index, I should be able to see how often someone rate movie in a per day basis. However, we are talking about 100 million records and likely 'awk' may not be able to handle that in a single run. If I can split it into some manageable sets, I should be able to overcome this hurdle. Here is my script to handle that in 10 separate awk runs
#! /bin/sh for i in 0 1 2 3 4 5 6 7 8 9 do for j in mv_*$i.txt do cat $j done | nawk ' BEGIN { FS="," } NR>1 { ind=sprintf("%s:%s",$1,$3) ++s[ind] } END { for(i in s) { print i,s[i] } }' > perday.txt.$i done
To merge the 10 files together, you can do a for loop and cat them together for the awk to process, pretty similar to the above.
To my surprise, customer #1664010 rated 5446 times on 2005-10-12!! I think I get the answer in finding the 'non-genuine' data. However, what will be the cutting point for # movies rated per day. Okay, let's plotted the cumulative graph. I realised that if I were to choose anything rated below 50 movies per day, I can reduce the data set by 37%. Now I have to wait for the SQL DELETE to complete before I can prove my concept. In the meantime, the calculation continues.
0 Comments:
Post a Comment
<< Home