Netflix, While I Am Waiting, .... Part 2
As I mentioned in my previous blog, there are some strange customers who rated more movies than anyone else and it is almost impossible for a human being to be able to rate (or watch) that many movies over the last 6 years. So the question is, can I simply ignore these customers. Will these customers provide some genuine data. If so, what is considered genuine and what is not.
If I could summarise the whole training set based on the customer id and date as the index, I should be able to see how often someone rate movie in a per day basis. However, we are talking about 100 million records and likely 'awk' may not be able to handle that in a single run. If I can split it into some manageable sets, I should be able to overcome this hurdle. Here is my script to handle that in 10 separate awk runs
#! /bin/sh
for i in 0 1 2 3 4 5 6 7 8 9
do
for j in mv_*$i.txt
do
cat $j
done | nawk '
BEGIN {
FS=","
}
NR>1 {
ind=sprintf("%s:%s",$1,$3)
++s[ind]
}
END {
for(i in s) {
print i,s[i]
}
}' > perday.txt.$i
done
To merge the 10 files together, you can do a for loop and cat them together for the awk to process, pretty similar to the above.
To my surprise, customer #1664010 rated 5446 times on 2005-10-12!! I think I get the answer in finding the 'non-genuine' data. However, what will be the cutting point for # movies rated per day. Okay, let's plotted the cumulative graph. I realised that if I were to choose anything rated below 50 movies per day, I can reduce the data set by 37%. Now I have to wait for the SQL DELETE to complete before I can prove my concept. In the meantime, the calculation continues.




0 Comments:
Post a Comment
<< Home