Wednesday, December 26, 2007

Netflix, While I Am Waiting, .... Part 2

I was hoping that my Netflix run can achieve a better progress after my 10 day long vacation overseas. To my disappointment, it did not achieve a lot. I am only getting 3 predictions based on my 'buddy' information for every 10 runs. Although I completed almost 70% of the movie files, it covered less than 9% of all the predications. If I did not do anything on my approach, I would have to wait for one whole year before I could do my first submission (if the contest is still open).

As I mentioned in my previous blog, there are some strange customers who rated more movies than anyone else and it is almost impossible for a human being to be able to rate (or watch) that many movies over the last 6 years. So the question is, can I simply ignore these customers. Will these customers provide some genuine data. If so, what is considered genuine and what is not.

If I could summarise the whole training set based on the customer id and date as the index, I should be able to see how often someone rate movie in a per day basis. However, we are talking about 100 million records and likely 'awk' may not be able to handle that in a single run. If I can split it into some manageable sets, I should be able to overcome this hurdle. Here is my script to handle that in 10 separate awk runs

#! /bin/sh

for i in 0 1 2 3 4 5 6 7 8 9
do
        for j in mv_*$i.txt
        do
                cat $j
        done | nawk '
BEGIN {
        FS=","
}
NR>1 {
        ind=sprintf("%s:%s",$1,$3)
        ++s[ind]
}
END {
        for(i in s) {
                print i,s[i]
        }
}' > perday.txt.$i
done

To merge the 10 files together, you can do a for loop and cat them together for the awk to process, pretty similar to the above.

To my surprise, customer #1664010 rated 5446 times on 2005-10-12!! I think I get the answer in finding the 'non-genuine' data. However, what will be the cutting point for # movies rated per day. Okay, let's plotted the cumulative graph. I realised that if I were to choose anything rated below 50 movies per day, I can reduce the data set by 37%. Now I have to wait for the SQL DELETE to complete before I can prove my concept. In the meantime, the calculation continues.


Labels: ,

0 Comments:

Post a Comment

<< Home