Tuesday, December 11, 2007

Netflix, My Performance Journey

My ex-colleague introduced me the Netflix Prize 3 months ago. As quoted from the website, "The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love.". The prize is US$1,000,000 and a job in Netflix.

I downloaded the dataset in tar.gz format of is 697,552,015 bytes. The training set comprises of 17,770 movies rated by 480,189 unique customers. All in all, it consists of 100,480,507 ratings. These movies were rated between 11 Nov 1999 to 31 Dec 2005.

Your job is to work out the 2,817,131 predictions out of 17,470 movies with 478,615 unique customers. That means 300 movies and 1,574 customers are not included in the prediction. Why, I have no idea at this moment.

As you already know, I am going to tackle this problem with my favourite programming language. With such a large dataset, a database is definitely a must. There are a couple of databases that I worked with before with Tcl. They are Metakit, MySQL and Oracle. However, none of them appeal to me as the "ideal" candidate. Metakit may not be able to scale to 100 million records. Oracle requires a license to run and installation/operation may require expertise that I may not be able to acquire within a short period. MySQL may not be able to provide me with the performance that I need.

I knew about this file-based SQL database called SQLite some time ago but do not have any chance to work on it. It is written by D. Richard Hipp, a long time Tcl developer. This file-based database has been documented in the website that it is faster than PostgreSQL and MySQL. So, this is a golden opportunity to 'test drive' this "in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine". FYI, SQLite comes with C/C++, Java, Perl, Python bindings/API. If you want to find out more, watch this video for an introduction.

I am no database person and therefore the best way to populate the database is to have one table to store all the records. Each record with 4 different fields, movie id, customer id, rating, and date. It took more than a week to populate the entire database before I can start doing any work. It is absolutly frustrating to wait for the database to finish. The initial runtime for just one prediction took almost 19 minutes. This is definitely unacceptable. Due to work commitment, I am kind of losing interest in this exercise.

I talked to my ex-colleague recently and he got me interested again. This time, I focused more on the approach and design of the program. I spent quite a bit of time reading up on the SQLite and realised that why it took 7+ days to populate and long run time. By default, SQL insert in SQLite is synchronous. By committing the database transaction for every 1000 inserts, I am able reduce the population time to just a couple of hours. Also, I introduced couple of table indexes. Although this will create a larger database (6.5GB), it speeds up the query tremendously . Instead of 19 minutes to process a prediction, I am able to reduce it down to a minute.

Another performance trick that I used is to compile Tcl and SQLite with SunStudio 12 with optimisation flag (-O) turn on and configure it to create binary for the target platform (-xarch=amd64). SunStudio, a production grade (now free of charge) compiler is definitely better than the gcc compiler.

My first version of the program is single-threaded and I have to run a few copies of it in order to take full advantage of all the CPUs. Once the logic is correct, I converted it to a multi-threaded version. In order to speed up the entire calculation, I need to take advantage of whatever relationship that I established for the future calculation. I term this as "buddy". If customer-1 and customer-2 like the same set of movies, they are consider to be "buddy". In the future prediction, if these two customers happen to be involved in the calculation, I can make use of these buddy information. A weight factor is also given to differentiate the "closeness" of these buddies.

The multi-threaded version speeds up the calculation but it gives rise to another problem. Now the issue becomes an I/O problem and I can see that all the threads are waiting for SQLite database to respond. By moving the calculation to a better disk sub-system server, SunFire X4500 (with 4 CPU cores), it provides better performance than a SunFire X4600 with 16 CPU cores with SAS disks.

In each thread calculation, I stored all the "buddy" information in the shared thread variable. To avoid threads overwriting information, the shared variable is locked during update. However, this slowed thing down because each thread has to wait to acquire the lock before it can proceed. In Solaris, you can use "prstat -m" to find out the microstate process accounting to identify the percentage of time the process/thread has spent waiting for locks.

To overcome this lock acquiring, I stored the "buddy" information in a variable that is unique to the thread id. In this case, each thread does not have to 'lock' the variable 'cos the variable name is unique to their own thread. With this, I am able to bring the performance down to 8 seconds using 4 CPU cores. Hey, that is about 140+ times improvement overall.

According to some of my initial benchmarks, movies rated by more customers will take a longer time to process. Another trick that I hope will give me an edge is to arrange the order of processing. Less rated movies will be processed first so that I can accumulate more "buddy" data to shorten the time for the heavily rated movies. You may argue that the information may not be accurate, but this is a trade-off between accuracy and performance.

The whole process is not really about getting the prize, but the journey of continuously pushing the performance envelop limit is really satisfying. I started the entire run on 6 Dec and I can see 20% of the prediction is based on my previous "buddy" information. Stay tune for more blog on this Netflix. Cheers.

Labels: , ,


Post a Comment

<< Home