article

Statistics Can Find You a Movie, Part 2

by: Staff with Robert Bell, Yehuda Koren, Chris Volinsky, May 19, 2010

  Part 2 of a 2-part series (Statistics Can Find You a Movie, Part 1)

This article is a continuation of a behind-the-scenes account of the Netflix Prize told from the viewpoint of BellKor, the AT&T Research team made up of computer scientist Yehuda Koren and statisticians Robert Bell and Chris Volinsky. The Netflix Prize is a statistical contest where teams competed to be the first to build a recommender tool that outperformed the Netflix-developed Cinematch by 10%. At the end of Part 1, BellKor had just won the first progress prize by achieving an impressive 8.43% improvement after a single year.) 

 

On November 13 in accordance with contest rules, BellKor as the winner of the first progress prize published a paper detailing its methods:

How combining predictions from multiple, complementary models improved performance, with one model’s strengths compensating for the weaknesses of others.

How results from SVD (singular value decomposition) models—useful for predicting global, overall trends—were combined with results from nearest-neighbor models, useful for picking up strong local interactions of a small set of related DVDs.

How global effects normalized the biases among customers, some who rated generously and some who rated more critically.

How ridge regression controlled the overfitting that inevitably results from large, complex models.

And how the 107 final outputs were linearly blended according to each model’s individual contribution. 

In a one-way free and open exchange of information, BellKor gave away to all takers a year’s worth of statistical modeling.

In a one-way free and open exchange of information, BellKor gave away to all takers a year’s worth of statistical modeling and parameter tweaking. Anyone with programming skills and familiarity with statistics could understand and even build their own recommender tool.

And any team happening to read the paper could apply the information supplied by BellKor and incorporate it into their own methods. Teams with different methods would benefit the most since they were being handed a brand-new information source to add to their own.

BellKor provided even more information in two papers submitted to the KDD data mining conference, and the team’s responses to specific questions on the forum (especially those of Yehuda Koren) provided additional details and context.

Temporal effects

Not long after publication, some new teams started appearing in the top 20, and in early February, BellKor lost its hold on the lead to rival When Gravity and Dinosaurs Unite.

BellKor was in need of new information to exploit. During the first year, the team had made steady progress with almost every new idea it tried. But now well into the second year, new ideas were harder to come by, and the ones that did come along contributed less improvement, and sometimes none at all.

The models had put the team in first place, and while the models would be continually refined during the three years of the contest, refinements alone would not produce any new big gains. What was needed was an entirely new source of information. And presently, one appeared.

Much earlier, in September, the team basho (John Tomfohr) had blogged about achieving an RMSE of 0.8805 with a single model, without any blending. That got BellKor’s attention; no single model of BellKor’s was capable of 0.8805. Based on experience with the data, Koren knew such results were not possible by pure algorithmic modeling of the ratings data alone. basho had discovered some new way of interpreting the data. Koren suspected temporal effects.

Temporal effects introduce the concept of change over time: kids grow up, people’s tastes evolve, and customers become more or less generous when rating DVDs. The perception of DVDs also changes, with some gaining in popularity as time passes (The Big Lebowski) while other lose (Revenge of the Transformers). Such gradual changes are easy to model, and were actually already incorporated to some extent in year-1 models as global effects (see the team’s first Progress Prize paper). Global effects assumed little day-to-day variation in an individual’s rating.

But in a February Wired article, Gavin Potter (team Just a Guy in a Garage) alluded to temporal effects that were not gradual but abrupt, such as when members of a household share an account but have different tastes in DVDs.

The models were already complex; incorporating time would make them even more so.

Individual customers, Potter reasoned, were also likely to exhibit abrupt changes. This was borne out in the data, which showed customers rated DVDs differently depending on whether they had just seen the DVD or were rating a bunch of DVDs seen a while ago.  (At any time, customers can batch-rate DVDs to help the Netflix system “learn” their preferences.)

People also change their opinion of DVDs in relation to other DVDs they’ve seen recently; they might rate a DVD differently if they see it after watching a bunch of disappointing DVDs in a row  (this is the anchoring effect cited by Potter). A customer’s mood or friends’ opinions might also temporarily change a customer’s usual rating behavior.

Robert Bell was initially hesitant about tackling temporal effects. The models were already complex; incorporating time would make them even more so. To illustrate, take this very simple equation that derives a single customer-DVD (item) prediction by adding an item effect (Mi) and a user effect (Mu). (A user effect, or bias, is the tendency of customers to rate above or below the average, while an item effect is how much a DVD rating deviates from average.)

         pui = Mi + Mu 

Incorporating temporal effects changes the equation by making the rating depend on a time period:

           rui (t)  =   Mi (t)  +  Mu (t)

Plugging in January 2005 for (t) may yield a different rating than if (t) is April 2004.

 

201005_netflix_temporal_effects4

Not all ratings for DVDs and customers showed such dramatic change over time, but some clearly did. And the team’s previous approach of simply averaging all the data points for a rating would not capture the dynamic quality of the data.

The amount of increased complexity depended on the granularity of the time intervals. BellKor divided the training data period (October 1999 to December 2005) into 30 intervals, or bins, for DVDs, one bin representing 10 consecutive weeks, during which an attitude towards a DVD barely changes.

For customers, who exhibit abrupt changes in their behavior, much finer granularity was needed; in fact, each rating date was considered in separation. Since 40 was the average number of days people rated, temporal effects resulted in a 40-fold increase in the number of customer factors. With BellKor working with between 40 and 200 factors, the equations (which were much more complicated than the simple one shown here) would need to be repeated for each factor and for each time bin. When Netflix competitors talk about billions of factors, they’re not exaggerating, just carefully counting every combination of factor and temporal interval.

Temporal effects also increased the risk of overfitting since abrupt changes over a short time period constituted a small pattern that was very specific to that set of data, and thus unlikely to fit another set of data. But after long exposure to large models, the team had gained confidence that the resulting overfitting could be managed; it was certainly worth the effort to do so. Temporal effects improved each single model from around 0.8900 to less than 0.8800, an improvement that held up even after blending, which could have diluted the effect.

One sure way of getting a boost: collaborating with another team.

Information the team gleamed from temporal effects was also incorporated into the existing global effects, producing some additional improvement.

In total, temporal effects boosted BellKor’s score by about .75%, a big jump at this stage and propelling them back into the lead. But the team was hoping for another big boost. In September, Yehuda Koren would be moving to Israel and leaving AT&T Research for Yahoo! Research. He’d be preoccupied for a while and wasn’t yet sure  how much time his new job would allow him to work on the contest. Another big boost might get BellKor the win.

One sure way of getting a boost: collaborating with another team. 

BellKor + BigChaos

Teams that collaborated always improved over their individual scores provided each team brought different methods to the table.

In mid-2008, BellKor approached several teams about partnering, including the current second-place team, BigChaos (Austrians Andreas Toscher and Michael Jahrer). While it might seem obvious to combine with a team that was already doing well, there would be no gain if both teams were simply duplicating one anothers’ methods. A 15th placed team with something new would be a better match.

How to know a team’s methods? Short of simply divulging them, which few teams were prepared to do, another way was to compare the two sets of predictions. If they were correlated very highly (that is, for each customer-DVD pair, they predicted roughly the same rating), it could be assumed the teams were using the same methods; however, if the individual predictions were different but both produced low RMSEs, it signaled different methods at work.

So teams wouldn’t have to reveal their actual predictions, Robert Bell proposed that each team add noise to their data (see sidebar). The teams could then exchange their data, perform some statistical calculations to account for the noise, and then be able to evaluate the similarity of the other team’s predictions with its own.

Through this exercise, BellKor identified BigChaos as the best fit; it had both a low RMSE and it had its own methods that differed from those of BellKor. Part of BigChaos’s secret was sophisticated blending through the use of neural networks. Where BellKor used simple, linear blending based on each model’s individual performance, BigChaos had determined that the individual RMSE of a single method was not the best indication of how much that model would add to the blend. Using neural networks, BigChaos focused specifically to create a blend with the lowest RMSE.

With the teams agreeing to partner, AT&T drafted a contract that was signed by both parties, forming the new team BellKor in BigChaos. From then on, BellKor turned over its outputs to BigChaos for blending.

The second Progress Prize and paper

The BellKor and BigChaos collaboration gave the new team enough of a boost (.25%) that the new team was comfortably in the lead the day before and after the deadline for the 2nd Progress Prize. Official notification of the win again arrived minutes after the deadline.

201005_Netflix_email_sheared

 A new paper was written and published on December 13, 2008, and in this way many teams learned for the first time about the importance of temporal effects. 

It had taken a full year and another team to raise improvement from 8.43% to 9.44%.

Another team, another boost

In early 2009, the team Pragmatic Theory (Martin Piotte and Martin Chabbert) started moving to the top of the Leaderboard. This Quebec team specialized in offbeat predictors, finding customers who rated differently on a Monday than a Friday, or even investigating whether the length of a title affected a customer’s rating.

By themselves, such factors can’t predict a movie rating, but they add a little bit of information. Hundreds of such little bits of information, each carefully weighted, produces slightly better results. And BellKor In BigChaos, closing in on 10% would take extra information wherever it could be found.

BellKor in BigChaos approached Pragmatic Theory about collaborating. The contest was now in its third year and it was becoming clear that BellKor in BigChaos probably wouldn’t reach 10% by itself, or at least anytime in the near future. Thinking along the same lines, Gravity and Dinosaur Planet (the two teams that had collaborated as When Gravity and Dinosaurs Unite), formed a new team, Grand Prize team, specifically to facilitate the team-building process; the team’s website invited other teams to upload predictions and even spelled out how prize money would be distributed.

There was one sticking point to a merger between Pragmatic Theory and BellKor in BigChaos. Pragmatic Theory was technically prohibited from competing; Netflix had strictly interpreted Quebec’s laws governing sweepstakes and other games of chance and made the province ineligible to compete. Pragmatic Theory appealed to the company, the lawyers re-examined their initial decision, and changed the rule, allowing Pragmatic Theory to compete.

More lawyering work was needed to hammer out contracts, a more drawn-out process this time than last: there were now four companies involved, each with its own set of lawyers: AT&T, Yahoo! (where Yehuda Koren was now working), and BigChaos’s Commendo Research firm. After a mere four months, all parties emerged satisfied, and new the team—BellKor’s Pragmatic Chaos (BPC)—came into being, discreetly.

(The day before the last of the paperwork was signed, Pragmatic Theory moved into first place.)

The new team’s existence would remain a secret for the time being; both Pragmatic Theory and BellKor in BigChaos continued posting under their own team names, but some of those posts were the combined predictions for BPC. Progressively the BPC posts got closer and closer to the 10% goal, though this was not apparent on the Leaderboard, which seemed to show BellKor in BigChaos stalled at 9.71, and Pragmatic Theory at 9.8; but this was a fiction invented to keep the other teams in the dark.

On June 26, BPC went public and submitted a 10.05% prediction set.

BPC was disguising its true results by adding noise (see sidebar) to inflate the posted RMSE. Knowing exactly how much noise was being added, BPC simply made calculations to derive the true RMSE from the posted one.  The team could see clear, gradual progress.

And in the beginning of June, the team reached the 10% goal, though this news was kept within the team. BPC didn’t want to trigger the last-call period until it was certain all team members would be available for the next 30 days. Once a 10% posting appeared, other teams were likely to pull out all stops to win.

On June 26, BPC went public and submitted a 10.05% prediction set.

The home stretch: Fixing the overfitting

For the next two weeks, the forum was alive with teams coming together in a last-ditch effort to take the lead away. The Grand Prize Team added members and surged to 9.91%, and Vandelay Industries combined with Opera Solutions to post a combined 9.89%. 

201005_Netflix-board-11days-to-go_441x188

 

 

 

And then things suddenly quieted. Obviously something big was up. A team or maybe more would certainly post a better result; how much better was the question.

BPC was not sitting around wondering about it. The team had to fix the overfitting caused by large, complex model. The solution had always been to “shrink” the predictions (move toward the average) and make them more general. If the prediction is 1 star higher than average, the solution is to shrink back by some factor. It was that “some factor” that had to be figured out.

With their pre-10% submissions, BPC adjusted shrinkage according to the Leaderboard feedback, increasing or decreasing the shrinkage until hitting the perfect sweet spot. But the Leaderboard was based on the quiz set; the winner would be determined by the secret test set (see sidebar). Leaderboard feedback wasn’t going to help determine how much shrinkage was needed for the test set.

Without having a way to evaluate the right amount of shrinkage against the test set, Bob Bell had to come up with a theoretical basis.

In statistics, there are formulas for predicting overfitting according to the complexity of the model (the higher the complexity, the greater the overfitting). And their models were very complex. Bob in his research was able (luckily) to find a formula that showed that the complexity-overfitting relationship was not one-to-one. Based on the formula, he decided to do more shrinkage than initially planned.

Other work continued also. Yehuda Koren worked closely with Pragmatic Theory to find additional parameters to tweak, and BigChaos continued perfecting the blending. These efforts had tangible results, raising BPC’s result to 10.08% on the Leaderboard. It wouldn’t be known until the end whether the team was doing the right amount of shrinkage.

The last two days

Finally, the final day before the deadline, almost as predicted, another team posted an over-10% submission: 10.09%. The team submitting, The Ensemble, was new but the individual sub-teams were not. The Gravity-Dinosaur-Planet-Grand Prize consortium had merged with the Vandelay-Opera Solutions one; altogether 23 former teams and single individuals made up The Ensemble. And now they were on top of the Leaderboard, ahead of BPC by a mere .01%. 

201005_Netflix_teams_combine

The Ensemble’s submission actually brought some relief to BPC, where the fear had been for a 10.15% or higher submission. 10.09% was within reach.

The next day, with 25 minutes remaining, BPC submitted again, tieing The Ensemble at 10.09%. At minus 4 minutes, it was The Ensemble again submitting, achieving 10.10%. (There was scrambling on both sides. A member of The Ensemble had trouble locating, then uploading the team’s best prediction set, inconveniently located on the same server as the team’s now-busy website. For a play-by-play account of their final submission countdown, go here.) The contest ended with The Ensemble at the top of the Leaderboard, and BPC in second place.

It is important here to point out again that the Leaderboard showed the results against the quiz set; the actual winner would be determined by the secret test data set. No one knew which team had scored better on the test set. BPC suspected strongly that The Ensemble was overfitting (just as BPC had been doing; unlike The Ensemble, BPC had an entire month to fix it), and was not too concerned about appearing second on the Leaderboard.

201005_Netflix_eaderboard-end-not-final

The contest ended at 18:42 UTC (2:42 PM, Eastern). BPC waited. Five minutes passed; then five more. As more quiet minutes slipped by, worry set in; prize announcements had before always come within five or ten minutes. Finally, 58 minutes after the deadline, an email from Netflix. BPC had won; the team had scored 10.06% against the hidden test set. It wasn’t official; results would have to be validated, contracts signed, intellectual property issues ironed out, etc, etc, etc.

For the others, there was only the silence on the Leaderboard to indicate that at least one team had achieved 10% against the hidden test set. Not until September 21, would the public announcement be made.

What the announcement didn’t say was how close the contest really was. Only after a member of The Ensemble computed both final predictions did BPC learn that The Ensemble had also achieved 10.06%; both teams achieved the exact same RMSE of 0.8567. It was BPC’s earlier posting time had clinched the win. (The RMSE scores reported by Netflix had been rounded to four decimal points. Measuring at five decimal points revealed that BPC had an ever so slight lead: 0.856704 to 0.856714.)

Post-script

The contest has been over for almost a year and can be safely judged to be a business-textbook success, delivering exactly what it was designed to deliver: a recommender method that performed 10% better than Netflix’s tool.

In the process, the contest demonstrated that crowd-sourcing can be a viable format for solving sophisticated, scientific problems.

And it advanced knowledge in the field of recommender systems, showing unequivocally that a disparity of approaches works better than a small number of well-tuned techniques. Disparity could be achieved by using complementary models or by combining teams in which different people approached problems differently. The highest-scoring teams did both.

Other lessons may be more elusive. The contest was exceptionally well-run and it was designed for a single objective: achieve the lowest RMSE possible with a data set of 100 million DVD ratings. Whether or not the crowd-sourcing format works for other objectives is an open question. Nor is it clear if the disparity of approaches and predictive modeling techniques that won the Netflix prize can extend to solving other problems, such as improving education or finding a cure for cancer.

Public policy challenges, for example, require transparency;  it’s important to identify which factors contribute to good or bad results. The large models that seem so necessary for the Netflix contest are something of a black box where the inner workings aren’t really understood. BellKor members can’t tell you the precise factors used to evaluate customers and DVDs. Blending, another critical step, is just as opaque.

Such questions will no doubt be at the center of research projects and graduate-level theses in the coming years. But it’s unlikely they will be addressed in another contest as open and with such a large dataset as this one. A second Netflix prize contest, this one involving demographic data, has been scuttled by privacy concerns.

So further research into how best to apply large-scale predictive modeling may have to be done behind closed doors with data sets kept secret, without the public participation, the drama, or the inspiration of the Netflix Prize contest.

The rest of us will miss out on the fun.

201002_netflix_techchannel_thumb TechChannel 201001_netflix_techchannel_logo_17X20
The Netflix Prize
201002_bellkor-homepage_thumb BellKor Home Page
201001_netflix_entire_map Yifan Hu
Netflix GMap
Interactive Version


 

  In Brief

Number of Netflix customers in training set:
480,189

Training dataset:
100,480,507 ratings

Probe dataset:
1,408,395 ratings

Qualifying data set:
2,817,131 ratings
(divided 50-50 into quiz set and test set; quiz set results posted to Leaderboard)

Rating dates for training set:
12/99 to 12/05

Rating scale:
1 - 5 stars

Beginning RMSE:
0.9514

RMSE to win:
0.8572

Contest start date:
Oct. 2, 2006

Number of teams submitting:
5,169

Number of submissions:
44,014

  RMSE

(root mean square error)

A measure of accuracy derived by comparing predicted values to actual values, and then finding the root square of the error.

This plot shows predicted values along a fitted line with actual values shown here as points.

 

201002_RMSE-pic_dk

The RMSE is the distance, on average, of a data point from the fitted line (representing predictions made by the model), measured along a vertical line.

To calculate RMSE, for each point take the distance vertically from the point to the corresponding y value on the line (the error), and find the square of the value. Add up all those values for all points, and divide by the number of points. Finally, report the square root of the result. Squaring prevents negative values from canceling positive ones.

 

 

Disguising predictions with noise

When evaluating other teams, BellKor altered each prediction by adding a randomly generated number between -2 and 2, with a distribution centered on 0.

sidebar-for-noisy-data

Since the additions and subtractions across the prediction set total 0, mathematically nothing has changed.

Predictions are wrong but in a mathematically trackable way, and Robert Bell developed a formula for deriving with some accuracy a team's original predictions from the noisy ones.

Data was also disguised to inflate the RMSE posted on the Leaderboard (to disguise how close BellKor in Pragmatic Chaos was getting to the magic 10%). In this case, the team simply added 1/4 point to their predictions, and subtracted the same amount from the other half (unless the altered prediction would be outside the range 1-5).

sidebar-for-noisy-data_second_table

 

The resulting RMSE was lower than the actual, but the team, knowing the exact amount of added noise, could calculate the true RMSE.

 

 

  Scoring

The 2,800,000 test set was divided by Netflix into two equal-size subsets: a quiz set and a final test set. Results against the quiz set would be posted immediately to a public Leaderboard, which listed the standings of the teams. The final test set was reserved to determine the winners of the progress and final prizes. Only Netflix knew which recommendations belonged to the quiz set and which to the test set.

Winning a progress prize or the grand prize depends on having the best score against the test set (not the quiz set).

It’s important to point out the Leaderboard reflected only the quiz set predictions, not the test predictions. Teams would not know how they scored against the final test until after the competition was over.

 

201002_Netflix-testing-vertical-dk02-cropped

 

  Some top contenders
  of year 2

BellKor
AT&T Research team. Winner of the 2007 progress prize (under name KorBell). Later combined with the Austrian team BigChaos to form BellKor in BigChaos; this team won the 2nd Progress Prize. In the last year, BellKor in BigChaos combined with the Montreal team Pragmatic Theory to form BellKor's Big Chaos, the eventual winner of the grand prize.

Yehuda Koren. A computer scientist in the information visualization group who at the time was applying linear algebra to network visualizations. Koren started the competition while at AT&T Research, then moved to Yahoo! in September 2008 while remaining on the  BellKor team.

Robert Bell. A statistician specializing in machine learning methods, analysis of data from complex samples, and record linkage methods.

Chris Volinsky. A statistician with expertise applying statistical and data mining techniques to the real-world problems of massive data sets. Also Director of the Statistics Research Department.

 

BigChaos
Andreas Töscher and Michael Jahrer, machine learning researchers and founders of commendo research in Austria. This team developed sophisticated blending techniques using neural networks and combined with BellKor in the summer of 2008 to form BellKor in BigChaos, the team that won the second Progress Prize. Later BellKor in BigChaos combined with Pragmatic Theory.


Pragmatic Theory
The Quebec team of Martin Chabbert and Martin Piotte, two computer engineers with no previous experience in data mining or machine learning. This team, specializing in off-beat predictors that integrated human intuition, achieved the number one ranking just before combining with BellKor in Chaos.

 

Gravity
Team led by Gábor Takács, a well-known data mining expert, and three PhD candidates from the University of Budapest. The team was the frontrunner between January and May 2007 and in October 2007 combined with Dinosaur Planet (forming When Gravity and Dinosaurs Unite) to challenge BellKor for the First Progress Prize.
Later, a founding member of the Grand Prize Team.

 

Dinosaur Planet
Made up of Princeton undergraduates. The team was always in the top teams during the first year and combined with Gravity to almost win the First Progress Prize. Later, a founding member of the Grand Prize Team, which merged withOpera Solutions and Vandelay United to form The Ensemble.

 

Grand Prize Team
Formed in January 2009 by Gravity and Dinosaur Planet (the two teams that had collaborated as When Gravity and Dinosaurs Unite), this team set up a website and invited other teams to join, with prize money to be distributed according to each team’s improvement. Grand Prize team later merged with Opera Solutions and Vandelay United to form The Ensemble.

 

The Ensemble
Tied BellKor on the test set with a 10.06% improvement (and an RMSE of 0.8567), but placed second due to a later submission time. The Ensemble was formed when Grand Prize Team merged with Opera Solutions and Vandelay United toward the end of the 30-day last-call period.

 

Opera Solutions
A global management consulting firm among the top leaders of the Leaderboard. This team first merged with Vandelay Industries! to form Opera Solutions and Vandelay United, which later merged with Grand Prize team to form The Ensemble.

 

Just a Guy in a Garage (Gavin Potter)
An English psychologist, whose first entry in November 2007 was 7.15% (it had taken BellKor seven months to achieve the same score). He was featured in a Wired article, which bolstered Yehuda Koren’s developing ideas about using temporal effects.