UCI ICS 273A - Neighborhood-based approach - D1308666

Home> Schools> University of California, Irvine> (ICS) > ICS 273A> Neighborhood-based approach

DOC PREVIEW

UCI ICS 273A - Neighborhood-based approach

School name University of California, Irvine

Course Ics 273a- Machine Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Exploring collaborative filters:Neighborhood-based approachMinas [email protected] id: 42369077Fabio [email protected] id: 90007558AbstractIn this project, we study the effectiveness of collaborative filtering mechanisms in the context ofthe Netflix competition. We focus our attention on a dataset provided by Netflix which includes atraining set with more than 100 million 4-tuples: user id, movie id, rating, and date [3]. In the firstpart of this project, we develop a simple model to predict future ratings of users based on the ratingsof the same users on similar movies using k-nearest neighbors (k-NN) methods. In the second partof the project we also implement two of the k-NN methods which were proposed by the currentlyfirst team in the contest [1]. We test the performance of the implemented techniques on the wholeNetflix dataset.1 IntroductionIn recent years, the importance of automated systems that are able to analyze user interest in specific items and provideaccurate future recommendations has grown significantly. Some of the major e-commerce based companies, such asAmazon or Netflix, have made recommender systems a central component of their commercial strategies.In October 2006 Netflix launched a new competition to further enhance the performance of its recommender systemand promote deeper research in the area. As part of the competition, Netflix released a dataset which consists of around100 millions ratings by users for specific movies. The winner of the competition has to implement an algorithm whichoutperforms Netflix’s proprietary recommender system, Cinematch, in predicting future user ratings by more than 10percent. As an incentive to encourage participation in the competition the winning team will receive an award of$1.000.000.The consistent prize and the rich dataset have stimulated much interest in the Netflix competition. Within the first yearover 2600 teams posted their algorithm and tried to reach the 10% gap from the Cinematech score. Until March 18th,the best improvement obtained by a team was at 9% (precisely 9.01%). This team, known as BellKor (also previouslyKorBell), recently highlighted part of its algorithms after winning the progress price. The latter is an annual prize forthe team which reaches the highest score at the end of each year that the competition lasts.The final solution of the BellKor team consists of linear combination of 107 separate sets of predictions. However,retrospectively the members of the BellKor team acknowledged that “it is possible to achieve at least 7.58% improve-ment with a linear combination of three prediction sets that combine all the key elements discussed in their article:nearest neighbor models, neighborhood aware factorization, and latent factor models”. In the project we analyze theperformance of nearest neighbor models. We try both to implement some simple but new scheme, and to reproducethe same nearest neighbor model employed by the BellKor team [4].2 The Netflix CompetitionThe dataset released by Netflix consist of the ratings created by over 480,000 users on 17,770 movies. Each rating isan integer from 1 to 5. The date when the rating was created, which varies between 1999 and 2005, is also given.The overall dataset consists of more than 100 million entries. These entries have been partitioned in two parts: atraining set, and a hold-out set of about 4.2 million entries. Specifically, the hold-out part is made up of the last 9ratings provided by each user - or less if a user did not rate at least 18 movies in the temporal period examined. Thehold-out part is further (randomly) split in 3 parts, called Probe, Quiz, and Test set. The Probe set is attached with theTraining set and can be downloaded together with it. The main purpose of this Probe dataset is to provide participantswith an easy way to estimate the performance of their algorithm. The Quiz and the Test set are kept by Netflix andconsist of an evaluation set that participants are required to predict ratings for. Once a participant submits predictions,the Netflix system returns the root mean square error (RMSE) on the Quiz set which is defined as:RM SE(dataset) =sX∀(u,i)∈dataset(predict ion(u, i) − realrating(u, i))2(1)RMSE is also posted online in the leaderboard ([6]). However, in order to win the competition, the 10% improvementmust be achieved over the Test set, and results over this set are never disclosed by Netflix. This precludes cleversystems to somehow figure out which entries are present in the Test set by mean of repeated submissions.3 Dataset AnalysisIn this section we present our analysis of the Netflix dataset. In our analysis we consider the complete Training +Probe dataset, which requires about 2GB of hard disk storage uncompressed. Every entry in the dataset is representedby a 4-tuples: (userid, movie id, rating, date). The size of the dataset combined with the characteristics of the dataposes an interesting challenge for prediction.One of the first challenges encountered was the actual manipulation of such a large dataset. For such a huge dataset,tools such as Matlab or even python proved to be too slow in executing even simple algorithms in the amount oftime compatible with the constraints imposed by the project deadline. Thus, we had to resort to a low-level C/C++framework, icefox (available at [5]) to implement basic operations on the Netflix dataset (e.g. fast look-up of user’srating, movie’s rating, etc.. ). Leveraging the API provided by the icefox framework we implemented our algorithmsas a set of C/C++ modules which actually extend the icefox framework capabilities. Obviously, making use of the C orC++ language significantly increased the time required to code the algorithms, however, as aforementioned, we triedseveral programming languages, and this was the only solution which enabled us to perform efficient operations, anddata manipulation of such a large dataset.Another challenge posed by this competition is that, despite the large size of the dataset provided, about 99% of thepotential pairs user-movie have no rating! Consequently, many machine learning methods designed to work on densedata are likely to fail in providing good predictions for the Netflix dataset.Moreover, both the distribution of ratings per movie, and the distribution of ratings per user is extremely skewed. Forinstance from fig 1(a), there is a movie with 262144 ratings while the most common cases for movies are those

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

UCI ICS 273A - Neighborhood-based approach

Sign up for free to view:

Please select your school