Slide 1High Dimensional DataExample: Recommender SystemsRecommendationsFrom Scarcity to AbundanceSidenote: The Long TailPhysical vs. OnlineTypes of RecommendationsFormal ModelUtility MatrixKey Problems(1) Gathering Ratings(2) Extrapolating UtilitiesSlide 14Content-based RecommendationsPlan of ActionItem ProfilesSidenote: TF-IDFUser Profiles and PredictionPros: Content-based ApproachCons: Content-based ApproachSlide 22Collaborative FilteringFinding “Similar” UsersSimilarity MetricRating PredictionsItem-Item Collaborative FilteringItem-Item CF (|N|=2)Item-Item CF (|N|=2)Item-Item CF (|N|=2)Item-Item CF (|N|=2)Item-Item CF (|N|=2)CF: Common PracticeItem-Item vs. User-UserPros/Cons of Collaborative FilteringHybrid MethodsRemarks & Practical TipsEvaluationEvaluationEvaluating PredictionsProblems with Error MeasuresCollaborative Filtering: ComplexityTip: Add DataRecommender Systems:Content-based Systems & Collaborative FilteringMining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford Universityhttp://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.orgHigh Dimensional DataHigh dim. dataHigh dim. dataLocality sensitive hashingLocality sensitive hashingClusteringClusteringDimensionality reductionDimensionality reductionGraph dataGraph dataPageRank, SimRankPageRank, SimRankCommunity DetectionCommunity DetectionSpam DetectionSpam DetectionInfinite dataInfinite dataFiltering data streamsFiltering data streamsWeb advertisingWeb advertisingQueries on streamsQueries on streamsMachine learningMachine learningSVMSVMDecision TreesDecision TreesPerceptron, kNNPerceptron, kNNAppsAppsRecommender systemsRecommender systemsAssociation RulesAssociation RulesDuplicate document detectionDuplicate document detectionJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2Example: Recommender SystemsCustomer XBuys Metallica CDBuys Megadeth CDCustomer YDoes search on MetallicaRecommender system suggests Megadeth from data collected about customer XJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3Recommendations ItemsSearch RecommendationsProducts, web sites, blogs, news items, …4J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgExamples:From Scarcity to AbundanceShelf space is a scarce commodity for traditional retailers Also: TV networks, movie theaters,…Web enables near-zero-cost dissemination of information about productsFrom scarcity to abundanceMore choice necessitates better filtersRecommendation enginesHow Into Thin Air made Touching the Void a bestseller: http://www.wired.com/wired/archive/12.10/tail.htmlJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5Sidenote: The Long TailSource: Chris Anderson (2004)6J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgPhysical vs. OnlineJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!Types of RecommendationsEditorial and hand curatedList of favoritesLists of “essential” itemsSimple aggregatesTop 10, Most Popular, Recent UploadsTailored to individual usersAmazon, Netflix, …8J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgFormal ModelX = set of CustomersS = set of ItemsUtility function u: X ×S RR = set of ratingsR is a totally ordered sete.g., 0-5 stars, real number in [0,1]9J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgUtility Matrix0.410.20.30.50.21Avatar LOTR Matrix PiratesAliceBobCarolDavid10J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgKey Problems(1) Gathering “known” ratings for matrixHow to collect the data in the utility matrix(2) Extrapolate unknown ratings from the known onesMainly interested in high unknown ratingsWe are not interested in knowing what you don’t like but what you like(3) Evaluating extrapolation methodsHow to measure success/performance ofrecommendation methods11J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org(1) Gathering RatingsExplicitAsk people to rate itemsDoesn’t work well in practice – people can’t be botheredImplicitLearn ratings from user actionsE.g., purchase implies high ratingWhat about low ratings?12J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org(2) Extrapolating UtilitiesKey problem: Utility matrix U is sparseMost people have not rated most itemsCold start: New items have no ratingsNew users have no historyThree approaches to recommender systems:1) Content-based2) Collaborative3) Latent factor based13J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgToday!Content-based Recommender SystemsContent-based RecommendationsMain idea: Recommend items to customer x similar to previous items rated highly by xExample:Movie recommendationsRecommend movies with same actor(s), director, genre, …Websites, blogs, newsRecommend other sites with “similar” contentJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15Plan of ActionlikesItem profilesRedCirclesTrianglesUser profilematchrecommendbuild16J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgItem ProfilesFor each item, create an item profileProfile is a set (vector) of featuresMovies: author, title, actor, director,…Text: Set of “important” words in documentHow to pick important features?Usual heuristic from text mining is TF-IDF(Term frequency * Inverse Doc Frequency)Term … FeatureDocument … Item17J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSidenote: TF-IDFfij = frequency of term (feature) i in doc (item) jni = number of docs that mention term iN = total number of docsTF-IDF score: wij = TFij ×
View Full Document