IntroductionCollaborative Filtering and the Netflix Prize DatasetPseudo Singular Value DecompositionExperimental ResultsSequential Baseline, Movie OrderMovie OrderLocal Copy, Movie OrderLocal Copy, User OrderTwo PassGreedy Static OrderFuture WorkConclusionEfficient Parallel Access Patterns for Sparse MatricesKevin Waughkwaugh@andrewAbstractA notable characteristic of the scientific computing and machine learning prob-lem domains is the large amount of data to be analyzed and manipulated. Here,carefully crafted parallel algorithms have the potential to make a massive impacton the size of a problem that is deemed tractable. Sparse matrices are a data struc-ture frequently encountered within these domains. Typically, these sparse matri-ces have a non-uniform structure, which can make the task of designing efficientparallel algorithms for sparse matrices much more complex than for their densecounterparts. In this report, we consider multiple ways of parallelizing the mostcomputationally intensive section of a typical sparse matrix algorithm — its innerloop. We compare the theoretical trade-offs between each of the approaches andempirically test performance on the Netflix Prize data set [4, 5], a large collabora-tive filtering task that seeks to learn movie preferences from a sparse collection ofratings.1 IntroductionLarge-scale scientific computing and machine learning have rapidly become popular research areasin the last decade. The advances in these areas have led to many practical applications that havedramatically impacted society in a generally beneficial way. For example, real time weather simu-lation, financial market analysis, web indexing and searching, voice and face recognition, productrecommendation, intrusion and fraud detection, and control automation are tasks that computerswere rarely used for or trusted with fifteen years ago. Though these applications would not bepossible without the staggering advancements made by researchers in these areas, it is safe to saythat the upward trend in processor performance and memory capacity has been pivotal to this pro-gression. As of recent, the rate at which single processor computation is increasing is stagnating.Instead, computer architects have shifted focus towards multi-core designs to keep up with the evergrowing computational demands. This paradigm shift will in require the efficient design of parallelalgorithms to effectively make use of the newest computing resources.A sparse matrix is a frequently reoccuring data structure in scientific computing and machine learn-ing. Unlike a dense matrix, most of the entries in a sparse matrix are zero and are not explicitlystored in memory. For example, a web search engine might use a matrix of word frequency countsfor a set of web pages. Here, a web page typically uses a small subset of the words in the dictionary,which leads to most of the matrix containing zeros. Similarly, a social networking graph, whereeach edge indicates some type of relationship, when represented as an adjacency matrix will likelybe sparse. In these cases, a machine learning algorithm for creating a search index or inferring rela-tionships must be equipped to handle this explicit sparse structure to gain traction on any reasonablysized data set.Let us denote an m by n sparse matrix as a pair (A, S), where (i, j) ∈ S is a non-zero entry inthe matrix and A(i, j) is its value. A reoccuring pattern in sparse matrix algorithms is, then, torepeatedly evaluate functions f and g on the non-zero entries to computeu =X(i,j)∈Sf(i, j, A(i, j)) and v =X(i,j)∈Sg(i, j, A(i, j)) (1)1where u ∈ Rmand v ∈ Rnare row and column vectors respectively. This model encompassesa large class of algorithms including the vast array centered around matrix-vector products, whichhave been well studied and optimized [2, 6]. For example, solution techniques for linear systemsincluding solving a positive definite system with the conjugate gradient method, a variety of matrixfactorization algorithms including ones for computing eigenvector and singular value decomposi-tions, and optimization algorithms like interior-point methods for linear and quadratic programming.Despite its simple structure, no one approach to parallelizing this family of algorithms is obviouslybest. Usually, determining how to effectively distribute this computation will depend on the non-uniform pattern of sparseness in the data in addition to the functions f and g. In this report, wewill examine a variety of schemes for parallelizing algorithms in this family. We will compare andcontrast these schemes empirically on both shared-memory machines and a chip multiprocessor aswell as theoretically in terms of both computational and memory resource overhead in addition tocache performance.This report is sectioned as follows. First, we will introduce the Netflix Prize data set and its collabo-rative filtering task, which we will use for our empirical study. Then, we will describe one possibleapproach for collaborative filtering, the pseudo-singular value decomposition learned using gradientdescent, whose inner loop follows the noted sparse access pattern. Finally, we will investigate andcontrast, both theoretically and empirically, a variety of ways to parallelize this approach.2 Collaborative Filtering and the Netflix Prize DatasetA collaborative filtering task is most easily described as a recommendation system. Take, for exam-ple, an online store selling a number of related items. The store might wish to infer the buying habitsof a user to strategically advertise goods that are mostlikely to be purchased. Similarly, an onlinemovie or music retailer might allow a user to rate music that they like and dislike. The system couldthen recommend similar music or movies. More formally, a collaborative filtering system takes asinput a set of k user/item/rating tuples and gives as output a function that predicts any unobservedtuple. The quality of such a system is usually measured by considering the root mean squared errorof the predictions on withheld test set.Good collaborative filtering systems consider groups of users, or perhaps them all, at once whenlearning preferences. This is because typically the observation vector for a single user is sparselypopulated and little can be inferred from the single user’s ratings in isolation. That is, one typicallyassumes that, though there are many distinct users of the system, many of them share similar tastes.These similar
View Full Document