CS6350 BIG DATA ANALYTICS and MANAGEMENT Spring 2015 HW 4 Related to Spark Data Analytics and Recommendation System Due April 22 2015 This homework consists of two parts The first part focuses on K means clustering data analytics and the second one focuses on recommendation systems Q1 Implement the k means algorithm from the scratch using SCALA and spark Please use this attached dataset in file Q1 testkmean txt as input Your number of cluster K should be 3 Your Scala code will produce output in the following ways Print each point and the corresponding cluster it belongs to Print the final centroids Q2 Read the following link for co occurrence based recommendation implementing in mahout https mahout apache org users recommender intro cooccurrence spark html Currently Mahout switches from MapReduce to Apache Spark It has an interactive shell will show in the class lecture contains how to install it Using that apply item based collaborative filtering using mahout s spark itemsimilarity spark itemsimilarity can be used to create other people also liked these things type recommendations You can find the dataset in elearning Copy the data into your hadoop cluster and use it as input data You can use the put or copyFromLocal HDFS shell command to copy those files into your HDFS directory There are 3 data files movies dat ratings dat users dat Please read the README Important file to know about the data organization and to know about the Attribute of the data All are very well explained in that README Important file A user rates some movies with rating 3 Our task is to recommend some movies to him that has the similar ratings from other users Steps to follow Read the above link carefully and construct the item similarity matrix of each movie having rating 3 use ratings dat The output should be like this In the above matrix the first integer is the movie id The movie for which we recommend then the rest of the text contains the list of the recommended movies id with their value movie id value 1 Save the above file to HDFS Now Run Apache spark interactive shell From the shell take the user id as input you can fix the id e g val userID 20 Now find all the movies that he rates with rating 3 2 Load read the above file item similarity file and find the movies that match with the user s rated movies with the key of the item similarity file For example suppose a user has id 20 and he rates movies 120 and 855 as 3 Write the code to extract the movie ids from item similarity matrix file that are present in the row for 955 and 123 movies and generate the matrix like following 120 855 898 951 910 905 1269 3265 1218 1089 3224 247 3 Now replace the movie Id with movieid movie name For example 120 Movie Name 898 Movie Name 951 Movie Name 910 Movie Name 905 Movie Name 1269 Movie Name 855 Movie Name 3265 Movie Name 1218 Movie Name 1089 Movie Name 3224 Movie Name 247 Movie Name You can apply join if it is necessary Use movies dat and ratings dat Note In 120 Movie Name Movie Name should be replaced with movie id 120 Display without angle brackets Submission You have to upload your submission via e learning before due date Please upload the following to eLearning 1 A scripting file like Q2 1 txt that shows the building of spark itemsimilarity and another scripting file Q2 2 txt shows the scala java program contains codes for step 1 3 If you use java scala then submit all source files
View Full Document