UT Dallas CS 6350 - HW4#2015 - D3104998

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> HW4#2015

DOC PREVIEW

UT Dallas CS 6350 - HW4#2015

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS6350: BIG DATA ANALYTICS and MANAGEMENTSpring 2015HW #4Related to: Spark, Data Analytics and RecommendationSystemDue: April 22, 2015This homework consists of two parts. The first part focuses on K-means clustering (data analytics) and the second one focuses on recommendation systems.Q1.Implement the k-means algorithm from the scratch using SCALA and spark. Please use this attached dataset in file Q1_testkmean.txt as input. Your number of cluster K should be 3. Your Scala code will produce output in the following ways:- Print each point and the corresponding cluster it belongs to.- Print the final centroidsQ2. Read the following link for co-occurrence based recommendation implementing in mahout. https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html Currently Mahout switches from MapReduce to Apache Spark. It has an interactive shell (willshow in the class, lecture contains how to install it). Using that, apply item-based collaborativefiltering using mahout’s spark-itemsimilarity. spark-itemsimilarity can be used to create"other people also liked these things" type recommendations. You can find the dataset in elearning. Copy the data into your hadoop cluster and use it as inputdata. You can use the put or copyFromLocal HDFS shell command to copy those files into yourHDFS directory. There are 3 data files: movies.dat, ratings.dat, users.dat. Please read the“README_Important” file to know about the data organization and to know about the Attributeof the data. All are very well explained in that README_Important file.“A user rates some movies with rating 3. Our task is to recommend some movies to him thathas the similar ratings from other users.” Steps to follow: Read the above link carefully and construct the item-similarity matrix of each movie havingrating 3 (use ratings.dat). The output should be like this: In the above matrix, the first integer is the movie id (The movie for which we recommend),then the rest of the text contains the list of the recommended movies id with their value (movieid: value)1. Save the above file to HDFS. Now, Run Apache spark interactive shell. From the shell,take the user id as input (you can fix the id, e.g., val userID = 20). Now find all the moviesthat he rates with rating 3.2. Load/read the above file (item-similarity file) and find the movies that match with theuser’s rated movies with the key of the item-similarity file.For example, suppose a user has id 20 and he rates movies 120 and 855 as 3. Write the code to extract the movie ids from item-similarity matrix file that are presentin the row for 955 and 123 movies and generate the matrix like following:120 898,951,910,905,1269855 3265,1218,1089,3224,2473. Now replace the movie Id with movieid:movie_name.For example,120:<Movie_Name> 898:<Movie_Name>,951:<Movie_Name>,910:<Movie_Name>,905:<Movie_Name>, 1269:<Movie_Name>855:<Movie_Name> 3265:<Movie_Name>,1218:<Movie_Name>,1089:<Movie_Name>,3224:<Movie_Name>, 247:<Movie_Name>You can apply join if it is necessary. (Use movies.dat and ratings.dat)Note: In, 120:<Movie_Name><Movie_Name> should be replaced with movie id 120. Display without angle brackets.Submission:You have to upload your submission via e-learning before due date. Please upload the following to eLearning: 1. A scripting file like, Q2_1.txt that shows the building of spark-itemsimilarity and anotherscripting file Q2_2.txt shows the scala/java program (contains codes for step 1 - 3).If you use java/scala, then submit all source

View Full Document

UT Dallas CS 6350 - HW4#2015

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

UT Dallas CS 6350 - HW4#2015

Sign up for free to view:

Please select your school