DOC PREVIEW
UT Dallas CS 6350 - homework2

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS6350:Big Data Analytics and Management Spring 2015DUE DATE: March 6, 11:59pmTA: Gbadebo [email protected] 2In this homework you will learn how to solve problems using Map Reduce. Please applyHadoop map-reduce to derive some statistics from IMDB movie data. You can find the dataset in elearning. Copy the data into your hadoop cluster and use it as input data. You can use the put or copyFromLocal HDFS shell command to copy those files into your HDFS directory. There are 3 datafiles :: movies.dat, ratings.dat, users.dat (Use the same data as in homework 1)Please read the “README” file to know about the data organization and to know about the Attribute of the data. All are very well explained in that README file. In class there will be brief demo/ discussion about that. Please read the questions carefully and use only the data file that you need. Some question may need only users.dat, or some question may need only movies.dat After being familiar with the data - you are required to write efficient Hadoop Map- Reduce programs in Java to find the following information ::Q1: Find top 5 average movies rated by female users and print out the titles and the average rating given to the movie by female users.This question involves filtering , joining data from multiple files and job chaining.You should use reduce side join for this question.Note: First of all get all movies rated by female users, then find the average rating given to each movie by female users. You will need all the three data files for this question.Please check out the lecture on hadoop programming given in class for implementation sample. Hadooplecture2withsamplecode.zip. You can find the document at the urlbelow from elearning (Requires elearning username and password.)https://elearning.utdallas.edu/bbcswebdav/pid-727586-dt-content-rid-6316587_1/xid-6316587_1e.gGiven ratings.dat as (Note this is just an example data format for simplicity.)userid movied ratings1 30 41 40 32 20 32 30 43 30 2users.dat asuserid gender1 M2 F3 Fwe can see from this data that user 2 and 3 are Females.We can see movie id 30 is rated by users 2 and 3, therefore the average rating given to the movieid with id 30 by female users is4+2/2 = 3.since user 2 and 3 are females and they rated the movie with id 30 ratings 4 and 2 respectively. Note: we are ignoring the rating given by the male user with id 1 even though the user rated the movie with id 30 also.You will then join the results obtained above to the movies.dat file to get the title.Your final result can be in the following format“title of movies” “avg rating by females”Toy Story 3.5To run your jobs use the following syntaxhadoop jar name_of_jar_file Classname <input dir> <output dir> [<extra input paramter>Submission :: You have to upload your submission via e-learning before due date. Please upload the following to eLearning: 1. Two jar files, one for each problem/ One jar file containing all solutions. 2. Java files which have the source code. 3. ***A Readme text file about how to run your jar file. Give the command to run your jar file.Q2. Given the id of a movie, find all userids,gender and age of users who rated the movie 4 or greater.You will input the movie id from command line. For this question use map side join to implement join in hadoop.To implement map side join, you will be loading users.dat in the hadoop distributed cache.Please, check out the second lecture on hadoop programming given in class for implementation sample. Hadooplecture2withsamplecode.zip. You can find the document at the url below on elearing ((Requires elearning username and password.)https://elearning.utdallas.edu/bbcswebdav/pid-727586-dt-content-rid-6316587_1/xid-6316587_1Use the users.dat and


View Full Document

UT Dallas CS 6350 - homework2

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download homework2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view homework2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view homework2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?