Unformatted text preview:

CS775: Distributed Systems Spring 2012 Homework #6 Points: 30 Due: March 15, 2012 Implement the following application using Map-Reduce and MPI. You will be provided with three separate files, with different data, but in the same format. Each data file has the same format---each record has three fields: (i) Age ; (ii) Gender (M/F) (iii) Program (BS/MS/PHD) (iv) GPA Age may be classified as: G1: 15-20; G2: 21-25; G3: 26-30; G4: higher than 30. For each of these classifications (by age, by gender, by program), compute the following statistics for the GPA: • Number of records in each class (by age, by gender, by program) • Average • Maximum • Minimum • Standard deviation In summary, the inputs are five files: file1. Dat, file2.dat, file3. Dat, file4, dat, file5.dat Each file has records in the format: <Age , Gender, Program, GPA> where Age is an integer, GPA is a real number, and Gender (M/F) and Program (B/M/P) are character variables. The output would look like: Statistics based on Age: Age 15-20: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX Age 21-25: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX Age 26-30: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX Age Higher than 30: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX Statistics based on Gender: Male: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX Female: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX Statistics based on Program:BS: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX MS: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX PhD: #of records: XXX Average GPA: XXX Maximum GPA: XXX Minimum GPA: XXX Std. Dev.: XXX Your program should employ maximum parallelism. Do not write a simple C++ program which reads the three files and prints the statistics. That would be CS149/CS150 assignment. Instead, use Map reduce and MPI to employ as much parallelism as possible. For both solutions, you MUST show the dataflow in a diagram so we understand how you have employed parallelism and the power of Map/Reduce and MPI. You may need several Map/Reduce functions to achieve this task. Make sure to get all the final output into a single file, especially for Map-Reduce (Hadoop). Submit your code, the dataflow diagram, and the final output. Use any language that is compatible with MapReduce and MPI. I will soon be providing you with the data files. In the meanwhile use any test data file. I will provide one version with data fields separated by commas, and the other one separated by spaces. Use whichever is easy for you. For simplicity, Gender will be represented by 1 (for male) and 2 (for Female) and Program by 1 (for BS), 2 (for MS), and 3 (for


View Full Document

ODU CS 775 - Homework #6

Download Homework #6
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Homework #6 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Homework #6 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?