Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12BIG DATA MANAGEMENT AND ANALYTICS cs6350SPARK TUTORIALSPARK INSTALLATIONIt works for both Linux and Windows Operating systems.Go to https://spark.apache.org/downloads.html.Chose a package type: Prebuilt for Hadoop 2.4 or laterDownload the spark file.Extract the file and change directory to the binRun spark-shellSimple Scala /Sparkval a = Array(1,2,3,2,3,4,5,2,1,2,3,4,3,4,5)val r = sc.parallelize(a)val newr = r.map(x => (x,1))newr.reduceByKey(_+_).collect()output will beres6: Array[(Int, Int)] = Array((4,3), (1,2), (5,2), (2,4), (3,4))Word count programval in = sc.textFile("beeline")in.flatMap(line=> line.split(" ")).map(word =>(word,1)).reduceByKey(_+_).collectAsMap()Filter commandsfilter by zipcodeval lines = sc.textFile("users.dat")val ln = readLine() //API to take input from command lineval linesZipcode = lines.filter(line => line.contains(ln)).map(line => line.split("::")).map(line => line(0)).collectFinding averageFind top 10 average rated movies with descending order of ratingval lines = sc.textFile("ratings.dat") val sumratings = lines.map(line => line.split("::")).map(line => (line(1), line(2).toDouble)).reduceByKey(_+_)val counts = lines.map(line => line.split("::")).map(line => (line(1), 1)).reduceByKey(_+_)sumratings.join(counts).mapValues{ case (sum, count) => (1.0 * sum)/count}.map(item => item.swap).sortByKey().top(10)Defining Functions in scaladef addInt( a:Int, b:Int ) : Int = { var sum:Int = 0 sum = a + b return sum }Applying functionsval a = Array(1,2,3,2,3,4,5,2,1,2,3,4,3,4,5)val r = sc.parallelize(a)r.map(x=> addInt(x,x)).collectStand alone scala programsCreate a folder structure as show below../simple.sbt./src./src/main./src/main/scala./src/main/scala/SimpleApp.scalaIn SimpleApp.scalaWrite your codepackage org.apache.spark.examples.streaming/* SimpleApp.scala */import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConfimport java.util.Propertiesobject SimpleApp { def main(args: Array[String]) { val logFile = "/home/gbaduz/money" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application")conf.setMaster("local[2]") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) }}In Simple.sbtAdd the meta information like main classname := "Simple Project"version := "1.0"scalaVersion := "2.10.4"libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"mainClass in (Compile,run) := Some("org.apache.spark.examples.streaming.SimpleApp")Run the sbt command to package and run your code sbt/bin/sbt "run"./sbt/bin/sbt package./sbt/bin/sbt
View Full Document