Spark Fast Interactive Language Integrated Cluster Computing Dr M Zaharia An Architecture for Fast and General Data Processing on Large Clusters PhD Dissertation http www eecs berkeley edu Pubs Tech Rpts 2014 EECS 2014 12 pdf Project Goals Extend the MapReduce model to better support two common classes of analytics apps Iterative algorithms machine learning graph Interactive data mining Enhance programmability Integrate into Scala programming language Allow interactive use from Scala interpreter Motivation MapReduce greatly simplified big data analys is on large unreliable clusters But as soon as it got popular users wanted mor e More complex multi stage applications e g iterative machine learning graph processing More interactive ad hoc queries Response specialized frameworks for some of these apps e g Pregel for graph processing Motivation Complex apps and interactive queries both ne ed one thing that MapReduce lacks Efficient primitives for data sharing In MapReduce the only way to share data across jobs is stable storage slow Examples HDFS read HDFS write HDFS read iter 1 HDFS write iter 2 Input HDFS read Input query 1 result 1 query 2 result 2 query 3 result 3 Slow due to replication and disk I O but necessary for fault tolerance Goal In Memory Data Sharing iter 1 iter 2 Input query 1 one time processing Input query 2 query 3 10 100 faster than network disk but how to get FT Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient Challenge Existing storage abstractions have interfaces ba sed on fine grained updates to mutable state RAMCloud databases distributed mem Piccolo Requires replicating data or logs across nodes f or fault tolerance Costly for data intensive apps 10 100x slower than memory write Solution Resilient Distributed Datasets RDD s Restricted form of distributed shared memory Immutable partitioned collections of records Can only be built through coarse grained determinis tic transformations map filter join Efficient fault recovery using lineage Log one operation to apply to many elements Recompute lost partitions on failure No cost if nothing fails RDD Recovery iter 1 iter 2 Input query 1 one time processing Input query 2 query 3 Generality of RDDs Despite their restrictions RDDs can express surpr isingly many parallel algorithms These naturally apply the same operation to multiple items Unify many current programming models Data flow models MapReduce Dryad SQL Specialized models for iterative apps BSP Pregel ite rative MapReduce Haloop bulk incremental Support new apps that these models don t Outline Introduction to Scala functional programming What is Spark Resilient Distributed Datasets RDDs Implementation Demo Conclusion About Scala High level language for JVM Object oriented Functional programming FP Statically typed Comparable in speed to Java no need to write types due to type inference Interoperates with Java Can use any Java class inherit from it etc Can also call Scala code from Java Quick Tour Quick Tour All of these leave the list unchanged List is Immutable Outline Introduction to Scala functional programming What is Spark Resilient Distributed Datasets RDDs Implementation Demo Conclusion Spark Overview Goal work with distributed collections as you would with local ones Concept resilient distributed datasets RDDs Immutable collections of objects spread across a cluster Built through parallel transformations map filter etc Automatically rebuilt on failure Controllable persistence e g caching in RAM for reuse Shared variables that can be used in parallel operations Outline Introduction to Scala functional programming What is Spark Resilient Distributed Datasets RDDs Implementation Demo Conclusion RDD Abstraction An RDD is a read only partitioned collection of records Can only be created by 1 Data in stable storage 2 Other RDDs transformation lineage An RDD has enough information about how it was derived fro m other datasets its lineage Users can control two aspects of RDDs 1 Persistence in RAM reuse 2 Partitioning hash range k v RDDs in More Detail An RDD is an immutable partitioned logical col lection of records Need not be materialized but rather contains infor mation to rebuild a dataset from stable storage Partitioning can be based on a key in each recor d using hash or range partitioning Built using bulk transformations on other RDDs Can be cached for future reuse RDD Types parallelized collections By calling SparkContext s parallelize method on a n existing Scala collection a Seq obj Once created the distributed dataset can be op erated on in parallel RDD Types Hadoop Datasets Spark supports text files SequenceFiles and any other Hadoop inputFormat Local path or hdfs s3n kfs val distFiles sc textFile URI Other Hadoop inputFormat val distFile sc hadoopRDD URI RDD Operations Transformations create a new dataset from an existing one Actions Return a value to the driver program Transformations are lazy they don t compute right awa y Just remember the transformations applied to datase ts lineage Only compute when an action require Spark Operations Transformations define a new RDD Actions return a result to driver program map filter sample groupByKey reduceByKey sortByKey collect reduce count save lookupKey flatMap union join cogroup cross mapValues Transformations Transformations map func Meaning Return a new distributed dataset formed by passing each element of the source through a function func flatMap func Return a new datasets formed by selecting those elements of the source on which func returns true union otherDateset Return a new dataset that contains the union of the elements in the source dataset and the argument Actions Actions reduce func collect Meaning Aggregate the elements of the dataset using a function func Return all the elements of the dataset as an array at the driver program count first saveAsTextFile path Return the number of elements in dataset Return the first element of the dataset Write the elements of the dataset as text file or set of text file in a given dir in the local file system HDFS or any other Hadoop supported file system Memory Management Spark provides three options for persist RDDs 1 in memory storage as deserialized Java Objs fastest JVM can access RDD natively 2 in memory storage as serialized data space limited choose another efficient r epresentation lower performance cost 3 on disk storage RDD too large to keep in memory and cost ly to recompute Example Log
View Full Document