DOC PREVIEW
UT Dallas CS 6350 - SparkChapter3(1)

This preview shows page 1-2-3-4-5-38-39-40-41-42-43-77-78-79-80-81 out of 81 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 81 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

SparkProject GoalsMotivationMotivationExamplesGoal: In-Memory Data SharingChallengeChallengeSolution: Resilient Distributed Datasets (RDDs)RDD RecoveryGenerality of RDDsOutlineAbout ScalaQuick TourQuick TourSlide 16Slide 17Slide 18OutlineSpark OverviewOutlineRDD AbstractionRDDs in More DetailRDD Types: parallelized collectionsRDD Types: Hadoop DatasetsRDD OperationsSpark OperationsTransformationsActionsMemory ManagementExample: Log MiningRDD Fault ToleranceFault RecoveryBenefits of RDD ModelRDDs vs Distributed Shared MemoryRepresenting RDDsInterface used to represent RDDsInternals of the RDD InterfaceExample: Hadoop RDDExample: Filtered RDDExample: Joined RDDA More Complex DAGA More Complex DAGNarrow and Wide TransformationsRDD DependenciesOutlineImplementationJob SchedulerTask SchedulerAdvanced FeaturesWorking With SparkUsing the ShellSparkContextCreating RDDsBasic TransformationsBasic Transformations/Actions: ScalaBasic ActionsWorking with Key-Value PairsSome Key-Value OperationsExample: Word CountOther Key-Value OperationsSetting the Level of ParallelismUsing Local VariablesClosure Mishap ExampleMore RDD OperatorsExample Application: PageRankExample: PageRankBasic IdeaAlgorithmAlgorithmAlgorithmAlgorithmAlgorithmAlgorithmScala ImplementationPageRank PerformanceOther Iterative AlgorithmsSlide 78Example: Logistic RegressionLogistic Regression CodeExample: MapReduceSparkFast, Interactive, Language-Integrated Cluster ComputingDr. M. Zaharia. An Architecture for Fast and General Data Processing on Large Clusters (PhD Dissertation).http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdfProject GoalsExtend the MapReduce model to better support two common classes of analytics apps: >> Iterative algorithms (machine learning, graph) >> Interactive data miningEnhance programmability: >> Integrate into Scala programming language >> Allow interactive use from Scala interpreterMotivation•MapReduce greatly simplified “big data” analysis on large, unreliable clusters•But as soon as it got popular, users wanted more:–More complex, multi-stage applications(e.g. iterative machine learning & graph processing)–More interactive ad-hoc queriesResponse: specialized frameworks for some of these apps (e.g. Pregel for graph processing)Motivation•Complex apps and interactive queries both need one thing that MapReduce lacks:•Efficient primitives for data sharingIn MapReduce, the only way to share data across jobs is stable storage  slow!Examplesiter. 1iter. 1iter. 2iter. 2. . .InputHDFSreadHDFSwriteHDFSreadHDFSwriteInputquery 1query 1query 2query 2query 3query 3result 1result 2result 3. . .HDFSreadSlow due to replication and disk I/O,but necessary for fault toleranceiter. 1iter. 1iter. 2iter. 2. . .InputGoal: In-Memory Data SharingInputquery 1query 1query 2query 2query 3query 3. . .one-timeprocessing10-100× faster than network/disk, but how to get FT?ChallengeHow to design a distributed memory abstraction that is both fault-tolerant and efficient?Challenge•Existing storage abstractions have interfaces based on fine-grained updates to mutable state–RAMCloud, databases, distributed mem, Piccolo•Requires replicating data or logs across nodes for fault tolerance–Costly for data-intensive apps–10-100x slower than memory writeSolution: Resilient Distributed Datasets (RDDs)•Restricted form of distributed shared memory–Immutable, partitioned collections of records–Can only be built through coarse-grained deterministic transformations (map, filter, join, …)•Efficient fault recovery using lineage–Log one operation to apply to many elements–Recompute lost partitions on failure–No cost if nothing failsInputquery 1query 1query 2query 2query 3query 3. . .RDD Recoveryone-timeprocessingiter. 1iter. 1iter. 2iter. 2. . .InputGenerality of RDDs•Despite their restrictions, RDDs can express surprisingly many parallel algorithms–These naturally apply the same operation to multiple items•Unify many current programming models–Data flow models: MapReduce, Dryad, SQL, …–Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, …•Support new apps that these models don’tOutline•Introduction to Scala & functional programming•What is Spark•Resilient Distributed Datasets (RDDs)•Implementation •Demo•ConclusionAbout ScalaHigh-level language for JVM >> Object-oriented + Functional programming (FP)Statically typed >> Comparable in speed to Java >> no need to write types due to type inferenceInteroperates with Java >> Can use any Java class, inherit from it, etc; >> Can also call Scala code from JavaQuick TourQuick TourAll of these leave the list unchanged (List is Immutable)Outline•Introduction to Scala & functional programming•What is Spark•Resilient Distributed Datasets (RDDs)•Implementation •Demo•ConclusionSpark OverviewConcept: resilient distributed datasets (RDDs) >> Immutable collections of objects spread across a cluster >> Built through parallel transformations (map, filter, etc) >> Automatically rebuilt on failure >> Controllable persistence (e.g. caching in RAM) for reuse >> Shared variables that can be used in parallel operationsGoal: work with distributed collections as you would with local onesOutline•Introduction to Scala & functional programming•What is Spark•Resilient Distributed Datasets (RDDs)•Implementation •Demo•ConclusionRDD AbstractionAn RDD is a read-only , partitioned collection of recordsCan only be created by :(1) Data in stable storage(2) Other RDDs (transformation , lineage)An RDD has enough information about how it was derived from other datasets(its lineage)Users can control two aspects of RDDs:1) Persistence (in RAM, reuse)2) Partitioning (hash, range, [<k, v>])RDDs in More Detail•An RDD is an immutable, partitioned, logical collection of records–Need not be materialized, but rather contains information to rebuild a dataset from stable storage•Partitioning can be based on a key in each record (using hash or range partitioning)•Built using bulk transformations on other RDDs•Can be cached for future reuseRDD Types: parallelized collectionsBy calling SparkContext’s parallelize method on an existing Scala collection (a Seq obj)Once created, the distributed dataset can be operated on in parallelRDD Types: Hadoop DatasetsSpark supports text files, SequenceFiles, and any other Hadoop inputFormatval distFiles = sc.textFile(URI)Other


View Full Document

UT Dallas CS 6350 - SparkChapter3(1)

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download SparkChapter3(1)
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view SparkChapter3(1) and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view SparkChapter3(1) 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?