Berkeley COMPSCI 186 - Parallel and distributed databases II - D2284680

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 186> Parallel and distributed databases II

DOC PREVIEW

Berkeley COMPSCI 186 - Parallel and distributed databases II

School name University of California, Berkeley

Course Compsci 186- Introduction to Database Systems

Pages 51

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Parallel and distributed databases IISome interesting recent systemsThen and nowA modern search engineMapReduceSlide 6Slide 7ExampleOther MapReduce usesDynamoEventual consistencySlide 12MechanismsEpidemic replicationAnti-entropyWhat if I get a conflict?Slide 17Slide 18How to resolve conflicts?Peer-to-peerPeer-to-peer originsNapsterGnutellaCharacteristicsSlide 25Joining the networkSlide 27SearchDownloadFailuresScalability!How to make more scalable?Iterative deepeningDirected breadth-first searchRandom walkRandom walk with replicationSupernodesSome interesting observations“Structured” networksDistributed Hash TablesChordSearchingBetter searchingJoiningSlide 45Slide 46InsertingWhat is actually stored?Good propertiesWhat if finger table is incomplete?IssuesParallel and distributed databases IISome interesting recent systemsMapReduceDynamoPeer-to-peerThen and nowA modern search engineMapReduceHow do I write a massively parallel data intensive program?Develop the algorithmWrite the code to distribute work to machinesWrite the code to distribute data among machinesWrite the code to retry failed work unitsWrite the code to redistribute data for a second stage of processingWrite the code to start the second stage after the first finishesWrite the code to store intermediate result dataWrite the code to reliably store final result dataMapReduceTwo phasesMap: take input data and map it to zero or more key/value pairsReduce: take key/value pairs with the same key and reduce them to a resultMapReduce framework takes care of the restPartitioning data, repartitioning data, handling failures, tracking completion…MapReduceXMapReduceXExampleMapReduceCount the number of times each word appears on the webapplebananaapplegrapeappleapplegrapeappleappleapplebananabananaappleapplegrapegrapeappleappleappleapplegrapegrapeappleappleapple,1apple,1banana,1banana,1apple,1apple,1grape,1grape,1apple,1apple,1grape,1grape,1apple,1apple,1apple,1apple,1apple,1apple,1banana,1banana,1apple,1apple,1grape,1grape,1apple,1apple,1grape,1grape,1apple,1apple,1apple,1apple,1apple,5grape,5banana,5Other MapReduce usesGrepSortAnalyze web graphBuild inverted indexesAnalyze access logsDocument clusteringMachine learningDynamoAlways writable data storeDo I need ACID for this?Eventual consistencyWeak consistency guarantee for replicated dataUpdates initiated at any replicaUpdates eventually reach every replicaIf updates cease, eventually all replicas will have same stateTentative versus stable writesTentative writes applied in per-server partial orderStable writes applied in global commit orderBayou system at PARCCommit orderStorageunitInconsistent copies visibleStorageunitLocal write order preservedEventual consistencyMaster storageunit(Joe, 22, Arizona)(Joe, 32, Arizona)(Joe, 22, Montana)(Joe, 22, Montana)Arizona  Montana(Joe, 32, Montana)(Joe, 32, Montana)(Bob, 22, Montana)Joe  Bob(Bob, 32, Montana)(Bob, 32, Montana)(Bob, 32, Montana)(Bob, 32, Montana)All replicas end up in same state22  32Arizona  MontanaJoe  BobArizona  MontanaJoe  Bob22  3222  32MechanismsEpidemics/rumor mongering Updates are “gossiped” to random sitesGossip slows when (almost) every site has heard itAnti-entropyPairwise sync of whole replicasVia log-shipping and occasional DB snapshotEnsure everyone has heard all updatesPrimary copyOne replica determines the final commit order of updatesEpidemic replicationSusceptible (no update) Infective (spreading update) Removed (updated, not spreading)UpdateNode is already infectedAnti-entropyInfective (spreading update) Removed (updated, not spreading)Will not receive updatePairwise syncSusceptible (no update)What if I get a conflict?How to detect?Version vector: (A’s count, B’s count)ABrian=‘GoodProfessor’BBrian=‘BadProfessor’What if I get a conflict?How to detect?Version vector: (A’s count, B’s count)Initially, (0,0) at bothA writes, sets version vector to (1,0)(1,0) dominates B’s version (0,0)No conflictABrian=‘GoodProfessor’BWhat if I get a conflict?How to detect?Version vector: (A’s count, B’s count)Initially, (0,0) at bothA writes, sets version vector to (1,0)B writes, sets version vector to (0,1)Neither vector dominates the otherConflict!!ABrian=‘GoodProfessor’BBrian=‘BadProfessor’How to resolve conflicts?Commutative operations: allow bothAdd “Fight Club” to shopping cartAdd “Legends of the Fall” shopping cartDoesn’t matter what order they occur inThomas write rule: take the last updateThat’s the one we “meant” to have stickLet the application cope with itExpose possible alternatives to applicationApplication must write back one answerPeer-to-peerGreat technologyShady business modelFocus on the technology for nowPeer-to-peer originsWhere can I find songs for download?Web interfaceQ?NapsterQ?Q?Q?GnutellaQ?CharacteristicsPeers both generate and process messagesServer + client = “servent”Massively parallelDistributedData-centricRoute queries and data, not packetsGnutellaJoining the networkpingpingpingpongpingpingpingpingpongpongpingpingpingpingpingpingpingpingpingpongpongpongpongpongpongJoining the networkSearchQ?TTL = 4DownloadFailuresXScalability!Messages flood the networkExample: Gnutella meltdown, 2000How to make more scalable?Search more intelligentlyReplicate informationReorganize the topologyIterative deepeningQ?Yang and Garcia-Molina 2002, Lv et al 2002Directed breadth-first searchQ?Yang and Garcia-Molina 2002Random walkQ?Adamic et al 2001Random walk with replicationQ?Cohen and Shenker 2002, Lv et al 2002SupernodesQ?Kazaa, Yang and Garcia-Molina 2003Some interesting observationsMost peers are short-livedAverage up-time: 60 minutesFor a 100K network, this implies churn rate of 1,600 nodes per minuteSaroiu et al 2002Most peers are “freeloaders”70 percent of peers share no filesMost results come from 1 percent of peersAdar and Huberman 2000Network tends toward a power-law topologyPower-law: nth most connected peer has k/n connectionsA few peers have many connections, most peers have fewRipeanu and Foster 2002“Structured” networksIdea: form a structured topology that gives certain performance guaranteesNumber of hops needed for

View Full Document