UW CSE 444 - Lecture Notes - D904648

Home> Schools> University of Washington> Computer Science and Engineering (CSE) > CSE 444> Lecture Notes

DOC PREVIEW

UW CSE 444 - Lecture Notes

School name University of Washington

Course Cse 444- Database Internals

Pages 46

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Introduction to Database SystemsCSE 444Lecture 21:Parallel and Distributed DatabasesCSE 444 - Summer 2010 1Where We Are•How to use a DBMS as a:How to use a DBMS as a: – Data analyst: SQL, SQL, SQL,… – Application programmer: JDBC, XML,…–Database administrator: tuning, triggers, security– Massive-scale data analyst: • Parallel DBMSs and Pig/MapReduce• How DBMSs workTti–Transactions – Data storage and indexing – Query executionCSE 444 - Summer 2010– Databases as a service 2Outline• Parallel data management motivation– All database vendors offer parallel DBMSsFull support for entire SQL and ACID transactions–Full support for entire SQL and ACID transactions• Basic concepts behind parallel DBMSs•MapReducedata processing approachMapReducedata processing approach• Readings:–Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Sections 1 through 3 only.CSE 444 - Summer 2010 3Parallel v s DistributedParallel v.s. DistributedDatabases• Parallel database system– Improve performance through parallel implementation• Distributed database system–Data is stored across several sites each site managed by aData is stored across several sites, each site managed by a DBMS capable of running independentlyCSE 444 - Summer 2010 4Parallel DBMSs• Goal– Improve performance by executing multiple operations in parallelparallel• Want to scale transactions per second• Want to scale large analytics queries•Key benefit•Key benefit– Cheaper to scale than relying on a single increasingly more powerful processor• Key challenge– Ensure overhead and contention do not kill performanceCSE 444 - Summer 2010 5Outline• Parallel data management motivation– All database vendors offer parallel DBMSsFull support for entire SQL and ACID transactions–Full support for entire SQL and ACID transactions• Basic concepts behind parallel DBMSs•MapReducedata processing approachMapReducedata processing approach• Readings:–Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Sections 1 through 3 only.CSE 444 - Summer 2010 6Performance MetricsPerformance Metrics for Parallel DBMSs• Speedup– More processors Î higher speedIndividual queries should run faster–Individual queries should run faster– Should have more Transactions per Second (TPS)• Scaleup– More processors Î can process more dataBthlti f l d t–Batch scaleup: same time for same query on larger data– Transaction scaleup: N-times TPS on N-times larger databaseCSE 444 - Summer 2010 7Linear v.s. Non-linear SpeedupSpeedupSpeedup(e.g. TPS)# processors (=P)CSE 444 - Summer 2010 8Linear v.s. Non-linear ScaleupBatchBatchScaleup×1 ×5×10×15# processors (= P) AND data size 9Challenges to Speedup & Scaleup• Startup cost– Cost of starting an operation on many processors• Interference– Contention for resources between processors•Skew– Slowest step becomes the bottleneckCSE 444 - Summer 2010 10Architectures for ParallelArchitectures for Parallel Databases• Shared memory• Shared disk• Shared nothingCSE 444 - Summer 2010 11Shared MemoryP P PInterconnection NetworkGlobal Shared MemoryD D D12Shared Memory• Single address space•Typically each processor has own memory, yp y p ybut the the combined memory forms unique address space–NUMA (nonuniform memory access):– Local memory faster than remote memoryAif t th dik–Any memory access is faster than diskCSE 444 - Summer 2010 13Shared DiskM M MP P PInterconnection NetworkD D D14Shared Disk• Separate disks (not attached to processors)• Disk controllers manage competing requests gpgqfrom processorsCSE 444 - Summer 2010 15Shared NothingInterconnection NetworkP P PM M MD D D16Shared Nothing• Most scalable architecture– Minimizes interference by minimizing resource sharingCan use commodity hardware–Can use commodity hardware•Also most difficult to program and managepg g• Processor = server = node• P = number of nodesWe will focus on shared nothinggCSE 444 - Summer 2010 17Taxonomy forTaxonomy forParallel Query Evaluation• Inter-query parallelism– Each query runs on one processor• Inter-operator parallelism–A query runs on multiple processorsA query runs on multiple processors– An operator runs on one processor• Intra-operator parallelism– An operator runs on multiple processorsWe study only intra-operator parallelism: most scalableHorizontal Data Partitioning• Relation R split into P chunks R0, …, RP-1, stored at the P nodes• Round robin: tuple tito chunk (i mod P)• Hash based partitioning on attribute A:– Tuple t to chunk h(t.A) mod P• Range based partitioning on attribute A:–Tuple t to chunk i if vi-1< t.A< viCSE 444 - Summer 2010 19Parallel SelectionCompute σA=v(R), or σv1<A<v2(R)• On a conventional database: cost = B(R)Q Wh t i th t ll l d t b ith P•Q: What is the cost on a parallel database with P processors ?– Round robin– Hash partitioned– Range partitionedCSE 444 - Summer 2010 20Parallel Selection• Q: What is the cost on a parallel database with P processors ?• A: B(R) / P in all cases• However, different processors do the work:– Round robin: all servers do the work– Hash: one server for σA=v(R), all for σv1<A<v2(R)– Range: one server onlyCSE 444 - Summer 2010 21Data Partitioning RevisitedWhat are the pros and cons ?• Round robin– Good load balance but always needs to read all the data•Hash based partitioning•Hash based partitioning– Good load balance but works only for equality predicates and full scans• Range based partitioning– Works well for range predicates but can suffer from data skewCSE 444 - Summer 2010 22Parallel Group By• Compute γA, sum(B)(R)• Step 1: server i partitions chunk Riusing a hash function h(t.A) mod P: Ri0, Ri1, …, Ri,P-1• Step 2: server i sends partition Rijto serve j• Step 3: server j computes γA, sum(B)on R0j, R1j, …, RP-1,jCSE 444 - Summer 2010 23Cost of Parallel Group ByRecall conventional cost=3B(R)Recall conventional cost 3B(R)• Cost of Step 1: B(R)/P I/O operations•Cost of Step 2: (P-1)/P B(R) blocks are sentCost of Step 2: (P-1)/P B(R) blocks are sent– Network costs assumed to be much lower than I/O•Cost of Step 3: 2 B(R)/PCost of Step 3: 2 B(R)/P–Why ?–When can we reduce it to 0 ?Total = 3B(R) / P + communication costsCSE 444 - Summer 2010 24Parallel Join• Step 1– For

View Full Document