MIT 6 830 - Lecture Notes - D1835617

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 830> Lecture Notes

DOC PREVIEW

MIT 6 830 - Lecture Notes

School name Massachusetts Institute of Technology

Course 6 830- Database Systems

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

6.830 2010 Lecture 15: C-Store (Sam Madden) Why are we reading this paper?C-store has standard interface, but very different designHelp us understand what choices standard DBs madeThink about different set of apps than OLTP Paper statusMost individual techniques already existedC-Store pulls them togetherNot just a limited special-purpose DBtransparent -- sql interface read/write, not just r/o consistency transactional update Paper doesn't describe complete system design + partial implementation Commercialized as Vertica What's a data warehouse? big historical collection of datacompanies analyze to spot trends &cwhat products are in? what's out? where is cherry coke popular?spot early, order more mostly r/o but must be updated, maybe continuously Example: big chain of stores, e.g. Walmarteach store records every saleupload to central DB at headquarterskeep the last few years Typical schema (logical):time(tid,year,mon,day,hour) product(pid,type,color,supplier) xact(tid,pid,sid,cid,price,discount,tax,coupon,&c) store(sid,state,mgr,size) customer(cid,city,zip) called a "star schema" "fact table" in the middle gigantic # of rows might have 100s of columns "dimension tables" typically much smaller How big is the data?50K products (5 MB)3K stores (1 MB)5M customers (5 GB)150K times (5 MB) (10 minute granularity)350B xact rows (35 TB) (100 bytes/sale)3000 stores * 10 registers * 20 items/min * 2 years example 1:total sales by store in Texas on Mondaysjoin xact to time and storefilter by day and stategroup by sid, aggregate example 2:average daily sales for Nikon camerasjoin xact to product, timefilter by suppliergroup by day, aggregateHow long would queries take on traditional DB?probably have to look at every page of fact table even if only 1% of records pass filter means every block might have one relevant record so index into fact table may not be very useful joins to dimension tables pretty cheap they fit in memory, fast hash lookups how long to read the whole fact table? 35 TB, say 100 disks, 50 MB/sec/disk => 2 hours outch! You can imagine building special setupse.g. maintain aggregates in real time -- pre-compute know all the queries in advance update aggregate answers as new data arrives table of daily sales of Nikon cameras, &c but then hard to run "ad-hoc" queries C-Store Why columns?Why store each column separately?avoid reading bytes from fact table you don't need Why "projections" of columns?you usually want more than one columne.g. sid and price for example 1 Why is a projection sorted on one of the columns?to help aggregation: bring all data for a given store togetheror to help filtering by bringing all data w/ given col value togetherso you only have to read an interval of the column What projection would help example 1?columns: sid, price, store.state, time.daynote we are including columns from multiple logical tablesnote we are duplicating a lot of data e.g. store.statenote projection must have every column you need -- can't consult "original" rowthus you don't need a notion of tupleID note i'th row in each column comes from same xact row order? sid state, sid Why multiple overlapping projections?why store the same column multiple times? What projection would help example 2?columns: price, time.year, time.mon, time.day, product.suppliernote we are not including join columns! e.g. pidorder? supplier, year, mon, day year, mon, day, supplier What if there isn't an appropriate projection for your query?You lose -> wait 2 hours Ask DB administrator to add a projection Could we get the same effect in conventional DB?Keep heap files sorted ("clustered")?can only do it one wayB+Trees for order and filtering?have to avoid seeks into main heap file, so multi-key B+trees copy data into many tables, one per projection So yes, we could But very manual choose right table for each query updating? "materialized views" partially automates this for conventional DB and Eval in Section 9 shows they make row store perform 10x better but c-store still faster Won't all this burn up huge quantities of disk space? How do they compress? Why does self-order vs foreign-order matter in Section 3.1? How to compress for our example projections?sid ordered by sid?price ordered by sid?store.state ordered by sid?time.day ordered by sid? Won't it be slow to update if there are lots of copies? How does C-Store update efficiently? How does C-Store run consistent r/o queries despite updates? Why segment across a cluster of servers?Parallel speedupmany disks, more memory, many CPUs How do they ensure good parallel speedup on a cluster?What is a "horizontal partition"?Why will that lead to good parallel speedup?Sorting allows filtering and aggregating to proceed in parallelwill talk about parallel DBs more later Evaluation? Section 9 what are the main claims that need to be substantiated? faster on data warehouse queries than a traditional row store uses a reasonable amount of space Experimental setupstandard data-warehouse benchmark "TPC-H" single machineone disk 2 GB RAM this is a little odd -- original data also 2 GBsmall reduction in memory requirement could give a huge boost in this setupbut make no difference for larger data sets TPC-H scale_10 standard data warehouse benchmark comes in different sizes ("scale")defines how many rows in each table customer: 1.5 M rows, abt 15 MB orders: 15 M rows, abt 150 MB lineitem: 60 M rows, abt 2.4 GB results are spectacular!mostly > 100x faster than row storeQ4 is 400x faster on c-store -- why?print o_orderdate, l_shipdategroup by o_orderdatefilter on l_orderkey, o_orderkey, o_orderdatemust be using D2: o_orderdate, l_shipdate, l_suppkey | o_orderdate, l_suppkeyD2 is missing o_orderkey and l_orderkey -- do we need them? D2 already in good order to aggregate by o_orderdate how much data is c-store scanning? two columns with 60 M rows o_orderdate probably compressed down to a bit or byte l_shipdate might be 4 bytes so 300 MB? read from disk in 6 seconds read from RAM in 0.3 seconds actual performance is in between: 2 secondsmaybe skipping due to o_orderdate > D? maybe some in mem, some in disk?what is row DB probably doing? for 723 seconds would have to scan 2 GB LINEITEM table if doesn't fit in RAM, 40 seconds at 50 MB/sec from disk must join to ORDERS table, fits in memory, should be fast hash then sort (or something) by o_orderdate hard to guess why row DB takes 723 rather than 40+ seconds Q2 is only 3x faster w/ c-storeneeds l_suppkey,

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

MIT 6 830 - Lecture Notes

Sign up for free to view:

Please select your school