1IntroductionCPS 116Introduction to Database Systems2A few words about myself (and databases) Have been doing (and enjoying) research in databases ever since grad school (1995) But didn’t take any database course as an undergrad Just didn’t appreciate it)No h o ld o nt to t ke 116?)Now, why would you want to take 116?) It’s not really about databases per se—it’s about principlesof data management E.g., Google won’t care whether you know SQL, but… They still ask you “big data” questions in interviews Brin was a grad student in the Stanford Database GroupTrend: Moore’s Law reversed Moore’s Law: Processing power doubles every 18 months Amount of data doubles every 9 months Disk sales (# of bits) doubles every 9 months Parkinson’s Law: Data expands to fill the space available for storage)M’Ld3)Moore s Law reversed:Time to process all data doubles every 18 months! Does your attention span double every 18 months? No, so we need smarter data management techniques4Misc. course information Course website: http://www.cs.duke.edu/courses/fall08/cps116/ Course information; tentative syllabus and reference sections in the book; lecture slides, assignments, programming notes Book: Database Systems: The Complete Book, by H. Garcia-Molina, J. D. Ullman, and J. Widom. 2ndEd. Gradiance (“Online Accelerated Learning”): see course website for purchase information Blackboard: for grades only Mailing list: [email protected] Messages of general interest only No “official” recitation sessions; help sessions for assignments, project, and exams to be scheduled5Grading[90%, 100%] A- / A / A+[80%, 90%) B- / B / B+[70%, 80%) C- / C / C+[60%, 70%) D[0% 60%)F[0%, 60%)F No curves Scale may be adjusted downwards (i.e., grades upwards) if, for example, an exam is too difficult Scale will not go upwards—mistake would be mine alone if I made an exam too easy6Course load Four homework assignments (35%) Including Gradiance as well as additional written and programming problems Course project (25%) Details to be given in the third week of class Midterm and final (20% each) Open book, open notes Final is comprehensive, but emphasizes the second half of the course27Example projects SensorDB: a system for managing, cleansing, and visualizing sensor data collected from the Duke Forest Ashley DeMass, Jonathan Jou, Jonathan Odom, 2007 SuperDatabase: GUI for creating schema with rich datatypes, as well as editing and querying such dataAd E i M R Li C iW dD idAndy Ewing, MacRae Linton, Congyi Wu, and David Zhang, 2007 yourTunes: social music networking Nick Patrick, 2006 Facebook+ Tyler Brock and Beth Trushkowsky, 2005 Web-based K-ville tenting management Zach Marshall, 20058So, what is a database system?From Oxford Dictionary: Database: an organized body of related information Database system, DataBase Management System (DBMS): a software system that facilitates the ()ycreation and maintenance and use of an electronic database9What do you want from a DBMS? Keep data around (persistent) Answer queries (questions) about data Update dataExample: a traditional banking applicationExample: a traditional banking application Data: Each account belongs to a branch, has a number, an owner, a balance, …; each branch has a location, a manager, … Persistency: Balance can’t disappear after a power outage Query: What’s the balance in Homer Simpson’s account? What’s the difference in average balance between Springfield and Capitol City accounts? Modification: Homer withdraws $100; charge account with lower than $500 balance with a $5 fee10Sounds simple!1001#Springfield#Mr. Morgan... ...00987-00654#Ned Flanders#2500.0000123-00456#Homer Simpson#400.0000142-00857#Montgomery Burns#1000000000.00... ... ASCII file Accounts/branches separated by newlines Fields separated by #’s11Query1001#Springfield#Mr. Morgan... ...00987-00654#Ned Flanders#2500.0000123-00456#Homer Simpson#400.0000142-00857#Montgomery Burns#1000000000.00... ... What’s the balance in Homer Simpson’s account? A simple script Scan through the accounts file Look for the line containing “Homer Simpson” Print out the balance12Query processing tricks Tens of thousands of accounts are not Homer’s)Cluster accounts by owner’s initial: those owned by “A...” go into file A; those owned by “B...” go into file B; etc. → decide which file to search using the initial)Keep accounts sorted by owner name → binary search?)Hash accounts using owner name → compute file offset directly)Index accounts by owner name: index entries have the formh owner_name, file_offset i → search index to get file offset)And the list goes on… What happens when the query changes to: What’s the balance in account 00142-00857?313Observations Tons of tricks (not only in storage and query processing, but also in concurrency control, recovery, etc.) Different tricks may work better in different usage scenarios (example?) Same tricks get used over and over again in different applications14The birth of DBMS – 1(Figure from Hans-J. Schek’s VLDB 2000 slides)15The birth of DBMS – 2(Figure from Hans-J. Schek’s VLDB 2000 slides)16The birth of DBMS – 3(Figure from Hans-J. Schek’s VLDB 2000 slides)17Early efforts “Factoring out” data management functionalities from applications and standardizing these functionalities is an important first step CODASYL standard (circa 1960’s))Bachman got a Turing award for this in 1973 But getting the abstraction right (the API between applications and the DBMS) is still tricky18CODASYL Query: Who have accounts with 0 balance managed by a branch in Springfield? Pseudo-code of a CODASYL application:Use index on account(balance) to get accounts with 0 balance;For each account recordFor each account record:Get the branch id of this account; Use index on branch(id) to get the branch record;If the branch record’s location field reads “Springfield”:Output the owner field of the account record. Programmer controls “navigation”: accounts → branches How about branches → accounts?419What’s wrong? The best navigation strategy & the best way of organizing the data depend on data/workload characteristics With the CODASYL approach To write correct code, application programmers need
View Full Document