DOC PREVIEW
MIT 6 830 - Problem Set 2

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

6.830 Problem Set 2 16.830 Problem Set 2 (2014)Assigned: Monday, Sep 22, 2014Due: Wednesday, Oct 8, 2014, 11:59 PMSubmit to Stellar: https://stellar.mit.edu/S/course/6/fa14/6.830/homework/The purpose of this problem set is to give you some practice with concepts related to schema design, query planning, and queryprocessing. Start early as this assignment is long.Part 1 - Query Plans and Cost ModelsIn this part of the problem set, you will examine query plans that PostgreSQL uses to execute queries, and try to understandwhy it produces the plan it does for a certain query.The data set you will use has the same schema as the MIMIC-II dataset you used in problem set 1. Rather than running your owninstance of SQLite, however, you will be connecting to our PostgreSQL server, since PostgreSQL produces more interestingquery plans than SQLite.To understand what query plan is being used, PostgreSQL includes the EXPLAIN command. It prints the plan for a query,including all of the physical operators and access methods being used. For example, the following SQL command displays thequery plan for the SELECT:EXPLAIN SELECT*FROM d_meditems WHERE label LIKE '%Folate%';In this problem, you will find \di and \d tablename commands useful. In order to use these, you must install PostgreSQLcommand-line client. Make sure you use PostgreSQL 8.3+ so that your results are consistent with the solutions.Athena already has version 9.3.5 installed, so you can simply ssh into athena.dialup.mit.edu and get started. In caseyou want to work on your own Debian/Ubuntu machine, you can install the postgresql-client package by running thefollowing command in your shell.sudo apt-get install postgresql-clientYou can then connect to our PostgreSQL server by running the following command.psql -h vise3.csail.mit.edu -U mimic2 -d mimic2Note that we currently only allow connections from MIT IPs so you will need to connect from on campus or by ssh’ing intoAthena.To understand the output of EXPLAIN, you will probably want to read the performance tips chapter of the PostgreSQL docu-mentation:http://www.postgresql.org/docs/9.2/static/performance-tips.htmlIn general to understand the plans that are generated you may need to spend a bit of time searching the Internet.We have run VACUUM FULL ANALYZE on all of the tables in the database, which means that all of the statistics used byPostgreSQL server should be up to date.6.830 Problem Set 2 21. [15 points]: Query PlansRun the following query (using EXPLAIN) in PostgreSQL and answer questions (a) – (f) below:EXPLAIN SELECT ci.itemid, ci.label, count(*)FROM d_chartitems ci,chartevents ce,d_patients p,demographic_detail dWHERE ci.itemid = ce.itemidAND d.subject_id = p.subject_idAND ce.subject_id = p.subject_idAND d.overall_payor_group_descr = 'MEDICAID'AND p.sex = 'F'AND ce.icustay_id IS NOT NULLGROUP BY ci.itemid, ci.labelORDER BY ci.itemid, ci.label;a. What physical plan does PostgreSQL use? Your answer should consist of a drawing of a query tree annotated withthe access method types and join algorithms (note that you can use the pgadmin3 tool shown in class to draw plans,but will need to annotate them by hand.)b. Why do you think PostgreSQL selected this particular plan?c. What does PostgreSQL estimate the size of the result set to be?d. When you actually run the query, how big is the result set?e. Run some queries to compute the sizes of the intermediate results in the query. Where do PostgreSQL’s estimatesdiffer from the actual intermediate result cardinalities?f. Given the tables and indices we have created, do you think PostgreSQL selected the best plan? You can list allindices with \di, or list the indices for a particular table with \d tablename. If not, what would be a betterplan? Why?6.830 Problem Set 2 32. [15 points]: Estimating Cost of Query PlansUse EXPLAIN to find the query plans for the following two queries and answer question (a).1. EXPLAIN SELECT m.labelFROM d_chartitems AS c,chartevents AS ce,medevents AS me,d_meditems AS mWHERE c.itemid=ce.itemidAND ce.subject_id=me.subject_idAND me.itemid=m.itemidAND ce.subject_id > 100AND m.itemid>370;2. EXPLAIN SELECT m.labelFROM d_chartitems AS c,chartevents AS ce,medevents AS me,d_meditems AS mWHERE c.itemid=ce.itemidAND ce.subject_id=me.subject_idAND me.itemid=m.itemidAND ce.subject_id > 27000AND m.itemid>370;a. Notice that the query plans for the two queries above are different, even though they have the same general form.Explain the difference in the query plan that PostgreSQL chooses, and explain why you think the plans are different.Now use EXPLAIN to find the query plan for the following query, and answer questions (b) – (g).3. EXPLAIN SELECT m.labelFROM d_chartitems AS c,chartevents AS ce,medevents AS me,d_meditems AS mWHERE c.itemid=ce.itemidAND ce.subject_id=me.subject_idAND me.itemid=m.itemidAND ce.subject_id > 26300AND m.itemid>370;b. What is PostgreSQL doing for this query? How is it different from the previous two plans that are generated? Youmay find it useful to draw out the plans (or use pgadmin3) to get a visual representation of the differences, thoughyou are not required to submit drawings of the plans in your answer.c. Run some more EXPLAIN commands using this query, sweeping the constant after the ’>’ sign from 100 to 27000.For what value of subject id does PostgreSQL switch plans?Note: There might be additional plans other than these three. If you find this to be the case, please write theestimated last value of subject id for which PostgreSQL switches query plans.d. Why do you think it decides to switch plans at the values you found? Please be as quantitative as possible in yourexplanation.e. Suppose the crossover point (from the previous question) between queries 2 and 3 is subject id23. Compare theactual running times corresponding to the two alternative query plans at the crossover point. How much do theydiffer? Do the same for all switch points.Inside psql, you can measure the query time by using the \timing command to turn timing on for all queries. Toget accurate timing, you may also wish to redirect output to /dev/null, using the command \o /dev/null;6.830 Problem Set 2 4later on you can stop redirection just by issuing the \o command without any file name. You may also wish to runeach query several times, throwing out the first time (which may be longer if the data is not resident in the bufferpool) and averaging the remaining runs.f. Based on your answers to the previous two


View Full Document
Download Problem Set 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Problem Set 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Problem Set 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?