CMSC 424 – Database designLecture 20Join size predictionConcurrency/recoveryMihai PopAdminJob DescriptionSoftware DeveloperThis Software Developer position is located at the U.S. Army Medical Research and Materiel Command (MRMC)Bioinformatics Cell (BIC), Ft. Detrick, MD. Applicants should have a Bachelors or Masters degree in computerscience, engineering, or a related discipline and an interest in pursuing the development of software tools thatdirectly support the mission of the MRMC life scientists. In particular, the Software Developer supportsbiomedical research by transitioning proof of concept software prototypes to production quality systems forresearch applications. This position requires substantial experience developing Web enabled client/serversystems with a database backend. A strong knowledge of servelets, CGI, HTML, JavaScript, SQL, XML, Javaapplets, and GNU tools is desirable. Must know C and possess good knowledge of OO, MySQL/Oracle, andPython/Ruby/LISP. Operating system knowledge should include Unix variants (BSD’s, GNU/Linux) andWindows OS. The Software Developer will also help maintain the BIC’s IT infrastructure. Experience withinstalling and maintaining UNIX/LINUX based server machines, and a working knowledge of Local AreaNetworks, MS Exchange based mail servers, DNS servers and CISCO firewalls is a plus.Please submit resume to:Jaques Reifman, Ph.D.Senior Research ScientistU.S. Army Medical Research and Material CommandFt. Detrick, MDEmail: [email protected] Phone: 301-619-7915www.bioanalysis.orgStatistical Information for Cost Estimation•nr: number of tuples in a relation r.•br: number of blocks containing tuples of r.•lr: size of a tuple of r.•fr: blocking factor of r — i.e., the number of tuples of r that fit into one block.•V(A, r): number of distinct values that appear in r for attribute A; same as the size of ∏A(r).•If tuples of r are stored together physically in a file, then: •Also: instead of V(A,r) can store a histogram=rfrnrbSelection Size Estimation•σA=v(r)•nr / V(A,r) : number of records that will satisfy the selection•Equality condition on a key attribute: size estimate = 1•σA≤V(r) (case of σA ≥ V(r) is symmetric)–Let c denote the estimated number of tuples satisfying the condition. –If min(A,r) and max(A,r) are available in catalog•c = 0 if v < min(A,r)•c =– If histograms available, can refine above estimate–In absence of statistical information c is assumed to be nr / 2.nr.v−min A , r max A , r−min A , rSize Estimation of Complex Selections•The selectivity of a condition θi is the probability that a tuple in the relation r satisfies θi . – If si is the number of satisfying tuples in r, the selectivity of θi is given by si /nr.•Conjunction: σθ1∧ θ2∧. . . ∧ θn (r). Assuming indepdence, estimate of tuples in the result is:•Disjunction:σθ1∨ θ2 ∨. . . ∨ θn (r). Estimated number of tuples:•Negation: σ¬θ(r). Estimated number of tuples:nr – size(σθ(r))nrnrnsssn∗∗∗∗ . . . 21−∗∗−∗−−∗)1(...)1()1(121rnrrrnsnsnsnJoin Operation: Running ExampleRunning example: depositor customerCatalog information for join examples:•ncustomer = 10,000.•fcustomer = 25, which implies that bcustomer =10000/25 = 400.•ndepositor = 5000.•fdepositor = 50, which implies that bdepositor = 5000/50 = 100.•V(customer_name, depositor) = 2500, which implies that , on average, each customer has two accounts.–Also assume that customer_name in depositor is a foreign key on customer.–V(customer_name, customer) = 10000 (primary key!)hEstimation of the Size of Joins•The Cartesian product r x s contains nr .ns tuples; each tuple occupies sr + ss bytes.•If R ∩ S = ∅, then r s is the same as r x s. •If R ∩ S is a key for R, then a tuple of s will join with at most one tuple from r–therefore, the number of tuples in r s is no greater than the number of tuples in s.•If R ∩ S in S is a foreign key in S referencing R, then the number of tuples in r s is exactly the same as the number of tuples in s.•The case for R ∩ S being a foreign key referencing S is symmetric.•In the example query depositor customer, customer_name in depositor is a foreign key of customer– hence, the result has exactly ndepositor tuples, which is 5000Estimation of the Size of Joins (Cont.)•If R ∩ S = {A} is not a key for R or S.If we assume that every tuple t in R produces tuples in R S, the number of tuples in R S is estimated to be:If the reverse is true, the estimate obtained will be:The lower of these two estimates is probably the more accurate one.•Can improve on above if histograms are available–Use formula similar to above, for each cell of histograms on the two relations ),( sAVnnsr∗),( rAVnnsr∗Estimation of the Size of Joins (Cont.)•Compute the size estimates for depositor customer without using information about foreign keys:–V(customer_name, depositor) = 2500, andV(customer_name, customer) = 10000–The two estimates are 5000 * 10000/2500 - 20,000 and 5000 * 10000/10000 = 5000–We choose the lower estimate, which in this case, is the same as our earlier computation using foreign keys.Size Estimation for Other Operations•Projection: estimated size of ∏A(r) = V(A,r)•Aggregation : estimated size of AgF(r) = V(A,r)•Set operations– For unions/intersections of selections on the same relation: rewrite and use size estimate for selections•E.g. σθ1 (r) ∪ σθ2 (r) can be rewritten as σθ1 σθ2 (r)–For operations on different relations:•estimated size of r ∪ s = size of r + size of s. •estimated size of r ∩ s = minimum size of r and size of s.•estimated size of r – s = r.•All the three estimates may be quite inaccurate, but provide upper bounds on the sizes.Transactions, concurrency, recovery•Until now we learned–how to design an efficient database–how to quickly answer queries•Next: how we ensure database is consistent:–equipment failures (disk, power, internet...)–concurrent accessesOverview•Transaction: A sequence of database actions enclosed within special tags•Properties:–Atomicity: Entire transaction or nothing–Consistency: Transaction, executed completely, takes database from one consistent state to another–Isolation: Concurrent transactions appear to run
View Full Document