DOC PREVIEW
USC CSCI 599 - DM3_jkns2000-12pst

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

The VLDB Journal (2000) 9: ♣–♣One-dimensional and multi-dimensionalsubstring selectivity estimationH.V. Jagadish1, Olga Kapitskaia2, Raymond T. Ng3, Divesh Srivastava41University of Michigan, Ann Arbor; E-mail: [email protected]ˆole Universitaire L´eonard de Vinci; E-mail: [email protected] of British Columbia; E-mail: [email protected]&T Labs – Research, 180 Park Avenue, Bldg 103, Florham Park, NJ 07932, USA; E-mail: [email protected] by M.P. Atkinson. Received: April 28, 2000 / Accepted: July 11, 2000Abstract. With the increasing importance of XML, LDAPdirectories, and text-based information sources on the Inter-net, there is an ever-greater need to evaluate queries involv-ing (sub)string matching. In many cases, matches need to beon multiple attributes/dimensions, with correlations betweenthe multiple dimensions. Effective query optimization in thiscontext requires good selectivity estimates. In this paper, weuse pruned count-suffix trees (PSTs) as the basic data struc-ture for substring selectivity estimation. For the 1-D problem,we present a novel technique called MO (Maximal Overlap).We then develop and analyze two 1-D estimation algorithms,MOC and MOLC, based on MO and a constraint-based char-acterization of all possible completions of a given PST. For thek-D problem, we first generalize PSTs to multiple dimensionsand develop a space- and time-efficient probabilistic algorithmto construct k-D PSTs directly. We then show how to extendMO to multiple dimensions. Finally, we demonstrate, both an-alytically and experimentally, that MO is both practical andsubstantially superior to competing algorithms.Key words: String selectivity – Maximal overlap – Shortmemory property – Pruned count-suffix tree1 IntroductionOne often wishes to obtain a quick estimate of the numberof times a particular substring occurs in a database. A tra-ditional application is for optimizing SQL queries with thelike predicate (e.g., name like %jones%). Such predicatesare pervasive in data warehouse queries, because of the pres-ence of “unclean” data [HS95]. With the growing importanceof XML, LDAP directories, and other text-based informationstores on the Internet, substring queries are becoming increas-ingly common.Furthermore, in many situations with these applications, a query may specify substrings to be matched on multiple al-phanumeric attributes or dimensions. The query [(name like%jones%) AND (tel like 973360%) AND (mail like%research.att.com)] is one example. Often the attri-butes mentioned in these kinds of multi-dimensional queriesmay be correlated. For the above example, because of the ge-ographical location of the research labs, people that satisfythe query (mail like %research.att.com) may have anunexpectedly high probability to satisfy the query (tel like973360%). For such situations, assuming attribute indepen-dence and estimating the selectivity of the query as a productof the selectivity of each individual dimension can lead togross inaccuracy.1.1 The data structureA natural question that arises is which data structure doesone use for substring selectivity estimation. Histograms havelong been used for selectivity estimation in databases (see,e.g.,[SAC+79,MD88,LN90,Ioa93,IP95,PIHS96,JKM+98]).They have been designed to work well for numeric attributevalue domains. For the string domain, one could continueto use such “value-range” histograms by sorting substringsbased on the lexicographic order and computing the appro-priate counts. However, in this case, a histogram bucket thatincludes a range of consecutive lexicographic values is notlikely to produce a good approximation, since the number oftimes a string occurs as a substring is likely to be very differ-ent for lexicographically successive substrings.As a result, welook for a different solution, one that is suitable for the stringdomain.A commonly used data structure for indexing substringsin a database is the suffix tree [Wei73,McC76], which is atrie that satisfies the following property: whenever a string αis stored in the trie, then all suffixes of α are stored in thetrie as well. Given a substring query, one can locate all thedesired matches using the suffix tree. Krishnan et al. [KVI96]proposed an interesting variation of the suffix tree: the prunedcount-suffix tree (PST), which maintains a count, Cα, for eachsubstring α in the tree and retains only those substrings α (andtheir counts) for which Cαexceeds some pruning threshold.In this paper, following [KVI96], we use PSTs as the basicsummary data structure for substring selectivity estimation.To estimate substring selectivity in multiple dimensions,we need to generalize the PST to multiple dimensions. Here,only those k-D substrings (α1,...,αk) for which the count2 H.V. Jagadish et al.: One-dimensional and multi-dimensional substring selectivity estimationexceeds the pruning threshold are maintained in the tree (alongwith their counts).1.2 The problemThe substring selectivity estimation problem can be formallystated as follows:Given a pruned count-suffix tree T , and a (1-D or k-D)substring query q, estimate the fraction Cq/N , whereN is the count associated with the root of T .The 1-D version of the above problem considers the situ-ation when the pruned tree T is created for a single attribute(e.g., name). The k-D version of the problem considers thecase when T is set up for multiple attributes (e.g., name andtel).What we gain in space by pruning a count-suffix tree, welose in accuracy in the estimation of the selectivities of thosestrings that are not completely retained in the pruned tree. Ourmain challenge, then, is: given a pruned tree, to try to estimateas accurately as possible the selectivity of such strings.1.3 Our contributionsWe begin by describing the 1-D problem and its solution first(in Sects. 4–6), and then go on to generalize our results tomultiple dimensions (in Sects. 7–10). Specifically, we makethe following contributions:– In Sect. 4, for the 1-D problem, we present a novel se-lectivity estimation algorithm MO (Maximal Overlap),which estimates the selectivity of the query string σ, basedon all maximal substrings, βi, of σ in the 1-D PST. Wedemonstrate that MO is provably better than KVI, theindependence-based estimation technique developed in[KVI96], using a greedy parsing of σ, under the natural as-sumption that strings exhibit the so-called short memoryproperty. We also


View Full Document

USC CSCI 599 - DM3_jkns2000-12pst

Documents in this Course
Week8_1

Week8_1

22 pages

Week2_b

Week2_b

10 pages

LECT6BW

LECT6BW

20 pages

LECT6BW

LECT6BW

20 pages

5

5

44 pages

12

12

15 pages

16

16

20 pages

Nima

Nima

8 pages

Week1

Week1

38 pages

Week11_c

Week11_c

30 pages

afsin

afsin

5 pages

October5b

October5b

43 pages

Week11_2

Week11_2

20 pages

final

final

2 pages

c-4

c-4

12 pages

0420

0420

3 pages

Week9_b

Week9_b

20 pages

S7Kriegel

S7Kriegel

21 pages

Week4_2

Week4_2

16 pages

sandpres

sandpres

21 pages

Week6_1

Week6_1

20 pages

4

4

33 pages

Week10_c

Week10_c

13 pages

fft

fft

18 pages

LECT7BW

LECT7BW

19 pages

24

24

15 pages

14

14

35 pages

Week9_c

Week9_c

24 pages

Week11_67

Week11_67

22 pages

Week1

Week1

37 pages

LECT3BW

LECT3BW

28 pages

Week8_c2

Week8_c2

19 pages

Week5_1

Week5_1

19 pages

LECT5BW

LECT5BW

24 pages

Week10_b

Week10_b

16 pages

Week11_1

Week11_1

43 pages

Week7_2

Week7_2

15 pages

Week5_b

Week5_b

19 pages

Week11_a

Week11_a

29 pages

LECT14BW

LECT14BW

24 pages

T7kriegel

T7kriegel

21 pages

0413

0413

2 pages

3

3

23 pages

C2-TSE

C2-TSE

16 pages

10_19_99

10_19_99

12 pages

s1and2-v2

s1and2-v2

37 pages

Week10_3

Week10_3

23 pages

jalal

jalal

6 pages

1

1

25 pages

T3Querys

T3Querys

47 pages

CS17

CS17

15 pages

porkaew

porkaew

20 pages

LECT4BW

LECT4BW

21 pages

Week10_1

Week10_1

25 pages

wavelet

wavelet

17 pages

October5a

October5a

22 pages

p289-korn

p289-korn

12 pages

2

2

33 pages

rose

rose

36 pages

9_7_99

9_7_99

18 pages

Week10_2

Week10_2

28 pages

Week7_3

Week7_3

37 pages

Load more
Download DM3_jkns2000-12pst
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view DM3_jkns2000-12pst and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view DM3_jkns2000-12pst 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?