Forms of RetrievalSlide 2Range Searchk-d TreesDefinition: k-d TreesBWB-CheckBOB-CheckExampleUsing Decision Trees as IndexProperties of Retrieval with Indexed CasesForms of Retrieval•Sequential Retrieval•Two-Step Retrieval•Retrieval with Indexed CasesRetrieval with Indexed CasesSources:–Textbook, Chapter 7–Davenport & Prusack’s book on Advanced Data Structures–Samet’s book on Data StructuresRange SearchRed light on? YesBeeping? Yes…Transistor burned!Space of known problemsk-d Trees•Idea: Partition of the case base in smaller fragments•Representation of a k-dimensional space in a binary tree•Similar to a decision tree: comparison with nodes•During retrieval:Search for a leaf, butUnlike decision trees backtracking may occurDefinition: k-d Trees•Given:K types: T1, …, Tk for the attributes A1, …, AkA case base CB containing cases in T1 … Tk A parameter b (size of bucket)•A K-D tree T(CB) for a case base CB is a binary tree defined as follows:If |CB| < b then T(CB) is a leaf node (a bucket)Else T(CB) defines a tree such that:The root is marked with an attribute Ai and a value v in Ai andThe 2 k-d trees T({c CB: c.i-attribute < v}) and T({c CB: c.i-attribute v}) are the left and right subtrees of the rootBWB-Check•Ball-With in-Bounds check:Suppose that algorithm reaches a leave node M (with at most b cases) while searching for the most similar case to PLet c be a case in B such that dist(c,P) is the smallestThen c is a candidate NN for P For each boundary B of M, dist(P,B) > dist(c,P) then c is the NNBut if for any boundary B of M, if dist(P,B) < dist(c,P) then the algorithm needs to backtrack and check if in the regions of B, there is a better candidateFor computing distance, simply use: f-1 be the inverse of the distance-similarity compatible function:distance(P,C) = f-1(sim(P,C))BOB-Check•Ball-Out of-Bounds check:Used during backtrackingChecks if for the boundary B defined in the node: dist(P,B) < dist(c,P) Where c is our current candidate for best case (e.g., the closest case to P in the initial bucket)If the condition is true, The algorithm needs to check if in those boundary’s regions, there is a better candidateExample(0,0)(0,100)(25,35)Omaha(5,45)Denver(35,40)Chicago(50,10)Mobile(90,5)MiamiAtlanta (85,15)(80,65)Buffalo(60,75)Toronto(100,0)A1<3535DenverOmahaA2<4040A1<8585MobileAtlantaMiamiA1<6060ChicagoTorontoBuffaloNotes:•Priority lists are used for computing kNN P(32,45)Using Decision Trees as IndexAiv1v2…vnStandard Decision TreeAiv1v2…vnVariant: InReCA TreeunknownCan be combined with numeric attributesAiv1>v1v2…>vnunknownNotes:•Supports Hamming distance•May require backtracking (using BOB-check) Operates in a similar fashion as k-d trees•Priority lists are used for computing kNNProperties of Retrieval with Indexed Cases•Advantage:•Disadvantages:Efficient retrievalIncremental: don’t need to rebuild index again every time a new case is entered-error does not occurCost of construction is highOnly work for monotonic similarity
View Full Document