Vector Models for IRSlide 2PowerPoint PresentationQueries and Documents share same vector representationSimilarity FunctionsProjection of Vectors into 2-D PlaneSlide 7Hierarchical Search with Document CentroidsHierarchical Query MatchingIdeal Clustering BehaviorSample Clustered Document CollectionSlide 12Slide 13CS466-8 1Vector Models for IR•Gerald Salton, Cornell(Salton + Lesk, 68)(Salton, 71)(Salton + McGill, 83)•SMART SystemChris Buckely, Cornell / SAPIR systemsCurrent keeper of the flameSalton’s Magical Automatic Retrieval Tool(?)CS466-8 2Vector Models for IR0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0Doc V1Doc V2Boolean ModelSMART Vector Model1.0 3.5 4.6 0.1 0.0 0.0Doc V1Doc V20.0 0.0 0.0 0.1 4.0 0.0Termi WordStemSpecial compoundsSMART vectors are composed of real valued Term weightsNOT simply Boolean Term Present or NOTCS466-8 3Example 3 5 4 1 0 1 0 0Doc V1Doc V2Comput* C++ Sparc genome bilog* proteinCompilerDNA 1 0 0 0 5 3 1 4Doc V3 2 8 0 1 0 1 0 0Issues• How are weights determined? (simple option : raw freq. weighted by region, titles, keywords)• Which terms to include? Stoplists• Stem or not?CS466-8 4Queries and Documents share same vector representationD3D2D1QGiven Query DQ map to vector VQ and find document Di : sim (Vi ,VQ) is greatestCS466-8 5Similarity Functions• Many other options available(Dice, Jaccard)• Cosine similarity is self normalizingD3D2QV1100 200 300 50V2 1 2 3 0.5V3 10 20 30 5Can use arbitrary integer values(don’t need to be probabilities)CS466-8 6Projection of Vectors into 2-D PlaneV2V1V5V4V3C1V7V6V10V9V8C2CS466-8 7Centroid computation : |D|Vtd,C Centroid|D|1dtset termt Basically, the average of the vectorsin the centroid setD = documents in centroid setTotal docs in centroid setC2C1CS466-8 8Hierarchical Search with Document CentroidsV1V2V3V4V5V6V7V9V10V8CS466-8 9Hierarchical Query Matching For all children of Ci {Cj }•find Cj : sim (VQ , Cj) is maximum•if Cj is a leaf(document vector), return Cj •else Ci = Cj and iteratelog ( | D | ) vector comparisons(height of tree)VQ = Query VectorCi = Root CentroidCS466-8 10Ideal Clustering BehaviorCS466-8 11 document vector centroid vectorSample Clustered Document CollectionCS466-8 12 relevant document with respect to a queryvector nonrelevant document with respect to a queryIdeal Document SpaceCS466-8 13 document vector centroid vector supercentroid vectorIntroduction of
View Full Document