Johns Hopkins EN 600 446 - Vector Models for IR (13 pages)

Previewing pages 1, 2, 3, 4 of 13 page document View the full content.
View Full Document

Vector Models for IR



Previewing pages 1, 2, 3, 4 of actual document.

View the full content.
View Full Document
View Full Document

Vector Models for IR

47 views

Lecture Notes


Pages:
13
School:
Johns Hopkins University
Course:
En 600 446 - Computer Integrated Surgery II

Unformatted text preview:

Vector Models for IR Gerald Salton Cornell Salton Lesk 68 Salton 71 Salton McGill 83 SMART System Chris Buckely Cornell SAPIR systems Current keeper of the flame Salton s Magical Automatic Retrieval Tool CS466 8 1 Vector Models for IR Boolean Model Doc V1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 Doc V2 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 SMART Vector Model Termi Word Stem Special compounds Doc V1 1 0 3 5 4 6 0 1 0 0 0 0 Doc V2 0 0 0 0 0 0 0 1 4 0 0 0 SMART vectors are composed of real valued Term weights NOT simply Boolean Term Present or NOT CS466 8 2 Example DNA Compiler Comput C Sparc genome bilog protein Doc V1 3 5 4 1 0 1 0 0 Doc V2 1 0 0 0 5 3 1 4 Doc V3 2 8 0 1 0 1 0 0 Issues How are weights determined simple option raw freq weighted by region titles keywords Which terms to include Stoplists Stem or not CS466 8 3 Queries and Documents share same vector representation D1 D2 Q D3 Given Query DQ map to vector VQ and find document Di sim Vi VQ is greatest CS466 8 4 Similarity Functions Many other options available Dice Jaccard Cosine similarity is self normalizing V1 100 200 300 50 V2 1 2 D2 3 0 5 V3 10 20 30 5 Q D3 Can use arbitrary integer values don t need to be probabilities CS466 8 5 Projection of Vectors into 2 D Plane V5 V1 V2 V4 C1 V3 V10 V6 V7 V9 C2 V8 CS466 8 6 C1 C2 Centroid computation Basically the average of the vectors in the centroid set D Vd t Centroid Ct d 1 t term D set D documents in centroid set Total docs in centroid set CS466 8 7 Hierarchical Search with Document Centroids V1 V2 V3 V4 V5 V6 V 7 V9 V8 V10 CS466 8 8 Hierarchical Query Matching VQ Query Vector Ci Root Centroid For all children of Ci Cj find Cj sim VQ Cj is maximum if Cj is a leaf document vector return C j else Ci Cj and iterate log D vector comparisons height of tree CS466 8 9 Ideal Clustering Behavior CS466 8 10 Sample Clustered Document Collection document vector centroid vector CS466 8 11 Ideal Document Space CS466 8 relevant document with respect to a



View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Vector Models for IR and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Vector Models for IR and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?