Johns Hopkins EN 600 446 - Issues/Parameters in Vector Model

Unformatted text preview:

Issues/Parameters in Vector ModelTerm Weighting StrategiesPowerPoint PresentationTerm Selection/WeightingSlide 5Document Internal WeightingSlide 7StoplistsSimilarity Functions/MeasuresRegion WeightingRelevance WeightingSlide 12Slide 13Compound IdentificationCS466-9 1Issues/Parameters in Vector Model1. Term weighting2. Term selection(special case of term weighting stop words = words with weight 0)3. Vector similarity functions(Dice, Jaccard, Cosine)4. Clustering approach(Agglomerative hierarchical clustering)CS466-9 2Term Weighting Strategies Boolean weightingWeightt,d = 1 if term t present in document d 0 if term t NOT present in document d Term weight  term frequency Normalized term frequencyWeightt,d = Freq t,d term document raw frequency of term in documentWeightt,d = Freq t,d Freq t,corpusnormalized term frequency by overall corpus frequencyCS466-9 3Term Weighting Strategies TF-IDF TF log IDF  “TF-IDF” DFreqt1 Freqdt,occurs t termuniquein documents of # Total1 Freqdt, Weightdt,---FreqtNlog2 Freqdt, Weightdt,Term Frequency(frequency of term in documents)Inverse Document Frequency# of doc. in the corpus # of doc. with term tTF  IDFCS466-9 4Term Selection/WeightingWhat makes a good term?Doc. #High freq. function words(in all documents, e.g. the, in of, for)Low freq. function words(e.g. certainly)PoorTermsFreq.ofterminDoc.CS466-9 5Localized, but Not too InfrequentDoc. #Poor signal/noise ratioFreq.ofterminDoc.10example term = 183CS466-9 6Document Internal Weighting“Genome” – 20 times in document more indicative than 10 times ?than 2 times ?Question assumption that Weightt,d  Freqt,d ??indicativeness# of times(unit length)1CS466-9 7Better TermsDoc. #Terms like“genome”,“cytochrome-c”,“Plasmasis”Freq.ofterminDoc.10Localized to subset of documents  Presence of term “indicative” of documentsCS466-9 8Stoplists•Human intuition of which terms are bad Excludes from vectorCS466-9 9Similarity Functions/Measures 3 5 4 1 0 1 0 0Doc V1Doc V2Comput* C++ Sparc genome bilog* proteinCompilerDNA 1 0 0 0 5 3 1 4Doc V3 2 8 0 1 0 1 0 0  --TtTttjtiTttjtitwtwtwtw1 12,2,1,,)()Vj,Vicos_sim(Sum over all terms in documentWeight of term tin document jNormalizing factorCS466-9 10Region Weighting•Title•Keywords•Abstract•Section Heads•Body Text–1st page–30th page•FootnotesShould words in each of theseregions be weighted equally?Wt,d = RWR • TFt,d • (IDF)3.0 Keywords2.0 Title0.8 Body Textmultiplicative weightings factordepending on region word appears inCS466-9 11Relevance WeightingTF Ft,d • TermRelt)s(Is)r(RrttttTheoretically optimal if you know Relevance# of irrelevant documents in corpus# of irrelevant documents with term t# of relevant documents with term t# of relevant documents in corpusraw term freq.CS466-9 12Type of Documentweight TFmaxfreqfreqK)(1KCfreqdt,dt,dif Term t in d,[ Croft, ’83](for titles) K = 1  boolean weighting(for full text) K = 0  similar to Freqt,d (Title vs. Abstract vs. Paper vs. Query)CS466-9 13Document Interval Term Weighting)(length(d)log)1(Freq logNfreq22t,dt,duse instead of Freqt,d in TF-IDF [Harman ’86]CS466-9 14Compound IdentificationSalton + McGill(1983) – cohersion measurefactor small)t( total_freq)t( total_freq)t,t( nce_freqco_occurre)t,tcohersion(jijiji--Measure is similar to :Mutual InformationExamples: venitian"blind"Hong" Hong" school"library "blind"venitian " spaniel"water "library" school" Compoundingmay increaseor decreasevocabularysizeCollocation extraction


View Full Document

Johns Hopkins EN 600 446 - Issues/Parameters in Vector Model

Download Issues/Parameters in Vector Model
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Issues/Parameters in Vector Model and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Issues/Parameters in Vector Model 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?