CSC 9010: Search Engines GoogleSearch Engine BasicsSelecting Relevant DocumentsExtracting Lexical FeaturesLexical AnalyserDesign Issues for Lexical AnalyserSlide 7StemmingStemming -- How?Noise Words (Stop Words)Example CorporaStructured Atributes for MedlineTextual Fields for MedlineStructured Fields for EmailText fields for EmailIndexingBasic Indexing AlgorithmFine PointsChoosing KeywordsManually Choosing KeywordsExamples of Constrained VocabulariesAutomated Vocabulary SelectionSlide 23Keyword Choice for WWWComparing and Ranking DocumentsDetermining Relevance by KeywordKeywords for Relevance RankingSlide 28Comparing DocumentsCharacterizing a Document: Term FrequencyCharacterizing a Document: Document FrequencyTF*IDFDescribing an Entire DocumentVector SpaceSimilarity Between DocumentsBag of WordsGOOGLEGoogle GoalsDetermining RelevancePage RankSimple PageRankPage Rank, contGoogle ArchitectureArchitecture: ModulesArchitecture: Data StoresGOOGLE APIWeb Page FreshnessExperimental SetupAverage Change IntervalChange Interval – By DomainRefresh StrategyProportional Often Not Good!Comparing PoliciesNot Every Page is Equal!Allocating Freshness ResourcesAdditional Search Issues©2003 Paula MatuszekCSC 9010: Search EnginesGoogleDr. Paula [email protected](610) 270-6851©2003 Paula MatuszekSearch Engine BasicsA spider or crawler starts at a web page, identifies all links on it, and follows them to new web pages.A parser processes each web page and extracts individual words.An indexer creates/updates a hash table which connects words with documentsA searcher uses the hash table to retrieve documents based on wordsA ranking system decides the order in which to present the documents: their relevance©2003 Paula MatuszekSelecting Relevant DocumentsAssume:–we already have a corpus of documents defined. –goal is to return a subset of those documents.–Individual documents have been separated into individual filesRemaining components must parse, index, find, and rank documents.Traditional approach is based on the words in the documents (predates the web)©2003 Paula MatuszekExtracting Lexical FeaturesProcess a string of characters–assemble characters into tokens (tokenizer)–choose tokens to indexIn place (problem for www)Standard lexical analysis problemLexical Analyser Generator, such as lex©2003 Paula MatuszekLexical AnalyserBasic idea is a finite state machineTriples of input state, transition token, output stateMust be very efficient; gets used a LOT012blankA-ZA-Zblank, EOF©2003 Paula MatuszekDesign Issues for Lexical AnalyserPunctuation–treat as whitespace?–treat as characters?–treat specially?Case–fold?Digits–assemble into numbers?–treat as characters?–treat as punctuation?©2003 Paula MatuszekLexical AnalyserOutput of lexical analyser is a string of tokensRemaining operations are all on these tokensWe have already thrown away some information; makes more efficient, but limits the power of our search©2003 Paula MatuszekStemmingAdditional processing at the token levelTurn words into a canonical form:–“cars” into “car”–“children” into “child”–“walked” into “walk”Decreases the total number of different tokens to be processedDecreases the precision of a search, but increases its recall©2003 Paula MatuszekStemming -- How?Plurals to singulars (eg, children to child)Verbs to infinitives (eg, talked to talk)Clearly non-trivial in English!Typical stemmers use a context-sensitive transformation grammar:–(.*)SSES -> /1SS50-250 rules are typical©2003 Paula MatuszekNoise Words (Stop Words)Function words that contribute little or nothing to meaningVery frequent words–If a word occurs in every document, it is not useful in choosing among documents–However, need to be careful, because this is corpus-dependentOften implemented as a discrete list©2003 Paula MatuszekExample CorporaWe are assuming a fixed corpus. Some sample corpora:–Medline Abstracts–Email. Anyone’s email.–Reuters corpus–Brown corpusTextual fields, structured attributes–Textual: free, unformatted, no meta-information–Structured: additional information beyond the content©2003 Paula MatuszekStructured Atributes for MedlinePubmed IDAuthorYearKeywordsJournal©2003 Paula MatuszekTextual Fields for MedlineAbstract–Reasonably complete standard academic English –Capturing the basic meaning of documentTitle –Short, formalized –Captures most critical part of meaning–Proxy for abstract©2003 Paula MatuszekStructured Fields for EmailTo, From, Cc, BccDatesContent typeStatusContent lengthSubject (partially)©2003 Paula MatuszekText fields for EmailSubject–Format is structured, content is arbitrary. –Captures most critical part of content. –Proxy for content -- but may be inaccurate.Body of email–Highly irregular, informal English. –Entire document, not summary. –Spelling and grammar irregularities. –Structure and length vary.©2003 Paula MatuszekIndexingWe have a tokenized, stemmed sequence of wordsNext step is to parse document, extracting index terms–Assume that each token is a word and we don’t want to recognize any more complex structures than single words.When all documents are processed, create index©2003 Paula MatuszekBasic Indexing AlgorithmFor each document in the corpus–get the next token–save the posting in a list–doc ID, frequencyFor each token found in the corpus–calculate #doc, total frequency–sort by frequency©2003 Paula MatuszekFine PointsDynamic Corpora: requires incremental algorithmsHigher-resolution data (eg, char position)Giving extra weight to proxy text (typically by doubling or tripling frequency count)Document-type-specific processing–In HTML, want to ignore tags–In email, maybe want to ignore quoted material©2003 Paula MatuszekChoosing KeywordsDon’t necessarily want to index on every word–Takes more space for index–Takes more processing time–May not improve our resolving powerHow do we choose keywords?–Manually–StatisticallyExhaustivity vs specificity©2003 Paula MatuszekManually Choosing KeywordsUnconstrained vocabulary: allow creator of document to choose whatever he/she wants–“best” match–captures new terms easily–easiest for person
View Full Document