Villanova CSC 9010 - Search Engines - D714390

Home> Schools> Villanova University> Computer Science (CSC) > CSC 9010> Search Engines

DOC PREVIEW

Villanova CSC 9010 - Search Engines

School name Villanova University

Course Csc 9010- TOP:High Performance Computing

Pages 56

This preview shows page 1-2-3-4-26-27-28-53-54-55-56 out of 56 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 56 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CSC 9010: Search Engines GoogleSearch Engine BasicsSelecting Relevant DocumentsExtracting Lexical FeaturesLexical AnalyserDesign Issues for Lexical AnalyserSlide 7StemmingStemming -- How?Noise Words (Stop Words)Example CorporaStructured Atributes for MedlineTextual Fields for MedlineStructured Fields for EmailText fields for EmailIndexingBasic Indexing AlgorithmFine PointsChoosing KeywordsManually Choosing KeywordsExamples of Constrained VocabulariesAutomated Vocabulary SelectionSlide 23Keyword Choice for WWWComparing and Ranking DocumentsDetermining Relevance by KeywordKeywords for Relevance RankingSlide 28Comparing DocumentsCharacterizing a Document: Term FrequencyCharacterizing a Document: Document FrequencyTF*IDFDescribing an Entire DocumentVector SpaceSimilarity Between DocumentsBag of WordsGOOGLEGoogle GoalsDetermining RelevancePage RankSimple PageRankPage Rank, contGoogle ArchitectureArchitecture: ModulesArchitecture: Data StoresGOOGLE APIWeb Page FreshnessExperimental SetupAverage Change IntervalChange Interval – By DomainRefresh StrategyProportional Often Not Good!Comparing PoliciesNot Every Page is Equal!Allocating Freshness ResourcesAdditional Search Issues©2003 Paula MatuszekCSC 9010: Search EnginesGoogleDr. Paula [email protected](610) 270-6851©2003 Paula MatuszekSearch Engine BasicsA spider or crawler starts at a web page, identifies all links on it, and follows them to new web pages.A parser processes each web page and extracts individual words.An indexer creates/updates a hash table which connects words with documentsA searcher uses the hash table to retrieve documents based on wordsA ranking system decides the order in which to present the documents: their relevance©2003 Paula MatuszekSelecting Relevant DocumentsAssume:–we already have a corpus of documents defined. –goal is to return a subset of those documents.–Individual documents have been separated into individual filesRemaining components must parse, index, find, and rank documents.Traditional approach is based on the words in the documents (predates the web)©2003 Paula MatuszekExtracting Lexical FeaturesProcess a string of characters–assemble characters into tokens (tokenizer)–choose tokens to indexIn place (problem for www)Standard lexical analysis problemLexical Analyser Generator, such as lex©2003 Paula MatuszekLexical AnalyserBasic idea is a finite state machineTriples of input state, transition token, output stateMust be very efficient; gets used a LOT012blankA-ZA-Zblank, EOF©2003 Paula MatuszekDesign Issues for Lexical AnalyserPunctuation–treat as whitespace?–treat as characters?–treat specially?Case–fold?Digits–assemble into numbers?–treat as characters?–treat as punctuation?©2003 Paula MatuszekLexical AnalyserOutput of lexical analyser is a string of tokensRemaining operations are all on these tokensWe have already thrown away some information; makes more efficient, but limits the power of our search©2003 Paula MatuszekStemmingAdditional processing at the token levelTurn words into a canonical form:–“cars” into “car”–“children” into “child”–“walked” into “walk”Decreases the total number of different tokens to be processedDecreases the precision of a search, but increases its recall©2003 Paula MatuszekStemming -- How?Plurals to singulars (eg, children to child)Verbs to infinitives (eg, talked to talk)Clearly non-trivial in English!Typical stemmers use a context-sensitive transformation grammar:–(.*)SSES -> /1SS50-250 rules are typical©2003 Paula MatuszekNoise Words (Stop Words)Function words that contribute little or nothing to meaningVery frequent words–If a word occurs in every document, it is not useful in choosing among documents–However, need to be careful, because this is corpus-dependentOften implemented as a discrete list©2003 Paula MatuszekExample CorporaWe are assuming a fixed corpus. Some sample corpora:–Medline Abstracts–Email. Anyone’s email.–Reuters corpus–Brown corpusTextual fields, structured attributes–Textual: free, unformatted, no meta-information–Structured: additional information beyond the content©2003 Paula MatuszekStructured Atributes for MedlinePubmed IDAuthorYearKeywordsJournal©2003 Paula MatuszekTextual Fields for MedlineAbstract–Reasonably complete standard academic English –Capturing the basic meaning of documentTitle –Short, formalized –Captures most critical part of meaning–Proxy for abstract©2003 Paula MatuszekStructured Fields for EmailTo, From, Cc, BccDatesContent typeStatusContent lengthSubject (partially)©2003 Paula MatuszekText fields for EmailSubject–Format is structured, content is arbitrary. –Captures most critical part of content. –Proxy for content -- but may be inaccurate.Body of email–Highly irregular, informal English. –Entire document, not summary. –Spelling and grammar irregularities. –Structure and length vary.©2003 Paula MatuszekIndexingWe have a tokenized, stemmed sequence of wordsNext step is to parse document, extracting index terms–Assume that each token is a word and we don’t want to recognize any more complex structures than single words.When all documents are processed, create index©2003 Paula MatuszekBasic Indexing AlgorithmFor each document in the corpus–get the next token–save the posting in a list–doc ID, frequencyFor each token found in the corpus–calculate #doc, total frequency–sort by frequency©2003 Paula MatuszekFine PointsDynamic Corpora: requires incremental algorithmsHigher-resolution data (eg, char position)Giving extra weight to proxy text (typically by doubling or tripling frequency count)Document-type-specific processing–In HTML, want to ignore tags–In email, maybe want to ignore quoted material©2003 Paula MatuszekChoosing KeywordsDon’t necessarily want to index on every word–Takes more space for index–Takes more processing time–May not improve our resolving powerHow do we choose keywords?–Manually–StatisticallyExhaustivity vs specificity©2003 Paula MatuszekManually Choosing KeywordsUnconstrained vocabulary: allow creator of document to choose whatever he/she wants–“best” match–captures new terms easily–easiest for person

View Full Document