Unformatted text preview:

Information RetrievalFinding Out AboutInformation Retrieval SystemsAsking a questionFinding the InformationIR BasicsHow Good Is The IR?Selecting Relevant DocumentsExtracting Lexical FeaturesLexical AnalyserDesign Issues for Lexical AnalyserSlide 12StemmingNoise Words (Stop Words)Example CorporaStructured Attributes for MedlineTextual Fields for MedlineStructured Fields for EmailText fields for EmailIndexingBasic Indexing AlgorithmFine PointsChoosing KeywordsManually Choosing KeywordsExamples of Constrained VocabulariesAutomated Vocabulary SelectionSlide 27Keyword Choice for WWWComparing and Ranking DocumentsDetermining Relevance by KeywordKeywords for Relevance RankingSlide 32Comparing DocumentsCharacterizing a Document: Term FrequencyCharacterizing a Document: Document FrequencyTF*IDFDescribing an Entire DocumentVector SpaceSimilarity Between DocumentsBag of WordsImprovementsQuery ExpansionDictionary/Thesaurus ExampleRelevance FeedbackBlind FeedbackPost-Hoc AnalysesAdditional IR IssuesSummaryCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari1Information RetrievalCSC 9010: Special Topics. Natural Language Processing.Paula Matuszek, Mary-Angela PapalaskariSpring, 2005CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari2Finding Out About•There are many large corpora of information that people use. The web is the obvious example. Others include:–scientific journals–patent databases –Medline–Usenet groups•People interact with all that information because they want to KNOW something; there is a question they are trying to answer or a piece of information they want. •Information Retrieval, or IR, is the process of answering that information need.•Simplest approach:–Knowledge is organized into chunks (pages or documents)–Goal is to return appropriate chunksCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari3Information Retrieval Systems•Goal of an information retrieval system is to return appropriate chunks•Steps involve include–asking a question–finding answers–evaluating answers–presenting answers•Value of an IR tool depends on how well it does on all of these.•Web search engines are the IR tools most familiar to most people.CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari4Asking a question•Reflect some information need•Query Syntax needs to allow information need to be expressed–Keywords–Combining terms•Simple: “required”, NOT (+ and -)•Boolean expressions with and/or/not and nested parentheses•Variations: strings, NEAR, capitalization.–Simplest syntax that works–Typically more acceptable if predictable•Another set of problems when information isn’t text: graphics, musicCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari5Finding the Information•Goal is to retrieve all relevant chunks. Too time-consuming to do in real-time, so IR systems index pages. •Two basic approaches–Index and classify by hand–Automate•For BOTH approaches deciding what to index on (e.g., what is a keyword) is a significant issue.•Many IR tools like search engines provide bothCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari6IR Basics•A retriever collects a page or chunk. This may involve spidering web pages, extracting documents from a DB, etc.•A parser processes each chunk and extracts individual words.•An indexer creates/updates a hash table which connects words with documents•A searcher uses the hash table to retrieve documents based on words•A ranking system decides the order in which to present the documents: their relevanceCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari7How Good Is The IR?•Information Retrieval systems are evaluated with two basic metrics:–Precision: What percent of document returned are actually relevant to the information need–Recall: what percent of documents relevant to information need are returned•Can’t typically measure these exactly; usually based on test sets.CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari8Selecting Relevant Documents•Assume:–we already have a corpus of documents defined. –goal is to return a subset of those documents.–Individual documents have been separated into individual files•Remaining components must parse, index, find, and rank documents.•Traditional approach is based on the words in the documents (predates the web)CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari9Extracting Lexical Features•Process a string of characters–assemble characters into tokens (tokenizer)–choose tokens to index•Standard lexical analysis problem•Lexical Analyser Generator, such as lexCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari10Lexical Analyser•Basic idea is a finite state machine•Triples of input state, transition token, output state•Must be very efficient; gets used a LOT012blankA-ZA-Zblank, EOFCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari11Design Issues for Lexical Analyser•Punctuation–treat as whitespace?–treat as characters?–treat specially?•Case–fold?•Digits–assemble into numbers?–treat as characters?–treat as punctuation?CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari12Lexical Analyser•Output of lexical analyser is a string of tokens•Remaining operations are all on these tokens•We have already thrown away some information; makes more efficient, but limits somewhat the power of our searchCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari13Stemming•Additional processing at the token level–We covered earlier this semester•Turn words into a canonical form:–“cars” into “car”–“children” into “child”–“walked” into “walk”•Decreases the total number of different tokens to be processed•Decreases the precision of a search, but increases its recallCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari14Noise Words (Stop Words)•Function words that


View Full Document

Villanova CSC 9010 - Information Retrieval

Documents in this Course
Lecture 2

Lecture 2

48 pages

Lecture 2

Lecture 2

46 pages

Load more
Download Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?