Berkeley COMPSCI 186 - Elementary IR: -Scalable Boolean Text Search - D2161370

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 186> Elementary IR: -Scalable Boolean Text Search

Berkeley COMPSCI 186 - Elementary IR: -Scalable Boolean Text Search

School name University of California, Berkeley

Course Compsci 186- Introduction to Database Systems

Pages 22

Download Save

Unformatted text preview:

Elementary IR: Scalable Boolean Text SearchInformation Retrieval: HistoryToday: Simple (naïve!) IRIR vs. DBMSIR’s “Bag of Words” ModelBoolean Text SearchText “Indexes”A Simple Relational Text IndexAn Inverted FileHandling Boolean LogicBoolean Search in SQLOne step fancier: Phrases and “Near”For better compressionUpdates and Text SearchLots more tricks in IRRecall From the First LectureYou Know The Basics!Revisiting Our IR/DBMS DistinctionsRevisiting Distinctions, Cont.SummaryIR Buzzwords to Know (so far!)Exercise!Elementary IR: Scalable Boolean Text Search(Compare with R & G 27.1-3)Information Retrieval: History•A research field traditionally separate from Databases–Hans P. Luhn, IBM, 1959: “Keyword in Context (KWIC)”–G. Salton at Cornell in the 60’s/70’s: SMART•Around the same time as relational DB revolution–Tons of research since then•Especially in the web era•Products traditionally separate–Originally, document management systems for libraries, government, law, etc.–Gained prominence in recent years due to web search•Still used for non-web document management. (“Enterprise search”).Today: Simple (naïve!) IR•Boolean Search on keywords•Goal:–Show that you already have the tools to do this from your study of relational DBs•We’ll skip:–Text-oriented storage formats–Intelligent result ranking (hopefully later!)–Parallelism•Critical for modern relational DBs too–Various bells and whistles (lots of little ones!)•Engineering the specifics of (written) human language–E.g. dealing with tense and plurals–E.g. identifying synonyms and related words–E.g. disambiguating multiple meanings of a word–E.g. clustering outputIR vs. DBMS•Seem like very different beasts•Under the hood, not as different as they might seem–But in practice, you have to choose between the 2 todayIR DBMSImprecise Semantics Precise SemanticsKeyword search SQLUnstructured data format Structured dataRead-Mostly. Add docs occasionallyExpect reasonable number of updatesPage through top k resultsGenerate full answerIR’s “Bag of Words” Model•Typical IR data model:–Each document is just a bag of words (“terms”)•Detail 1: “Stop Words”–Certain words are not helpful, so not placed in the bag–e.g. real words like “the” –e.g. HTML tags like <H1>•Detail 2: “Stemming”–Using language-specific rules, convert words to basic form –e.g. “surfing”, “surfed” --> “surf”–Unfortunately have to do this for each language•Yuck!Boolean Text Search•Find all documents that match a Boolean containment expression:–“Windows” AND (“Glass” OR “Door”) AND NOT “Microsoft”•Note: query terms are also filtered via stemming and stop words•When web search engines say “10,000 documents found”, that’s the Boolean search result size–More or less ;-)Text “Indexes”•When IR folks say “text index”…–usually mean more than what DB people mean•In our terms, both “tables” and indexes–Really a logical schema (i.e. tables)–With a physical schema (i.e. indexes)–Usually not stored in a DBMS•Tables implemented as files in a file systemA Simple Relational Text Index•Given: a corpus of text files–Files(docID string, content string)•Create and populate a “bag of words” tableInvertedFile(term string, docID string)•Build a B+-tree or Hash index on InvertedFile.term–Something like “Alternative 3” critical here!! •Keep lists of dup keys sorted by docID–Will provide “interesting orders” later on!•Fancy list compression on the docIDs is important, too•Typically called a postings list by IR people–Note: URL instead of RID, the web is your “heap file”!•Can also cache pages and use RIDs•This is often called an “inverted file” or “inverted index”–Maps from words -> docs, rather than docs -> words•Given this, you can now do single-word text search queries!An Inverted File•Snippets from:–Old class web page–Old microsoft.com home page•Search for–databases–microsoftdata http://www-inst.eecs.berkeley.edu/~cs186database http://www-inst.eecs.berkeley.edu/~cs186date http://www-inst.eecs.berkeley.edu/~cs186day http://www-inst.eecs.berkeley.edu/~cs186dbms http://www-inst.eecs.berkeley.edu/~cs186decision http://www-inst.eecs.berkeley.edu/~cs186demonstrate http://www-inst.eecs.berkeley.edu/~cs186description http://www-inst.eecs.berkeley.edu/~cs186design http://www-inst.eecs.berkeley.edu/~cs186desire http://www-inst.eecs.berkeley.edu/~cs186developer http://www.microsoft.comdiffer http://www-inst.eecs.berkeley.edu/~cs186disability http://www.microsoft.comdiscussion http://www-inst.eecs.berkeley.edu/~cs186division http://www-inst.eecs.berkeley.edu/~cs186do http://www-inst.eecs.berkeley.edu/~cs186document http://www-inst.eecs.berkeley.edu/~cs186document http://www.microsoft.commicrosoft http://www.microsoft.commicrosoft http://www-inst.eecs.berkeley.edu/~cs186midnight http://www-inst.eecs.berkeley.edu/~cs186midterm http://www-inst.eecs.berkeley.edu/~cs186minibase http://www-inst.eecs.berkeley.edu/~cs186million http://www.microsoft.commonday http://www.microsoft.commore http://www.microsoft.commost http://www-inst.eecs.berkeley.edu/~cs186ms http://www-inst.eecs.berkeley.edu/~cs186msn http://www.microsoft.commust http://www-inst.eecs.berkeley.edu/~cs186necessary http://www-inst.eecs.berkeley.edu/~cs186need http://www-inst.eecs.berkeley.edu/~cs186network http://www.microsoft.comnew http://www-inst.eecs.berkeley.edu/~cs186new http://www.microsoft.comnews http://www.microsoft.comnewsgroup http://www-inst.eecs.berkeley.edu/~cs186Term docIDHandling Boolean Logic•How to do “term1” OR “term2”?–Union of two postings lists (docID sets)!•How to do “term1” AND “term2”?–Intersection of two postings lists!•Can be done via merge of postings lists•Remember: postings list per key sorted by docID in index•How to do “term1” AND NOT “term2”?–Set subtraction•Also easy because sorted (basically merge logic again)•How to do “term1” OR NOT “term2”–Union of “term1” with “NOT term2”.•“Not term2” = all docs not containing term2. Yuck!–Usually not allowed!•Optimizations: What order to handle terms if you have many ANDs? Can you do better than merge? How does this interact with postings list compression?Boolean Search in SQL•(SELECT docID FROM InvertedFile WHERE word = “window” INTERSECT

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 186 - Elementary IR: -Scalable Boolean Text Search

Sign up for free to view:

Please select your school