DOC PREVIEW
TAMU CSCE 315 - text-processing

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Text Operations: PreprocessingIntroductionDocument PreprocessingThe Process of PreprocessingLexical Analysis of the TextSlide 6Elimination of StopwordsStemmingIndex Term SelectionThesauriSlide 11Text Operations:PreprocessingIntroductionDocument preprocessing–to improve the precision of documents retrieved–lexical analysis, stopwords elimination, stemming, index term selection, thesauri–build a thesaurusDocument PreprocessingLexical analysis of the text–digits, hyphens, punctuation marks, the case of lettersElimination of stopwords–filtering out the useless words for retrieval purposesStemming–dealing with the syntactic variations of query termsIndex terms selection–determining the terms to be used as index termsThesauri–the expansion of the original query with related termThe Process of PreprocessingstructureLexicalanalysisstopwordsNoungroupsstemmingManual indexingDocsstructure Full text Index termsLexical Analysis of the TextFour particular casesNumbers •usually not good index terms because of their vagueness•need some advanced lexical analysis procedure–ex) 510B.C. , 4105-1201-2310-2213, 12/2/2000, ….Hyphens•breaking up hyphenated words might be useful–ex) state-of-the-art  state of the art (Good!)–but, B-1  B 1 (???)•need to adopt a general rule and to specify exceptions on a case by case basisLexical Analysis of the TextPunctuation marks–removed entirely•ex) 510B.C � 510BC•if the query contains ‘510B.C’, removal of the dot both in query term and in the documents will not affect retrieval performance–require the preparation of a list of exceptions•ex) val.id � valid (???)The case of letters–converts all the text to either lower or upper case–part of the semantics might be lost•Northwestern University � northwestern university (???)Elimination of StopwordsBasic concept–filtering out words with very low discrimination values•ex) a, the, this, that, where, when, ….Advantage–reduce the size of the indexing structure considerablyDisadvantage–might reduce recall as well•ex) “to be or not to be”StemmingWhat is the “stem”?–the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes)–ex) ‘connect’ is the stem for the variants ‘connected’, ‘connecting’, ‘connection’, ‘connections’Effect of stemming–reduce variants of the same root to a common concept–reduce the size of the indexing structure–controversy about the benefits of stemmingIndex Term SelectionIndex terms selection–not all words are equally significant for representing the semantics of a documentManual selection–selection of index terms is usually done by specialistAutomatic selection of index terms–most of the semantics is carried by the noun words–clustering nouns which appear nearby in the text into a single indexing component (or concept)–ex) computer scienceThesauriWhat is the “thesaurus”?–list of important words in a given domain of knowledge–a set of related words derived from a synonymity relationship–a controlled vocabulary for the indexing and searchingMain purposes–provide a standard vocabulary for indexing and searching–assist users with locating terms for proper query formulation–provide classified hierarchies that allow the broadening and narrowing of the current query requestThesauriThesaurus index terms–denote a concept which is the basic semantic unit–can be individual words, groups of words, or phrases•ex) building, teaching, ballistic missiles, body temperature–frequently, it is necessary to complement a thesaurus entry with a definition or an explanation•ex) seal (marine animals), seal (documents)Thesaurus term relationships–mostly composed of synonyms and near-synonyms–BT (Broader Term), NT (Narrower Term), RT (Related


View Full Document

TAMU CSCE 315 - text-processing

Download text-processing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view text-processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view text-processing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?