UCI ICS 278 - Text Mining - D938577

Home> Schools> University of California, Irvine> (ICS) > ICS 278> Text Mining

UCI ICS 278 - Text Mining

School name University of California, Irvine

Course Ics 278- Principles of Data Mining

Pages 30

Download Save

Unformatted text preview:

ICS 278: Data Mining Lecture 12: Text MiningText MiningText Mining ApplicationsSlide 4General concepts in Information RetrievalQuery by ContentIssues in Query by ContentThe Standard ApproachEvaluating Retrieval MethodsPrecision versus RecallPrecision-Recall Curve (form of ROC)TREC evaluationsText RetrievalToy example of a document-term matrixDistances between DocumentsDistance matrices for toy document-term dataTF-IDF Term Weighting SchemesTF-IDF ExampleTypical Document Querying SystemSynonymy and PolysemyLatent Semantic IndexingSlide 22SVDExample of SVDSlide 25Another LSI ExampleLSI Example (continued)LSI ExampleFurther ReadingNext up ….Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineICS 278: Data MiningLecture 12: Text MiningPadhraic SmythDepartment of Information and Computer ScienceUniversity of California, IrvineData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Mining•Information Retrieval•Text Classification•Text Clustering•Information ExtractionData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Mining Applications•Information Retrieval–Query-based search of large text archives, e.g., the Web•Text Classification–Automated assignment of topics to Web pages, e.g., Yahoo, Google–Automated classification of email into spam and non-spam•Text Clustering–Automated organization of search results in real-time into categories–Discovery clusters and trends in technical literature (e.g. CiteSeer)•Information Extraction–Extracting standard fields from free-text•extracting names and places from reports, newspapers (e.g., military applications)•extracting resume information automatically from resumes•Extracting protein interaction information from biology papersData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Mining•Information Retrieval•Text Classification•Text Clustering•Information ExtractionData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineGeneral concepts in Information Retrieval•Representation language–typically a vector of p attribute values, e.g.,•set of color, intensity, texture, features characterizing images•word counts for text documents•Data set D of N objects–Typically represented as an N x p matrix•Query Q:–User poses a query to search D–Query is typically expressed in the same representation language as the data, e.g.,•each text document is a set of words that occur in the document•Query Q is also expressed as a set of words, e.g.,”data” and “mining”Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineQuery by Content•traditional DB query: exact matches–e.g. query Q = [level = MANAGER] & [age < 30]•query-by-content query: more general / less precise–e.g. Q = what historic record most similar to new one?–for text data, often called “information retrieval (IR)” •Goal–Match query Q to the N objects in the database–Return a ranked list (typically) of the most similar/relevant objects in the data set D given QData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineIssues in Query by Content–What representation language to use–How to measure similarity between Q and each object in D–How to compute the results in real-time (for interactive querying)–How to rank the results for the user–Allowing user feedback (query modification)–How to evaluate and compare different IR algorithms/systemsData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineThe Standard Approach•fixed-length (d dimensional) vector representation–for query (d-by-1 Q) and and database (d-by-n X) objects•use domain-specific higher-level features (vs raw)–image•color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), …–text•“bag of words”: freq count for each word in each document, …•compute distances between vectorized representation•use k-NN to find k vectors in X closest to QData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineEvaluating Retrieval Methods•predictive models (classify/regress) objective–score = accuracy on unseen test data•evaluation more complex for query by content–real score = how “useful” is retrieved info (subjective)•e.g. how would you define real score for Google’s top 10 hits?•towards objectivity, assume:–1) each object is “relevant” or “irrelevant”•simplification: binary and same for all users (e.g. committee vote)–2) each object labelled by objective/consistent oracle–these assumptions suggest classifier approach possible•rather different goals: want nearests to Q, not separability per se•but would require learning classifier at query time (Q = pos class)–which is why k-NN type approach seems so appropriate …Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvinePrecision versus Recall•DQ = Q’s ranked retrievals (smallest distance first)•DQT = those with distance < threshold •Threshold ~0: few false positives (FP) (say relevant, but not), many false neg (FN)•large threshold: few false negative (FP), many false pos (FP)•precision = TP / (TP+FP)–fraction of retrieved objects that are relevant •recall = TP / (TP + FN)–fraction of retrieved objects / total relevant objects •Tradeoff: high precision -> low recall, and vice-versa•For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvinePrecision-Recall Curve (form of ROC)C is universallyworse than A & Balternative(point) values:precision whererecall=precision orprecision for fixednumber of retrievalsoraverage precisionover multiple recall levelsData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineTREC evaluations•Text Retrieval Conference (TReC)–Web site: trec.nist.gov•Annual impartial evaluation of IR systems–e.g., D = 1 million documents–TREC organizers supply contestants with several hundred queries Q–Each competing system provides its ranked list of queries–Union of top 100 queries or so from each system

View Full Document


School:
Email:
New Password:
Confirm Password:

UCI ICS 278 - Text Mining

Sign up for free to view:

Please select your school