ICS 278: Data Mining Lecture 12: Text MiningText MiningText Mining ApplicationsSlide 4General concepts in Information RetrievalQuery by ContentIssues in Query by ContentThe Standard ApproachEvaluating Retrieval MethodsPrecision versus RecallPrecision-Recall Curve (form of ROC)TREC evaluationsText RetrievalToy example of a document-term matrixDistances between DocumentsDistance matrices for toy document-term dataTF-IDF Term Weighting SchemesTF-IDF ExampleTypical Document Querying SystemSynonymy and PolysemyLatent Semantic IndexingSlide 22SVDExample of SVDSlide 25Another LSI ExampleLSI Example (continued)LSI ExampleFurther ReadingNext up ….Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineICS 278: Data MiningLecture 12: Text MiningPadhraic SmythDepartment of Information and Computer ScienceUniversity of California, IrvineData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Mining•Information Retrieval•Text Classification•Text Clustering•Information ExtractionData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Mining Applications•Information Retrieval–Query-based search of large text archives, e.g., the Web•Text Classification–Automated assignment of topics to Web pages, e.g., Yahoo, Google–Automated classification of email into spam and non-spam•Text Clustering–Automated organization of search results in real-time into categories–Discovery clusters and trends in technical literature (e.g. CiteSeer)•Information Extraction–Extracting standard fields from free-text•extracting names and places from reports, newspapers (e.g., military applications)•extracting resume information automatically from resumes•Extracting protein interaction information from biology papersData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Mining•Information Retrieval•Text Classification•Text Clustering•Information ExtractionData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineGeneral concepts in Information Retrieval•Representation language–typically a vector of p attribute values, e.g.,•set of color, intensity, texture, features characterizing images•word counts for text documents•Data set D of N objects–Typically represented as an N x p matrix•Query Q:–User poses a query to search D–Query is typically expressed in the same representation language as the data, e.g.,•each text document is a set of words that occur in the document•Query Q is also expressed as a set of words, e.g.,”data” and “mining”Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineQuery by Content•traditional DB query: exact matches–e.g. query Q = [level = MANAGER] & [age < 30]•query-by-content query: more general / less precise–e.g. Q = what historic record most similar to new one?–for text data, often called “information retrieval (IR)” •Goal–Match query Q to the N objects in the database–Return a ranked list (typically) of the most similar/relevant objects in the data set D given QData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineIssues in Query by Content–What representation language to use–How to measure similarity between Q and each object in D–How to compute the results in real-time (for interactive querying)–How to rank the results for the user–Allowing user feedback (query modification)–How to evaluate and compare different IR algorithms/systemsData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineThe Standard Approach•fixed-length (d dimensional) vector representation–for query (d-by-1 Q) and and database (d-by-n X) objects•use domain-specific higher-level features (vs raw)–image•color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), …–text•“bag of words”: freq count for each word in each document, …•compute distances between vectorized representation•use k-NN to find k vectors in X closest to QData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineEvaluating Retrieval Methods•predictive models (classify/regress) objective–score = accuracy on unseen test data•evaluation more complex for query by content–real score = how “useful” is retrieved info (subjective)•e.g. how would you define real score for Google’s top 10 hits?•towards objectivity, assume:–1) each object is “relevant” or “irrelevant”•simplification: binary and same for all users (e.g. committee vote)–2) each object labelled by objective/consistent oracle–these assumptions suggest classifier approach possible•rather different goals: want nearests to Q, not separability per se•but would require learning classifier at query time (Q = pos class)–which is why k-NN type approach seems so appropriate …Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvinePrecision versus Recall•DQ = Q’s ranked retrievals (smallest distance first)•DQT = those with distance < threshold •Threshold ~0: few false positives (FP) (say relevant, but not), many false neg (FN)•large threshold: few false negative (FP), many false pos (FP)•precision = TP / (TP+FP)–fraction of retrieved objects that are relevant •recall = TP / (TP + FN)–fraction of retrieved objects / total relevant objects •Tradeoff: high precision -> low recall, and vice-versa•For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvinePrecision-Recall Curve (form of ROC)C is universallyworse than A & Balternative(point) values:precision whererecall=precision orprecision for fixednumber of retrievalsoraverage precisionover multiple recall levelsData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineTREC evaluations•Text Retrieval Conference (TReC)–Web site: trec.nist.gov•Annual impartial evaluation of IR systems–e.g., D = 1 million documents–TREC organizers supply contestants with several hundred queries Q–Each competing system provides its ranked list of queries–Union of top 100 queries or so from each system
View Full Document