Semi Supervised Learning and Text Analysis Machine Learning 10 701 November 29 2005 Tom M Mitchell Carnegie Mellon University Document Classification Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 gas 1 oil 1 Zaire 0 For code see www cs cmu edu tom mlbook html click on Software and Data Supervised Training for Document Classification Common algorithms Logistic regression Support Vector Machines Bayesian classifiers Quite successful in practice Email classification spam foldering Web page classification product description publication Intranet document organization Research directions More elaborate domain specific classification models e g for email Using unlabeled data too semi supervised methods EM for Semi supervised document classification Using Unlabeled Data to Help Train Na ve Bayes Classifier Learn P Y X Y X1 X2 X3 X4 Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 From Nigam et al 2000 E Step M Step wt is t th word in vocabulary Elaboration 1 Downweight the influence of unlabeled examples by factor New M step Chosen by cross validation Using one labeled example per class 20 Newsgroups 20 Newsgroups EM for Semi Supervised Doc Classification If all data is labeled corresponds to Na ve Bayes classifier If all data unlabeled corresponds to mixture ofmultinomial clustering If both labeled and unlabeled data it helps if and only if the mixture of multinomial modeling assumption is correct Of course we could extend this to Bayes net models other than Na ve Bayes e g TAN tree Bags of Words or Bags of Topics LDA Generative model for documents Blei Ng Jordan 2003 Also extended to case where number of topics is not known in advance hierarchical Dirichlet processes Blei et al 2004 Clustering words into topics with Hierarchical Topic Models unknown number of clusters Blei Ng Jordan 2003 Probabilistic model for generating document D 1 Pick a distribution P z of topics according to P 2 For each word w Pick topic z from P z Pick word w from P w z Training this model defines topics i e which defines P W Z Example topics induced from a large collection of text JOB BALL SCIENCE FIELD STORY MIND DISEASE WATER WORK GAME STUDY MAGNETIC STORIES WORLD BACTERIA FISH JOBS TEAM SCIENTISTS MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM BASEBALL EXPERIENCE KNOWLEDGE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING PLAYERS EMPLOYMENT WORK CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES PLAY RESEARCH COIL READ MOMENT CAUSED LIKE WORKING FIELD CHEMISTRY POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS BASKETBALL MANY COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND PLAYING FIELD ELECTRIC SHORT SENSE PERSON DIVING POSITION HIT PHYSICS DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE GAMES WORLD BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN BAT STUDYING POLE TALE MIGHT INFECTIONS DOLPHIN ABLE TERRY SCIENCES INDUCED NOVEL HOPE CERTAIN UNDERWATER Tennenbaum et al Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS BASKETBALL MANY COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS PLAYED BIOLOGY CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM 2 OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS SMALLPOX SEAL 5 WHOLE OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER Significance Learned topics reveal hidden implicit semantic categories in the corpus In many cases we can represent documents with 10 topics instead of 10 words Especially important for short documents e g emails Topics overlap when words don t Tennenbaum et al Can we analyze roles and relationships between people by analyzing email word or topic distributions Author Recipient Topic model for Email Latent Dirichlet Allocation Author Recipient Topic LDA ART Blei Ng Jordan 2003 McCallum Corrada Wang 2004 Enron Email Corpus 250k email messages 23k people Date Wed 11 Apr 2001 06 56 00 0700 PDT From debra perlingiere enron com To steve hooser enron com Subject Enron TransAltaContract dated Jan 1 2001 Please see below Katalin Kiss of TransAlta has requested an electronic copy of our final draft Are you OK with this If so the only version I have is the original draft without revisions DP Debra Perlingiere Enron North America Corp Legal Department 1400 Smith Street EB 3885 Houston Texas 77002 dperlin enron com Topics and prominent sender receivers discovered by ART McCallum et al 2004 Top words within topic Top author recipients exhibiting this topic Topics and prominent sender receivers discovered by ART Beck Chief Operations Officer Dasovich Government Relations Executive Shapiro Vice Presidence of Regulatory Affairs Steffes Vice President of Government Affairs Discovering Role Similarity Traditional SNA ART connection strength A B Similarity in recipients they sent email to Similarity in authored topics conditioned on recipient Co Training for Semi supervised document classification Idea take advantage of redundancy Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Redundantly Sufficient Features Professor Faloutsos my advisor Co Training Key idea Classifier1 and ClassifierJ must 1 Correctly classify labeled examples 2 Agree on
View Full Document