DOC PREVIEW
CMU CS 10701 - Semi-Supervised Learning and Text Analysis

This preview shows page 1-2-3-4-24-25-26-50-51-52-53 out of 53 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Semi Supervised Learning and Text Analysis Machine Learning 10 701 November 29 2005 Tom M Mitchell Carnegie Mellon University Document Classification Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 gas 1 oil 1 Zaire 0 For code see www cs cmu edu tom mlbook html click on Software and Data Supervised Training for Document Classification Common algorithms Logistic regression Support Vector Machines Bayesian classifiers Quite successful in practice Email classification spam foldering Web page classification product description publication Intranet document organization Research directions More elaborate domain specific classification models e g for email Using unlabeled data too semi supervised methods EM for Semi supervised document classification Using Unlabeled Data to Help Train Na ve Bayes Classifier Learn P Y X Y X1 X2 X3 X4 Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 From Nigam et al 2000 E Step M Step wt is t th word in vocabulary Elaboration 1 Downweight the influence of unlabeled examples by factor New M step Chosen by cross validation Using one labeled example per class 20 Newsgroups 20 Newsgroups EM for Semi Supervised Doc Classification If all data is labeled corresponds to Na ve Bayes classifier If all data unlabeled corresponds to mixture ofmultinomial clustering If both labeled and unlabeled data it helps if and only if the mixture of multinomial modeling assumption is correct Of course we could extend this to Bayes net models other than Na ve Bayes e g TAN tree Bags of Words or Bags of Topics LDA Generative model for documents Blei Ng Jordan 2003 Also extended to case where number of topics is not known in advance hierarchical Dirichlet processes Blei et al 2004 Clustering words into topics with Hierarchical Topic Models unknown number of clusters Blei Ng Jordan 2003 Probabilistic model for generating document D 1 Pick a distribution P z of topics according to P 2 For each word w Pick topic z from P z Pick word w from P w z Training this model defines topics i e which defines P W Z Example topics induced from a large collection of text JOB BALL SCIENCE FIELD STORY MIND DISEASE WATER WORK GAME STUDY MAGNETIC STORIES WORLD BACTERIA FISH JOBS TEAM SCIENTISTS MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM BASEBALL EXPERIENCE KNOWLEDGE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING PLAYERS EMPLOYMENT WORK CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES PLAY RESEARCH COIL READ MOMENT CAUSED LIKE WORKING FIELD CHEMISTRY POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS BASKETBALL MANY COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND PLAYING FIELD ELECTRIC SHORT SENSE PERSON DIVING POSITION HIT PHYSICS DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE GAMES WORLD BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN BAT STUDYING POLE TALE MIGHT INFECTIONS DOLPHIN ABLE TERRY SCIENCES INDUCED NOVEL HOPE CERTAIN UNDERWATER Tennenbaum et al Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS BASKETBALL MANY COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS PLAYED BIOLOGY CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM 2 OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS SMALLPOX SEAL 5 WHOLE OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER Significance Learned topics reveal hidden implicit semantic categories in the corpus In many cases we can represent documents with 10 topics instead of 10 words Especially important for short documents e g emails Topics overlap when words don t Tennenbaum et al Can we analyze roles and relationships between people by analyzing email word or topic distributions Author Recipient Topic model for Email Latent Dirichlet Allocation Author Recipient Topic LDA ART Blei Ng Jordan 2003 McCallum Corrada Wang 2004 Enron Email Corpus 250k email messages 23k people Date Wed 11 Apr 2001 06 56 00 0700 PDT From debra perlingiere enron com To steve hooser enron com Subject Enron TransAltaContract dated Jan 1 2001 Please see below Katalin Kiss of TransAlta has requested an electronic copy of our final draft Are you OK with this If so the only version I have is the original draft without revisions DP Debra Perlingiere Enron North America Corp Legal Department 1400 Smith Street EB 3885 Houston Texas 77002 dperlin enron com Topics and prominent sender receivers discovered by ART McCallum et al 2004 Top words within topic Top author recipients exhibiting this topic Topics and prominent sender receivers discovered by ART Beck Chief Operations Officer Dasovich Government Relations Executive Shapiro Vice Presidence of Regulatory Affairs Steffes Vice President of Government Affairs Discovering Role Similarity Traditional SNA ART connection strength A B Similarity in recipients they sent email to Similarity in authored topics conditioned on recipient Co Training for Semi supervised document classification Idea take advantage of redundancy Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Redundantly Sufficient Features Professor Faloutsos my advisor Co Training Key idea Classifier1 and ClassifierJ must 1 Correctly classify labeled examples 2 Agree on


View Full Document

CMU CS 10701 - Semi-Supervised Learning and Text Analysis

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Semi-Supervised Learning and Text Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Semi-Supervised Learning and Text Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Semi-Supervised Learning and Text Analysis and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?