DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1. Introduction2. System Architecture2.1 Thread Extraction2.2 Topic Extraction2.2.1 Chi-square2.2.2 Latent Semantic Analysis2.3 Topic Clustering3. Results3.1 Discussions4. Conclusions and Future WorkAutomatic Topic Extraction and Classification of Usenet ThreadsSusumu Harada Shashikant KhandelwalDept. of Computer ScienceStanford University{harada,kshashi}@stanford.eduAbstractWe implemented atopic extractionsystem that takes asinput a collection ofpostings from aUsenet newsgroupand outputs a list ofprominent topics thatcharacterize thecontents of thenewsgroup, and foreach topic, gives theset of threads thatdiscuss the topicalong with theirrelevance measurewith respect to thetopic. We use twomethods, chi-squarefeature extraction andLatent SemanticAnalysis, to extractthe topic terms.1. IntroductionUsenet has beenproviding a means forpeople across theworld to participate inonline discussions onvariety of topics sincethe early 1980’s, andstill continues to drawthousands of postingsa day. With so manypostings and a greatnumber of differentthreads proceedingconcurrently, it is agreat challenge toidentify the set ofpostings or threadsthat relate to a specifictopic one is interestedin. Current Usenetclients can grouppostings by the threadthey belong to, butthey do not provideany furtherorganization. . Theimportant technicaldetails discussed innewsgroups are muchdesired, and givebetter answersparticularly totechnical questionsthan searching theweb does. However itis a daunting task tosearch the completelist of newsgroups,many of which may betalking about the sametopic, and browsethrough all the threadssearching for the rightones. A system that couldautomatically processthe newsgroup andgenerate a list oftopics being talkedabout could be usefulto give a briefsynopsis of the wholenewsgroup and alsofor advanced search.Even a goal basedtraditional IR systemcould be significantlyimproved, if backedby some classificationor metadata about thenewsgroup postings.Also, given theweights of the topicsand their relativesimilarity measuresamong each other aswell as betweendocuments, one cangenerate avisualization of thenewsgroup threadsand their topics,providing a way toboth visualize andnavigate the clusteredthreads easily.We implemented asystem toautomatically extractnewsgroup threadsand categorize thembased on mostprominent topicalcategories. Weimplement andcompare two methodsfor feature extractionfrom newsgroupthreads, one based onpurely statisticalmethods and the othermaking use ofsemantic informationin the newsgroupthreads in order tobuild a topicalstructure for anewsgroup corpus. Wecompared these twomethods, viz. chi-square based featureextraction and LatentSemantic Analysis, todetermine whichmethod will yieldbetter results. Ourinitial hypothesis wasthat the LatentSemantic Analysis willbe able to extractmore sensible topics,as suggested by manyof the literatures, butin practice thedifference in resultsfrom the two methodswas not verysignificant.2. System ArchitectureThe systemarchitecture diagramis show in Figure 1.The following sectionsdescribe each sectionof the processingpipeline.2.1 Thread ExtractionThe first task involvesextracting threadstructure from the rawdump of thenewsgroup postings.The dump of thenewsgroup postingswas acquired byexporting the entirenewsgroup into asingle text file throughPINE with full headerinformation for eachmessage.The dump file is thenfed through the threadextractor, whichorganizes eachmessage in a treestructurecorresponding to thethread order,determined by parsingthe message headers.During this process, acorpus of the entirenewsgroup isconstructedincrementally as eachnew term isencountered, as wellas building up thestatistical informationfor each term such asthe term frequencies(number of times aterm occurred in athread) and threadsfrequencies (numberof threads in which theterm occurred). Sincewe used Java for easeof implementation, wemake ample ofreferences to keep thememory usage to theminimum.We also applied alemmatizer to strip offthe non-content words,based on existing lists[11] [12] combinedwith our ownadditions to it, asfeatures innewsgroups aremostly nouns, andsometimes, verbs. Inaddition, a word-filtering moduleattempts to filter outany other undesiredwords based on a setof heuristics. First,any words startingwith a non-alphabeticcharacter arediscarded. It alsoremoves any wordsstarting with URLprefixes (“http”, “ftp”,etc) or containing“@”,“.” or “?”.Porter StemmingAlgorithm [10] wasalso used to convertwords into their rootform to reduce the sizeof the corpus.Headers containinformation (such asdate, sender’s email,etc.) which might berelevant for datamining procedures andwhich must beconsequentlyrecognized and stored,were also parsed out.The interpretation ofnewsgroup items isoften dependent on thecontext, i.e. on the listof messages to whichthe current message isan answer. Asprevious messages areoften reported in theemail using standardgraphical conventions(most often a “>”sign) it is extremelyimportant 1) to avoidextracting informationfrom reported emails;2) to preserve the textof reported email forthe phase of semanticinterpretation, in orderto provide a correctresolution ofanaphoric references.However, we did notpursue this problemfurther, as ourinterpretation of thewhole thread was asbelonging to one topic,and hence treated thewhole thread as onedocument, neglectingthe quoted messagesfrom previouspostings whileparsing.Finally, the word-filtering moduleincluded an adjustableparameter whichspecified a thresholdnumber of times aterm must occuracross the entirenewsgroup in order tobe included in thecorpus. The reasoningbehind this was


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?