DOC PREVIEW
Stanford CS 276 - An Internet Forum Index

This preview shows page 1-2-3-4-30-31-32-33-34-61-62-63-64 out of 64 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 64 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

OverviewForum ExamplesvBulletinphpBBUBBgentooevolutionMbayareapreludewarcraftPaw talkCurrent SolutionsGooglelycosinternalboardsearchSlide 17Evaluation MetricImplementationImproving Software Package Search QualityThe ProblemSourceforge.orgGentoo.orgFreshmeat.netHow can we improve this?Better Sources of InformationBuilding the SystemHow do we measure success?Any questions?Incorporating Social Clusters in Email ClassificationPrevious WorkEmail ClassificationIncorporating Social ClustersEvaluationExtensionsReferencesA research literature search engine with abbreviation recognitionOutlineMotivationSlide 40Search result in Google ScholarGoalApproachArchitectureTechnologySlide 46A Web-based Question Answering SystemOutlineQA BackgroundOur QA SystemSystem ArchitectureQuestion ClassifierQuery RewriteAnswer Pattern LearningSlide 55Streaming XPath EngineTraditional XML ProcessingStreaming XML ProcessingWhat is XPath?A Simple ExampleObjectiveXPath ChallengesAlgorithmsSlide 64Board SearchAn Internet Forum IndexOverview•Forums provide a wealth of information•Semi structured data not taken advantage of by popular search software•Despite being crawled, many information rich posts are lost in low page rankForum Examples•vBulletin•phpBB•UBB•Invision•YaBB•Phorum•WWWBoardvBulletinphpBBUBBgentooevolutionMbayareapreludewarcraftPaw talkCurrent Solutions•Search engines•Forum’s internal searchGooglelycosinternalboardsearchboardsearchEvaluation MetricMetrics: Recall - C/N, Precision C/ERival system:•Rival system is the search engine / forum internal search combination•Rival system lacks precisionEvaluations:•How good our system is at finding forums•How good our system is at finding relevant posts/threadsProblems:•Relevance is in the eye of the beholder•How many correct extractions exist?Implementation•Lucene•Mysql•Ted Grenager’s Crawler Source•Jakarta HTTPClientImproving Software Package Search QualityDan Fingal and Jamie NicolsonThe Problem•Search engines for softare packages typically perform poorly•Tend to search project name an blurb only•For example…Sourceforge.orgGentoo.orgFreshmeat.netHow can we improve this?•Better keyword matching•Better ranking of the results•Better source of information about the package•Pulling in nearest neighbors of top matchesBetter Sources of Information•Every package is associated with a website that contains much more detailed information about it•Spidering these sites should give us a richer representation of the package•Freshmeat.net has data regarding popularity, vitality, and user ratingsBuilding the System•Will spider freshmeat.net and the project webpages, put into mySQL database•Also convert gentoo package database to mySQL•Text indexing done with Lucene•Results generator will combine this with other available metricsHow do we measure success?•Create a gold corpus of queries to relevant packages•Measure precision within the first N results•Compare results with search on packages.gentoo.org, freshmeat.net, and google.comAny questions?Incorporating Social Clusters in Email ClassificationByMahesh Kumar ChhapariaPrevious Work•Previous work on email classification focus mostly on:–Binary classification (spam vs. Non-spam)–Supervised learning techniques for grouping into multiple existing folders•Rule-based learning, naïve-Bayes classifier, support vector machines•Sender and recipient information usually discarded•Some existing classification tools–POPFile : Naïve-Bayes classifier–RIPPER : Rule-Based learning–MailCat : TF-IDF weightingEmail Classification•Emails:–Usually small documents–Keyword sharing across related emails may be small or indistinctive–Hence, on-the-fly training may be slow–Classifications change over time, and–Different for different users !!•Motivation:–The sender-receiver link mostly has a unique role (social/professional) for a particular user–Hence, it may be used as one of the distinctive characteristics of classificationIncorporating Social Clusters•Identify initial social clusters (unsupervised)•Weights to distinguish–From and cc fields, –Number of occurrences in distinct emails•Study effects of incorporating sender and recipient information:–Can it substitute part of the training required ?–Can it compensate for documental evidence of similarity ?–Quality of results vs. Training time tradeoff ?–How does it affect regular classification if used as terms too ?Evaluation•Recently Enron Email Dataset made public–The only substantial collection of “real” email that is public–Fast becoming a benchmark for most of the experiments in•Social Network Analysis•Email Classification•Textual Analysis …•Study/Comparison of aforementioned metrics with the already available folder classification on Enron DatasetExtensions•Role discovery using Author-Topic-Recipient Model to facilitate classification•Lexicon expansion to capture similarity in small amounts of data•Using past history of conversation to relate messagesReferences•Provost, J. “Naïve-Bayes vs. Rule-Learning in Classification of Email”, The University of Texas at Austin, Artificial Intelligence Lab. Technical Report AI-TR-99-284, 1999.•E. Crawford, J. Kay, and E. McCreath, “Automatic Induction of Rules for E-mail Classification,” in Proc. Australasian Document Computing Symposium 2001.•Kiritchenko S. & Matwin S. “Email Classification with Co-Training”, CASCON’02 (IBM Center for Advanced Studies Conference), Toronto, 2002.•Nicolas Turenne. “Learning Semantic Classes for improving Email Classification”, Proc. IJCAI 2003, Text-Mining and Link-Analysis Workshop, 2003.•Manu Arey & Sharma Chakravarthy. “eMailSift: Adapting Graph Mining Techniques for Email Classification”, SIGIR 2004.A research literature search engine with abbreviation recognitionGroup membersCheng-Tao ChuPei-Chin WangOutline•Motivation•Approach–Architecture•Technology•EvaluationMotivation•Existing research literature search engines don’t perform well in author, conference, proceedings abbreviation•Ex: search “C. Manning, IJCAI” in Citeseer, Google ScholarSearch result in Google ScholarGoal•Instead of searching by only index, identify the semantic in query•Recognize abbreviation for author and proceedings namesApproach•Crawl DBLP as the data source•Index the data


View Full Document

Stanford CS 276 - An Internet Forum Index

Documents in this Course
Load more
Download An Internet Forum Index
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view An Internet Forum Index and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view An Internet Forum Index 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?