1CS276BText Retrieval and MiningWinter 2005Project Practicum 2Plan for todayGeneral discussion of your proposalsSample project overview (what you have to turn in on Tuesday)More tools you might want to useMore examples of past projectsGeneral feedback on proposalsWe need more specifics on what exactly you’re planning to build.Vagueness was fine for the proposals, but it’s not appropriate for your overview.Avoid discussion of “possible applications” –your overview is a commitment to develop a fleshed-out, polished application.Be ambitious but realistic. It’s okay if at some future point you realize that you don’t have time to implement every feature described in your overview; but your final product should not deviate too far from the scope of your overview. General feedback on proposalsMeasurement criteria are essentialCreating a cool application is great but not sufficient – you also need a predetermined standard for evaluating the success or failure of your work.Some kind of scientific numerical analysis of your system’s performance in comparison to a baseline or rival system:precision/recalluser satisfaction ratingscorrelation or mean squared error (if you’re predicting values)processing time, main memory requirements, disk spaceGeneral feedback on proposalsRemember: a successful project doesn’t have to achieve great performance!Of course it’s better to get good results…But there can be significant value in trying something interesting and finding that it doesn’t work very well.So don’t be afraid to explore an idea that isn’t guaranteed to pan out – as long as there’s reason to believe that it might.Project overview:Suggested structureTitleGroup membersAbstract (one short paragraph)Topic(s) investigatedRelevant prior work (paper citations, actual systems)Delineation of group member responsibilitiesData sourcesTechnologies (programming languages, software, etc.)Existing tools leveragedImplementation detailsSubmission calendar:Block 1Block 2Block 3 (final product)2Sample project overview(idealized – not my actual proposal!)MovieThing: A web-based collaborative filtering movie recommendation systemGroup: Louis Eisenberg (CS coterm) and Joe User (CS senior)Abstract: I will conduct an online experiment by building a website on which registered users can provide ratings for popular movies using a graphical interface. Once I have collected ratings from a substantial number of users, I will generate movie recommendations, assigning each user randomly to one of a handful of distinct recommendation algorithms. I will then solicit feedback from the users on the quality of the recommendations and use that feedback to perform a qualitative analysis of the relative accuracy of the different algorithms.Sample project overviewTopics investigated: collaborative filtering, recommendation systemsRelevant prior work:MovieLens (U. of Minn.)Jester (UC-Berkeley)CF research papers: http://jamesthornton.com/cf/Empirical Analysis of Predictive Algorithms for Collaborative Filtering:http://research.microsoft.com/users/breese/cfalgs.htmlMore research papers…Sample project overviewGroup member responsibilities:Louis: set up database, JDBC and utility code, JavaScript sliders, evaluation codeJoe: AWS code, JSP and servlet front-end code, literature reviewBoth: fill movie table, design CF algorithms, recruit subjects, write final paperData sources:Movie data (title, actors, genres, etc.) from IMDB and AmazonMovie ratings supplied by my usersAmazon product similarity dataTechnologies: servlets/JSP, Javascript, MySQLExisting tools leveraged: Amazon Web ServicesSample project overviewImplementation details:Website will display movies in tabular format with ability to search/filter by title, genre, actors, etc. Users rate movies bydragging sliders.Algorithms: Amazon: use product similarity to generate predicted ratings based on weighted averages using user’s ratings and movies considered “similar” to those the user has rated Standard: predicted ratings are weighted averages using user’s Pearson correlation to other users and the ratings of the other users General deviation: emphasize movies for which user has an unusual opinion by introducing additional term into covariance calculation (which factors into user similarity weight) Personal deviation: emphasize movies about which user feels strongly by cubing covariance terms. Both deviations: combine tweaks of general and personal.Evaluation: Overall ratings of quality of recommendation lists Correlation between predicted and actual ratings for recommended movies that user has already seenSample project overviewSubmission calendar:Block 1:movies table is fully populatedwebsite is live and accepting ratingsBlock 2:sufficient users and ratings have been collectedAmazon similarity data has been retrievedrecommendation algorithms are functionalBlock 3:users have received recommendations and provided feedbackfinal paper includes analysis of algorithms’ relative performanceNotes on sample project overviewYour overview should be more extensive than this sample…More specific implementation details, particularly in regard to algorithmsMore specific goals for each block/milestoneContingency plans for slight modifications to your project if you encounter obstacles?3More toolsMALLETA Machine Learning for Language Toolkithttp://mallet.cs.umass.edu/“an integrated collection of Java code useful for statistical natural language processing, document classification, clustering, information extraction, and other machine learning applications to text”Minimally documented but has lots of stuff:Building feature vectorsVarious classification methods (Naïve Bayes, max-ent, boosting, winnowing)Evaluation: precision, recall, F1, etc.N-gramsSelecting features using information gainThey have some examples of front-end codeMinorThirdhttp://minorthird.sourceforge.net/“a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text”Documentation seems to be pretty good: comprehensive Javadocs, tutorial, FAQ…Has the concept of “spans” (sequences of words) that can be extracted and classified based on
View Full Document