Penn CIS 400 - A Politics Mash Up - D2773177

Home> Schools> University of Pennsylvania> Cinema Studies (CIS) > CIS 400> A Politics Mash Up

Penn CIS 400 - A Politics Mash Up

Pages 13

Download Save

Unformatted text preview:

A Politics Mash-UpInki Hong ([email protected])Faculty Advisor:Ben Taskar ([email protected])AbstractA mashup in the worldwide web is an application that collects data from multiple sources of information and presents them in a unique way. This paper describes a mashup on politics that integrates both geographical andchronological dimensions regarding the 2008 presidential election of the United States of America.Users of this mashup can browse through news articles, poll results,donations/contributions, and so forth arranged geographically over a time frame. It helps users of the system to organize diverse pieces of information, and facilitates geographical/chronological interpretation which is not apparent and hidden behind the disorganized scene.Related WorkMost of the large market share portal sites have made some form of election section. Yahoo! has opened a democratic candidate mashup (http://debates.news.yahoo.com/) which focuses on video interviews and poll results. CNN (http://www.cnn.com/ELECTION/2008/) has a broad coverage of election-related news. Although these sites put together mixed form of contents such as video and message board, they are still more of traditional portal sitesthan mashups in the way that these sites themselves are large producers of contents.Some smaller sites such as 186kbps.com have a politics website (2008) that integrates a large variety of political news sources such as Reuters, USA Today and Yahoo! News. Only few of the articles, however, are groups together as meaningful categories. Mashing up is used to collect more than one source but the website itself doesn't add additional value presenting them. One attempt tolink politics on Google Maps service can be found in a web post (Sinreich, 2007)and a website by Matthew Kane (2005) which tries to track political campaign contributions geographically.This project focuses on both geographical and chronological interpretations. A work by Veronis (2007) discusses correlation between # of candidate citations and the actual result (contrasted to poll results and the outcome.) Users areable to add geographical perspective to this kind of interpretation and prediction with the help of this mashup.The New York Times has compiled up-to-date geographical campaign donation distribution on its web site where users can browse by individual candidates and further by time frames. It also lists past and upcoming schedules for each candidate by locations (The New York Times Company, 2008). An excellent site on campaing contributions is “The FundRace 2008” by Huffington Post (http://fundrace.huffingtonpost.com/) which provides its data to other sites as well.This system also includes geographical/chronological media exposure based on the observation by Veronis (2007) that it has a significant correlation with the actual election. Very similar to the approaches that this system takes is “Map the Candidates” (Slate.com, 2008). Difference is that while this web site primarily focuses on candidate events, this system focuses on, based on the observation by Veronis (2007), candidates' media exposure, poll results, contributions, of which one is expected to be very indicative of actual result and the other to be less forecasting than many people have relied on traditionally.Technical Approach and ChallengesSystem CompositionThe system is composed of three major subsystems. First, the back subsystem is responsible for collecting data from multiple sources of information, analyzing the assembled data, and extracting useful information from them. Second, the middle subsystem is responsible for permanently storing constructed information in a structured way so that it serves as a link between the front and back. Finally the front subsystem is responsible for presenting these pieces of information in a geographically and chronologically meaningful way.Back End: CrawlingThe back end is mainly based on several web crawlers which periodically fetches news articles, poll results, etc. that are available on the web and store them. Stand-alone web crawler programs were developed in Java language, version 6, to leverage a rich set of available libraries that facilitates development process. The Eclipse platform was used as it is the most widely used Java development tool. It was developed and tested in a regular desktop IBM-PC. For the networking modules, Sun Microsystems's own network libraries as well as Apache.org's HTTP Client (http://jakarta.apache.org/httpcomponents/)library have been used. The crawler is a single-threaded process.The system only deals with HTML files, which are most of the time small in size, data are processed as memory-based InputStream-like objects rather than disk-based File-like objects. On processing HTML, it is converted to XML documents to make easy use of its tree structure (http://java-source.net/open-source/html-parsers/jtidy). Once it has an XML representation, content extractors apply pre-specified rules and information retrieval techniques to extract useful information and store them in the middle subsystem.Back End: Web SitesPresidentPolls (http://www.presidentpolls2008.com/) is an excellent resource for poll results about presidential candidates. It has a nice structure where each result for the day is summarized in a table and all previous results can be downloaded by following single link.All donation/contribution information can be searched at the Federal Election Commission (http://www.fec.gov/). It also has parser-friendly structure only with larger amount of data.News articles are, as one can naturally expect, very heterogeneous. Since it was not feasible to come up with a master parser for appropriate link detection, representative newspapers with high circulation were selected (BurrellsLuce, 2007). Other methods attempted to collect news articles is by meta-searching search engines. For example, USA Today (http://www.usatoday.com), the newspaper with highest circulation in the United States, provides archive search service but only with headline and excerpt for free. In that case, a two-passcrawling was attempted. First pass collects headlines from the archive. In the second pass, those headlines are searched against popular search engines such as Yahoo! or Google to find links to the original URL.As the number of web sites that is visited gets higher, time to build a new parser for the specific site should be reduced. Although every site has its own presentation

View Full Document


School:
Email:
New Password:
Confirm Password:

Penn CIS 400 - A Politics Mash Up

Sign up for free to view:

Please select your school