Penn CIS 400 - Intelligent RSS Aggregation

Unformatted text preview:

IRA: Intelligent RSS Aggregation Scott Graham ([email protected]) Advisor: Lyle Ungar Abstract RDF Site Syndication (RSS) is an Extensible Markup Language (XML) format for syndicating articles on the internet to desktop and web aggregators. RSS was officially released in December, 2000 and has been quickly adopted by the internet community, including major sites such as the New York Times, the Christian Science Monitor, and Scientific American. The broad acceptance and relative ease-of-use of RSS has prompted the development of multiple desktop aggregators for reading RSS articles. Intelligent RSS Aggregation (IRA) aims to replicate the concept of TiVo with RSS feeds by watching usage patterns in Firefox and suggesting articles tailored to the specific interests of the user. By analyzing the content viewed within a user’s web browser, it is possible to suggest new articles and RSS feeds. These articles are suggested by Bayesian heuristics based on the web browsing habits and bookmarks of the user. Blog/RSS Activity From April 2004 to January 2005 QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture. Source: http://www.pubsub.com/tracking.phpRelated Work Existing Products Dozens of existing RSS readers already exist for all major platforms. However, each reader is primarily concerned with article delivery and RSS subscription management. The aggregation of news articles within one piece of software is the primary feature for these products. NetNewsWire, SharpReader, and Straw rank among the most popular readers for Macintosh, Windows, and Linux respectively. Existing Research Most existing research on news aggregation centers on the intelligent clustering of newsgroup articles. Newsgroup artificial intelligence (AI) is roughly split between collaborative filtering and informational filtering. Collaborative filtering aggregates data provided by a large set of users and recommends new articles to specific users based on their personal preferences. Information filtering attempts to recommend new articles (and information in general) by finding patterns in large amounts of data and categorizing new data according to these patterns. Many of the information filtering techniques are relevant to intelligent RSS aggregation. However, collaborative filtering is not applicable for IRA due to privacy concerns and the complex connectivity issues that arise from a large client base. Despite the advantages of collaborative filtering for products and information, the complexities have largely pushed many of the new developments towards online companies with vast resources who are interested in selling additional products. For example, Amazon.com uses a database of customer purchases to recommend new products based on a users’ individual actions within their website. Intelligent RSS Aggregation No existing RSS readers apply artificial intelligence to recommend articles or RSS feeds. IRA was conceived to fill a void in the RSS market by identifying interesting articles based on a users’ previous behavior. One of the major problems with RSS is identifying new feeds that are relevant to the user. In the past twelve months, a few services for searching RSS and blogs have gained popularity, but they require explicit user input, and must be manually configured. Today, the number of estimated RSS feeds is between 10 million and 20 million. As the RSS spacegrows, it will be increasing important to pull pertinent information from the comparative noise of irrelevant RSS feeds. Technical Approach Project Evolution As the project was initially conceived in September, there were three major components to the project: • A Mozilla browser extension to collect web content and RSS feeds. • A database for heuristics and usage patterns. • A RSS reader to read and suggest RSS feeds and articles. The browser would pass keywords and RSS feeds to a central database, which are later retrieved by the RSS reader. The reader uses the keywords to generate a profile for the user, which is continually updated. User actions and selected RSS articles within the RSS reader reinforce the profile heuristics. RSS feeds are polled for relevant articles according to the user’s profile and suggested within the RSS Reader. The basic project framework has remained intact, but the components have shifted slightly due to the technical constraints of Mozilla and typical RSS usage. The majority of RSS feeds contain a short description of a news article, blog entry, or general text entry and a URL pointing to the full contents of the entry. The average user of an RSS reader browses the list of available articles for something that appears interesting, and then clicks the associated link which launches a web browser. With this concept in mind, there was no identifiable reason to create a separate RSS reader since the reader functionality can be embedded inside the Mozilla extension. Embedding the reader contains the added benefit of allowing IRA to exploit the heuristics code for web pages with minimal modifications. The second change to the initial project specification was to the database of heuristics and usage patterns. Maintaining a full database such as MySQL or another relational database is unrealistic, and is not necessary for the amount of web content and associated heuristic data. Instead, data is written to Java hashtables, and saved as serialized objects on disk. Testing shows performance is acceptable, at the slight expense of features inherent to SQL and relationaldatabases. Browser Extension Mozilla internet browsers (specifically Mozilla and Firefox) have full support for browser alterations via installable modules called browser extensions. The browser extension wraps JavaScript code and a Java daemon that runs while Firefox is active. Browser extensions have the same security settings as the browser itself, and are tightly integrated into the browser. Extensions have full access to browser components, allowing an extension, if desired, to overwrite browser code with custom routines. Direct integration with the browser is extremely powerful, and users must explicitly install an extension and restart their browser for an extension to properly operate. The rising popularity of Firefox has spawned over a hundred different extensions to customize different aspects of the browser. Mozilla and Firefox have many of the


View Full Document

Penn CIS 400 - Intelligent RSS Aggregation

Download Intelligent RSS Aggregation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Intelligent RSS Aggregation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Intelligent RSS Aggregation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?