Penn CIS 400 - Instant Message Retrieval

Unformatted text preview:

- 1 of 13 - Instant Message Retrieval GoGetIM Michael Seltzer [email protected] Faculty Advisor: Zachary Ives [email protected] Abstract Many companies and individuals have recently developed a wide variety of software that attempts to monitor, store, and organize the vast collection of information that we interact with as computer users. Applications such as Google Desktop Search and MSN Search Toolbar allow users to easily find and access a wide variety of data, from e-mail to word processing documents. However, most of these applications ignore the characteristics and intricacies that can be associated with alternative types of information. Instant messages are a prime example of information that is fundamentally different from other web or local information. They are essentially threads of back-and-forth messages occurring in spurts; however, they are currently not represented in this manner. Considering these complexities, traditional indexing and retrieval procedures cannot practically apply. Saving the conversation information is relatively easy; however, indexing, organizing and user retrieval in a practical fashion is the ultimate challenge. The goal of my system was therefore to improve retrieval methods for conversations. While I primarily wanted to improve the retrieval process of searching for instant message conversations, a subset of functionalities were underlying this goal. This included searching by simple message content or by an advanced search that specified other constraints. An example of a more advanced search capability would be for a user to search for a message from conversations with a specific subset of users, and within a certain range of dates. This goal also included producing search results that reflect the specific conversation the user is seeking, rather than the entire log file of the user’s conversations of the day or the single line in this log that contains it. Thus, logs can be shaped and presented in their original conversational context, instead of appearing within the concatenated and cumbersome file.- 2 of 13 - Related Work There is currently a diverse selection of commercial applications that archive and retrieve both web-based and local information. Search engines such as Google and Yahoo! have developed and released desktop search software with the capability of archiving and retrieving instant message conversations. However, current search technology for instant messages is mostly limited to simple text-matching. Additionally, some instant messaging clients or software add-ins currently store conversation history; however, retrieval methods are restricted to basic search across all log files and the nuisance of manual discovery. For example, DeadAIM archives instant message conversations; however, its built-in search is rudimentary and ineffective. Hence, there is room for improvement in establishing efficient techniques for accessing instant message conversations. The system that I implemented far surpasses these current systems in terms of the user’s ability to specify search limitations as well as the relevance of information that the program retrieves. Technical Approach In order to maintain the focus of my project on the retrieval of instant message conversations, I did not build any components to archive conversations. Instead, I used the output of DeadAIM software for archived instant message data. As one of many different third-party plug-ins for AOL Instant Messenger (AIM), DeadAIM automatically archives instant message conversations. DeadAIM produces a simple HTML file that contains all of the conversations one has with user X on date Y, in chronological sequence. Because I already had a collection of these files accessible due to my previous use of the software, the approach of using these files was by far the most ideal. In addition to using DeadAIM, I extended Apache Lucene, an open source java program that can be molded to index and search a variety of data. I used Lucene to help build an instant message conversation index and provide search functionality for that index. On the top of the application is a GUI that I built using Java’s Swing components that allows the user to interact with the functioning code that I have written. The functional code is separated into two general components: an indexing component and a searching component. The indexing component parses through all DeadAIM HTML logs to create an index of conversations, and the- 3 of 13 - searching component provides results from a created index relevant to multiple user-supplied constraints. Figure 1: Overall Technical Implementation Diagram Indexing I created a package of classes called imstrip that would contain all of the code pertaining to indexing instant message conversations. Inside this package resides the Strip class, which handles all parsing for an individual log file. Each log file contains all instant messages sent to and received from a particular user on a particular date. Since AIM produces horribly formatted HTML for these conversations, each Strip instance is passed a version of this HTML cleaned up DeadAIM Archives Conversation Index Graphical User Interface Indexing Component Searching Component ConversationsUsernames Date Range Search Query Search Results- 4 of 13 - using the open source program JTidy. This cleaned HTML is traversed, and all message usernames, times and message texts are collected. The IMStrip class, also within the imstrip package, finds all relevant log files for a Strip instance to parse. Once these HTML log files have been parsed out into usernames, times and message texts, the IMStrip instance uses the Lucene data class Document to organize all of the information into conversations. While each Document normally contains fields for files such as title, date, location and so on, I implemented each Document to have fields for the remote username, the local username, the date of conversation, and fields for every message sent. The message text, or field value, is given based on the regular expression: (r | l)(m | a)(######)(.+). The ‘r’ is for a message that the remote user sent, or the ‘l’ would be for a message that the local user sent. The ‘m’ is for a normal message, or the ‘a’ would be for an automatically sent message (through an AIM away message function). The (######) represents the time of the message in 24-hour format. The actual message text follows. Thus,


View Full Document

Penn CIS 400 - Instant Message Retrieval

Download Instant Message Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Instant Message Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Instant Message Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?