Unformatted text preview:

Liz Palmer [email protected]: Pat PalmerApril 11, 2005Pigeonhole Liz Palmer [email protected] CSE 400 Professor: Camillo J. Taylor Advisor: Pat Palmer April 11, 2005 Abstract Pigeonhole looks to provide a free indexing tool for private websites which would better summarize the contents and allow site users to search the data more effectively. There exist a number of search engines for web based data systems which are available for private indexing, but these require the user to have prior knowledge of exact search words in the documents in order to locate relevant pages. There exists a need for a better low-cost indexing system for private web sites. Pigeonhole will automatically generate an interface for a website user created from the information contained in the pages of the site. Related Work A number of web-based services offer indexing of private websites for the purpose of information retrieval. Each system also offers a search tool. Picosearch (www.picosearch.com) offers a set of basic features free of charge which includes indexing the website by URL, minimal design control, and displaying context 1surrounding the keywords. For different price levels users can buy the ability to modify the criteria for building the index. Also included in this upgrade is the inclusion of file types beyond HTML and text such as Word, Powerpoint, and PDF, along with advanced statistics regarding searches performed. A second system, Spiderline (www.spiderline.com), provides all services by subscription only. Indexing features include customizable synonym lists, indexing multiple file formats, and support for password protected areas. A similar product called Indexing Service was offered for servers running Windows 2000 but has since been phased out in newer versions. Perlfect Solutions (http://www.perlfect.com/) offers Perlfect Search as a free script. Again this is focused more towards the searching capabilities where it provides a list of results to a given keyword entry. A search engine which has started a novel direction of categorizing is KartOO (www.kartoo.com). Given a query, it returns a graphical representation of sets of sites where the search term is found and the links between these sites. Features found in existing indexers are included and expanded upon to include better organization techniques, more customization options for the website owner, and more advanced language processing techniques. Pigeonhole is designed using a combination of the best features from many of the existing tools along with performing more research regarding selection of keywords and organization techniques. Pigeonhole’s goal is to produce a tool to create a sensible structure representing the information on a private website. Pigeonhole offers a categorized index as a starting point for navigation through a website. The index is information-based as opposed to the link-structure dictating the layout. Using keywords and phrases as starting points for users may better direct their attention to information areas on a given website. There also exist professional humans who do these indices by hand. The American Society of Indexers offers lists of such people. Pigeonhole is capable of remaining 2completely automated from it’s crawling, to the sorting of information, all the way to producing the index document. Technical Approach Pigeonhole is designed as a service accessible through the internet. It was implemented on Microsoft ASP.NET and SQL Server platforms. The interface is formatted to display correctly on most versions of Internet Explorer and Netscape as well as Mozilla/FireFox. To create a site index, a user is required to first establish a login with identification information. This includes a valid email address, password dual entry, and a username. The user is then required to follow a link sent to their email which contains a hashed link which, once followed, verifies their email and automatically identifies the user. Once the user can be identified, Pigeonhole can be used to index and search one or more websites specified by the user. Cookies are used on the computer to allow the user to remain logged in for up to seven days. All stored passwords are stored in an encrypted format to protect user privacy. [Screenshot 1 – Login Page] The user interface has undergone multiple revisions towards the end goal of providing a simple interface for those who wish to input no additional information yet allowing customization for advanced users as well. The settings page features a column for existing data and statistics and a second area for input. The bottom area focuses on the main steps to creation of the desired table of contents for a simple user. While many layouts were explored, we found Picosearch to provide an intuitive structure. These and additional decisions were reached via end user testing and feedback. [Screenshot 2 – User Settings Page] Crawling Pigeonhole first crawls given one or more starting URLs. These URLs are checked for validity of format. When implementing the URL parsing, at first the built-in third party URI handler was used. This proved to be unreliable and thus a more reliable handler was created. The new URI handler has additional features such as discrimination between 3files and directories and more advanced parsing. This necessary added component resulted in a two week setback. A user may crawl in addition to the previous data and append results, or they have the option to clear the databases before a new crawl. The crawler has the option to be restricted to a given number of pages as well in order to limit the amount of time taken and the request load on the user’s website. This parameter is a user setting which has a default value of one hundred. Existing techniques for crawling are taken into account, including support for exclusion of files and folders via robots.txt files. Once retrieved and parsed, the exclusion information is stored to expedite later crawling. The crawler has been moved to run in an independent worker thread. This allows the user to continue viewing their settings and other webpages while the crawling occurs. Previously, the user would wait for the webpage to reload while the crawling completed. Currently crawling occurs on wren.cis.upenn.edu, a Dell Precision 530 running WinServer2003 Enterprise. A benchmark website was crawled to estimate the time


View Full Document

Penn CIS 400 - Pigeonhole

Download Pigeonhole
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Pigeonhole and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pigeonhole 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?