UConn CSE 4904 - Crawler Based Search Engine - D417018

Home> Schools> University Of Connecticut> Computer Science & Engr (CSE) > CSE 4904> Crawler Based Search Engine

DOC PREVIEW

UConn CSE 4904 - Crawler Based Search Engine

School name University Of Connecticut

Course Cse 4904- Computer Science Design Laboratory

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Ryan CapletCSE 4904 – Fall 08Milestone 1Sept 17, 2008Crawler Based Search EngineBackground:The purpose of this project is to design a crawler-based search engine. A crawlerbased search engine is a search engine that searches the web using a crawler. A crawler isa bot or a script that is used to gather information and/or download pages. These webpages are then processed to find the information needed. Initially a search engine likethis will start with a small list of URLs and as the crawler searches the web and followshyperlinks it adds new URLs to the list of places to search. The crawler searches thepages downloaded for text and various other elements to determine the rank of the webpages. The indexer used to index the pages by amount of keywords found. The rankedpages are then outputted to the screen on a new page displaying the highest ranked pagesfirst. The database is listed out on the pages by the rank that the indexer put thedownloaded web pages in. This is the basics of how the project is going to function(BHerath).Motivation:The main motivation for this project is that all members of the group areinterested in how a search engine works as well as how the web works. Search enginesare some of the most commonly used types of pages used out in the world amongst thosewho have access to the internet. This project will also give the group more experience inthe various programming languages and programs we will be using during this project. ItPage 1will also give us more experiences with large programming projects that allow us tocreate something that people could use.Projected Task Breakdown Between Team Members:In the project the team is going to be made up of Ryan Caplet, Bryan Chapmanand Morris Wright. The team is going to divide up the jobs so that everyone will havesomething to do and know what the group is doing at all times. The group has decidedon some preliminary ideas on who is doing what but it is not totally solidified just yet.Ryan will be mostly programming the interface for the web page that will interact withthe crawler to produce the output. He will also be testing the functions to make sure thatthey are working the way that they are supposed to. Bryan is going to create the crawlerwhich is going to download the files. The crawler is also going to analyze the files andrank them by what the crawler finds. Morris is going to work on developing a userinterface for the project which will aid in the searching. He will also be managing theaccounts on the server we will be using and be managing the databases we will be using.These roles may change as the project goes along as the group gets more organized.Product Function and Perspective:The basis of the project is that the group will be developing a search engine whichis a web program that is used to search the internet. The search engine will downloadpages and index URLs while following hyperlinks. Each new page that is found will beadded to the list of pages to visit (which will be stored in a database) and that page couldbe downloaded to the computer. Once the page is downloaded it is parsed and searchedto find what the searcher searches for. The pages get ranked by how well everything isfound amongst the text within the page. The ranking system will be implemented byPage 2using various arrays and database tables to store keywords and keyword counts. Theinformation will be printed out to a new page so that the user can view the information.The basic bounds of this project are going to be able to search the University ofConnecticut campus domain for what is being looked for. This project model couldeasily be extended to search a bigger area or set of web sites or domains. Development Environment:For the development environment the group has decided to use an Apache-drivenLinux server. The group might be using one of the servers available from the School ofEngineering but there is also the assumption that some of the work will be done on thegroup’s individuals computers. The computers will need to have Apache web server,PHP, Perl and MySql all installed considering that the group is going to need all theseprograms to develop the project. The group was thinking of also using some other toolsand interpreters, such as Perl, as needed to connect different parts of the project which areavailable through Linux and other sources. An example of a Linux command would be touse a recursive wget to download web pages. Other ways to download pages would be touse PHP curl as means to download a page on an as need basis once the database URLlist is set up.User Environment:The user environment for this project is going to be any computer with anyoperating system that has a web browser since the project is going to be web-based. Thepage will be a search line with a search button and this page will be integrated with PHPto interact with the rest of the system. The user should be able to use the systemarchitecture that is fast enough to run a web browser and run a simple recursive loop.Page 3There are very little constraints to the user environment because most of the processing ofthe information is going to be done on the server. Some Design Contraints:Some basic constraints that the group thought would need to think about are howto develop this for multiple users accessing the list of pages. The group would need tomake sure that this project is done efficiently so the server doesn’t hang up if too manyusers were to use this. The group has decided to keep URLs that are found in a databaseso that when the search engine needs the website to search it can just take that websitesearch I and then delete it so that it can save space. Other constraints include resourcesthat may be available on some of the group’s individual laptops may not be available onthe server. The group might need a good amount of space to hold the pages if theychoose to be downloaded.User Interfaces:For this project the group has to come up with some ideas of what aspects areimportant to include in the external interfaces that runs this project. There are a fewdifferent types of interfaces that this project might use. These types can range from beinghardware or a software interface. The external user interface is going to be a dynamicwebsite which will find other web

View Full Document