UConn CSE 4904 - Crawler Based Search Engine - D2144071

Home> Schools> University Of Connecticut> Computer Science & Engr (CSE) > CSE 4904> Crawler Based Search Engine

DOC PREVIEW

UConn CSE 4904 - Crawler Based Search Engine

School name University Of Connecticut

Course Cse 4904- Computer Science Design Laboratory

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Ryan CapletCSE 4904 – Fall 08Milestone 4Nov 5, 2008Crawler Based Search EngineIntroduction:Purpose and Scope:The purpose of this document is to explain the progress of the project on aCrawler-Based search engine. In this part of the project, the group is beginning to bringthe different parts of the project together so they can work together. This part willhighlight the different parts of the project that the group has worked on and how thegroup is hoping to make the project eventually work. There is also some discussion onhow the system as a whole is tested and how everything is going to work together that thegroup has designed. There is a part of this document that is going to be discussing a verythorough strategy on how to integrate each part of the system. There will be criteria onwhen a part is ready to be tested and there will be specifics on how it is to be tested. Thetesting will then prepare the rest of the project for testing. There is also some criteria onwhen the system is ready for beta testing by users.Integration Strategy:Entry Criteria:The criterion that is needed for the parts of the system to be tested is this: the firstparty or the creator of the part of the system needs to be able to run it and know that itworks. Then a second party needs to run it and verify that it is correct. In the sequenceexplained later the earlier parts of the project need to be carried out in order to make thetesting for the later parts be more effective.Page 1Elements to be integrated:There are several different parts that make up the search engine. So far the grouphas developed the crawler, the indexer, the keyword generator and the search function forthe engine. The crawler is the part of project that analyzes the web page and builds anindex of websites that are downloaded, for other parts of the project to allow quickeraccess. The keyword generator is the part of the system which builds many tables ofwords that so when people search a word they can find which sites have those words.Then there is the search function which searches the keywords and produces a list ofURLs related to that word. The crawler, indexer and keywords are separate to the rest ofthe project because they are run once in a while to develop their respective lists. Withinthis project, every part depends on something else in order to run correctly. Thecrawler/indexer and the keywords are already integrated together and the search is theonly part that is not directly integrated yet. The crawler, new_strip.pl, is written in Perl,the index is a MySql table, and the keyword building script is written in both PHP andPerl. Keyword.php is the keyword building script and that also uses theprocessKeyword.pl Perl script.Integration Strategy:The group decided to organize the project into the separate parts that werediscussed above because not every part of this project is going to be running at all times.The crawler, index and keywords are going to change only if the database administratordecides that they want to update the keywords and the index. The project’s integrationapproach is designed in more of a bottom-up approach. The reason that the project wasdesigned like this was because there are clear separate parts in the crawler based searchPage 2engine that is being made. The group each built and tested the separate parts as theproject was made. The group members each were given a task within the project to do.Sequence of Features / Functions and Software Integration:The sequence of the features is the recursive wget, which will retrieve the pages,the crawler, which will build the index and the keyword generator, which will build thetables for each keyword. These keywords will then be searched by the search page andthe results will displayed on the search page. These subsystems are integrated in thisorder because each one depends on the one immediately before it. These features do not have any real specific hardware dependencies but there aresome software dependencies that are needed. The project requires PHP 4 or higher withthe MySql module and Perl, which is standard with Linux in order to work. Also theserver needs to be Linux and have wget installed so that one can download the pagesneeded. Individual Steps:The way the group knows that everything has worked so far up to this point iseach part was tested with the next part. First the recursive wget was used to download allthe web pages on the UConn network. This information was used by the crawler/indexerto create the index of the web site. The crawler must build a clean index with no blanksin the title and no duplicates. The keywords generator script uses the indexer for quickaccess to the each of the files so it can read in keywords and create new database tablesfor each of the words. This one is very hard to test at the large scale but going throughwithout any errors throughout the run time is sufficient. The search has basicfunctionality but it is going to search the word tables that are created using the keywordPage 3script. The search right now is using a test database to test the functionality of it.Eventually the search page will use the keyword database tables to search for wordsamongst the UConn web site.Software & Subsystem Integration Test Description:The project had multiple part and sub sections which are used to make up theentire system. The basic functionality of the sub system functions such as the quicksortand reverse functions discuss more in depth later are tested using an array of numbers.Once Ryan knew that these two functions worked he put them into the search script andmodified them so that they can use two dimensional arrays.As far as separate software sections being integrated each part of the project is, asmentioned through this document, dependent on the part that was made and tested beforeit. In order to test each part of the system one needs to run the section before it to obtainthe data for that part. The search engine itself is reliant on the keywords which could begenerated once and left for one to use to search. There shouldn’t be a need to run aconstant update of all the pages every time one searches.Final Functional Tests:The functional tests that will be done at the end of this project are going to havethe search web page interact with

View Full Document