DOC PREVIEW
USF CS 110 - Programming Assignment 4

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Programming Assignment 4Computer Science 110-03Due date: Monday, November 21 (note the change)1 Web Crawling and Web SearchA web crawler is a program that systematically visits pages on the internet. For programmingassignment 4, you should write a web crawler that builds a database of the words stored on thepages that are visited. The database should be a Python dictionary whose keys are the absoluteURLs of the sites visited and whose values are lists of the words on the sites visited.After building the URL–word-list dictionary, your program should build a reverse index of thedatabase. This is another dictionary whose keys are the words in the original dictionary and whosevalues are the URLs. For example, if the first dictionary is{’http://cs.usfca.edu/~peter/cs110/hello.html’:[’hello’,’ciao’,’goodday’],’http://cs.usfca.edu/~peter/cs110/goodbye.html’:[’goodbye’,’bye’,’ciao’,’goodday’],Then the reverse index would be{’hello’:[’http://cs.usfca.edu/~peter/cs110/hello.html’],’ciao’:[’http://cs.usfca.edu/~peter/cs110/hello.html’,’http://cs.usfca.edu/~peter/cs110/goodbye.html’],’goodday’:[’http://cs.usfca.edu/~peter/cs110/hello.html’,’http://cs.usfca.edu/~peter/cs110/goodbye.html’],’goodbye’:[’http://cs.usfca.edu/~peter/cs110/goodbye.html’],’bye’:[’http://cs.usfca.edu/~peter/cs110/goodbye.html’]}After building the two dictionaries, your program should give the user a choice of commands:Search the database for a word (s)Open a URL in a browser (o)1Print the dictionary with keys that are URLs (u)Print the dictionary with keys that are words on web pages (w)Print this menu (m)Quit (q)The user can type in the letter for a command, and your program should carry it out. Aftercompleting the command, it should continue to request and execute commands until the user quits.2 DetailsInitial input to the program will consist of a “root URL” and the filename of the first web page tobe visited. The root URL is an absolute URL for a directory that contains all the web pages theprogram will visit. In the example above, the root URL would behttp://cs.usfca.edu/~peter/cs110and the filename might behello.htmlSo an absolute URL for the first page visited would behttp://cs.usfca.edu/~peter/cs110/hello.htmlAfter getting the initial input, the program should use depth-first search to crawl the pagesin the directory specified by the root URL. During the crawl it should build the URL–word-listdictionary. Keys should be absolute URLs. After visiting all the files that are reachable by depth-first search, the program should build the reverse-index dictionary.It should then begin prompting for and executing user commands.As discussed in class, the HTML pages that will be visited have a very simple format: each lineis either an HTML tag or a word in ordinary text. The words will consist only of lower-case letters(no spaces or punctuation). The HTML tags will be<html></html><title></title><body></body><a href=". . ."></a><br>The URLs in the <a href=". . ."> tags will be relative to the root URL. So they’ll simply befilenames ending in “html” — e.g. hello.html or goodbye.html.You can use the Python requests module to get the contents of a webpage:import requests. . .page = requests.get(url)2Here, url is the absolute URL of the page. The object returned by requests.get has a contentattribute. This contains a string with the html. In this example, then,page.contentcontains the html for the web page.The search command should prompt for a word to search for. If the word is in the database,it should print a list of the absolute URLs of the pages that contained the word. If the word isn’tin the database, it should just print a message to this effect.The open command should prompt for an absolute URL, and try to open Firefox at this URL.On the Linux systems in the CS labs, this can be done using the system command from the osmodule:import os. . .os.system("firefox " + url + " &")Here, url is the absolute URL of the page. Notes: If Firefox is already running on the systemyou’re using, you may need to kill it before opening a page from the program. Also, this willprobably only work on Linux systems. The ampersand (&) at the end will start the browser in thebackground; this will allow your program to continue without killing the browser first.The commands that print the dictionaries should print each key on a single line, and on thelines immediately following the key the elements of the corresponding value (which is a list). Forthe URL–word-list example, above, the u command should printhttp://cs.usfca.edu/~peter/cs110/hello.htmlhellociaogooddayhttp://cs.usfca.edu/~peter/cs110/goodbye.html’:goodbyebyeciaogoodday3 Test DataI’ll put some test data on the class web site. I’ll include the output of the print commands when asolution to the assignment is run with the test data.4 Due DateIn order to receive full credit, your program must be in the p4 subdirectory of your Subversionrepository by 2:00 pm on Monday, November 21, and you must turn in a print out of your programby 5 pm on the 21st. (Note that this is different from the date in the syllabus.)35 GradingYour program will be graded on the basis of its correctness and its “static features.”1. Correctness will be 60% of your grade. Does your program take input in the correct format?Does it use depth-first search to visit the pages? Are the dictionaries correct? Does it correctlyexecute the user commands?2. The following static features will be graded.(a) Documentation will be 10% of your grade. Does your header documentation includethe author’s name, the purpose of the program, and a description of how to use theprogram? Are the identifiers meaningful? Are any obscure constructs clearly explained?(b) Source format will be 5% of your grade. Is the indentation consistent? Have blank linesbeen used so that the program is easy to read?(c) Quality of solution will be 20% of your grade. Does your solution contain unnecessarycalculations? Is your solution too clever — e.g., has the solution been condensed to thepoint where it’s incomprehensible? Are functions and data structures (e.g., lists, strings,dictionaries) used properly? Are any functions (including the main program) more than20 lines long?3. Extra credit. You can get up to 10 points extra credit by augmenting the output of the searchcommand: instead of simply printing a list of URLs, the search command


View Full Document

USF CS 110 - Programming Assignment 4

Download Programming Assignment 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Programming Assignment 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Programming Assignment 4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?