Unformatted text preview:

MIT OpenCourseWare http ocw mit edu 6 006 Introduction to Algorithms Spring 2008 For information about citing these materials or our Terms of Use visit http ocw mit edu terms Lecture 1 Introduction and Document Distance 6 006 Spring 2008 Lecture 1 Introduction and the Document Distance Problem Course Overview E cient procedures for solving problems on large inputs Ex entire works of Shake speare human genome U S Highway map Scalability Classic data structures and elementary algorithms CLRS text Real implementations in Python Fun problem sets version of the class feedback is welcome Pre requisites Familiarity with Python and Discrete Mathematics Contents The course is divided into 7 modules each of which has a motivating problem and problem set except for the last module Modules and motivating problems are as described below 1 Linked Data Structures Document Distance DD 2 Hashing DD Genome Comparison 3 Sorting Gas Simulation 4 Search Rubik s Cube 2 2 2 5 Shortest Paths Caltech MIT 6 Dynamic Programming Stock Market 7 Numerics 2 Document Distance Problem Motivation Given two documents how similar are they Identical easy Modi ed or related Ex DNA Plagiarism Authorship 1 Lecture 1 Introduction and Document Distance 6 006 Spring 2008 Did Francis Bacon write Shakespeare s plays To answer the above we need to de ne practical metrics Metrics are de ned in terms of word frequencies De nitions 1 Word Sequence of alphanumeric characters For example the phrase 6 006 is fun has 4 words 2 Word Frequencies Word frequency D w of a given word w is the number of times it occurs in a document D For example the words and word frequencies for the above phrase are as below Count 1 0 1 1 0 1 W ord 6 the is 006 easy f un In practice while counting it is easy to choose some canonical ordering of words 3 Distance Metric The document distance metric is the inner product of the vectors D1 and D2 containing the word frequencies for all words in the 2 documents Equivalently this is the projection of vectors D1 onto D2 or vice versa Mathematically this is expressed as D1 D2 D1 w D2 w 1 w 4 Angle Metric The angle between the vectors D1 and D2 gives an indication of overlap between the 2 documents Mathematically this angle is expressed as D1 D2 arccos D1 D2 D1 D2 0 2 An angle metric of 0 means the two documents are identical whereas an angle metric of 2 implies that there are no common words 5 Number of Words in Document The magnitude of the vector D which contains word frequencies of all words in the document Mathematically this is expressed as N D D D D So let s apply the ideas to a few Python programs and try to esh out more 2 2 Lecture 1 Introduction and Document Distance 6 006 Spring 2008 Document Distance in Practice Computing Document Distance docdist1 py The python code and results relevant to this section are available here This program com putes the distance between 2 documents by performing the following steps Read le Make word list the year Count frequencies the 4012 year 55 Sort into order a 3120 after 17 Compute Ideally we would like to run this program to compute document distances between writings of the following authors Jules Verne document size 25k Bobsey Twins document size 268k Lewis and Clark document size 1M Shakespeare document size 5 5M Churchill document size 10M Experiment Comparing the Bobsey and Lewis documents with docdist1 py gives 0 574 However it takes approximately 3 minutes to compute this document distance and probably gets slower as the inputs get large What is wrong with the e ciency of this program Is it a Python vs C issue Is it a choice of algorithm issue n2 versus n Pro ling docdist2 py In order to gure out why our initial program is so slow we now instrument the program so that Python will tell us where the running time is going This can be done simply using the pro le module in Python The pro le module indicates how much time is spent in each routine See this link for details on pro le The pro le module is imported into docdist1 py and the end of the docdist1 py le is modi ed The modi ed docdist1 py le is renamed as docdist2 py Detailed results of document comparisons are available here 3 Lecture 1 Introduction and Document Distance 6 006 Spring 2008 More on the di erent columns in the output displayed on that webpage tottime per call column3 is tottime column2 ncalls column1 cumtime column4 includes subroutine calls cumtime per call column5 is cumtime column4 ncalls column1 The pro ling of the Bobsey vs Lewis document comparison is as follows Total 195 secs Get words from line list 107 secs Count frequency 44 secs Get words from string 13 secs Insertion sort 12 secs So the get words from line list operation is the culprit The code for this particular section is word list for line in L words in line get words from string line word list word list words in line return word list The bulk of the computation time is to implement word list word list words in line There isn t anything else that takes up much computation time List Concatenation docdist3 py The problem in docdist1 py as illustrated by docdist2 py is that concatenating two lists takes time proportional to the sum of the lengths of the two lists since each list is copied into the output list L L1 L2 takes time proportional to L1 L2 If we had n lines each with one word computation time would be proportional to 1 2 3 n n n 1 n2 2 Solution word list extend words in line word list append word for each word in words in line Ensures L1 extend L2 time proportional to L2 4 Lecture 1 Introduction and Document Distance 6 006 Spring 2008 Take Home Lesson Python has powerful primitives like concatenation of lists built in To write e cient algorithms we need to understand their costs See Python Cost Model for details PS1 also has an exercise on guring out cost of a set of operations Incorporate this solution into docdist1 py rename as docdist3 py Implementation details and results are available here This modi cation helps run the Bobsey vs Lewis example in 85 secs as opposed to the original 195 secs We can improve further by looking for other quadratic running times hidden in our routines The next o ender in terms of overall computation time is the count frequency routine which computes the frequency of each word given the word list Analysing Count Frequency def count frequency word list Return a list giving pairs of form word frequency L for new word in word list for entry in L if new word entry 0 entry 1 entry 1 1 break


View Full Document

MIT 6 006 - Lecture Notes

Documents in this Course
Quiz 1

Quiz 1

7 pages

Quiz 2

Quiz 2

12 pages

Quiz 2

Quiz 2

9 pages

Quiz 1

Quiz 1

10 pages

Quiz 2

Quiz 2

11 pages

Quiz 1

Quiz 1

12 pages

Graphs

Graphs

27 pages

Load more
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?