Unformatted text preview:

6 006 Introduction to Algorithms Lecture 1 Prof Constantinos Daskalakis Today s Menu Motivation Administrivia Course Overview Linked Lists and Document Distance Intro to Peak Finding Al go rithms wha Remember Logarithms they have nothing to do with Algorithms Well specified method for solving a problem u sing a finite sequence of instructions Description might be English Pseudocode or real code Key no ambiguity Al Khw rizm 780 850 Efficient Algorithms Why Solving problems consumes resources that are often limited valuable Time Plan a flight path Space Query a database Energy Save money Bigger problems consume more resources Need algorithms that scale to large inputs e g searching the web Efficient Algorithms How Define problem Unambiguous description of desired result Abstract irrelevant detail Assume the cow is a sphere Pull techniques from the algorithmic toolbox CLRS class textbook Implement and evaluate performance Revise problem abstraction Generalize Algorithm to apply to broad class of problems Administrivia Handout course info Profs Daskalakis Jaillet TAs Goldstein Griner Bhattacharya Madry Sign up for class at https sec csail mit edu to get a recitation assignment Prereqs 6 01 6 042 Python Grades Problem sets 30 Quiz1 Oct 13 7 30 9 30pm 20 Quiz2 Nov 17 7 30 9 30pm 20 Exam 30 Read collaboration policy Content 8 modules with motivating problem pset Linked Data Structures Document Distance Divide Conquer Peak Finding Hashing Efficient File Update Synchronization Sorting Graph Search Rubik s Cube Shortest Paths Google Maps Dynamic Programming print justification Numerical Algorithms linear systems Document Distance Given 2 documents how similar are they if one document is a query this is web search find similar documents to a given one detect plagiarism Goal algorithm to compute similarity Actually we ll compute distance 1 similarity Problem Definition Need unambiguous definition of similarity We define in terms of distance Word sequence of alpha characters Ignore punctuation formatting Document sequence of words Word frequencies D w is number of occurences of w in D Similarity based on amount of word overlap Vector Space Model Salton Wang Yang 1975 Treat each doc as a vector of its words one coordinate per word of the English dictionary dog e g doc1 the cat 1 doc2 the dog similarity by dot product the 1 1 cat trouble not scale invariant d1 d2 1 documents the the cat cat and the the dog dog will appear closer than doc1 and doc2 Vector Space Model Solution Normalization divide by the length of the vectors measure distance by angle e g 0 documents identical if of the same size permutations of each other 2 not even share a word Algorithm Read file Make word list divide file into words Count frequencies of words Compute dot product for every word in the first document check if it appears in the other document if yes multiply their frequencies and add to the dot product worst case time order of words D1 x words D2 micro optimization sort documents into word order alphabetically compute inner product in time words D1 words D2 Python Implementation Docdist1 py see handout Read file read file filename Output list of lines strings Make word list get words from line list L Output list of words array Count frequencies count frequency word list Output list of word frequency pairs Sort into word order insertion sort Output sorted list of pairs Dot product inner product D1 D2 Output number Inputs Jules Verne 25K Bobsey Twins 268K Francis Bacon 324K Lewis and Clark 1M Shakespears 5 5M Churchill 10M Profiling Tells how much time spent in each routine import profile profile run main One line per routine reports 1 calls 2 total time excluding subroutine calls 3 Time per call 2 1 4 Cumulative time including subroutines 5 Cumulative per call 4 1 What s with L L1 L2 is concatenation of arrays Take L1 and L2 Copy to a bigger array Time proportional to sum of lengths Suppose n one word lines Time 1 2 n n n 1 2 n2 Solution word list extend words in line appends list named words in line to list named word list Takes time proportional to length of list words in line Total time in example of n one word lines n resulting improvement get words from line list 23s 0 12s Other Improvements Docdist4 py Instead of inserting words in list insert in dictionary total to 42s 5 py Process words instead of chars to 17s 6 py merge sort instead of insertion 6s 7 py dictionary again instead of sort 0 5s Next time Peak Finding Array of numbers Find one that is bigger than its neighbors A local minimum


View Full Document

MIT 6 006 - Lecture Notes

Documents in this Course
Quiz 1

Quiz 1

7 pages

Quiz 2

Quiz 2

12 pages

Quiz 2

Quiz 2

9 pages

Quiz 1

Quiz 1

10 pages

Quiz 2

Quiz 2

11 pages

Quiz 1

Quiz 1

12 pages

Graphs

Graphs

27 pages

Load more
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?