6.006- Introduction to AlgorithmsLecture 1Prof. Constantinos DaskalakisToday’s Menu• Motivation • Administrivia• Course Overview • Linked Lists and Document Distance• Intro to “Peak Finding”“Al-go-rithms”…wha?• Remember Logarithms?– they have nothing to do with Algorithms• Well specified method for solving a problem using a finite sequence of instructions• Description might be English, Pseudocode, orreal code• Key: no ambiguityAl-Khwārizmī (780-850)Efficient Algorithms: Why?• Solving problems consumes resources that areoften limited/valuable:– Time: Plan a flight path– Space: Query a database– Energy: Save money• Bigger problems consume more resources• Need algorithms that “scale” to large inputs, e.g. searching the web…Efficient Algorithms: How?• Define problem: – Unambiguous description of desired result• Abstract irrelevant detail– “Assume the cow is a sphere”• Pull techniques from the “algorithmic toolbox”– [CLRS] class textbook• Implement and evaluate performance– Revise problem/abstraction• Generalize – Algorithm to apply to broad class of problemsAdministrivia• Handout: course info• Profs: Daskalakis, Jaillet• TAs: Goldstein, Griner, Bhattacharya, Madry• Sign up for class at https://sec.csail.mit.edu/to get a recitation assignment• Prereqs: 6.01, 6.042• Python• Grades: Problem sets (30%) Quiz1 (Oct 13: 7.30-9.30pm; 20%)Quiz2( Nov 17: 7.30-9.30pm; 20%)Exam (30%)• Read collaboration policy!Content• 8 modules with motivating problem/pset• Linked Data Structures: Document Distance• Divide&Conquer: Peak Finding• Hashing: Efficient File Update/Synchronization• Sorting • Graph Search: Rubik‟s Cube• Shortest Paths: Google Maps• Dynamic Programming: print justification• Numerical Algorithms: linear systemsDocument Distance• Given 2 documents, how similar are they?– if one “document ” is a query, this is web search– find “similar documents” to a given one– detect plagiarism• Goal: algorithm to compute similarity– Actually, we‟ll compute “distance” = 1/similarityProblem Definition• Need unambiguous definition of similarity• We define in terms of “distance”• Word: sequence of alpha characters– Ignore punctuation, formatting• Document: sequence of words• Word frequencies:D(w) is number of occurences of w in D• Similarity based on amount of word overlapVector Space Model• [Salton, Wang Yang 1975]• Treat each doc as a vector of its words– one coordinate per word of the English dictionarye.g. doc1 = “the cat”doc2 = “the dog”„the‟„cat‟„dog‟111– similarity by dot‐product – trouble: not scale invariantdocuments “the the cat cat” and “the the dog dog” will appear closer than doc1 and doc2d1 d2 = 1Vector Space Model• Solution: Normalization– divide by the length of the vectors– measure distance by angle:e.g. θ=0 documents “identical”(if of the same size, permutations of each other)θ=π/2 not even share a wordAlgorithm• Read file• Make word list (divide file into words)• Count frequencies of words• Compute dot product– for every word in the first document, check if it appears in the other document; if yes, multiply their frequencies and add to the dot product• worst case time: order of #words(D1) x #words(D2)– micro-optimization:• sort documents into word order (alphabetically)• compute inner product in time #words(D1) + #words(D2)Python Implementation• Docdist1.py (see handout)• Read file: read_file(filename)– Output: list of lines (strings)• Make word list: get_words_from_line_list(L)– Output: list of words (array)• Count frequencies: count_frequency(word list)– Output: list of word‐frequency pairs• Sort into word order: insertion_sort()– Output: sorted list of pairs• Dot product: inner_product(D1, D2)– Output: numberInputs:• Jules Verne: 25K• Bobsey Twins: 268K• Francis Bacon: 324K• Lewis and Clark: 1M• Shakespears: 5.5M• Churchill: 10MProfiling• Tells how much time spent in each routine– import profile– profile.run(“main()”)• One line per routine reports1. #calls2. #total time excluding subroutine calls3. Time per call (#2/#1)4. Cumulative time, including subroutines5. Cumulative per call (#4/#1)What’s with +?• L=L1+L2 is concatenation of arrays• Take L1 and L2• Copy to a bigger array• Time proportional to sum of lengths• Suppose n one-word lines• Time 1+2+…+n = n(n+1)/2 = (n2)Solution• word_list.extend(words_in_line) : appends list named “words_in_line” to list named “word_list”• Takes time proportional to length of list “words_in_line”• Total time in example of n one-word lines: (n)• resulting improvement:– get_words_from_line_list 23s0.12sOther Improvements• Docdist4.py:– Instead of inserting words in list, insert in dictionary: total to 42s• 5.py:– Process words instead of chars: to 17s• 6.py: merge sort instead of insertion: 6s• 7.py: dictionary (again) instead of sort: 0.5sNext time: Peak Finding• Array of numbers• Find one that is bigger than its neighbors• A local
View Full Document