DOC PREVIEW
Princeton COS 226 - Symbol Table Applications

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Robert Sedgewick and Kevin Wayne • Copyright © 2005 • http://www.Princeton.EDU/~cos2264.5 Symbol Table Applications2Set ADTSet ADT: unordered collection of distinct keys.! Insert a key.! Check if set contains a given key.! Delete a key.SET interface.! add(key) insert the key! contains(key) is given key present?! remove(key) remove the key! iterator() return iterator over all keysQ. How to implement?Java library: java.util.HashSet.3Set Client: Remove DuplicatesRemove duplicates. [e.g., from commercial mailing list]! Read in a key.! If key is not in set, insert and print it out.public class DeDup { public static void main(String[] args) { SET<String> set = new SET<String>(); while (!StdIn.isEmpty()) { String key = StdIn.readString(); if (!set.contains(key)) { set.add(key); System.out.println(key); } } }}4More Set ApplicationsApplication Purpose KeySpell checker Identify misspelled words WordBrowser Highlight previously visited pages URLChess Detect repetition draw Board positionSpam blacklist Prevent spam IP addresses of spammersRobert Sedgewick and Kevin Wayne • Copyright © 2005 • http://www.Princeton.EDU/~cos226Inverted Index6Vectors and MatricesInverted index. Given a list of documents, preprocess so that you canquickly find all documents containing a given query.Ex 1. Book index.Ex 2. Web search engine index.Key. Query word.Value. Set of documents.7Inverted Index Implementationpublic class InvertedIndex { public static void main(String[] args) { ST<String, SET<String>> st; st = new ST<String, SET<String>>(); // create inverted index for (String filename : args) { In in = new In(filename); while (!in.isEmpty()) { String word = in.readString(); if (!st.contains(word)) st.put(word, new SET<String>()); SET<String> set = st.get(word); set.add(filename); } }use SET to avoid duplicates 8Inverted Index Implementation (cont) // read queries from standard input In stdin = new In(); while (!stdin.isEmpty()) { String query = in.readString(); if (!st.contains(query)) System.out.println("NOT FOUND"); else { SET<String> set = st.get(query); for (String filename : set) System.out.print(filename + " "); } System.out.println(); } }}9Inverted IndexExtensions.! Ignore stopwords: the, on, of, etc.! Multi-word queries:– set intersection (AND)– set union (OR).! Record position and number of occurrences of word in document.Robert Sedgewick and Kevin Wayne • Copyright © 2005 • http://www.Princeton.EDU/~cos226Sparse Vectors and Matrices11Vectors and MatricesVector. Ordered sequence of N real numbers.Matrix. N-by-N table of real numbers. ! a = 0 3 15 [ ], b = "1 2 2 [ ]a + b = "1 5 17 [ ]a o b = (0 # "1) + (3 # 2) + (15 # 2) = 36! 0 1 12 4 "20 3 15# $ % % % & ' ( ( ( ) "122# $ % % % & ' ( ( ( =4236# $ % % % & ' ( ( ( ! A =0 1 12 4 "20 3 15# $ % % % & ' ( ( ( , B = 1 0 00 2 00 0 3# $ % % % & ' ( ( ( , A + B = 1 1 12 6 "20 3 18# $ % % % & ' ( ( ( 12SparsityDef. An N-by-N matrix is sparse if it has O(N) nonzeros entries.Empirical fact. Large matrices that arise in practice are usually sparse.Matrix representations.! 2D array: space proportional to N2.! Goal: space proportional to number of nonzeros,without sacrificing fast access to individual elements.Ex. Google performs matrix-vector product with N = 4 billion!13Sparse Vector Implementationpublic class SparseVector { private final int N; private ST<Integer, Double> st = new ST<Integer, Double>(); public SparseVector(int N) { this.N = N; } public void put(int i, double value) { if (value == 0.0) st.remove(i); else st.put(i, value); } public double get(int i) { if (st.contains(i)) return st.get(i); else return 0.0; } a[i] = valuea[i]14Sparse Vector Implementation (cont)// return a ! bpublic double dot(SparseVector b) { SparseVector a = this; double sum = 0.0; for (int i : a.st) if (b.st.contains(i)) sum += a.get(i) * b.get(i); return sum;} // return c = a + bpublic SparseVector plus(SparseVector b) { SparseVector a = this; SparseVector c = new SparseVector(N); for (int i : a.st) c.put(i, a.get(i)); for (int i : b.st) c.put(i, b.get(i) + c.get(i)); return c;}15Sparse Matrix Implementationpublic class SparseMatrix { private final int N; private SparseVector[] rows; public SparseMatrix(int N) { this.N = N; rows = new SparseVector[N]; for (int i = 0; i < N; i++) rows[i] = new SparseVector(N); } public void put(int i, int j, double value) { rows[i].put(j, value); } public double get(int i, int j) { return rows[i].get(j); }A[i][j] = valueA[i][j]N-by-N matrixof all zeros16Sparse Matrix Implementation (cont)// return b = A*xpublic SparseVector times(SparseVector x) { SparseMatrix A = this; SparseVector b = new SparseVector(N); for (int i = 0; i < N; i++) b.put(i, rows[i].dot(x)); return b;}// return C = A + Bpublic SparseMatrix plus(SparseMatrix B) { SparseMatrix A = this; SparseMatrix C = new SparseMatrix(N); for (int i = 0; i < N; i++) C.rows[i] = A.rows[i].plus(B.rows[i]); return C;}Robert Sedgewick and Kevin Wayne • Copyright © 2005 • http://www.Princeton.EDU/~cos226A Plan for Spam18A Plan for SpamBayesian spam filter.! Filter based on analysis of previous messages.! User trains the filter by classifying messages as spam or ham.! Parse messages into tokens (alphanumeric, dashes, ', $)Build data structures.! Hash table A of tokens and frequencies for spam.! Hash table B of tokens and frequencies for ham.! Hash table C of tokens with probability p that they appear in spam.Reference: http://www.paulgraham.com/spam.htmldouble h = 2.0 * ham.freq(word);double s = 1.0 * spam.freq(word);double p = (s/spams) / (h/hams + s/spams);bias probabilities toavoid false positives19A Plan for SpamIdentify incoming email as spam or ham.! Find 15 most interesting tokens (difference from 0.5).! Combine probabilities using Bayes law.! Declare as spam if threshold > 0.9.Details.! Words you've never seen.! Words that appear in ham corpus but not spam corpus, vice versa.! Words that appear less than 5 times in spam and ham corpuses.! Update data structures.which data


View Full Document

Princeton COS 226 - Symbol Table Applications

Documents in this Course
QUICKSORT

QUICKSORT

14 pages

QUICKSORT

QUICKSORT

55 pages

STRINGS

STRINGS

69 pages

Lecture

Lecture

4 pages

STRINGS

STRINGS

18 pages

Hashing

Hashing

7 pages

MERGESORT

MERGESORT

14 pages

Quicksort

Quicksort

12 pages

Strings

Strings

10 pages

Overview

Overview

15 pages

Hashing

Hashing

13 pages

Mergesort

Mergesort

15 pages

Tries

Tries

13 pages

Final

Final

12 pages

Final

Final

12 pages

Mergesort

Mergesort

13 pages

Final

Final

10 pages

Load more
Download Symbol Table Applications
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Symbol Table Applications and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Symbol Table Applications 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?