Unformatted text preview:

Scaling up Na ve Bayes on Hadoop What How and Why Big ML c 2001 Banko Brill Scaling to Very Very Large ACL 2001 Task distinguish pairs of easily confused words affect vs effect in context Some other hot BigData topics Scaling up SGD parallelizing the updates asynchronous SGD and parameter servers Other models for parallelization Graph based machine learning Using less memory Lossy compression Bloom filters Fast nearest neighbors Locality sensitive hash functions LSH Some other hot BigData topics These are all covered in depth in my other course 10 605 Machine Learning from Large Datasets MW 1 30 3 00 next semester Today I m going to talk about learning in Hadoop The Na ve Bayes classifier v2 Dataset each example has A unique id id Why For debugging the feature extractor d attributes X1 Xd Each Xi takes a discrete value in dom Xi One class label Y in dom Y You have a train dataset and a test dataset The Na ve Bayes classifier v2 You have a train dataset and a test dataset Initialize an event counter hashtable C For each example id y x1 xd in train C Y ANY C Y y For j in 1 d C Y y X xj For each example id y x1 xd in test For each y in dom Y d Compute Pr y x1 xd Pr X x j Y y Pr Y y j 1 d Pr X x j Y y Pr Y y Pr Y y j 1 Return the best y The Na ve Bayes classifier v2 You have a train dataset and a test dataset Initialize an event counter hashtable C For each example id y x1 xd in train C Y ANY C Y y For j in 1 d C Y y X xj For each example id y x1 xd in test For each y in dom Y Compute log Pr y x1 xd where qj 1 V qy 1 dom Y m 1 C X x j Y y mqx C Y y mq y log log C Y y m C Y ANY m j Return the best y Complexity of Na ve Bayes Sequential reads You have a train dataset and a test dataset Initialize an event counter hashtable C Complexity For each example id y x1 xd in train O n n size of C Y ANY C Y y train For j in 1 d C Y y X xj For each example id y x1 xd in test For each y in dom Y Compute log Pr y x1 xd where qj 1 V qy 1 dom Y m 1 Sequential reads C X x j Y y mqx C Y y mq y log log C Y y m C Y ANY m j Return the best y Complexity O dom Y n n size of test Na ve Bayes v2 This is one example of a streaming classifier Each example is only read only once You can update a classifier after every example and can perform classifications at any point Memory is minimal O n Ideally it would be constant Traditionally less than O sqrt N Order doesn t matter Nice because we may not control the order of examples in real life This is a hard one to get a learning system to have There are few competitive learning methods that as stream y as na ve Bayes Scaling up Na ve Bayes Next DONE What How and Why Assume hashtable holding all counts fits in memory Complexity of Na ve Bayes You have a train dataset and a test dataset Initialize an event counter hashtable C For each example id y x1 xd in train C Y ANY C Y y For j in 1 d C Y y X xj For each example id y x1 xd in test For each y in dom Y Compute log Pr y x1 xd C X x j Y y mqx C Y y mq y log log C Y y m C Y ANY m j Return the best y Big ML c 2001 Banko Brill Scaling to Very Very Large ACL 2001 Task distinguish pairs of easily confused words affect vs effect in context What s next How to implement Na ve Bayes Assuming the event counters do not fit in memory This is a core problem in big data processing Large scale text classification challenge O 1 000 000 documents O 100 000 categories Or classify web pages as product nonProduct product review product offer Disk based database Memory based distributed database Stream and sort design pattern Hadoop What s next How to implement Na ve Bayes Assuming the event counters do not fit in memory Possible approaches Use a database or at least a key value store Numbers Jeff Dean says Everyone Should Know 10x 15x 40x 100 000x 100k sec day 100 hr decade Who Compilers don t warn Jeff Dean Jeff Dean warns compilers Jeff Dean builds his code before committing it but only to check for compiler and linker bugs Jeff Dean writes directly in binary He then writes the source code as a documentation for other developers Jeff Dean once shifted a bit so hard it ended up on another computer When Jeff Dean has an ergonomic evaluation it is for the protection of his keyboard gcc O4 emails your code to Jeff Dean for a rewrite When he heard that Jeff Dean s autobiography would be exclusive to the platform Richard Stallman bought a Kindle Jeff Dean puts his pants on one leg at a time but if he had more legs you d realize the algorithm is actually only O logn A single large file can be spread out among many non adjacent blocks sectors and then you need to seek around to scan the contents of the file Question What could you do to reduce this cost Distributed Counting example 1 example 2 example 3 Counting logic Machine 0 increment C x by D Hash table1 Machine 1 Hash table2 Machine 2 New issues Machines and memory cost table2 Routing incrementHash requests to right machine Sending increment requests across the network Communication Machine K Numbers Jeff Dean says Everyone Should Know 10x 15x 40x 100 000x What s next How to implement Na ve Bayes Assuming the event countersO n scan do not fit in memory O n scan n send Possible approaches Use a memory based distributed database Extra cost Communication costs O n but that s ok Extra complexity routing requests correctly Great ideas Compress the counter hash table which lead to Use integers as keys instead of strings interesting Use approximate counts work eg randomized Discard infrequent unhelpful words Trade off time for space somehow algorithms Observation if the counter updates were better ordered we could avoid using disk Large vocabulary Na ve Bayes Counting One way trade off time for space Assume you need K times as much memory as you actually have Method Construct a hash function h event For i 0 K 1 Scan thru the train dataset Increment counters for event …


View Full Document

CMU MLG 10601 - 1204bigdata-nb

Loading Unlocking...
Login

Join to view 1204bigdata-nb and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 1204bigdata-nb and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?