DOC PREVIEW
Berkeley COMPSCI 258 - Accelerating Machine Learning Applications on Graphics Processors

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Accelerating Machine Learning Applications on Graphics ProcessorsBig PictureGPUs as proxy for manycoreGPUs are not for everyoneNVIDIA G80 ArchitectureNVIDIA GeForce 8800 GTX SpecificationsGPU programming - CUDASupport Vector MachinesSVM TrainingChoice of parallel algorithm (among chunking algorithms)Fitting SMO on a GPUAdaptive heuristicResultsOverall speedup compared to LIBSVMSVM ClassificationRestructuring the Classification problemSlide 17Slide 18Is this compute or memory bound?Importance of memory coalescingIs SIMD becoming ubiquitous?ConclusionAccelerating Machine Learning Applications on Graphics ProcessorsNarayanan Sundaram and Bryan CatanzaroPresented byNarayanan SundaramBig PictureFrameworksCBIR ApplicationFrameworkPatternsApplication Framework DeveloperMap ReduceProgrammingFrameworkMap Reduce ProgrammingPatternMap Reduce ProgrammingFramework Developer CUDAComputation &CommunicationFrameworkBarrier/Reduction Computation &CommunicationPatternsCUDAFramework DeveloperFace SearchDeveloperConsumer SearchSearcherFeature Extraction& Classifier ApplicationPatternsNvidia G80 Hardware ArchitectPattern LanguageSW InfrastructurePlatform ApplicationGPUs as proxy for manycore•GPUs are interesting architectures to program •Transitioning from highly specialized pipelines to general purpose•The only way to get performance from GPUs is through parallelism (No caching, branch prediction, prefetching etc.)•Can launch millions of threads in one call35/14/08 CS258 Parallel Computer ArchitectureGPUs are not for everyone•Memory coalescing is really important•Irregular memory accesses to even local stores is discouraged - up to 30% performance hit on some apps for local memory bank conflicts•Cannot forget that it is a SIMD machine•Memory consistency is non-existent & inter-SM synchronization is absent•Hardware scheduled threads•20 us overhead for kernel call (20,000 instructions @ 1GHz)45/14/08 CS258 Parallel Computer ArchitectureNVIDIA G80 Architecture 5/14/08 CS258 Parallel Computer Architecture 5NVIDIA GeForce 8800 GTX SpecificationsNumber of Streaming Multiprocessors 16Multiprocessor Width 8Local Store Size 16 KBTotal number of Stream Processors 128Peak SP Floating Point Rate 346 GflopsClock 1.35 GHzDevice Memory 768 MBPeak Memory Bandwidth 86.4 GB/sConnection to Host CPU PCI ExpressCPU -> GPU bandwidth 2.2 GB/s*GPU -> CPU bandwidth 1.7 GB/s*5/14/08 CS258 Parallel Computer Architecture 6* measured valuesGPU programming - CUDA•Each block can have upto 512 threads that synchronize•Millions of blocks can be issued•No synchronization between blocks•No control over scheduling5/14/08 CS258 Parallel Computer Architecture 7Support Vector Machines•A hugely popular machine learning technique for classification•Tries to find a hyperplane separating the different classes with “maximum margin”•Non-linear surfaces can be generated through non-linear kernel functions •Uses Quadratic Programming for training (specific set of constraints imply a wide variety of techniques for solving it)85/14/08 CS258 Parallel Computer ArchitectureSVM Training•Quadratic Program•Some kernel functions: Variables:α: Weight for each training point (determines classifier)Data:l: number of training pointsC: trades off error on training set for generalization performancey: Label (+/- 1) for each training pointx: training pointsChoice of parallel algorithm(among chunking algorithms)5/14/08 CS258 Parallel Computer Architecture 10SequentialMinimal Optimization(SMO)Fitting SMO on a GPU•Shared memory constraints on the GPU fits the algorithm as only two vectors need to be shared among all the threads•Performance strongly dependent on the choice of the working set•Several heuristics proposed – two are popular (1st and 2nd order)•2nd order heuristic is almost twice as costly, but saves on the number of iterations115/14/08 CS258 Parallel Computer ArchitectureAdaptive heuristic•Both heuristics can be expressed as a series of “Map Reduce” stages•A Map Reduce code generator was used to generate the code•Sample periodically and adapt depending on the most converging heuristic at any given time • Tightly coupled map-reduces are essential for machine learning algorithms•Cannot afford the overhead of general library call when called millions of times125/14/08 CS258 Parallel Computer ArchitectureResults135/14/08 CS258 Parallel Computer ArchitectureNormalizedto 1st orderheuristicOverall speedup compared to LIBSVMSVM Classification•SVM classification task involves finding which side of the hyperplane a point lies on•Specifically, where •Insight : Instead of doing this serially for all points, note thatRestructuring the Classification problemSV Test DataOutputVsOutputTest DataSVResults5/14/08 CS258 Parallel Computer Architecture 17Results5/14/08 CS258 Parallel Computer Architecture 18Is this compute or memory bound?•GPUs are better for memory bound jobs (Observed 7 GB/s vs 1 GB/s for other streaming-like apps)5/14/08 CS258 Parallel Computer Architecture 19Importance of memory coalescing•In order to avoid non-coalesced memory accesses, carried both Data and DataT into GPU memory•Letting 0.05% of memory accesses to be non-coalesced led to a 21% drop in performance for one case•Well written code should scale with GPU size (parallelism should be limited by problem size, not machine size)5/14/08 CS258 Parallel Computer Architecture 20Is SIMD becoming ubiquitous?•SIMD already important for performance on uniprocessor systems•Task Vs Data parallelism•Intel’s new GPU has wide SIMD•CUDA lesson - Runtime SIMD binding easier for programmers•Non-SIMD leads to performance penalty, not incorrect programs – prevents premature optimizations and keep code flexible5/14/08 CS258 Parallel Computer Architecture 21Conclusion•GPUs and Manycore CPUs are on a collision course•Data parallelism on GPUs or Task parallelism on CPUs•Rethink serial control and data structures•Sequential optimizations may harm parallelism•Machine learning can use a lot of parallel hardware if software engineered properly5/14/08 CS258 Parallel Computer Architecture


View Full Document

Berkeley COMPSCI 258 - Accelerating Machine Learning Applications on Graphics Processors

Documents in this Course
Load more
Download Accelerating Machine Learning Applications on Graphics Processors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Accelerating Machine Learning Applications on Graphics Processors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Accelerating Machine Learning Applications on Graphics Processors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?