Berkeley COMPSCI 258 - Accelerating Machine Learning Applications on Graphics Processors - D2223353

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Accelerating Machine Learning Applications on Graphics Processors

DOC PREVIEW

Berkeley COMPSCI 258 - Accelerating Machine Learning Applications on Graphics Processors

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Accelerating Machine Learning Applications on Graphics ProcessorsNarayanan Sundaram & Bryan CatanzaroDepartment of Electrical Engineering and Computer SciencesUniversity of California at Berkeley, CA, USA{narayans,catanzar}@eecs.berkeley.eduAbstractRecent developments in programmable, highly par-allel Graphics Processing Units (GPUs) have enabledhigh performance implementations of machine learn-ing algorithms. We describe a solver for Support Vec-tor Machine training running on a GPU, using Platt’sSequential Minimal Optimization algorithm and anadaptive first and second order working set selectionheuristic, which achieves speedups of 9-35× over LIB-SVM running on a traditional processor. We alsopresent a GPU-based system for SVM classificationwhich achieves speedups of 63-133× over LIBSVM.1 IntroductionComputing industry is facing its most massive chal-lenge in recent years due to the transition from single-processor CPUs to manycore CPUs. This change hasbeen facilitated due to increasing transistor density onsilicon dictated by Moore’s law and dwindling ben-efits from increasing uniprocessor complexity. Also,managing power and complexity pushes us into multi-processor-on-chip designs. To build successful, scal-able and reliable software in this model is a challenge[1].Machine Learning algorithms have widespread usein information retrieval, bioinformatics, speech andimage recognition systems etc. These tasks are im-portant future workloads as identified by Intel in theircategorization[2]. Machine Learning is a dominantfield in the Recognition-Mining-Synthesis (RMS) cat-egorization. Machine learning is an area that can ben-efit from the increased computational resources avail-able due to parallelism.Graphics processors are currently transitioningfrom their initial role as specialized accelerators for tri-angle rasterization to general purpose engines for highthroughput floating-point computation. Because theystill service the large gaming industry, they are ubiq-uitous and relatively inexpensive. Interestingly, GPUsare massively parallel computers whose use in the past(for general purpose computation) had been hamperedby the difficulty in programming them. With the recenttrend towards more programmable GPUs, they seemto be on a collision course with manycore CPU archi-tectures. We believe that the successes and failuresin porting applications for GPUs will help manycoreCPU architects and application developers too.In this paper, we show how Support Vector Machinetraining and classification can be adapted to a highlyparallel, yet widely available and affordable comput-ing platform: the graphics processor, or more specif-ically, the Nvidia GeForce 8800 GTX, and detail theperformance gains achieved.The organization of the paper is as follows. Sec-tion 2 gives an overview of the architectural and pro-gramming features of the GPU. Section 3 describes theSVM training and classification problems briefly. Sec-tion 4 presents the details of implementation of the par-allel SVM training and classification approach on theGPU. Results are presented in Section 5. Section 6 re-lated work in SVM parallelization and using GPUs forgeneral purpose computations. We conclude in Sec-tion 7.2 GPU ArchitectureGPU architectures are specialized for compute-intensive, memory-intensive, highly-parallel compu-tation, and therefore are designed such that more re-sources are devoted to data processing than caching orcontrol flow. State of the art GPUs provide up to anorder of magnitude more peak IEEE single-precisionfloating-point than their CPU counterparts. Addition-ally, GPUs have much more aggressive memory sub-systems, typically endowed with more than 10x highermemory bandwidth than a CPU. Peak performance isusually impossible to achieve on general purpose ap-plications, yet capturing even a fraction of peak per-formance yields significant speedup.GPU performance is dependent on finding high de-grees of parallelism: a typical computation running onthe GPU must express thousands of threads in orderto effectively use the hardware capabilities. As such,we consider it an example of future “many-core” pro-cessing [1]. Algorithms for machine learning applica-tions will need to consider such parallelism in orderto utilize many-core processors. Applications whichdo not express parallelism will not continue improv-ing their performance when run on newer computingplatforms at the rates we have enjoyed in the past.Therefore, finding large scale parallelism is importantfor compute performance in the future. Programmingfor GPUs is then indicative of the future many-coreprogramming experience. Since the individual proces-sors in the GPU are very simple (no branch prediction,caches, prefetch mechanisms etc.), the only way to getperformance is through parallelism.2.1 Nvidia GeForce 8800 GTXIn this project, we employ the NVIDIA GeForce8800 GTX GPU, which is an instance of the G80 GPUarchitecture, and is a standard GPU widely availableon the market. Pertinent facts about the GPU platformcan be found in table 1. We refer the reader to theNvidia CUDA reference manual for more details [3].The GPU has a local store (referred to as sharedmemory), which can act as a software managed cache.The GPU is essentially a SIMD machine, and exe-cutes 32 threads (called a warp) with one instructionissue. Since instruction issue does not happen at lessthan warp granularity, any conditional statement (likeif-then-else) executes both paths if any of the threadsinvolved has to take the path. This leads to code di-vergence. Loads and stores to the device memory areissued for 16 threads at one time. If the memory ad-GPUMultiprocessor NMultiprocessor 2Multiprocessor 1Device memoryShared MemoryInstructionUnitProcessor 1Registers…Processor 2RegistersProcessor MRegistersConstantCacheTextureCacheFigure 1: NVIDIA GeForce 8800 GPU ArchitectureTable 1: Nvidia GeForce 8800 GTX ParametersNUMBER OF MULTIPROCESSORS 16MULT IPROCESSOR WIDTH 8# OF STREAM PROCESSORS 128PEAK GENERAL PURPOSE IEEE SP 346 GFLOPSMULT IPROCESSOR LOCAL STORE SIZE 16 KBCLOCK RATE 1.35 GHZMEMORY CAPACITY 768 MBMEMORY BANDWIDTH 86.4 GB/SCPU←→GPU BANDWIDTH 3.2 GBIT/Sdresses involved in this issue are consecutive, then thequeries get coalesced into one large request and canmake the most efficient use of the memory system.Otherwise, the memory requests get serialized, lead-ing to slowdowns. Extracting the maximum perfor-mance from the GPU involves efficient utilization

View Full Document