Slide 1IntroductionOpenCL Design GoalsOpenCL Platform ModelOpenCL Programming ModelOpenCL Task-Parallel KernelsOpenCL Memory modelOpenCL ObjectsOpenCL Kernel ObjectsOpenCL Program ObjectsOverall PipelineOpenCL C LanguageOpenCL C LanguageOpenCL C LanguageSummaryOpenCLIntroductionOpen standard for parallel programming across heterogenous devicesDevices can consist of CPUs, GPUs, embedded processors etc – uses all the processing resources availableIncludes a language based on C99 for writing kernels and API used to define and control the devicesParallel computing through task-based and data-based parallelism.OpenCL Design GoalsUse all computational resources the systemPlatform independenceProvide a data and task parallel computational modelProvide a programming model which abstracts the specifics of the underlying hardwareSpecify accuracy of floating-point computationsSupport both desktop and handheld/portables.OpenCL Platform ModelHost connected to one or more OpenCL devicesDevice consists of one or more coresExecution per processor may be SIMD or SPMDContexts group together devices and enable inter-device communicationContextContextDevice A - CPUDevice A - CPUDevice B - GPUDevice B - GPUDevice C - DSPDevice C - DSPHOSTHOSTContextContextOpenCL Programming ModelKernel – basic unit of execution – data parallelProgram – collection of kernels and other related functionsKernels executed across a collection of work-items – one work-item per computationWork-items grouped together into workgroupsWorkgroups executed together on one deviceMultiple workgroups are executed independentlyApplications queue kernel instances for execution in-order, but they may be executed in-order or out-of-orderOpenCL Task-Parallel KernelsSome compute devices can also execute task-parallel kernelsExecute as a single work itemImplemented as either a kernel in OpenCL C or a native C/C++ functionOpenCL Memory modelPrivate memory is available per work item Local memory shared within workgroupNo synchronization between workgroupsSynchronization possible between work items in a workgroupGlobal/Constant memory for access by work-items – not synchronizedHost memory - access through the CPUMemory management is explicitData should be moved from host->global->local and backOpenCL ObjectsDevices – multiple cores on CPU/GPU together taken as a single deviceKernels executed across all cores in a data-parallel mannerContexts – Enable sharing between different devicesDevices must be within the same context to be able to shareQueues – used for submitting work, one per deviceBuffers – simple chunks of memory like arrays; read-write accessImages – 2D/3D data structuresAccess using read_image(), write_image()Either read or write within a kernel, but not bothOpenCL Kernel ObjectsDeclared with a kernel qualifierEncapsulate a kernel functionKernel objects are created after the executable is builtExecutionSet the kernel argumentsEnqueue the kernelKernels are executed asynchronouslyEvents used to track the execution statusUsed for synchronizing execution of two kernelsclWaitForEvents(), clEnqueueMarker() etc.OpenCL Program ObjectsEncapsulateA program source/binaryList of devices and latest successfully built executable for each deviceList of kernel objectsKernel source specified as a string can be provided and compiled at runtime using clCreateProgramWithSource() – platform independenceOverhead – compiling programs can be expensiveOpenCL allows for reusing precompiled binariesOverall PipelineOpenCL C LanguageDerived from ISO C99No standard headers, function pointers, recursion, variable length arrays, bit fieldsAdded features: work-items, workgroups, vector types, synchronizationAddress space qualifiersOptimized image accessBuilt-in functions specific to OpenCLData-typesChar, uchar, short, ushort, int, uint, long, ulongBool, intptr_t, ptrdiff_t, size_t, uintptr_t, halfImage2d_t, image3d_t, sampler_tVector types – portable, varying length (2,4,8,16), endian safeChar2,ushort4,int8,float16,double2 etc.OpenCL C LanguageWork-item and workgroup functionsget_work_dim(), get_global_size()get_group_id(), get_local_id()Vector operations and components are pre-defined as a language featureKernel functionsget_global_id() – gets the next work itemConversionsExplicit – convert_destType<_sat><_roundingMode>Reinterpret – as_destTypeScalar and pointer conversions follow C99 rulesNo implicit conversions/casts for vector typsOpenCL C LanguageAddress spacesKernel pointer arguments must use global, local or constantDefault for local variables is privateImage2d_t and image3d_t are always in global address spaceGlobal variables must be in constant address spaceCasting between different address spaces undefinedSummaryPortable and high-performance frameworkComputationally intensive algorithmsAccess to all computational resourcesWell defined memory/computational modelAn efficient parallel programming languageC99 with extensions for task and data parallelismSet of built in functions for synchronization, math and memory operationsOpen standard for parallel computing across heterogenous collection of
View Full Document