cieslewicz07parallelbuffers

Home> Academic Documents> cieslewicz07parallelbuffers

DOC PREVIEW

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Parallel Buffers for Chip MultiprocessorsJohn Cieslewicz∗†Columbia [email protected] A. Ross†Columbia [email protected] Giannakakis†Columbia [email protected] multiprocessors (CMPs) present new opportunities forimproving database performance on large queries. BecauseCMPs often share execution, cache, or bandwidth resourcesamong many hardware threads, implementing parallel data-base operators that efficiently share these resources is keyto maximizing performance. A crucial aspect of this paral-lelism is managing concurrent, shared input and output tothe parallel operators. In this paper we propose and evalu-ate a parallel buffer that enables intra-operator parallelismon CMPs by avoiding contention between hardware threadsthat need to concurrently read or write to the same buffer.The parallel buffer handles parallel input and output coor-dination as well as load balancing so individual operators donot need to reimplement that functionality.1. INTRODUCTIONModern database systems process queries by constructingquery plans. A plan consists of a collection of operatorsconnected to each other by data bu ffers. The output fromone operator is fed as input to another operator.Recently, microprocessor designs have shifted from fastuniprocessors that exploit instru ction level parallelism tochip multiprocessors that exploit thread level parallelism.Because of power and design issues that reduce the perfor-mance improvement obtainable from faster uniprocessors,improved performance now depends on taking advantage ofon-chip parallelism by writing applications with a high de-gree of thread level parallelism [11].In this paper, we focus on query plans running on a chipmultiprocessor. On such a machine, many concurrent threadsmay cooperate to collaboratively perform a single databaseoperation [20, 6]. The advantage of using the available par-allelism in this way is improved locality. Instructions andsome data structures are shared, leading to good cache be-∗Supported by a U.S. Department of Homeland Secu rityGraduate Research Fellowship†Supported by NSF grant IIS-0534389Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Proceedings of the Third International Workshop on Data Management onNew Hardware (DaMoN 2007), June 15, 2007, Beijing, China.Copyright 2007 ACM 978-1-59593-772-8 ...$5.00.havior. In contrast, treating each thread as a parallel pro-cessor that performs independent tasks can lead to cacheinterference [20].In a collaborative design, an operator would be executedby all t hreads for a certain time-slice. The t ime-slice needsto be long enough to amortize the initial compulsory misseson the instruction and data caches, as well as the context-switch costs. A time-slice might end when either (a) a time-window exp ires, (b) the input is fully consumed, or (c) theoutput buffer becomes full.A critical question for a system employing parallel opera-tors is t he design of buffers for passing data between opera-tors. A na¨ıve choice can lead t o hot-spot points of contentionthat prevent the operator from working at its full capacity[3]. For example, if all threads write output records to acommon output array, then there will be contention for themutex on the pointer to the current position within the ar-ray.One way to avoid output contention is to give each threadits own output array [3, 20]. While this choice avoids con-tention, it has other disadvantages. The next operator thatconsumes the data has to be aware of th e partitioned natureof its input. This can lead to a relatively complex imple-mentation for all operators, because they have to take intoaccount run-time information such as the number of avail-able threads, which may vary during the course of queryexecution. An output partition could, in the worst case,grow much faster than the others. For instance, considera parallel range selection operator where each thread runson a portion of the data that has been partitioned by theattribute being selected. As a result, each output threadmust be pessimistic in allocating output space consistentwith worst-case behavior. Such allocation can waste mem-ory resources, p articularly when there are many threads.There is no guarantee that the separate output partitionswill be balanced. Therefore, the consuming operator alsobecomes responsible for load balancing. If the operator doesnot balance the load, then it is possible for a single slowthread to cause all other threads to stall for long periods,substantially reducing the eff ective d ata parallelism. Wetake a closer look at load balancing in Section 5.4.Simply put, a challenge for multithreaded database andother similar data intensive workloads is that each threadmay consume different amounts of input, generate differentamounts of output, and take significantly different amountsof time to ru n. In this paper, we propose a buff er structurethat avoids these pitfalls, while still minimizing contentionfor a shared buffer. Our solution has the following desirableproperties:• The output and input structu res are the same, so thatarbitrary operators can be composed.• The buffer is allocated from a single array, so thatmemory can be allocated in proportion to the expectedoutput of all threads rather than the worst-case output.• Data records are processed in chunks. Mut ex or atomicoperations are required only for whole chunks. Bychoosing sufficiently large chunks, contention can beminimized.• Parallel utilization is high. In particular, n o thread isstalled for longer than the smaller of th e input-chunkprocessing time and the time t o generate one output-chunk. Utilization is high even when there is an im-balance in the rate of progress made by the variousthreads.• Operators use simple get_chunk and put_chunk ab-stractions, and do not have to re-implement locking orload-balancing functions.We evaluate our parallel buffer data structure on realhardware, the Sun Microsystems UltraSPARC T1. The T1is a chip multiprocessor with eight cores and four hardwarethreads per core for a total of 32 threads on one chip.2.


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Please select your school