DOC PREVIEW
Berkeley COMPSCI 152 - Lecture Notes

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 152 Computer Architectureand Engineering Lecture 18: MultithreadingKrste AsanovicElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~krstehttp://inst.cs.berkeley.edu/~cs1524/15/20082CS152-Spring!08Last Time: Vector Computers• Vectors provide efficient execution of data-parallel loop codes• Vector ISA provides compact encoding of machine parallelism• ISAs scale to more lanes without changing binary code• Vector registers provide fast temporary storage to reduce memorybandwidth demands, & simplify dependence checking between vectorinstructions• Scatter/gather, masking, compress/expand operations increase set ofvectorizable loops• Requires extensive compiler analysis (or programmer annotation) tobe certain that loops can be vectorized• Full “long” vector support still only in supercomputers (NEC SX8R,Cray X1E); microprocessors have limited “short” vector operations– Intel x86 MMX/SSE/AVX– IBM/Motorola PowerPC VMX/Altivec4/15/20083CS152-Spring!08Vector Conditional ExecutionProblem: Want to vectorize loops with conditional code:for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i];Solution: Add vector mask (or flag) registers– vector version of predicate registers, 1 bit per element…and maskable vector instructions– vector operation becomes NOP at elements where mask bit is clearCode example:CVM # Turn on all elementsLV vA, rA # Load entire A vectorSGTVS.D vA, F0 # Set bits in mask register where A>0LV vA, rB # Load B vector into A under maskSV vA, rA # Store A back to memory under mask4/15/20084CS152-Spring!08Masked Vector InstructionsC[4]C[5]C[1]Write data portA[7] B[7]M[3]=0M[4]=1M[5]=1M[6]=0M[2]=0M[1]=1M[0]=0M[7]=1Density-Time Implementation– scan mask vector and only executeelements with non-zero masksC[1]C[2]C[0]A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]M[3]=0M[4]=1M[5]=1M[6]=0M[2]=0M[1]=1M[0]=0Write data portWrite EnableA[7] B[7]M[7]=1Simple Implementation– execute all N operations, turn off resultwriteback according to mask4/15/20085CS152-Spring!08Compress/Expand Operations• Compress packs non-masked elements from one vector registercontiguously at start of destination vector register– population count of mask vector gives packed vector length• Expand performs inverse operationM[3]=0M[4]=1M[5]=1M[6]=0M[2]=0M[1]=1M[0]=0M[7]=1A[3]A[4]A[5]A[6]A[7]A[0]A[1]A[2]M[3]=0M[4]=1M[5]=1M[6]=0M[2]=0M[1]=1M[0]=0M[7]=1B[3]A[4]A[5]B[6]A[7]B[0]A[1]B[2]ExpandA[7]A[1]A[4]A[5]CompressA[7]A[1]A[4]A[5]Used for density-time conditionals and also for generalselection operations4/15/20086CS152-Spring!08Vector ReductionsProblem: Loop-carried dependence on reduction variablessum = 0;for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sumSolution: Re-associate operations if possible, use binary tree to performreduction# Rearrange as:sum[0:VL-1] = 0 # Vector of VL partial sumsfor(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum# Now have VL partial sums in one vector registerdo { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials} while (VL>1)4/15/20087CS152-Spring!08Multimedia Extensions (aka SIMD extensions)• Very short vectors added to existing ISAs for microprocessors• Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b– This concept first used on Lincoln Labs TX-2 computer in 1957, with 36b datapathsplit into 2x18b or 4x9b– Newer designs have 128-bit registers (PowerPC Altivec, Intel SSE2/3/4)• Single instruction operates on all elements within register16b 16b 16b 16b32b 32b64b8b 8b 8b 8b 8b 8b 8b 8b16b 16b 16b 16b16b 16b 16b 16b16b 16b 16b 16b+ + + +4x16b adds4/15/20088CS152-Spring!08Multimedia Extensions versus Vectors• Limited instruction set:– no vector length control– no strided load/store or scatter/gather– unit-stride loads must be aligned to 64/128-bit boundary• Limited vector register length:– requires superscalar dispatch to keep multiply/add/load units busy– loop unrolling to hide latencies increases register pressure• Trend towards fuller vector support inmicroprocessors– Better support for misaligned memory accesses, compilers– Support of double-precision (64-bit floating-point)– New Intel AVX spec (announced April 2008), 256b vector registers(expandable up to 1024b)Multithreading4/15/200810CS152-Spring!08Multithreading• Difficult to continue to extract instruction-levelparallelism (ILP) or data-level parallelism (DLP) froma single thread• Many workloads can make use of thread-levelparallelism (TLP)– TLP from multiprogramming (run independentsequential jobs)– TLP from multithreaded applications (run one jobfaster using parallel threads)• Multithreading uses TLP to improve utilization of asingle processor4/15/200811CS152-Spring!08Pipeline Hazards• Each instruction may depend on the nextLW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8F D X M WD D DF D X M WD D DF F FF DD D DF F Ft9 t10 t11 t12 t13 t14What can be done to cope with this?– interlocks (slow)– or bypassing (needs hardware, doesn’t help allhazards)4/15/200812CS152-Spring!08MultithreadingHow can we guarantee no dependencies betweeninstructions in a pipeline?-- One way is to interleave execution of instructionsfrom different program threads on same pipelineF D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)t9F D X M WF D X M WF D X M WF D X M WInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipePrior instruction ina thread alwayscompletes write-back before nextinstruction insame thread readsregister file4/15/200813CS152-Spring!08CDC 6600 Peripheral Processors(Cray, 1964)• First multithreaded hardware• 10 “virtual” I/O processors• Fixed interleave on simple pipeline• Pipeline has 100ns cycle time• Each virtual processor executes one instruction every 1000ns• Accumulator-based instruction set to reduce processor state4/15/200814CS152-Spring!08Simple Multithreaded Pipeline• Have to carry thread select down pipeline to ensure correct state bitsread/written at each pipe stage• Appears to software (including OS) as multiple, albeit slower, CPUs+12ThreadselectPC1PC1PC1PC1I$IRGPR1GPR1GPR1GPR1XY2D$4/15/200815CS152-Spring!08Multithreading Costs• Each thread requires its own


View Full Document

Berkeley COMPSCI 152 - Lecture Notes

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?