Berkeley COMPSCI 152 - Lecture Notes - D1679232

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 152> Lecture Notes

Berkeley COMPSCI 152 - Lecture Notes

School name University of California, Berkeley

Course Compsci 152- Computer Architecture and Engineering

Pages 21

Download Save

Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 18 Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California Berkeley http www eecs berkeley edu krste http inst cs berkeley edu cs152 Last Time Vector Computers Vectors provide efficient execution of data parallel loop codes Vector ISA provides compact encoding of machine parallelism ISAs scale to more lanes without changing binary code Vector registers provide fast temporary storage to reduce memory bandwidth demands simplify dependence checking between vector instructions Scatter gather masking compress expand operations increase set of vectorizable loops Requires extensive compiler analysis or programmer annotation to be certain that loops can be vectorized Full long vector support still only in supercomputers NEC SX8R Cray X1E microprocessors have limited short vector operations Intel x86 MMX SSE AVX IBM Motorola PowerPC VMX Altivec 4 15 2008 CS152 Spring 08 2 Vector Conditional Execution Problem Want to vectorize loops with conditional code for i 0 i N i if A i 0 then A i B i Solution Add vector mask or flag registers vector version of predicate registers 1 bit per element and maskable vector instructions vector operation becomes NOP at elements where mask bit is clear Code example CVM LV vA rA SGTVS D vA F0 LV vA rB SV vA rA Turn on all elements Load entire A vector Set bits in mask register where A 0 Load B vector into A under mask Store A back to memory under mask 4 15 2008 3 CS152 Spring 08 Masked Vector Instructions Simple Implementation Density Time Implementation execute all N operations turn off result writeback according to mask scan mask vector and only execute elements with non zero masks M 7 1 A 7 B 7 M 7 1 M 6 0 A 6 B 6 M 6 0 M 5 1 A 5 B 5 M 5 1 M 4 1 A 4 B 4 M 4 1 M 3 0 A 3 B 3 M 3 0 C 5 M 2 0 C 4 M 2 0 C 2 M 1 1 C 1 A 7 B 7 M 1 1 M 0 0 C 1 Write data port M 0 0 Write Enable 4 15 2008 C 0 Write data port CS152 Spring 08 4 Compress Expand Operations Compress packs non masked elements from one vector register contiguously at start of destination vector register population count of mask vector gives packed vector length Expand performs inverse operation A 7 A 7 A 7 M 7 1 M 6 0 A 6 A 5 B 6 M 6 0 M 5 1 A 5 A 4 A 5 M 5 1 M 4 1 A 4 A 1 A 4 M 4 1 M 3 0 A 3 A 7 B 3 M 3 0 M 2 0 A 2 A 5 B 2 M 2 0 M 1 1 A 1 A 4 A 1 M 1 1 M 0 0 A 0 A 1 B 0 M 0 0 M 7 1 Compress Expand Used for density time conditionals and also for general selection operations 4 15 2008 5 CS152 Spring 08 Vector Reductions Problem Loop carried dependence on reduction variables sum 0 for i 0 i N i sum A i Loop carried dependence on sum Solution Re associate operations if possible use binary tree to perform reduction Rearrange as sum 0 VL 1 0 for i 0 i N i VL sum 0 VL 1 A i i VL 1 Now have VL partial sums in one do VL VL 2 sum 0 VL 1 sum VL 2 VL 1 while VL 1 4 15 2008 CS152 Spring 08 Vector of VL partial sums Stripmine VL sized chunks Vector sum vector register Halve vector length Halve no of partials 6 Multimedia Extensions aka SIMD extensions 64b 32b 32b 16b 8b 16b 8b 8b 16b 8b 8b 16b 8b 8b 8b Very short vectors added to existing ISAs for microprocessors Use existing 64 bit registers split into 2x32b or 4x16b or 8x8b This concept first used on Lincoln Labs TX 2 computer in 1957 with 36b datapath split into 2x18b or 4x9b Newer designs have 128 bit registers PowerPC Altivec Intel SSE2 3 4 Single instruction operates on all elements within register 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 4x16b adds 4 15 2008 CS152 Spring 08 7 Multimedia Extensions versus Vectors Limited instruction set no vector length control no strided load store or scatter gather unit stride loads must be aligned to 64 128 bit boundary Limited vector register length requires superscalar dispatch to keep multiply add load units busy loop unrolling to hide latencies increases register pressure Trend towards fuller vector support in microprocessors Better support for misaligned memory accesses compilers Support of double precision 64 bit floating point New Intel AVX spec announced April 2008 256b vector registers expandable up to 1024b 4 15 2008 CS152 Spring 08 8 Multithreading Multithreading Difficult to continue to extract instruction level parallelism ILP or data level parallelism DLP from a single thread Many workloads can make use of thread level parallelism TLP TLP from multiprogramming run independent sequential jobs TLP from multithreaded applications run one job faster using parallel threads Multithreading uses TLP to improve utilization of a single processor 4 15 2008 CS152 Spring 08 10 Pipeline Hazards t0 t1 t2 t3 t4 t5 t6 t7 t8 LW r1 0 r2 LW r5 12 r1 ADDI r5 r5 12 SW 12 r1 r5 t9 t10 t11 t12 t13 t14 F D X MW F D D D D X MW F F F F D D D D X MW F F F F D D D D Each instruction may depend on the next What can be done to cope with this interlocks slow or bypassing needs hardware doesn t help all hazards 4 15 2008 11 CS152 Spring 08 Multithreading How can we guarantee no dependencies between instructions in a pipeline One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads T1 T4 on non bypassed 5 stage pipe t0 t1 t2 t3 t4 t5 t6 t7 F D X MW T1 LW r1 0 r2 F D X M T2 ADD r7 r1 r4 F D X T3 XORI r5 r4 12 T4 SW 0 r7 r5 F D T1 LW r5 12 r1 F 4 15 2008 t8 W MW X MW D X MW CS152 Spring 08 t9 Prior instruction in a thread always completes writeback before next instruction in same thread reads register file 12 CDC 6600 Peripheral Processors Cray 1964 First multithreaded hardware 10 virtual I O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Accumulator based instruction set to reduce processor state 4 15 2008 13 CS152 Spring 08 Simple Multithreaded Pipeline PC PC PC 1 PC 1 1 1 I IR GPR1 GPR1 GPR1 GPR1 X Y D 1 2 Thread select 2 Have to carry thread select down pipeline to ensure correct state bits read written at each pipe …

View Full Document

Berkeley COMPSCI 152 - Lecture Notes

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 152 - Lecture Notes

Sign up for free to view:

Please select your school