BROCKPORT CPS 303 - Chapter 2: Architecture of Parallel Computers

Unformatted text preview:

CPS 303 High Performance ComputingWensheng ShenDepartment of Computational ScienceSUNY BrockportChapter 2: Architecture of Parallel Computers Hardware Software2.1.1 Flynn’s taxonomy Single-instruction single-data (SISD) Single-instruction multiple-data (SIMD) Multiple-instruction single-data (MISD) Multiple-instruction multiple-data (MIMD)Michael Flynn classified systems according to the number of instruction streams and the number of data streams.Instruction streams and data streams Data stream: a sequence of digitally encoded coherent signals of data packets used to transmit or receive information that is in transmission. Instruction stream: a sequence of instructions.Instruction set architectureStored program computer: memory stores programs, as well as data. Thus, programs must go from memory to the CPU where it can be executed. Programs consist of many instructionswhich are the 0's and 1's that tell the CPU what to do. The format and semantics of the instructions are defined by the ISA (instruction set architecture). The reason the instructions reside in memory is because CPUs typically hold very little memory. The more memory the CPU has, the slower it runs. Thus, memory and CPU are separate chips2.1.2 SISD --- the classic van Neumann machineInstruction poolData poolPLoad XLoad YAdd Z, X, YStore ZMemoryControl UnitArithmetic LogicUnitCPUInputDevicesOutputDevicesExternalStorageA single processor executes a single instruction stream, to operate on data stored in a single memory. During any CPU cycle, only one data stream is used. The performance of an van Neumann machine can be improved by caching.Steps to run a single instruction IF (instruction fetch): the instruction is fetched from memory. The address of the instruction is fetched from program counter (PC), the instruction is copied from memory to instruction register (IR). ID (instruction decode): decode the instruction and fetch operands. EX (execute): perform the operation, done by ALU (arithmetic logicunit) MEM (memory access): it happens normally during load and storeinstructions. WB (write back): write results of the operation in the EX step to a register in the register file.  PC (update program counter): update the value in program counter, normally PC Å PC + 4IF EXIDMEMWBWBMEMIDEXIFWBMEMID EXIFSubscalar CPUs: Since only one instruction is executed at a time, the entire CPU must wait for that instruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets "hung up" on instructions which take more than one clock cycle to complete execution. This process is inherent inefficiencyIt takes 15 cycles to complete three instructionsIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBScalar CPUs: In this 5 stage pipeline, it can barely achieve the performance of one instruction per CPU clock cycle2.1.3 Pipeline and vector architectureIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBIF EXIDMEMWBSuperscalar CPUs: in the simple superscalar pipeline, two instructions are fetched and dispatched at the same time, a performance of a maximum of two instructions per CPU clock cycle can be achieved.Example  The functional units are arranged in a pipeline. The output of one functional unit is the input to the next. Say x[0] and y[0] are being added, one of x[1] and y[1] can be shifted, the exponents in x[2] and y[2] can be compared, and x[3] and y[3] can be fetched. After pipelining, we can produce a result six times faster than without the pipelining.float x[100], y[100], z[100];for (i=0; i<100; i++)z[i] = x[i] + y[i];Fetch the operands from memoryCompare exponentShift one operandAddNormalized the resultsStore result in memory654X0/y0X1,y1X2,y23X0,y0X1,y12X0,y01storenormaddshiftcompfetchclock By adding vector instructions to the basic machine instruction set, we can further improve the performance. Without vector instructions, each of the basic instructions has to be issued 100 times. With vector instructions, each of the basic instructions has to be issued 1 time.  Using multiple memory banks. Operations (fetch and store) that access main memory are several times slower than CPU only operations (add). For example, suppose we can execute a CPU operation once every CPU cycle, but we can only execute a memory access every four cycles. If we used four memory banks, and distribute the data z[i] in memory bank i mod 4, we can execute one store operation per cycle.do i=1, 100z(i) = x(i) + y(i)enddoz(1:100) = x(1:100) + y(1:100)Fortran 77 Fortran 902.1.4 SIMDInstruction poolData poolPPPLoad X[2]Load Y[2]Add Z[2], X[2], Y[2]Store Z[2]Load X[n]Load Y[3]Add Z[n], X[3], Y[3]Store Z[n]Load X[1]Add Z[1], X[1], Y[1]Store Z[1]load Y[1]A type of parallel computers. Single instruction: All processor units execute the same instruction at any give clock cycle. Multiple data: Each processing unit can operate on a different data element. It typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction unitsBest suitable for specialized problems characterized by a high degree of regularity, e.g., image processing.infantryA single CPU to control and a large collection of subordinate ALUs, each with its own memory. During each instruction cycle, the control processor broadcasts an instruction to all of the subordinate processors, and each of the subordinate processors either executes the instruction or is idle.For (i=0; i<100; i++)if (y[i] != 0.0)z[i] = x[i]/y[i];elsez[i] = x[i];If local_y != 0, idleIf local_y == 0, z[i]=x[i]Time step 3If local_y != 0, z[i]=x[i]/y[i]If local_y == 0, idleTime step 2Test local_y != 0Time step 1Disadvantage: in a program with many conditional branches or long segments of code whose execution depends on conditionals, its more likely that many processes will remain idle for a long time.2.1.5 MISDInstruction poolData poolP PA single data stream is fed into multipleprocessing unitsEach processing unit operates on the dataindependently via independent instructionstreamsVery few actual machines: CMU’s C.mmpcomputer (1971)Load X[1]Add Z[1], X[1], Y[1]Store Z[1]Mul Y[1], A, X[1]Load X[1]Add Z[2], X[1], Y[2]Store Z[2]Mul Y[2], B, X[1]Load X[1]Add Z[3], X[1], Y[3]Store Z[3]Mul Y[3], C, X[1]2.1.6 MIMDInstruction poolData poolP PPPPPMultiple instruction stream: Every processor mayexecute a different


View Full Document

BROCKPORT CPS 303 - Chapter 2: Architecture of Parallel Computers

Download Chapter 2: Architecture of Parallel Computers
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Chapter 2: Architecture of Parallel Computers and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Chapter 2: Architecture of Parallel Computers 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?