Exam2 ReviewOutlineParallel processingParallel processing classification9.2 PipeliningSPEEDUPExampleSlide 8Example AnswerInstructions seperate5-Stage PipeliningPipeline HazardsData hazardSlide 14Slide 15Slide 16Slide 17Slide 18Branch hazardsSlide 20Branch Untaken (Freeze approach)Branch Taken (Freeze approach)Branch Untaken (Predicted-untaken)Branch Taken (Predicted-untaken)Branch Untaken (Predicted-taken)Branch taken (Predicted-taken)Delayed BranchSlide 28Slide 29Memory HierarchyRAMROMMemory Address MapSlide 34Cache memorySlide 36Associative mappingDirect MappingSlide 39Set-Associative MappingAverage memory access timeSlide 42Page FaultPerformance of Demand Paging9.4 Page ReplacementSlide 46Exam2 ReviewDr. Bernard Chen Ph.D.University of Central ArkansasSpring 2010OutlinePipelineMemory HierarchyParallel processingA parallel processing system is able to perform concurrent data processing to achieve faster execution timeThe system may have two or more ALUs and be able to execute two or more instructions at the same timeGoal is to increase the throughput – the amount of processing that can be accomplished during a given interval of timeParallel processing classificationSingle instruction stream, single data stream – SISDSingle instruction stream, multiple data stream – SIMDMultiple instruction stream, single data stream – MISDMultiple instruction stream, multiple data stream – MIMD9.2 PipeliningInstruction execution is divided into k segments or stagesInstruction exits pipe stage k-1 and proceeds into pipe stage kAll pipe stages take the same amount of time; called one processor cycleLength of the processor cycle is determined by the slowest pipe stagek segmentsSPEEDUPIf we execute the same task sequentially in a single processing unit, it takes (k * n) clock cycles.• The speedup gained by using the pipeline is:)1(1nknkTTSpeedu pkExampleA non-pipeline system takes 100ns to process a task; the same task can be processed in a FIVE-segment pipeline into 20ns, eachSpeedup Ratio for 1000 tasks: 100*1000 / (5 + 1000 -1)*20 = 4.98However, if the task cannot be evenly divided…ExampleA non-pipeline system takes 100ns to process a task; the same task can be processed in a six-segment pipeline with the time delay of each segment in the pipeline is as follows 20ns, 25ns, 30ns, 10ns, 15ns, and 30ns. Determine the speedup ratio of the pipeline for 10, 100, and 1000 tasks. What is the maximum speedup that can be achieved?Example AnswerSpeedup Ratio for 10 tasks:100*10 / (6+10-1)*30Speedup Ratio for 100 tasks:100*100 / (6+100-1)*30Speedup Ratio for 1000 tasks:100*1000 / (6+1000-1)*30Maximum Speedup:100*N/ (6+N-1)*30 = 10/3Instructions seperate 1. Fetch the instruction2. Decode the instruction3. Fetch the operands from memory 4. Execute the instruction5. Store the results in the proper place5-Stage PipeliningFetch Instruction (FI)FetchOperand (FO)Decode Instruction (DI)WriteOperand(WO)Execution Instruction (EI)S3 S4S1 S2 S51 2 3 4 98765S1S2S5S3S41 2 3 4 87651 2 3 4 7651 2 3 4 651 2 3 4 5TimePipeline Hazards There are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated cycleThere are three classes of hazardsStructural hazardData hazardBranch hazardData hazardExample:ADD R1R2+R3SUB R4R1-R5AND R6R1 AND R7OR R8R1 OR R9XOR R10R1 XOR R11Data hazardFO: fetch data value WO: store the executed value Fetch Instruction (FI)FetchOperand (FO)Decode Instruction (DI)WriteOperand(WO)Execution Instruction (EI)S3 S4S1 S2 S5TimeData hazardDelay load approach inserts a no-operation instruction to avoid the data conflictADD R1R2+R3No-opNo-opSUB R4R1-R5AND R6R1 AND R7OR R8R1 OR R9XOR R10R1 XOR R11Data hazardData hazardIt can be further solved by a simple hardware technique called forwarding (also called bypassing or short-circuiting)The insight in forwarding is that the result is not really needed by SUB until the ADD execute completely If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the results in ALU instead of from memoryData hazardBranch hazardsBranch hazards can cause a greater performance loss for pipelines When a branch instruction is executed, it may or may not change the PC If a branch changes the PC to its target address, it is a taken branchOtherwise, it is untakenBranch hazardsThere are FOUR schemes to handle branch hazardsFreeze scheme Predict-untaken schemePredict-taken schemeDelayed branchBranch Untaken (Freeze approach)The simplest method of dealing with branches is to redo the fetch following a branchFetch Instruction (FI)FetchOperand (FO)Decode Instruction (DI)WriteOperand(WO)Execution Instruction (EI)Branch Taken (Freeze approach)The simplest method of dealing with branches is to redo the fetch following a branchFetch Instruction (FI)FetchOperand (FO)Decode Instruction (DI)WriteOperand(WO)Execution Instruction (EI)Branch Untaken (Predicted-untaken)Fetch Instruction (FI)FetchOperand (FO)Decode Instruction (DI)WriteOperand(WO)Execution Instruction (EI)TimeBranch Taken (Predicted-untaken) Fetch Instruction (FI)FetchOperand (FO)Decode Instruction (DI)WriteOperand(WO)Execution Instruction (EI)Branch Untaken (Predicted-taken) Fetch Instruction (FI)FetchOperand (FO)Decode Instruction (DI)WriteOperand(WO)Execution Instruction (EI)Branch taken (Predicted-taken) Fetch Instruction (FI)FetchOperand (FO)Decode Instruction (DI)WriteOperand(WO)Execution Instruction (EI)Delayed BranchA fourth scheme in use in some processors is called delayed branchIt is done in compiler time. It modifies the code The general format is:branch instructionDelay slotbranch target if takenDelayed BranchOptimalOutlinePipelineMemory HierarchyMemory HierarchyThe main memory occupies a central position by being able to communicate directly with the CPU and with auxiliary memory devices through an I/O processorA special very-high-speed memory called cache is used to increase the speed of processing by making current programs and data available to the CPU at a rapid rateRAMROMMemory Address MapCache memoryWhen the CPU refers to memory and finds the word in cache, it is said to produce a
View Full Document