DOC PREVIEW
UCF CDA 5106 - Chapter 4 Solutions

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Modern Processor Design: Fundamentals of Superscalar Processors 28 Chapter 41. Is it reasonable to build a scalar pipeline that supports out-of-order execution? If so, describe a code execution scenario where such a pipeline would perform better than a conventional in-order scalar pipeline.Yes. An example scenario is when a load instruction stalls on a cache miss. Subsequent indepen-dent instructions can still make progress if the pipeline supports out-of-order execution.2. Superscalar pipelines require replication of pipeline resources across each parallel pipeline, naively including the replication of cache ports. In practice, however, a two-wide superscalar pipeline may have two data cache ports but only a single instruction cache port. Explain why this is possible, but also discuss why a single instruction cache port can perform worse than two (rep-licated) instruction cache ports.The spatial locality in the instruction reference stream is quite high. Hence, a single port that is twice as wide can usually satisfy both pipelines by fetching two adjacent instructions. The exception to this condition is when the first instruction fetched in the cycle is a taken branch. Whenever this is the case, the spatially adjacent instruction is not useful, since it lies on the fall-through path. In this case, assuming the fetch stage recognized that a taken branch was being fetched for the first pipeline, it could use the second instruction cache port to fetch an instruction from the branch target.3. Section 4.3.1 suggests that a compiler can generate object code where branch targets are aligned at the beginning of physical cache lines to increase the likelihood of fetching multiple instruc-tions from the branch target in a single cycle. However, given a fixed number of instructions between taken branches, this approach may simply shift the unused fetch slots from before the branch target to after the branch that terminates sequential fetch at the target. For example, mov-ing the code at label0 so it aligns with a physical cache line will not improve fetch efficiency, since the wasted fetch slot shifts from the beginning of the physical line to the end. .Discuss the relationship between fetch block size and the dynamic distance between taken branches. Describe how one affects the other, describe how important is branch target alignment for small vs. large fetch blocks and short vs. long dynamic distance, and describe how well static compiler-based target alignment might work in all cases.bc add cmp bcPhysical cache linebc add cmp bcPhysical cache lineWastedslotWastedslotOriginal code:Optimized code:bc cond, label0label0: add r1, r2, r3 cmp cond, r1, r5 be cond, label1Modern Processor Design: Fundamentals of Superscalar Processors29 The essay should point out the following: As fetch block size increases, the probability of encountering a taken branch within the block increases. The greater the dynamic distance between taken branches is, the better a simple fetch mechanism will work, since taken branches cause discontinues in the fetched instruction stream. For small fetch blocks, aligning branch tar-gets is important, since relatively more fetch slots are lost even with minor misalignment. How-ever, for large fetch blocks, major misalignment can also cause severe fetch bandwidth loss. The bottom line is that target alignment will only be beneficial if the target block that is being aligned contains enough instructions to satisfy fetch bandwidth. If the target block itself is short, whether it is aligned or not will not matter. Hence, a compiler heuristic that is sensitive to target block size could improve fetch bandwidth by only aligning targets in those cases where the target block exceeds some threshold.4. The auto-realigning instruction fetch hardware shown in Figure 4-13 still fails to achieve full-width fetch bandwidth (i.e., four instructions per cycle). Describe a more aggressive organization that is always able to fetch 4 instructions per cycle. Comment on the additional hardware such an organization implies.An even/odd interleaved cache would work, where each bank is able to fetch four instructions per cycle. An 8:4 mux connects the output of the two banks and can always deliver four instruc-tions (assuming both banks hit).5. One idea to eliminate the branch misprediction penalty is to build a machine that executes both paths of a branch. In a 2-3 paragraph essay, explain why this may or may not be a good idea.6. Section 4.3.2 discusses adding predecode bits to the instruction cache to simplify the task of decoding instructions after they have been fetched. A logical extension of predecode bits is to simply store the instructions in decoded form in a decoded instruction cache; this is particularly attractive for processors like the Pentium Pro that dynamically translate fetched instructions into a sequence of simpler RISC-like instructions for the core to execute. Identify and describe at least one factor that complicates the building of decoded instruction caches for processors that translate from a complex instruction set to a simpler RISC-like instruction set.Possible factors include: unequal expansion ratios (some CISC sequences will expand to more instructions than others); self-modifying code (where cached sequences must be invalidated); alignment issues with a fixed cache line length in the decoded cache.7. What is the most important advantage of a centralized reservation station over distributed reser-vation stations?Better utilization (which is always true for a shared resource).8. In an in-order pipelined processor, pipeline latches are used to hold result operands from the time an execution unit computes them until they are written back to the register file during the write-back stage. In an out-of-order processor, rename registers are used for the same purpose. Given a four-wide out-of-order processor TYP pipeline, compute the minimum number of rename regis-ters needed to prevent rename register starvation from limiting concurrency. What happens to this number if frequency demands force a designer to add five extra pipeline stages between dis-patch and execute, and five more stages between execute and retire/writeback?Modern Processor Design: Fundamentals of Superscalar Processors 30 For maximum throughput, each pipeline stage will contain four inflight instructions. Since regis-ters are allocated at decode, and freed at retire, each instruction


View Full Document

UCF CDA 5106 - Chapter 4 Solutions

Download Chapter 4 Solutions
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Chapter 4 Solutions and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Chapter 4 Solutions 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?