Stanford EE 392C - Lecture #3 - Thread-Parallel Architectures

Unformatted text preview:

EE392C: Advanced Topics in Computer Architecture Lecture #3Polymorphic ProcessorsStanford University Tuesday, 8 April 2003Thread-Parallel ArchitecturesLecture #3: Tuesday, 8 April 2003Lecturer: Metha Jeeradit and Wajahat QadeerScribe: John Kim and Rebecca Schultz1 Analysis Multi-threaded Architectures for ParallelComputing[ 1]1.1 Summary and Original ContributionsThis paper provides a model to characterize the behavior of a multithreaded machinebased on 4 parameters including the memory access latency, the number of threads thatcan be interleaved, the cost of context switching, and the length of each thread. The first3 of these parameters are architecture dependent and are determined by the machinesinterconnection, the amount of processor state available, and the switching mechanismused. The length of the thread, however, depends on both the architecture and t heapplication behavior.Two basic r egions of operations are identified by the authors: a linear region and asaturation region. In the linear region, the processors efficiency is linear with the numberof contexts that you have. On the other hand, in the saturation region, the efficiencyis independent of both the number o f contexts and the latency of the memory cycle. Inthis saturation region, the ratio between the switching cost and the length of the threaddetermines the maximum efficiency. However, when taking cache interference effects intoaccount, having a large number o f threads can hurt your efficiency in saturation.1.2 CritiqueThe main strength of this paper is a practical, deterministic model available for use asa first approximation when starting to design a multithreaded architecture. With thismodel, the architectural designer can easily estimate the performance of his/her designbefore doing any simulation of the architecture in software.The main critique is the debatable assumptions that they made:1. They assume sufficient parallelism is available all the t ime. This is applicationdependent and this scenario seems unrealistic.2. They also ignore synchronization issues which may substantially reduce your pro-cessor’s efficiency.2 EE392C: Lecture #33. They also assume a constant value for memory access latency which may not be truesince you can have local cache, remote cache, local memory and remote memory,all of which will have different latencies.1.3 Future WorkA natural future work to this paper is to extend the model by including the debatableassumptions that they have made.2 Interleaving: A Multi-threading Technique Tar-geting Microprocessors and Workstation s[2]2.1 Summary and Original ContributionsThis paper proposes a new har dware technique, called interleaving, to implement efficientmultithreading. This technique suggests architectural improvements in commodity mi-croprocessors for the efficient implementation of multithreading in b oth the workstationand multiprocessor loads. It is an extension of the fine-grained multithreading techniquewith the addition of data caches and pip eline interlocks for the implementation of multiplecontexts while matching the performance of a single context microprocessor. Consider-able improvement over blocked scheme has been shown for the interleaved scheme in bo t hthe workstation and multiprocessor environments due to its low switching cost and theability to hide small latencies.2.2 CritiqueThe main strength of this paper is the convincing results to substantiat e the claim thatinterleaved multithreading architecture yields better performance than the blocked andfine grained schemes in bot h workstations and multiprocessors.The main critiques include the significant hardware complexity for caches and pro-gram control unit for RISC ba sed machines and the fact that interleaving t echnique isdifficult to implement in dynamically scheduled superscalar processor.2.3 Future WorkA possible future work from this paper is to research an efficient way to implementmultithreading in the dynamic super scalar processor. Another possible future workincludes finding a way t o combine both interleaved and blocked approaches together t oprovide a low switching cost implementation while still providing the performance of ahigh priority thread comparable to a single-threaded architecture performance.EE392C: Lecture #3 33 Comparative Evaluation of Latency Reduc i ng andTolerating Techniques[3]3.1 SummaryThis paper provides a consistent a consistent framework for the evaluation of the fo llowingtechniques for multi-processor architectures:• Coherent Caches• Memory consistency models• Software controlled prefetching• Multiple contextsThe results of this paper show that using coherent caches offer substantial gainsin the performance especially for avoiding read misses however, the hit rate is foundto be lower than uni-processors. Relaxed memory consistency model is fo und to bebetter than sequential model due to its potential f or improved perfo rmance howeverits complexity is higher. Pre-fetching and multiple contexts techniques are both veryapplication dependent though they still provide some performance improvements whenrun separately.3.2 CritiqueThe main strength of this paper is a systematic and consistent scheme to compare variouslatency tolerating schemes, providing a better understanding between their tradeoffs.The main critiques of this paper include:• In sufficient number of applications to substantiate t he results present ed in thepaper.• Results for the combined t echnique of prefetching and multiple contexts were notconsidered appropriately: The pre-fetching implementation (from the one used insingle context) should have been modified before being used with the multiplecontext techniques.• Lock-up free caches were also not exploited for read misses, which would have of-fered substant ia l performance gains for architectures supp orting multiple-contexts.3.3 Future WorkA natural future work to this paper is to compare these techniques against other latencytolerating schemes such as out-of-order execution and using DLP architecture.4 EE392C: Lecture #34 Discuss i onMultithreading is a technique for tolerating latency in the system. The various la-tency causing events include cache misses, synchronization penalties, TLB faults, anddata/control dependencies.Architectural latency avoiding or reducing techniques are caching, prefetching in hard-ware or software, out-of-order execution, exploiting data-level parallelism, relaxed mem-ory coherence, and multithreading.


View Full Document

Stanford EE 392C - Lecture #3 - Thread-Parallel Architectures

Download Lecture #3 - Thread-Parallel Architectures
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture #3 - Thread-Parallel Architectures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture #3 - Thread-Parallel Architectures 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?