How to Communicate Poorly giving bad talks show bad posters writing bad papers Professor David A Patterson December 2004 www cs berkeley edu pattrsn talks nontech html DAP Spr 01 UCB 1 7 Talk Commandments for a Bad Career I Thou shalt not illustrate II Thou shalt not covet brevity III Thou shalt not print large IV Thou shalt not use color V Thou shalt cover thy naked slides VI Thou shalt not skip slides in a long talk VII Thou shalt not practice DAP Feb 04 UCB 2 Following all the commandments in Powerpoint We describe the philosophy and design of the control flow machine and present the results of detailed simulations of the performance of a single processing element Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code It is shown that the control flow processor compares favorably in the program We present a denotational semantics for a logic program to construct a control flow for the logic program The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions We also present a bottom up compilation of medium grain clusters from a fine grain control flow graph We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters A hierarchical macro control flow computation allows them to exploit the coarse grain parallelism inside a macrotask such as a subroutine or a loop hierarchically We use a hierarchical definition of macrotasks a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters We apply a parallel simulation scheme to a real problem the simulation of a control flow architecture and we compare the performance of this simulator with that of a sequential one Moreover we investigate the effect of modeling the application on the performance of the simulator Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modeling is used We have demonstrated that to achieve the best execution time for a control flow program the number of nodes within the system and the type of mapping scheme used are particularly important In addition we observe that a large number of subsystem nodes allows more actors to be fired concurrently but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially The relationship between the mapping scheme employed and locality effect in a program are discussed The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution Medium grain execution can benefit from a higher output bandwidth of a processor and finally a simple superscalar processor with an issue rate of ten is sufficient to exploit the internal parallelism of a cluster Although the technique does not exhaustively detect all possible errors it detects nontrivial errors with a worst case complexity quadratic to the system size It can be automated and applied to systems with arbitrary loops and nondeterminism DAP Feb 04 UCB 3 7 Poster Commandments for a Bad Career I Thou shalt not illustrate II Thou shalt not covet brevity III Thou shalt not print large IV Thou shalt not use color V Thou shalt not attract attention to thyself VI Thou shalt not prepare a short oral overview VII Thou shalt not prepare in advance DAP Feb 04 UCB 4 Following all the commandments How to Do a Bad Poster David Patterson University of California Berkeley CA 94720 We describe the philosophy and design of the control flow machine and present the results of detailed simulations of the performance of a single processing element Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code It is shown that the control flow processor compares favorably in the program Our compiling strategy is to exploit coarsegrain parallelism at function application level and the function application level parallelism is implemented by fork join mechanism The compiler translates source programs into control flow graphs based on analyzing flow of control and then serializes instructions within graphs according to flow arcs such that function applications which have no control dependency are executed in parallel A hierarchical macro control flow computation allows them to exploit the coarse grain parallelism inside a macrotask such as a subroutine or a loop hierarchically We use a hierarchical definition of macrotasks a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters We have demonstrated that to achieve the best execution time for a control flow program the number of nodes within the system and the type of mapping scheme used are particularly important In addition we observe that a large number of subsystem nodes allows more actors to be fired concurrently but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially The relationship between the mapping scheme employed and locality effect in a program are discussed The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution We assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain control flow execution We present a denotational semantics for a logic program to construct a control flow for the logic program The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions We also present a bottom up compilation of medium grain clusters from a fine grain control flow graph We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters We apply a parallel simulation scheme to a real problem the simulation of a control flow architecture and we compare the performance of this simulator with that of a sequential one Moreover we investigate the effect of modeling the application on the performance of the simulator Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modeling is used Medium grain execution can
View Full Document
Unlocking...