Combinatorial Algorithms for Fast Clock MeshOptimization∗Ganesh Venkataraman, Zhuo Feng, Jiang Hu, Peng LiDept. of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843{ganesh, jianghu}@ece.tamu.edu, {fengzhuo, pli}@neo.tamu.eduABSTRACTWe present a fast and efficient combinatorial algorithm to simul-taneously identify the candidate locations as well as the sizes ofthe buffers driving a clock mesh. Due to the high redundancy, amesh architecture offers high tolerance towards variation in theclock skew. However, such a redundancy comes at the expense ofmesh wire length and power dissipation. Based on survivable net-work theory, we formulate the problem to reduce the clock meshby retaining only those edges that are critical to maintain redun-dancy. Such a formulation offers designer the option to trade-offbetween power and tolerance to process variations. Experimentalresults indicate that our techniques can result in power savingsup to 28% with less than 4% delay penalty.1. INTRODUCTIONThe function of the Clock Distribution Network (CDN) isto deliver the clock signal from the clock source to the clocksinks. The design of CDN could include multiple (and oftenconflicting) objectives like wire length, p ower, signal slewrate and tolerance of clock skew to variations. Tree base ddistributions offer the advantage of simplicity (single pathbetween source and sinks) as well as lower wirelength [3].However, tree based distributions have a relatively low tol-erance towards variations.Non-tree based distributions, on the other hand provide ahigh tolerance to variations [4–7] due to the redundancy cre-ated by multiple paths between clock source and the sinks.One of the most widely used non-tree based CDN is clo ckmesh [9]. The mesh consists of a rectangular grid driven bya top level tree. The buffers at the leaf of the top level treeshall henceforth be referred to as mesh bu ffers. Mesh ar-chitecture is used mainly in high performance systems suchas IBM G 5 [8], Power4 [9] and SUN Sparc V9 [10]. In allthe above processors, a very low clock skew has been re-ported which proves the effectiveness of the clock mesh inmitigating skew. Clock mesh consumes significantly higherwire area compared to tree based distributions. (up to 168%higher area compared to tree [11]). Higher wire area leadsthat a higher load capacitance for the clock buffers which inturn implies a higher power dissipation.The maximum permissible delay between any two regis-∗This work was supported in part by SRC under contractnumber 2004-TJ-1205 and 2006-TJ-1416.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICCAD’06, November 5-9, 2006, San Jose, CACopyright 2006 ACM 1-59593-389-1/06/0011 ...$5.00.ters (i, j) is given by [2]:Pijdelay= Tclock− tsetup− skewij(1)In the above equation, Tclockdenotes the clock period, tsetupthe set up time and skewijdenotes the skew between i and j.Pijdelayrepresents the maximum permissible delay between iand j. Pdelay= min∀(i,j)Pijdelaydenotes the maximumpermissible delay of the entire circuit. Typically, inthe zero skew design skewijis designed to be zero. How-ever, in the presence of variations, skewijmay increase, sub-sequently reducing the maximum speed at which the circuitcan function. Higher skew would bring down the maximumpermissible delay Pdelay. Hence a mesh architecture is suitedwell for high performance systems since it mitigates the clockskew even though the resource consumption is high (wirelength, power etc.). However, wi th power occupying an in-creasingly important role in chip design, it may be necessaryto trade the clock skew for low power. At the very least, thedesigner should be given the flexibility to do so. Even thoughthere has been previous works on in mesh architecture, thefollowing issues remain largely unaddressed:• What are the ideal locations to drive the clock mesh?Is it ok to distribute the drivers uniformly across themesh?• Can the mesh buffers be sized differently? If yes, thendo e s it work better than sizing uniformly?• A mesh has a high level of redundancy. Can someamount of redundancy be sacrificed to reduce the wirelength? If yes, then how much is the trade-off? Canwe quantify the skew vs power trade-off?In this work, we address all the above mentioned issues.Our contributions include: The contributions in this workinclude:• We propose a set-cover based algorithm for finding themesh buffer locations and their sizes. Our algorithmworks fast on a discrete library of buffer sizes.• We formulate the mesh reduction problem by usingsurvivable network theory. We present heuristics forsolving the formulation efficiently and quickly. Exper-imental results indicate up to 29% reduction in wirelength, 28% reduction in power with less than4% increase in delay penalty.• Our techniques allow the designer to trade-off betweenskew and power dissipation. In fact, the formulationpresented is flexible enough to allow a high range oftrade-off (that is either a high skew- low power de-sign or a low skew - high power design or anywhere inbetween).• Our algorithms run very fast. In fact, it runs withina few seconds even for large test cases. Such ahigh speed helps the designer to run the same algo-rithm several times with different parameter valuesthat produce different solutions in the power delaycurve.• We present an efficient gate delay model suited forthe clock mesh. Such a model achieves near SPICEaccuracy with speed-up up to 62X compared withHSPICE. Such a speed up is particularly helpful inanalyzing large mesh with thousands of clock sinks.2. PRELIMINARIES AND PROBLEMSTATEMENTWe shall introduce certain notations and conventions whichwill be followed throughout the paper.• m × n denotes the dim ensi on of the clock mesh. Idenotes the set of no des in the mesh.• Clock buffers of B sizes {b1, b2, . . . bk} in non-decreasingorder. Buffer bican drive a load of capacitance at mostci.• Buffer Mapping Function BM : i → j maps each nodelocation i ∈ I to j ∈ B. BM (i) = φ implies that thelocation i has no buffer in it.• S = {s1, s2. . . sn} denotes the set of
or
We will never post anything without your permission.
Don't have an account? Sign up