# A POWER DELIVERY NETWORK AND CELL PLACEMENT AWARE DYNAMIC IR MITIGATION TECHNIQUE: HARVESTING UNUSED TIMING SLACKS TO SCHEDULE USEFUL SKEWS by Lakshmi Saraswathi Bhamidipati A Thesis Submitted to the Graduate Faculty of George Mason University in Partial Fulfillment of The Requirements for the Degree of Master of Science Electrical Engineering | Committee: | | |------------|-----------------------------------------------------------| | | Dr. Avesta Sasan, Thesis Director | | | Dr. Houman Homayoun, Committee Member | | | Dr. Jens-Peter Kaps, Committee Member | | | Dr. Monson Hayes, Department Chair | | | Dr. Kenneth S. Ball, Dean, Volgenau School of Engineering | | Date: | Fall Semester 2016 George Mason University Fairfax, VA | A Power Delivery Network and Cell Placement Aware Dynamic IR Mitigation Technique: Harvesting Unused Timing Slacks to Schedule Useful Skews A Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science at George Mason University by Lakshmi Saraswathi Bhamidipati Bachelor of Electronics and Instrumentation Engineering Anna University, 2013 > Director: Avesta Sasan, Professor Department of Electrical Engineering > > Fall Semester 2016 George Mason University Fairfax, VA Copyright 2016 Lakshmi Saraswathi Bhamidipati All Rights Reserved # **DEDICATION** I dedicate this thesis to my parents B. Jayasree and B.N. Sastri, my sisters B. Srilalitha, B. Haripriya, B. Aarna and my brother B. Krishna. ### **ACKNOWLEDGEMENTS** I would like to express my heartfelt gratitude to my advisor Dr. Avesta Sasan for introducing me to several aspects of VLSI Design and for continuously motivating and guiding me throughout the research. This thesis would not have been possible if not for him. I would also like to take this opportunity to thank Bhoopal Reddy for his continuous support in getting the work done. Finally, I would like to thank Aditya Atkuri for his moral support and encouragement. # TABLE OF CONTENTS | | Page | |-----------------------------------------------------|------| | List of Tables | vi | | List of Figures | vii | | List of Equations | viii | | List of Abbreviations and/or Symbols | ix | | Abstract | X | | 1 Introduction | 1 | | 1.1 Static IR Drop v/s Dynamic IR Drop | 4 | | 2 Background | | | 2.1 Useful Skew for IR Drop Reduction | 7 | | 3 On Die PDN Construction | 11 | | 3.1 Package Design | 13 | | 3.2 Power Switches | 14 | | 3.3 Connectivity to Power Switches | 15 | | 4 Impact of IR drop on Timing | 20 | | 5 Methodology and Algorithm | 24 | | 5.1 Problem formulation | 26 | | 6 Experimental Results | 34 | | 6.1 Impact on timing | 34 | | 6.2 Peak current reduction | 34 | | 6.3 Differential Voltage Difference (DVD) Reduction | 36 | | 6.4 Instance Peak Current Reduction | 37 | | 7 Future Work and Discussion | 38 | | 8 References | 41 | # LIST OF TABLES | Table | Page | | |----------------------------------------------|------|--| | Table 1 Toggle rate for some of the gates | 31 | | | Table 2 Comparison of all benchmark circuits | | | # LIST OF FIGURES | Figure | Page | |------------------------------------------------------------------------------------|------| | Figure 1 Peak Current Demand | 2 | | Figure 2 Distinguishing localized current from peak current | 12 | | Figure 3 Implementing Useful Skew | 16 | | Figure 4 Construction of the PDN | 18 | | Figure 5 Simultaneous switching of cells at the M1-M7 intersection | 23 | | Figure 6 Physical layout of timing path over the IR hotspot | 26 | | Figure 6 Algorithm to calculate CAT | 26 | | Figure 7 Structure of the Timing Path | 27 | | Figure 8 Threshold Voltages v/s Nominal Delay | 27 | | Figure 9 Peak reduction v/s Hot spot mitigation | 29 | | Figure 10 TW for cells in MRR and their current demand | | | Figure 11 Algorithm to Schedule CAT | 32 | | Figure 12 Skew Transfer without violations | | | Figure 13 Algorithm to Calculate CAT | 34 | | Figure 14 Marking Cells Early and Late Yellow=early, Red=Late | 34 | | Figure 15 Start and endpoints from Register X and their useful skew calculation | 35 | | Figure 16 Peak Current reduction using the algorithm | 35 | | Figure 17Worst DVD drop for the designs | 35 | | Figure 18 Average Currents for the designs | 36 | | Figure 19 IR Drop before and after running our algorithm and range of voltage drop | 37 | | Figure 20 Instantaneous VDD vs Time | | | Figure 21 Peak Current vs Time | | | Figure 22 Original and Proposed DVD. | | | Figure 23 Proposed Peak Current | 41 | # LIST OF EQUATIONS | Equation | Page | |-----------------------------------------------------|------| | Ldi/dt | 21 | | $x - \frac{D}{2} < X[C[j]] < x + \frac{D}{2} \dots$ | 30 | | $Y[C[j]] = y[i] \dots$ | 30 | | $v[i] \in MRP[C[j]]$ | 30 | | | | # LIST OF ABBREVIATIONS AND SYMBOLS | Voltage Drop | IR Drop | |-----------------------------|---------| | Power Delivery Network | PDN | | Clock Tree Synthesis | CTS | | Clock Arrival Time | CAT | | Nominal Threshold Voltage | NTV | | Super Threshold Voltage | STV | | Threshold Voltage | | | Toggle Rate | | | Metal 1- Metal 7 | | | Effective Series Resistance | | | Decoupling Capacitors | DECAPS | | Circuit Under Design | | | Re-distribution Layer | RDL | | Minimum Resistance Path | | | Supply Voltage | VDD | | Ground Voltage | | | Micro meter squared | | **ABSTRACT** A POWER DELIVERY NETWORK AND CELL PLACEMENT AWARE DYNAMIC IR MITIGATION TECHNIQUE: HARVESTING UNUSED TIMING SLACKS TO SCHEDULE USEFUL SKEWS Lakshmi Saraswathi Bhamidipati, M.S. George Mason University, 2016 Thesis Director: Dr. Avesta Sasan To prevent setup and hold failures during the operation of a chip, different sources of on chip variability need to be modeled and margined during the physical design. One of the sources of the variability is dynamic IR drop and cycle to cycle voltage variation. The excessive IR drop or large cycle to cycle voltage variation could cause various forms of timing failure. In this thesis, we present a novel technique for reducing the dynamic IR- drop by leveraging available timing slacks and scheduling useful skews. Unlike previous work, which is focused on reducing the peak current, we breakdown the peak current minimization problem into many smaller problems of reducing the intensity of individual hot spots. In addition to timing information, the power delivery network, floorplan, and cell placement information are considered while scheduling the clock arrival times. This technique reduces the peak dynamic IR-drop by ~50%, peak current by ~30% and cycle to cycle voltage variation by more than 30%. ## 1 INTRODUCTION A synchronous pipelined circuit relies on the distribution of clock for timing management. At every rising edge of the clock, each register captures the incoming signal, and injects it to the next stage of combinational circuit. Each rising edge of the clock gives rise to a surge of switching activity, however the switching activity is quickly suppressed as signals propagate down the timing paths[1]. This, as illustrated in figure 1, naturally makes the triggering edge of the clock, the timing window in which the peak current demand, and peak IR drop occurs. Figure 1: Peak current demand The situation is worsened when synchronous circuits are optimized for zero clock skew [2], since all registers will fire at the same time. Many previous studies have investigated techniques to reduce the IR drop. IR drop has a resistive and an inductive element. The resistive element of voltage drop, denoted by IR, could be reduced by lowering the Resistance of Power Delivery Network (PDN) or reducing the current demand. The inductive form of IR drop, denoted by $L\frac{di}{dt}$ , could be reduced by reducing the inductance of Board and Package, or by reducing the rage of change in the demanded current. Following these guidelines, several researchers have attempted to formulate techniques for IR drop reduction. For example, the work in [3] sizes the P/G lines in the PDN to reduce the voltage drop by reducing the PDN resistance. This is an effective method; however, its application is limited as it faces the design trade off reducing the routing resources. Considering that the silicon area in advanced geometries is routing resource dominated, such approach cannot be pushed beyond a certain limit. Work in [4] [5] has been done with a focus on PDN planning and synthesis for IR Drop Reduction by minimizing metal usage. Clock skew optimization was first explored by J.P. Fishburn [6] followed by Vittal et al. [7] and Benini et al. [1] who scattered clock signal arrival times. A polarity assignment technique was later developed by Nieh et al. [8] to reduce the total peak current. Work done in [9] tried to reduce the peak current by state replication and recoding in FSM circuits. The work in [10] proposes a heuristic algorithm that allows to reduce peak current by scheduling clock skew for large circuits. A more interesting approach is taken by [11] [12] [13] [2] where Clock Arrival Times (CAT) to individual registers are skewed to reduce the peak current, attempting to smooth the overall current signature of the circuit. These solutions, are indeed very effective in reducing the overall peak current, however they have limited ability to address the occurrence of local IR hot spots. This is because the IR hot spots are created due to the simultaneous switching of small collection of cells that are spatially placed close to one another and share the most resistive parts of the PDN, such as M1 rails and lower metal layer vias. Therefore, although peak current reduction techniques widen the distribution of CATs, because of being ignorant to cell placement with respect to the PDN, they cannot prevent a concentrated cluster of registers to have similar CAT. Our proposed technique is a IR hot-spot mitigation technique. During the physical design, this technique is applicable at any stage after the Clock Tree Synthesis (CTS) to remove IR hot spots. Proposed technique modifies the CAT, with consideration for placement and connectivity of different registers in the hot-spot region to reduce the intensity of IR hot-spots. Figure 2: Distinguishing localized current from peak current ## 1.1 Static IR Drop v/s Dynamic IR Drop As shown in figure 1, Static IR Drop depends on the overall peak current of the circuit as opposed to dynamic IR Drop which occurs due to localized currents caused at the clock rising edges and exist for a shorter duration. Thus, it can be said that Dynamic IR drop depends on the switching activity of the cells but static IR Drop depends totally on the clock period [14]. IR Drop can cause hold time violations on the clock network and setup violations on the data path signals. Most of the times undetected Silicon IR drop may lead to appropriate voltage not reaching a transistor, which can to an extent be compensated by increasing the supply at board level. In extreme cases, it could also lead to customer-returns, because of a Timing Analysis failure, whose root-cause could be eventually found to be due to higher-than-rated, voltage drop on operation. Dynamic Voltage drop is something that cannot be modeled by pure traditional Static Timing Analysis. [15] Hence, it is vital to mitigate Dynamic IR Drop. #### 2 BACKGROUND In order to prevent setup and hold timing failures, during the operation of a chip, different sources of on chip variability should be modeled and margined for. One of the sources of on-chip variability is dynamic IR drop and its implied cycle to cycle voltage variation. The excessive IR drop or large cycle to cycle voltage variation could cause various forms of timing failure. Considering that hold or setup failure of a single timing path is enough to make the chip non-functional, the margins for IR drop and voltage noise, are calculated based on worst case scenario and not its average case. With the rising interest and growth in applications, such as Internet of Things (IoT), Cyber Physical Systems (CPS), Mobile, handheld and wearable devices, that demand very low power profiles, it is highly desired to design and operate the circuit Near the Threshold Voltage (NTV) of its transistors, as this region is the energy optimal point of operation [14]. However, voltage variation poses a big challenge for the implementation of NTV circuits. With 5% voltage noise in a design operated at Super Threshold Voltage (STV), a performance variation in range of 10-20% is observed [15]. The situation becomes far worse when the circuit is operated at NTV, where 5% of voltage noise causes more than ~200% performance variation [15]. To prevent the timing delay variation from causing timing failures, large design margins for IR drop and endpoint uncertainty are adopted. From this discussion, reducing the IR drop at STV makes the circuit more competitive, as the reduced IR drop, and the resulting higher voltage seen by transistors could be used to improve the Power, Performance, and Area (PPA) of the chip. In NTV, reduction of IR drop and noise is even more important, as it could determine the existence or feasibility of these solutions. For this reason, it is highly desired to reduce the peak IR drop, and the extent of cycle to cycle noise in the physical design. Note that if IR drop is only fought with margining techniques, the price will be paid in terms of PPA, and as mentioned, moving the lower voltages, the penalties increase exponentially. Although previous work that aims at reducing the peak current by scheduling useful skews, proves useful in reducing the overall IR drop, they are blind to the existence and occurrence of local hot spots. Therefore, during the timing analysis, still the larger IR drop reported for IR hot regions should be considered for timing analysis, otherwise timing failures would occur, and a single timing failure is enough for a circuit to become non-functional. In this work, we visit the idea of useful skew scheduling, however with the primary objective of reducing the intensity of IR hot spots, rather than reducing the overall current. This requires considering the PDN, the placement, and the availability of Decaps, prior to scheduling the useful skews. ## 2.1 Useful Skew for IR Drop Reduction The conventional goal of CTS algorithms is to build a clock tree with minimized or zero skew in order to distribute the timing slacks evenly between various pipeline stages in both data and control paths. However, pipeline stages, cannot always have equal delays, and it is the delay of the longest stage that dictates the overall achievable frequency. To alleviate this problem, CTS flows considered the idea of time borrowing by means of scheduling useful skews [16] [17] [18]. As illustrated in figure 3, this is done by engineering the arrival time of the clock such that a portion of timing slack of one stage could be pushed to another. This widely adopted technique, which is now available on many commercial EDAs [19] [20] [21] takes the available timing slack from non-critical timing paths, and gives it to critical and violating timing paths for the purpose of timing closure [22] [23] [24]. The source stage and destination stage for the transfer of timing slack do not need to be successive stages; the timing slacks could be moved across many stages until delivered to the destination stage. Useful skew could also be explored to improve dynamic and leakage power recovery by pushing the available timing slacks from short data paths, to near critical data paths, increasing the chances of VT and cell swapping [25] [26]. Figure 3: Implementing Useful Skew Figure 1 illustrated the current signature of an example design DES Engine obtained after power and IR analysis using Ansys Apache RedHawk [27]. Each current spike is in line with the triggering edge of the clock. Each triggering edge of the clock (rising edge in this example) initiates a surge of toggle activities. However, as the signals propagate down the different timing paths, their toggle activity is quickly suppressed, and thus, their current demand is quickly reduced as illustrated. This identifies the triggering edge of the clock as the timing-window where the peak current and largest IR drop occurs. From this observation, widening the distribution of clock arrival times, reduces the number of simultaneous switching activities, resulting in reduction in the peak current. This has been previously studied in [2][12] [13] [11] and successfully implemented to reduce the peak current, however although these techniques reduce the overall peak current, they cannot effectively address the problem of local hot IR spots. This is because by widening the distribution of clock triggering edge arrival time, there is no guaranty to prevent simultaneous switching of a subset of registers that are placed in close proximity to each other. In order to address this issue, in addition to the arrival time of the clock, the connectivity to the PDN, the placement of the cells, and the suppressing impact of various decoupling capacitances (either dedicated or device decap) should be considered. The simultaneous switching of cells that are placed spatially close to one another, becomes concerning when these cells share the same M1 (and possibly M2) rail(s) and same via stack for power delivery (Power, Ground or Both). In order to better explain how our technique works, we need to understand the way the PDN is constructed, to better explain when a part of PDN becomes an IR hot-spot. #### 3 ON DIE PDN CONSTRUCTION In this section, we explain how a generic on-die PDN is constructed. The discussion is fine tuned for more advanced geometries (45nm and below) where 9 or more on-die metal layers are available. Construction of the PDN, starts with building Metal 1 (M1) rails to which the power and ground pins of standard cells are connected. M1 rails are laid out in the design, alternating between Power (P) and Ground (G). M1 rails are usually implemented by placing filler cells in the design, pre-routing them and removing them, leaving the M1 rails which are separated by the height of standard cells behind. M2 rail could be optionally routed parallel to the M1 rails. Considering the increase in the current and power density in the state of the art processes, the M1 rail alone may not be strong enough to meet IR drop and Electromigration (EM) requirements. For this reason, M2 parallel rails are being adopted in geometries below 32nm and are becoming a mush as we move to 10 and 7nm design. If M2 rails are laid in parallel with M1 rails, via-1s are inserted at regular intervals to improve the resistive characteristic of parallel rails. This also improves the flight of charges between distributed De-Coupling Capacitors (decaps) commented to the M1 rail, and the voltage deprived cells in the same rail by reducing the Effective Series Resistance (ESR), which makes the decaps more useful. Figure 4: Construction of the PDN A batch of higher-level metal straps (usually M7) with routing direction preference which is orthogonal to M1 rail is used to distribute the power vertically. This batch of orthogonal stripe(s) are usually implemented in M7, however it is possible to use M5 and possibly local M3 rails as well (depending on the size of the block, number of metal layers available, power and current density on die, etc.). However, M7 rails have a lower resistance, and considering lower metal layers are used more heavily by the router, using M7s create less routing issues than M5 and M3. The M7 power and ground straps are connected to M1 or M2 rails at each location they intersect using a via stack. For choosing the size of the via stack its tradeoff with routing resources should be considered. Larger via stacks reduce the resistive IR drop, however consume more routing resource, and vice versa. In addition, the distance between orthogonal M7 straps also plays a role in selection of via stack sizes. As M7 straps are pushed closer to one another, the number of intersections, and therefore the number of via stacks increases, reducing the peak current through each via stack, and vice versa. Therefore, as distance between M7 straps is reduced, the size of individual vias could be reduced. Further construction of the PDN in higher metal layers helps with more uniform distribution of current. Usually two of the upper metal layers (M8/M9) are again used for the PDN construction. The PDN in this layers could be sparse or dense. Sometime global signals are routed in these layers and therefore PDN should allow for this signal to pass over the block, making it sparser. PDN is then extended to the Re-Distribution Layer (RDL) to further distribute the supplied voltage and connect the PDN to bumps at the RDL layer. Note that as we move up in the PDN layer the width of the PDN metal straps increases. Pads/bumps are then connected to the package. Package also plays a big role in dictating the IR drop as it is highly inductive. The power and ground routes in the package should be alternated to reduce inductive coupling and $L\frac{di}{dt}$ voltage drop in the result of that. Note that IR drop in the package and board is mostly of inductive nature. ## 3.1 Package Design Pads/bumps are the connection to the package. Package design is also an important aspect in minimization of IR drop with ultimate goal of reducing the inductive coupling within the package, reducing the inductive and resistive traces from in-package, or on-package decoupling capacitance to different pads, and also assuring that individual traces in the package have acceptable inductive profile. A major improvement in the package design come from optimization of inductive traces with respect to underlying current or power density scenarios, minimizing the inductance and resistance of traces that are connected to bumps on top of hot power and IR spots. #### 3.2 Power Switches The remaining component of on-die PDN is insertion of power switches. The ASIC could be designed without power switch as always on (or VMU controlled power switch), or with on die power switches. The on die power switches could be one of three kinds: Fine-grain distributed power switches, switch island mild-grain distributed switches, and Global power switches. The fine-grain power switches are distributed in the design uniformly, and are controlled by a broadcasted control signal. The control signal is usually constructed in one or multiple chains (or with a fish bone structure, or hybrid fish-bone and chain) similar to that of scan chain. In this implementation, the control signal regulates a switch and then is propagated through a buffer to the next switch. The buffer could be inserted within or after each switch cell (or possibly after each N switches) to control the in-rush current during the power up and power down. Design for in-rush current is also a tradeoff between how responsive the system is to shut down and to power up, versus how much of in-rush current is induced in the system. In-rush current and its implied inductive IR drop becomes a serious concern when PDN of a block is shared with multiple other blocks in the package or board as in-rush induced inductive drop affect the voltage of other running blocks with shared PDN. The mild-grain distributed switch islands are constructed of larger switches or a batch of tightly placed switches. The mild-grain switches are distributed, however the frequency of repetition of switches is lower and distance between switch islands is much larger than that of fine grain distributed switches. The global power switches usually sit on the side(s) of a voltage regulated block and connect to a power mesh around the block. This implementation incurs no interference with the routing and placement of the cell however the voltage is less uniformly distributed in the design. The switches in this implementation are also larger than those of mild-grain power switches. ## 3.3 Connectivity to Power Switches The fine grain power switches are placed like standard cells (they might be single or double height). They are distributed in the design in pre-defined location right underneath the vertical voltage distribution straps (M7 in above text). In this case the Via stack connect the M7 to the power pin of the switch rather than the M1 or M2 rail. The connection between PDN and M1 rail is internal to the switch and is operated/managed by the control signal and switch transistor. The mild-grain and Global switches however are connected slightly differently. The power pin of these switches are connected to higher Metal (M7, M8 or M9) using a via stack. However, the output of this switches is again connected to another via stack that moves up to a higher metal layer, and then it is distributed across the chip. The upper metal layers again connect to M1/ (optional M2) rails using another set of via stack. That means for both mild-grain and Global switches there are three sets of via-stacks that has to be traversed before the charge/current from a RDL layer reaches the M1 rail. However, note that the via stacks that connect the RDL to switch and the switch to re-distribution layer are usually very large and have considerably less IR drop compare to the third set of via stack, or the via stacks that are used in the find grain switch distribution scheme. Having the PDN construction described let's return to our proposed clock skew driven IR mitigation methodology. Depending on the placement and location of high IR cells and their timing window, and the timing slack in their paths, we can determine if we can fix the problem with clock skew adjustment. Let's us first define the "timing window" of a cell. Delay of a cell is a function of the voltage of the cell at the time that an input signal reaches the cell. Therefore, although a cell could see varying voltage during a clock cycle, what determine the speed of the cell is only the voltage waveform or more specifically the differential voltage (VDD(t)-VSS(s)) that the cells sees while it is propagating a signal. Simultaneous switching, of cells that are sharing one or both power rails in close proximity calls for additional current demand through the via-stack and M1 rails and cause a larger IR drop. Figure 5 gives an example of a layout of multiple cells after cell placement. If cells A and B share a portion of their switching window, they could be activated at the same time. Due to sharing both VDD and VSS rails the current demand through shared PDN for the duration of activation of both cells is increased. This causes larger resistive and inductive drop. Cells C and D if switch at the same time as cell A, will have a high impact, however smaller than that of cell B, as they only share one rail with cell A. Figure 5: Simultaneous Switching of cells at the M1- M7 intersection It is probably a good place to also introduce the concept of Minimum Resistance Path (MRP). MRP is the lowest resistive path from a cell to a bump or a pad. For a typical PDN that was described earlier we will have a distinct MRP for each of the ground and power pins of each cell. For example, in case of cell A, the MRP for VDD pin or VSS pin is traversing from M1 rail, to the right or to the left to closest via stack, going up the via-stack, traversing in vertical power distribution strap (M7) up or down to closest Via7 and going through shortest resistance path in upper metal layer until the closest bump or wire bond is reached. The MRP carries the largest charge in (power pin) or out (ground pin) of a cell. Note that cells A and B are sharing the same MRP and therefore simultaneous switching of these cells has an additive effect on incoming charge and outgoing charge in the shared MRP. Cells D and C share one MRP with the cell A, and therefore could cause moderate drop if switched at the same time. Cell F and J are sharing a portion of MRP (via stack and not the M1 rail). Cells H, I and G are too far and get connected to the PDN and share the MRP at higher layers after the via stack and will have little impact when switched at the same time as cell A. Note that cells also have different strength and current demand depending on their input signal transition time, strength of transistor inside and the output capacitance they drive, and it is possible that a current hungry cell that sits further away from our concern area and share smaller portion of MRP to have an impact equal or more than a small cell with low current demand profile that share a large portion of MRP. From this discussion and the background provided it is imperative that engineering the activation timing window of spatially close cells with overlapping switching window (timing window) could go a long way in reducing the required charge demand and therefore reducing the IR drop. In addition, as explained, the switching activity is largest at the beginning of a clock cycle and is suppressed very quickly as signal is propagated down the timing path. Therefore, spreading the arrival time of the clocks in a local region with possible IR issue could drastically reduce simultaneous switching and instantaneous current demand leading to wiping clear the IR hot spots. In the rest of this paper, we explain our approach and methodology for reducing instantaneous voltage drop using useful skew and cross data path time borrowing, and will reflect our results in the Results chapter illustrating the IR drop improvement in the result of applying our methodology. We further explain how this methodology could be tuned to have minimal or no impact of circuit timing and on leakage recovery through VT swapping. We present the result of application of our techniques on 5 different circuit benchmarks and illustrate the effectiveness of our technique as an easy and applicable remedy for treating IR related timing issues in the design. #### 4 IMPACT OF IR DROP ON TIMING Delay of a cell depends on the voltage waveform or more specifically the differential voltage $V_{DD}(t) - V_{SS}(t)$ that is available at its power and ground pin while it is propagating a signal. This duration of time is formally defined as the cell's Timing Window (TW). Unlike the assumption made during Static Timing Analysis (STA), that for the purpose of timing closure a fixed voltage is considered, in reality no two cells see the same voltage waveform; the voltage waveform, at each point of PDN is unique and is affected by the pattern of cell activity and dynamic nature of the demanded current. Therefore, each cell experiences a different voltage signature at its power and ground pin, and correlation between the experienced voltage of two cells becomes less, as cells are placed further apart, and share a smaller portion (upper layers) of the PDN. Figure 6 illustrates the physical layout of two timing paths in the design, drawn on top of the IR map of the circuit. Let's first consider the two highlighted cells. Cell A is in an IR hotspot. Due to higher activity in a local region, this cell usually sees a lower mean voltage. At the same time, being in an area where toggle activity changes frequency, the cell A sees a larger voltage variation over time. Cell B is located at an area with low toggle density and therefore generally sees a higher voltage and less variation in the observed voltage. Figure 6: Physical layout of timing path over the IR hotspot The structure of each timing path, as illustrated in figure 7, could be broken into Common, Launch, Capture, and Data path. As illustrated in figure 7 it is possible to have most of the launch paths in a IR hot-spot where as the capture path lies mostly in a lower IR drop region, and vice versa. Therefore, it is possible that a substantial voltage difference between the voltage supplied to the cells in launch and capture paths is observed. This means, as variation in the voltages observed because of high IR and low IR regions increases, the timing checks should be additionally margined for it. This margin could not be applied in terms of lower voltage, as the lower or higher voltage is only seen at the launch or capture portion of the clock. Therefore, it should be margined as jitter and modeled using uncertainty. The larger the IR drop, larger is the uncertainty. In addition, in a timing path, the capture happens at least one cycle after the launch. Therefore, the voltage of the capture path, could change even further. That is why cycle to cycle voltage variation becomes important and such voltage variation is also modeled and margined using uncertainty and not using IR drop. There are a few contributing factors to cycle to cycle voltage variation (1) overall change in peak current which causes RLC oscillation in board and package and causes the voltage to change from cycle to cycle. This is a slow changing voltage. (2) change in toggle rate of local cells, which could result in temporary depletion of charge and high frequency noise drop. Figure 7: Structure of the timing path Different cells will be impacted differently by voltage noise and voltage drop. Figure 6 illustrates the delay of 5 cells with different threshold voltages including ultrahigh, high, standard, low, and ultra-low threshold voltages, abbreviated as UHTV, HTV, STV, LTV, ULTV accordingly. The higher the threshold voltage, the lower the voltage headroom and the larger the delay increase. In addition, this figure illustrates the impact of 50mV voltage variation on the delay of different cells. As illustrated on one extreme the ULVT cell see only 60% delay variation, whereas HVT cells sees 450% delay variation, and UHVT is not even operated at the lower voltage after seeing 50mV of additional IR drop. Figure 8: Threshold voltages v/s Nominal Delay Having explained the role of the PDN in formation of hot spots, the way the clock arrival time scheduling could help us reduce IR drop, and explaining why it is important to control the IR drop to minimize its circuit timing impact, we move to the next section to explain our proposed technique. #### 5 METHODOLOGY AND ALGORITHM In order to reduce the Dynamic IR drop, rather than reducing the overall peak current, we focus on reducing the intensity of local hot spots. Previous work that has addressed the peak current reduction, as illustrated in figure 9, cares only about the current signature observed at package balls (or in case of wire-bound package on pads). In this case if the peak value of $I_{VDD}(t)$ and $I_{GND}(t)$ is reduced, RLC resonance in package and board is reduced, and all cells in the design see a higher mean votlage, and lesser overall IR drop. Although peak current mitigation technique widens the distribution of the clock arrival time, still many timing endpoints with similar arrival time could be placed close to one another. This is because these techniques are pre-placement techniques and are not aware of physical layout and placement of individual registers and their relative connectivity to the PDN. By using peak current reduction technique, the overall IR demand will be reduced, however it is the possibility of high, concentrated and local current demand, which causes the formation of regions with high IR drop. In order to mitigate this problem, we break the problem of peak current reduction into many smaller problems. The most resistive section of the PDN is the M1 rail and the via stack that connects the M1 to the wider upper layer metal straps. Considering M1 rails and lower level via stacks are highly resistive, they are the problematic area in the formation of high-IR regions. Therefore, if the current demanded through each lower level via stack could be lowered, the occurrence of IR hot-spots could be mitigated. Figure 9: Peak current reduction v/s Hot Spot Mitigation Let us define the Minimum Resistance Path (MRP) as the lowest resistive path from a cell to a bump or a pad. Considering that a PDN, constructed as explained in section 2.2, has a regular structure, for each standard cell, the via stacks closest to the power and ground pins of that cell are a part of its MRP. Note that most of the current delivered or returns to/from a logic cell, runs through its MRP. Let us also define a few terms which will be used in formulating the problem: **MRP**(**FF**[**j**]): the MRP of flip flop FF[**j**]. **CAT(FF[j]):** Cloak Arrival Time to flip-flop FF[j]. **D:** The distance between the via stacks **V[i]:** ith via stack considered. **X**(**FF**[**i**]), **X**(**V**[**i**]): Cartesian X location of FF[**i**] or V[**i**] **Y**(**FF**[**i**]), **Y**(**V**[**i**]): Cartesian Y location of FF[**i**] or V[**i**] **MRR(i):** the region that contain all cells cell[j] whose MRP includes V(i). (in figure 7 the highlighted region is the MRR of the central via, which includes all yellow colored cells). **TR(cell[j]):** toggle rage of cell[j] ## 5.1 Problem formulation For each via V(i), and all flip-flops FF[j] that satisfy the conditions below, schedule the timing window CAT(FF[j]), such that the peak current demand through v(i) is minimized. $$x - \frac{D}{2} < X[C[j]] < x + \frac{D}{2}$$ (1) $$Y[C[j]] = y[i] \tag{2}$$ $$v[i] \in MRP[C[j]] \tag{3}$$ In other words, considering each via stack as a source or sink for the current, schedule the arrival time of the triggering edge of clock to the FFs for which the specified via stack is a part of their MRP, such that the timing windows of different FFs have a minimum overlap. Note that when FFs share the same MRP, simultaneous switching of these FFs has an additive effect on demanded current and injected charge in the shared MRP. In addition, FFs have different strength and various output loads. Therefore, when scheduling the arrival time of these cells, their size and output load should also be considered. In order to account for cell strength and output loads we build a simple yet effective model: The current that each cell draws during the switching will be directly related to its $C_L/t_p$ . The CL is the capacitive load of the cell, which could be obtained from adding internal capacitance, wire capacitance and fan-out gate capacitances, and $t_p$ is the propagation delay through the cell obtained from timing analysis. In addition, the occurrence of the hot spot is dependent upon the toggle rate of the cells in MRR. Let us define the TR of a cell as probability of switching (including both $0 \rightarrow 1$ and $1 \rightarrow 0$ ). With this definition, the TR of a cell depends on the probability of input switching. Note that propagation through logic gates significantly modifies the signal statistics [28] and suppress the toggle rates. For example, table 1 provides the output transition probability for a few static logic gates. The probability of switching $(0 \rightarrow 1)$ at the two input gates is considered to be $P_A$ and $P_B$ , and $P_A$ for one input gates. In order to come up with TR for each cell, the probability of switching at the output of FF is considered to be 1, and it is propagated downstream in the timing path. Table 1: Toggle rate for some of the gates | Gate | $\propto_{0 \to 1}$ | |---------|------------------------------------------------------| | BUF/INV | $P_A$ | | AND2 | $(1 - P_A P_B)(P_A P_B)$ | | OR2 | $(1-P_A)(1-P_B)[1-(1-P_A)(1-P_B)$ | | XOR2 | $[1 - (P_A + P_B - 2P_A P_B)](P_A + P_B - 2P_A P_B)$ | Usually, in a combinational cell, by moving away from the launch register, the probability of switching is quickly depressed, a reason why the peak current occurs early in each cycle. Putting all these together Expected Current Demand (ECD) of a cell is defined as TR(cell[j]). $C_L/t_p$ . Figure 10: TW for cells in MRR and their current demand In order to schedule the CAT for each FF (clock skew), the ECD for all cells in MRR is calculated. This process is illustrated in figure 11. The process is as following: Each cell is associated with a timing window in which its ECD is valid. In order to roughly obtain the current signature over the entire clock period. The ECDs are integrated over the entire cycle. The Integrated ECD (IECD) curve is divided into many ECD Slices (ECDS), where the boundary of each slice is defined by min and max arrival of timing windows of individual cells. Our proposed clock scheduling techniques works by shifting the timing windows of FFs in MRR region such that the IECD curve with the smallest max-valued ECD slice is obtained. Following is the algorithm used for scheduling the CAT of individual FFs in the MRR region: ``` (1) Perform IR analysis (2) If IR hotspot is a concern { | Set tslice = "small time increment, e.g. 5ps" Add a user defined attribute for all FFs named For each power via(i) in the hot spot, and its associated MRR[i] { Annotate each Cell[j] in MRR[i] with its TR | Annotate each Cell[j] in MRR[i]with its TW | Annotate each Cell with its ECD | Obtain the IECD curve. | For each flip-flop FF in MRR[i] in decreasing order of ECD { | Set tmin = early arrival time of clock to clock-pin of FF | | Set tmax = late arrival time of clock to clock pin of FF | | | Set tws = tmax-tmin | | Set mst = the min slack in all paths ending at FF | | Set msf = the min slack in all paths starting from FF | | For tt=tmin-mst; until tt< tmin+msf; tt+=tslice | | | | Set SAT = t; // SAT: Scheduled Arrival Time | | | | } | | | } | | if status= "updated" { Annotate the FF with the SAT | | | //(to be considered in the next iteration, when looking at slack from or to this registers) | | | } | | } | } (3) Run Incremental CTS to implement scheduled arrival times ``` Figure 11: Algorithm to schedule CAT The above algorithm runs very quickly, and achieves considerably good results. However, there is a small issue with this algorithm: rescheduling the CAT of a FF will shift the TW of all cells in that timing path, which may result in hot IR spot somewhere else. Although a valid concern, in practice this is not a big issue because the toggle rates drop very quickly within the first few cells of each timing path; therefore, although timing overlap may happen, considering the reduced probability of switching, the actual occurrence of simultaneous switching is far less. Furthermore, the algorithm could be executed multiple times to remedy the occurrence of new IR hot spots. Figure 12: Skew transfer without violations In addition to minimizing the ECDs, the algorithm assures that by changing the skew, no timing failure is created. This is done by considering the available slack in all timing paths to and from the FF under investigation. This is illustrated in figure 12. The minimum available timing slacks from start-points S1, S2 and S3 to FF is that of S2 (1ns) and smallest slack with FF as a start point is that of FF→ E1 with 0.5ns. Therefore, without causing timing violation, the arrival time of clock to FF could be skewed to push it early by 1ns, or late by 0ns. Note that this work could be easily extended to consider multiple stages, to increase the available slack. For example, if E1 could be pushed out by 0.5ns, if there exists available slack for all timing paths starting from E1, the FF could be pushed late by 1ns, instead of 0.5ns. A detailed discussion is present in the algorithm shown in figure 13. Below is a detailed description of the algorithm. We first cluster the registers in the design based on their physical location on the floorplan considering the nearest via location as they would be sharing the portion of MRP as described in the previous chapter. In our work, we have grouped cells that are near the via intersection of the VDD power straps of the M1 and M7 layer. The basic idea is to change the clock arrival time of the cells in order to change the cell switching time. We can do that in two ways i.e. we can make the cell to either switch early or switch late without affecting the timing. By doing this we reduce the number of cells that switch together thereby minimizing the switching current at that location. So, we annotate some cells in the cluster to switch early and rest of them to switch late. This annotation of the cells to switch early or late is done alternately from left to right and bottom to top of the core area as shown in figure 14, so that they can be equally distributed over the core area. Figure 13: Algorithm to calculate CAT Figure 14: Marking Cells Early/Late Yellow=early, Red=Late The clock arrival time (CAT) can be adjusted only by considering the timing to and from the cell. So, the early/late assignment depends on the available slack to and from the cell. Figure 15a and 15b gives a detailed explanation of the CAT calculation for a register marked as Early. Given a register 'X' it has 3 Startpoints and 3 Endpoints. The slack from the Startpoints are 4ns, 2ns and 3ns whereas slack from the Endpoints are 3ns, 2ns, 4ns. Since this register is marked early we consider shifting the clock arrival time to the left. The minimum startpoint slack =2ns so we can shift the clock arrival time of the register 'X' upto 1ns as shown in figure 8b. X is now moved by 1 ns and all its startpoint slacks decremented by 1ns and endpoint slacks incremented by 1ns as shown in the figure. We can similarly derive the CAT for cells marked as late. 15a 15b Figure 15a and 15b : Start points and endpoints from Register X and their useful skew calculations Thus, we pick a register 'X', find all the start points, endpoints and the available slack from the timing paths through this register. If the cell is annotated as early, we consider the minimum slack available from all the start points to the register X. Similarly, when the cell is marked as late, we consider the minimum slack available from register X to all its endpoints. Given all the data, we consider the minimum available slack as the time to make the cell to switch early/late as shown in figure 15b. Now that we have calculated the new clock arrival time (CAT) for the register X, we need a method to indicate the new slack because of the new CAT to start points and endpoint to and from the register X. This is required so that the new slack value can be considered when calculating the available slacks to and from these cells during future iterations of CAT calculations for other cells. Similarly, we annotate all the registers that have been touched and once the algorithm traverses through all registers the algorithm terminates. All the new constraints (new CAT) are applied after cell placement stage of the design and then Clock Tree Synthesis is performed. This method helps to reduce the IR drop locally. ### **6 EXPERIMENTAL RESULTS** We ran our algorithm on IWLS [29] benchmarks. We used Design Compiler TOPO [30] to synthesis our design and Synopsys ICC [20] to floorplan the design, place the cells, perform Clock Tree Synthesis, and route the design. The algorithm is written in TCL and runs in ICC's GUI. The IR drop is calculated using ANSIS Apache RedHawk [27]. Metrics used to test the effectiveness of the proposed algorithm include reduction in peak current, max IR drop, Mean IR drop, IR drop variance and cycle to cycle voltage variation before and after the execution of the proposed IR mitigation algorithm ## 6.1 Impact on timing Our proposed IR hot-spot mitigation technique reduces the intensity of IR hot spots and helps timing as follows (1) by removing the local hot spots, the mean voltage seen by cells are higher, and therefore they are faster. (2) by distributing the arrival time of the local clock, the high frequency voltage noise, due to local simultaneous switching reduces and therefore less uncertainty margin during the timing closure is required. (3) the accumulative impact of distributing local IR drop, causes the reduction in peak current, and therefore the RLC oscillation reduces, further reducing the cycle to cycle voltage noise, and reduction in the required uncertainty margin during the timing closure. ## 6.2 Peak current reduction Although the primary objective of the propose algorithm is to reduce the intensity of hot spots, still it is very effective in reducing the peak current, as it uses the concept of widening the distribution of clock arrival times to reduce the local current densities. Figure 16 illustrates the impact of proposed algorithm in reduction of the peak current demand of a DES Engine. The reduction in the peak current of multiple designs on which we have tested this algorithm is summarized in table 1. The proposed technique achieves between 20% to 30% reduction in peak current. Figure 16: Peak Current Reduction using the algorithm Figure 17: Average current Improvement Figure 18: Worst DVD of the designs ## 6.3 Differential Voltage Difference (DVD) Reduction This algorithm is quite effective at removing IR hot-spots. Figure 17 illustrates the IR map of the DES Engine before and after application of the proposed algorithm. Table 2 captures the impact of IR reduction on worst, worst 10, worst 100 and worst 1000 cells. As illustrated, the proposed technique removes the hot spots, and the outliers. Up to 40% of voltage variation is minimized by our technique. Table 2: Comparison of all the benchmark circuits | | Register | Worst DVD | | | Top 10 cells | Top 1K cells | All Cells % | lavg % | Power % | Cell Area % | |------------|----------|-----------|-----------|------------|--------------|--------------|-------------|-----------|-----------|-------------| | Cell count | Count | before (m | after (mV | %Reduction | % Reduction | % Reduction | Reduction | Reduction | Reduction | Reduction | | 45787 | 8808 | 116 | 75 | 35 | 33% | 18% | 11% | 33% | 5% | 2% | | 33437 | 10545 | 89 | 45 | 49 | 47% | 22% | 12% | 28% | 4% | 2% | | 105116 | 1595 | 82 | 59 | 28 | 26% | 15% | 8% | 19% | 3% | 1% | | 12384 | 1485 | 81 | 51 | 37 | 31% | 24% | 9% | 40% | 6% | 2% | | 12442 | 1314 | 79 | 52 | 34 | 33% | 19% | 8% | 41% | 5% | 3% | Figure 19 a and b: IR Drop before and after running our algorithm and range of voltage drop Figure 20: Instantaneous VDD vs time Figure 21: Peak Current v/s Time # 6.4 Instance Peak Current Reduction Considering that the proposed approach improves the mean IR drop, all cells see a higher voltage. The Mean IR improvement in our case was ~10%. We reflected this improvement in IR drop, and retimed the solution. Then both original and improved design went through cell downsizing ECO. Considering that in the proposed solution, all cells see a higher mean voltage, larger number of cells were downsized. Downsized cells consume a smaller peak current. Figure 21 compares the original and proposed design in terms of instance peak current. As illustrated many of outlier cells specially those related to clock and sequential cells are removed. Figure 22: Original and Proposed DVD Figure 23: Proposed Peak Current #### 7 FUTURE WORK AND DISCUSSION The algorithm could be taken to the next level to consider all logic cells and not only flipflops. In this case the algorithm would be more involved because each cell could have multiple start points and could lead to multiple endpoints. In this case the timing window of the cell could be engineered in two different ways. (1) by shrinking the timing window (2) by shifting the timing window. In order to shrink the timing window, each timing path to the cell should be considered. The min arrival time in a timing window is from a start point that gets to that cell the fastest and the max is from the start point that gets to that cell the latest. In order to reduce the timing window, these two start points should be rescheduled. By making the min arrival time late, and register that contributes to the late arrival time to be skewed early. This process could be repeated iteratively until timing window is shrunken considerably, reduce the chances of overlap. Shifting the timing window is even more difficult. In this case all start points to that cell should be identified. The min slack from each starting point to all associated endpoints should be obtained. The min of such min slacks, the amount of time by which all start points could be shifted, causing the shift in the timing window of the logic cell. Shifting early follow the same trend. All start points are identified. The min slack to each start point (now considered as end points) is calculated, the min of these min slack is the amount by which the cell could be shifted early. As seen for shifting the timing window of a cell in the middle of the timing path, multiple CTS scheduling should take place. Although possible, this will be a heavy duty for CTS synthesis tools. In addition, our implementation showed that by scheduling the flip flops, the hot spots are mitigated considerably, and in case of our simulation, we did not only slight improvement by a full-blown rescheduling to shift the timing window of mid path logic cells. Having this said, there may be scenarios where such tradeoff is legitimate and as explained, the flow of diagram 10 could be easily extended to do logic cell timing window shifting. #### **REFERENCES** - [1] P. V. A. B. a. G. D. M. L. Benini, "Clock Skew Optimization for Peak Current reduction," *VLSI Signal Process. Syst*, 1997. - [2] W.-C. D. Lam, C.-K. Koh and C.-W. A. Tsao, "Power Supply Noise Suppression via Clock Skew Scheduling," in *ISQED*, 2002. - [3] H. H. Su, K. Gala and S. Sapatnekar, "Fast analysis and optimization of power/ground network," in *onf. on Computer Aided De- sign*. - [4] Z. L. a. Y. J. K. Shi, "A power network synthesis method for industrial power gating designs," in *ISQED*, 2007. - [5] J. S. B. L. a. Y. C. Q. Zhou, "Floorplanning considering IR drop in multiple supply voltage island designs," *IEEE. trans Very Large Scale Integration Systems*, 2011. - [6] J. Fishburn, "Clock Skew Optimization," *IEEE Trans. On Computers*, 1990. - [7] H. H., B. a. M. M.-S. A. Vittal, "Clock Skew Optmization for Ground Bounce Control," in *ICCAD*, 1996. - [8] S.-H. H. a. S. -Y. H. Y.-t.Nieh, "Minimizing peak current via opposite-phase clock tree," in *DAC*, 2005. - [9] G. Q. L. Y. a. Q. Z. J. Gu, "Peak Current reduction by simultaneous State Replication and Re-Encoding," in *ICCAD*, 2010. - [10] V. Arunkumar Vijayakumar and S. Kundu, "An Efficient Method for Clock Skew Scheduling to reduce Peak Current," in *VLSI NDesign and 2016 15th International Conference on Embedded Systems (VLSID)*, 2016. - [11] S.H.Huang, C.M.Chang and Y. Nieh, "Fast Multiplier Domain Clock Skew Scheduling for Peak Current Reduction," in *Asia and south Pacific Design Automation Conference*. - [12] R.Sankaranarayan and A. Mukherjee, "Retiming and Clock Scheduling to Minimize Simultaneous Switching," in *IEEE SOC Conference*, 2004. - [13] L. B. A. B. P. Vuillod and G. De Micheli, "Clock-skew Optimization for Peak Current Reduction," in *International Symposium on Low power electronics and design*. - [14] G. S. S. C. Nithin S K, "Dynamic Voltage (IR) Drop Analysis and Design Closure: Issues and Challenges," in *ISQED*, 2010. - [15] V. S. Abhishek Nigam, "An efficient approach to evaluate Dynamic and Static voltage-drop on a multi-million transistor SoC design". - [16] M. W. D. B. D. S. a. T. M. R. G. Dreslinski, "Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits," in *Proceedings of the IEEE*, 2010. - [17] H. Kaul and Mark Anders, "Near-Threshold Voltage (NTV) Design—Opportunities and Challenges," in *DAC*, 2010. - [18] P. M. M. L. M.-N. D. Z. P. S. Roy, "Clock Tree Resynthesis for Multi-Corner Multi-Mode Timing Closure," in *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 2015. - [19] S. R. K. H. M. N. Parthiban, "Clock Skew Optimization in Pre and Post CTS," in *Advances in Computing and Communications (ICACC)*, 2012. - [20] S.-H. H. W.-K. C. Y.-C. C. T.-J. Wang, "Top-level activity-driven clock tree synthesis with clock skew variation considered," in *ISCAS Circuits and Systems*, 2016. - [21] Azure Rubix. - [22] Synopsys ICC. - [23] Cadence EDI. - [24] S. J. a. C.-K. K. R. Ewetz, "Fast clock skew scheduling based on sparse-graph algorithms," in *The 20th Asia and South Pacific Design Automation Conference*, 2015. - [25] Y. C. W. C. Y. L. Q. Z. a. J. H. W. Shen, "Useful clock skew optimization under a multi-corner multi-mode design framework," in *ISQED*, 2010. - [26] T.-Y. W. L.-Y. L. a. K.-Y. C. J.-K. Wu, "IR Drop Reduction via a Flip-Flop Resynthesis Technique," in *9th International Symposium on Quality Electronic Design*, 2008. - [27] A. R. Y. Sudarsanam, "Clock skew automation for power and area reduction in deep sub micron designs," in *Circuits and Systems Workshop (DCAS)*, 2010. - [28] J. X. a. W.-M. Dai, "Useful-skew clock routing with gate sizing for low power design," in *Design Automation Conference Proceedings*, 1996. - [29] Ansys Apache Redhawk. - [30] J. Rabey, Digital Integrated Circuts, Princton hall, p. 259. - [31] C. R. Laboratories, IWLS 2005 Benchmarks, Berkely. - [32] Synopsys Design Compiler. ## **BIOGRAPHY** Lakshmi Saraswathi Bhamidipati is a candidate for Master of Science in Electrical Engineering, with a specialization in VLSI Design and Microelectronics from George Mason University. She has been a member of Dr. Avesta Sasan's research lab, since December 2015. She received her Bachelor of Engineering in Electronics and Instrumentation from Anna University, Chennai, India in 2013.