Mobile QR Code QR CODE

  1. (Department of Computer Science, Sogang University, Seoul, Korea)



VLSI, clock tree synthesis, aging, bias temperature instability, IR-drop

I. Introduction

As CMOS technology rapidly advances to nanometer scale, clock tree synthesis (CTS) has become one of the crucial steps in the high-performance circuit design. A clock distribution network (CDN) is constructed to deliver the synchronous signal to every sequential element within skew constraints by applying buffer insertion, sizing, routing, and wire snaking. Clock skew optimization in CTS is a challenging problem in recent nanoscale technology due to the increasing demand of high frequency design and circuit reliability.

Reliability issues such as process, voltage, temperature (PVT) variation, aging, and noise have become dominant factors to be considered. Although the initial clock tree meets skew constraint, the aging effects, especially asymmetric, may cause unexpected delay and skew.

Moreover, aging-induced performance degradation is heavily dependent on the operating voltage, temperature, and ON time of a transistor. In general, the power consumption of CDN possesses about 40% of the entire chip power budget [1]. The clock gating has been one of the essential techniques to reduce excessive power consumption by shutting part of clock tree down. To reduce skew violation, buffers are inserted from source to sinks and alternately undergo stress and recovery time. On only for 5% of the total lifetime may incur up to 40% of the degradation as compared to an always-on domain [2]. If buffers in a gated clock tree have no constant uniform stress and recovery time, it may lead to asymmetric aging in a clock tree [3] and significant malfunction on circuit.

Bias temperature instability (BTI) is known as the major source of aging and it has the great negative impacts on performance and reliability of devices. There are the two types of BTI, which are negative BTI (NBTI) and the positive BTI (PBTI) that occur in PMOS and NMOS respectively. Above the 90 nm process with SiO2 gate oxide devices, the NBTI mainly affects in PMOS transistors and NBTI-induced degradation is not negligible [4]. With high-k metal gate on circuit, the PBTI effect becomes more severe. Thus, both NBTI and PBTI effects must be considered simultaneously for accurate aging analysis. As the technology scales down below 5 nm, BTI degradation has the high impacts on the lifetime of the device as compared to hot carrier injection (HCI) and time dependence dielectric breakdown (TDDB). BTI effect may result in threshold voltage increase, the drive current reduction, timing violation, and functional failures [5].

Even if all buffers in a clock tree are connected to the same power distribution network (PDN), supply voltage levels on buffers could be different due to the IR drop [6]. Thus, non-uniform supply voltages on buffers can incur a large clock skew and cause degrading timing yield [7]. The clock skew must satisfy given constraints for the correct operation of the circuit. Hence, handling the IR-drop in CTS can provide better performance and optimization convergence. Also, BTI is estimated Vth degradation due to different stress conditions, such as static and dynamic [8]. Static BTI occurs when a transistor is subjected to a constant bias voltage and high temperature over time. Dynamic BTI, on the other hand, occurs when the bias voltage on a transistor is changing rapidly. IR drop can cause a variation in the gate voltage, which can impact static BTI degradation rate and dynamic BTI degradation rate by altering the pulse width and amplitude of the input signals. Although the IR-drop consideration at the placement stage is not accurate as that at the post layout stage, it is still desirable to encounter the IR-drop early for faster time to market.

The main contribution of our work presents asymmetric aging aware CTS with LP-based supply voltage assignment. Though asymmetric aging can be induced by clock gating and supply voltage variation, the clock tree generated by our method is robust for circuit reliability. Thus, this work will enable robust clock tree synthesis in real designs that necessarily utilize clock gating and power gating techniques. This work is summarized as follows: First, the bottom-up phase performs clock sink clustering [9] and parent/buffer region generation to construct a power-aware symmetrical clock tree topology. Second, the top-down phase performs BTI-aware optimal supply voltage alignment on buffers based on linear programming (LP) and wire routing without additional optimization stages. Third, our CTS meets the skew constraints before and after years of BTI degradation. In addition, we achieve on average 55% skew reduction after 10 years compared to existing CTS methods.

The rest of the paper is organized as follows: The related works for clock tree design are briefly reviewed in Section II. Section III illustrates the BTI effect, our motivation, and the problem formulation. Our symmetrical buffered CTS considering BTI is introduced in Section IV. Section V presents the experimental results compared with existing CTS methods. Finally, Section VI addresses the conclusions.

II. Related Works

Various works proposed frameworks and methodologies considering asymmetric aging. The work [2] presents a statistical framework for asymmetric aging of mixed signal CMOS circuits. The framework focused on the problems by power management technique. [4] proposed the framework for the beat-frequency detection using ring oscillators under asymmetric BTI. The work [10] predicted timing violation by identifying the critical moments in circuit operation under asymmetric NBTI aging. [8] presents the STA process aware of aging degradation by developing device-level variation aware timing models of CMOS inverters to represent threshold-crossing points (TCPs). [11] analyzes the impact of asymmetrical aging due to BTI in the clock tree segments of power efficient designs.

There have been many researches in the area of CTS. The patent [12] handled NBTI effect on clock skew. This work calculates the clock skew and aging effect as the guard-band in the clock tree generation. The authors in [13] added extra circuitry to simplify clock control logic for skew reduction. The authors of [5] proposed Gating with Both Logic Value (GBLV) scheme to equalize the stress time of clock subtrees. This approach chooses whether the clock signal being gated should be frozen at logic 0 or logic 1.

There are several researches without additional circuitry. In [3], the method of critical-PMOS-aware clock tree design is introduced by manipulating the stress probabilities to the critical PMOS transistors. Also, NBTI induced skew reduction is achieved by gate selection method. The work [14] introduced a framework to solve the aging-induced performance degradation and skew violation in the clock tree. The work [15] presented an adaptive clock buffer design and an adaptive CDN to mitigate the power-supply variations on clock distribution buffers. [16] proposed clock gating with ICG cell selection to reduce N/PBTI-induced clock skew according to latency and clock cycle. The symmetrical buffered clock tree synthesis was proposed in [7]. This approach reduces clock skew by aligning identical supply voltage. The CTS flow for skew reduction considering NBTI and process variation was proposed in [17].

III. Preliminary

1. Bias Temperature Instability

Major parameters of BTI induced aging include voltage, temperature, and time. BTI increases threshold voltages of logic gates and causes performance degradation.

During repetition of stress-recovery phase as described in Fig. 1, BTI shows a stress phase where the threshold voltage increases and recovery phase where part of the increased threshold voltage decreases again. The recovery phase is not fully completed. BTI is proceeded on the gate oxide and silicon substrate by hole trapping in transistors, which is explained by atomistic trap-based BTI (ATB) model [4].

Fig. 1. The shift of threshold voltage (∆Vth) during stress and recovery phase in transistors.

../../Resources/ieie/JSTS.2024.24.5.424/fig1.png

When the gate-source voltage (V$_{gs}$) of PMOS is smaller than 0 (stress phase), the electric field in the gate oxide brings about interface trap generation between the gate oxide and the substrate. Thus, the threshold voltage of PMOS increases over time due to the interface trap generation. When positive biased (V$_{gs}$ > 0), the operation is occurred in PMOS transistors similar to NMOS and aging degradation is generated by PBTI. Static BTI analysis leads to pessimistic results because it does not consider actual behavior of transistors. The recovery phase is very significant to the correct estimation of BTI. Accurate BTI analysis has been performed by considering the recovery phase using duty cycle. The duty cycle is defined as the ratio between ON and OFF time. A long-term prediction model [18] that approximates the threshold voltage shift by BTI is described in Eq. (1).

(1)
$ \Delta Vth_{BTI}\left(t\right)=A*t^{n}*V_{gs}^{m} $

where A is pre-factor, t is the stress time, n is the power law time exponent, and m is the power law voltage acceleration. Throughout this paper, the long-term prediction model is used to compute threshold voltage shift.

2. Tilted Rectangular Region

A tilted rectangular region (TRR) represents potential buffer embedding and parent node region as described in Fig. 2. TRR extension and intersection operation are used for generating the potential buffer/parent region. TRR extension and intersection operations are performed to maintain wirelength constraints. The potential buffer and parent regions are generated by TRR [7,19] which is a 45-degree or 135-degree rectangular region that consists of a core and radius. The core is a line segment in dashed rectangular of Fig. 2(a) called merging segment and the radius is Manhattan distance from the core to a dashed rectangular line. Fig. 2(a) extends sink A and sink B with extended radius to generate potential parent node region (dotted rectangular). Extended radius is same as the longest among the wirelength of clusters with a certain level. Intersection refers to the overlapping area when each sink is expanded by the radius. Then, potential parent node region, same as intersection, is extended for potential buffer region as two dashed rectangular in Fig. 2(b). Fig. 2(c) represents the potential buffer and parent node embedding region generated through extension and intersection operation [7] for buffer insertion and wire routing. After the buffer regions are constructed as represented in Fig. 2(d), the voltage domain (candidates) for the supply voltage alignment can be scanned in potential buffer region as shown in Fig. 2(b).

Fig. 2. Tilted Rectangular Region (TRR): (a) TRR extension and intersection operation; (b) Potential buffer and parent node region; (c) Potential parent node region (merging point) generation; (d) Potential buffer region construction for obtaining the voltage candidates.

../../Resources/ieie/JSTS.2024.24.5.424/fig2.png

3. Asymmetric Aging by Clock Gating

Different transistors on a clock tree are degraded by aging asymmetrically. Transistors of various physical locations undergo asymmetric aging with increasing use of power management techniques, such as clock gating or power gating. Fig. 3(a) shows a symmetrical buffered CTS without consideration for clock gating technique. Due to symmetric property, the clock signal arrives from clock source to each buffer (A, B, C, D) at the same time. The clock skew becomes zero because all clock buffers have uniform signal probability (SP). However, all modern clock tree networks include clock gating in order to reduce high power consumption. Thus, the clock tree structure can cause unexpected clock skew due to non-uniform SP caused by clock gating after years of aging. Although the supply voltages on buffers, buffer size, and wire length at each level are identical, asymmetric aging caused by non-uniform SP can bring about a large clock skew leading to performance degradation and failure. Hence, CTS without consideration for clock gating is neither realistic nor practical. Fig. 3(b) illustrates CTS with consideration for clock gating technique. The SPs of buffers is non-uniform due to clock gating technique. If the optimal SVs on buffers at each level are assigned with consideration for asymmetric aging, the clock tree of Fig. 3(b) can maintain smaller clock skew than that of Fig. 3(a) after aging. Therefore, it is essential to handle asymmetric aging effect (BTI) caused by clock gating technique.

Fig. 3. (a) Symmetrical buffered clock tree synthesis without consideration for clock gating (CG); (b) Our symmetrical buffered clock tree synthesis with consideration for clock gating.

../../Resources/ieie/JSTS.2024.24.5.424/fig3.png

4. Problem Formulation

Our objective is to construct a symmetrical buffered clock tree with supply voltage alignment. The linear programming (LP) with signal probability and supply voltage is applied to achieve optimal supply voltage values. The threshold voltage shift along signal probability is formulated into the linear form. Then, the delay associated with signal probability is computed by simple transformation. Finally, optimal supply voltage is calculated using LP with inputs such as skews, delays, and the signal probability.

Inputs are given for problem formulation. A set of clock sinks is {$s_{1}$, $s_{2}$, … , $s_{n}$}. Power information is formed with power network and IR drop map [20]. A set of libraries are given as buffers, wires, and clock gates such as AND, NAND and NOR. The library of buffers and wires includes fresh and aging information. A clock skew is reduced under capacitance limit and no slew violation. The slew constraints must be less than 100 ps.

IV. Bti-tolerant Clock Tree Synthesis using LP-based Supply Voltage Alignment

Our CTS approach considering BTI is based on the deferred merge embedding to minimize wire length. The BTI-aware symmetrical buffered CTS consists of two steps: 1) bottom-up phase, 2) top-down phase as described in Fig. 4. First, we generate a symmetrical abstract tree topology considering power consumption in bottom-up phase. Second, buffer insertion and wire routing considering BTI are performed for completing CTS in top-down phase. The objective of our CTS approach is to construct a symmetrical buffered clock tree considering BTI to satisfy the clock skew constraints considering both before and after aging.

Fig. 4. The overall flow of our buffered clock tree synthesis (CTS) considering BTI.

../../Resources/ieie/JSTS.2024.24.5.424/fig4.png

1. Bottom-up Phase

The bottom-up phase constructs a symmetrical abstract tree topology with parent/buffer region. Our proposed abstract tree topology generation algorithm is based on nearest neighbor graph (NNG). The objective of our method is not only to minimize power consumption but also to generate the parent node and buffer region for supply voltage alignment. Constructing an abstract tree topology with power consumption minimization is significantly important since it can reduce the power overhead of supply voltage alignment performed in top-down phase, and it has a great impact on circuit performance. Our symmetrical abstract tree topology algorithm has three properties as follows: 1) bottom-up traversal, 2) minimized power consumption, 3) symmetrical tree construction.

1) Abstract tree topology generation

Our abstract tree topology generation algorithm recursively selects a pair of subtrees or sinks and then merges the selected pair. The selection of pairs is based on the cost function which includes the power consumption of the wire, downstream capacitance of sinks as shown in (2). The cost function for selecting the pairs as follows:

(2)
$ Cost~ Function=c_{wire}\cdot \left(E_{a}+E_{b}\right)+\left(C_{a}+C_{b}\right) $

where C$_{wire}$ is a unit capacitance of wire. The terms E$_{a}$ and E$_{b}$ represent the Manhattan distance from node a to its parent node and from node b to its parent node respectively. C$_{a}$ and C$_{b}$ are defined as downstream capacitance at node a and b respectively.

Assume that a set of clock sinks is given. First, NNG of all subtrees and sinks is constructed based on cost function (2). At each sink, after the costs of all edges from each sink to the others are obtained, an edge with the lowest cost is selected among them as a directed edge. Thus, each sink has a directed edge. Next, the directed edges are sorted according to their costs in non-decreasing order. An edge with minimum value is selected among sorted edges in a table for minimizing power consumption. The selected edge is removed from the table. The parent nodes (internal nodes) of removed edges are stored in a new table. This process is repeated level by level in bottom-up sink.

Fig. 5 describes the algorithm of our clock sink clustering used in this work. Fig. 5(a) shows the initial sink distribution as a primary input before NNG is constructed. At each sink, after the costs of all edges from each sink to the others are obtained, an edge with the lowest cost is selected among them as a directed edge as shown in Fig. 5(b). After NNG based on (2) is constructed, all sinks have a directed edge represented in Fig. 5(c). Sink S2 has a directed edge connecting to sink S1. This mean that the cost is minimum when sink S2 is paired with S1. However, when sink S1 is paired with S5, the cost is minimum value. After NNG for all sinks is finished, the costs of all edges are stored in the table for merging cost with non-decreasing order. Fig. 5(e) is a table that consists of edge information and its cost in non-decreasing order. The costs of all edges are calculated by (2), and then an edge with the minimum cost is selected for minimum power consumption such as edge (S5, S6). Sinks S5 and S6 are the edges removed from the table. The removed edges (S5, S6) are stored in new table for its parent node (I$_{1}$). This process is repeated until the edges in the table are empty. Through this process, after the set of parent nodes is completed, the above process for parent nodes (internal nodes) is repeated until it reaches the clock source.

Fig. 5. Our clustering algorithm based on NNG: (a) Initial clock sink distribution; (b) The selection of a directed edge at each paired and located at level 3 in bottom-up fashion level by level; (c) After the directed edges of all sink are constructed; (d) The cost calculation for edges of all sinks; (e) The table for merging cost.

../../Resources/ieie/JSTS.2024.24.5.424/fig5.png

2) Buffer and parent node region generation

In bottom-up stage, a buffer region is constructed for buffer insertion with supply voltage alignment, and a parent node region is generated for merging point with symmetrical wire-length. As we mentioned in Section III-B, TRR represents potential buffer embedding and parent node region generated through extension/ intersection operation [7] for buffer insertion and wire routing. For extension/intersection operation, the radius is same as the longest wirelength among pairs of a certain level clustered with (2). After the buffer regions are constructed by TRR, we can scan the voltage domain (candidates) for the supply voltage alignment as shown in Fig. 6(b). Since shorter connections should be lengthened through snaking to satisfy the identical-wirelength property, the total wirelength is determined by the longest wire within a certain level [7]. As shown in Fig. 6(a), the bottom-up phase generates the symmetrical abstract tree topology minimizing power consumption as a result.

Fig. 6. (a) Our symmetrical abstract tree topology; (b) The voltage domain (candidates) for the buffer insertion.

../../Resources/ieie/JSTS.2024.24.5.424/fig6.png

3) Time Complexity

The pseudocode of our symmetrical abstract tree generation algorithm is shown in Algorithm 1. The time complexity of constructing the nearest neighbor graph for N sinks is O(N$^{2}$) and calculating the merging costs for all sinks takes O(N). The time complexity of sorting the table in non-decreasing order becomes O(N·logN). Finding a minimum cost in a table takes O(N). The time complexity of constructing the potential parent node and buffer region is O(N). Scanning the voltage domain in a buffer region from power map becomes O(N). The time complexity of inner loop is O(N) and the outer loop takes O(N). Therefore, the overall time complexity of the proposed symmetrical abstract tree generation algorithm becomes O(N$^{2}$).

2. Top-down Phase

After the symmetrical abstract tree topology with parent/buffer region is generated in bottom-up phase, top-down phase decides the exact location of buffer and wire length for completing BTI-aware symmetrical buffered clock tree. The optimal supply voltages on buffers are assigned with considering BTI. Then, buffer insertion and wire routing are performed at each level recursively. The top-down phase in our CTS is made up of four steps: First, the signal probabilities of clock buffers are calculated at each tree level. In the second step, the threshold voltage shift (${\Delta}$Vth) is obtained according to SP after 10 years of aging, and the delay changed (${\Delta}$D) by ${\Delta}$Vth is obtained. The third step determines BTI-aware optimal supply voltages (SVs) of buffers at each level based on LP. Then, the exact location of buffer among voltage candidates in potential buffer region is determined, and it is inserted on corresponding grid. The final step conducts wire routing and adjustment through wire snaking to complete a symmetrical buffered clock tree.

1) Signal probability (SP) propagation

The first step of our top-down phase is to calculate the SPs of buffers at each level for estimating BTI effect. Signal probability (SP) is a common method to represent relative time of stress and recovery stage.

Fig. 7 shows an example of SP propagation under clock gating in a buffered clock tree. The input SP of clock source B1 is 0.5. Assume that an AND gate is used as clock gating technique. The output SP of the AND gate is represented as SP·GP. The different logic gates can be employed as clock gating e.g., NAND, NOR, etc. This work uses AND gates for implementing clock gating technique. The gating probability (GP) is denoted as the probability of a gate to be ON state. If the GP is 1, clock is always gated. If the GP becomes 0, clock is never gated. Assume that the GPs of buffers are as shown in Fig. 7. When the GP of clock buffer B3 is 0.7, its SP becomes 0.35. The SP of clock buffer B6 is same as B3 since its GP is 1.0. Also, the GP of B7 is 0.9 and the SP of B7, which is right subtree of clock buffer B3 becomes 0.315. The buffers in shaded circle is always gated. Thus, their SPs of B2, B4, and B5 are equal to the SP of clock source B1. Likewise, for estimating BTI effect, we can obtain the SPs of all buffers at each level from clock source to sinks level by level recursively.

Fig. 7. The signal probability propagation under clock gating in a buffered clock tree.

../../Resources/ieie/JSTS.2024.24.5.424/fig7.png

2) The delay degradation by threshold voltage shift

In this step, the shift of threshold voltage (${\Delta}$Vth) on a buffer is obtained by utilizing SP after 10 years of aging. To observe the impacts of SP on the buffer, we first obtain ${\Delta}$Vth of each buffer according to SP in a buffered clock tree after 10 years of BTI degradation. Using ${\Delta}$Vth obtained through (1), we performed SPICE simulation to obtain the buffer delay (rise delay) corresponding to ${\Delta}$Vth. Since BTI effect impacts both on PMOS and NMOS devices, it has an effect on the rise and fall delay of the buffer. ${\Delta}$Vth is a large increase when SP is from 0 to 0.1. However, when SP is larger than 0.1, the shift of threshold voltage (${\Delta}$Vth) is flattened. The relationship between signal probability (SP) and the shift of threshold voltage (${\Delta}$Vth) is linear form as follows:

(3)
$ \Delta V_{th}=\left\{\begin{array}{l} a_{0}*SP+a_{1},\begin{array}{l} \end{array}for\begin{array}{l} \end{array}SP>0.1\\ b_{0}*SP+b_{1},\begin{array}{l} \end{array}for\begin{array}{l} \end{array}SP<0.1 \end{array}\right. $

where a$_{0}$ and b$_{0}$ are coefficients, a$_{1}$ and b$_{1}$ are constant values. Both cases for SP < 0.1 and SP > 0.1 are linear form respectively. Because ${\Delta}$V$_{th}$ is affected by input transition time (T$_{\mathrm{r}}$) and output load capacitance (C$_{\mathrm{L}}$), the two parameter is fixed with specific value of the coefficients, a$_{0}$ and b$_{0}$, constant, a$_{1}$ and b$_{1}$. Thus, the relationship between ${\Delta}$V$_{th}$ and the delay of buffer (D$_{buffer}$) can be represented as follows:

(4)
$ D_{buffer}=c_{0}*\Delta V_{th}+c_{1} $

where c$_{0}$ is a coefficient and c$_{1}$ is a constant value. Using (3) and (4), the linear function (5) between SP and D$_{buffer}$ can be obtained as below:

(5)
$ D_{buffer}=\left\{\begin{array}{l} c_{0}*\left(a_{0}*SP+a_{1}\right)+c_{1},\begin{array}{l} \end{array}for\begin{array}{l} \end{array}SP>0.1\\ c_{0}*\left(b_{0}*SP+b_{1}\right)+c_{1},\begin{array}{l} \end{array}for\begin{array}{l} \end{array}SP<0.1 \end{array}\right. $

Therefore, we can obtain that the relationship between the SP of buffer and its delay is linear function by (5). We built a look-up table (LUT) based on input slew and output capacitance of the clock buffers. In addition, an aging-aware LUT is constructed to achieve the delay of buffer after 10 years of BTI degradation. In CDN, we built LUTs of clock gates such as NAND, NOR, and AND in order to handle the aging effect of clock gates itself for further accuracy analysis of aging.

3) BTI-aware supply voltage alignment

To build a power network for IR drop, we used the power network that is modeled by an equivalent circuit [20]. Horizontal and vertical power trunks are evenly distributed on a circuit. Also, each power grid consists of power straps that are for VDD port connections of clock buffers, orthogonal lower metal rails in a mesh structure [21]. Cells and blocks are connected to power straps, and their current can be obtained for the power analysis during the placement stages. Fig. 8 is the example of power network used in this work. The power network and IR-drop information can be obtained at each intersection of power straps in sufficient accuracy. Non-uniform SPs of clock buffers at each tree level lead to asymmetric BTI degradation.

Fig. 8. The example of 3 x 3 power distribution network and the difference of supply voltage.

../../Resources/ieie/JSTS.2024.24.5.424/fig8.png

The optimal supply voltage alignment (SVA) is introduced to reduce a large clock skew caused by asymmetric BTI degradation after 10 years. For assigning SVs to buffers at each level, the SVs are selected among voltage candidates in potential buffer region. If the exact embedding location is determined in potential buffer region, the cost for buffer insertion is zero. Thus, obtaining SV in potential buffer region is cost effective. Since the VDD ports of buffers are connected to power straps, the corresponding voltage in potential buffer region is obtained from the power analysis map. The maximum and minimum voltage can be found in potential buffer region. The continuous voltages between the maximum and the minimum exactly form an interval called voltage candidates. Since the IR drop at different locations (nodes) of the power network may vary, it is difficult to choose the identical SV between potential buffer regions. To minimize the skew, uniform supply voltage is first selected at each tree level as possible. The uniform supply voltage is denoted as the minimum difference among used supply voltages in each potential buffer region as represented in (6). To obtain the uniform supply voltage, following cost function is proposed in [7]:

(6)
$ \min f\left(v_{u}\right)=\sum _{\begin{array}{l} i\in voltage\begin{array}{l} \end{array}\;candidates\\ \end{array}}m\left(v_{u},v_{i,\min },v_{i,\max }\right) $

The v$_{i,min}$ and v$_{i,max}$ are represented as the minimum and maximum voltage in voltage candidate i respectively.

(7)
$ m\left(v_{u},v_{i,\min },v_{i,\max }\right)=\left\{\begin{array}{l} |v_{u}-v_{i,\min }|\\ 0\\ |v_{u}-v_{i,\max }| \end{array}\right.\begin{array}{l} \begin{array}{l} \end{array}for\\ \begin{array}{l} \end{array}for\\ \begin{array}{l} \end{array}for \end{array}\begin{array}{l} \\ \\ \end{array}\begin{array}{l} v_{u}<v_{i,\min }\\ v_{i,\min }<v_{u}<v_{i,\max }\\ v_{u}>v_{i,\max } \end{array}$

By (6), The uniform SV v$_{u}$ of buffers is obtained at each level in a clock tree network. As the assigned supply voltage on a buffer increase, the buffer delay is linearly decreased. Hence, the delay of buffer with large SV has smaller value than that with small SV in Fig. 9. However, the degradation rate of the buffer is increased after years of BTI degradation (years of aging). The degradation rate of the buffer with large SV is larger than that with small SV in Fig. 10. Using the characteristic, optimal SVs are assigned to buffers with the largest/smallest SP and uniform SV to the others for reducing the clock skew caused by asymmetric BTI degradation. As shown in Fig. 11(a), although the clock skew of existing CTS method without BTI consideration is zero at time 0, this clock tree can cause unexpected large clock skew after 10 years of BTI degradation. The objective of our CTS approach is to meet clock skew constraints before and after 10 years of BTI degradation as shown in Fig. 11(b). Therefore, asymmetric aging effect (BTI) by clock gating must be considered for circuit reliability and performance. In clock tree network, the arrival time (8) is represented as follows:

Fig. 9. The relationship between the delay of buffer and supply voltage.

../../Resources/ieie/JSTS.2024.24.5.424/fig9.png

Fig. 10. The degradation rate depending on the supply voltage of buffer.

../../Resources/ieie/JSTS.2024.24.5.424/fig10.png

Fig. 11. (a) Existing CTS method without BTI consideration before and after years of BTI degradation; (b) Our CTS method with BTI consideration before and after years of BTI degradation.

../../Resources/ieie/JSTS.2024.24.5.424/fig11.png
(8)
$AT_{i}^{t}=\sum _{i=0}^{k}D_{buf,i}\left(SP,v\right)+D_{wire,i}\begin{array}{ll} , & \end{array}t\in fresh,\begin{array}{l} \end{array}aging$

The AT$^{fresh}$$_{i}$ is denoted as the arrival time from clock source (i=0) to i-th tree level at time 0 (fresh) and the AT$^{aging}$$_{i}$ is denoted as the arrival time from clock source to i-th tree level after 10 years of BTI degradation (aging). The term D$_{buf,i}$(SP,v) is the delay function of signal probability (SP) and the voltage v of the buffer. The D$_{wire,i}$ is the delay of wire. Therefore, the clock skew at each level is defined as follows:

(9)
$ ske{w^{t}}_{i}=AT_{max,i}^{t}-AT_{min,i}^{t} $

In (9), the clock skew (skew$^{t}$$_{i}$) at i-th tree level can be represented as the difference between the largest (AT$^{t}$$_{max,i}$) and the smallest (AT$^{t}$$_{min,i}$) arrival time from clock source at time t. We formulate the linear function through (5) and (9) as follows:

(10)
$ Minimize\begin{array}{l} \end{array}\left(ske{w^{t}}_{i}\right)\\ \begin{array}{l} S.T\colon \\ \begin{array}{ll} & \end{array}v_{{^{low}}}\leq v_{{^{u}}}\leq v_{{^{high}}},\begin{array}{l} \end{array}\forall v\in voltage\_ interval\\ \begin{array}{ll} & \end{array}ske{w^{fresh}}_{i}\leq skew_{cons},\\ \begin{array}{ll} & \end{array}ske{w^{aging}}_{i}\leq skew_{cons},\forall i\in L,\\ \begin{array}{ll} & \end{array}C<C_{cons} \end{array} $

where the v$_{high}$ and v$_{low}$ are denoted as the optimal voltages of buffers with the largest/smallest SP respectively. The skew$_{cons}$ is denoted as clock skew constraints and C$_{cons}$ is defined as capacitance limit. L is the level of clock tree. To reduce the clock skew after 10 years of BTI degradation, we find optimal SVs (v$_{high}$, v$_{low}$) of buffers for minimizing skew$_{i}$$^{aging}$ = (AT$^{aging}$$_{max,i}$ - AT$^{aging}$$_{min,i}$) among voltage candidates. The buffer with the largest SP is assigned to v$_{high}$ that is larger than uniform voltage (v$_{u}$) and the buffer with the smallest SP is assigned to v$_{low}$ that is smaller than v$_{u}$ at each level. To minimize objective function skew$^{t}$$_{i}$ is not only to reduce the clock skew after 10 years of aging but also to satisfy clock skew constraints at time 0. We perform the function (10) for finding BTI-aware optimal SVs (v$_{high}$, v$_{low}$) from clock source to sinks at each level recursively. Also, we do not require multi-level supply voltages but find optimal voltages in potential buffer region using power analysis map. First, we obtain the voltage candidates in potential buffer region using power analysis map. Second, we determine uniform SV at each level using (6) for symmetrical buffered clock tree structure. Third, the longest and shortest paths from clock source to i-th level are extracted using DFS (Depth First Search) algorithm. Next, the optimal SVs of buffers with largest/smallest SP at i-th level are obtained in potential buffer region using (10). Finally, after finding the optimal SVs of buffer, the location of buffer is obtained and it is inserted on grid with optimal SV in potential buffer region. Likewise, the optimal SVs at each level are obtained from clock source until clock sink is reached recursively.

Fig. 12 illustrates optimal supply voltage alignment on buffer considering BTI proposed in this work. For example, the clock source is S, its SP is 0.5, and the voltage candidates are {$v_{1}$, $v_{2}$, $v_{3}$, $v_{4}$, $v_{5}$} that is sorted by ascending order. Assume that the SPs of B3, B4, B5, and B6 are as represented in Fig. 12. In our work, first, all SPs of buffers can be obtained by method described in Section IV-2 as top-down traversal. Second, the uniform SV (v$_{u}$) on buffers which is obtained using (6) is v$_{3}$ at level 2. This uniform voltage is included in voltage candidates. It is assigned to all buffers except buffers with the largest/smallest SP. Thus, all buffers at level 2 are assigned to v$_{3}$ except B3 and B4. B3 is assigned to higher voltage (v$_{high}$) than uniform voltage v$_{u}$ and B4 is assigned to lower voltage (v$_{low}$) than v$_{u}$. These optimal voltages (v$_{high}$, v$_{low}$) are obtained among voltage candidates by linear function (10). The SVs of B3 and B4 become v$_{5}$ and v$_{1}$ respectively. After the location of buffer B3 in potential buffer region is determined, buffer B3 is inserted to corresponding region (grid) with v$_{5}$. If the grid with v$_{5}$ does not exist in potential buffer region, B3 is inserted on grid which is as close to v$_{5}$ as possible. Although our CTS method may cause not zero skew but small skew at time 0, it satisfies clock skew constraints before and after 10 years of BTI degradation.

Fig. 12. Example of our supply voltage alignment among voltage candidates in potential buffer region to reduce the clock skew caused by asymmetric BTI degradation.

../../Resources/ieie/JSTS.2024.24.5.424/fig12.png

4) Wire routing

In this section, we perform wire routing to complete the symmetrical buffered clock tree. The wire delay is calculated based on the Elmore delay model as follows:

(11)
$ D_{wire}=r_{0}l_{wire}\left(\frac{1}{2}c_{0}l_{wire}+C_{load}\right) $

where D$_{wire}$ is defined as the wire delay. r$_{0}$ and c$_{0}$ are represented as a unit resistance and a unit capacitance of the wire respectively. The wire length is denoted as l$_{wire}$. C$_{load}$ is the load capacitance at the end of the wire. The wire slew model [22] used in this paper is expressed as follows:

(12)
$ Sl_{e}=\ln 9\cdot D_{wire} $

where Sl$_{e}$ is the slew degradation on wire. Thus, the wire delay considering slew is obtained through (11) and (12) in order to meet the slew constraints. If the wire is routed as identical length using wire snaking that lengthen wire as ${\Delta}$L, it can further reduce clock skew. We can obtain wire-length (${\Delta}$L) for wire snaking and perform wire routing with the identical length at each level to maintain symmetric tree structure.

5) Time complexity Analysis

Algorithm 2 is the top-down phase in our CTS considering BTI. In top-down phase, a symmetrical abstract tree topology generated in bottom-up phase and clock skew constraints are required as primary inputs. In the first step, the time complexity of both calculating SP and obtaining ${\Delta}$Vth is O(M+N) where M be the number of wires and N be the number of buffers. Since (5) is the same problem of finding median value, obtaining uniform SV can be solved by linear time. The time complexity for finding the longest/shortest path in a clock tree with sinks (nodes) S is O(S·logS) due to symmetrical tree structure. The time complexity of obtaining optimal SV is O(V$^{2}$) where V is the number of candidate voltages. The total time complexity of above linear function becomes O(V$^{2}$·logL). Therefore, the total time complexity of our top-down phase becomes O(V$^{2}$).

V. Experimental Results

The proposed approach was implemented in the C/C++/Python programming language on quad core 3.1~GHz Linux machine. We have verified the proposed method using a 45 nm process technology and used a predictive technology model (PTM) [23] with HSPICE. The tool CPLEX is used to solve our linear programming. Also, we tested our CTS for ISPD’09 [24] and IBM benchmark circuits [25]. Since ISPD benchmark circuits do not include a power network, we generated a random current for each power node in order to build a power network of each circuit. A power network is 20 x 20 grids. We set randomly 10\textasciitilde{}20% of clock buffer to be gating enable. For various workloads, we randomly are assigned various clock gates such as NAND, NOR, and AND, etc. The input signal probability (SP) of clock source was assumed to be 50%. The slew limit is set as 100 ps. We set the nominal supply voltage between 0.95~V and 1.05 V. Also, limited IR drop tolerance to 10% of nominal supply voltage. The clock skew constraint is set to 7.5 ps as used in ISPD clock network synthesis contest.

We compared the performance of our abstract tree topology with abstract topology algorithms such as [26], [27] in Table 1. [26] is DME-based CTS with slew-aware buffer insertion/sizing algorithm. [26] is performed buffer sizing to meet the slew constraints while satisfying capacitance limit. However, buffer sizing to meet the slew limit leads to unexpected skew and imbalance in [26]. [27] constructs DME-based clock tree based on equivalence between the wire length of zero-skew tree (ZST) and the diameter sum of the hierarchical clustering (HC) in subtree. As the heuristic algorithm based on BST/DME, the decision of the skew bound is important for total wirelength. Our abstract tree topology is based on NNS [9] with minor modifications for reducing power consumption and generating parent/buffer region.

As shown in Table 1, total wirelength and the number of buffers, as well as clock skew, is compared for the power consumptions. For the comparison with [26], our proposed method shows about 14% more total wirelength, the number of the inserted buffers, and power consumption. Due to the difference in the number of clustering, little overhead occurred in terms of power consumption. However, because our proposed method generates symmetrical clock tree, the clock skew is reduced 28% than [26]. In addition, the proposed method showed a difference in clock skew within 2% in comparison with [27]. In terms of power consumption, total wirelength and power consumption are 9%, 18% less than [27]. Although our method has a little overhead, both the average skew and power consumption are improved and have guarantees compare to existing works. Therefore, the proposed method is significantly effective our abstract tree topology algorithm.

Table 1. Skew and power comparison between proposed method and the other methods
../../Resources/ieie/JSTS.2024.24.5.424/tb1.png

We compared our method with existing symmetrical CTS in Table 2. Existing symmetrical CTS method which does not consider asymmetric aging assigns the identical supply voltage to all buffers due to symmetrical property.

Thus, it causes unexpected clock skew after years of BTI degradation. The symbol as ‘-’ represents an increase. In Table 2, the clock skew of our method is larger than that of existing symmetric CTS (uniform) at 0 year. However, the proposed method meets clock skew constraints at 0 year. In addition, our method causes smaller clock skew than existing symmetric CTS after years of BTI degradation. Existing symmetric CTS leads to a skew violation on benchmark circuits f11, f21, f22, r1, r2 after 5 years and 10 years of aging. In the experimental results, our method reduced on average 24.9%, 34.5% clock skew after 5 and 10 years of BTI degradation compared to existing symmetrical CTS.

Table 2. Skew comparison between proposed method (Non-Uniform) and existing method (Uniform)
../../Resources/ieie/JSTS.2024.24.5.424/tb2.png

To show the effectiveness of top-down stage in our CTS method, we compared our method with existing works [14], [3] and [28] on ISPD’09 and IBM benchmark circuits. [28] without BTI consideration constructs a symmetrical clock tree. We implemented [14] and [3] that are existing methods considering BTI. The approach [14] reduces aging-induced clock skew by inserting duty-cycle converters in post-CTS. [14] is applied to the synthesized clock tree based on the latest existing CTS [27]. The work [3] performs the clock gating implementation by selecting NAND or NOR gate as output stage of integrated clock gating cells considering NBTI-induced clock skew by manipulating the signal probabilities. In Table 3, if aging was not considered as in [28], the skew increased more than 2.5 times after 10 years. The proposed method achieved up to 42%, 55%, skew reduction compared to [3] and [14] after 5 years of BTI degradation respectively. In fresh, the clock skew of the proposed algorithm is similar or slightly larger within skew constraints. After 10 years of BTI degradation, we achieved 48%, 53% skew reduction respectively. In addition, the skew of our method after 10 years with BTI degradation is similar to skew at 0 year. The result of [28] without considering aging effect worsens more than 2 times. The existing methods [2], [13] and [28] cause skew violation (>7.5 ps) on some benchmark circuits after 10 years of BTI degradation. However, our CTS method satisfied given skew constraint for all benchmark circuits. Our method is practical and robustness because the amount of the change in skew is the smallest after 10 years of BTI degradation. Through our supply voltages alignment, our method assigns optimal supply voltages to buffers for reducing clock skew caused by BTI.

Table 3. Skew comparison between proposed method and the other methods after aging
../../Resources/ieie/JSTS.2024.24.5.424/tb3.png

In Fig. 13, the shortest and longest path on the benchmark circuit f21 is presented on fresh, 5 year and 10 years aging. The paths changed due to asymmetric aging. The longest and shortest path is #60 and #83 on 0 year, #60 and #1 on 5 years, and #109 and #1 on 10 years. Due to these changes, unexpected skew can occur, leading to skew violation. As shown in Table 3, the skew variations caused by aging can result not only from the BTI effect itself, but also from the longest and shortest path switching due to asymmetric aging. Our method performs CTS with consideration for these changes to prevent skew violation.

Fig. 13. The shortest and longest path for benchmark circuit f21. The longest path is blue. The shortest path is red. (the result on 0 year has scale difference).

../../Resources/ieie/JSTS.2024.24.5.424/fig13.png

Table 4 shows the skew comparison of the clock tree generated by our method on process variation and temperature. Threshold voltage, channel length, oxide thickness, and metal width are adjusted from 90% to 110% in nMOS and pMOS to create fast, typical, and slow corner for process variation parameters. Temperature variation was measured by evaluating the delay of the circuit at 0 $^{\circ}$C, 25$^{\circ}$C, and 125 $^{\circ}$C. Our method produced a clock tree without skew violation for both process variation and temperature variation.

Table 4. Skew comparison of proposed method with process variation and temperature variation
../../Resources/ieie/JSTS.2024.24.5.424/tb4.png

VI. Conclusion

We have presented a symmetrical buffered clock tree synthesis considering asymmetric BTI. The bottom-up phase performs clock sink clustering and DME-based parent/buffer region generation to construct a power-aware symmetrical clock tree topology. The top-down phase performs BTI-aware optimal supply voltage alignment on buffers based on linear programming (LP) and wire routing without additional optimization stage. Experimental results show that our approach efficiently constructs a BTI tolerant-clock tree. Because clock tree topology is performed in bottom-up phase, comparison is performed with clustering algorithms. It shows also similar clock skew with little power consumption. The skew comparison is performed with the other methods after aging. The proposed algorithm achieves on average 51% skew reduction after 10 years compared to existing CTS methods. Our approach presents the robust clock tree than other methods, which is satisfying skew constraints at 0 year and after 10 years of BTI. We built a clock tree topology with minimized power consumption and proposed BTI-aware optimal supply voltage alignment on buffers using linear programming. We do not require additional hardware (level converter). Also, the skew minimization is performed during CTS process without an additional skew optimization stage. The proposed algorithm suggests the robust clock tree synthesis for reliability. In our future works, we will study BTI-aware clock tree synthesis based on machine learning algorithm and 3D clock tree synthesis considering aging effect.

ACKNOWLEDGMENTS

This work was supported by Samsung Electronics Co., Ltd. (IO201210-07940-01). Also, the MIST, Korea, under the National Program for Excellence in SW(2015-0-00910) supervised by the IITP. The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

References

1 
J. Lu, W. K. Chow and C. W. Sham, “Fast power- and slew-aware gated clock tree synthesis,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 11, pp. 2094-2103, Nov. 2012.DOI
2 
P. Jain, F. Cano, B. Pudi and N. V. Arvind, "Asymmetric Aging: Introduction and Solution for Power-Managed Mixed-Signal SoCs," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 3, pp. 691-695, March 2014.DOI
3 
S.-H. Huang, C.-M. Chang, W.-P. Tu, and S.-B. Pan, “Critical-PMOS-aware clock tree design methodology for anti-aging zero skew clock gating,” in Proc. Asia and South Pacific Des. Autom. Conf., Jan. 2010, pp. 480-485.URL
4 
K. B. Sutaria et al., "Duty cycle shift under static/dynamic aging in 28nm HK-MG technology," 2015 IEEE International Reliability Physics Symposium, 2015, pp. CA.7.1-CA.7.5.DOI
5 
A. Chakraborty, G. Ganesan, A. Rajaram, and D. Pan, “Analysis and optimization of NBTI induced clock skew in gated clock trees,” in Proc. Des. Autom. Test Europe Conf., pp. 296-299, Apr. 2009.DOI
6 
S. Kirolos, Y Massoud, and Y Ismail, "Mitigating Power-Supply Induced Delay Variations using Self Adjusting Clock Buffers," IEEE Int. Midwest Symp. Circ. Syst.(MWSCAS), 2008.DOI
7 
Xin-Wei Shih, Tzu-Hsuan Hsu, Hsu-Chieh Lee, Yao-wen Chang, Kai-Yuan Chao, “Symmetrical buffered clock-tree synthesis with supply-voltage alignment,” in Proc. Asia and South Pacific Des. Autom. Conf., pp. 447-452, 2013.DOI
8 
L. C. Acharya et al., "Aging Aware Timing Model of CMOS Inverter: Path Level Timing Performance and Its Impact on the Logical Effort," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and SystemsDOI
9 
M. Edahiro, “A clustering-based optimization algorithm in zero-skew routings”, in Proc. Des. Autom. Conf. (DAC), pp. 612-616, Jun. 1993.DOI
10 
J. B. Velamala, K. B. Sutaria, V. S. Ravi and Y. Cao, "Failure Analysis of Asymmetric Aging Under NBTI," in IEEE Transactions on Device and Materials Reliability, vol. 13, no. 2, pp. 340-349, June 2013DOI
11 
J. M. Cohn, “Method for Reducing Design Effect of Wearout Mechanisms on Signal Skew in Integrated Circuit Design,” United States Patent No. 6651230, 2003.URL
12 
S. Arasu, M. Nourani, F. Cano, J. M. Carulli and V. Reddy, "Asymmetric aging of clock networks in power efficient designs," Fifteenth International Symposium on Quality Electronic Design, Santa Clara, CA, USA, 2014, pp. 484-489.DOI
13 
D. Borkovic and K.S. McElvain, “Reducing Clock Skew in Clock Gating Circuits,” United States Patent No. 7082582, 2006.URL
14 
K. -C. Wu, T. -H. Tseng and S. -C. Li, "MAUI: Making aging useful, intentionally," 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 2018, pp. 527-532S.DOI
15 
Bujumalla, C. Koh, “Synthesis of low power clock trees for handling power-supply variations,” in Proc. Int. Symp. Phys. Des., pp. 37-44, 2011.DOI
16 
J. Chen and M. Tehranipoor, “A Novel Flow for Reducing Clock Skew Considering NBTI Effect and Process Variations,” in Proc. Int. Symp. Qual. Electron. Design, pp. 327-334, 2013.DOI
17 
L. Lai, V. Chandra, R. Aitken and P. Gupta, "BTI-Gater: An Aging-Resilient Clock Gating Methodology," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 4, no. 2, pp. 180-189, June 2014.DOI
18 
R. Kishida, T. Asuke, J. Furuta and K. Kobayashi, "Extracting Voltage Dependence of BTI-induced Degradation Without Temporal Factors by Using BTI-Sensitive and BTI-Insensitive Ring Oscillators," in IEEE Transactions on Semiconductor Manufacturing, vol. 33, no. 2, pp. 174-179, May 2020DOI
19 
T. H. Chao, Y. C. Hsu, J. M. Ho, K. D. Boese, and A. B.Kahng, “Zero skew clock routing with minimum wirelength,” in IEEE Trans. Circuits Syst., vol. 39, pp. 799-814, 1992.DOI
20 
A. Mukherjee, K. Wang, L. Chen, and M. Marek-Sadowska, “Sizing power/ground meshes for clocking and computing circuit components,” in Proc. Des. Autom. Test Europe Conf., pp. 176-183, 2002.DOI
21 
X.-D. S. Tan and C.-J. R. Shi, “Fast power/ground network optimization based on equivalent circuit modeling,” In Proc. Des. Autom. Conf. (DAC), pp. 550-554, 2001.DOI
22 
S. Hu, C. J. Alpert, J. Hu, S. Karandikar, Z. Li, W. Shi, et al., “Fast algorithms for slew constrained minimum cost buffering,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 11, pp. 2009-2022, Nov. 2007.DOI
23 
Predictive Technology Model, downloaded from http://ptm.asu.edu/.URL
24 
C. Sze, P. Restle, G.-J. Nam, C. J. Alpert, “ISPD 2009 Clocknetwork Synthesis Contest,” in Proc. Int. Symp. Phys. Des., pp. 149-150, 2009.DOI
25 
R.-S. Tsay, “Exact zero skew,” in Proc. Int. Conf. Comput.-Aided Design, pp. 336-339, 1991.DOI
26 
M. Choi, D. Oh and J. Kim, "Slew-aware fast clock tree synthesis with buffer sizing," 2018 International Conference on Electronics, Information, and Communication (ICEIC), Honolulu, HI, USA, 2018, pp. 1-4.DOI
27 
G. Chen and E. F. Y. Young, "Dim Sum: Light Clock Tree by Small Diameter Sum," 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 2019, pp. 174-179.DOI
28 
X.-W. Shih and Y.-W. Chang, “Fast timing-model independent clock tree synthesis,” in Proc. Des. Autom. Conf. (DAC), pp. 80-85, 2010.DOI
Mujun Choi
../../Resources/ieie/JSTS.2024.24.5.424/au1.png

Mujun Choi received the B.S. degree in Computer Science and Engineering from Sogang University, Korea in 2011. He is currently pursuing a unified Mater and Ph.D degree in Computer Science and Engineering at Sogang University, Korea. His research interests are variation-aware timing analysis, design for reliability enhancement, interconnect variation, and 3D-IC.

Deokkeun Oh
../../Resources/ieie/JSTS.2024.24.5.424/au2.png

Deokkeun Oh received the B.S. degree, Master degree, and Ph.D degree in Computer Science and Engineering from Sogang University, Korea in 2012, 2014, and 2020, respectively. He is with Samsung Electronics, Hwaseong-si, in 2020, where he is currently a CAE Engineer. His research interests are variation-aware timing analysis, interconnect and process variation, Monte-Carlo analysis, design for reliability enhancement, clock tree, and low power.

Juho Kim
../../Resources/ieie/JSTS.2024.24.5.424/au3.png

Juho Kim received B.S degree and Ph.D degree in Computer and Information Science from University of Minnesota in 1987 and 1995, respectively. After getting Ph.D degree, he worked as a senior member of technical staff at Cadence Design System until 1997. Professor Kim joined the department of computer science and engineering in Sogang University, Seoul, Korea in 1997, and he was a department chair from 2005 to 2008. His research interests are variation-aware timing analysis, and low power design.