Our CTS approach considering BTI is based on the deferred merge embedding to minimize
wire length. The BTI-aware symmetrical buffered CTS consists of two steps: 1) bottom-up
phase, 2) top-down phase as described in Fig. 4. First, we generate a symmetrical abstract tree topology considering power consumption
in bottom-up phase. Second, buffer insertion and wire routing considering BTI are
performed for completing CTS in top-down phase. The objective of our CTS approach
is to construct a symmetrical buffered clock tree considering BTI to satisfy the clock
skew constraints considering both before and after aging.
1. Bottom-up Phase
The bottom-up phase constructs a symmetrical abstract tree topology with parent/buffer
region. Our proposed abstract tree topology generation algorithm is based on nearest
neighbor graph (NNG). The objective of our method is not only to minimize power consumption
but also to generate the parent node and buffer region for supply voltage alignment.
Constructing an abstract tree topology with power consumption minimization is significantly
important since it can reduce the power overhead of supply voltage alignment performed
in top-down phase, and it has a great impact on circuit performance. Our symmetrical
abstract tree topology algorithm has three properties as follows: 1) bottom-up traversal,
2) minimized power consumption, 3) symmetrical tree construction.
1) Abstract tree topology generation
Our abstract tree topology generation algorithm recursively selects a pair of subtrees
or sinks and then merges the selected pair. The selection of pairs is based on the
cost function which includes the power consumption of the wire, downstream capacitance
of sinks as shown in (2). The cost function for selecting the pairs as follows:
where C$_{wire}$ is a unit capacitance of wire. The terms E$_{a}$ and E$_{b}$ represent
the Manhattan distance from node a to its parent node and from node b to its parent
node respectively. C$_{a}$ and C$_{b}$ are defined as downstream capacitance at node
a and b respectively.
Assume that a set of clock sinks is given. First, NNG of all subtrees and sinks is
constructed based on cost function (2). At each sink, after the costs of all edges from each sink to the others are obtained,
an edge with the lowest cost is selected among them as a directed edge. Thus, each
sink has a directed edge. Next, the directed edges are sorted according to their costs
in non-decreasing order. An edge with minimum value is selected among sorted edges
in a table for minimizing power consumption. The selected edge is removed from the
table. The parent nodes (internal nodes) of removed edges are stored in a new table.
This process is repeated level by level in bottom-up sink.
Fig. 5 describes the algorithm of our clock sink clustering used in this work. Fig. 5(a) shows the initial sink distribution as a primary input before NNG is constructed.
At each sink, after the costs of all edges from each sink to the others are obtained,
an edge with the lowest cost is selected among them as a directed edge as shown in
Fig. 5(b). After NNG based on (2) is constructed, all sinks have a directed edge represented in Fig. 5(c). Sink S2 has a directed edge connecting to sink S1. This mean that the cost is minimum
when sink S2 is paired with S1. However, when sink S1 is paired with S5, the cost
is minimum value. After NNG for all sinks is finished, the costs of all edges are
stored in the table for merging cost with non-decreasing order. Fig. 5(e) is a table that consists of edge information and its cost in non-decreasing order.
The costs of all edges are calculated by (2), and then an edge with the minimum cost is selected for minimum power consumption
such as edge (S5, S6). Sinks S5 and S6 are the edges removed from the table. The removed
edges (S5, S6) are stored in new table for its parent node (I$_{1}$). This process
is repeated until the edges in the table are empty. Through this process, after the
set of parent nodes is completed, the above process for parent nodes (internal nodes)
is repeated until it reaches the clock source.
Fig. 5. Our clustering algorithm based on NNG: (a) Initial clock sink distribution;
(b) The selection of a directed edge at each paired and located at level 3 in bottom-up
fashion level by level; (c) After the directed edges of all sink are constructed;
(d) The cost calculation for edges of all sinks; (e) The table for merging cost.
2) Buffer and parent node region generation
In bottom-up stage, a buffer region is constructed for buffer insertion with supply
voltage alignment, and a parent node region is generated for merging point with symmetrical
wire-length. As we mentioned in Section III-B, TRR represents potential buffer embedding
and parent node region generated through extension/ intersection operation [7] for buffer insertion and wire routing. For extension/intersection operation, the
radius is same as the longest wirelength among pairs of a certain level clustered
with (2). After the buffer regions are constructed by TRR, we can scan the voltage domain
(candidates) for the supply voltage alignment as shown in Fig. 6(b). Since shorter connections should be lengthened through snaking to satisfy the identical-wirelength
property, the total wirelength is determined by the longest wire within a certain
level [7]. As shown in Fig. 6(a), the bottom-up phase generates the symmetrical abstract tree topology minimizing
power consumption as a result.
Fig. 6. (a) Our symmetrical abstract tree topology; (b) The voltage domain (candidates)
for the buffer insertion.
3) Time Complexity
The pseudocode of our symmetrical abstract tree generation algorithm is shown in Algorithm
1. The time complexity of constructing the nearest neighbor graph for N sinks is O(N$^{2}$)
and calculating the merging costs for all sinks takes O(N). The time complexity of
sorting the table in non-decreasing order becomes O(N·logN). Finding a minimum cost
in a table takes O(N). The time complexity of constructing the potential parent node
and buffer region is O(N). Scanning the voltage domain in a buffer region from power
map becomes O(N). The time complexity of inner loop is O(N) and the outer loop takes
O(N). Therefore, the overall time complexity of the proposed symmetrical abstract
tree generation algorithm becomes O(N$^{2}$).
2. Top-down Phase
After the symmetrical abstract tree topology with parent/buffer region is generated
in bottom-up phase, top-down phase decides the exact location of buffer and wire length
for completing BTI-aware symmetrical buffered clock tree. The optimal supply voltages
on buffers are assigned with considering BTI. Then, buffer insertion and wire routing
are performed at each level recursively. The top-down phase in our CTS is made up
of four steps: First, the signal probabilities of clock buffers are calculated at
each tree level. In the second step, the threshold voltage shift (${\Delta}$Vth) is
obtained according to SP after 10 years of aging, and the delay changed (${\Delta}$D)
by ${\Delta}$Vth is obtained. The third step determines BTI-aware optimal supply voltages
(SVs) of buffers at each level based on LP. Then, the exact location of buffer among
voltage candidates in potential buffer region is determined, and it is inserted on
corresponding grid. The final step conducts wire routing and adjustment through wire
snaking to complete a symmetrical buffered clock tree.
1) Signal probability (SP) propagation
The first step of our top-down phase is to calculate the SPs of buffers at each level
for estimating BTI effect. Signal probability (SP) is a common method to represent
relative time of stress and recovery stage.
Fig. 7 shows an example of SP propagation under clock gating in a buffered clock tree. The
input SP of clock source B1 is 0.5. Assume that an AND gate is used as clock gating
technique. The output SP of the AND gate is represented as SP·GP. The different logic
gates can be employed as clock gating e.g., NAND, NOR, etc. This work uses AND gates
for implementing clock gating technique. The gating probability (GP) is denoted as
the probability of a gate to be ON state. If the GP is 1, clock is always gated. If
the GP becomes 0, clock is never gated. Assume that the GPs of buffers are as shown
in Fig. 7. When the GP of clock buffer B3 is 0.7, its SP becomes 0.35. The SP of clock buffer
B6 is same as B3 since its GP is 1.0. Also, the GP of B7 is 0.9 and the SP of B7,
which is right subtree of clock buffer B3 becomes 0.315. The buffers in shaded circle
is always gated. Thus, their SPs of B2, B4, and B5 are equal to the SP of clock source
B1. Likewise, for estimating BTI effect, we can obtain the SPs of all buffers at each
level from clock source to sinks level by level recursively.
Fig. 7. The signal probability propagation under clock gating in a buffered clock
tree.
2) The delay degradation by threshold voltage shift
In this step, the shift of threshold voltage (${\Delta}$Vth) on a buffer is obtained
by utilizing SP after 10 years of aging. To observe the impacts of SP on the buffer,
we first obtain ${\Delta}$Vth of each buffer according to SP in a buffered clock tree
after 10 years of BTI degradation. Using ${\Delta}$Vth obtained through (1), we performed SPICE simulation to obtain the buffer delay (rise delay) corresponding
to ${\Delta}$Vth. Since BTI effect impacts both on PMOS and NMOS devices, it has an
effect on the rise and fall delay of the buffer. ${\Delta}$Vth is a large increase
when SP is from 0 to 0.1. However, when SP is larger than 0.1, the shift of threshold
voltage (${\Delta}$Vth) is flattened. The relationship between signal probability
(SP) and the shift of threshold voltage (${\Delta}$Vth) is linear form as follows:
where a$_{0}$ and b$_{0}$ are coefficients, a$_{1}$ and b$_{1}$ are constant values.
Both cases for SP < 0.1 and SP > 0.1 are linear form respectively. Because ${\Delta}$V$_{th}$
is affected by input transition time (T$_{\mathrm{r}}$) and output load capacitance
(C$_{\mathrm{L}}$), the two parameter is fixed with specific value of the coefficients,
a$_{0}$ and b$_{0}$, constant, a$_{1}$ and b$_{1}$. Thus, the relationship between
${\Delta}$V$_{th}$ and the delay of buffer (D$_{buffer}$) can be represented as follows:
where c$_{0}$ is a coefficient and c$_{1}$ is a constant value. Using (3) and (4), the linear function (5) between SP and D$_{buffer}$ can be obtained as below:
Therefore, we can obtain that the relationship between the SP of buffer and its delay
is linear function by (5). We built a look-up table (LUT) based on input slew and output capacitance of the
clock buffers. In addition, an aging-aware LUT is constructed to achieve the delay
of buffer after 10 years of BTI degradation. In CDN, we built LUTs of clock gates
such as NAND, NOR, and AND in order to handle the aging effect of clock gates itself
for further accuracy analysis of aging.
3) BTI-aware supply voltage alignment
To build a power network for IR drop, we used the power network that is modeled by
an equivalent circuit [20]. Horizontal and vertical power trunks are evenly distributed on a circuit. Also,
each power grid consists of power straps that are for VDD port connections of clock
buffers, orthogonal lower metal rails in a mesh structure [21]. Cells and blocks are connected to power straps, and their current can be obtained
for the power analysis during the placement stages. Fig. 8 is the example of power network used in this work. The power network and IR-drop
information can be obtained at each intersection of power straps in sufficient accuracy.
Non-uniform SPs of clock buffers at each tree level lead to asymmetric BTI degradation.
Fig. 8. The example of 3 x 3 power distribution network and the difference of supply
voltage.
The optimal supply voltage alignment (SVA) is introduced to reduce a large clock skew
caused by asymmetric BTI degradation after 10 years. For assigning SVs to buffers
at each level, the SVs are selected among voltage candidates in potential buffer region.
If the exact embedding location is determined in potential buffer region, the cost
for buffer insertion is zero. Thus, obtaining SV in potential buffer region is cost
effective. Since the VDD ports of buffers are connected to power straps, the corresponding
voltage in potential buffer region is obtained from the power analysis map. The maximum
and minimum voltage can be found in potential buffer region. The continuous voltages
between the maximum and the minimum exactly form an interval called voltage candidates.
Since the IR drop at different locations (nodes) of the power network may vary, it
is difficult to choose the identical SV between potential buffer regions. To minimize
the skew, uniform supply voltage is first selected at each tree level as possible.
The uniform supply voltage is denoted as the minimum difference among used supply
voltages in each potential buffer region as represented in (6). To obtain the uniform supply voltage, following cost function is proposed in [7]:
The v$_{i,min}$ and v$_{i,max}$ are represented as the minimum and maximum voltage
in voltage candidate i respectively.
By (6), The uniform SV v$_{u}$ of buffers is obtained at each level in a clock tree network.
As the assigned supply voltage on a buffer increase, the buffer delay is linearly
decreased. Hence, the delay of buffer with large SV has smaller value than that with
small SV in Fig. 9. However, the degradation rate of the buffer is increased after years of BTI degradation
(years of aging). The degradation rate of the buffer with large SV is larger than
that with small SV in Fig. 10. Using the characteristic, optimal SVs are assigned to buffers with the largest/smallest
SP and uniform SV to the others for reducing the clock skew caused by asymmetric BTI
degradation. As shown in Fig. 11(a), although the clock skew of existing CTS method without BTI consideration is zero
at time 0, this clock tree can cause unexpected large clock skew after 10 years of
BTI degradation. The objective of our CTS approach is to meet clock skew constraints
before and after 10 years of BTI degradation as shown in Fig. 11(b). Therefore, asymmetric aging effect (BTI) by clock gating must be considered for
circuit reliability and performance. In clock tree network, the arrival time (8) is represented as follows:
Fig. 9. The relationship between the delay of buffer and supply voltage.
Fig. 10. The degradation rate depending on the supply voltage of buffer.
Fig. 11. (a) Existing CTS method without BTI consideration before and after years
of BTI degradation; (b) Our CTS method with BTI consideration before and after years
of BTI degradation.
The AT$^{fresh}$$_{i}$ is denoted as the arrival time from clock source (i=0) to i-th
tree level at time 0 (fresh) and the AT$^{aging}$$_{i}$ is denoted as the arrival
time from clock source to i-th tree level after 10 years of BTI degradation (aging).
The term D$_{buf,i}$(SP,v) is the delay function of signal probability (SP) and the
voltage v of the buffer. The D$_{wire,i}$ is the delay of wire. Therefore, the clock
skew at each level is defined as follows:
In (9), the clock skew (skew$^{t}$$_{i}$) at i-th tree level can be represented as the difference
between the largest (AT$^{t}$$_{max,i}$) and the smallest (AT$^{t}$$_{min,i}$) arrival
time from clock source at time t. We formulate the linear function through (5) and (9) as follows:
where the v$_{high}$ and v$_{low}$ are denoted as the optimal voltages of buffers
with the largest/smallest SP respectively. The skew$_{cons}$ is denoted as clock skew
constraints and C$_{cons}$ is defined as capacitance limit. L is the level of clock
tree. To reduce the clock skew after 10 years of BTI degradation, we find optimal
SVs (v$_{high}$, v$_{low}$) of buffers for minimizing skew$_{i}$$^{aging}$ = (AT$^{aging}$$_{max,i}$
- AT$^{aging}$$_{min,i}$) among voltage candidates. The buffer with the largest SP
is assigned to v$_{high}$ that is larger than uniform voltage (v$_{u}$) and the buffer
with the smallest SP is assigned to v$_{low}$ that is smaller than v$_{u}$ at each
level. To minimize objective function skew$^{t}$$_{i}$ is not only to reduce the clock
skew after 10 years of aging but also to satisfy clock skew constraints at time 0.
We perform the function (10) for finding BTI-aware optimal SVs (v$_{high}$, v$_{low}$) from clock source to sinks
at each level recursively. Also, we do not require multi-level supply voltages but
find optimal voltages in potential buffer region using power analysis map. First,
we obtain the voltage candidates in potential buffer region using power analysis map.
Second, we determine uniform SV at each level using (6) for symmetrical buffered clock tree structure. Third, the longest and shortest paths
from clock source to i-th level are extracted using DFS (Depth First Search) algorithm.
Next, the optimal SVs of buffers with largest/smallest SP at i-th level are obtained
in potential buffer region using (10). Finally, after finding the optimal SVs of buffer, the location of buffer is obtained
and it is inserted on grid with optimal SV in potential buffer region. Likewise, the
optimal SVs at each level are obtained from clock source until clock sink is reached
recursively.
Fig. 12 illustrates optimal supply voltage alignment on buffer considering BTI proposed in
this work. For example, the clock source is S, its SP is 0.5, and the voltage candidates
are {$v_{1}$, $v_{2}$, $v_{3}$, $v_{4}$, $v_{5}$} that is sorted by ascending order.
Assume that the SPs of B3, B4, B5, and B6 are as represented in Fig. 12. In our work, first, all SPs of buffers can be obtained by method described in Section
IV-2 as top-down traversal. Second, the uniform SV (v$_{u}$) on buffers which is obtained
using (6) is v$_{3}$ at level 2. This uniform voltage is included in voltage candidates. It
is assigned to all buffers except buffers with the largest/smallest SP. Thus, all
buffers at level 2 are assigned to v$_{3}$ except B3 and B4. B3 is assigned to higher
voltage (v$_{high}$) than uniform voltage v$_{u}$ and B4 is assigned to lower voltage
(v$_{low}$) than v$_{u}$. These optimal voltages (v$_{high}$, v$_{low}$) are obtained
among voltage candidates by linear function (10). The SVs of B3 and B4 become v$_{5}$ and v$_{1}$ respectively. After the location
of buffer B3 in potential buffer region is determined, buffer B3 is inserted to corresponding
region (grid) with v$_{5}$. If the grid with v$_{5}$ does not exist in potential buffer
region, B3 is inserted on grid which is as close to v$_{5}$ as possible. Although
our CTS method may cause not zero skew but small skew at time 0, it satisfies clock
skew constraints before and after 10 years of BTI degradation.
Fig. 12. Example of our supply voltage alignment among voltage candidates in potential
buffer region to reduce the clock skew caused by asymmetric BTI degradation.
4) Wire routing
In this section, we perform wire routing to complete the symmetrical buffered clock
tree. The wire delay is calculated based on the Elmore delay model as follows:
where D$_{wire}$ is defined as the wire delay. r$_{0}$ and c$_{0}$ are represented
as a unit resistance and a unit capacitance of the wire respectively. The wire length
is denoted as l$_{wire}$. C$_{load}$ is the load capacitance at the end of the wire.
The wire slew model [22] used in this paper is expressed as follows:
where Sl$_{e}$ is the slew degradation on wire. Thus, the wire delay considering slew
is obtained through (11) and (12) in order to meet the slew constraints. If the wire is routed as identical length
using wire snaking that lengthen wire as ${\Delta}$L, it can further reduce clock
skew. We can obtain wire-length (${\Delta}$L) for wire snaking and perform wire routing
with the identical length at each level to maintain symmetric tree structure.
5) Time complexity Analysis
Algorithm 2 is the top-down phase in our CTS considering BTI. In top-down phase, a
symmetrical abstract tree topology generated in bottom-up phase and clock skew constraints
are required as primary inputs. In the first step, the time complexity of both calculating
SP and obtaining ${\Delta}$Vth is O(M+N) where M be the number of wires and N be the
number of buffers. Since (5) is the same problem of finding median value, obtaining uniform SV can be solved by
linear time. The time complexity for finding the longest/shortest path in a clock
tree with sinks (nodes) S is O(S·logS) due to symmetrical tree structure. The time
complexity of obtaining optimal SV is O(V$^{2}$) where V is the number of candidate
voltages. The total time complexity of above linear function becomes O(V$^{2}$·logL).
Therefore, the total time complexity of our top-down phase becomes O(V$^{2}$).