In contrast to supervised learning, where correct responses are established in advance,
Q-learning is effective in scenarios where a predefined dataset is lacking, making
it particularly useful in environments with scarce or inefficiently generated data,
such as analog layout design using a cutting-edge process node. In analog layout design,
the quantity of accumulated data is typically constrained, with the exception of a
few layout modifications. Furthermore, due to concerns about potential circuit information
leakage and the need to protect intellectual property (IP), it is not feasible to
create large datasets of various analog layout designs for training neural networks.
Moreover, analog circuits are highly susceptible to the process node utilized, such
that even circuits with identical functionality may necessitate alterations in topology.
Consequently, it is imperative to develop a dataset tailored to each specific process
node, even for the same circuit. Hence, the implementation of reinforcement learning
(RL) for device placement optimization is a fitting strategy.
1. Problem formation
1) Quantizing design space for states ($s$) and actions ($a$)
To implement a practically computable version of Q-learning, actions have been quantized
into a finite action space. For device placement, five possible actions--north, south,
west, east, and stationary--have been defined within this space, considering each
device as a state. Due to the requirement that each action must be selectable within
a finite space, the environment in which the states move is also quantized into a
grid-based format. The process employed was the GAA process node, utilizing the nanosheet
FET device. The available channel widths are constrained to discrete values. Although
it is possible to construct an analog block using devices with disparate channel widths,
this approach requires specific gate length values and the inclusion of spacers. This
method is inefficient from an area perspective and introduces an additional LLE, termed
tapered RX, to neighboring devices [11]. Furthermore, the contact-to-poly pitch (CPP) is fixed for each device gate length,
and a design rule exists that only allows devices with the same gate length to be
merged. Consequently, in order to minimize the size of a circuit block with a specific
function, the channel width and gate length are standardized. As a result, a grid-based
environment, as illustrated in Fig. 4(b), can be implemented for device placement. The horizontal grid is defined as one CPP,
and the vertical grid corresponds to the channel width of the device used in the design.
The blue and orange colors represent different types of devices, abstracted in the
manner shown in Fig. 4(a).
Fig. 4. (a) Abstraction of device. This will be a state in the proposed Q-learning
to be moved. (b) Quantized design space. P-type (blue) and N-type (yellow) are randomly
placed at the initial step.
(a)
(b)
2) Reward functions with LLE surrogate models
In analog circuits, small variations in the threshold voltage ($V_{th}$) can result
in significant output current deviation from the nominal value. This deviation increases
in proportion to the number of devices, rendering these applications highly susceptible
to threshold voltage variation. By extracting geometric parameters from the placed
devices and utilizing the trained LLE surrogate models, we can directly obtain the
estimated threshold voltage of the devices. This enables placement optimization under
LLE considerations. Therefore, reward functions are integrated with the LLE surrogate
models.
The reward of the ${i}$-th device or ${i}$-th cluster state at time step $t$ in the
$x$-th sequence represented by $r^{{i}}_{x,t}$ encompasses a number of factors, including
the area, wire length, LLE parameters, type clustering, sub-circuit clustering, threshold
voltage difference and estimated threshold voltage by LLE surrogate models. The area
reward and wire length reward ($r_{area,t}$ and $r_{wire,t}$) are set with the purpose
of minimizing the area and length, respectively. They are applied as reversed rewards,
where smaller related parameters result in larger rewards.
LLE parameter rewards $(r_{\Delta LOD,t}$, $r_{\Delta DTI,t})$ consisted of LOD
parameter rewards and DTI parameter rewards. These rewards consider similarity of
LLE parameters $( {\Delta }LOD$, ${\Delta }DTI )$ among devices ($i$ and $j$) at time
step $t$. The differences in each of the LEE parameters serve as the criteria in Equation
(6) for determining whether a penalty or an optimal reward value should be applied. ${\nu
}_1$ and ${\nu }_2$ are hyperparameters that are controlled during reward tuning.
Clustering rewards ($r_{clustering,t}$) are used primarily for two purposes: clustering
by type and clustering by same sub-circuit (ssc). Type refers to the categorization
of devices into N- and P-types to maximize device matching. Within the same sub-circuits,
such as differential pairs or current mirrors, or even multi-finger devices, split
into multiple single-finger devices, as shown in Fig. 7(a), device pairs or groups are critical to achieve matching. Therefore, they must be
as close as possible and placed in geometrically identical environments, such as symmetric
and row-based arrangements. To accomplish this purpose, clustering rewards are used
with LLE parameter rewards in the proposed algorithm. This will be explained in the
next chapter. In Equation (7), ${{\Omega}}_t$ refers to an optimal reward that provides additional value when the
device groups are placed close enough while considering layout constraints at time
step t. ${\overline{d}}_{type,t}$, ${\overline{dx}}_t$, and ${\overline{d}y}_t$ in
Equation (7) refer to the distance between devices of the same type, the coordinate difference
in the x-axis direction, and the coordinate difference in the y-axis direction, respectively,
at time step t. $\theta $, $\mu $, ${\sigma}$ and ${\omega }$ are hyperparameters
that are controlled during reward tuning.
The purpose of the difference in threshold voltage reward ($r_{dvth}$, $t$) is to
reduce differences in estimated threshold voltage within the same sub-circuit topology
and for the same type of device. This reward is calculated through Equation (8). $p$ is the number of sub-circuits in the input circuit. $m$ is the number of devices
in the same sub-circuit. $k$ and $l$ is the total number of P-type and N-type devices,
respectively. $V_{th,LOD}$ and $V_{th,DTI}$ are calculated using the trained LLE surrogate
models explained in Section II. $\alpha $, $ \beta $, $\delta $, and $\zeta $ are
hyperparameter controlled during reward tuning.
The objective of utilizing $r_{vth\_lle,t}$ is to optimize the estimated threshold
voltage ($V_{th}$), which is calculated by the LOD surrogate models. As illustrated
in Fig. 3(b), in contrast to DTI, there is a saturated region in proximity to the zero gradient
at the maximum or minimum threshold voltage. Therefore, the objective is to maximize
or minimize $V_{th, LOD}({LOD}_a,{LOD}_b)$ while simultaneously minimizing the parameters
${LOD}_a$ and ${LOD}_b$. In the case of P-type devices, Equation (9) is employed, whereas for N-type devices, Equation (11) is utilized. As a result, the outputs of Equations (9)-(12) are integrated into Equations (13) and (14) as optimal ($r_{optimal}$) and penalty ($r_{penalty}$) reward threshold values. In
the case of a P-type device with a threshold voltage, $V^{{i}}_{th}$, the reward for
said device ($r^{{i},P-type}_{vth\_lle,t}$) is calculated using Equation (13). Conversely, the reward for device $j$ $\left(r^{{j},N-type}_{vth\_lle,t}\right)$,
an N-type device with threshold voltage $V^{{j}}_{th}$, would be calculated using
Equation (14). Ultimately, like Equation (15), the reward value, $r_{vth\_lle,t}$, is determined by aggregating the outputs of
the aforementioned equations. The two hyperparameters, $w_1$ and $w_2$, are calibrated
during the reward tuning process.
2. Proposed algorithm
The Q-learning algorithm utilized in this paper employs Watkins-Dayan [12]-based Q-learning, as demonstrated in Equation (16). $\alpha $ and $\gamma $ represent the learning rate and discounted factor, respectively.
Additionally, epsilon-greedy exploration is applied to balance the trade-off between
exploitation and exploration, with probabilities of $1-\epsilon $ and $\epsilon $,
respectively. This approach allows the agent to reinforce the evaluation of known
good actions while also exploring new actions [13]. In this context, $x$ represents the sequence index, $t$ represents step number,
${i}$ represents the ${i}$-th device, and $n$ represents the total number of devices,
respectively. The reward value, states set, and actions set of devices in sequence
$x$ at step $t$ are ${\mathrm{R}}_{x,t}$, ${{\mathbf{S}}}_{x,t},$ and ${{\mathbf{A}}}_{x,t}$
in Equations (17), (18), and (19), respectively. Former related research [14] applied deep Q-learning that a state represents all devices' placement and agent
selects actions of all devices. However, an alternative approach has been taken whereby
individual Q tables ($q^{{i}}_{x,t}(s^{{i}}_t,a^{{i}}_t)$) have been applied for each
state-action pair of ${i}$-th device, rather than applying a shared Q table for all
devices. It was observed that selecting all actions simultaneously in a single step
of Q-learning led to a failure in convergence to an optimal layout, instead resulting
in a persistent oscillatory pattern. The root cause of this issue lies in the trade-off
relationship between LLE rewards and traditional metrics such as area and wire length,
combined with the dynamic nature of LLEs, which continuously vary and exert influence
based on the real-time relative distances among all devices. Moreover, the Q-table
structure enables the effective representation of state and action pairs for individual
devices, thereby facilitating straightforward management. Consequently, as the number
of devices increases, the number of Q-tables increases correspondingly. However, this
approach may not fully capture the interactions between devices, especially when the
optimal action for one device depends on the state or action of another device. To
address this limitation, we applied a total reward (${\mathrm{R}}_{x,t}$) that considers
all devices in the circuit, as shown in Equation (17). Fig. 5 illustrates the structure of the proposed algorithm. Based on that, in order to emulate
knowledge-based design flow used by experts, a sequential algorithm is implemented
to identify the optimal placement and ensure efficient convergence to the optimal
solution. For this purpose, rather than simultaneously applying all reward functions
previously described, each sequence selectively uses them according to Equation (20) to calculate the next reward value at step $t+1$. The agent selects the current actions
for devices that yield the maximum Q values based on the current states array and
the reward at sequence $x$. After taking actions, the next states at step $t+1$ include
updated features for each device such as relative distances, device type, length,
coordinates, and LLE parameters. Those features interact with sequentially selected
rewards. Fig. 6 depicts the overall proposed algorithm.
Fig. 5. Structure of the proposed Q-learning's agent and environment.
Fig. 6. Sequential Q-learning algorithm.