Mobile QR Code QR CODE

  1. (Department of Electronic Engineering, Kwangwoon University, 615, Bima, 20, Gwangun-ro, Nowon-gu, Seoul 139-701, Korea)



CMOS, IO transceiver, scalable, delay compensation, pre-emphasis, FIR driver

I. INTRODUCTION

Recently demands for high resolution display have increased per-pin data rate for high data throughput between chip-to-chip. Data transmission speed for display has variation (even during real-time operation) depending on image contents and various standards need to be covered by the interface circuits with restricted power and noise budget. In order for the interface scheme to operate with various speeds, high-speed digital logics should be scalable. In [1], diverse clock phases with correlations are used to solve the hold-time violation (HTV) problem and scalable operation could be achieved. However, multi-phase clock should be available and maintaining the phase gap equally spaced at high speed is an issue. A scheme that selects the clock polarity adaptively after detecting HTV, has been suggested [2]. Since the cascade digital logics may require multiple selections along the path, multiple adaptive loops need to be implemented to remove all HTV in the scheme.

In this paper, we theoretically analyze the mechanism of HTV event at multiple data speeds and propose an efficient design methodology to avoid HTV for scalable data speeds. The entire high-speed digital paths in our transceiver have been designed scalably via delay matching technique. The transceiver covers the speed range of 2.65 Gb/s-6.4 Gb/s, which meets various standards such as DP1.4 (5.4 Gb/s), LPDDR5 (6.4 Gb/s), SATA3 (6 Gb/s) and XAUI (3.125 Gb/s). The measurement performances are compared to the similar applications [3,4]. In addition, half-rate design of drivers and sampler in the front-end could reduce the power significantly.

II. ARCHITECTURE

Fig. 1 presents our proposed 2 channel transceiver that operates in scalable data speed. The pseudo random bit sequence (PRBS) generates 18 lanes 147-356 Mb/s parallel signals with $2^{23}-1$, $2^{31}-1$ pattern lengths. The following 18:2 serializers assemble them into 2 lanes 1.325-3.2 Gb/s EVEN/ODD data. As shown in Fig. 1(b), the tap signal generator delays the half-rate D$_{\mathrm{ODD}}$/D$_{\mathrm{EVEN}}$ signals to generate the PRE/MAIN/POST signals. The tap signal generator consists of consecutive latches, and the data are delayed by 0.5 UI of data speed for each stage of latches. As shown in the timing diagram of Fig. 1(b), the PRE/MAIN/POST signals from the appropriate nodes where the three tap signals are aligned are provided to the drivers. As shown in Fig. 1(c), compared to the current mode driver our voltage mode driver consumes 1/4 current power. The driver consists of three taps (PRE/MAIN/POST) and each taps have 3, 15, 7 segments, respectively. The number of ON segments of each tap is adjusted through the PU SEG, and the amplitude of pre-emphasis is adjusted for various channel losses.

Fig. 1. (a) Block diagram of the 2channel TRx; (b) tap signal generator; (c) FIR driver; (d) sampler.
../../Resources/ieie/JSTS.2024.24.3.184/fig1.png

The equalizing drivers generate 2.65-6.4 Gb/s differential non-return-to-zero (NRZ) signals. Along the path, all high-speed logics are scalable because the HTV could be avoided by matching delays for various speed. In our receivers, the 1-stage continuous-time linear equalizers (CTLE) mitigate the channel inter-symbol interference and improve bit-error rate (BER) performance. As shown in Fig. 1(c), each sampler is designed with a strong-arm latch topology and SR latch by transforming the output of strong-arm latch return-to-zero (RZ) signal to NRZ signals. The strong arm latch compares the voltage-level of the differential inputs (V$_{\mathrm{in,p}}$, V$_{\mathrm{in,n}}$) of 2.65-6.4 Gb/s data speed at rising edge of the recovered clock from the CDR. If V$_{\mathrm{in,p}}$ is larger than V$_{\mathrm{in,n}}$, OUTP is determined as 1 and if V$_{\mathrm{in,p}}$ is smaller than V$_{\mathrm{in,n}}$, OUTP is determined as 0.

The following 2:18 deserializers parallelize the EVEN/ODD data into 18 lanes $\times $ 147-356 Mb/s signals and the PRBS checkers detect errors in the received signals. The BER counters can count the number of errors up to $2^{40}$ and monitor the error count in real-time via serial-to-parallel interface (SPI). Scalable logics make possible data operation at various speeds under maximum limitation comes from clock speed constraint.

Fig. 2 illustrates the delay matching technique for scalable speed operation of high-speed logics in our architecture. Fig. 2(a) shows a typical case of consecutive positive edge-triggered flipflops (FF) that share a single clock source. Then Input clock is inverted for FF2 because PVT variation and line delay mismatch can cause timing mismatch between data and clock on FF2 and may cause setup and hold time violations. The data delay, $t_{d}$ and clock delay, $t_{c}$ occur from combinational logic propagation delay required for making logical functions (i.e. muxing/demuxing/clock dividing) or clock-to-Q delay. In all cases, $t_{d}$ and $t_{c}$ do not depend on the data speed and clock speed but on the propagation delay of logic circuits. Fig. 2(b) is a simple implementation of logic blocks made up of inverter chain to find out the changes in $t_{d}$ and $t_{c}$ by PVT variation. The value of $t_{d}$} and $t_{c}$ are 468.9 ps and 157.8 ps at 3.2 Gb/s, typical corner, 27℃ and Table 1 summarizes the value of data and clock delay with corners and temperature. The $t_{d,corner}$, $t_{c,corner}$ in Table 1 are the values of data delay and clock delay at each corner and temperature and $t_{d,var}$ and $t_{c,var}$ are defined as $t_{d,var}= t_{d,corner}- t_{d}$, $t_{c,var}= t_{c,corner}- t_{c}$. When Clock B locates at the optimal point of Data B at typical corner, 27℃, the deviation of Clock B from the optimal point of Data B is defined as $\left| t_{d,var}- t_{c,var}\right| $. As stated in Table 1, the maximum value of $\left| t_{d,var}- t_{c,var}\right| $at the maximum data rate(3.2 Gb/s) of our circuit is 95 ps, 0.3UI at ss corner, 120℃. Usually, eye opening is secured over 0.8UI in a digital circuit. Since the difference in delay due to PVT variation does not vary with data speed, the lower the data speed, the narrower the portion of $\left| t_{d,var}- t_{c,var}\right| $within 1UI of the data speed. As a result, this circuit reduces the HTV due to PVT variation.

Table 1. Data and Clock Delay with PVT variations

Corner

Temperature [℃]

$t_{d,corner}$

[ps]

$t_{c,corner}$

[ps]

$t_{d,var}$

[ps]

$t_{c,var}$

[ps]

$\left| t_{d,var}- t_{c,var}\right| $

[ps]

[UI @ 3.2Gbps]

tt

-40

460.6

156.9

-8.3

-0.9

7.4

0.02

120

485.3

161.7

16.4

3.9

12.5

0.04

ss

-40

590.5

200

121.6

42.2

79.4

0.25

120

607.5

201.4

138.6

43.6

95

0.3

ff

-40

368.4

124.7

-100.5

-33.1

67.4

0.22

120

398.2

132.9

-70.7

-24.9

45.8

0.15

Fig. 2. Simulation testbench for delay matching technique: (a) Typical flip-flop logic where timing issue occurs for various speed; (b) logic blocks made up of inverter chain; (c) timing illustration of Data B and Clock B.
../../Resources/ieie/JSTS.2024.24.3.184/fig2.png

Fig. 3(a) shows a timing diagram in the case that $t_{d}$ is $3\alpha $, where $\alpha $ is assumed to be 0.5UI for the illustration purpose. In method1 and method2 we can delay $t_{c}$ by $\alpha $ and $3\alpha $, each respectively, to avoid HTV. If both the data and clock speed become half, as shown in Fig. 3(c), method1 results in HTV. Fig. 3(b) shows that it operates without HTV in both method1 and method2 at 2/3 data speed. Method 2 can enable avoiding HTV at the continuous wide-range data rate between max data speed and 0.5 ${\times}$ max data speed.

Fig. 3. Illustration of delay matching techniques for scalable speed: (a) Timing diagram for maximum speed ( $t_{d}$ > $t_{c}$ case); (b) Timing diagram for 2/3 speed of the maximum ( $t_{d}$ > $t_{c}$ case); (c) Timing diagram for 0.5 speed of the maximum ( $t_{d}$ > $t_{c}$ case); (d) Timing diagram for maximum speed ( $t_{d}$ < $t_{c}$ case); (e) Timing diagram for 2/3 speed of the maximum ( $t_{d}$ < $t_{c}$ case); (f) Timing diagram for 0.5 speed of the maximum ( $t_{d}$ < $t_{c}$ case).
../../Resources/ieie/JSTS.2024.24.3.184/fig3.png

Whereas the clock trigger timing still remains at optimal data BER for method2, this consecutive FF scheme can operate without HTV regardless of various data speed. Fig. 3(d)-(f) show the case that $t_{c}$ is $3\alpha $ and the delay compensation is made on $t_{d}$ by $\alpha $ and $3\alpha $. Similarly, method2 ($t_{d}$=$3\alpha $) can enable avoiding HTV for various data speed.

In Fig. 2(a), the speed of input data is same as the speed of the input clock, and the value of $\alpha $ at the maximum data rate (3.2 Gb/s) is set to 156.25 ps. In the case of Method 1, $t_{d}=3\alpha $, $t_{c} = \alpha $, and in the case of Method 2, $t_{d}=3\alpha $ and $t_{d}=3\alpha $. It is a simulation in which the pattern checker determines an error and calculates the BER for each frequency when the speed of the input clock changed to 0.1-3.2 GHz. Fig. 4 shows the BERs of method1 and method2 in wide data rate through this simulation. In Method 2, the BER is close to 0 across 0.1-3.2 GHz while in Method 1, the BER increase near 0.5 ${\times}$ maximum data rate.

Fig. 4. Simulation results of BER – Frequency by method1 and method 2.
../../Resources/ieie/JSTS.2024.24.3.184/fig4.png

Fig. 5(a) and (b) shows the circuits of 3:1 serializer in the transmitter and 1:3 deserializer in the receiver, where the timing issues occur on 2nd FF in the consecutive FFs with a single clock source. In the serializer, as shown in Fig. 5(a) the mux has to use a divided-by-3 clock and $t_{d}$ is larger than $t_{c}$. For scalable operation the delay compensation should be made on $t_{c}$ by adding a chain of buffers. In the deserializer, on the other hand, $t_{d}$ is smaller than $t_{c}$. In the same manner, the compensation is made on $t_{d}$, as shown in Fig. 5(b). We have options to place the delay compensation buffers on A or B for the deserializer. Choosing A will affect the timing issue in FF1, so B is a better choice.

Fig. 5. Delay matching techniques used in our transceiver IP: (a) 3:1 Serializer; (b) 1:3 Deserializer.
../../Resources/ieie/JSTS.2024.24.3.184/fig5.png

III. MEASUREMENT

Fig. 6 presents the measurement results of our transceiver for 3.2 Gb/s and 6.4 Gb/s. Tektronix TDS6154C has been used to measure the Tx eye performances and the built-in BER counter in Rx measures the BER by sweeping the sampler clock phase horizontally. The estimated parasitic loading of Tx output, PAD and channel is 7.5 pF, which results in 17.8 dB channel loss at Nyquist rate. The measured vertical eye openings for channel1 and channel2 are 94.8 mV/993 mV and 59.5 mV/997 mV each respectively at 6.4 Gb/s without pre-emphasis, as shown in Fig. 6(a) and (b). With the pre-emphasis on, the vertical eye-openings are improved to 221.6 mV/577.8 mV and 185.1 mV/534.3 mV. Fig. 6(c) shows the Rx horizontal bathtub curve measured from the built-in BER counter in our IP at 3.2 Gb/s and 6.4 Gb/s with and without pre-emphasis. The horizontal eye-opening is improved by 0.23 UI and 0.25 UI at $10^{-9}$ BER. Our transceiver has been fabricated in

Fig. 6. Measurement results of our transceiver at 3.2 Gb/s and 6.4 Gb/s: (a) Tx output eye opening w/ and w/o FIR at channel1 (6.4 Gb/s); (b) Tx output eye opening w/ and w/o FIR at channel2 (6.4 Gb/s); (c) Rx bathtub curves w/ and w/o FIR for 3.2 and 6.4 Gb/s (channel1).
../../Resources/ieie/JSTS.2024.24.3.184/fig6.png

65 nm CMOS process and occupies 1.02 $\mathrm{mm}^{2}$ die area. Fig. 7 shows layout of our IP and the measurement setup. Table 2 summarizes the measured performances of our transceiver, and they are compared to the prior arts. The proposed transceiver shows successful data transmission in measurement within all speed range of 2.65 Gb/s - 6.4 Gb/s by scalable design technique. Our transceiver consumes 72 mW/ch from 1.2 V power supply.

Fig. 7. Layout for 2-ch transceivers (1.02 mm2) and measurement setup.
../../Resources/ieie/JSTS.2024.24.3.184/fig7.png
Table 2. Comparison Table

[3]

[4]

This work

Technology

28 nm CMOS

90 nm CMOS

65 nm CMOS

Data rate (bit/s)

0.5 - 6.6 G

4 G

2.65 - 6.4 G

Supply (V)

1

-

1.2

Power (mW/ch)

129

56

72

Channel Loss (dB)

22

18.2

17.8

Tx Vertical eye opening (mV)

180

-

221.6

(FR4)

Rx Horizontal eye opening (UI)

0.25 (@10-9)

0.2 (@10-9)

0.25 (@10-9)

Swing (mV)

-

250 - 1000

577.8

Single Tx/Rx Area (mm2/ch)

0.64

1.11

0.51

IV. CONCLUSIONS

A design methodology of high-speed clock-triggered logics for scalable speed operation, has been proposed and used to implement the whole 2-channel IO transceivers. The HTV timing issue for various data speed has been dealt with theoretical backgrounds. The IP shows successful data transmission over the speed range of 2.65 Gb/s-6.4 Gb/s with error-free.

ACKNOWLEDGMENTS

This work was supported in part by the ATC+ (Advanced Technology Center plus) Program through the Korea Evaluation Institute of Industrial Technology under Grant 20017980 and was supported by the Research Grant of Kwangwoon University in 2022. The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

References

1 
Frans, Y., Carey, D., Erett, M. et al: ‘A 0.5-16.3 Gb/s Fully Adaptive Flexible-Reach Transceiver for FPGA in 20 nm CMOS’ , IEEE Jornal of Solid-State Circuits, 2015, 50, 8, pp. 1932-1944, doi:10.1109/JSSC.2015.2413849.DOI
2 
Abdollahi, R., Hadidi, K. and Khoei, A.: ‘A Simple and Reliable System to Detect and Correct Setup/Hold Time Violations in Digital Circuits’,IEEE Transactions on Circuits and Systems I: Regular Paper 2016, 63, 10, pp. 1682-1689, doi:10.1109/TCSI.2016.2582239.DOI
3 
Savoj, J., Hsieh, K.C.H., An, F.T. et al: ‘A Low-Power 0.5-6.6Gb/s Wireline Transceiver Embedded in Low-Cost 28nm FPGAs’, IEEE Journal of Solid-State Circuits, 2013, 48, 11, pp. 2582-2594, doi:10.1109/JSSC.2013.2274824.DOI
4 
Faust, A.C., Narasimha, R.L., Bhatia, K. et al: ‘FEC-based 4 Gb/s backplane transceiver in 90nm CMOS’, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference, San Jose, CA, USA, 9-12 Sept. 2012, doi:10.1109/CICC.2012.6330665.DOI
Goohyung Chung
../../Resources/ieie/JSTS.2024.24.3.184/au1.png

Goohyung Chung received the Bachelor of Science (B.S.) degree in the department of electronic engi-neering from Kwangwoon university, Korea, in 2022. His Master of Science (M.S.) degree is in progress in Kwangwoon university, Korea. His current research field is designing of clock and data recovery (CDR) circuits including high-speed IO circuits.

Kyoungub Cho
../../Resources/ieie/JSTS.2024.24.3.184/au2.png

Kyoungub Cho received the Bachelor of Science (B.S.) degree in the department of electronic engi-neering from Kwangwoon university, Korea, in 2022. His Master of Science (M.S.) degree is in progress in Kwangwoon university, Korea. His current research field is designing of clock generation circuits which are including phase-locked loop (PLL) and high-speed IO circuits.

Taehyoun Oh (S’05)
../../Resources/ieie/JSTS.2024.24.3.184/au3.png

Taehyoun Oh (S’05) received the Bachelor of Science (B.S.) and Master of Science (M.S.) degrees in Electrical Engineering from Seoul National University in 2005 and 2007, respectively. He received his Ph.D. degree in Electrical Engi-neering from the University of Minnesota, Minneapolis under the supervision of Dr. Ramesh Harjani. His doctoral research is focused on high-speed I/O circuits and architectures. During the summer of 2010, he worked on I/O channel modeling at AMD Boston Design Center, MA. In the fall semester of 2011, he researched on I/O architecture and jitter budgeting of the link at Intel Corp., CA. From fall of 2012, he joined the IBM system technology group, NY. and worked on performance verification of high-speed decision feedback equalizer for server processors. Since spring of 2013, he joined at the department of electronic engineering in Kwangwoon university in Seoul, Korea as an assistant professor. His current research interest is focused on clock generation IC design.