Mobile QR Code QR CODE

  1. (Department of Electronics and Electrical Engineering, Dankook University, Yong-In, Korea)



Bitonic sorting network, bit-serial bitonic sorting network, sorting network, odd-even sorting network, bit-serial operation

I. INTRODUCTION

Sorting is a widely used prime function in most data-centric applications such as big data [1], image processing [2], database query operations [3], wireless local area network [4], ATM switching [5], etc. For applications that require high performance, sorting is often performed in hardware. Because of the speed merit over software approaches, developing an efficient hardware sorting unit has been a significant challenge to overcome the computational bottleneck of sorting problems.

Many hardware sorters called sorting networks have been developed. In 1968, Batcher presented bitonic sorting network and odd-even sorting network [6] which have $O((logN)^2)$ time complexity and $O(N(logN)^2)$space complexity. In 1992, Liszka and Batcher presented a generalized bitonic sorting network which sorts general bitonic sequences of any size using an odd-merge method [7]. AKS sorting network [8] proposed by Ajtai et al. has $O(logN)$ time complexity, however, the constant hidden by the O-notation is very large which makes it unsuitable for actual implementation because it is much slower in practice than other parallel sorting network. Although other sorting networks with $O(logN)$ time complexity have been presented [9, 10], the bitonic and the odd-even sorting network remain the most practical solutions [11, 12].

Some modified bitonic sorting networks which are efficient in size or power are developed. Recirculating bitonic sorting network [12, 13] is proposed to minimize communication in bitonic sorting network. The recirculating bitonic sorting network is composed of one column of comparators and $(logN - 1)$ level switch network. By combining the comparators and switch network, this architecture can realize any connection corresponding to any step in bitonic sorting network. Bitonic sorting is performed by recirculating the data into this architecture $k(k + 1)$ times. Although this architecture can save hardware size, power consumption and latency of sorting network are increased.

Power and area efficient bitonic sorting network which uses unary coding is proposed [14- 16]. By the unary coding, $m$-bit keys are encoded into $2^m$-bit streams, and the streams are processed with pipelining by 1-bit comparator, then the results are decoded into $m$-bit keys. The unary coding method can minimize compare and swap units so that the size and power can be saved very much. However, it can be used only for low-speed applications, because it requires encoding and decoding processes, and takes longer operation time due to performing the operation on $2^m$-bit long streams. In 2025, Han and Yoshikawa implement unary coding based bitonic sorting network with single flux quantum (SFX) circuit [16] which is a high-speed and low-power solution for large-scale superconducting digital systems. Their experiments show that the unary code method can save area of sorting network very much, but power saving is limited to 8-bit inputs because of the longer processing time for $2^m$-bit long streams. With the results, the unary coding method is suitable for circuits with low precision or limited hardware resources.

Although some modified bitonic sorting networks succeeded in reducing area and/or power consumption of conventional bitonic sorting network, none of them is faster than normal bitonic sorting network. A design of bit-serial bitonic sorting network is presented in this paper. Compared to the normal bitonic sorting network, the new bit-serial bitonic sorting network is efficient not only in size and power, but also in speed. It is implemented with conventional CMOS technologies so that it can be employed easily in many applications that require high-speed and/or low-cost hardware sorting network.

II. BITONIC SORTING NETWORK

Most sorting algorithms are based on compare-and-swap (CAS) operation which compares the magnitudes of two inputs and swaps their positions according to the result. For conventional bitonic sorting networks which sort $N$ of $m$-bit binary inputs, the CAS operation is implemented by an $m$-bit comparator and an $m$-bit multiplexer. The structure of CAS unit is presented in Fig. 1(a), and its symbol which is usually used in drawing sorting network is in Fig. 1(b). The arrow in the symbol represented the direction of sorting. By using this symbol, 16 input bitonic sorting network can be depicted as in Fig. 2.

Fig. 1. The structure and symbol of CAS unit (a) block diagram (b) symbol.

../../Resources/ieie/JSTS.2026.26.2.174/fig1.png

Fig. 2. Architecture of 16-input bitonic sorting network.

../../Resources/ieie/JSTS.2026.26.2.174/fig2.png

Bitonic merge is the principal operation of the bitonic sorting algorithm. It is used to make two $N/2$ data sets which are sorted in opposite direction into $N$ sorted data. Bitonic sorting network consists of $k$ stages in sorting of $N (= 2^k)$ unsorted inputs. In the $i$-th stage, $N (= 2^k)$ inputs are partitioned into $2^{k-i}$ groups which are made with $2^i$ adjacent inputs. Each group is composed of two sets of inputs which are sorted in opposite directions and sorts them through bitonic merge operation. Unsorted $2^k$ inputs are sorted by passing through from stage 1 to stage $k$ as in Fig. 2. The $i$-th stage consists of $i$ CAS steps so that $k(k+1)/2$ CAS steps are required in bitonic sorting network to sort $2^k$ inputs. The latency of a bitonic sorting network ($t_{SN}$) is

(1)
$t_{SN} = t_s \cdot k(k + 1)/2,$

where $t_s$ is the average step delay of sorting network which is almost same as the delay of a CAS unit ($t_{cas}$). Each step is composed of $N/2$ CAS units so that the number of CAS in a bitonic sorting network is $N logN(logN + 1)/4$, so that the bionic sorting network has $O(N(logN)^2)$ hardware complexity. Since $k(k + 1)/2$ in Eq. (1) is constant for a given $N$, only $t_s$ is the parameter affecting the speed of bitonic sorting network. Therefore, the design of CAS unit affects greatly the speed, power, and the size of a bitonic sorting network.

Generally, the CAS unit is implemented with an $m$-bit comparator and an $m$-bit parallel multiplexer for $m$-bit binary inputs as in Fig. 1. Unary code based bitonic sorting network tried to simplify CAS structure by bit-serial processing of inputs instead of $m$-bit parallel processing. Through the unary coding, CAS can be implemented by one AND gate and one OR gate as in Fig. 3 which results in extremely small and fast CAS unit. Although this approach succeeds in reducing the size of the bitonic sorting network, it fails in increasing sorting speed because of the increased bit count by unary coding. For example, to sort 8 inputs of 8-bit binary number, each 8-bit binary input is converted to 256-bit unary code. Since 8-input bitonic sorting network is composed of 6 steps, the latency of the conventional bitonic sorting network is

$t_{SN} = 6t_s = 6t_{cas},$

but the bit-serial operation of 256-bit unary code requires $(256+6-1)$ cycles so that

$t_{SN,unary} = 261t_{cas,unary}.$

To achieve speed improvement, the $t_{cas,unary}$ must be 44 times smaller than $t_{cas}$ which is an impossible requirement.

Fig. 3. CAS unit of unary coding based bitonic sorting network.

../../Resources/ieie/JSTS.2026.26.2.174/fig3.png

In this paper, a bit-serial bitonic sorting network is proposed. Bit-serial bitonic sorting network uses one bit CAS instead of word-size CAS, and each input is processed as a bit-stream. Like unary coding based bitonic sorting network, the bit-serial bitonic sorting network uses bit-serial operation to take advantage of small and fast CAS, but it does not encode inputs but use binary inputs as they are. As a result, the bit-serial bitonic sorting network could operate faster than conventional bitonic sorting network.

For sorting small number of inputs, normal (parallel) bitonic sorting network is still competitive in speed, but the proposed design becomes faster as the number of inputs increases. With the advance of technologies, the complexity of problems and systems has increased significantly, resulting in a substantial growth in data volume. The need for sorting large number of inputs grows. Therefore, the new bit-serial bitonic sorting network can meet such a trend. To distinguish between the normal and bit-serial sorting network, let us use NBSN for normal bitonic sorting network and BBSN for the new bit-serial bitonic sorting network.

III. DESIGN OF BIT-SERIAL BITONIC SORTING NETWORK

Generally, sorting networks avoid using memories between CAS units to swap inputs, because use of the memories requires more area, power and delays. Therefore, NBSN operates asynchronously without clock. That is, once sorting begins, all input-bits propagate from the first step to the end of sorting network without clock or any other control signals.

Unlike usual sorting networks, bit-serial sorting networks should operate synchronously which requires memory elements between adjacent steps. Although BBSN uses one-bit comparator and multiplexer, it requires more complex controls than $m$-bit CAS of NBSN.

The major difficulty of BBSN is that the paths of switches are determined at different clock cycles even for the CAS units in the same step. In operation of NBSN, all CAS units in the same step determine path of the inputs at similar time with the order from the first to the last step. However, such an order does not exist in BBSN. For example, consider sorting four 6-bit binary numbers (11, 52, 57, 58) in descending order with BBSN. The bit-serial sorting is performed by 8 cycles such as in Fig. 4. Fig. 4 shows the operation of BBSN cycle by cycles.

Fig. 4. The sorting flow of 4 input bit-serial sorting network in sorting 6-bit inputs (a) cycle 1 (b) cycle 2 (c) cycle 3 (d) cycle 4 (e) cycle 5 (f) cycle 6 (g) cycle 7 (h) cycle 8.

../../Resources/ieie/JSTS.2026.26.2.174/fig4.png

The 4-input BBSN is composed of 3 steps, 2 CAS/step and one memory/input between steps. Let use $C_{ij}$ to indicate the $j$-th CAS unit in the $i$-th step. At the beginning, all memory elements are reset, and all CAS units are unfixed. A CAS has either fixed or unfixed state. An unfixed CAS compares two input bits and fixes the paths of inputs if the inputs are different, otherwise, maintains unfixed state and passes inputs to the default paths. Between two inputs ($I_1$, $I_2$), and two outputs ($O_1$, $O_2$) of a CAS, the default path is determined by $I_1$. For descending CAS, the default path of $I_1$ is $O_1$ if $I_1$ is 1, otherwise, it is $O_2$. A fixed CAS just passes the inputs to the fixed paths and maintains fixed state until the end of sorting. In Fig. 4, the paths of unfixed CAS are represented by dashed lines and those of fixed CAS are marked by bold lines.

In the first cycle, only one CAS, $C_{11}$, becomes fixed, and only $C_{22}$ does in cycle 2 (The CAS with red lines represents that it is fixed in this cycle) Likewise, each CAS becomes fixed at different cycles. Note that $C_{21}$ is unfixed until cycle 4 while $C_{22}$ and $C_{23}$ are fixed in cycle 2 and 3 respectively. In normal bitonic sorting network, $C_{ij}$ is fixed before $C_{ik}$, if $j < k$. However, such a restriction does not exist in BBSN so that the fixing-order of $C_{ij}$ is not predictable. Some CAS are fixed at early cycle, but some CAS could maintain unfixed state until the end of sorting.

Although the comparator and multiplexer of BBSN is much smaller and simpler than NBSN, the CAS requires additional logic to control CAS states. The CAS unit of BBSN is designed as in Fig. 5. It is composed of one-bit comparator, one-bit multiplexer and memory elements (latch and flip-flop) which are used to store the current state of the CAS. The latch is used to fix the paths of the input bits, and the state of CAS is determined by the flip-flop (FF). When the output of the flip-flop is 0, the CAS is unfixed. At the beginning of sorting operation, the FFs of all CSA are reset so that all CSA are in unfixed state. The output of the FF controls the door of latch. While CSA is unfixed, the door is open, but it is closed when the CSA becomes fixed.

Fig. 5. VLSI implementation of compare and swap circuit (descending order) for bit-serial bitonic sorting network.

../../Resources/ieie/JSTS.2026.26.2.174/fig5.png

In every cycle, unfixed CSAs determine the path of inputs and store the result to the latch through the door to the latch. If the inputs are different each other, the FF becomes 1 in the middle of the cycle which closes the door and disables the FF. It results in fixing the multiplexer by the current paths. Once a CSA is fixed, it cannot change its states until the next reset signal.

Fig. 6 shows the block diagram of BBSN. By the external Start signal, the Signal Generator produces reset signal and local clock. For every step, one FF per input is used to hold an input to CAS as in Fig. 5. It is required for synchronized operation. The output memories could be shared with input memories as in Fig. 4.

Fig. 6. Block diagram for the structure of bit-serial bitonic sorting network.

../../Resources/ieie/JSTS.2026.26.2.174/fig6.png

Although, the comparator and multiplexer of BBSN is smaller and faster than CAS of NBSN, their effect is reduced by the newly inserted components, the FFs and the additional logic in CAS. To sort $N$ of $m$-bit inputs, the latency of BBSN ($t_{BBSN}$) is given by

(2)
$t_{BBSN} = t_{s,BBSN}(N_s + m - 1),$

where $N_s$ is the number of steps in $N$ inputs NBSN, and $t_{s,BBSN}$ is the average step delay which is the same as the clock cycle time ($t_{clk}$) of BBSN. The $t_{s,BBSN}$ includes propagation delay of FF ($t_{FF}$) and the latency of CAS ($t_{CAS,BBSN}$). The first bit of input stream exits the BBSN at the end of $N_s$ cycles, but $(m-1)$ more cycles are required until the last bit exits the BBSN so that BBSN has time overhead of ($m$-1) cycles. Comparing Eqs. (1) and (2), the BBSN is faster than NBSN when the following condition is satisfied.

(3)
$\frac{t_{s,BBSN}}{t_{s,NBSN}} \le \frac{N_s}{N_s + m - 1}.$

According to Eq. (3), the effect of overhead cycles decreases as $N_s$ increases so that BBSN has speed benefit when $N_s$ is large. In sorting 8 of 16-bit inputs, for example, $N_s = 6$ and $m = 16$ so that the BBSN is faster than NBSN if

$t_{s,BBSN} \le t_{s,NBSN}/3.5.$

Although it is a tough requirement, it is eased to

$t_{s,BBSN} \le t_{s,NBSN}/2,$

when $N = 32$, which is in a challengeable range.

NBSN is faster than BBSN in sorting small $N$ with large $m$, but BBSN is faster when $N$ becomes large. The cross point depends on the design of CAS unit of BBSN.

IV. EXPERIMENTS

The efficiency of BBSN is estimated by comparing characteristics of BBSN and NBSN. For the estimations, both BBSN and NBSN are designed for various number of inputs and synthesized by Synopsys Design Complier with 45-nm CMOS standard cell library. 10 sorting networks of different numbers (8, 16, 32, 64 and 128) and size (8-bit and 16-bit) for each architecture are synthesized for experiments.

The sizes of the sorting networks are compared in Table 1. The size of unit cell (CAS unit) and the total size of sorting network are measured. In BBSN, the size of signal generator (SG) is also measured and listed in Table 1. The size of I/O block is not included in the size of sorting network, because I/O streaming method depends on applications. In general, the sizes of I/O memory are almost the same for BBSN and NBSN, but BBSN needs less area for routing from input buffer to the sorting network because BBSN needs only 1 line/input while NBSN requires $m$ line/input.

Table 1. Comparison of area ($\mu m^2$).

m

$N$

8

16

32

64

128

NB

8

CAS

59.8

59.8

59.8

59.8

59.8

Total

3,105

10,623

33,142

91,426

244,680

16

CAS

120.6

120.6

120.6

120.6

120.6

Total

6,907

23,114

69,960

193,247

528,982

BB

8

CAS

68.4

68.4

68.4

68.4

68.4

SG

159.0

236.6

315.6

410.4

534.8

Total

2,516

5,761

13,664

25,965

49,254

16

CAS

68.4

68.4

68.4

68.4

68.4

SG

253.8

394.6

552.6

742.2

977.2

Total

5,286

9,896

19,875

40,545

74,083

Table 2. Comparison of power consumption (mW).

$N$

8

16

32

64

128

$N_S$

6

10

15

21

28

8-bit

$m/N_S$

1.33

0.8

0.53

0.38

0.29

NB

0.132

0.524

1.76

5.78

15.34

BB

0.125

0.416

1.25

3.49

9.31

16-bit

$m/N_S$

2.67

1.6

1.07

0.76

0.57

NB

0.256

1.008

4.09

14.01

40.34

BB

0.249

0.812

2.50

6.99

18.63

The total size of 8-bit BB(SN) is about 81% of NB(SN) when $N$ is 8, but it is 20% when $N$ is 128. For the 16-bit, the size of BB is about 77% of NB when $N$ is 8, and 14% when $N$ is 128. Although the CAS of BB as in Fig. 5 uses one bit comparator and one bit multiplexer, its size is bigger than CAS of 8-bit NB, due to the additional FFs and state-control logic. The area efficiency of BB mostly comes from saving in routing areas. As $N$ increases, interconnection lines between steps become longer and more lines overlap one another, so that more areas are required for routing. The ratio of routing area to CAS area increases rapidly with $N$. The routing area of BB may converge to $1/m$ of NB for large $N$. Although, CAS size of BB is smaller than NB when $m > 8$, its effect is not critical in area efficiency.

To measure power consumption, the networks are activated by Start signal which is generated at 50 MHz rate. Sorting networks are activated by the Start signal, sort the given inputs, and deactivated when sorting is finished. The power consumption of two sorting networks is compared in Table 2. The $m/N_s$ ratio represents overheads of BB in critical path over NB, which means that BB should operate $m/N_s$ times more steps than NB. In general, BB consumes less power than NB, but the difference is small when $N$ is small. The power consumption of BB and NB differs only 5.5% for 8-bit 8 inputs, and 2.8% for 16-bit 8 inputs. The reason is that $m/N_s$ ratio is big when $N$ is small. As $N$ increases, the power efficiency of BBSN grows also. When $N$ is 128, BB consumes 40% less power than NB for 8-bit inputs and 54% for 16-bit inputs.

The experimental results for latency of sorting networks are shown in Fig. 7. Generally, BBSN is faster than NBSN. According to the graph, however, BBSN could be slower than NBSN for small $N$, especially for large $m$. It is because of the effect of overhead cycles also. The $m/N_s$ ratio is high when NBSN is faster than BBSN. Comparing Fig. 7 and Table 2, we can expect that BBSN is faster than NBSN when $m/N_s$ ratio is less than 1.4. We can predict that the critical value of $m/N_s$ may be between 1.33 and 1.6 for the proposed design. The $m/N_s$ ratio could be used as a parameter that determines which is faster for the given inputs. The critical value depends on the relative speed of steps so that depends on CAS design. In any case, however, BBSN is faster than NBSN for large $N$.

Fig. 7. Execution time for sorting.

../../Resources/ieie/JSTS.2026.26.2.174/fig7.png

By the results of experiments, it is proved that BBSN is more efficient than NBSN in size, power and speed. So far, many modified architectures are proposed for more efficient bitonic sorting network than NBSN. Although those architectures are efficient in size and/or power consumption, no architecture is more efficient than NBSN in speed. BBSN achieves speed efficiency over NBSN in addition to size and power efficiency. Generally, the efficiencies increase as the size and the number of inputs increase. However, NBSN could be still faster than BBSN for small $N$ ($\le 32$) with large $m$ ($\ge 16$) inputs although BBSN is still efficient in size and power consumption.

The complexity of problems and systems keeps increasing rapidly and involves larger scale data. The need for high-speed and low-cost sorting networks grows also. The proposed bit-serial bitonic sorting network could be a solution.

V. CONCLUSIONS

An implementation of bit-serial bitonic sorting network is presented in this paper. Through bit-serial operation, it is possible to use a smaller and faster CAS unit than conventional bitonic sorting network. As the results, the proposed sorting network can reduce 84% in area and 60% in power consumption of conventional bitonic sorting network. Furthermore, it is 16% faster than conventional bitonic sorting network in sorting 128 inputs.

Although many modified bitonic sorting networks have been developed, no one is more efficient in speed than the conventional bitonic sorting network. The proposed architecture is efficient not only in size and power, but also in speed. The proposed bitonic sorting network can be used for any application which requires hardware aided high-speed sorting operations.

ACKNOWLEDGEMENT

The present research was supported by the research fund of Dankook University in 2023.

REFERENCES

1 
Wen L. , Fen L. , DangDang D. , Huan X. , XiangLi P. , 2020, Adaptive sorting method of big data with scattered defects based on cloud computing, Proc. of 2020 13th International Conference on Intelligent Computation Technology and Automation (ICICTA), pp. 370-376DOI
2 
Li P. , Lilja D. J. , Qian W. , Bazargan K. , Riedel M. D. , 2014, Computation on stochastic bit streams digital image processing case studies, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 22, No. 3, pp. 449-462DOI
3 
Graefe G. , 2006, Implementing sorting in database systems, ACM Computing Surveys, Vol. 38, No. 3, pp. 10DOI
4 
Chang R. C. , Wei M. , Chen H. , Lin K. , Chen H. , Gao Y. , Lin S. , 2014, Implementation of a high-throughput modified merge sort in mimo detection systems, IEEE Transactions on Circuits and Systems I, Vol. 61, No. 9, pp. 2730-2737DOI
5 
Colavita A. A. , Cicuttin A. , Fratnik F. , Capello G. , 2003, Sortchip: A VLSI implementation of a hardware algorithm for continuous data sorting, IEEE Journal of Solid-State Circuits, Vol. 38, No. 6, pp. 1076-1079DOI
6 
Batcher K. E. , 1968, Sorting network and their applications, Proc. of 1968 Spring Joint Computer Conference, AFIPS, Vol. 32, pp. 307-314DOI
7 
Liszka K. J. , Batcher K. E. , 1993, A generalized bitonic sorting network, Proc. of 1993 International Conference on Parallel Processing - ICPP'93, pp. 105-108DOI
8 
Ajtai M. , Komlos J. , Szemeredi E. , 1983, An O(N logN) sorting network, Proc. of 15th Annual ACM Symposium on Theory of Computing, pp. 1-9DOI
9 
Leighton F. , 1985, Tight bounds on the complexity of parallel sorting, IEEE Transactions on Computers, Vol. 34, No. 4, pp. 344-354DOI
10 
Paterson M. S. , 1990, Improved sorting networks with O(logN) depth, Algorithmica, Vol. 5, pp. 75-92DOI
11 
Chen R. , Prasanna V. K. , 2017, Computer generation of high throughput and memory efficient sorting designs on FPGA, IEEE Transactions on Parallel and Distributed Systems, Vol. 28, No. 11, pp. 3100-3113DOI
12 
Lee J.-D. , Batcher K. E. , 2000, Minimizing communication in the bitonic sort, IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 5, pp. 459-474DOI
13 
Lee J.-D. , Batcher K.E. , 1996, Minimizing communication of a recirculating bitonic sorting network, Proc. of the 1996 ICPP Workshop on Challenges for Parallel Processing, Vol. I, pp. I-251-254DOI
14 
Najafi M. H. , Lilja D. J. , Riedel M. , Bazargan K. , 2017, Power and area efficient sorting networks using unary processing, Proc. of 2017 IEEE International Conference on Computer Design (ICCD), pp. 125-128DOI
15 
Najafi M. H. , Lilja D. J. , Riedel M. D. , Bazargan K. , 2018, Low-cost sorting network circuits using unary processing, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 26, No. 8, pp. 1471-1480DOI
16 
Han Z. , Yoshikawa N. , Yamanashi Y. , 2025, Small-area sorting network based on unary coding using single flux quantum circuits, IEEE Transactions on Applied Superconductivity, Vol. 35, No. 5, pp. 1-5DOI
Myungchul Yoon
../../Resources/ieie/JSTS.2026.26.2.174/au1.png

Myungchul Yoon received his B.S. and M.S. degrees in electronics engineering from Seoul National University, Korea, in 1986 and in 1988, respectively, and a Ph.D. degree in electrical and computer engineering from The University of Texas at Austin, USA, in 1998. From 1988 to 2002, he was with Hynix Inc. Icheon, Korea as technical research staff at Semiconductor R&D Lab., and Mobile Communication R&D Lab. From 2005 to 2006, he was with DGIST, Korea as technical staff at the Information Technology R&D Division. Since 2006, he has been with the Department of Electronics and Electrical Engineering, Dankook University, YongIn, Korea, where he is a professor. His research interests are in low-power VLSI design, embedded systems, and mobile communication.