I. INTRODUCTION
Sorting is a widely used prime function in most of the datacentric applications such
as big data ^{[1]}, image processing ^{[2]}, database query operations ^{[3]}, wireless local area network ^{[4]}, ATM switching ^{[5]}, etc. Due to the great impact of sorting operations, a lot of software algorithms
and hardware implementations have been developed. In many cases, especially for realtime
sorting operations ^{[6,}^{7]}, softwareoriented designs are difficult to meet timing constraints. Since hardware
implementation has many advantages over software approaches, developing an efficient
hardware sorting unit has been a significant challenge to overcome the computational
bottleneck of the sorting problems.
Most prior works on hardware sorting designs are implemented based on some modification
of traditional sort algorithms or some altered structure of switching network with
parallel processing and pipelining stages. Most of these approaches are based on compareandswap
(CAS) operation which compares the magnitudes of two inputs and switches their positions
according to the result. To improve the performance of sorting architectures, parallel
dataindependent CAS blocks are devised to form sorting network ^{[8}^{10]}. The CAS based sorting architectures recursively move inputs from memory to memory
so that they require large memory space and processing time for moving inputs.
Alternative sorting engines, called comparisonfree sorting engines, use some special
sorting methods instead of the CAS operation ^{[11}^{14]}. These architectures use a noncomparative sorting algorithm which does not compare
between inputs. Due to the absence of recursive moving and swapping of input elements,
comparisonfree sorting engines can save swapping memories and data load/store time.
The number of comparisonfree sorting engines, however, is quite limited compared
to CAS based sorting engines, because implementation of existing noncomparative sorting
algorithms is much harder than CAS based design. Comparisonfree sorting engines use
a noncomparative algorithm either by modifying existing algorithm or by developing
some special architecture and operations.
Despite many potential advantages of comparisonfree sorting, most of implementations
proposed so far are not adequate as a sorting engine for a general use. The main reason
is that their architectures use a special operation which requires unusual hardware
so that they are not easy to synthesize with usual ASIC design tools.
In this paper, a comparisonfree sorting engine called Sorting Grid is presented.
The Sorting Grid (SG) is intended to be used as a general hardware module for ASIC
design. The SG is composed of three major blocks, and each block is built by copying
a basic unit so that it can be easily synthesized with ASIC tool. It can support stable
sorting for inputs with the same key, and even multikey sorting. Any ASIC chip requiring
hardware sorting IP can use this module.
The relevant hardware sorting implementations are reviewed in Section II. The architecture
and operation of the proposed sorting engine are described in Section III. Performance
of the new architecture is evaluated by simulations and the results are presented
in Section IV and Section V concludes this paper.
II. RELATED WORK
Most of hardware sorting architectures are based on comparative sorting algorithms
while only few employ a noncomparative algorithm.
All hardware sorting engines adopting a comparative sorting algorithm use CAS elements
^{[1}^{10]}, and their difference lies in the way of their arrangement in the architectures.
Efficient parallel processing and pipelining is the key for highspeed operation.
Although some CAS based engines are implemented as an ASIC chip, most of them are
implemented with FPGA. Since comparators and local memories are included in most of
FPGAs, it is relatively easy to implement CAS based engine with FPGA. The CASbased
architecture needs large processing time and memories for comparing and swapping operations,
especially for sorting large number of inputs.
Some sorting engines employ a noncomparative sorting algorithm to avoid the disadvantage
of CAS operation. However, only few comparisonfree sorting engines have been developed
so far. A binary number sorting engine is presented in ^{[11]} which is based on rankorder filter and an efficiently implemented multiinput majority
function. Alternatively, the design in ^{[12]} uses a rank matrix that assigns relative ranks to the inputs. The rank matrix is
updated in accordance with the value of a particular bit in each of the inputs. However,
this design could not be used when the number of elements was less than the bitsize
of key. Onehot weight matrix is employed for sorting operation in ^{[13]}. Each binary input is assigned a unique onehotweight and the onehotweight matrix
is made by the weights of the inputs. Sorting is performed by using the matrix. Due
to the inefficiency of the onehotweight matrix, however, this engine is competitive
only when the key size ($\textit{k}$) is small, and the number of inputs is about
2$^{k}$. Largest element detection scheme is used for sorting in ^{[14]}. By employing bitserial scanning of inputs, it finds the largest element at the
first scan and returns the address of the element. The second largest element is sorted
in the next san, and so on. One element is sorted by one scan so that sorting is performed
by repeating the scans $\textit{N}$ times.
Most of comparisonfree sorting engines are implemented as an ASIC chip. Generally,
comparisonfree sorting engines use a special sorting scheme which is not employed
in any other ASIC chips. Therefore, most of their components are not supported in
commercial FPGA chips so that implementation with FPGA is either impossible or very
inefficient. The uniqueness and unfamiliarity of the architecture are the disadvantage
of comparisonfree sorting engines as a generally usable sorting engine.
III. THE SORTING GRID
A sortingengine module called Sorting Grid (SG) is developed to be used as a hardware
IP for ASIC design. It can be used as an easily synthesizable hardware module for
any ASIC chip which requires highspeed sorting operation. The structure and operation
of SG are described in this section.
1. Block Diagram of The Sorting Grid
Operation of the SG is based on radix sort algorithm. Radix sort is a wellknown sorting
algorithm. It is a noncomparative sorting algorithm which avoids comparison by creating
and distributing data into buckets according to their radix ^{[15]}. From the most significant digit (MSD), this bucketing process is repeated for each
digit while preserving the ordering of the prior step, until all digits have been
considered.
The SG uses a novel architecture to implement radix2 sort algorithm. Fig. 1 shows the structure of SG. It consists of four main blocks: I/O register block, grid
block, bin management block, and control block. The I/O register block is composed
of conventional shift registers. To sort $\textit{N}$ of $\textit{m}$bit binary inputs,
$\textit{N}$ of ($\textit{m}$+1)bit shift registers are used to store both of the
inputs and sorted data. One additional bit is automatically inserted when each input
element is loaded to the registers. Before sorting, input elements are loaded to the
registers, and those are sorted in ($\textit{m}$+1) cycles. The most significant bits
(MSB) of the $\textit{N}$registers are used for sorting operation in each cycle and
discarded by shifting at the end of the cycle. The least significant bits (LSB) of
the registers are filled with the outputs of the bin management block, as in Fig. 1. By repeating this operation ($\textit{m}$+1) times, the original input elements
are expelled from the registers when sorting is completed, and the registers are filled
by the sorted elements.
Fig. 1. Block diagram of the Sorting Grid.
The grid block is an array of $\textit{N}$${\times}$$\textit{N}$ grid units. The principal
function of the grid unit is a 1${\times}$2 demultiplexer and the array is used to
count the number of 1s in a bin. The bin management (BM) block creates bins, deactivates
useless grid units, and controls progress of a cycle. The flow of entire sorting process
is governed by the control block.
2. Operation of The Sorting Grid
The operation of the SG is explained by using a simple example. The SG supports two
types of inputs. The one type is keyonly inputs where keys themselves are the objects
of sorting. In many applications, raw data are the keys to be sorted. The value of
key is everything they need so that no other information is necessary. For this type
of inputs, stability of sorting is not necessary because there is no way to distinguish
two identical data.
The other type of inputs is that key is only a part of input element. Typically, an
input element is composed of a key and an object. The key represents a weight of the
object. Sorting is performed to determine an order of the objects, although it is
done using keys. Various objects can be possible for sorting problems so that a pointer
is used to retrieve the object. For this type of inputs, stability is an important
property of sorting operation.
For the $\textit{m}$bit ($\textit{m}$ = $\textit{k}$+$\textit{p}$) inputs with $\textit{k}$bit
key and $\textit{p}$bit pointer, the I/O block requires ($\textit{m}$+1)bit shift
registers which can be divided by three fields (key, TB, and pointer). Keys must have
a fixed size. One bit called tiebreak (TB) bit is inserted (with 0) between key field
and pointer field. This bit is used to achieve stable sort when multiple inputs have
the same key. This bit is inserted/removed by the control block when data is transmitted
between I/O block and external systems. For keyonly inputs, TBbit is not necessary,
so that $\textit{m}$(=$\textit{k}$)bit register is sufficient for sorting. The value
of $\textit{k}$, and $\textit{p}$ must be given to the control block before input
transmission.
Fig. 2 presents the sorting process of SG. For example, assume that there are 8 objects
($\textit{D}$$_{1}$, $\textit{D}$$_{2}$, ... $\textit{D}$$_{8}$) of which the weights
are marked by 4bit binary keys (8, 5, 15, 10, 5, 7, 5, 11). Among 8 objects, three
have the same key. As shown in Fig. 2(a), 8bit registers are used to store the input elements. In addition to 4bit key field,
1 bit for stability, and 3bit for index field are required. Instead of pointer, index
of object is used to save area and operation time.
Before sorting begins, input elements are loaded to I/O registers, all grid units
are activated, and entire array belongs to one bin. At the first cycle, only the rightbottom
input of the grid array has 0. Let us call this 0 value as a ball. According to the
value of MSBs of the I/O registers, grid units pass the ball to the top of the array
as in Fig. 2(a).
Fig. 2. Example for operation of the sorting grid: (a) the first cycle; (b) the second cycle; (c) the third cycle; (d) the fourth cycle; (e) the fifth cycle (tiebreak cycle); (f) the 6~8th cycles (transmission cycle).
Behavior of a grid unit is controlled mainly by the value MSBs of the I/O registers.
A grid unit passes the ball to its upper grid when MSB=1, but to leftup grid when
MSB=0. Since the ball moves left only when MSB=0, the ball arrives at the top of $\textit{j}$th
column such that $\textit{n}$$\textit{j}$ is the same as the number of 0s in the
MSB of the registers.
As soon as the $\textit{j}$th BMunit receives the ball, it divides the array into
two bins. The right/left columns become 0s/1s bin. According to the bindecision the
memory cell in each BMunit is written with proper value as in Fig. 2(a). After bindecision, useless grid units in the bin are deactivated with kill signals:
K0 and K1. The K0 (kill 0) signal activated in 1s bin disables all grid units with
MSB=0, while K1 (kill 1) signal in 0s bin disables grid units with MSB =1. If a grid
unit is deactivated, it just passes a ball to upper unit regardless of the value of
MSB. As a result, $\textit{j}$${\times}$$\textit{j}$ active grid units are left in
1s bin and ($\textit{n}$$\textit{j}$)${\times}$($\textit{n}$$\textit{j}$) active
units form a new array in 0s bin as in Fig. 2(b). At the end of each cycle, the contents of shift registers are shifted left so that
the bits in MSB are discarded and the bits in the next digit move to MSB while the
contents in the memories in BM blocks shifted to the LSB of the registers as in Fig. 2(b).
The operations in the cycles 2, 3 and 4 are the same as those of in the first cycle
except the fact that there are multiple bins, and the operations in the first cycles
are applied to every bin simultaneously. At the rightbottom of each bin, a new ball
is placed, and it propagates to the top of each bin. Receiving one of the balls, BMblock
divides the bin and kills unnecessary grid units in the new bins. Then, the register
is shifted with the contents of memory in BMblock. The first 4 cycles ($\textit{k}$
cycles in general) are performed with keys so that these cycles are named as sorting
cycle.
When no identical key exists in input elements, only one active gridunit should be
left in every column of the grid array. However, multiple active units exist in a
column when some elements have the same key as in Fig. 2(e). One additional cycle, called tiebreak cycle, is required to set an order among
those elements. By making their order as the same as their input order, the entire
sorting operation can be a stable sort. The tiebreak cycle operates as the same as
sorting cycle, except that the $\textit{T}$ signal is generated instead of the KILLsignal.
The $\textit{T}$ signal deactivates unnecessary grid units.
After tiebreak cycle, only one active grid unit is left in every column of the array
as in Fig. 2(f). The rest of cycles are called transmission cycles, which just pass the bits in MSB
of the registers to LSB via active grid units and memories in the BMblock. The transmission
cycle operates the same way as the sorting cycle but the deactivating process with
KILL signal is not necessary.
As shown in Fig. 2, the I/O registers are filled initially with input elements but gradually replaced
by outputs and finally filled with sorted elements. Therefore, it requires minimal
memory space for data which can greatly reduce hardware size.
3. Implementation of The Sorting Grid
Each block of SG in Fig. 1 is implemented as a VLSI circuit. The circuit of a grid unit is shown in Fig. 3. The principal function of the grid unit is a 1${\times}$2 demultiplexer. The grid
unit is a 1${\times}$2 demultiplexer with some additional features such as deactivating
logics and dummy path. Behavior of a grid unit is controlled mainly by the value $\textit{D}$$_{i}$,
$\textit{I}$$_{j}$, and internal latch. $\textit{D}$$_{i}$is a bit from an input element
which comes from the MSB of the $\textit{i}$th I/O register. All grid units in the
$\textit{i}$th row are controlled by $\textit{D}$$_{i}$. If $\textit{D}$$_{i}$=1,
the ($\textit{i, j}$)th grid unit ($\textit{G}$$_{ij}$) passes received ball signal
($\overline{B_{ij}})$ to $\textit{G}$$_{i}$$_{1}$$_{,j}$ while $\overline{B_{ij}}$
to $\textit{G}$$_{i}$$_{1}$$_{,j}$$_{1}$ if $\textit{D}$$_{i}$=0. A grid unit can
be disabled by its internal latch. If the latch is set, the grid unit is disabled
so that it just passes ball to the upper unit ($\textit{G}$$_{i}$$_{1}$$_{,j}$) regardless
of the value $\textit{D}$$_{i}$. At the beginning of the sorting operation, internal
latches of all grid units are reset by RESET signal. During sorting process, grid
units are selectively disabled by setting the latch through one of kill signals (K0
or K1). Once a unit is disabled, it cannot be reactivated until the sorting is finished.
The isolation signal, $\textit{I}$$_{j}$ is used to isolate two adjacent columns in
the grid array. A column of the grid array, called a ladder, corresponds to an output
element in a bin. The activated $\textit{I}$$_{j}$, ($\textit{I}$$_{j}$=1) becomes
a border of a bin so that a bin is defined by two activated isolation signals, and
the number of elements in the bin is the number of columns between them. The ball
signal ($\overline{B_{j}})$ cannot cross any border of bins. The isolation signal
$\textit{I}$$_{j}$ activates an auxiliary ball path $\overline{AB_{j}}~ $ to provide
a path for ball signals instead of crossing the border.
Fig. 3. Circuit of the grid unit.
The bin management (BM) block creates bins, deactivates useless grid units, and controls
progress of a cycle. It consists of a series of $\textit{N}$ BMunits as presented
in Fig. 4. Each BMunit controls a ladder. If a ball (value 0) arrives at the top of $\textit{j}$th
column (i,e. $\textit{B}$$_{0j}$=1), the $\textit{j}$th BMunit divides its bin into
two bins by activating $\textit{I}$$_{j}$$_{+1}$. The ladders in the right of $\textit{j}$th
ladder become 0 s bin and the other ladders become 1 s bin. After creating new bins,
the memory cells in the 1 s (0 s) bin are written by 1 (0). The signal $\overline{F_{j}}$
is used to inform that the ladder $\textit{j}$ is ready for the next operation. The
control block can check the states of BMunits with $\overline{F_{j}}$ signals.
Fig. 4. Block diagram of the binmanagement unit.
The flow of entire sorting process is governed by the control block. The control block
generates various asynchronous signals as soon as possible for highspeed operation.
The entire sorting operation is performed in $\textit{m}$+1 cycles but each cycle
is performed with different latencies. A small bin can be processed more quickly than
a large bin. The number of bins increases while the sizes of bins decrease along the
cycle. Because all bins are sorted in parallel, the processing time for a cycle keeps
decreasing. For this reason, the SG avoids using the fixed period clock. Instead,
it uses a selftimed asynchronous clock which flips as soon as possible by checking
the states of BMunits. However, it uses a fixed external clock when the I/O block
sends/receives data to/from external systems.
Note that three main blocks, the grid array, the I/O block, and the BMblock, have
a modular structure. Each block is made by copying a basic unit. Therefore, the engine
could be easily synthesized if hardware IP for the grid unit and the BMunit are available
since shift register is included already as a standard cell for most of ASIC tools.
The control block can also be easily synthesized by using the conventional control
logic design method.
4. Performance Enhancement Scheme
Two speed enhancement schemes are developed to enhance the performance of the SG.
A selftimed asynchronous clock is used to avoid the inefficiency of the fixed period
clock, and a bypassing scheme is employed to decrease the delay of ball propagation
time in the grid block.
$\textit{N}$ of $\textit{m}$bit inputs with $\textit{k}$bit key and $\textit{p}$bit
pointer are sorted in ($\textit{k}$+$\textit{p}$+1) cycles with the SG. If a fixed
period clock is used in the operation, the period of the clock, $\textit{T}$$_{clk}$,
should be decided as the longest cycle among ($\textit{k}$+$\textit{p}$+1) cycles,
and the total delay of sorting ($\textit{T}$$_{d}$) becomes ($\textit{m}$+1)$\textit{T}$$_{clk}$.
Since the longest cycle time is proportional to $\textit{N}$, time complexity of the
proposed engine would be $\textit{O}$($\textit{mN}$) when the fixed period clock is
used. However, the execution time of the longest cycle could be tens of times longer
than that of the shortest cycle, so that fixed period clock is quite inefficient in
performance.
The execution time of a cycle ($\textit{T}$$_{cycle}$) can be analyzed with 4 components.
where $\textit{T}$$_{col}$ is the time that a ball propagates from the bottom to the
top of the array, and $\textit{T}$$_{BM}$ is the time for setting BMunits from the
ball position to the boarders of the bin. $\textit{T}$$_{C}$ is the time spent in
the control block and $\textit{T}$$_{R}$ is restoring time for the next cycle. The
contributions of $\textit{T}$$_{C}$and $\textit{T}$$_{R}$ to total sorting time are
small, and their variations along the cycles are negligible compared to the other
two terms, so that they can be regarded as constants. $\textit{T}$$_{col}$ is proportional
to the number of grid units in a column and $\textit{T}$$_{BM}$ is proportional to
bin size, i.e. the number of columns in a bin. The total delay of sorting $\textit{T}$$_{d}$
is
where, $\textit{t}$$_{g}$ is the delay of a grid unit, $\textit{b}$$_{\mathrm{max}}$
is the largest bin size in cycle $\textit{j}$, $\textit{t}$$_{B}$ is the delay of
a BMunit, and ${\alpha}$ = $\textit{T}$$_{C}$+ $\textit{T}$$_{R}$ is a constant.
The longest $\textit{T}$$_{cycle}$ occurs in the first cycle because the bin size
is biggest at the first cycle. Although the number of bins increases along the cycle,
the size of each bin decreases. Since all bins are processed simultaneously, $\textit{T}$$_{BM}$
is determined by the biggest bin in that cycle. Because $\textit{b}$$_{\mathrm{max}}$
decreases monotonically, $\textit{T}$$_{cycle}$ decreases along the cycle, and all
bins have only one element after the tiebreak ($\textit{k}$+1) cycle. Therefore,
the worstcase cycle time at the first cycle is
while the shortest cycle time in transmission cycles is
Eqs. (3) and (4) show a large difference between the longest and shortest cycle times. By reflecting
this result, the new sorting engine uses a selftimed asynchronous clock to increase
operation speed. The clock flips by detecting the end of critical processes so that
each cycle starts as soon as possible. By employing the variable period clock, the
time complexity of $\textit{T}$$_{BM}$ can be reduced from $\textit{O}$($\textit{Nm}$)
to $\textit{O}$($\textit{N+m}$) in the best case. Assume that inputs are well balanced
so that balls bisect bins in every cycle, then $\textit{b}$$_{\mathrm{max,j}}$ would
be $\textit{N}$/2$^{j}$. The $\textit{b}$$_{\mathrm{max,j}}$ decreases until it becomes
1, and rest of the cycles are working with the same $\textit{T}$$_{BM}$. If $\textit{N}$
< 2$^{k}$, approximately,
which has $\textit{O}$($\textit{N+m}$) time complexity. The distribution of pseudorandom
inputs shows similar patterns as the best case.
Unlike $\textit{T}$$_{BM}$, $\textit{T}$$_{col}$ is not changed even though the number
of active grid units in a column keeps decreasing. The columns in a bin are localized
in the array, but the active grid units are scattered over entire column. Even if
most of units in a column are disabled, a ball should pass those units so that a ball
must pass $\textit{N}$ grid units regardless of the number of inactive units. Although
$\textit{T}$$_{BM}$ is much larger than $\textit{T}$$_{col}$ in early cycles, $\textit{T}$$_{col}$
becomes the dominant component from few cycles later causing $\textit{T}$$_{d}$ to
remain in $\textit{O}$($\textit{Nm}$).
To resolve this problem, a bypass scheme is developed as in Fig. 5. The purpose of the bypass scheme is to skip consecutive inactive grid units in ballpropagation.
A column is divided into several groups such that every group consists of consecutive
$\textit{q}$ grid units and has a bypassing line. If all grid units in a group are
disabled, a ball does not pass the grid units in the group, but the ball is sent to
the next group through the bypassing line. The bypassing scheme is very effective
in later cycles since lots of grid units are disabled.
Fig. 5. Basic grid block with the first level bypassing circuit.
Fig. 5 shows the basic block of grid units and implementation of the first level bypassing
scheme. Four grid units with a repeater form a basic gridblock which is used as a
basic module of ASIC library. With the first level bypassing scheme, only 4 times
is the maximum achievable speedup. For greater speed up, multilevel bypassing scheme
could be employed. The number of levels and the number of grid units in a group should
be decided considering the applications and hardware overheads.
Similarly, with $\textit{T}$$_{BM}$, time complexity of $\textit{T}$$_{col}$ can be
$\textit{O}$($\textit{N+m}$) in the best case with an optimal input distribution and
the maximal level bypassing scheme. Therefore, the overall latency ($\textit{T}$$_{d}$)
of the sorting engine can achieve $\textit{O}$($\textit{N+m}$) timecomplexity, but
$\textit{O}$($\textit{Nm}$) in the worstcase distribution.
IV. PERFORMANCE EVALUATION
The performance of SG is evaluated through simulations with pseudorandom inputs.
The simulation is performed by HSPICE with 1.2 V 0.13 ${\mu}$mprocess model parameter.
Simulations are performed with several input sizes and key sizes. For each type of
inputs, 10 samples are simulated, and their average values are used as the results.
Two groups of inputs are used in simulation where one group uses 8bit key and the
other does 16bit key. The inputs with 16, 64, 256, and 1024 elements are used. Keyonly
inputs are used for simulations because transmission cycles are just moving data with
a constant cycle time. Multilevel bypassing scheme is used such that every 4 block
is grouped as a bypassing unit. For example, 4level (maximum level) bypassing scheme
is used in sorting 1024 inputs.
Table 1 shows the simulation results. The proposed sorting engine sorts 8bit 1024 inputs
in 1.042 ${\mu}$s. Its rate is about one element per nanosecond. Even when the key
size is increased twice, the sorting time increases by 55% for 16 inputs while it
increases by 29% only for 256 inputs. For 1024 inputs, the sorting time increases
only by 6%. One reason of such an amazing result is that 1024 inputs with 8bit key
include too many identical inputs since $\textit{N}$ > 2$^{k}$, while $\textit{N}$
< 2$^{k}$ for 16bit key inputs. The efficiency of bypassing scheme drops when inputs
include many samekey inputs.
As in Table 1, the gap in cycle times between the first cycle and the last cycle is very large.
The difference is getting bigger for a larger number of inputs. Fig. 6 shows how the cycle time ($\textit{T}$$_{cycle}$) changes along the cycles. $\textit{T}$$_{cycle}$
decreases rapidly along the cycle and is saturated at a point. It is because $\textit{T}$$_{col}$
and $\textit{T}$$_{BM}$ keep decreasing but $\textit{T}$$_{C}$+$\textit{T}$$_{R}$
does not. Therefore, $\textit{T}$$_{C}$+$\textit{T}$$_{R}$ becomes the dominant factor
of $\textit{T}$$_{cycle}$ from the point.
Table 1. Simulation Results
# of Input

8bit Key

16bit Key

$T_{d}$ (ns)

$T_{cycle}$ (ns)

$T_{d}$ (ns)

$T_{cycle}$ (ns)

First

Last

First

Last

16

25.3

5.7

2.0

39.3

5.4

1.9

64

69.3

19.9

3.6

100.0

20.3

3.2

256

253.3

79.6

6.8

325.6

80.8

4.9

1024

1042.2

318.4

26.2

1105.1

317.0

7.1

If $\textit{N}$ > 2$^{k}$, there must be multiple inputs with a same key as the case
of 1024 inputs with 8bit. This input may have 4 elements for any key value so that
the efficiency of the bypassing scheme is reduced. It can be detected in Fig. 6(a) that $\textit{T}$$_{cycle}$ of 1024 inputs decreases slowly than that of 256 inputs
where only few inputs may be the same. It implies that the bypassing scheme is more
efficient when $\textit{N}$ < 2$^{k}$.
Fig. 6. Change of cycle time: (a) inputs with 8bit key; (b) inputs with 16bit key.
The shortest cycle time is about 44 times faster than the longest cycle for 1024 inputs
with 16bit key. If a fixed period clock is used, the sorting delay ($\textit{T}$$_{d}$)
would be 2.25 ${\mu}$s (16${\times}$317 ns) which is double the delay of SG (1.105
${\mu}$s). The largest difference occurs when sorting 256 inputs with 16key, which
the delay is quadrupled. It proves that employing the variable period clock makes
the engine more than two times faster than using a fixed cycle clock.
The speed of SG is compared to that of other comparisonfree sorting engines. The
comparison is summarized in Table 2. Only the sorting engine in ^{[14]} is implemented using FPGA while the others are done with VLSI design, thus it has
difficulties in direct comparison with other engines. The number of cycles required
to complete sorting operation is $\textit{k}$+1 for SG which is fixed regardless of
the number of inputs ($\textit{N}$), while it is proportional to $\textit{N}$ for
the other sorting engines. The total sorting time ($\textit{T}$$_{D}$) of SG is only
affected by clock cycle time ($\textit{T}$$_{cycle}$) so that SG uses the selftimed
asynchronous clock to increase the sorting speed.
Table 2. Comparison of SG with other sorting engines

SG

in ^{[11]}

in ^{[12]}

in ^{[13]}

in ^{[14]}

Time complexity

O(N+k)

O(N+k)

O(N+2k)

O(N)

O(N)

Restriction

None

None

N > k

N${\approx}$2$^{k}$

None

Technology

130 nm

350 nm

65 nm

90 nm

FPGA (65 nm)

N

1024

63

16

1024

1024

k

16

16

16

10

16

No. of cycles

17

78

48

2k~3k

1024

( k+1 )

( N+k1 )

( N+2k )

(2N~3N)

(N~2N)

T$_{\mathrm{cycle}}$

variable

5 ns

1.6 ns

2 ns

7.8 ns

Sorting Time (T$_{D}$)

1105.1 ns

390 ns

76.8 ns

4~6 ms

8 ms

T$_{D}$ / N

1.08 ns

6.19 ns

4.8 ns

4~6 ns

7.8 ns

Because the sorting engines use different key sizes and data sizes in speed simulation,
the sorting time per element ($\textit{T}$$_{D}$/$\textit{N}$) is used to compare
their speeds. As in Table 2, SG is the fastest among all comparisonfree sorting engines. The sorting rate of
SG is about 1 element per 1 ns while those of other engines are 0.13 ~ 0.25 element
per 1 ns, where SG is more than 4 times faster than other engines.