I. INTRODUCTION
               Sorting is a widely used prime function in most of the data-centric applications such
                  as big data [1], image processing [2], database query operations [3], wireless local area network [4], ATM switching [5], etc. Due to the great impact of sorting operations, a lot of software algorithms
                  and hardware implementations have been developed. In many cases, especially for real-time
                  sorting operations [6,7], software-oriented designs are difficult to meet timing constraints. Since hardware
                  implementation has many advantages over software approaches, developing an efficient
                  hardware sorting unit has been a significant challenge to overcome the computational
                  bottleneck of the sorting problems.
               
               Most prior works on hardware sorting designs are implemented based on some modification
                  of traditional sort algorithms or some altered structure of switching network with
                  parallel processing and pipelining stages. Most of these approaches are based on compare-and-swap
                  (CAS) operation which compares the magnitudes of two inputs and switches their positions
                  according to the result. To improve the performance of sorting architectures, parallel
                  data-independent CAS blocks are devised to form sorting network [8-10]. The CAS based sorting architectures recursively move inputs from memory to memory
                  so that they require large memory space and processing time for moving inputs. 
               
               Alternative sorting engines, called comparison-free sorting engines, use some special
                  sorting methods instead of the CAS operation [11-14]. These architectures use a non-comparative sorting algorithm which does not compare
                  between inputs. Due to the absence of recursive moving and swapping of input elements,
                  comparison-free sorting engines can save swapping memories and data load/store time.
               
               The number of comparison-free sorting engines, however, is quite limited compared
                  to CAS based sorting engines, because implementation of existing non-comparative sorting
                  algorithms is much harder than CAS based design. Comparison-free sorting engines use
                  a non-comparative algorithm either by modifying existing algorithm or by developing
                  some special architecture and operations. 
               
               Despite many potential advantages of comparison-free sorting, most of implementations
                  proposed so far are not adequate as a sorting engine for a general use. The main reason
                  is that their architectures use a special operation which requires unusual hardware
                  so that they are not easy to synthesize with usual ASIC design tools. 
               
               In this paper, a comparison-free sorting engine called Sorting Grid is presented.
                  The Sorting Grid (SG) is intended to be used as a general hardware module for ASIC
                  design. The SG is composed of three major blocks, and each block is built by copying
                  a basic unit so that it can be easily synthesized with ASIC tool. It can support stable
                  sorting for inputs with the same key, and even multi-key sorting. Any ASIC chip requiring
                  hardware sorting IP can use this module.  
               
               The relevant hardware sorting implementations are reviewed in Section II. The architecture
                  and operation of the proposed sorting engine are described in Section III. Performance
                  of the new architecture is evaluated by simulations and the results are presented
                  in Section IV and Section V concludes this paper.
               
             
            
                  II. RELATED WORK
               Most of hardware sorting architectures are based on comparative sorting algorithms
                  while only few employ a non-comparative algorithm. 
               
               All hardware sorting engines adopting a comparative sorting algorithm use CAS elements
                  [1-10], and their difference lies in the way of their arrangement in the architectures.
                  Efficient parallel processing and pipelining is the key for high-speed operation.
                  Although some CAS based engines are implemented as an ASIC chip, most of them are
                  implemented with FPGA. Since comparators and local memories are included in most of
                  FPGAs, it is relatively easy to implement CAS based engine with FPGA. The CAS-based
                  architecture needs large processing time and memories for comparing and swapping operations,
                  especially for sorting large number of inputs. 
               
               Some sorting engines employ a non-comparative sorting algorithm to avoid the disadvantage
                  of CAS operation. However, only few comparison-free sorting engines have been developed
                  so far. A binary number sorting engine is presented in [11] which is based on rank-order filter and an efficiently implemented multi-input majority
                  function. Alternatively, the design in [12] uses a rank matrix that assigns relative ranks to the inputs. The rank matrix is
                  updated in accordance with the value of a particular bit in each of the inputs. However,
                  this design could not be used when the number of elements was less than the bit-size
                  of key. One-hot weight matrix is employed for sorting operation in [13]. Each binary input is assigned a unique one-hot-weight and the one-hot-weight matrix
                  is made by the weights of the inputs. Sorting is performed by using the matrix. Due
                  to the inefficiency of the one-hot-weight matrix, however, this engine is competitive
                  only when the key size ($\textit{k}$) is small, and the number of inputs is about
                  2$^{k}$. Largest element detection scheme is used for sorting in [14]. By employing bit-serial scanning of inputs, it finds the largest element at the
                  first scan and returns the address of the element. The second largest element is sorted
                  in the next san, and so on. One element is sorted by one scan so that sorting is performed
                  by repeating the scans $\textit{N}$ times.
               
               Most of comparison-free sorting engines are implemented as an ASIC chip. Generally,
                  comparison-free sorting engines use a special sorting scheme which is not employed
                  in any other ASIC chips. Therefore, most of their components are not supported in
                  commercial FPGA chips so that implementation with FPGA is either impossible or very
                  inefficient. The uniqueness and unfamiliarity of the architecture are the disadvantage
                  of comparison-free sorting engines as a generally usable sorting engine.
               
             
            
                  III. THE SORTING GRID
               A sorting-engine module called Sorting Grid (SG) is developed to be used as a hardware
                  IP for ASIC design. It can be used as an easily synthesizable hardware module for
                  any ASIC chip which requires high-speed sorting operation. The structure and operation
                  of SG are described in this section.
               
               
                     1. Block Diagram of The Sorting Grid 
                  Operation of the SG is based on radix sort algorithm. Radix sort is a well-known sorting
                     algorithm. It is a non-comparative sorting algorithm which avoids comparison by creating
                     and distributing data into buckets according to their radix [15]. From the most significant digit (MSD), this bucketing process is repeated for each
                     digit while preserving the ordering of the prior step, until all digits have been
                     considered.
                  
                  The SG uses a novel architecture to implement radix-2 sort algorithm. Fig. 1 shows the structure of SG. It consists of four main blocks: I/O register block, grid
                     block, bin management block, and control block. The I/O register block is composed
                     of conventional shift registers. To sort $\textit{N}$ of $\textit{m}$-bit binary inputs,
                     $\textit{N}$ of ($\textit{m}$+1)-bit shift registers are used to store both of the
                     inputs and sorted data. One additional bit is automatically inserted when each input
                     element is loaded to the registers. Before sorting, input elements are loaded to the
                     registers, and those are sorted in ($\textit{m}$+1) cycles. The most significant bits
                     (MSB) of the $\textit{N}$-registers are used for sorting operation in each cycle and
                     discarded by shifting at the end of the cycle. The least significant bits (LSB) of
                     the registers are filled with the outputs of the bin management block, as in Fig. 1. By repeating this operation ($\textit{m}$+1) times, the original input elements
                     are expelled from the registers when sorting is completed, and the registers are filled
                     by the sorted elements.
                  
                  
                        Fig. 1. Block diagram of the Sorting Grid.
 
                  The grid block is an array of $\textit{N}$${\times}$$\textit{N}$ grid units. The principal
                     function of the grid unit is a 1${\times}$2 demultiplexer and the array is used to
                     count the number of 1s in a bin. The bin management (BM) block creates bins, deactivates
                     useless grid units, and controls progress of a cycle. The flow of entire sorting process
                     is governed by the control block.
                  
                
               
                     2. Operation of The Sorting Grid 
                  The operation of the SG is explained by using a simple example. The SG supports two
                     types of inputs. The one type is key-only inputs where keys themselves are the objects
                     of sorting. In many applications, raw data are the keys to be sorted. The value of
                     key is everything they need so that no other information is necessary. For this type
                     of inputs, stability of sorting is not necessary because there is no way to distinguish
                     two identical data.
                  
                  The other type of inputs is that key is only a part of input element. Typically, an
                     input element is composed of a key and an object. The key represents a weight of the
                     object. Sorting is performed to determine an order of the objects, although it is
                     done using keys. Various objects can be possible for sorting problems so that a pointer
                     is used to retrieve the object. For this type of inputs, stability is an important
                     property of sorting operation. 
                  
                  For the $\textit{m}$-bit ($\textit{m}$ = $\textit{k}$+$\textit{p}$) inputs with $\textit{k}$-bit
                     key and $\textit{p}$-bit pointer, the I/O block requires ($\textit{m}$+1)-bit shift
                     registers which can be divided by three fields (key, TB, and pointer). Keys must have
                     a fixed size. One bit called tie-break (TB) bit is inserted (with 0) between key field
                     and pointer field. This bit is used to achieve stable sort when multiple inputs have
                     the same key. This bit is inserted/removed by the control block when data is transmitted
                     between I/O block and external systems. For key-only inputs, TB-bit is not necessary,
                     so that $\textit{m}$(=$\textit{k}$)-bit register is sufficient for sorting. The value
                     of $\textit{k}$, and $\textit{p}$ must be given to the control block before input
                     transmission.
                  
                  Fig. 2 presents the sorting process of SG. For example, assume that there are 8 objects
                     ($\textit{D}$$_{1}$, $\textit{D}$$_{2}$, ... $\textit{D}$$_{8}$) of which the weights
                     are marked by 4-bit binary keys (8, 5, 15, 10, 5, 7, 5, 11). Among 8 objects, three
                     have the same key. As shown in Fig. 2(a), 8-bit registers are used to store the input elements. In addition to 4-bit key field,
                     1 bit for stability, and 3-bit for index field are required. Instead of pointer, index
                     of object is used to save area and operation time.
                  
                  Before sorting begins, input elements are loaded to I/O registers, all grid units
                     are activated, and entire array belongs to one bin. At the first cycle, only the right-bottom
                     input of the grid array has 0. Let us call this 0- value as a ball. According to the
                     value of MSBs of the I/O registers, grid units pass the ball to the top of the array
                     as in Fig. 2(a). 
                  
                  
                        Fig. 2. Example for operation of the sorting grid: (a) the first cycle; (b) the second cycle; (c) the third cycle; (d) the fourth cycle; (e) the fifth cycle (tie-break cycle); (f) the 6~8-th cycles (transmission cycle).
 
                  Behavior of a grid unit is controlled mainly by the value MSBs of the I/O registers.
                     A grid unit passes the ball to its upper grid when MSB=1, but to left-up grid when
                     MSB=0. Since the ball moves left only when MSB=0, the ball arrives at the top of $\textit{j}$-th
                     column such that $\textit{n}$-$\textit{j}$ is the same as the number of 0s in the
                     MSB of the registers.
                  
                  As soon as the $\textit{j}$-th BM-unit receives the ball, it divides the array into
                     two bins. The right/left columns become 0s/1s bin. According to the bin-decision the
                     memory cell in each BM-unit is written with proper value as in Fig. 2(a). After bin-decision, useless grid units in the bin are deactivated with kill signals:
                     K-0 and K-1. The K-0 (kill 0) signal activated in 1s bin disables all grid units with
                     MSB=0, while K-1 (kill 1) signal in 0s bin disables grid units with MSB =1. If a grid
                     unit is deactivated, it just passes a ball to upper unit regardless of the value of
                     MSB. As a result, $\textit{j}$${\times}$$\textit{j}$ active grid units are left in
                     1s bin and ($\textit{n}$-$\textit{j}$)${\times}$($\textit{n}$-$\textit{j}$) active
                     units form a new array in 0s bin as in Fig. 2(b). At the end of each cycle, the contents of shift registers are shifted left so that
                     the bits in MSB are discarded and the bits in the next digit move to MSB while the
                     contents in the memories in BM blocks shifted to the LSB of the registers as in Fig. 2(b).
                  
                  The operations in the cycles 2, 3 and 4 are the same as those of in the first cycle
                     except the fact that there are multiple bins, and the operations in the first cycles
                     are applied to every bin simultaneously. At the right-bottom of each bin, a new ball
                     is placed, and it propagates to the top of each bin. Receiving one of the balls, BM-block
                     divides the bin and kills unnecessary grid units in the new bins. Then, the register
                     is shifted with the contents of memory in BM-block. The first 4 cycles ($\textit{k}$
                     cycles in general) are performed with keys so that these cycles are named as sorting
                     cycle.
                  
                  When no identical key exists in input elements, only one active grid-unit should be
                     left in every column of the grid array. However, multiple active units exist in a
                     column when some elements have the same key as in Fig. 2(e). One additional cycle, called tie-break cycle, is required to set an order among
                     those elements. By making their order as the same as their input order, the entire
                     sorting operation can be a stable sort. The tie-break cycle operates as the same as
                     sorting cycle, except that the $\textit{T}$ signal is generated instead of the KILL-signal.
                     The $\textit{T}$ signal deactivates unnecessary grid units.
                  
                  After tie-break cycle, only one active grid unit is left in every column of the array
                     as in Fig. 2(f). The rest of cycles are called transmission cycles, which just pass the bits in MSB
                     of the registers to LSB via active grid units and memories in the BM-block. The transmission
                     cycle operates the same way as the sorting cycle but the deactivating process with
                     KILL signal is not necessary.
                  
                  As shown in Fig. 2, the I/O registers are filled initially with input elements but gradually replaced
                     by outputs and finally filled with sorted elements. Therefore, it requires minimal
                     memory space for data which can greatly reduce hardware size.
                  
                
               
                     3. Implementation of The Sorting Grid
                  Each block of SG in Fig. 1 is implemented as a VLSI circuit. The circuit of a grid unit is shown in Fig. 3. The principal function of the grid unit is a 1${\times}$2 demultiplexer. The grid
                     unit is a 1${\times}$2 demultiplexer with some additional features such as deactivating
                     logics and dummy path. Behavior of a grid unit is controlled mainly by the value $\textit{D}$$_{i}$,
                     $\textit{I}$$_{j}$, and internal latch. $\textit{D}$$_{i}$is a bit from an input element
                     which comes from the MSB of the $\textit{i}$-th I/O register. All grid units in the
                     $\textit{i}$-th row are controlled by $\textit{D}$$_{i}$. If $\textit{D}$$_{i}$=1,
                     the ($\textit{i, j}$)-th grid unit ($\textit{G}$$_{ij}$) passes received ball signal
                     ($\overline{B_{ij}})$ to $\textit{G}$$_{i-}$$_{1}$$_{,j}$ while $\overline{B_{ij}}$
                     to $\textit{G}$$_{i-}$$_{1}$$_{,j-}$$_{1}$ if $\textit{D}$$_{i}$=0. A grid unit can
                     be disabled by its internal latch. If the latch is set, the grid unit is disabled
                     so that it just passes ball to the upper unit ($\textit{G}$$_{i-}$$_{1}$$_{,j}$) regardless
                     of the value $\textit{D}$$_{i}$. At the beginning of the sorting operation, internal
                     latches of all grid units are reset by RESET signal. During sorting process, grid
                     units are selectively disabled by setting the latch through one of kill signals (K-0
                     or K-1). Once a unit is disabled, it cannot be reactivated until the sorting is finished.
                     The isolation signal, $\textit{I}$$_{j}$ is used to isolate two adjacent columns in
                     the grid array. A column of the grid array, called a ladder, corresponds to an output
                     element in a bin. The activated $\textit{I}$$_{j}$, ($\textit{I}$$_{j}$=1) becomes
                     a border of a bin so that a bin is defined by two activated isolation signals, and
                     the number of elements in the bin is the number of columns between them. The ball
                     signal ($\overline{B_{j}})$ cannot cross any border of bins. The isolation signal
                     $\textit{I}$$_{j}$ activates an auxiliary ball path $\overline{AB_{j}}~ $ to provide
                     a path for ball signals instead of crossing the border.
                  
                  
                        Fig. 3. Circuit of the grid unit.
 
                  The bin management (BM) block creates bins, deactivates useless grid units, and controls
                     progress of a cycle. It consists of a series of $\textit{N}$ BM-units as presented
                     in Fig. 4. Each BM-unit controls a ladder. If a ball (value 0) arrives at the top of $\textit{j}$-th
                     column (i,e. $\textit{B}$$_{0j}$=1), the $\textit{j}$-th BM-unit divides its bin into
                     two bins by activating $\textit{I}$$_{j}$$_{+1}$. The ladders in the right of $\textit{j}$-th
                     ladder become 0 s bin and the other ladders become 1 s bin. After creating new bins,
                     the memory cells in the 1 s (0 s) bin are written by 1 (0). The signal $\overline{F_{j}}$
                     is used to inform that the ladder $\textit{j}$ is ready for the next operation. The
                     control block can check the states of BM-units with $\overline{F_{j}}$ signals.
                  
                  
                        Fig. 4. Block diagram of the bin-management unit.
 
                  The flow of entire sorting process is governed by the control block. The control block
                     generates various asynchronous signals as soon as possible for high-speed operation.
                     The entire sorting operation is performed in $\textit{m}$+1 cycles but each cycle
                     is performed with different latencies. A small bin can be processed more quickly than
                     a large bin. The number of bins increases while the sizes of bins decrease along the
                     cycle. Because all bins are sorted in parallel, the processing time for a cycle keeps
                     decreasing. For this reason, the SG avoids using the fixed period clock. Instead,
                     it uses a self-timed asynchronous clock which flips as soon as possible by checking
                     the states of BM-units. However, it uses a fixed external clock when the I/O block
                     sends/receives data to/from external systems.
                  
                  Note that three main blocks, the grid array, the I/O block, and the BM-block, have
                     a modular structure. Each block is made by copying a basic unit. Therefore, the engine
                     could be easily synthesized if hardware IP for the grid unit and the BM-unit are available
                     since shift register is included already as a standard cell for most of ASIC tools.
                     The control block can also be easily synthesized by using the conventional control
                     logic design method.
                  
                
               
                     4. Performance Enhancement Scheme 
                  Two speed enhancement schemes are developed to enhance the performance of the SG.
                     A self-timed asynchronous clock is used to avoid the inefficiency of the fixed period
                     clock, and a bypassing scheme is employed to decrease the delay of ball propagation
                     time in the grid block. 
                  
                  $\textit{N}$ of $\textit{m}$-bit inputs with $\textit{k}$-bit key and $\textit{p}$-bit
                     pointer are sorted in ($\textit{k}$+$\textit{p}$+1) cycles with the SG. If a fixed
                     period clock is used in the operation, the period of the clock, $\textit{T}$$_{clk}$,
                     should be decided as the longest cycle among ($\textit{k}$+$\textit{p}$+1) cycles,
                     and the total delay of sorting ($\textit{T}$$_{d}$) becomes ($\textit{m}$+1)$\textit{T}$$_{clk}$.
                     Since the longest cycle time is proportional to $\textit{N}$, time complexity of the
                     proposed engine would be $\textit{O}$($\textit{mN}$) when the fixed period clock is
                     used. However, the execution time of the longest cycle could be tens of times longer
                     than that of the shortest cycle, so that fixed period clock is quite inefficient in
                     performance.
                  
                  The execution time of a cycle ($\textit{T}$$_{cycle}$) can be analyzed with 4 components.
                  
                  where $\textit{T}$$_{col}$ is the time that a ball propagates from the bottom to the
                     top of the array, and $\textit{T}$$_{BM}$ is the time for setting BM-units from the
                     ball position to the boarders of the bin. $\textit{T}$$_{C}$ is the time spent in
                     the control block and $\textit{T}$$_{R}$ is restoring time for the next cycle. The
                     contributions of $\textit{T}$$_{C}$and $\textit{T}$$_{R}$ to total sorting time are
                     small, and their variations along the cycles are negligible compared to the other
                     two terms, so that they can be regarded as constants. $\textit{T}$$_{col}$ is proportional
                     to the number of grid units in a column and $\textit{T}$$_{BM}$ is proportional to
                     bin size, i.e. the number of columns in a bin. The total delay of sorting $\textit{T}$$_{d}$
                     is
                  
                  
                  where, $\textit{t}$$_{g}$ is the delay of a grid unit, $\textit{b}$$_{\mathrm{max}}$
                     is the largest bin size in cycle $\textit{j}$, $\textit{t}$$_{B}$ is the delay of
                     a BM-unit, and ${\alpha}$ = $\textit{T}$$_{C}$+ $\textit{T}$$_{R}$ is a constant.
                  
                  The longest $\textit{T}$$_{cycle}$ occurs in the first cycle because the bin size
                     is biggest at the first cycle. Although the number of bins increases along the cycle,
                     the size of each bin decreases. Since all bins are processed simultaneously, $\textit{T}$$_{BM}$
                     is determined by the biggest bin in that cycle. Because $\textit{b}$$_{\mathrm{max}}$
                     decreases monotonically, $\textit{T}$$_{cycle}$ decreases along the cycle, and all
                     bins have only one element after the tie-break ($\textit{k}$+1) cycle. Therefore,
                     the worst-case cycle time at the first cycle is
                  
                  
                  while the shortest cycle time in transmission cycles is
                  
                  Eqs. (3) and (4) show a large difference between the longest and shortest cycle times. By reflecting
                     this result, the new sorting engine uses a self-timed asynchronous clock to increase
                     operation speed. The clock flips by detecting the end of critical processes so that
                     each cycle starts as soon as possible. By employing the variable period clock, the
                     time complexity of $\textit{T}$$_{BM}$ can be reduced from $\textit{O}$($\textit{Nm}$)
                     to $\textit{O}$($\textit{N+m}$) in the best case. Assume that inputs are well balanced
                     so that balls bisect bins in every cycle, then $\textit{b}$$_{\mathrm{max,j}}$ would
                     be $\textit{N}$/2$^{j}$. The $\textit{b}$$_{\mathrm{max,j}}$ decreases until it becomes
                     1, and rest of the cycles are working with the same $\textit{T}$$_{BM}$. If $\textit{N}$
                     < 2$^{k}$, approximately,
                  
                  
                  which has $\textit{O}$($\textit{N+m}$) time complexity. The distribution of pseudo-random
                     inputs shows similar patterns as the best case.
                  
                  Unlike $\textit{T}$$_{BM}$, $\textit{T}$$_{col}$ is not changed even though the number
                     of active grid units in a column keeps decreasing. The columns in a bin are localized
                     in the array, but the active grid units are scattered over entire column. Even if
                     most of units in a column are disabled, a ball should pass those units so that a ball
                     must pass $\textit{N}$ grid units regardless of the number of inactive units. Although
                     $\textit{T}$$_{BM}$ is much larger than $\textit{T}$$_{col}$ in early cycles, $\textit{T}$$_{col}$
                     becomes the dominant component from few cycles later causing $\textit{T}$$_{d}$ to
                     remain in $\textit{O}$($\textit{Nm}$).
                  
                  To resolve this problem, a bypass scheme is developed as in Fig. 5. The purpose of the bypass scheme is to skip consecutive inactive grid units in ball-propagation.
                     A column is divided into several groups such that every group consists of consecutive
                     $\textit{q}$ grid units and has a bypassing line. If all grid units in a group are
                     disabled, a ball does not pass the grid units in the group, but the ball is sent to
                     the next group through the bypassing line. The bypassing scheme is very effective
                     in later cycles since lots of grid units are disabled.
                  
                  
                        Fig. 5. Basic grid block with the first level bypassing circuit.
 
                  Fig. 5 shows the basic block of grid units and implementation of the first level bypassing
                     scheme. Four grid units with a repeater form a basic grid-block which is used as a
                     basic module of ASIC library. With the first level bypassing scheme, only 4 times
                     is the maximum achievable speed-up. For greater speed up, multi-level bypassing scheme
                     could be employed. The number of levels and the number of grid units in a group should
                     be decided considering the applications and hardware overheads.
                  
                  Similarly, with $\textit{T}$$_{BM}$, time complexity of $\textit{T}$$_{col}$ can be
                     $\textit{O}$($\textit{N+m}$) in the best case with an optimal input distribution and
                     the maximal level bypassing scheme. Therefore, the overall latency ($\textit{T}$$_{d}$)
                     of the sorting engine can achieve $\textit{O}$($\textit{N+m}$) time-complexity, but
                     $\textit{O}$($\textit{Nm}$) in the worst-case distribution.
                  
                
             
            
                  IV. PERFORMANCE EVALUATION
               The performance of SG is evaluated through simulations with pseudo-random inputs.
                  The simulation is performed by HSPICE with 1.2 V 0.13 ${\mu}$m-process model parameter.
                  Simulations are performed with several input sizes and key sizes. For each type of
                  inputs, 10 samples are simulated, and their average values are used as the results.
               
               Two groups of inputs are used in simulation where one group uses 8-bit key and the
                  other does 16-bit key. The inputs with 16, 64, 256, and 1024 elements are used. Key-only
                  inputs are used for simulations because transmission cycles are just moving data with
                  a constant cycle time. Multi-level bypassing scheme is used such that every 4 block
                  is grouped as a bypassing unit. For example, 4-level (maximum level) bypassing scheme
                  is used in sorting 1024 inputs.
               
               Table 1 shows the simulation results. The proposed sorting engine sorts 8-bit 1024 inputs
                  in 1.042 ${\mu}$s. Its rate is about one element per nano-second. Even when the key
                  size is increased twice, the sorting time increases by 55% for 16 inputs while it
                  increases by 29% only for 256 inputs. For 1024 inputs, the sorting time increases
                  only by 6%. One reason of such an amazing result is that 1024 inputs with 8-bit key
                  include too many identical inputs since $\textit{N}$ > 2$^{k}$, while $\textit{N}$
                  < 2$^{k}$ for 16-bit key inputs. The efficiency of bypassing scheme drops when inputs
                  include many same-key inputs.
               
               As in Table 1, the gap in cycle times between the first cycle and the last cycle is very large.
                  The difference is getting bigger for a larger number of inputs. Fig. 6 shows how the cycle time ($\textit{T}$$_{cycle}$) changes along the cycles. $\textit{T}$$_{cycle}$
                  decreases rapidly along the cycle and is saturated at a point. It is because $\textit{T}$$_{col}$
                  and $\textit{T}$$_{BM}$ keep decreasing but $\textit{T}$$_{C}$+$\textit{T}$$_{R}$
                  does not. Therefore, $\textit{T}$$_{C}$+$\textit{T}$$_{R}$ becomes the dominant factor
                  of $\textit{T}$$_{cycle}$ from the point.
               
               
                     Table 1. Simulation Results
                  
                        
                           
                              | 
                                 
                              								
                               # of Input 
                              							
                            | 
                           
                                 
                              								
                               8-bit Key 
                              							
                            | 
                           
                                 
                              								
                               16-bit Key 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               $T_{d}$ (ns) 
                              							
                            | 
                           
                                 
                              								
                               $T_{cycle}$ (ns) 
                              							
                            | 
                           
                                 
                              								
                               $T_{d}$ (ns) 
                              							
                            | 
                           
                                 
                              								
                               $T_{cycle}$ (ns) 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               First 
                              							
                            | 
                           
                                 
                              								
                               Last 
                              							
                            | 
                           
                                 
                              								
                               First 
                              							
                            | 
                           
                                 
                              								
                               Last 
                              							
                            | 
                        
                     
                     
                           
                              | 
                                 
                              								
                               16 
                              							
                            | 
                           
                                 
                              								
                               25.3 
                              							
                            | 
                           
                                 
                              								
                               5.7 
                              							
                            | 
                           
                                 
                              								
                               2.0 
                              							
                            | 
                           
                                 
                              								
                               39.3 
                              							
                            | 
                           
                                 
                              								
                               5.4 
                              							
                            | 
                           
                                 
                              								
                               1.9 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               64 
                              							
                            | 
                           
                                 
                              								
                               69.3 
                              							
                            | 
                           
                                 
                              								
                               19.9 
                              							
                            | 
                           
                                 
                              								
                               3.6 
                              							
                            | 
                           
                                 
                              								
                               100.0 
                              							
                            | 
                           
                                 
                              								
                               20.3 
                              							
                            | 
                           
                                 
                              								
                               3.2 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               256 
                              							
                            | 
                           
                                 
                              								
                               253.3 
                              							
                            | 
                           
                                 
                              								
                               79.6 
                              							
                            | 
                           
                                 
                              								
                               6.8 
                              							
                            | 
                           
                                 
                              								
                               325.6 
                              							
                            | 
                           
                                 
                              								
                               80.8 
                              							
                            | 
                           
                                 
                              								
                               4.9 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               1024 
                              							
                            | 
                           
                                 
                              								
                               1042.2 
                              							
                            | 
                           
                                 
                              								
                               318.4 
                              							
                            | 
                           
                                 
                              								
                               26.2 
                              							
                            | 
                           
                                 
                              								
                               1105.1 
                              							
                            | 
                           
                                 
                              								
                               317.0 
                              							
                            | 
                           
                                 
                              								
                               7.1 
                              							
                            | 
                        
                     
                  
                
               If $\textit{N}$ > 2$^{k}$, there must be multiple inputs with a same key as the case
                  of 1024 inputs with 8-bit. This input may have 4 elements for any key value so that
                  the efficiency of the bypassing scheme is reduced. It can be detected in Fig. 6(a) that $\textit{T}$$_{cycle}$ of 1024 inputs decreases slowly than that of 256 inputs
                  where only few inputs may be the same. It implies that the bypassing scheme is more
                  efficient when $\textit{N}$ < 2$^{k}$.
               
               
                     Fig. 6. Change of cycle time: (a) inputs with 8-bit key; (b) inputs with 16-bit key.
 
               The shortest cycle time is about 44 times faster than the longest cycle for 1024 inputs
                  with 16-bit key. If a fixed period clock is used, the sorting delay ($\textit{T}$$_{d}$)
                  would be 2.25 ${\mu}$s (16${\times}$317 ns) which is double the delay of SG (1.105
                  ${\mu}$s). The largest difference occurs when sorting 256 inputs with 16-key, which
                  the delay is quadrupled. It proves that employing the variable period clock makes
                  the engine more than two times faster than using a fixed cycle clock.
               
               The speed of SG is compared to that of other comparison-free sorting engines. The
                  comparison is summarized in Table 2. Only the sorting engine in [14] is implemented using FPGA while the others are done with VLSI design, thus it has
                  difficulties in direct comparison with other engines. The number of cycles required
                  to complete sorting operation is $\textit{k}$+1 for SG which is fixed regardless of
                  the number of inputs ($\textit{N}$), while it is proportional to $\textit{N}$ for
                  the other sorting engines. The total sorting time ($\textit{T}$$_{D}$) of SG is only
                  affected by clock cycle time ($\textit{T}$$_{cycle}$) so that SG uses the self-timed
                  asynchronous clock to increase the sorting speed.
               
               
                     Table 2. Comparison of SG with other sorting engines
                  
                        
                           
                              | 
                                 
                              								
                               
                              							
                             | 
                           
                                 
                              								
                               SG 
                              							
                            | 
                           
                                 
                              								
                               in [11] 
                              							
                            | 
                           
                                 
                              								
                               in [12] 
                              							
                            | 
                           
                                 
                              								
                               in [13] 
                              							
                            | 
                           
                                 
                              								
                               in [14] 
                              							
                            | 
                        
                     
                     
                           
                              | 
                                 
                              								
                               Time complexity 
                              							
                            | 
                           
                                 
                              								
                               O(N+k) 
                              							
                            | 
                           
                                 
                              								
                               O(N+k) 
                              							
                            | 
                           
                                 
                              								
                               O(N+2k) 
                              							
                            | 
                           
                                 
                              								
                               O(N) 
                              							
                            | 
                           
                                 
                              								
                               O(N) 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Restriction 
                              							
                            | 
                           
                                 
                              								
                               None 
                              							
                            | 
                           
                                 
                              								
                               None 
                              							
                            | 
                           
                                 
                              								
                               N > k 
                              							
                            | 
                           
                                 
                              								
                               N${\approx}$2$^{k}$ 
                              							
                            | 
                           
                                 
                              								
                               None 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Technology 
                              							
                            | 
                           
                                 
                              								
                               130 nm 
                              							
                            | 
                           
                                 
                              								
                               350 nm 
                              							
                            | 
                           
                                 
                              								
                               65 nm 
                              							
                            | 
                           
                                 
                              								
                               90 nm 
                              							
                            | 
                           
                                 
                              								
                               FPGA (65 nm) 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               N 
                              							
                            | 
                           
                                 
                              								
                               1024 
                              							
                            | 
                           
                                 
                              								
                               63 
                              							
                            | 
                           
                                 
                              								
                               16 
                              							
                            | 
                           
                                 
                              								
                               1024 
                              							
                            | 
                           
                                 
                              								
                               1024 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               k 
                              							
                            | 
                           
                                 
                              								
                               16 
                              							
                            | 
                           
                                 
                              								
                               16 
                              							
                            | 
                           
                                 
                              								
                               16 
                              							
                            | 
                           
                                 
                              								
                               10 
                              							
                            | 
                           
                                 
                              								
                               16 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               No. of cycles 
                              							
                            | 
                           
                                 
                              								
                               17 
                              							
                            | 
                           
                                 
                              								
                               78 
                              							
                            | 
                           
                                 
                              								
                               48 
                              							
                            | 
                           
                                 
                              								
                               2k~3k 
                              							
                            | 
                           
                                 
                              								
                               1024 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               ( k+1 ) 
                              							
                            | 
                           
                                 
                              								
                               ( N+k-1 ) 
                              							
                            | 
                           
                                 
                              								
                               ( N+2k ) 
                              							
                            | 
                           
                                 
                              								
                               (2N~3N) 
                              							
                            | 
                           
                                 
                              								
                               (N~2N) 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               T$_{\mathrm{cycle}}$ 
                              							
                            | 
                           
                                 
                              								
                               variable 
                              							
                            | 
                           
                                 
                              								
                               5 ns 
                              							
                            | 
                           
                                 
                              								
                               1.6 ns 
                              							
                            | 
                           
                                 
                              								
                               2 ns 
                              							
                            | 
                           
                                 
                              								
                               7.8 ns 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Sorting Time (T$_{D}$) 
                              							
                            | 
                           
                                 
                              								
                               1105.1 ns 
                              							
                            | 
                           
                                 
                              								
                               390 ns 
                              							
                            | 
                           
                                 
                              								
                               76.8 ns 
                              							
                            | 
                           
                                 
                              								
                               4~6 ms 
                              							
                            | 
                           
                                 
                              								
                               8 ms 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               T$_{D}$ / N 
                              							
                            | 
                           
                                 
                              								
                               1.08 ns 
                              							
                            | 
                           
                                 
                              								
                               6.19 ns 
                              							
                            | 
                           
                                 
                              								
                               4.8 ns 
                              							
                            | 
                           
                                 
                              								
                               4~6 ns 
                              							
                            | 
                           
                                 
                              								
                               7.8 ns 
                              							
                            | 
                        
                     
                  
                
               Because the sorting engines use different key sizes and data sizes in speed simulation,
                  the sorting time per element ($\textit{T}$$_{D}$/$\textit{N}$) is used to compare
                  their speeds. As in Table 2, SG is the fastest among all comparison-free sorting engines. The sorting rate of
                  SG is about 1 element per 1 ns while those of other engines are 0.13 ~ 0.25 element
                  per 1 ns, where SG is more than 4 times faster than other engines.