A 675.2 nJ/frame Keypoint Extraction Processor for LiDAR-based SLAM
Kyuho Lee1
-
(School of Electrical and Electronics Engineering, Yonsei University, Seoul, Republic
of Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Keypoint extraction, LiDAR-based SLAM, LeGO-LOAM, radix sorting
I. INTRODUCTION
Simultaneous Localization and Mapping (SLAM) is widely utilized in robot and autonomous
driving systems for seamless 3D map reconstruction tasks such as object recognition,
path planning, and object tracking [1-
3]. SLAM has been researched using various sensors, including camera-based visual SLAM
using RGB data, and LiDAR-based SLAM utilizing 3D point LiDAR data. Especially, LiDAR-based
SLAM is highlighted for its ability to overcome the limitations of visual SLAM, such
as narrow field-of-view (FOV) and challenges in scenarios with insufficient lighting
conditions.
LiDAR odometry and mapping (LOAM) [4] is a LiDAR-based SLAM method that demonstrates high accuracy with low operation latency.
With the outstanding performance that LOAM shows, various derivative LOAMs are proposed
[5-
7]. In particular, LeGO-LOAM [5] demonstrates real-time operation with underground vehicle (UGV) system with its two-track
keypoint extraction (KE) process.
The KE process in LeGO-LOAM extracts two types of feature points from the 3D point
clouds: 1) edge features are designated to the points with high curvature values which
indicates roughness in its local region. 2) plane features are extracted with the
points having low curvature value.
Fig. 1 shows the KE process in LeGO-LOAM from range map to feature points. Range map consists
of 3D point data captured from LiDAR sensor indicating the distance between the object
and the sensor. The curvature map indicates how flattened or bulged the point is,
calculated from the range map by calculating the difference between the reference
point and adjacent points. After the curvature map is transformed from the range map,
curvature map is sorted, for the extraction of keypoints from the Top-K sorted curvature
data. Extracted keypoints are used for LiDAR odometry and mapping optimization process.
Fig. 1. Flow of LOAM process and keypoint extraction process in detail.
Fig. 2 illustrates the challenges of KE process. In LeGO-LOAM, keypoints are extracted from
a curvature map with dimensions of 1800 $\times$ (# of LiDAR Channels). In case of
VLP-16 LiDAR, which consists of 16 channels, sorts 28.8K of curvature points. Even
though LeGO-LOAM divides each channel into small partitions, named sub-images, sorting
remains as the main bottleneck of KE accounting for 85.9% of the total latency. Additionally,
due to the dynamic surroundings of LiDAR, the number of valid points detected by LiDAR
varies up to 31.9$\times$ in channel-wise, and up to 5.1$\times$ frame-wise, while
the number of keypoints extracted per channel is constant. This results in a variance
in the workload of the sorting process, which causes unnecessary power consumption.
Fig. 2. Challenges of keypoint extraction process due to heavy computation, and workload
variance.
This paper proposed a KE processor to achieve low-latency acceleration of KE with
the following key features: 1) near-keypoint removal, which reduces the total number
of points in the sorting process to alleviate the high computational burden, resulting
in 43.0% reduction in latency, and 2) reconfigurable radix sorting (RRS) to resolve
workload variance problem by selecting high-speed or low-power mode for the sorting
process based on the volume of data to be sorted, enabling energy-efficient KE process.
The rest of this paper is organized as follows. In Section II, the proposed keypoint
extraction core architecture is explained in detail with candidate filtering block
and reconfigurable radix sorting. Implementation and evaluation results are explained
in Section III, followed by a conclusion in Section IV.
II. ARCHITECTURE
Fig. 3 shows the overall architecture of the proposed energy-efficient keypoint extraction
core (KEC). It consists of 8 keypoint extraction (KE) units, curvature transform circuit
(CTC), 50 KB of Range memory (RMEM) for storing range map, 50 KB of Curvature Memory
(CMEM) for storing curvature data which is used in sorting, and KE process.
Fig. 3. Overall architecture of proposed Keypoint Extraction Core.
Curvature transform circuit (CTC) converts the range map into a curvature map. KE
unit consists of two radix sorting buffers (RSB), curvature filtering block (CFB),
and radix controller. RSB stores curvature data based on its value at the corresponding
radix step. In the sorting process of curvature data, CFB excludes the curvature points
that are not considered for extraction as keypoints, thereby reducing the time complexity
of the process. The radix controller operates to determine the radix and radix step
for the reconfigurable radix sorting process.
For keypoints extraction process, KEC stores range map data in the RMEM. The range
data stored in RMEM is transformed into curvature map data through CTC which consists
of shift-based MAC PE, and then the transformed data is stored in CMEM. After conversion,
curvature data undergo a radix sorting process. A single chunk for sorting is stored
in the radix buffer within the radix sorting block based on its digit, and is then
determined to be written back into CMEM or discarded in CFB. Within each radix step,
the number of curvature data is reduced by iterative filtering. After sorting, keypoints
are extracted from the sorted curvature, and used for LiDAR odometry and mapping process.
1. Candidate Filtering for Near-Keypoint Removal
In Fig. 4, the conventional KE process is explained. In the conventional stage, near-keypoint
data that are adjacent to the extracted keypoints are excluded from the feature candidate
after a single keypoint is extracted. This is to prevent the extraction of redundant
keypoints, causing a processing time bottleneck for the sorting process.
Fig. 4. Conventional stage for keypoint extraction process and proposed radix sorting
based process.
Sorting accounts for 85.9% of total KE processing latency, and this latency proportionally
increases with the number of data being sorted. However, only 7.9% of the total curvature
is extracted as keypoints, which means sorting for redundant data that is not extracted
as keypoints brings the main bottleneck for the total process.
In Fig. 4, a new KE process is proposed, which excludes near-keypoint curvature data during
radix sorting step. Also, a candidate filtering block (CFB), to support this new KE
process, is proposed.
In KEC, curvature data in CMEM is loaded into the radix sorting buffer (RSB), and
stored in the buffer based on the digit in the corresponding radix step. After all
data are stored in RSB, CFB determines whether curvature data are written back into
CMEM or discarded. In CFB, input curvature is compared to keypoint candidate data
stored in the local maximum register. The decision unit determines the update of the
local max register and decides whether to write back or discard the curvature data,
referencing the index monitor.
Fig. 5. Near-keypoint removal comparing with local max keypoints, and three cases
for decision unit in candidate filtering block (CFB).
Fig. 5 shows three cases of operation of the decision unit in CFB while the curvature is
being loaded to the block. 1) If the input curvature is larger than the stored local
max data: the local max register is updated with the input curvature, considering
it as a new keypoint candidate. 2) If the input curvature is smaller than the value
stored in the local max register, the local max register is not updated, and the input
curvature is monitored with its index to determine whether it is adjacent to keypoint
candidate in the register. When two data are considered as adjacent data, the input
curvature is regarded as near-keypoint and discarded. 3) If the input curvature is
smaller than the local max and is monitored to be far from the local max data, the
input curvature can still be considered as a keypoint candidate that is written back
to CMEM and included in the next radix step.
2. Reconfigurable Radix Sorting for Workload Variance
Due to dynamic environmental changes near the LiDAR sensor, the number of points used
for the KE process fluctuates frame-/channel-wise. However, the number of keypoints
extracted from each channel is constant, resulting in a variance of workload for KEC.
Proposed KEC, with radix sorting, accesses CMEM each radix step, requiring huge memory
bandwidth resulting in large dynamic power consumption. With 2 radix-based sorting,
35.2 GB/s of memory bandwidth is required for a 16-channel LiDAR operating at 20 Hz.
In this paper, reconfigurable radix sorting is proposed for reduction of power for
sorting process to solve workload variance problem.
Fig. 6. Operation of 2, 4, and 16 radix-based sorting supported by reconfigurable
radix sorting.
Proposed KEC consists of a total of 16 radix buffers. For 2 radix-based sorting, only
2 radix buffers are required, allowing 8 sets of data to be sorted in parallel. Single
set of data is distributed across 2 buffers, which induces fine-grained candidate
filtering. Through parallel processing with fine-grained removal of near-keypoint
data, 2 radix-based sorting can achieve low latency achieving 63.0% data removal,
and a 43.0% reduction in total latency.
Fig. 7. Near keypoint removal results at each radix step in three types of radix-based
sorting by reconfigurable radix sorting, and total latency reduction of each radix
sorting.
When valid curvature data is sparse in the channel, proposed KEC can operate in low
power mode with 16 radix-based sorting. In case of 16 radix-based sorting, 16 sorting
buffers are utilized operating as a single core. Through candidate filtering, 16 radix-based
sorting can eliminate 42.8% of near-keypoint data, which can obtain 19.5% of total
latency reduction. Compared to 2 radix-based sorting, which requires 281.6 GB/s of
memory bandwidth for real-time operation, 16 radix-based sorting requires only 8.8
GB/s of memory bandwidth enabling low-power operation. The proposed KEC can support
reconfigurable sorting of data using 2, 4, and 16 radix-based sorting based on the
amount of data. This processor can switch between high-speed, and low-power modes
depending on the workload.
III. IMPLEMENTATION RESULTS
Proposed KEC can achieve low-latency processing through candidate filtering, and it
can also operate in low-power mode through reconfigurable radix sorting, resulting
in reduced memory bandwidth. Fig. 8 illustrates a trade-off graph depicting the relationship between latency and memory
bandwidth for the sorting process. With a large amounts of sorting data, operation
with 2 radix-based sorting can achieve the greatest reduction in latency due to fine-grained
candidate filtering and highly parallelized processing. However, 2 radix-based sorting
requires 16 iterations of radix step, 13.18 MB/frame which results in high power consumption
of the process.
Fig. 8. Sorting latency and memory bandwidth of each radix sorting process.
For 4 and 16 radix-based sorting, the latency is 1.22$\times$, and 2.82$\times$ higher
latency compared to 2 radix-based sorting adopting candidate filtering respectively.
Even though 2 radix-based sorting requires a huge memory bandwidth, thanks to candidate
filtering, the memory bandwidth is reduced from 13.18 MB/frame to 7.48 MB/frame, representing
a reduction of 39.8%. For 4 radix-based sorting, the process requires 2.31 MB/frame
of memory bandwidth, which is 3.24$\times$ lower than for 2 radix. With a memory bandwidth
of 0.34 MB/frame, operating with 16 radix can achieve the lowest memory power consumption,
which is 32.5 times lower than that of 2 radix.
Fig. 9 shows the chip layout of the proposed KEC processor. The processor is designed in
28 nm CMOS technology, occupying 1.9 mm $\times$ 0.3 mm. Total SRAM of the processor
is 108 KB which consists of CMEM, RMEM, and sorting buffer. The processor operates
under 250 MHz of clock frequency with 1.0 V supplying voltage. It shows 93.0 $\mu$s
of processing time extracting keypoints from 28.8K points under 2 radix-based sorting
achieving 675.2 nJ/frame of energy efficiency. With 4 radix-based and 16 radix-based
sorting, the proposed KEC shows 99.4 $\mu$s and 144.3 $\mu$s of latency with 721.6
nJ/frame, and 1047.6 nJ/frame of energy efficiency, respectively.
Fig. 9. Photograph of the chip layout and specification.
Table 1. Hardware comparison table.
|
|
This work
|
NVIDIA Jetson TX2
|
i7 CPU
|
|
Technology [nm]
|
28
|
16
|
12
|
|
Clock Freq. [MHz]
|
250
|
1,300
|
1,545
|
|
Throughput [FPS]
|
10752
|
120
|
280
|
|
Power [mW]
|
7.26
|
15,000
|
47,000
|
|
Energy per frame [nJ/frame]
|
675.2
|
125.0
|
167.9
|
Proposed KEC shows 7.26 mW of power consumption with 675.2 nJ/frame of energy efficiency.
It achieves ultra-low power operation, compared to NVIDIA Jetson TX2, consuming 15,000
mW, and i7 CPU requiring 47,000 mW. Proposed KEC shows 99.9% lower power consumption
than Jetson TX2 and i7 CPU, achieving extremely higher energy efficiency than Jetson
and CPU.
IV. CONCLUSIONS
A low-latency keypoint extraction process is proposed with near keypoint removal supported
by a candidate filtering block, removing 42.8 to 63.0% of data for sorting, and resulting
in a reduction of 19.5% to 43.0% in total latency. With the help of reconfigurable
radix sorting, proposed KEC can select one of three different (2, 4, 16) radix-based
sorting, to optimize processing time and power consumption due to memory bandwidth.
The processor achieved low power consumption of 7.26 mW and 675.2 nJ/frame of high
energy efficiency with real-time performance.
ACKNOWLEDGMENT
This research was supported by the Yonsei University Research Fund of 2025-22-0142.
The EDA tool and chip fabrication were supported by the IC Design Education Center
(IDEC), Korea.
REFERENCES
L. Rummelhard, A. Paigwar, A. Nègre, C. Laugier, 2017, Ground estimation and
point cloud segmentation using spatiotemporal conditional random field, Proc. of 2017
IEEE Intelligent Vehicles Symposium (IV), pp. 1105-1110

M. Himmelsbach, F. von Hundelshausen, H.-J. Wuensche, 2010, Fast segmentation
of 3D point clouds for ground vehicles, Proc. of 2010 IEEE Intelligent Vehicles Symposium,
pp. 560-565

H. Lim, M. Oh, H. Myung, 2021, Patchwork: Concentric zone-based region-wise
ground segmentation with ground likelihood estimation using a 3D LiDAR sensor, IEEE
Robotics and Automation Letters, Vol. 6, No. 4, pp. 6458-6465

J. Zhang, S. Singh, 2014, LOAM: Lidar odometry and mapping in real-time, Proc.
of Robotics: Science and Systems, Vol. 2, No. 9, pp. 1-9

T. Shan, B. Englot, 2018, LeGO-LOAM: Lightweight and ground optimized LiDAR odometry
and mapping on variable terrain, Proc. of 2018 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pp. 4758-4765

D.-U. Seo, H. Lim, S. Lee, H. Myung, 2022, PaGO-LOAM: Robust ground-optimized
LiDAR odometry, Proc. of 2022 19th International Conference on Ubiquitous Robots (UR),
pp. 1-7

H. Wang, C. Wang, C.-L. Chen, L. Xie, 2021, F-LOAM: Fast LiDAR odometry and
mapping, Proc. of 2021 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pp. 4390-4396

Kyuho Jason Lee received his B.S., M.S., and Ph.D. degrees from the School of Electrical
Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon,
South Korea, in 2012, 2014, and 2017, respectively. Dr. Lee worked for Samsung Research
America as a Hardware Designer in 2016. From 2017 to 2018, he was a post-doctoral
researcher with the Information Engineering and Electronics Research Institute, KAIST.
Then, he worked as an Associate Professor with the Department of Electrical Engineering
and the Graduate School of Artificial Intelligence, Ulsan National Institute of Science
and Technology (UNIST), Ulsan, South Korea, from 2018 to 2025. He is currently an
Associate Professor with the School of Electrical and Electronics Engineering, Yonsei
University, Seoul, South Korea. Dr. Lee has been serving as a Guest Editor for IEEE
Journal of Solid-State Circuits (JSSC) and a TPC member for the IEEE Asian Solid-State
Circuits Conference (A-SSCC), the ACM/IEEE Design, Automation and Test in Europe (DATE),
ACM/IEEE Design Automation Conference (DAC), IEEE Custom Integrated Circuits Conference
(CICC), IEEE International Conference on Electronics, Information, and Communication
(ICEIC), IEEE International SoC Design Conference (ISOCC), IEEE Asia and South Pacific
Design Automation Conference (ASP-DAC). He also has been served as an OC member for
the IEEE International Conference on Artificial Intelligence Circuits and Systems
(AICAS), IEEE International Conference on Electronics, Information, and Communication
(ICEIC), IEEE International Conference on Consumer Electronics Asia (ICCE-Asia), and
IEEE Asian Solid-State Circuits Conference (A-SSCC). His research interests include
mixed-mode neuromorphic processor, deep learning accelerator, AI System-on-Chip, Network-on-Chip
architectures, Processing-in-Memory & Computing-in-Memory architectures, and intelligent
computer vision processor for mobile devices and autonomous vehicles. He is a Senior
Member of IEEE.