Leveraging deep learning for robust visual and visual-inertial SLAM
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Simultaneous localization and mapping (SLAM) arises in numerous applications ranging from robot navigation to augmented reality (AR) and virtual reality (VR). Despite recent advances in deep learning, most SLAM systems still rely completely on hand-crafted techniques. In this dissertation, we aim to show how deep learning can benefit SLAM -- in certain cases improving the accuracy, robustness, and even sometimes the efficiency. We show not only that deep learning can benefit SLAM, but in many cases SLAM can actually benefit the deep neural network in return, exposing a mutually-beneficial relationship between the learned and hand-crafted techniques. The dissertation is divided in four thrusts. In the first thrust, a robust visual object SLAM system is presented, which utilizes a custom deep semantic keypoint network to provide the monocular SLAM system with metric scale while allowing it to initialize from a single view. In return, the SLAM system provides valuable priors to the network which allows it to track keypoints on symmetric objects consistently across multiple views, which is typically not possible. In the second thrust, a dense visual-inertial odometry (VIO) system is presented, in which dense geometry is represented as a compact optimizable code that is estimated tightly in the VIO estimator. We show that not only the code estimation can improve the dense geometry's accuracy, but that it can also improve the accuracy of VIO in return. In the third thrust, a monocular depth-aided visual-inertial initialization pipeline is presented. The learned monocular depth is shown to improve the initialization performance in low-excitation scenarios where completely hand-crafted initialization performance is degraded, and in turn the inertial information provides the scale to the monocular depth network which allows it to be used as a compact feature representation. In the fourth and final thrust, an efficient and robust dense visual-inertial SLAM system is presented called AB-VINS. Instead of estimating each feature position separately, as in most SLAM systems, AB-VINS utilizes a monocular depth network to represent the geometry, which is shown to improve the robustness and efficiency. Again, the monocular depth provides useful priors to the SLAM system while the inertial information provides the scale to the network. A new hand-crafted technique for pose graph optimization called the memory tree is introduced along with AB-VINS, which is shown to greatly improve the robustness and efficiency over state-of-the-art methods -- allowing for AB-VINS to solve pose graph SLAM while only relinearizing a constant number of variables no matter the number of keyframes. The dissertation is concluded with possible future research directions.
Description
Keywords
Deep learning, Navigation, Robustness, SLAM, Virtual reality