Towards robust visual-inertial estimation

Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Visual-inertial navigation systems (VINS), which fuse the information from cameras and inertial measurement units (IMUs) to recover an estimate for the motion of a moving sensing platform, have seen an explosion in popularity in recent years. This is especially apparent in domains where cost, payload, and computational resources are heavily constrained, such as mobile devices and micro aerial vehicles (MAVs). While a standard VINS paradigm assumes a single IMU-camera (monocular or stereo) pair operating in an unknown but {static} environment, in this thesis, we provide three main thrusts which seek to improve overall estimation performance and to generalize these assumptions for building more robust visual-inertial systems. ☐ In the first thrust, we focus on improving the standard VINS paradigm (a single IMU-camera pair operating in a static environment). In particular, we seek to improve the accuracy of incorporating inertial measurements into a graph-based (i.e., batch-optimization) VINS when utilizing IMU preintegration. We provide two models for the evolution of the inertial measurements between sample intervals which yield closed-form but highly accurate solutions to the preintegration equations. We then show that these models offer improved accuracy over the standard, discrete preintegration methods, and incorporate the proposed models into both indirect (feature-based) and direct (intensity-based) visual-inertial systems. ☐ In the second thrust, we incorporate an arbitrary number of ``plug-and-play'' sensors into VINS. In particular, we first consider utilizing the information from multiple, non-overlapping, {asynchronous} (i.e., each camera is triggered independently) cameras, to allow for robustness to poorly textured regions along certain viewing directions and parallelizability of feature tracking for each camera. The fact that this system does not require hardware-synchronized sensors allows for easy addition of auxiliary cameras. Naively, due to the measurements being collected at multiple rates, estimation of the pose of the system at each measurement time could be performed to fuse the information, yielding an increased computational burden. To combat this, and to allow for real-time estimation, we introduce a linear-interpolation scheme which allows for the estimation of poses at a reduced rate, and utilize this model to perform calibration of both the spatial (relative pose) and temporal (time offset) relationships between each of the involved sensors within the multi-state constraint Kalman Filter (MSKCF) framework. We show in real-world experiments that utilizing additional cameras yields improved VINS performance, and demonstrate that our system is able to accurately calibrate the sensor suite from poor initial guesses. ☐ Considering the fact that even a sensor suite with many cameras may experience measurement depletion due to poor lighting conditions or fully textureless scenarios, we further design a new MSCKF-based estimator to incorporate an arbitrary number of IMUs into its formulation. This allows for an improvement in visual-inertial estimation which does not heavily rely on the environmental conditions, while additionally robustifying the system to failure of a single IMU (due to physical disconnection, high temperatures, or certain vibrational frequencies). In particular, we maintain an estimate for each IMU's state (pose, velocity, and sensor biases). During the prediction phase of the filter, the state of each IMU, as well as the joint covariance of the combined system, is propagated forward. We then derive a novel spatial-temporal constraint on the relative pose between each IMU which acts as an updating measurement for the filter, and allows for the fusion of each IMU's information as well as spatial and temporal calibration between all sensors. Due to the fact that this system can operate as long as one IMU remains active, this makes our method robust to IMU sensor failure. We then show in simulated and real-world experiments that the proposed method offers improved localization performance over a single-IMU system, while additionally being resilient to IMU failures. Finally, based on these two systems, a multi-IMU multi-camera (MiMc)-VINS is developed to fuse all available information from any number of IMUs and cameras, and is validated in simulations and experiments. ☐ In the third thrust, we focus on the second assumption, namely that the sensor platform operates in a static environment. As the world is often truly dynamic, improper modeling of the surroundings may lead to catastrophic errors as the estimator misinterprets motion of the environment as its own ego-motion. To relax this assumption, we allow for some parts of the environment (i.e., 3D feature points detected by the camera) to remain static, while others act as moving rigid-body objects which follow a motion model described by some estimated motion parameters. Estimating the motion of such objects not only provides more information to the filter, but may even be the objective of the sensing platform (e.g., a MAV performing target tracking of another vehicle). ☐ To perform this estimation, we include the pose, motion parameters and local point cloud of the moving object as state variables to be estimated by the filter, along with the standard VINS navigation states. The treatment of the object as a moving point cloud provides our system robustness to viewing the target from multiple directions, while the incorporation of a motion model allows for prediction of the future target state after loss of sight, which is a required component in active tracking scenarios. Several possible motion models are offered which capture many real-world target scenarios seen in practice, and an extensive observability analysis for each model is performed. We then show in both simulated and real-world experiments that our system provides accurate localization and object tracking performance, even when losing sight of the target or viewing the object from multiple directions. Lastly but of practical importance, we investigate in-depth the effect of incorrect model selection during tightly-coupled estimation, and design a Schmidt-KF based estimator to prevent inconsistent motion models from corrupting VINS performance while still properly tracking all correlations.
Description
Keywords
Localization, Perception, SLAM, Target Tracking, VINS, VIO
Citation