From coarse to fine: quickly and accurately obtaining indoor image-based localization under various illuminations
Date
2016
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
The focus of this dissertation is on improving accuracy and efficiency for indoor
image- or video-based localization under different situations. Indoor localization is a
critical topic in computer vision. Image-based localization is mostly used in an outdoor
environment, especially in urban areas where GPS localization is inaccurate. Current
vision-based indoor localization techniques are mainly aimed at retrieving the best
matching image to achieve the location information, which misses the orientation information.
Kinect is also used to search correspondences for each pixel in a query image
to 3D points in an already known scene. However, Kinect is limited to a short distance
range to be useful. I introduce the 3D Structure-from-Motion (SfM) reconstruction
model into the indoor image-based localization framework. Feature correspondences
are searched within each descriptor visual word, which reduces the search scope and
accelerates the search process. In order to make my feature more discriminant for
building 2D-to-3D correspondences, I learn a projection matrix and project features to
a more discriminant space through trace ratio relevance feedback. I then change the
distance computation method from Euclidean distance to Hellinger distance to improve
the localization accuracy. I also analyze the localization accuracy for different distance
computation methods.
In a crowded environment, the captured images in an indoor environment usually
contain people, which often leads to an inaccurate camera pose estimation. In my
approach, people are segmented out in the videos by means of an optical flow technique
and background is completed. A novel initial correspondence selection method
is used instead of traditional RANSAC, leading to a much faster image registration
speed. The correspondence selection method is through graph matching to enhance
both image registration speed and camera pose estimation accuracy.
In addition to SfM-based methods, I propose another multi-view, image-based
localization method. The 3D SfM model is still applied to provide users a clear view
of the whole building. I perform image retrieval to roughly obtain the image location.
Knowing the orientation is also an essential property to localize an image, I regard each
view direction as a task and perform image retrieval in a multi-task learning framework.
By performing the multi-view image retrieval, the image location and orientation are
achieved at the same time. This process avoids finding correspondences for each feature
extracted from 2D image, which accelerates the process of building correspondences. I
linearly search the nearest neighbor of the query image within the same location group
and task, and assign the camera pose of the best matching candidate to the query
image with further refinement through bundle adjustment.
Existing image-based localization systems are based on color images captured
by normal cameras with the assumption that adequate light is supplied. In many
situations, such as in a power outage, light is not available. To cope with this situation,
I implement a localization method based on thermal imaging which captures the object
surface temperature instead of light. To overcome some limitations of thermal cameras,
more time is needed to invest into focusing and framing a scene. Because this processes
is labor intensive, far fewer samples were taken compared with color images. I apply
transfer learning to enhance the training of the thermal image classification. To select
the most informative samples used in the learning process, I combine active learning
with transfer learning to make the classification model more accurate. Through the
active transfer learning method, the location model can accurately indicate the thermal
image location.
To better explore scene geometric attributes, I further use a 3D model reconstructed
by a short video as the query to realize 3D-to-3D localization under a multitask,
point-retrieval framework. First, the use of a 3D model as the query enables
efficient selection of location candidates. Furthermore, the reconstruction of 3D model
exploits the correlation among different images, based on the fact that images captured
from different views for SfM share information through matching features. By exploring
shared information (matching features) across multiple related tasks (images of the
same scene captured from different views), the visual feature’s view-invariance property
can be improved in order to achieve higher point-retrieval accuracy. More specifically, I
use a multi-task point-retrieval framework to explore the relationship between descriptors
and the 3D points, thereby extracting the discriminant points for a more accurate
3D-to-3D correspondences retrieval. In summary, the entire dissertation conducts research
on image/video-based localization with a focus on indoor environment in more
accurate time- and memory-efficient methods.