dc.description.abstract |
3D models of buildings are used in many applications such as location recognition,
augmented reality, virtual training and entertainment. Creating models of buildings automatically
is a longstanding goal in computer vision research. Many current applications
rely on manual creation of models using images and a 3D authoring tool. While more automated
approaches exist, they typically are inefficient, require dense imagery, other sensor
data, or frequent manual interventions. The focus of this thesis is to automate and increase
the efficiency of 3D model creation from image collections.
Matching sets of images to each other is a frequent step in 3D model building. In many
applications image matching must be done hundreds or thousands of times. Thus, any increase
in matching efficiency will be multiplied hundreds or thousands of times when used in
these applications. This dissertation presents a new image matching method that achieves
greater efficiency by using the fact that images taken from similar viewing angles are approximately
related by an affine transformation. An affine transformation models translation,
rotation and non-isotropic scaling between image pairs. When images are related by an
affine transformation ratios of areas of corresponding shapes are invariant. The method
uses this invariant to fit an affine transformation model to a set of putative matches and detect incorrect matches. Methods assuming global and local affine transformation models
were created. The first assumes a single global affine transformation between each image
pairs. The second method imposes a structure on the feature points to cluster features in
a local region. The method then fits different affine models to each cluster. Both methods
were evaluated using sets of synthetic matches with varying percentages of incorrect
matches, localization error and rotation. Additionally, the methods were applied to a large
publicly available image database and the results were compared to several recent model
fitting methods. The results show the best affine method using local regions maintains
equivalent accuracy and is consistently more efficient than current state of the art methods.
When creating and using 3D models, it is often important to predict if images taken
from specific locations will match existing images in the model. Image matching prediction
is used to evaluate image sets for vision-based location recognition and augmented reality
applications. This dissertation presents a new way to predict if images will match by
measuring affine distortion. Distortion is measured by projecting features into a second
image and computing the affine transformation between the corresponding feature regions.
Feature distortion is computed from the skew, stretch and shear of the transformed region.
Using the distortion measure for all features in an image pair, a distortion vector is created
describing the image pair. Using the distortion vectors and the actual number of matches, a
classifier is trained to predict the confidence that images will match. Results are presented
that compare this method to other published approaches. The results demonstrate the
affine distortion-based classifier predicts matching confidence more accurately than other
published techniques.
The classifier is also used to create a spatial model of locations around a building.
The spatial model shows the confidence that a new image taken from a specific location
and pose will match an existing set of images. Using this model, location recognition
applications can determine how well they will work throughout the scene. The approach
presented uses the classifier described above and more realistic location sampling to create
a spatial map that is more accurate than other published approaches. Additionally, as part
of this goal, the minimum set of images needed to cover the space around the building is
computed. The approach uses structure from motion to create 3D information about the
scene. Synthetic cameras are then created using approximate locations and directions from
which people commonly take pictures. The affine distortion-based classifier is applied to
compute the confidence that images from the synthetic cameras will match the existing set
of images. Results are presented on a spatial map showing the confidence that new images
captured at specific locations and poses will match the existing image set. Additionally, the
minimal set of images needed to maintain the matching coverage is computed using a greedy
set cover algorithm. The minimal set can be used to increase efficiency in applications
that need to match new images to an existing set of images (e.g. location recognition,
augmented reality and 3D modeling applications). Finally, a process is presented to validate
the 3D information computed using structure from motion. Validation ensures that the
data is precise and accurate enough to provide a realistic 3D model of the scene structure.
Results from the process show that the Bundler structure from motion software generates
3D information accurately enough to calculate distortion and generate the spatial coverage
map. |
|