Feature Detection and Matching - Structure from Motion

While this is no problem with respect to the photo since it is done before the optimization, the silhouette of the projected

2.2 Structure from Motion

2.2.1 Feature Detection and Matching

The goal is to find and match the same 2D features in the different images of the input set. Two categories of solutions have been proposed. Some solutions adopt marker-based tracking, where artificially designed markers, easy to detect with image processing algorithms, are used to simplify the detection and the creation of 2D correspondences [100][165][140]. Even if the detection and tracking of markers are very reliable, in some cases the preparation of the scene with them is not possible. In such cases, a marker-less solution, based on the natural features of the environment, must be used.

For a marker-less solution, two important elements are needed: a feature de-tector, which extracts the most salient 2D features of the image, usually points or lines; a feature descriptor, which associates to each extracted feature a descriptive information, usually in form of vector, to use in a matching process. Both of them must have some peculiar characteristics. A good detector should be repeatable and reliable. Repeatability means that the same feature must be detected in different images. Reliability means that the detected point should be distinctive enough so that the number of its matching candidates is small. A descriptor should be in-variant to rotation, scaling, and affine transformation, so that the same feature on different images will be characterized by almost the same value, and distinctive to reduce the number of possible matches.

Feature Detector

The Harris corner detector [83] is a well-known point detector that is invariant to rotation and partially to intensity change. However, it is not scale invariant. The detector is based on a local auto-correlation function that measures the local changes of the image. For each point (x, y) it computes the Harris matrix:

A= X

where (u, v) is a image point belong to the window W around (x, y) (circular weighted window if w(u, v) is a Gaussian), and I_x and I_y are the partial deriva-tives of the imageI. Then the value of the functionR=det(A) +k(tr(A)²) enables the classification of the point as a corner (R >0), a flat region (R'0), or an edge (R <0).

A simple and efficient detector, named SUSAN (Smallest Univalue Segment As-similating Nucleus), was introduced in [179]. It computes the fraction of pixels within a neighborhood that have similar intensity to the center pixel. Corners can then be localized by thresholding this measure and selecting local minimum. The FAST (Features from Accelerated Segment Test) detector was proposed in [168]. A point is classied as a corner if one can find a sufficiently large set of pixels on a circle of fixed radius around the point such that these pixels are all significantly brighter than the central point.

Scale invariant detectors [120][133] search for features over scale space. Lowe et al. [120] searches for local maxima of difference of Gaussian (DOG) in space and scale. Mikolajczyk et al. [133] use Harris corners to search for features in the spatial domain and then use a Laplacian in scale to select features that are invariant to scale.

An affine invariant detector is defined by Tuytelaars et al. [193]. Starting from a local intensity maximum, it searches along rays through that point to find local in-tensity extrema. The link formed by those extrema defines an interest region, which is later approximated by an ellipse. By searching along many rays and using ellipses to represent regions, the detected regions are invariant to affine transformation.

Bay et al. [12] proposed a scale-invariant feature detector based on the Hessian-matrix, but rather than using a different measure for selecting the location and the scale, the determinant of the Hessian is used for both. More precisely, they detect blob-like structures at locations where the determinant of the Hessian is maximum.

The Hessian matrix is roughly approximated, using a set of box-type filters.

An extensive survey about local feature detection can be found in [194].

Feature Descriptor

One of the most robust and used feature descriptor is SIFT (Scale-invariant feature transform) and its following derivations. The SIFT descriptor [120] is a vector with 128 elements that is computed on the local image gradient. It uses a regular grid 4×4 around the feature and computes for each grid the histogram of the image gradient. The eight bins values of each histogram become the values of the feature descriptor. SIFT is invariant to scale, rotation, changes in illumination, noise and partially to view change.

Several improvements of SIFT have been proposed. In PCA-SIFT [101], Prin-cipal Component Analysis techniques are applied on the local patches of the image gradient to reduce the dimension of the descriptor (typically 36 elements). The result is a descriptor more robust to image deformation and more compact that

reduces the time for feature matching. In GLOH (Gradient Location-Orientation Histogram) [132], the descriptor is computed in a log-polar location grid around the feature and its size is reduced by PCA. In [35], a new modification for SIFT is proposed. The orientation histogram is computed on an irregular grid where the patches are partially overlapped. This modification increases the robustness against the scale variation.

The SURF (Speed Up Robust Features) descriptor [12] is partly inspired by SIFT. It relies on the Haar wavelet responses computed for 16 sub-regions centered on the feature. The result is a descriptor of 64 elements as robust as the SIFT but that reduces the time for features computation and matching.

In the last years, several descriptors have been proposed to allow as fast as possible comparison and matching using binary vector. Calonder et at. [25] propose BRIEF (Binary Robust Independent Elementary Features). The descriptor vector is composed by binary comparison of the intensity of 512 pairs of pixels after applying a Gaussian smoothing to reduce the noise sensitivity. The positions of the pixels are pre-selected randomly according to a Gaussian distribution around the patch center. Rublee et al. [169] propose the Oriented Fast and Rotated BRIEF (ORB) descriptor. Their binary descriptor is invariant to rotation and robust to noise.

Similarly, Leutenegger et al. [109] propose a binary descriptor invariant to scale and rotation called BRISK (Binary Robust Invariant Scalable Keypoints). It is based on a sampling pattern consisting of points lying on appropriately scaled concentric circles. Each point contributes to many pairs in the descriptor and the pairs are divided in two subsets: short-distance and long-distance. The long-distance subset is used to estimate the direction of the keypoint, while the short-distance subset is used to build the binary descriptor after rotating the sampling pattern. Alahi et al.

[6] propose a keypoint binary descriptor inspired by the human visual system, more precisely the retina, named Fast Retina Keypoint (FREAK).

KLT Tracking

One of the most used solution for feature detection and matching in a video sequence is the KLT tracking algorithm [188] [176]. The main idea is to use the typical co-herence between consecutive frames of a video sequence to find the displacement between each pair of corresponding points. It extends the local estimation of the optical flow proposed in [122], to track a template patch under an affine transforma-tion model, with the assumptransforma-tion of small brightness changes between consecutive frames.

Starting from the point features extracted with the Harris’ algorithm [83], the tracker computes the displacement of each features in the next frame, by shifting a local window around the feature, until the similarity measure between the local windows in the two frames becomes maximum. Because the movement between two consecutive frames is typically very small, the searching in the next frame starts from the same position of the feature in the previous frame. Typical similarity

measures for these patches are Sum of Square Differences (SSD) or Normalized Cross Correlation (NCC).

This type of tracking presents a drift problem due to several causes: image noise;

geometric distortion; illumination changes; occlusions; fast camera movements; 3D features that leave the cameras field of view and reappear after in the sequence.

Different solutions were proposed to compensate illumination changes [98] and to merge unconnected features track [32][186]. A further extension of KLT tracker was proposed by Dame [37], where the SSD is substituted by MI for the feature matching between consecutive frames.

In document Surface Appearance Estimation from Video Sequences Supervisor (sider 35-38)