Indoor 3D reconstruction by Lidar and IMU and point cloud segmentation and compression with machine learning

(1)

Indoor 3D reconstruction by Lidar and IMU and point cloud

segmentation and compression with machine learning

Henry Haugsten Hansen

Thesis submitted for the degree of Master in Computational Science

(Geosciences) 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)

(3)

Indoor 3D reconstruction by Lidar and IMU and point cloud

segmentation and compression with machine learning

Henry Haugsten Hansen

(4)

c

2020 Henry Haugsten Hansen

Indoor 3D reconstruction by Lidar and IMU and point cloud segmentation and compression with machine learning

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

A tremendous amount of 3D data is being collected with laser scanners and cameras. This work covers the entire process from data collection to scene understanding. The Lidar data collection part is based on Vetle Sillerud and Fredrik Johanssen’s master thesis [18], and with acknowledgments, challenges and improvements further discussed. Vetle Sillerud and Fredrik Johanssen made a functional working setup with an affordable Lidar and IMU. However, after extensive testing their implementation was found to be unreliable and unscalable. Their implementation is simplified, modified and accelerated without loss of quality. The problem of aligning two point clouds from a real-world Lidar perspective is thoroughly discussed. The method of alignment, registration, of point clouds is a key decision in a Lidar reconstruction pipeline.

The last part of the thesis is dedicated to the state-of-the-art advance- ments in machine learning on point clouds. High quality heightmaps of planar surfaces of a indoor point cloud are generated with the help of machine learning. The heightmaps can be used for poly-reconstruction, as in Edmond Boulet-Gilly’s work ”Point Cloud Room Reconstruction [9]. Ma- chine learning for registration and compression are discussed too.

(6)

(7)

List of Figures

1.1 Principal rotations . . . 9

2.1 Rotation matrix . . . 14

2.3 Euler angles . . . 15

2.4 Angle-axis theorem . . . 16

2.5 3D pose . . . 19

2.6 Newton’s method . . . 21

2.7 Image . . . 23

2.8 Perspective camera model . . . 23

2.9 Perspective camera model world . . . 24

2.10 Perspective camera model mathematical . . . 24

2.11 Laplacian of Gaussian . . . 25

2.12 Homography . . . 26

2.13 Structure from motion . . . 27

2.14 Point cloud . . . 27

2.15 Surface normals . . . 28

2.16 Polygon mesh . . . 29

2.17 Violin plot . . . 30

3.1 Lidar illustration . . . 31

3.2 Gyroscope . . . 32

3.3 Lidar . . . 32

3.4 IMU . . . 33

4.1 ICP algorithm . . . 36

4.2 ICP objective function error . . . 37

4.3 ICP registration example . . . 38

4.4 ICP point-to-point . . . 39

4.5 ICP point-to-plane . . . 39

4.6 Generalized ICP . . . 41

4.7 Symmetric ICP . . . 43

4.8 Voxelization . . . 44

4.9 Moving least squares . . . 44

4.10 Normal space sampling . . . 45

5.1 Reconstruction setup . . . 50

5.2 Tripod transformation . . . 51

(10)

5.3 Electrical scheme . . . 51

5.4 Odometry . . . 55

6.1 Blackbox . . . 60

6.2 Blackbox fragments . . . 60

6.3 Blackbox odometry . . . 61

6.4 Blackbox reconstruction . . . 61

6.7 Fragment error . . . 63

6.8 Fragment horizontal error . . . 63

6.9 Fragment vertical error . . . 63

6.10 Blackbox horizontal error . . . 64

6.11 Blackbox vertical error . . . 65

7.1 Bias-variance . . . 68

7.2 Linear regression . . . 69

7.3 Neural network . . . 71

7.4 Gradient descent . . . 73

7.5 Convolution . . . 75

7.6 Max pooling . . . 75

7.7 Convolutional neural network . . . 76

7.8 Convolutional deconvolution neural network . . . 76

7.9 SEGCloud . . . 77

7.10 PointNet . . . 78

7.11 Multiview CNN . . . 78

7.12 CNN with spherical kernels . . . 79

8.1 Segmentation training loss . . . 83

8.2 Segmentation predictions Blackbox . . . 84

8.3 Wall segmentation Blackbox . . . 85

8.4 Heigtmaps Blackbox . . . 86

8.5 Heightmaps feature matching . . . 89

8.6 Autoencoder . . . 91

8.7 Autoencoder with spherical kernels . . . 91

8.8 Autoencoder predictions Blackbox . . . 92

8.9 Autoencoder Blackbox . . . 92

8.10 Autoencoder training loss . . . 93

(11)

Preface

In this master thesis, Henry Haugsten Hansen, master student in Compu- tational science at Department of Geoscience, covers the process from 3D data collection to reconstruction and segmentation. This was made possible by the work of Vetle Smedbakken Sillerud and Fredrik Kristoffer Jo- hanssen, who studied Cybernetics at Department of Physics at UiO. In their master thesis (2019), they co-operated and developed a data acquisition system for collecting and synchronizing data, as well as a data processing method for generating high-density point clouds with a LiDAR and an IMU.

Also, the research and insights discovered by Edmond Boulet-Gilly (Institut de Recherche en Informatique de Toulouse) were useful and helpful for problem formulations.

I want to thank to my supervisor Carsten Griwodz, for giving me the opportunity to work with such an interesting topic. Thank you for all the meeting arrangements, ideas and guidance. Your deep level of understanding of difficult topics has been motivating.

(12)

(13)

Chapter 1

Introduction

1.1 Motivation

Lidar (LIght Detection And Ranging) is a fast way obtaining high quality 3D data. Lidars gather 3D data by emitting laser pulses. If a laser pulse hits a surface, the pulse is reflected back to the Lidar and the travel time is measured. Based on the direction and the time travelled, a relative 3D point to the Lidar is known. If a large number of points is collected, the resulting point cloud will give a realistic appearance of the surrounding environment. To sufficiently cover the target environment, most likely the Lidar needs to be moved to different positions. A problem not to be taken lightly arises here, because for independent measurements the point clouds generated are not aligned with each other.

Figure 1.1: Independent measurements of the same room taken at different positions. The point clouds are clearly in different coordinate systems.

This problem is almost completely eradicated in outdoor environments by GNSS (Global Navigation Satellite Systems) technology. Anyway, in indoor environments, one approach would be to apply an alignment algorithm directly on the independent measurements. But trying to merge two misaligned point clouds is likely to fail if the two point clouds are far off. A better approach would to keep track of translation and rotation be-

(14)

tween the independent measurements. Arguably the best way to do this is by attaching an IMU (Inertial Measurement Unit) to the Lidar. An IMU keeps track of the unit’s movement by measuring linear acceleration and rotation. Vetle Sillerud and Fredrik Johannsen made a functional working setup with an affordable Lidar and IMU [18]. They neglected the linear acceleration measurements, because it was found to be unreliable. Instead they used the registration algorithm ICP (Iterative Closest Point) to keep track of translation. They stuck to the point-to-plane ICP found in Open3D Python library, but more robust ICPs have been invented. They also locked the z-axis for translation because of drifting.

In his master thesis (Point Cloud Based Room Reconstruction, 2019) [9], Edmond Boulet-Gilly generated heightmaps from a point cloud to construct a low-poly presentation of indoor environments. In order to do so, he used a region-growing algorithm for planar extraction of the floor, walls and ceiling. A graph-approach was implemented to seal the indoor environment. The planar surfaces were turned into heightmaps images based on the depth of the surface. The median was used as an interpolaton method for pixels with multiple points. The heightmaps were divided into regions and by line segmentation with KIPPI (KInetic Polygonal Partitioning of Images). The end result was a high quality reconstruction requiring minimal storage. However, planar extraction by region-growing is a computationally expensive task and usually requires parameter tuning to obtain a satisfactory result.

Figure 1.2: Edmond Boulet-Gilly’s pipeline for a low-poly reconstruction.

The planar extraction problem is usually tackled by planar least squares with RANSAC and line segmentation for wall separation like in ”Automatic reconstruction of parametric building models from indoor point clouds” [45] and ”Detection of Walls, Floors, and Ceilings in Point Cloud Data” [23]. Planar extraction by 3D Hough transform has also been done [14]. Region-growing, Hough transform and planar least squares with RANSAC, all require essential parameter tuning. It’s especially true when the point cloud is noisy. A recent statistical approach to the planar segmentation problem was proposed in ”A robust statistics approach for plane detection in unorganized point clouds” [2]. Planar extraction is achieved by a split, grow and merge strategy, where planes are

(15)

found by a planetary test involving PCA (Principal Components analysis).

Because of the variety of indoor environments, building a general model is demanding. If a model could be build by feeding the computer data of indoor environments, a more general model could be found. This way of constructing a model, where it’s up to the computer to find the parameters of a complex model based on large amount of data is the deep learning branch of the machine learning field. The best parameters are the one’s that minimizes the error in the model. Therefore, it’s an optimization problem essentially.

A deep learning approach to extract the floor, walls and ceiling is presented and implemented the chapter 8. In machine learning, parameter tuning is only an issue during model training. After the model is trained, the model is fixed and it can efficiently be applied on point clouds resembling the training data.

1.2 Purpose and goals

There several goals for this thesis. They are all important and germane topics for point clouds and computer vision. The following points can summarize the general goals:

1. Find a robust and general indoor reconstruction method.

2. Analyze registration algorithms to improve a reconstruction.

3. Generate heightmaps with machine learning.

1.3 Outline

This thesis consists of nine chapters, each divided into subsections, and a appendix:

• Chapter 1, Introduction, presents the motivation and the purpose and goals of this thesis.

• Chapter 2, Background for Lidar and computer vision, presents the theoretical background to ensure sufficient knowledge of some of the methods and techniques utilized in the first chapters.

• Chapter 3, Tools for Lidar and computer vision, presents the utilized data types and software in the first chapters.

• Chapter 4, Registration of Lidar point clouds, an analysis of different registration methods in a Lidar setting.

• Chapter 5, Vetle Sillerud and Frederik Johanssen’s data acquisition system and data processing pipeline for indoor reconstruction described.

(16)

• Chapter 6, Indoor reconstruction++, presents an improved indoor reconstruction, it’s based on Vetle Sillerud and Frederik Johanssen’s work.

• Chapter 7, Background and tools for machine learning, presents some theoretical background and the software used in the next chapter.

• Chapter 8, Machine learning on point clouds and applications, presents a heightmap generation method with machine learning and other machine learning applications.

• Chapter 9, Conclusion, the big takeways from the thesis.

• Appendix

(17)

Chapter 2

Background for Lidar and computer vision

2.1 Notation

The mathematical concepts used in rest of this chapter will have the following notation. It’s based on the TEK 5030 Computer vision course:

Symbol Description

{~_a} Vector basis~_a

F_a ={~a₁,~a₂,~a₃} Frame a with vector basis~a

~x^a Column vector represented in{a}

~r^a_ab Position vector from{b}to{a}in{a}. q_ab Rotation quaternion from{b}to{a} R_ab Rotation matrix from{b}to{a} T_ab Transformation matrix from{b}_to{a}

2.2 Rotations

There are different ways of describing rotations in space. Rotation matrices, Euler angles, axis-angle theorem or quaterions. They all have different strengths and weaknesses. Rotation is also known as orientation.

2.2.1 Rotation matrices

Rotation matrices are square matrices, with real entries. They are also orthogonal matrices with determinant equal to 1. This means a square matrixRis a rotation matrix if and only ifR^T = R⁻¹and det(R) =1. The set of all orthogonal matrices of size n with determinant 1 forms a group known as the special orthogonal group SO(n). For 3D space the rotation group is SO(3):

SO(₃) ={_R^3x3∈ _R|_RR^T = I, detR=₁} _(2.1)

(18)

Figure 2.1: There exists a rotation matrixR_abthat rotates frame{b}to frame {a}[18].

The rotation matrices can be multiplied together to form a sequence of rotations, but the sequence is not commutative:

R_ac =R_abR_bc (2.2)

Rotation matrices can easily perform action on points, but have 9 parameters to be determined.

2.2.2 Euler angles

Euler angles are three angles introduced by Leonhard Euler to describe the orientation of a rigid body with respect to a fixed coordinate system. Any orientation can be decomposed into a sequence of three principal rotations R = R_z(θ₃)R_y(θ₂)R_x(θ₁). The three angles (θ₁)(θ₂)(θ₃)are known as the Euler angles. The sequence of rotations matrices must be understood to perform rotation. All sequences have singularities i.e. where orientations are not unique, and it’s problem for hardware which are based on Euler angles. It’s called gimbal lock.

(19)

Figure 2.2: The three principal rotations. Each rotation is about one of the main axis [38].

The sequenceR=R_z(θ₃)R_y(θ₂)R_x(θ₁)is commonly used in navigation.

The angles(θ₁)(θ₂)(θ₃)are called pitch, roll and yaw, and give an intuitive way of understanding orientation. This sequence is singular when the pitch is ^π₂ and gimbal lock occurs. Whenθ₂ = ^π₂, both roll and yaw describe a rotation around the z-axis and the ability to roll is lost.

But since most vehicles are never meant to experience a pitch of ^π₂, it’s the best choice of sequence.

Figure 2.3: The Euler angles as pitch, roll and yaw give a good way of describing an aircraft’s orientation in space [38].

A downside by using Euler angles is that it’s problematic to recover the Euler angles from a rotation matrix. On the plus side, only four parameters are needed to define an orientation.

(20)

2.2.3 Angle-axis theorem

Euler’s rotation theorem states that any rotation or sequence of rotations in a three-dimensional space is equivalent to a rotation about a single fixed axis. This means a rotation can be represented by a pair of a vector and an angle (v, φ). This representation is intuitive, but not typically used for computation. The corresponding rotation matrix is R_ab = cos(φ)I+ (1− cos(φ))vv^T+sin(φ)vvˆ, wherevˆ is:

vˆ=





0 −v₃ v₂ v₃ 0 −v₁

−v2 v₁ 0



 (2.3)

Figure 2.4: A rotation can be described by only a vector v and an angleφ [38].

2.2.4 Quaternions

Quaternions are four-dimensional complex numbers that can be used to describe rotations in space, just as ordinary complex numbers describe rotations in the plane. For quaternions, rotations are expressed as a rotation angle around a rotation axis. As mentioned, Euler’s rotation theorem states that an arbitrary rotation matrix R_ab can be decided by rotating frame {a} an angle θ about the axis v_b = v^b₁ v^b₂ v^b₃T

(kv^bk = 1). This representation can be converted to quaternions:

q_x =v₁sin θ

2

q_z =v₃sin θ

2

q_y =v₂sin θ

2

q_w =cos θ

2











⇒q²_w+q²_x+q²_y+q²_z =1 (2.4)

Since these parameters represented the rotationR_ab, the same rotation is now the quaternionq_ab:

q_ab =q_w+q_xi+q_yj+q_zk (2.5)

(21)

whereq_w,q_x,q_y,q_z∈ _Randi,jandkhave complex properties such as:

ii =_jj =_kk =−1 ij =−ji =k

jk =−kj =i ki =−ik =j

(2.6)

The complex terms are closely related to the the axis of rotation, and the the real term is closely related to the angle of rotation.

The norm of such a quaternion is defined as:

kqk= q

qw2+qx2+qy2+qz2 (2.7) A quaternionqcan be normalized to a unit quaternionq⁰by:

q⁰ = ^q

kqk ^(2.8)

where now kq⁰k=1.

When the quaternion q_ab is given, it can be converted to a rotation matrixR_ab:

R_ab =





1−2q²_y−2q²_z 2(q_xq_y−q_wq_z) 2(q_xq_z+q_wq_y) 2(qxqy+qwqz) 1−2q²_x−2q²_z 2(qyqz−qwqx) 2(q_xq_z−q_wq_y) 2(q_yq_z+q_wq_x) 1−2q²_x−2q²_y



 (2.9)

Quaternions are efficient, numerical stable and do not suffer from singularities which make them superior in many applications. They do however, have the difficulty of interpretation.

2.3 Transformation and homogeneous coordinates

A transformation describes a mapping, general function, between two spaces:

T:R^m →_Rⁿ (2.10)

A translation is a geometric transformation that moves every point by the same distance in a given direction, while a rotation is a transformation that moves every point by an angle. Neither translation or rotation is changing the distances between the points. They are rigid transformations.

Vectors, translations and rotations can also be represented by homogeneous coordinates. In homogeneous coordinates another dimension is added. This addition of dimension is also known as projective geometry, as opposed to Euclidean geometry. The advantage of this will be clear shortly.

In 3D, a fourth dimension is added, where the fourth dimension is usually set to 1.

(22)

For a vector~q, the homogeneous representation of the vector is:

˜ q=

~q 1

(2.11) The homogeneous representation of a translationtis:

t=

0 t 0 0 0 1

(2.12) The homogeneous representation of a rotation matrixRis:

R=

R 0 0 0 0 1

(2.13) In homogeneous coordinates, the rotation and translation can be embedded into one matrix to define the transformation:

T=

R t 0 0 0 1

(2.14) In this way, q^b in F_b can be transformed to F_a by the homogeneous (4×4)matrix:

T_ab =

R_ab t^a_ab 0 0 0 1

(2.15)

˜

q^a =T_abq˜^b (2.16)

This compact representation is computationally appealing instead of the non-homogeneous:

~q^a =R_ab~q^b+t^a_ab (2.17) Like rotation matrices, it’s possible to represent the transformation matrix as a sequence of multiplied transformation matrices, but again the sequence is not commutative:

Tac =T_cbT_bc (2.18)

The orthogonality property does not hold for transformation matrices, unlike rotation matrices:

T^T 6=T⁻¹ (2.19)

2.4 A pose in 3D

A pose describes the relationship between two different coordinate frames, the necessary transformation to make those two frames coincide. The orientation is the required rotation and the position is the required translation. They are expressed as a transformation with respect to one of the frames, which is called the reference frame. The position vector of a reference frame with respect to the other frame ist.

(23)

A pose can then be put together as a homogeneous transformation matrix:

T=

"

R t

0 0 0 1

#

(2.20)

Figure 2.5: A reference frame F_a and another frameF_bwith the position vectort^a_abbetween them [18].

The translation which brings the origin ofF_b to the origin ofF_b ist^a_ab. The rotation needed to orientateF_btoF_a isR_ab. The transformation which makesF_bcoincide withF_ais then:

T_ab =

R_ab t^a_ab 0 0 0 1

(2.21) This is the pose ofF_bwith respect toF_a.

2.5 Estimation theory

2.5.1 RANSAC (RANdom SAmple Consensus)

RANSAC is an iterative method for estimating the parameters of a mathematical model from a set of observed data that contain outliers.

RANSAC will in the estimation process divide the observed data into inliners and outliers. Therefore, it can be regarded as an outlier rejection method.

It selects a random subset of points, fits the model to those points, points which fits to the model based on some threshold are inliners, updates model parameters if there are more inliners. The size of the random subset is usually the least number of points needed to fit the model.

(24)

Algorithm 1RANSAC

inputmodely= f(x;a), data set S ={x_i}, inliner distance threshold t repeatsteps 1-3 until N models have been tested

1. Determine a test model y = f(x;a_test from n random data points {f(x₁,y₁),(x2,y2), . . . ,(xn,yn)}

2. Check how well the data ponts in S fit with the test model

- Data points with the distance t of the model constitute a set of inliners S_test ⊂S

- The remaining data points are outliers

3. If S_test is larger than all previous set of inliners, we update the RANSAC model f(x;a) = f(x;a_test)and the set of linersS_{I N} =S_test

The number of RANSAC iterations can either be fixed or found by a probabilistic measure. A common one is:

N = ^log(1−p) log(1−(^|^S_|^test_S_|^|)ⁿ)

where p is the desired probability of selecting at least one model that is free from outliers within N iterations. |S_test|is the number of inliers in the current estimated model. |S|is the total number of observations and n is the number of random points. A new N is calculated every time a model is updated.

2.5.2 Non-linear optimization and non-linear least squares Many optimization problems in computer vision are non-linear, and no closed-form solution exists.

Newton’s method is an iterative method invented by Sir Isaac Newton to find the roots of a set of n non-linear functions with n variables. It needs an initial guess of the root to start.

The simplest case would then be when n is equal to one. Take for instance the equation x² = 5 ⇒ f(x) = x²−5 and then setting f(x) = 0.

At each iteration, the root of the function is approximated by linearizing the function at the current guess of x. Linearization is achieved by using of the derivative of f, f’, to create a tangent line:

x_k+1= x_k− ^x^k f(x_k)⁰ where k is the iteration number.

(25)

Figure 2.6: The root c of the function f is estimated by applying Newton’s method [35].

when n is>1, the multivariate Newton’s method is:

x_k+1= _x_k−J(_x_k)⁻_f¹_f(_x_k)

where x is the vector of variables, J(x)_f is the Jacobian of f, a matrix containing the first-order derivatives of all the n functions with respect to all n variables.

To find the minimum or maximum of one function for optimization, Newton’s method is applied on the derivative of the function:

x_k₊₁= x_k− ^f(x_k)⁰ f(x_k)⁰⁰ where f” is the second derivative of f.

When the function has more than one variable, multivariate Newton’s method for optimization is:

x_k+1 =x_k−H(x_k)⁻_f ¹∇f(x_k)

where x is the vector of variables, ∇f(x) is the gradient of f, a column vector containing the first-order derivatives of f with respect to all the variables, and H(x)_f is the Hessian of f, a matrix containing the second- order derivatives of f with respect to all the variables.

However, Newton’s method cannot trivially be defined for optimization with more than one function. Such formulations are usually based on the Quasi-Newton methods.

When there are multiple functions that can be formulated as a problem of sum of squares, a least squares approach is possible if there more functions than variables. The objective function isε(x) =_∑^m_i₌₁r²_i(x). Non- linear least squares is solved with the Gauss-Newton algorithm. Given m

(26)

functions r = (r₁,...,r_m), called residuals, of n variables x = (x₁,...,x_n), with m≥n, the x which minimizesεis sought. The gradient ofεisJ(x)_r^Tr. The HessianH(x)_ris approximated by J(x)^T_rJ(x)_r. Thus we have:

ε(x) =min

∑

m i=1

r²_i(x)

can be solved by iteratively computing:

x_k+1 =x_k−(J(x_k)^T_rJ(x_k)_r)⁻¹J(x_k)_r^Tr(x_k)

When m=n, (J(x)^T_rJ(x)_r)⁻¹J(x)^T_r becomes (J(x)⁻_r¹. This is equivalent to the multivariate version of Newton’s method. More precisely, (J(x)^T_rJ(x)_r)⁻¹J(x)^T_r is the left pseudoinverse of J(x)_r. Therefore, the Gauss-Newton algorithm can be viewed as a generalization of Newton’s method.

The Levenberg-Marquardt algorithm (LM) is a small, but significant modification to Gauss-Newton algorithm, namely (J(x)^T_rJ(x)_r)⁻¹ should become(J(x)^T_rJ(x)_r+λdiag(J(x)^T_rJ(x)_r))⁻¹. This is the Fletcher variant of the Levenberg-Marquardt algorithm. Theλparameter could be initialized to 10⁻⁴ for instance. In Gauss-Newton, a new update can overshoot and increase the cost ε. If that happens, the step is rejected and λ increased to reduce the trust region. But if a new update reduces the cost, it’s accepted and the trust region is increased by decreasing λ. Decreasing the trust region makes the increment rotated more towards the direction of steepest descent and larger movement along the directions where the gradient is smaller. The latter avoids slow convergence in the direction of small gradients.

2.6 Computer vision

2.6.1 Images

An image is a two- or three-dimensional matrix, where each element is within a fixed range or set of values called spatial resolution. A two- dimensional image is usually a grey-scale image, while a three-dimensional image is a color image or RBG image, where the third dimension have three channels, red, green and blue. A common spatial resolution is 8 bits, that means 2⁸, which gives 265 possible values.

An image can be artificially created or captured by a camera. A camera will create an image by processing the received reflected light.

(27)

Figure 2.7: An image is simply a matrix with values. A two-dimensional matrix in this case.

2.6.2 The perspective camera model

The perspective camera model is a mathematical model describing the correspondence between observed points in a scene and pixels in the captured image.

Figure 2.8: Perspective camera model. The scene point x is xn in the normalized image plane and u in the image plane [40].

The optical axis of the camera is the z-axis of the camera frame F_c. When a point x inF_chas a z-coordinate value of 1, it lies in the normalized image plane. The frameF_iis the image plane.

The pointxusually lives in a different coordinate frame than the camera frame. x could be in a world frame that is common for all pertinent cameras.

(28)

In the situation below, figure 2.9, a camera sees world points. Then a transformationTcwis needed to transform points in the world frame to the camera frame.

Figure 2.9: The pose of the world frame relative to the camera frame [40].

The perspective camera model can mathematically be described as:

Figure 2.10: [40]

The world point ˜x^w is transformed to the camera frame by T_cw, projected byΠ0and transformed to the image plane ˜uby the camera matrix K. The camera matrixKcontains the intrinsic camera parameters.

2.6.3 Feature detection and matching

Image matching in computer vision is a well studied problem. The problem can be formulated as: Given a set of images, find the corresponding points in the images. This is done by finding keypoints in the images, extracting features for these keypoints and then compare the features of the keypoints.

There are two types of detectors, corner detectors or blob detectors.

Blob detectors are usually preferred because they are invariant to scale.

Blob detectors find keypoints by applying the scale-normalized Laplacian

(29)

of Gaussian with different scales and at different spaces. Different scales mean different σ’s of the Gaussian, and different spaces mean different image resolutions.

The scale-normalized Laplacian of Gaussian:

∇L²_norm =σ²(^∂g

2

∂x²+ ^∂g

2

∂y²)

Figure 2.11: A 1D example of blob detection. The magnitude of the Laplacian response is maximum at the centre of the blob provided the scale of the Laplacian matches the scale of the blob [37].

For faster computation, the Difference of Gaussians (DoG) is used.

It’s just an approximation of Laplacian of Gaussian. This implies that keypoints also are estimates.

The Laplacian of Gaussian with different scales and at different spaces can produce too many keypoints, and some of them are unstable.

Therefore, points that have low contrast or are poorly localized along an edge are rejected. For all the keypoints remaining, a descriptor is created for the patch surrounding the keypoint. The descriptors are vectors and subject to distance matching to find the best match. A ratio test between the best and the second best match was proposed by Lowe to discard many false-positive matches [27].

The main methods are SIFT, SURF and ORB, which have their similarities and differences. In SIFT, the descriptor is made up by looking at the image gradients in a patch.

2.6.4 Homography

Mathematically, a homography is any invertible matrix in projective space. In computer vision, two different planar surfaces are related by a homography. The homography matrix gives the transformation between these two planar surfaces. The planar surfaces could be image-to-image or image-to-surface.

(30)

In the case of two images viewing a flat surface, the image-to-image homography gives the transformation of corresponding points. In the figure 2.12 below, two images view the same planar surface. The point

˜

uin the normalized image plane of image 1 which corresponds to the point u on the planar scene, can be found in image 2 by ˜u⁰ =_Hu.˜

Figure 2.12: Two images of viewing planar scene. u˜ and ˜u⁰ which corresponds to the same point u on the planar scene, is related by the homographyH.

The homography matrix can be estimated with Direct Linear Transform (DLT). A more robust estimation is done with Normalized DLT and RANSAC. The minimum of four pair of correspondences are needed to estimateH.

The homography matrix can be decomposed into a rotation R and translation t. This could be the rotation and translation between two cameras. The homography matrix and hence the translation t, are only known up to a scale. Therefore, the true scale between the cameras are not known.

2.6.5 Surface from motion (SFM)

Based on point correspondences between images, it’s possible to estimate the corresponding points and the projective transformation for each camera. This means triangulation of 3D points and pose estimation of the cameras at the same time. This optimization procedure is called full bundle adjustment or simply just bundle adjustment. The goal is to minimize the squared sum of reprojection errors for m cameras and n points:

min

∑

m i

∑

n j

d(u˜_ij,P_iX˜_j)²

This function is non-linear and must have more residuals than variables. It can be solved with the Gauss-Newton and Levenberg-Marquardt algorithm. As an iterative procedure, it requires an initial estimate. The

(31)

initial estimate can be found with a sequential SFM. The camera calibration can be solved as part of the bundle adjustment by including the intrinsic camera parameters in the optimization.

Figure 2.13: The point X_j, the projections P₁, P₂ and P₃ and the point correspondences u_1j, u_2j, u_3j give reprojection errors, and the sum of reprojection errors can be minimized to find the optimal projections and point position [39].

2.7 Point clouds

A point cloud is a set of points in space defined by a coordinate system.

In a 3D coordinate system, each point is defined uniquely by a set of x-, y- and z-coordinate. In some cases, point clouds can include additional information such as RGB color, reflective intensity and surface normals.

Figure 2.14: The Stanford bunny as a point cloud [54].

(32)

2.7.1 Surface normals

A surface normal is a vector orthogonal to the surface. For a point cloud the normal does not exist, but can be estimated based on surface spanned by the neighboring points. How large this neighborhood should be for an accurate estimation, is a parameter that needs to set. This is the scale and will influence the final result. The neighborhood is defined either in terms of a radius or the number of points. The surface normal is estimated by solving the eigenproblem on the covariance matrix for the neighboring points. If p = ¹_k _∑^k_i₌₀p_i is the centroid of k neighboring points, then the covariance matrix for the neighborhood is

C= ¹ k

∑

k i=₁

(p_i−p)(p_i−p)^T

The eigenproblem for C is:

C~v_j =λ~v_j,j∈ {0, 1, 2}

The eigenvector which belongs to the smallest eigenvalue is the surface normal. The sign of the surface normal is ambiguous.

Figure 2.15: The Stanford bunny point cloud with surface normals.

2.7.2 Point cloud conversion

Point clouds themselves can be directly rendered and visualized, and used for various applications. However, they can also be converted to surface models as NURBS (Non-Uniform Rational B-Spline), a triangle or polygon mesh or solid CAD models. NURBS are mathematical models of curves and surfaces, and give flexibility and efficiency for editing and design.

Meshes are information of the collection of vertices, edges and faces that define the shape’s surface.

(33)

Figure 2.16: The Stanford bunny as a polygon mesh [53].

There are many techniques for converting a point cloud to a mesh.

Some approaches, like Delaunay triangulation, alpha shapes and ball pivoting, build a network of triangles over the existing points of the point cloud. Other approaches convert the point cloud into a volumetric distance field and reconstruct the implicit surface, defined through a marching cubes algorithm.

None of these approaches will generally work very well for indoor Lidar point clouds, because of the complexity of geometry and the amount of noise.

2.8 Density plot

Density plots provide a visual representation of the shape of a distribution of the data.

2.8.1 Violin plot

A violin plot is a plot that combines the box plot and a density plot into one diagram to display the shape the data. The name violin plot originates from the visual appearance resembling a violin. A violin plot includes the density plot, and for instance the median and the 2.5% and 97.5%

percentiles. The vertical axis ranges between the outermost samples in the distribution. Extreme outliers are usually excluded.

In the figure below, a standard normal distributionN(0, 1), µ= 0 and σ= 1, is drawn and plotted. µis the mean andσis the standard deviation.

The median, which equals the mean for a normal distribution, is indicated by the center line, while the 2.5% and 97.5%percentiles are indicated by the upper and lower lines. For a normal distribution, these percentiles coincide with the 2σrange about the median.

(34)

Figure 2.17: A violin plot of a standard normal distributionN(0, 1).

(35)

Chapter 3

Tools for Lidar and computer vision

3.1 Lidar

Lidar (LIght Detection And Ranging) is a fast way to obtain high quality 3D data. Lidars gather 3D data by emitting laser pulses. If a laser pulse hits a surface, the pulse is reflected back to the Lidar and the travel time is measured. Based on the direction and the time travelled a relative 3D point to the Lidar is known. If enough number points are gathered, the result will be a good 3D representation of the environment. It has terrestrial, airborne, and mobile applications. Lidar is commonly used to make high-resolution 3d data, with applications in geodesy, geomatics, archaeology, geography, geology, geomorphology, seismology, forestry and laser guidance. Lidar equipment is also used in control and navigation for autonomous vehicles.

Figure 3.1: A Lidar emits a laser pulse which is reflected back to the Lidar.

Based on the rotation and distance, the travel time times the speed, a 3D point is known [29].

3.2 IMU

Modern inertial sensors are microelectromechanical systems with components such as accelerometers, magnetometers, and gyroscopes. An inertial

(36)

measurement unit (IMU) is multiple of these sensors integrated into the same device. These sensors enable the IMU to report information about it’s motion in space. The accelerometer calculates linear acceleration by combining gravity and other external forces acting on the sensor. The magne- tometer is used to measure magnetic field strength. In the Earth reference frame, it’s often used in the same manner as a compass, i.e., to determine the heading in space based on the magnetic field of the Earth. The gyroscope is used for measuring angular velocity and rotation based on Earth’s gravity. It consists of a freely rotating disk called a rotor, mounted onto a spinning axis in the centre of a larger and more stable frame.

Figure 3.2: Right: A classical gyroscope. Left: A high-end gyroscope found in IMUs [52].

3.3 Equipment

The low-cost equipment VLP-16 LiDAR and a Tinkerforge IMU Brick 2.0 are available. The VLP-16 has 16 lasers giving a 360^◦ horizontal field of view (FOV) and a 30^◦ vertical FOV,±15^◦ out of the normal surface of the rotation axis, illustrated in figure 3.3

Figure 3.3: Velodyne LiDAR Puck - 16 fixed vertical lasers rotating 360^◦ [29].

A Tinkerforge IMU Brick 2.0, shown in figure 3.4, is a small, lightweight, and relatively accurate IMU, which one can communicate with through a USB. The IMU outputs quaternions, linear acceleration, gravity vector, rotation vector, as well as Euler angles.

(37)

Figure 3.4: Tinkerforge IMU Brick 2.0 [18].

3.4 Software

3.4.1 VelodyneCapture class and VeloView software

The VelodyneCapture class is provided by Sugiura process Velodyne Li- DAR data and is implemented in C++ [49]. The VelodyneCapture class provides the opportunity to retrieve the laser data from Velodyne sensors.

VelodyneCapture retrieves laser information either capturing a 360^◦revo- lution as one scan or a single data packet. Each data packet includes laser information about azimuth angle, vertical angle, distance, intensity, laser ID, and time.

VeloView is a software provided by Velodyne for real-time visualiza- tion, analysis and data recordings from Velodyne sensors. VeloView offers tools to select, measure, and analyze points delivered by the sensor.

3.4.2 Point Cloud Library

Point Cloud Library (PCL) is a standalone open-source library for point cloud processing and 3D geometry. It contains numerous algorithms, including filtering, feature estimation, surface reconstruction, registration, model fitting, and segmentation. The library also has methods for writing and storing point data in PCD files, PCL’s own point cloud format. PCL is a modern C++ library, written with performance and efficiency in mind, where most of the mathematical operations are implemented with the Eigen linear algebra library. The library has been in development since March 2010 and was officially released in May 2011 under the terms of the 3-clause BSD license. PCL is cross-platform and has been deployed on Linux, Mac OS, Windows, Android and iOS [3].

3.4.3 OpenCV

OpenCV (Open source Computer Vision) is a library mainly aimed at image processing and real-time computer vision. It was originally developed by Intel, later supported by Willow Garage and then Itseez, is now acquired by Intel. The library is cross-platform and free for use under the open-source BSD license. OpenCV has tools for egomotion estimation, motion understanding, object identification, segmentation and

(38)

recognition, stereo vision, structure from motion (SFM), motion tracking and augmented reality to name a few. OpenCV is written in C++ and it’s primary interface is in C++. There are bindings in Python, Java and MATLAB/OCTAVE [36].

3.4.4 AliceVision and Meshroom

AliceVision is a cooperation between multiple universities and labs (Czech Technical University, IMAGINE, INPT, Simula Research Laboratory, and Quine) and Mikros Image, a French post-production company, for a photogrammetric computer vision framework. The framework provides 3D reconstruction and camera tracking algorithms. Meshroom is a free, open-source photogrammetry software that is based on the AliceVision framework. The program is built around an easy-to-use node-based workflow that connects all the steps to generate a 3D model. There are binaries for Windows and Linux, but the source code can be build for Windows, Linux and Max OS X [6].

(39)

Chapter 4

Lidar point cloud registration

In a Lidar measurement setting, to sufficiently cover the target environment, the Lidar most likely needs to be moved to different positions. If the Lidar has no information about the displacement, the point cloud measurements generated at different positions will be in different coordinate systems. This problem is almost completely eradicated in outdoor environments by high-end GNSS (Global Navigation Satellite Systems) technology. In indoors environments, combining the Lidar with an Inertial Measurement Unit (IMU) can keep the measurements in the same coordinate system when moving from position to position. An IMU uses an accelerometer and gyroscope to record the rotation and translation. The relia- bility of this setup relies heavily on the quality of the IMU. If there exists no information about the rotation and translation between positions, a point cloud registration method might succeed to align the measurements. Point cloud registration methods tries to solve the problem of aligning two point clouds which partially or completely presents the same scene, but are in different coordinate systems and or in different scales. This is a field of ac- tive research, and the class of registration methods called Iterative Closest Point (ICP) dominate. There are other point cloud registration methods as well, but although they can be mathematically well formulated, they usually assume one-to-one correspondence between the point clouds or they suffer from computational complexity like the Coherent Point Drift (CPD) method [41]. It means they are not suitable for most Lidar applications.

On the other hand, if the initial alignment is poor, ICP can also be computationally expensive and fail to converge to the correct alignment. Point cloud registration can be divided into two groups, rigid and non-rigid. A registration is rigid if the scale between the point clouds is assumed to be constant, which is the case in most Lidar settings. Only rigid registration is taken into account in this chapter.

4.1 Iterative Closest Point

The Iterative Closest Point algorithm tries to solve the problem of aligning two point clouds which partially or completely presents the same scene, but which are in different coordinate systems. The algorithm rotates and

(40)

translates one of the clouds until it is sufficiently close to the other one. The cloud which is transformed is called source, the fixed cloud is called target.

ICP point-to-point

The original ICP algorithm of Besl and McKay was published in 1992. It’s now called the point-to-point version of ICP. Let’s define the source cloud as P consisting of points p, and the target cloud as Q with points q. The objective function of point-to-point is finding the transformation consisting of rotation and translation that minimizes the sum of the squared distances, the error, for all corresponding pairs p_i,q_i (i, . . . ,n), where n represents the number of points in the source cloud. The optimal transformation for the corresponding pairs is:

argmin

T

∑

n i=1

||Tp_i−q_i||²

where Tis the transformation with rotation and translation. The optimal transformation for each iteration exists in closed-form and can be found with the use of the Singular Value Decomposition (SVD).

Figure 4.1: ICP alignment based on the point-to-point error metric [8]. P is the source cloud and Q is the target cloud. An optimal transformation that minimizes the distance between the three corresponding pairs is found.

(41)

Algorithm 2Iterative closest point with point-to-point metric inputP source, Q target, initial alignment, max iterations (Filtering of P and Q)

T←Initial alignment n = 0

whileICP has not converged or n<max iterationsdo forp_iin Pdo

q_i ←find closest point (p_i, Q) end for

(Weighting)

T←argmin_T∑ⁿi=₁||Tp_i−q_i||²(Outlier rejection) Transform P withT

n = n + 1 end while returnP

where∑ⁿ_i=₁||Tp_i−q_i||²is the error metric. Finding this transformation is a non-linear least squares problem, where only the point-to-point metric has a closed-form solution. The optimal transformation depends greatly on the how the error metric is defined. Finding the closest point to p_i in Q is the most expensive part of the ICP algorithm, but it can be done efficiently by a data structure like a KD-tree.

Figure 4.2: An ICP objective function f(d), which is a sum of distance errors, measures the error in the alignment. It decreases with the number of iterations until convergence, when a transformation which reduces the error can no longer be found.

(42)

Figure 4.3: Two Lidar scans of an outdoor scene, each indicated in blue and green. They are obtained from two different positions and registered into one cloud [7].

ICP point-to-plane

In 1991, Chen & Medioni proposed a objective function for the ICP called point-to-plane.

Geometrically, the point-to-plane metric creates a tangent plane at the target point with surface normal as the plane normal. If a source point p_i lies close to the tangent plane formed by the corresponding target pointq_i, the contribution to the objective function is small, and hence will have a small impact on the transformation estimation.

argmin

T

∑

n i=1

[(Tp_i−q_i)·n_q,i]²

It has been shown that point-to-plane has faster convergence than point-to- point, but perhaps narrower convergence basin [50]. But if the point clouds are mostly smooth and piece-wise planar, the point-to-plane algorithm outperforms the point-to-point method. If the geometric structures in the scene are mostly quadratic or polynomial, the traditional ICP point-to- point algorithm yields better results.

(43)

Take a registration problem like the one below in the figure 4.4. The red points belong to the source and the blue points to the target. The optimal point-to-point ICP transformation moves the red points to the situation on the right. Here, the point-to-point ICP would converge, because no transformation exists that would decrease the error. A translation to the left would increase the total error, since the error for the three horizontal correspondences would increase, but only decrease for two correspondences.

Figure 4.4: Left: Two point clouds to be aligned with the point-to-point ICP.

Blue - target, red - source. Right: The final ICP result [32].

Take the same registration problem, but now with the point-to-plane ICP, figure 4.5. This time the ICP wouldn’t converge, because the three horizontal correspondences would still have an error of zero after a translation to the left. This is because the three horizontal red points lie in the tangent lines spanned by the surface normals of the three horizontal blue points. In this case, point-to-plane finds the optimal alignment while point-to-point does not.

Figure 4.5: Left: Two point clouds to be aligned with the point-to-point ICP. Blue - target, red - source. Right top: The result after the first iteration.

Right bottom: The final ICP result [32].

Point-to-plane is a nonlinear least squares problem, but the rotation matrix can be linearized when the rotation angleθis small withsin(θ)≈θ andcos(θ)≈1.

Non-linear ICP

To reduce the number of iterations for ICP, Andrew W. Fitzgibbon, suggested to use the Levenberg-Marquardt optimization [15]. This

(44)

approach can significantly reduce the number of iterations needed before convergence is reached, but with increased computational complexity per iteration. The outlier rejection method Huber loss can easily be set in the error metric in the non-linear ICP.

Non-linear ICP with Levenberg-Marquardt would find the optimal transformation T, given the correspondences, by using the residuals between P and Q.

Minimize:

ε(T) =

∑

n i=1

r_i(T)²

Point-to-point residual:

r_i(T) =||Tp_i−q_i||

Point-to-plane residual:

r_i = (Tp_i−q_i)·n_q,i With the Jacobian:

J_r= ^∂r

∂T

Generalized ICP

In 2009 Segal et al, Stanford University, came up with an ICP with a plane- to-plane objective function [5]. They made a probabilistic framework to include the normal from the source and the target. The surface normal is assumed to be measured or computed with some uncertainty contrary to point-to-plane. This makes Generalized ICP very attractive in Lidar settings where the measurements are usually noisy and inaccurate.

In their probabilistic model every point in P and Q is assumed to be drawn from a normal multivariate distribution. Defined as p_i ∼ N(pˆ_i,C_i^P) and q_i ∼ N(qˆ_i,C_i^Q) andC_i^P where C_i^Q are covariance matrices associated with the measured points. If µ_i andv_i are the respective surface normals for p_i and q_i. The uncertainty of the surface normal’s are computed by rotating the covariance matrix below, so that e represents uncertainty along the surface normal.

C_i^P = R_µ,i





e 0 0 0 1 0 0 0 1



R^T_µ,i

C_i^Q = R_v,i





e 0 0 0 1 0 0 0 1



R^T_v,i

(45)

Figure 4.6: Two point clouds to be aligned with Generalized ICP. Red scan - source, green scan - target [5].

The ellipses surrounding each point is the covariance matrix visualized as a confidence ellipse. The position of a point is known with high confidence along the normal, but the exact location in the plane is more uncer- tain. In this case, all of the points along the vertical section of the green scan are incorrectly associated with a single point in the red scan. But since the surface normals are inconsistent they will not contribute much to the objective function. Putting more emphasis on corresponding points with consistent surface normals.

The assumption of perfect correspondences, the correct transformation would be T^∗, so ˆq_i = T^∗pˆ_i. An arbitrary rigid transformation, T, the distance error vectors are defined as d⁽_i^T⁾ = q_i − Tp_i. Consider the distribution from which d⁽_i^T^∗⁾ is drawn. Since p_i and q_i are assumed to be drawn from independent Gaussians:

d⁽_i^T^∗⁾∼ N(qˆ_i−(_T^∗)pˆ_i,C^Q_i + (_T^∗)C_i^P(_T^∗)^T) =N(_0,C_i^Q+ (_T^∗)C^P_i (_T^∗)^T) The transformation matrix is found by Maximum Likelihood Estima- tion (MLE).

Since the log function is strictly monotonic, argmax_xΠf(x) = argmax_x∑log(f(x)), the MLE becomes:

argmax

T

∏

n i=1

p(d^T_i ) =argmax

T

∑

n i=1

log(p(d^(T)_i ))

Insertingd^(T)_i and using the logarithmic rules:

argmax

(T)

−ⁿ

2log(_det(C_i^Q+TC_i^PT^T))

−¹ 2

∑

n i=1

d⁽_i^T⁾^T(C_i^Q+TC_i^PT^T)⁻¹d⁽_i^T⁾+const

The above can be simplified to:

argmin

(T)

∑

n i=1

d⁽_i^T⁾^T(C_i^Q+TC_i^PT^T)⁻¹d⁽_i^T⁾

The last equation is the key equation of the Generalized ICP. The minimum of this non-linear problem is solved by the Quasi-Newton

Indoor 3D reconstruction by Lidar and IMU and point cloud segmentation and compression with machine learning

Indoor 3D reconstruction by Lidar and IMU and point cloud

segmentation and compression with machine learning

Henry Haugsten Hansen

Thesis submitted for the degree of Master in Computational Science

(Geosciences) 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Indoor 3D reconstruction by Lidar and IMU and point cloud

segmentation and compression with machine learning

Henry Haugsten Hansen

Abstract

Contents

List of Figures

Preface

Chapter 1

Introduction

1.1 Motivation

1.2 Purpose and goals

1.3 Outline

Chapter 2

Background for Lidar and computer vision

2.1 Notation

2.2 Rotations

2.3 Transformation and homogeneous coordinates

2.4 A pose in 3D

2.5 Estimation theory

∑

2.6 Computer vision

∑

∑

2.7 Point clouds

∑

2.8 Density plot

Chapter 3

Tools for Lidar and computer vision

3.1 Lidar

3.2 IMU

3.3 Equipment

3.4 Software

Chapter 4

Lidar point cloud registration

4.1 Iterative Closest Point

∑

∑

∑

∏

∑

∑

∑