Depth Estimation and Object Detection using Stereo Vision for Autonomous Ferry

(1)

NTNU Norwegian University of Science and Technology Faculty of Information Technology and Electrical Engineering Department of Engineering Cybernetics

Depth Estimation and Object

Detection using Stereo Vision for Autonomous Ferry

Master’s thesis in Cybernetics and Robotics Supervisor: Edmund Førland Brekke

Co-supervisor: Annette Stahl and Øystein Kaarstad Helgesen May 2021

Master ’s thesis

(2)

(3)

Depth Estimation and Object Detection using Stereo Vision for Autonomous Ferry

Master’s thesis in Cybernetics and Robotics Supervisor: Edmund Førland Brekke

Co-supervisor: Annette Stahl and Øystein Kaarstad Helgesen May 2021

Norwegian University of Science and Technology

Faculty of Information Technology and Electrical Engineering Department of Engineering Cybernetics

(4)

(5)

In this project several methods for solving the correspondence problem and object detection in stereo vision are compared and evaluated. In regards to the correspondence problem, the local correspondence method Sum of Absolute Differences and the global method Semi-Global Method were tested on real data.Stixel Tesselation andEuclidean Clusteringwere tested in regards to object detection.

This project was conducted as a part of the research project at NTNU called Autoferry, and provide improved knowledge on the use of stereo vision on the autonomous ferry, milliAmpere.

Autonomous vehicles that move in a space shared with other vehicles depend on accurate and reliable perception data. Detecting and estimating distance to objects far away is a difﬁcult task. Lidar has been the "go-to" sensor for high frequency 3D-perception in short to medium ranges. However, these sensors can be expensive, have poor resolution and can be weather sensitive. Stereo cameras serve many of the same purposes with higher resolution, but often poorer accuracy.

To address this, I tested and evaluated different stereo vision algorithms on real data from milli- Ampere. The results from these tests has been compared against each other as well as against lidar data to see which is best in terms of accuracy, detection and run time. Theﬁndings of this study show that stereo vision can provide valuable information to autonomous vehicles at greater distances than lidar, and that the accuracy is comparable to what the lidar provides. A stereo camera therefore seems like a good sensor for use on milliAmpere. However, the run time of the methods presented are not optimal, and future research should focus on reducing run time.

i

(6)

(7)

I denne avhandlingen evaluerersﬂere methoder for løsing av korrespondanseproblemet og objektdeteksjon ved bruk av stereokamera. For løsing av korrespondanseproblemet ble den lokale metoden Sum of Absolute Differencesog den globale metodenSemi-Global Methodtestet på virkelige data. Met- oden Stixel TesselasjonogEuclidisk Gruppering ble testet innen objektdeteksjon. Arbeidet ble gjort som en del av forskningsprojectetAutoferry, hvor dette prosjektet undersøker mulighetene for bruk av stereokamera på den autonome fergen, milliAmpere.

Autonome kjøretøy som beveger seg i et område delt med andre kjøretøy er avhengige av nøyak- tig og robust oppfatning av omgivelsene. Deteksjon og estimering av distanse til objekter på lang avstand er utfordring å få til. Over lengre tid har lidar har vært den vanligste sensoren for logging av omgivelsedata i korte til medium lange avstander. Derimot, så kan lidar-sensorer være dyre, bli sterkt påvirket av værforhold og ha dårlig oppløsning på store avstander. Et alternativ til lidar er stereokamera. Et stereokamera dekker mange av de samme behovene som lidar, men med høyere oppløsning. Ulempen med stereokamera har tradisjonelt sett vært dårligere nøyaktighet og krevende med tanke på regnekraft.

I denne studien er de ulike metodene testet og evaluert gjennom eksperiementer gjort med den autonome fergen, milliAmpere i de faktisk omgivelsene milliAmpere skal operere i. Metodene har blitt sammenlignet med hverandre i tillegg til mot lidar-data. Funnene fra denne studien typer på at stereokamera kan tilby informasjon om omgivielsene på en større rekkevidde enn lidar, med en nøyaktighet som er sammenlignbar med lidar. Stereokamera kan derfor være en nyttig sensor ombord på milliAmpere. Derimot var kjøretiden til de forskjellige objektdeteksjons metodene ikke optimal, og videre arbeid bør fokusere på å redusere kjøretiden.

iii

(8)

(9)

This thesis is the completion of a Master’s programme in Cybernetics and Robotics at the Norwegian University of Science and Technology(NTNU). Throughout this project I have received valuable sup- port and assistance, and I would like to express my gratitude to all of you who have contributed.

First and foremost, I would like to thank my supervisor, Edmund Førland Brekke for his valuable feedback and guidance. For sharing their knowledge with me, and providing technical input and valuable feedback on my work, I would like to thank my co-supervisors Annette Stahl and Øystein Kaarstad Helgesen. I also wish to thank my fellow students Martin Rånes Græsdal, Martin Eek Ger- hardsen and Thomas Hellum. Your help during the experiments with milliAmpere was valuable and much appreciated.

Thank you, Egil Eide for organizing the use of milliAmpere and Håkon Hagen Helgesen for lending me the equipment for logging ground truth during the experiments. Glenn Angell at the ITK workshop at NTNU made two beautiful housings for the cameras, which ensured that the experiments could be conducted regardless of weather conditions.

v

(10)

(11)

Abstract . . . i

Sammendrag . . . iii

Preface. . . v

Contents. . . vii

Figures. . . ix

Tables . . . xiii

Symbols . . . xv

1 Introduction . . . 1

1.1 Background . . . 1

1.2 Thesis outline . . . 2

2 Theory . . . 3

2.1 Camera as a Sensor . . . 3

2.2 Epipolar Geometry . . . 6

2.3 Stereo Calibration . . . 7

2.4 Uncertainty in Stereo Vision Systems . . . 11

3 The Correspondence Problem. . . 19

3.1 Disparity Map . . . 20

3.2 Local Methods - Correlation Based . . . 21

3.3 Local Methods - Feature Based . . . 24

3.4 Global Methods . . . 27

4 Evaluation of Matching Methods . . . 31

4.1 Local Methods - Feature Based . . . 31

4.2 Local Methods - Correlation Based . . . 32

4.3 Global Methods . . . 32

4.4 SGM vs SAD . . . 32

5 Object Detection using Stereo Vision . . . 37

5.1 The Ground Plane . . . 38

5.2 Method 1 - Stixel Tesselation . . . 38

5.3 Method 2 - Digital Elevation Map . . . 42

5.4 Method 3 - Geometry-based Cluster . . . 44

5.5 Method 4 - Direct Planar Hypothesis Testing . . . 46

5.6 Method 5 - Euclidean Clustering . . . 47

5.7 Evaluation of Detection Methods . . . 49

6 System Description. . . 51

6.1 Hardware . . . 51

6.2 Software . . . 54

7 Experiments . . . 59

7.1 Ground truth . . . 59

8 Results. . . 65 vii

(12)

8.1 Calibration Results . . . 65

8.2 Distance Estimation . . . 68

8.3 Detection Rate . . . 76

8.4 Run time . . . 83

9 Discussion . . . 85

9.1 Stereo Calibration . . . 85

9.2 Stixel Tesselation . . . 85

9.3 Euclidean Cluster . . . 86

10 Conclusion and Future Work . . . 89

10.1 Conclusion . . . 89

10.2 Future Work . . . 90

Bibliography . . . 91

A Additional Material . . . 95

A.1 Maximum Likelihood Estimation - MLE . . . 95

A.2 Generalized Likelihood Ratio Test - GLRT . . . 95

B Scenario Overview . . . 97

C Point clouds - Euclidean Cluster - Stereo Vision . . . 101

(13)

2.1 Illustration of the principle behind the pinhole camera model . . . 3

2.2 Relation between camera- and image coordinates . . . 4

2.3 Illustration of coordinate systems . . . 5

2.4 Epipolar plane spanned between camera origins and observed 3D point . . . 6

11ﬁgure.caption.27 2.7 Baseline impact on overlappingﬁeld of view . . . 12

2.8 Uncertainty regions with different camera angles . . . 13

2.9 Uncertainty regions with different baselines . . . 13

2.10 Example of how varying baseline affects estimation error. Fixed focal length (1230[px]) and disparity error (0.1[px]) . . . 14

2.11 Illustration of reprojection error . . . 15

2.12 Example of how reprojection error affects estimation error. Fixed focal length (1230[px]) and baseline (1.8[m]) . . . 15

2.13 Parabolaﬁt to disparity matches . . . 16

3.2 Image frames before and after rectiﬁcation . . . 20

3.3 Illustration of patch sliding across scanline . . . 21

3.4 Pyramid where each level has half the width and length as the previous level resulting in a quarter of the pixels . . . 24

3.5 SIFT descriptor . . . 25

3.6 SURF Box-Filtering . . . 26

3.7 2D intra-scanline Source:[28] . . . 27

3.8 Aggregation of cost . . . 29

4.1 Keypoint detectors - 1 . . . 32

4.2 Keypoint detectors - 2 . . . 32

4.3 Comparison of disparity maps - 1 . . . 33

4.4 Comparison of disparity maps - 2 . . . 33

4.5 Input image - 1 . . . 34

4.6 Disparity map example - SAD - 1 . . . 34

4.7 Disparity map example - SGM 3WAY - 1 . . . 34

4.9 Input image - 2 . . . 35

4.10 Disparity map example - SAD - 2 . . . 35

5.1 Variations of the occupancy grid . . . 40

5.2 Illustation of freespace . . . 41

5.3 Stixel distribution from scenario 7 . . . 42 ix

(14)

5.4 Flow chart of DEM algorithm . . . 43

5.5 Output of DEM algorithm . . . 44

5.6 Double truncated cone . . . 45

5.7 Segmentation process 3-D points classiﬁed as obstacle . . . 46

5.8 Cones that visualize allowed deviations in local plane and obstacle . . . 46

6.2 Point of intersection FoV . . . 52

6.3 Sensors mounted on milliAmpere . . . 53

6.4 Communication between units in the system . . . 53

6.5 Flowchart for camera driver . . . 55

6.6 Flowchart forstereo_image_proc . . . 55

6.7 Image before and after resizing . . . 56

6.8 Completeﬂow chart for stixel implementation . . . 56

6.9 Completeﬂow chart for Euclidean cluster implementation . . . 57

6.10 Tuning parameters inrqt_reconfigure . . . 57

7.1 Havfruen and milliAmpere . . . 59

7.3 Positional Error in Ground Truth . . . 61

7.4 MilliAmpere during testing . . . 62

62ﬁgure.caption.121 7.6 Flowchart of ROS-node for lidar . . . 63

8.1 Overview of area covered during testing . . . 65

8.2 Corners of checkerboard detected . . . 66

8.3 Rectiﬁed image with horizontal epipolar lines . . . 66

8.4 Checkerboard positions during calibration . . . 66

8.5 Mean reprojection error for every image . . . 67

8.6 Lower bound for depth error as a result of the calibration results . . . 68

8.7 Illustration of scenario 1 . . . 68

8.8 Distance estimation for scenario 1 . . . 69

8.9 Distance error - Scenario 1 . . . 69

8.22 Detecton distribution for stixel method . . . 76

8.23 Detecton distribution for Euclidean cluster method - Stereo Vision . . . 77

8.24 Detecton distribution for Euclidean cluster method - Lidar . . . 78

8.25 Examples of stixel detection in scenario 1 . . . 78

(15)

8.30 Examples of stixel detection in scenario 5 . . . 82 9.1 Rotation of left camera causes translation of the right camera . . . 85 9.2 . . . 87

(16)

(17)

3.1 Explanation of elements in energy function . . . 29

4.1 Run times for SAD and SGM . . . 33

4.2 Summary of selected matching methods . . . 36

5.1 Summary of selected OD method . . . 50

6.1 Camera speciﬁcations . . . 52

6.2 Lens speciﬁcations . . . 52

7.1 Technical data for VPL 16 . . . 63

8.1 Extrinsic parameters from camera calibration . . . 67

8.2 Intrinsic parameters from camera calibration . . . 67

8.3 Duration of intervals without detection for different distances in total for all scenarios with the stixel method . . . 76

8.4 Duration of intervals without detection for different distances in total for all scenarios with the Euclidean cluster method - Stereo Vision . . . 77

8.5 Duration of intervals without detection for different distances in total for all scenarios with the Euclidean cluster method - Lidar . . . 77

8.6 Duration of intervals without detection for different distances for scenario 1, Stixel . 78 8.7 Duration of intervals without detection for different distances for scenario 1, Cluster stereo . . . 79

8.8 Duration of intervals without detection for different distances for scenario 1, Cluster lidar . . . 79

xiii

(18)

8.18 Duration of intervals without detection for scenario 5, Stixel . . . 82

8.19 Duration of intervals without detection for scenario 5, Cluster . . . 82

8.20 Duration of intervals without detection for scenario 5, Cluster Lidar . . . 82

8.21 Mean run times for each scenario, stixel . . . 83

8.22 Mean run times for each scenario, stereo cluster . . . 83

8.23 Mean run times for each scenario, lidar cluster . . . 83

9.1 Mean run times for each scenario - Alternativeﬁltering sequence - Stereo . . . 87

(19)

Coordinates

X_w World coordinates X_c Camera coordinates

(x,y) image coordinates p= (u,v) Discretized image coordinates

~ Homogeneous coordinates ˆ Estimate

d disparity value D Disparity map

Camera Parameters

f_u Horiztontal focal length[px] u₀ Horiztontal principal point[px] f_v Vertical focal length[px] v₀ Vertical principal point[px]

B Baseline f focal length[m]

K Intrinsic Matrix P Projection Matrix

f_s Skew parameter λ Scaling factor

Object Detection

v_B Stixel base point v_T Stixel top point

S Error funciton θ Maximum likelihood

Hf Null hypothesis Ho Alternative hypothesis

I Data vector γ Decision threshold

α(p) Intensity bias atp η(p) Noise atp

G Multivariate Gaussian ξ Error

Γ Covariance matrix

xv

(20)

Misc.

E_x(.) Energy function m Stereo measurement

E Edge set V Vertex set

G Graph A Transformation matrix

σ Scale L Likelihood function

L Scale space Θ Horizontalﬁeld of view

e Epipole F Fundamental matrix

E Essential matrix W D Working distance

β Inward rotation of camera H Homography matix

(21)

Introduction

1.1 Background

A crucial part of any autonomous vehicle is an accurate and reliable perception system. For an autonomous system to move safely, it depends on systems providing information about its surroundings, that enable the autonomous system to make correct decisions regarding its movement to avoid collisions. These systems also has to be able to detect objects of various size and range to plan tra- jectories early, and behave predictable for other vehicles in its vicinity. As robots are now taking their ﬁrst steps out of factoryﬂoors and into the real world, these systems are becoming more important than ever.

In order for a robot to sense the surroundings it needs to be equipped with various sensors. Radar has the ability to detect objects at large distances, but its slow update frequency and poor accuracy makes it less ideal for close encounters[1]. Lidar is accurate in perceiving depth at closer distances, but it has poor resolution at long ranges and can be heavily affected by weather conditions, especially rain. Stereo cameras have a denser resolution than lidar, but traditionally has lacked the same depth accuracy. However, in recent years stereo vision has grown in popularity due to new and improved algorithms and its affordability compared to the lidar. This thesis explores opportunities to imple- ment stereo vision on a robot in the maritime environment.

An Unmanned Surface Vehicle (USV) is a vehicle operating on the sea-surface without on-board per- sonnel. A sub-category of these vehicles is the Autonomous Surface Vehicle (ASV), which is a vehicle able to operate without human interaction. There has been conducted some research on ASVs and their applications[2]in recent years, but as autonomy is becoming more and more common in the automotive industry, it is still a scarce feature in the maritime sector. Nonetheless, a suitable area for ASVs can be in human transport at sea, namely ferries. According to[3]human error was a factor in more than 70% of the accidents in the study. This makes room for a large leap in safety if the human error can be eliminated with the use of autonomy.

This thesis is a part of the autoferry research project at NTNU. The main goal of the research project is to "develop groundbreaking new concepts and methods which will enable the development of autonom- ous passenger ferries for transport of people in urban water channels" [4]. The ferry, milliAmpere is going to transport passengers between Ravnkloa and Vestre kanalhavn in the Trondheim canal. The

1

(22)

passengers shall be able to call the ferry by the push of a button, and the rest is to be fully autonomous.

Earlier work done regarding stereo vision on the autoferry project includes Lina Theimann and Trine Ødegårds master thesis [5]where they used a correlation based local method to obtain a disparity map and a Convolutional Neural Network (CNN) for object detection. A specialization project done by the same authors compared different local methods for solving the correspondence problem[6] In my specialization project in TTK4551 during the autumn of 2020[7]I tried to take this further by comparing theirﬁnds with the more recent method called the Semi-Global Method (SGM)[8]. This project showed that SGM could operate in real time and resulted in denser point cloud than local methods, but at roughly the same accuracy regarding depth.

The aim of this thesis is to evaluate different stereo vision systems regarding object detection and depth estimation in a way that contribute to a better understanding of how the different methods work and what their strengths and weaknesses are. In addition to this the most suitable methods will be chosen for use on milliAmpere. As there are many different methods and algorithms to choose for the different stages in the process, this thesis tries to combine theﬁnds by Olsen and Theimann in [6] and my specialization project[7], and present a more complete system. This is done by testing out systems in the working environment to help determine the performance of the system in real life scenarios. The thesis addresses the following tasks:

• General design of a stereo vision system.

• Evaluation of matching methods.

• Evaluation of object detection methods.

• Implementation of a stereo vision system on milliAmpere.

• Evaluation of performance.

1.2 Thesis outline

The thesis is structured such that in chapter 2 relevant background theory will be presented. Then the backbone of stereo vision, namely solving the correspondence problem will be presented in chapter 3, and evaluated in chapter 4. In chapter 5 different methods for generic object detection based on stereo vision will be explained, followed by an evaluation of the methods with regards to their use in this thesis. Chapter 6 explains the stereo system used in this thesis and the design choices that were made. In chapter 7 and chapter 8 the experiments conducted and the following results, respectively.

Lastly a thorough discussion and conclusion is made in chapter 9 and chapter 10.

(23)

Theory

In this chapter relevant theory for understanding a stereo vision system will be presented. How to use a monocular camera as a sensor is fundamental and it will therefore be explainedﬁrst. Epipolar geometry is key in understanding why stereo vision systems differs from a setup with two unrelated cameras, and has therefore a central part in this chapter. Finally, the theory behind stereo camera calibration and accuracy in stereo vision systems will be described.

2.1 Camera as a Sensor

In general the camera conducts a mapping of the 3D world onto a 2D image [9, p. 153]. There is a range of different methods for this type of mapping, but the simplest and most popular is the pinhole camera model. The pinhole model is based on the observation that when a plane with a small aperture is placed in front of a scene, the scene will be projected as an reversed image of the reality on the other side of the plane. This phenomena can be observed in a dark room with a keyhole as the aperture. An illustration of the can be seen inﬁg.2.1b.

(a)No barrier causes blurred image (b)Barrier with aperture Figure 2.1:Illustration of the principle behind the pinhole camera model

In practice this is done by defining a plane with Z_c = f, parallel to the camera coordinate system (where f is the focal length of the camera, fig. 2.2). Points in the 3D world are observed by the camera as a line between origin of the camera coordinate system and the 3D point. This line will intersect the image plane, and the relationship between image- and world points can be described by the means of similar triangles (fig. 2.2).

3

(24)

Figure 2.2:Relation between camera- and image coordinates

(X_c,Y_c,Z_c)^T −→� f x

z,f y z,f�T

(2.1)

Homogeneous Coordinates

Homogeneous coordinates is a coordinate system for projective spaces. They allow us to express the relation between 2D and 3D coordinates very elegantly by simplifying mathematical derivations[9, p. 27]. For instance can several transformations such as translation, rotation, scale, afﬁne, etc. be expressed by matrices. Transformation between Euclidean and homogeneous space can be done as described in eq. 2.2 and eq. 2.3 wherep is Euclidean and˜pis homogeneous.

p=

�u v

�

−→p˜=



u v 1



 (2.2)

˜ p=



u v w



−→p=

�u/w v/w

�

(2.3)

Homogeneous coordinates are only deﬁned up to scale.

˜

p=λp, λ�=0 (2.4)

Another important property is that points inﬁnitely far away in the 3D world can be expressed with ﬁnite coordinates.

˜

p_{in f inite}=



u v 0



 (2.5)

By utilizing these properties a series of complex transformations can be described by a single transformation matrix, created by a series of matrix multiplications.

˜

p^�=Ap,˜ whereAis a transformation matrix (2.6)

(25)

From World to Image

To transform the data given by the camera system into usable information describing the real world a transformation between coordinate systems is required. In a camera system there is often four different coordinate systems being used:

1. X_w- The world coordinate system describing 3D points with origin in an arbitrary location.

2. X_c- The camera coordinate system describing 3D points with origin in the camera center.

3. (x,y)- The image plane coordinate system describing 2D points on the image plane with origin in the center of the image plane.

4. p= (u,v)- The discretized image coordinate system describing 2D points in pixel units with origin in the upper left corner of the image.

Figure 2.3:Illustration of coordinate systems

In most cases it is the transformation from world coordinates (X_w) to discretized image coordinates, p, that is desirable, while the camera coordinate system and image coordinate system are just inter- mediate steps required to complete the transformation. Theﬁrst step in this process is to transform a world point,X_w into a camera point, X_c. This is done by a rigid body transformation. In practice this means rotating and translatingX_wto coincide withX_c.



 X_c Y_c Z_c 1



=

�R T 0 1

�

 X_w Y_w Z_w 1



 (2.7)

After transforming X_w toX_c the point can be projected to the image plane. As the image plane has coinciding X -and Y-axis withX_cand is only shifted along the Z-axis with the same value as the focal length (f) this projection can be described as in eq. 2.8, where Z is a scaling factor also represented as Z=_λ¹.

(26)

Z



x y 1



=



f 0 0 0

0 f 0 0

0 0 1 0







 X_c Y_c Z_c 1



 (2.8)

The transformation from image coordinates to discretized image coordinates is done by multiplying the coordinates with the cameras intrinsic matrix (K) which transforms the point into pixel units and shifts the origin to the top left corner. This matrix is obtained by camera calibration and is described in detail in section 2.3. The complete transformation can be seen in eq.2.9 whereΠis known as the projective matrix.

λ



 u v 1



=



 f_u f_s u₀ 0 f_v v_o

0 0 1





� ��

K



 1 0 0 0

0 1 0 0 0 0 1 0





� ��

Π

� R T 0 1

�

� ��

R|T



 X_w Y_w Z_w 1



 (2.9)

λp˜=KΠ[R|T]X˜_w (2.10)

2.2 Epipolar Geometry

Epipolar geometry is the geometry between two views [9]. It describes the relative pose between the cameras regardless of what scene is observed. Knowing this relationship between the cameras is essential in stereo vision systems as it is this knowledge that allows for depth estimation in a single stereo image. If a pointX_w is viewed from two different viewsp_l andp_r anepipolar planeπis the plane spanned between these three points. The straight line between the origin of the camera coordinate systems is thebaseline.

Figure 2.4:Epipolar plane spanned between camera origins and observed 3D point

Source: Based onﬁg. 9.1 in[9]

The point of intersection between the baseline and the image plane is known as theepipole. Perhaps

(27)

the most important feature in epipolar geometry is theepipolar line. This is line is the intersection of the epipolar plane and the image plane. In other words; ifX_w is observed in the left image view, the same point has to be located along the epipolar line in the right image view, and vice versa.

(a)Epipolar line Source: Based onﬁg.

9.1 in[9]

(b)Epipolar plane Source: Based onﬁg.

9.2 in[9]

The algebraic representation of epipolar geometry is known as the Fundamental Matrix [9]. This matrix has some useful properties listed below.

p_r^TFp_l =0 for corresponding pointsp_r,p_l (2.11) l_r=Fp_l, l_l =F^Tp_r where l is an epipolar line (2.12) Fe_l=0, F^Te_r=0 where e is an epipole (2.13) Theessential matrixis a specialised version of the fundamental matrix[9, p. 257]that can be obtained if the camera intrinsics,K, is known.

ˆ

p_l =K⁻¹p_l (2.14)

= [R|t]X (2.15)

In 2.2 is now expressed in normalized coordinates

E= [t]xR=R[R^Tt]x (2.16)

ˆp_r^TEˆp_l =0 (2.17)

E= K^T_rF K_l (2.18)

2.3 Stereo Calibration

In this section the theory and process of stereo camera calibration will be presented. A proper calibration is needed to minimize the uncertaintly in the stereo measurements of a stereo system. As the

(28)

stereo calibration consists of estimating two different sets of parameters, namely the intrinsic -and the relative extrinsic parameters they will be presented accordingly.

Calibration Introduction

Many of the available tools such as Matlabs Camera Calibration App, and OpenCV’s calibration func- tions are based on the same method, namely Zhangs Method. This is a methodﬁrst presented by Zhang in 2000[10]. In this method a set of pictures of planar object with a known geometrical pattern is captured. The most common pattern is a checkerboard pattern, where all the squares have known dimensions. Either the camera or the checkerboard is then locked in place while the other is changing position, and images are captured of the pattern from various distances and angles. A world coordinate system with origin in the upper left corner of the checkerboard with the axis X, Y and Z pointing right, down and inward, respectively. By doing this all the corners in the checkerboard has known world coordinates. The same corners can easily be detected in an image and given a image coordinate. As origin of the world coordinate system is on the checkerboard Z=0 for all the points on the plane, and the relationship between image points and world coordinates can be described as in eq. 2.19.

λ



u v 1



=K� R t�



 X_w Y_w 0 1



 (2.19)

Rewriting R in eq. 2.19 using r_i to represent every column of the matrix to deﬁne a new matrix, H∈R^3×3known as theHomography matrix. 2D homography is a linear transformation of 2D points from one plane to another[9, p.87]. The matrixHis deﬁned up to a scale factor (λ) and has eight degrees of freedom[11].

λ



u v 1



=K�

r₁ r₂ t�



X Y 1



=H



X_w Y_w 1



 (2.20)

λ



u v 1



=



h₁₁ h₁₂ h₁₃ h₂₁ h₂₂ h₂₃ h₃₁ h₃₂ h₃₃







X_w Y_w 1



 (2.21)

The homogeneous set of equations in eq. 2.21 can be turned into inhomogenous equations by divid- ing by the scalar term (λ=h₃₁X+h₃₂Y+h₃₃).

u=h₁₁X_w+h₁₂Y_w+h₁₃

h₃₁X_w+h₃₂Y_w+h₃₃ (2.22)

v=h₂₁X_w+h₂₂Y_w+h₂₃

h₃₁X_w+h₃₂Y_w+h₃₃ (2.23)

Rewriting eq. 2.22 and eq. 2.23 into vector form we get:

(29)

a_u^Th=0 (2.24)

a^T_vh=0 (2.25)

Combining this set of equations for every image pair and assembling them into one matrix. A, we get

A=





 a^T_u1 a^T_v1 ...

a^T_un a^T_vn







, 4≤n (2.26)

A=







−X_w −Y_w −1 0 0 0 u₁X_w u₁Y_w u₁ 0 0 0 −X_w −Y_w −1 v₁X_w v₁Y_w v₁ ... ... ... ... ... ... ... ... ...

−X_w −Y_w −1 0 0 0 u_nX_w u_nY_w u_n 0 0 0 −X_w −Y_w −1 v_nX_w v_nY_w v_n





, 4≤n (2.27)

Ah=0 (2.28)

AsHhas eight degrees of freedom at least four correspondences is necessary to solve the set of linear equations. Using Singular Value Decomposition (SVD) the values ofHcan be extracted from the last column ofVin eq. 2.29.

A=UΣV^T (2.29)

Intrinsic Parameters

From eq. 2.20 we know that H contains the information about both the intrinsics (K) and the ex- trinsics (R,t), this can also be seen in eq. 2.30. The next step is therefore to extractK fromH. Only the resulting equations will be presented here. For details the reader is reffered to the original paper by Zhang[10].

H= K λ_H

�R t�

(2.30)

A new matrixBis deﬁned up to a scale factorλ_B.

B=λ_B(K K^T)⁻¹ (2.31)

(30)

v₀= B₁₂B₁₃−B₁₁B₁₃

B₁₁B₂₂−B₁₂² (2.32)

λ_B=B₃₃−B²₁₂−v₀(B₁₂B₁₃−B₁₁B₂₃)

B₁₁ (2.33)

f_u=

��λ_B

B₁₁ (2.34)

f_v=

��

� λ_BB₁₁

B₁₁B₂₂−B²₁₂ (2.35)

f_Θ= B₁₂α²_xα_y

λ_B (2.36)

u₀= s y₀

α_y − B₁₃α²_x

λ_B (2.37)

Extrinsic Parameters

As the intrinsic matrix, K, is constant for all the homographies in the calibration this can be used to extract the extrinsic componentsRandt, which is the rotation and translation between the two cameras. Theﬁrst step is to rewrite eq. 2.31:

λK⁻¹�

h₁ h₂ h₃�

=�

r₁ r₂ t�

(2.38) From eq. 2.38 we get three equations:

r₁=λK⁻¹h₁ (2.39)

r₂=λK⁻¹h₂ (2.40)

r₃=r₁×r₂ (2.41)

t=λK⁻¹h₃ (2.42)

(2.43) where

λ= 1

||K⁻¹h₁||= 1

||K⁻¹h₂|| (2.44)

Resulting in the extrinsic matrix being constructed like:

�R t�

=�

r₁ r₂ r₃ t�

(2.45) However, due to noise in the data this procedure results in with aR that does not satisfy the properties of a rotation matrix[10]. To solve the reprojection error can be minimized using a solver for nonlinear optimization problems, such as the Levenberg-Marquardt algorithm. More details about this process can be found in Zhangs original paper[10].

(31)

Distortion Coefﬁcient Estimation

Lens distortion is when the shape of the lens or inaccuracies in the placement of the lens cause errors in the image. These errors cause deformation of objects making the image not a real representation of the real world[12, p. 58]. These deformation makes pixels in the image appear at the incorrect place to what the pinhole camera model expects and therefore needs to be corrected.

The most prominent distortion is known as the radial distortion and cause straight lines to appear as curved in the image. The distortion can be expressed as a non linear function. In eq.2.46 x_d and y_d are the distorted image coordinates,r is the distance from the principal point(u₀,v₀)tox,y, andk_n forn∈N are constants for each speciﬁc case.

�x_d y_d

�

=�

1+k₁r²+k₂r⁴+k₃r⁶�� x−u₀ y−v₀

� +

�u₀ v₀

�

(2.46)

(a)No radial distortion (b)Positive radial distortion (c)Negative radial distortion Figure 2.6:Radial distortion.

Source: OpenCV documentation¹

The second form of distortion is the tangential distortion. This has a much smaller impact on the image than the radial distortion, and is often ignored all together. This distortion happens when the lens and the photo sensor are not completely parallel. This results in a compressing effect on the image[13, p. 686].

�x_d y_d

�

=

�2k₄(x−u₀) (y−v₀) +k₅�

r²+2(x−u₀)²� k₄�

r²+2(y−v₀)²+2k₅(x−u₀) (y−v₀)

�

(2.47)

2.4 Uncertainty in Stereo Vision Systems

As the main drawback of using stereo vision compared to lidar, has been poor depth estimation it is important to evaluate the accuracy in the estimations. The accuracy of depth estimation in stereo vision systems depends on the geometric structure of the system[14]:

1. Interpixel distance, focal length, baseline, camera angle 2. The accuracy of these parameters

(32)

Theﬁrst point show that in order to achieve the best possible results the stereo vision setup regarding cameras, lenses and stereo parameters has to be suitable for the desired operating range. The accuracy of these parameters is ensured by a calibration of the stereo camera, which is done to estimate intrinsic- and extrinsic parameters. The error in depth is described in eq. 2.48 which is derived in [13]. The depth errorΔZ depends on the depthZ, the mean of the reprojection errorΔp_x, focal length f_uand baselineB.

ΔZ_c= Z_c²Δp_x

f B (2.48)

From eq.2.48 and inﬁg. 2.9 one can see that the error will be reduced by increasing the baseline.

However, increasing the baseline will reduce the amount of overlap, (ﬁg.2.7) and therefore increasing the minimum distance in front of the cameras where matching can occur.

(a)Small baseline (b)Large baseline (c)Large baseline+angled cameras Figure 2.7:Baseline impact on overlappingﬁeld of view

Angling the cameras inward will increase the commonﬁeld of view between the cameras and also lower the uncertainty in the measurements, but at the cost of a lower angular resolution making it harder to detect details in small objects. Fig. 2.9 andﬁg. 2.8 illustrates the changes in uncertainty.

This means that the baseline has to be chosen based on a compromise between the two which is explained in chapter 6.

(33)

(a)Small angle (b)Large angle Figure 2.8:Uncertainty regions with different camera angles

(a)Small baseline (b)Large baseline

Figure 2.9:Uncertainty regions with different baselines

In fig 2.10 the error in depth estimations are shown with a varying baseline, and where all other parameters in eq. 2.48 arefixed. Here it is clear that the error increases exponentially, and a large baseline will have a significant impact on the accuracy at long ranges.

(34)

� ��

�[�]

��

Δ�[�]

��

Figure 2.10:Example of how varying baseline affects estimation error.

Fixed focal length (1230[px]) and disparity error (0.1[px])

The other physical parameter is the focal length, but as this is less practical to change and does not have the same range to vary within it will not be discussed further.

The second point regarding the accuracy of the previous parameters is achieved via calibration.

Proper calibration is required as this sets the lower boundary of uncertainty in the measurements.

From the results in[14]it is clear that yaw and non parallel photo sensor and lens are the two most important contributors to error in depth estimations. Then pitch, roll and radial distortion. This means that a proper calibration of both intrinsic- and extrinsic parameters is vital. The reprojection error is a measure of how well the calibration results are. This is a measure of the error between a measured point and the reprojected point onto the same image plane given the extrinsic parameters from the calibration. A large reprojection error indicates that the parameters provided by the calibration does not ﬁt well with the real world. The equation for reprojection error (eq. 2.49) can be split into three steps.

• ||p_il^� −P_l(p_i)||² - The square of the absolute difference between a measured -and projected point in the left image plane.

• ||p_ir^� −P_r(p_i,R,t)||²- The square of the absolute difference between a measured -and projected point in the right image plane given a rotation and translation.

• �_n

i=1 - Sum the error over all the points in an image pair.

ε=

�n i=1

||p_il^� −P_l(p_i)||²+||p_ir^� −P_r(p_i,R,t)||² (2.49)

(35)

Figure 2.11:Illustration of reprojection error

As the reprojection error is not constant for all the points in an image it is common to calculate the mean, and use this value for the entire image. Inﬁg. 2.12 the error in depth based on eq. 2.48 vary with rather small changes in mean reprojection error while the other parameters are ﬁxed.

The examples are all within the accuracy of a single pixel so-calledsub-pixel accuracy, but at longer distances the error becomes signiﬁcant.

� ��

�[�]

��

Δ�[�]

��

Δ��= 0Δ1 Δ��= 0Δ2 Δ��= 0Δ5 Δ��= 1Δ0

Figure 2.12:Example of how reprojection error affects estimation error.

Fixed focal length (1230[px]) and baseline (1.8[m])

(36)

Sub-pixel Estimation

A problem regarding estimation in stereo vision systems is the disparity in pixel units has a physical limitation in resolution. The minimum resolution of one pixel severely affects the accuracy of depth estimations, especially for objects at large distances. The solution to overcome this limitation issub- pixel estimation. One of the methods presented in[15]is byﬁtting a parabola over the best match and the nearest neighbors.

Figure 2.13:Parabolaﬁt to disparity matches Source:[15]

The parabola is deﬁned as in eq. 2.50, and its local extrema by setting its differentiation equal to zero as in eq. 2.51

y =ax²+bx+c (2.50)

d y

d x =2x+b=0 (2.51)

x= −b

2a (2.52)

The pixel values p₀,p₋andp₊ inﬁg. 2.13 correspond to to the matching score of the best match, the score at d-1 and at d+1 respectively. Applying the pixel positions to 2.50 results in:

y(−1) =p₋=a−b+c y(0) =p₀=c

y(1) =p₊=a+b+c

Rewriting 2.51 we can solve for the local maximum using the pixel values.

x= p₋−p₊

2(p₋+p₊)−4p₀ (2.53)

(37)

Estimation of corresponding points with sub-pixel accuracy allows for a more precise disparity value to be calculated. The new disparity estimate can now be expressed as in eq. 2.54.

d_est=d+x (2.54)

(38)

(39)

The Correspondence Problem

As described in section 2.2 a point in the world coordinate system can be estimated if seen from two different image views with known relative geometry. Recognizing the same point in both image views is known as the correspondence problem and is a key part of any stereo vision system. This problem can be formalized with two constraints[16]:

1. Uniqueness- Each pixel only have one corresponding match 2. Continuity- A scene is composed by piecewise continuous surfaces

A stereo system can be divided into two cases:the simplified caseandthe general case. In the simplified case the two cameras are aligned and identical, creating a simple representation of a system where the epipolar lines are horizontal. This simplifies the search for correspondences as the matching pixel is bound to be on the same image row in both images. However, in real life there is always some differences between the cameras in the system. Either wanted by design such as angled cameras or unwanted as there will always be some inaccuracy in the mechanical setup. This classifies as the general case.

A system in the general case can be transformed into the simpliﬁed case by rectiﬁcation. This is a process where the rigid body transformation between the cameras, acquired by stereo calibration is applied to the images and remapping them onto a common image plane. Details about this process are described in section 2.3.

19

(40)

(a)Simpliﬁed stereo setup (b)General stereo setup

Most stereo matching algorithms either require or beneﬁt from having rectiﬁed images as input as this lowers the computational complexity of the problem.

Figure 3.2:Image frames before and after rectiﬁcation

3.1 Disparity Map

As a result of solving the correspondence problem the shift in pixel position between the two images is obtained. This shift in pixel position is directly proportional to the distance to the objects in the image. Features in the scene that are closer to the camera will change more than those that are further away. By mapping the pixels and giving them adisparity valuebased on how much the pixel changes between the two images adisparity mapcan be created. For two arbitrary pixels in a rectiﬁed image pair, matched by solving the correspondence problem ˜p_l = (u_l,v_l, 1)^T and p˜_r = (u_r,v_r, 1)^T the disparity (d)is expressed in eq. 3.1.

(41)

u_l−u_r=d (3.1) By doing this for all the pixels in an image, a disparity map can be created. This is a new image where all the pixels are described byp= (u,v,d)^T. After obtaining the disparity map each pixel can be transformed into actual depths by solving eq. 3.2, where f is focal length andB is the baseline.

Z_c= f_U∗B

d (3.2)

In order to establish a 3D point with origin in the camera center for every pixel in the disparity map eq. 3.3 can be solved. In this case u₀ and v₀ are the principal points of the camera, and T_x is the translation of the optical center between the left and right camera in the X-axis.

Q



 u v d 1



=



 X_c Y_c Z_c W_c



, where Q=







1 0 0 −u₀

0 1 0 −v₀

0 0 0 f_u

0 0 −_T¹_x ^u⁰_T⁻_x^u^�⁰





 (3.3)

Over the years, as computing power has increased, more and more complex methods for solving the correspondence problem has been presented. However, the methods are traditionally divided into two groups; local and global[17]. As indicated by their names, the local methods only consider the local area around a pixel, while the global approach takes the entire image into account when calculating disparity for a pixel.

3.2 Local Methods - Correlation Based

Correlation based matching methods use an area or matrix around a given pixel to describe that particular part of the image. This part of the image can then be searched for in in the matching image in order to establish a match. Often the images are rectiﬁed and epipolar lines are used as horizontal scan lines where one can slide the patch from the origin image along the scanline to search for a matching patch. Once a matching area is found the disparity in pixel position can be extracted.

Figure 3.3:Illustration of patch sliding across scanline

(42)

These methods result in a dense disparity maps and are generally computationally fast, but yield poor results in certain settings. Listed below are common terms used in the different mathematical equations describing the methods.

I_L−Left stereo image I_R−Right stereo image W_m−Matching window

d−disparity I(u,v)−Pixel intensity

C(d)−Cost of matching with disparityd

Sum of Absolute Differences - SAD

The Sum of Absolute Differences approach is the most common stereo matching algorithm [18]. After locating an area of interest in the origin image, a patch around that area is saved and moved along a search line on the matching image. For each pixel along the search line the sum of absolute differences is calculated.

C_SAD(d) = �

(u,v)∈Wm

|I_L(u,v)−I_R(u−d,v)| (3.4)

In eq. 3.4 the absolute difference between two blocks is calculated. The disparity,dcan be extracted where the function output is lowest. There is a compromise between accuracy and execution time in varying the size of the matching block. Increasing the size of the block makes the algorithm less prone to errors, but increases execution time.

Sum of Squared Differences - SSD

The SSD matching method works by subtracting the values from the extracted patch pixel by pixel along a search line. In theory the matching patch is found when the SSD returns zero, but in practice the lowest return value is accepted[19].

C_SSD(d) = �

(u,v)∈Wm

[I_L(u,v)−I_R(u−d,v)]² (3.5)

This simple principle yields a fast algorithm, but in turn, it fails at depth discontinuities as the whole patch is considered to have the same disparity. For this reason the SSD-method is not suited for images containing many or small objects.

Normalized Cross Correlation - NCC

NCC is both more complex and robust than SSD. As the values in NCC are normalized, the method handles change in luminosity between cameras well[19]. In eq.3.6 ¯I is the average pixel intensity of the patch and is only calculated once for each matching operation.

(43)

C_{N CC}(d) = �

(u,v)∈W_m

ˆI_L(u,v)ˆI_R(u−d,v) (3.6)

where, ˆI(u,v) = I(u,v)−¯I

||I−¯I||Wm

Because the NCC is more complex than the SSD, it is difﬁcult to achieve run times that are low enough for ream time operations. Further, the NCC is also prone to error at discontinuities as the other methods.

Rank Transform

Rank Transform is a non-parametric algorithm, meaning it does not consider the value of the pixel intensity itself, rather its relative value. By evaluating the pixels in a patch/block around the center pixel the rank of that patch can be determined by simply the number of pixels with a lower value [20](Non parametric).

R(P) =||P^�∈N(P)|I(P^�)<I(P)|| (3.7) In practice this rank calculation can be described as in eq. 3.8

Rank(u,v) = �

(i,j)(u,v)

L(i,j), L(i,j) =

�0 :I(i,j)<I(u,v)

1 :otherwise (3.8)

After the image has been rank transformed one of the methods presented earlier such as SAD can be used on the transformed images.

C_RT(d) = �

(u,v)∈Wm

|Rank_L(u,v)−Rank_R(u−d,v)| (3.9)

Doing this achieves a lower run time as the rank transform lowers the complexity of the problem [21].

Census Transform

Census Transform is also a non-parametric measure of local intensity [20] In this case the values surrounding a pixel is set to 0 or 1 based on whether they are higher (0) or lower (1) than the value of the center pixel. These values are then mapped into a bit string. This bit string can then be compared with a bit string originating from the second stereo image using the Hamming distance [20]. Minimizing the hamming distance is done to establish the correct match. This method is robust when there is a high change in intensity in a few pixels.

Census(u,v) =Bitst ring₍_i,_j_)∈_W_m(I(i,j)≥I(u,v)) (3.10) C_{C T}(d) = �

(u,v)∈W_m

Hamming(Census_L(u,v)−Census_R(u−d,v)) (3.11)

(44)

3.3 Local Methods - Feature Based

A feature, or key point is an area in an image with a certain uniqueness that can be described and stored. This uniqueness can be corners, edges, patches and other areas that is able to stand out in the image.

A feature detector is the method that searches for these areas and stores the most prominent ones as feature descriptors. There are several types of feature descriptors. Common for all of them is that they are designed to be quite robust, meaning that during searching for a matching feature, a false positive should seldom occur. Feature detection and matching is a key element in almost every computer vision application and is often used in areas such as object recognition and structure from motion.

Scale Invariant Feature Transform - SIFT

Because good feature might be located at different scales in an image it is beneﬁcial to search through a range of scales when identifying a feature. This is the idea behind the method, SIFT,ﬁrst presented by David G. Lowe in[22] and has since then been the benchmark method regarding accuracy and robustness in visual features in computer vision. The method consists of four stages:

Scale-space Extrema Detection

Detecting keypoints in an image that is invariant to scale is done by applying Gaussian blur to the image with varying magnitude, then "stacking" the images on top of each other and search for ex- treme points. This procedure is done for every level in a scale pyramid. This pyramid is created by consecutively lowering the image width and length. Subsampling and smoothing the image pixels is done to achieve a lower resolution.

Figure 3.4:Pyramid where each level has half the width and length as the previous level resulting in a quarter of the pixels.

Source:[23, p. 132]

The scale-space of an image is deﬁned in eq. 3.12. Where L(u,v,σ)is the scale-space, I(u,v) is the input image, andG(u,v,σ)is the Gaussian blur, and * is the convolution operator.

L(u,v,σ) =G(u,v,σ)∗I(u,v) (3.12) The difference-of-Gaussians (DoG),D(u,v,σ), is calculated by subtracting a scales-space image with another where a constant, k, is separating them in scale.