Lidar based object detection for an autonomous race car

(1)

Benjamin PalerudLidar based object detection for an autonomous race car NTNU Norwegian University of Science and Technology Faculty of Information Technology and Electrical Engineering Department of Engineering Cybernetics

Master ’s thesis

Benjamin Palerud

Lidar based object detection for an autonomous race car

Master’s thesis in Industrial Cybernetics Supervisor: Edmund Førland Brekke June 2020

(2)

(3)

Benjamin Palerud

Lidar based object detection for an autonomous race car

Master’s thesis in Industrial Cybernetics Supervisor: Edmund Førland Brekke June 2020

Norwegian University of Science and Technology

Faculty of Information Technology and Electrical Engineering Department of Engineering Cybernetics

(4)

(5)

Abstract

With the development of autonomous technology in recent years, autonomy has now entered the racing scene. As part of the Formula Student competition, teams compete against each other with driverless racing cars. The racing cars runs completely autonomously on tracks outlined by cones, where the goal is for the vehicle to maneuver as fast as possible through the racetrack.

This sets high expectations for the detection system, which must locate and classify cones ac- curately in real-time. Lidars for autonomous purposes are thought to be suitable for this task, as they give accurate positional data, along with intensity data that potentially can be used for classification.

In this thesis, a cone detection framework based on lidars is implemented and tested. It consists of a localization system using Euclidean clustering alongside filtering to find cone candidates.

As the method is prone to false positives, the cone candidates are 2D projected to28×28im- ages, which is classified by a CNN. This captures the shape of a cone, which is thought to filter out false positives. By introducing intensity to the 2D projected images, the repeatable pattern across the height of a cone is included. This lets the CNN classify color in addition with shape.

Furthermore, a framework for lidar-lidar fusion is introduced. This consists of calibration and synchronization, with the prerequisite of ego-motion compensation.

The results show that CNNs can improve precision and do color recognition with overall accuracy of 98.9% and 97.3% on test data, for the shape classification and color and shape classification, respectively. Based on stationary scenarios, the localization system manages to find cones up to at least 28m. The classifiers manage to classify candidates at 15m and 10m, based on shape classification, and shape and color classification, respectively. It was found that the combined shape and color classifier is not as good as the shape classifier to filter out false positives. Although the results are promising, more training data and tests on realistic tracks is necessary. The lidar-lidar fusion had a calibration error, which corrupted the results related to fusion localization. Consequently, re-calibration is necessary. Synchronization and ego-motion compensation should be tested and verified in high speed scenarios.

(6)

Med utviklingen innen autonom teknologi de siste ˚arene har autonomi n˚a entret racingscenen.

Som en del av konkurransen Formula Student, konkurrerer lag mot hverandre med førerløse racerbiler. Racerbilene kjører helt autonomt p˚a baner definert av kjegler, der m˚alet er at bilen skal manøvrere seg s˚a raskt som mulig gjennom racerbanen. Dette stiller høye krav til de- teksjonssystemet, som m˚a lokalisere og klassifisere kjegler nøyaktig, i sanntid. Lidarer for autonome form˚al antas ˚a være egnet for denne type oppgaver, da de gir nøyaktig posisjonsdata, sammen med intensitetsdata som potensielt kan brukes til klassifisering.

I denne masteroppgaven implementeres og testes et rammeverk for kjegledeteksjon, basert p˚a lidardata. Det best˚ar av et lokaliseringssystem som bruker Euklidisk klynging i tillegg til fil- trering, for ˚a finne kjeglekandidater. Ettersom metoden er utsatt for falske positiver, blir kje- glekandidatene 2D-projisert til 28×28bilder, som klassifiseres av et CNN. Dette fanger for- men til en kjegle, som antas ˚a filtrere ut falske positiver. Ved ˚a introdusere intensitet til de 2D-projiserte bildene, blir det repeterbare mønsteret p˚a tvers av høyden til en kjegle inkludert.

Dette lar CNN klassifisere farge i tillegg til form. Videre introduseres et rammeverk for lidar- lidar-fusjon. Dette best˚ar av kalibrering og synkronisering, hvor egobevegelse-kompensasjon er en forutsetning.

Resultatene viser at CNN kan forbedre presisjon og gjøre fargegjenkjenning med en overordnet nøyaktighet p˚a 98.9% og 97.3% p˚a testdata, for hendholdsvis formklassifisering og farge- og formklassifisering. Basert p˚a stasjonære scenarier, klarer lokaliseringssytemet ˚a finne kjegler opp til minst 28 meter. Klassifikatorene klarer ˚a bestemme kandidater p˚a 15m og 10m, basert p˚a henholdvis formklassifisering, og form- og fargeklassifisering. Det ble funnet at kombinert form- og fargeklassifikator ikke er s˚a bra som formklassifikatoren for ˚a filtrere ut falske positiver. Selv om resultatene er lovende, er det nødvendig med mer treningsdata og testing p˚a realistiske baner. Lidar-lidar-fusjonen hadde en kalibreringsfeil, som gjorde at resultatene re- latert til fusjonslokalisering ble korrupte. Følgelig er rekalibrering nødvendig. Synkronisering og ego-bevegelse-kompensasjon bør videre testes og verifiseres i høyhastighetsscenarier.

(7)

Preface

This thesis concludes the author’s two-year Masters degree in Industrial Cybernetics, completed at the Norwegian University of Science and Technology (NTNU). It was conducted in collabo- ration with Revolve NTNU, a Formula Student team based in Trondheim, Norway. I would like to thank my formal supervisor Edmund Førland Brekke, for general guidance and tips, while also giving me the opportunity to do this thesis as a part of my engagement in Revolve NTNU.

I would like to thank my co-supervisor Frank Lindseth from Department of Computer Science, who provided the regular supervision throughout the year, for support and helpful guidance. I would also like to thank the members of my group, Perception and Navigation, for feedback, assistance, and for being great discussion partners. In addition, I would like to thank Revolve NTNU, for providing an innovative environment, and the opportunity to go from theory, to practice. Finally, a thank you to friends and family, for support and proofreading.

This thesis is related to the author’s specialization project, conducted during the Autumn of 2019 [34]. The specialization project investigated how lidar intensity could be used for cone color recognition with machine learning methods. The project concludes that there is a potential, with optimistic results using support vector machine and naive bayes. This thesis aims to extend on these results, and the experiences obtained during the project, by using intensity for color recognition in a complete detection framework. Although the specialization project and the thesis are related, they are structured as individual contributions that can be read indepen- dently of each other.

Benjamin Palerud Trondheim, June 1, 2020

(8)

(9)

List of Tables

2.1 Algorithm for discrete-time Extended Kalman Filter (EKF) . . . 8

2.2 Confusion matrix . . . 13

4.1 Hesai Pandar40 [57] and Hesai Pandar20B [56] specifications . . . 22

4.2 VN-300 Specifications . . . 22

4.3 Relevant notations . . . 25

4.4 Tuning parameters . . . 31

4.5 Maximum classification distance for various lidar configurations . . . 32

4.6 CNN architecture . . . 35

5.1 Tuning parameters . . . 44

5.2 Confusion matrix: Shape classifier . . . 46

5.3 Precision, recall and overall accuracy for the shape classifier . . . 46

5.4 Confusion matrix: Shape and color classifier . . . 46

5.5 Precision, recall and overall accuracy for the shape and color classifier . . . 47

5.6 Scenario 1: Average processing time for each method . . . 48

C.1 Pandar20 localization data . . . 82

C.3 Fusion localization data . . . 83

C.4 Shape classification data . . . 83

C.5 Color and shape classification data . . . 84

C.6 Pandar20B localization data . . . 84

C.9 Shape classification data . . . 88

C.10 Color and shape classification . . . 90

C.11 P20 localization data . . . 90

C.12 P40 localization data . . . 91

C.14 Shape classification . . . 92

C.15 Color and shape classification . . . 93

(14)

(15)

List of Figures

1.1 The cones that defines the boarder of a track . . . 2

2.1 Lidar coordinate system . . . 6

2.2 Example of INS/GNSS integration . . . 8

2.3 Illustrations of neural networks . . . 9

2.4 The result of a2×2kernel convolution performed on a4×4matrix . . . 11

2.5 2×2max pooling performed on a4×4grid . . . 11

2.6 Example of a Convolutional Neural Network (CNN) with two convolutional (Conv) layers, two pooling layers and fully connected layers (FC, Output) . . . 12

4.1 Frameworks . . . 19

4.2 Yellow cone with measurments . . . 21

4.3 Render of Atmos . . . 22

4.4 Connection chart . . . 23

4.5 Frames on the vehicle . . . 24

4.6 Roll, pitch and yaw in body frame, adapted from [2] . . . 24

4.7 Geometric illustration for ego-motion compensation . . . 28

4.8 Transformation from synchronization . . . 29

4.9 Localization flowchart . . . 31

4.10 Illustration of vertical resolution and classification range for different lidar configurations . . . 32

4.11 Illustration of 2D projection . . . 33

4.12 Architecture of LeNet-5, adapted from [28] . . . 34

5.1 Images from case specific and multi-purpose experiments . . . 39

5.2 Examples of cones with impurities . . . 39

5.3 The average intensity image from a subset of gathered data . . . 41

5.4 Effect of calibrations from different views . . . 42

5.5 Effect of ego-motion compensation. Blue cloud: corrected, Red cloud: uncor- rected. . . 42

5.6 Different steps in localization . . . 43

5.7 Effect of outlier removal on a vehicle . . . 43

5.8 Examples of 2D projection of real data . . . 45

5.9 Examples of 2D projection of simulated data . . . 45

5.10 The effect of simulated data on the shape classifier . . . 47

(16)

5.13 Scenario 1: P40 localization . . . 50

5.14 Scenario 1: Fusion localization . . . 51

5.15 Side view of ID:19 in fusion localization . . . 51

5.16 Scenario 1: Classification . . . 52

5.19 Scenario 3: Fusion localization . . . 56

5.20 Scenario 3: Classification . . . 57

A.1 Stand alone and merged lidar point clouds . . . 75

A.2 Cones at different distances . . . 76

B.1 Examples of 2D projection of blue cones . . . 77

B.2 Examples of 2D projection of yellow cones . . . 78

B.3 Examples of 2D projection of not cones with intensity . . . 78

B.4 Examples of simulated cones . . . 79

C.1 Scenario 2: P20 localization . . . 85

C.2 Scenario 2: P40 localization . . . 86

C.3 Scenario 2: Fusion localization . . . 87

C.4 Scenario 2: Classification . . . 89

(17)

Abbreviations

CNN Convolutional Neural Network ANN Artificial Neural Network ReLU Rectified Linear Unit

GNSS Global Navigation Satellite Systems INS Inertial Navigation System

IMU Inertial Measurements Units KF Kalman Filter

EKF Extended Kalman Filter ROS Robot Operating System Lidar Light Detection and Ranging PCL Point Cloud Library

CG Center of Gravity

ECEF Earth Centered, Earth Fixed ICP Iterative Closest Point

NDT Normal Distribution Transform RBF Radial Basis Function

(18)

(19)

Chapter 1 Introduction

1.1 Background and Motivation

The development of autonomous vehicles covers several disciplines related to perception, self awareness, planning, control, and physical hardware to name some. It is a field under heavy research, where new and exiting findings are frequently emerging. With higher demand and expectations for quality data, better and cheaper technologies are also emerging, which in turn offers new opportunities regarding methods using this technology. This can be seen by the development in processing power and GPUs, which in turn allows for better performing machine learning methods, or in sensor technology with, for instance, lidars and cameras developed for autonomy.

Autonomy is conquering vehicles on land, sea and in the air, which all contributes to different needs and expectations. Within the world of land vehicles, autonomy has emerged in the For- mula Student competition in recent years. In the Formula student competition, teams created by students with formula-style racing cars compete. Racing cars have different prerequisites compared to normal cars driving on roads, where performance, agility and speed are keywords that describe expectations. Nevertheless, aspect regarding safety, robust performance and good perception are just as important in a race car as in other autonomous vehicles.

One team competing in the Formula Student competition is Revolve NTNU. Revolve NTNU is a voluntary student organization that every year participates in the Formula Student competition with in-house developed racing cars, one electric for a driver, and one fully autonomous. The autonomous race car is subjected to various events, including acceleration, skidpad, autocross and trackdrive. All events must be run autonomously on tracks defined by cones. This is where the theme for this thesis arises, namely object detection of cones. Due to lidars accurate positional information and the possibility to use intensity data, it is believed that methods using this sensor can make positive contributions to the detection of cones.

(20)

1.2 Goals and Research Questions

The thesis is one of many contributions to the Revolve NTNU team, where the governing goal is to develop a cone detection framework. There already exists a detection framework using lidars at Revolve NTNU, but this is designed for lower velocity scenarios without providing information about the class of a cone. It is based on evaluating the geometric size of candidates, and is therefore subject to false positives, as it may consider anything within the given constraints as a potential cone. The fact is that there are several types of cones that defines the track, where blue and yellow cones are used to indicate the boarders, illustrated in figure 1.1.

(a)Blue cone (b)Yellow cone

Figure 1.1:The cones that defines the boarder of a track

Based on the state of the current localization framework and optimistic results obtained regarding cone color classification from the author’s specialization project [34], this thesis will aim to look at ways to improve the overall detection framework. The thesis will build upon the existing framework, which needs to be adapted to new lidars, be tuned and evaluated. Secondly, new classes will be introduced that aim to improve precision by reducing false positives and adding cone color information to better understand the track.

Since Revolve NTNU currently have two lidars, the thesis will also look into possibilities of merging the data from two lidars into a lidar-lidar fusion framework, to see if it will contribute positively to the localization accuracy. In summary, these aspects can be formulated as the following research questions (RQs) that will be addressed in this thesis:

RQ1: Does lidar-lidar fusion contribute to better object localization compared to a single lidar, or a combination of two lidars with their own localization framework?

RQ2: Can classification with CNN enhance the precision of the localization method, and can it contribute with color recognition of cones?

RQ3: Can simulated data be advantageous with the lack of training data?

RQ4: How does the suggested detection framework perform on realistic tracks?

1.3 Contributions

The contributions from the thesis relate to a practical implementation and testing of an already existing localization framework to development and suggestions of classification methods. Through working with these concepts, experience and various datasets has been acquired, which can be used for future work. The contributions can be summarized as follows:

(21)

1.4 Report Outline

• A framework for lidar-lidar fusion, consisting of calibration and synchronization between two lidars.

• Integration of cone localization using filtering and clustering on lidar data, which is adapt- able for several lidar configurations.

• Implementation of a classification framework for cone candidates using CNN on images created by 2D projections and intensity data from the lidar.

• Datasets that can be used to train and validate classification methods for cone color and cone shape recognition. The raw data is stored such that other variations and methods beside those suggested in this thesis also can be tested.

• Datasets from experiments which is used for evaluation and verification. These can be used for future development as the data was recorded with multiple sensors (lidars, cameras and INS).

1.4 Report Outline

Chapter 1: Introductionintroduces the project in a broader sense. Introduces the overall goal governing the thesis along with relevant research questions that the project aims to answer. Fi- nally a list of contributions from the thesis is presented.

Chapter 2: Background Theoryexplains the theory behind the lidar followed by an introduction to state estimation with INS and GNSS. Classification with CNNs is explained with notes on evaluation methods for object detection. The chapter intends to motivate how the lidar can be used for positional data along with intensity extraction, and giving an introduction to concepts used later in the thesis.

Chapter 3: Related Work introduces some relevant work done in the field of lidar detection and explores how lidar intensity data is used in other projects. Based on relevant work, the method of choice is motivated.

Chapter 4: Methodology and Implementation introduces the hardware and software used during the project along with a practical integration of sensors. Thereafter the methods used in each framework is presented, along with aspects of implementation and modifications to the methods.

Chapter 5: Experiments and Resultsintroduces the conducted experiments and gives the results that are attained by using the implemented frameworks from chapter 4.

Chapter 6: Discussionevaluates the results with emphasis on benefits and drawbacks with the methods, with the limitations of the experiments.

Chapter 7: Conclusion and Future Work concludes the project by summarizing what is achieved and gives answers to the research questions. It gives a reflection of potential future work related directly to this thesis, and with notes on camera-lidar fusion.

(22)

(23)

Chapter 2 Background Theory

In the following chapter, theoretical concepts regarding lidar and relevant methods in object detection will be introduced. The Inertial Navigation System (INS) is presented with the intention of showing how positional and velocity data can be acquired through state estimation. The intention of the chapter is to introduce how the lidar works in terms of detection and intensity extraction, in addition to providing a theoretical basis for the concepts introduced in the next chapters.

2.1 Lidar

The following section about lidar theory is based on section 2.1 in the author’s specialization project ”Pattern recognition for cone color classification using lidar intensity”[34].

Light Detection and Ranging (Lidar) is an optical remote sensing sensor that transmits light and measures the properties of the returned light which is reflected from objects[51]. The lidar is used in a variety of applications, including atmospheric surveillance, geomatics and autonomy.

Although the principles behind the lidars are similar for many of the applications, the following theory will be directed to the lidars used for autonomous vehicles.

There are several types of lidars, in this project two 360^◦, 3D multi-laser lidars are used. This means that the lidar has a horizontal field of view of 360^◦and receives data in three dimensions.

It consists of several pairs of laser emitters and receivers that is attached to a rotating motor inside the lidar. As the motor spins, each laser diode emits a short laser pulse which travels through the surroundings. Upon contact with an object, diffuse reflection of the laser beam occurs. The reflected beams are detected by optical sensors in the lidar. From this, the distance to the object can be calculated by the time of flight formula, using the timetbetween emission and receipt, and the speed of lightc[57]. The laser beam is travelling twice the distance to an object in time t, which means that the resulting formula that relates distance, time and speed becomes

d= 1

2ct. (2.1)

The information attained from the lidar is typically x,y and z coordinates along with the range, reflection and intensity for a given point. Channel number, or which vertical laser that emitted and received the signal is also usually given. The lidar generates this information through the

(24)

optical detector. The total optical power received at the detector is given by P_R=P_scA_rec

R² η_atmη_sys (2.2)

whereP_sc =ρ_πI_tA_t, and is the power per steradian backscattered by the target into the direction of the receiver. Here, ρ_π is the target reflection per steradian in the backward direction, I_t is the intensity of the transmitted light at the target location, and At is the area of the target.

Furthermore, R is the range, A_rec/R² is the solid angle of the receiver with respect to the target,η_atmis the atmospheric transmission, andη_sys is the efficiency of the system[31]. Thus, using equation 2.1 and equation 2.2, one can obtain information regarding range, intensity and reflection. The reflection and intensity vary with the composition of the surface the laser is deflected of. This is how patterns can be recognized using lidar intensity, as darker colored objects absorbs more of the light compared to lighter colored objects. The wavelength used in the lidar for this project is 905nm. This is on the lower spectrum of infrared light, and the reasoning behind this particular wavelength regards safety and water absorption to name some[58].

The lidar has its own coordinate system which is defined by the manufacturer. The coordinate system is defined in spherical coordinates, with elevationω, azimuthαand rangeR. The angles is illustrated in figure 2.1. This means that for a single point, the lidar keeps track of the azimuth and elevation of the point, and with the range calculated from equation 2.1, one can attain x,y and z coordinates using the following transformations

x=Rcos(ω)sin(α) (2.3a)

y=Rcos(ω)sin(α) (2.3b)

z =Rsin(ω) (2.3c)

Figure 2.1:Lidar coordinate system

If the lidar rotates at a frequencyf, while scanning at a horizontal resolution ofα₀^◦withN pairs of emitters and receivers spaced vertically, the lidar receives up toN × ³⁶⁰_α

0 data points at every _f¹s. One full rotation generates a point cloud that gives information in 360^◦that can be represented by

P=







p₁₁ p₁₂ . . . p_1j p₂₁ p₂₂ . . . p_2j ... ... . .. ... p_i1 p_i2 . . . p_ij







, (2.4)

(25)

2.2 State Estimation with INS/GNSS whereidenotes the index of the vertical laser, andj denotes the index of the azimuth ₃₆₀^α⁰ ×j.

The point p_ij gives information of coordinates x,y,z as well as intensity and range. Which vertical laser a given pointp_ij relates to can also be given, and is referred to as channel number og ring number.

2.2 State Estimation with INS/GNSS

For autonomous vehicles, an important aspect is knowing its current position and velocity. This information, can for instance be used as a tool for locating the vehicle globally, and placing landmarks seen by a detection system globally. Most autonomous vehicles contain systems to provide this type of information, which has resulted state estimation being integrated into methods used in other systems, related to this project with ego-motion compensation and synchronization, which is discussed in chapter 4. The process of obtaining this information will be referred to as state estimation. In the following sections, state estimation based on a INS combined with GNSS antennas will be explained based on theory from T. H. Bryne and T. I.

Fossen [10].

INS is a sensor package that consists of gyroscopes and accelerometers which measures the angular rate and acceleration along with sensor errors and noise. When combined in a compact form with one sensor type for each axes of motion, they are known as Inertial Measurements Units (IMU). The INS can also be equipped with a barometric pressure sensor and a magne- tometer to provide additional measurements related to altitude and heading. The problem with INSs is that they will drift due to the sensor errors, which is why they can beneficially be coupled with external navigation aids such as Global Navigation Satellite Systems (GNSS).

The GNSS uses satellites to provide position measurements. By using Doppler measurements it can also provide velocity estimates, and if a dual antenna system is used, estimate of heading can be provided. The GNSS receivers normally provide measurements at 1-5Hz, compared to INS which provide measurements at a higher rate of 100-1000Hz. GNSS and the INS have a range of complementary features that make them suitable for coupling, for instance, with low bandwidth, good long-term accuracy and poor short term accuracy from the GNSS, and high bandwith, poor long-term accuracy and good short-term accuracy from INS.

There are several frameworks for combining INS with GNSS where the goal is to couple them such that an output stream with the most accurate estimates can be obtained. This is often implemented with one or several filters, where versions of the Kalman Filter (KF) are popular choices. For this project, the INS uses a built-in EKF to combine accelerometers, gyros and GNSS measurements to estimate the altitude, the position and the velocity, along with sensor biases. EKF is based on a nonlinear discrete-time model on the form

x[k+ 1] =f(x[k]) +B_d[k]u[k] +E_d[k]w[k] (2.5a) y[k] =h(x[k]) +D_d[k]u[k] +[k]. (2.5b) Where f andh are the nonlinear terms, andw, are process and measurement noise vectors.

By linearizingf at each sample the matrixA_dcan be attained, which combined withB_d and Ed describes the process model in discrete-time. Whenh is linearized,Cd is attained which

(26)

combined withD_ddescribes the measurement model. The linearization at samplex[k] = ˆx⁻[k]

is given as

A_d[k] = ∂f(x[k])

∂x[k]

x[k]=ˆx⁻[k]

(2.6a)

C_d[k] = ∂h(x[k])

∂x[k]

x[k]=ˆx⁻[k]

. (2.6b)

Once the model is defined, the predicted states at time step k can be calculated step-wise ac- cording to the algorithm defined in table 2.1. For further examples and versions of KFs related to this topic, the reader is referred to [10]. The flowchart for combining INS and GNSS may look something like in figure 2.2. In the flowchart, the IMU provides angular velocities w_{IM U}^b and specific forcesf_{IM U}^b . If the GNSS is tightly coupled the raw pseudorange measurementsyi

andv_ican be fused directly to the INS filter, or if it is loosely coupled the raw measurements are first passed through a GNSS EKF which results in positionp^e_{GN SS} and velocitiesv_{GN SS}^e , which is provided to the INS filter. Note that this can vary from different integrations and be extended with other sensors such as magnetometers, dual GNSS etc.

Design matrices Q[k] =Q^T[k]>0,R[k]>0 Initial conditions xˆ⁻[0] =x₀

Pˆ⁻[0] =E[(x[0]−xˆ⁻[0])(x[0]−xˆ⁻[0])^T] =P₀ Kalman gain matrix K[k] = ˆP⁻[k]C^T_d[k](C_d[k] ˆP⁻[k]C^T_d[k] +R[k])⁻¹ State corrector xˆ⁺[k] = ˆx⁻[k] +K[k](y[k]−h(ˆx⁻[k])−D_d[k]u[k])

Covariance corrector Pˆ⁺[k] = (I−K[k]C_d[k]) ˆP⁻[k](I−K[k]C_d[k])^T +K[k]R[k]K^T[k]

State prediction xˆ⁻[k+ 1] =f(ˆx⁺[k]) +B_d[k]u[k]

Covariance prediction Pˆ⁻[k+ 1] =A_d[k] ˆP⁺[k]A_d[k]^T +E_d[k]Q[k]E^T_d Table 2.1:Algorithm for discrete-time EKF

Figure 2.2:Example of INS/GNSS integration

(27)

2.3 Object Detection

Object detection is techniques introduced for locating instances of objects in data and classify them. The techniques vary from different machine learning methods to more deterministic methods. Methods that use deep learning for object detection in images have shown great potential [46], but with the advances of lidars for autonomous purposes, the development of similar techniques in the point cloud has had its breakthrough. In the following sections, convolutional neural networks and relevant evaluation metrics will be explained.

2.3.1 Convolutional Neural Networks

CNNs are deep learning models widely used in image classification. It began its advance in 1989 with ”Backpropagation applied to handwritten zip code recognition” [27] by Yann LeChun et al. with further developments in ”Gradient-based learning applied to document recognition” in 1998 [28]. In the same paper, LeChun et al. introduces the MNIST dataset [29] which shows the potential in CNNs for classifying handwritten digits. The potential was further emphasised and popularized in 2012 by Krizhevsku et al. which won the ImageNet 2012 classification benchmark with their convolutional neural network, described in [24]. With the development of datasets, algorithms, and better hardware, several methods based on CNNs have made their breakthrough. For instance, R-CNN [17], Fast R-CNN [44] and Faster R-CNN [43] for detection in images, or SECOND [59] and PointRCNN [53] for classification in lidar data.

The basis of a CNN is a neural network which is built up by neurons that are connected to each other in layers. This will be explained in more details in the next section, based on theory adapted from ”Pattern recognition and Machine learning” by C. M. Bishop [9]. Next, some of the key characteristics of CNNs will be explained, based on theory from Goodfellow et al. [18].

Artificial Neural Network

The Artificial Neural Network (ANN) is a network that connects artificial neurons together in a system. The neurons take an input, and produces an output that is fed forward in the network.

They are arranged in layers, where the layers between the input and output layers are referred to as hidden layers, this is illustrated in figure 2.3a.

(a)Basic neural network with two hidden layers (b)Example of a simple neural network Figure 2.3: Illustrations of neural networks

(28)

To explain the flow of information in a neural network, an example of a simple neural network will be used. It consists of an input layer withN variables that can be denotedx₁, . . . , x_N, a hidden layer withM neurons that can be denoted z₁, . . . , z_M and an output layery₁, . . . , y_K with K outputs. This is illustrated in figure 2.3b. The neurons in the hidden layer can be described with the activation’s, a_m usingM linear combinations of the input variables on the form

a_m =

N

X

i=1

w_mi⁽¹⁾x_i+w⁽¹⁾_m0 (2.7)

wherem = 1, . . . , M, w_mi⁽¹⁾ is the weight and w_m0⁽¹⁾ is the bias. The subscript(1) indicates that the activation’s are in the first layer. Each activation is transformed with an activation function to form the output

z_m =h(a_m) (2.8)

A Rectified Linear Unit (ReLU) is a popular choice for the activation function in the hidden layers, and is given by

z_m =max(0, a_m). (2.9)

The outputs in the second and final layer can be described by K linear combinations on the form

a_k=

M

X

i=1

w_ki⁽²⁾z_i+w_k0⁽²⁾ (2.10)

wherek = 1, . . . , K. For the output layer a typical activation function in classification tasks is the softmax function, which is given as

y_k= e^a^k PK

k=1e^a^k. (2.11)

When equation 2.7 and equation 2.10 are combined, the result is a model with weights and biases, which is trained using inputs xn and outputs yk. The process of training the network consists of finding the correct weights and biases that leads to the inputs being classified with the correct output. This is achieved by training the network on inputs where the class is known.

The goal is to minimize the loss function, which is also referred to as the error function. This is done by using an optimizer such as gradient descent to adjust the weight and biases with backpropagation.

CNN

A CNN is a type of neural network that specialize in data that has a grid-like topology. Ex- amples of this can be time-series which represent a one-dimensional grid, or a grayscale image representing a two-dimensional grid. The input in the case of an image withn×mpixels would be an array of size n×m×1for a grayscale image, orn×m×3for a RGB image. Like a conventional ANN, a CNN consists of neurons structured in layers interconnected. What sets them apart is that a CNN consists of one or more convolutional layers, and that pooling is performed to downsample the amount of neurons in the network.

In traditional neural networks each input unit interacts with each output unit, which can be both unnecessary and expensive. CNNs on the other hand uses sparse interactions. This allows the

(29)

2.3 Object Detection network to detect meaningful features in subsections of the image, which reduces memory re- quirements and improve its efficiency. This is accomplished by convolving the image with a square matrix with given weights, referred to as a kernel. A convolutional layer is made up by a set of kernels that convolves the image, and returns the dot product of the weight in the kernel and the pixels in the image. An activation function layer is then applied, which perform an element wise activation function, such as ReLU. This creates an activation map that represent features detected with the kernel. Throughout the network, different kernels are applied to extract information of features in the image, for instance edges and round shapes. Combined, these activation maps represents characteristics in an image. The result of convolution with a 2×2diagonal kernel is illustrated in figure 2.4. The input is a4×4matrix, while the output is a3×3matrix. To maintain the size of the input, zero padding can be applied, this process is adding zero elements around the input.

Figure 2.4: The result of a2×2kernel convolution performed on a4×4matrix

Pooling layers are applied to downsample the spatial dimensions of the input. This is accomplished by applying filters that summarizes the statistic of nearby values. The filters are usually smaller than the input, and are applied with a stride which defines how the filter is applied throughout the input. A popular pooling function is the max pooling operation, it outputs the maximum value within a defined neighborhood. This is illustrated in figure 2.5, where a 2×2filter is used to transform a4×4input to a2×2output using max pooling with a stride of 2. Pooling helps making the representation more invariant to small translations of the input, meaning that if the input is slightly translated, max pooling still outputs the same value. This is because pooling is interested in the value itself, not the placement of the value within a neigh- bourhood. Pooling also provides an abstracted form of the representation, which helps with overfitting. Thirdly, it reduces the size of the input which results in lower computational cost.

Figure 2.5:2×2max pooling performed on a4×4grid

(30)

Figure 2.6 illustrates an example of a typical architecture of a CNN. The network consists of an input, for example a RGB image, two convolutional layers and two pooling layers. The last layer is a fully connected layer which is similar to the layers in a traditional neural network.

Here, each neuron is connected to all the neurons in the previous layer. There can be several fully connected layers in a CNN. The last fully connected layer uses an activation function to output the class score, for instance the softmax function.

Figure 2.6: Example of a CNN with two convolutional (Conv) layers, two pooling layers and fully connected layers (FC, Output)

2.3.2 Evaluation Metrics

In order of evaluating the performance of the methods used in object detection, a set of evaluation metrics are introduced. For this project, they are divided into metrics to evaluate object localization and metrics to evaluate classification.

For an object to be successfully localized in 3D space, it needs to be seen and placed by the localization algorithm, the placement should be accurate without deviations. Recall is a metric that can be used to analyze the ability to extract true positive objects from a scenery, and is defined as

Recall= tp

tp+f n. (2.12)

Here, tp stand for the number of true positive samples, andf n stand for the number of false negative samples. If the recall is high, the localization manage to find most of the objects in each callback. Standard deviation and variance are metrics that can be used to check whether the physical placement of the object is similar for each callback. The sample variance and standard deviation are defined from

s² = 1 N −1

N

X

i=1

(x_i−x)¯ ², (2.13)

wheresis the standard deviation ands² is the variance. The sample size is given byN, while the sample mean is given byx, defined from the mean of the set of samples¯ x_i.

Relevant metrics to evaluate the performance of a classifier are accuracy, recall and precision.

The accuracy is defend as the fraction of correct predictions, and is given by a= ncorrect

n_samples. (2.14)

The recall is defined in equation 2.12, and states whether the classification manages to classify all the true positive objects. The precision says something about the ability to not classify a

(31)

2.3 Object Detection

sample as positive, when it is negative. Precision is given as P recision= tp

tp+f p, (2.15)

wheref pis the amount of false positive samples. The weighted average of precision and recall is called the F1 score, and is defined by

F1 = 2· precision·recall

precision+recall. (2.16)

For recall, precision and F1 the best value is given as 1, while the worst score is at 0. Another way of evaluating a classifier is to look at the confusion table. The confusion table summarizes the predictions made in an evaluation set, and is presented in a table as illustrated in table 2.2, for a binary classification problem. TP, TN, FN and FP can be given as a fractions of the total evaluation set or as the amount of samples. It can also be extended for multi class problems.

Predicted label Class 1 Class 2 True label Class 1 TP FN

Class 2 FP TN

Table 2.2: Confusion matrix

(32)

(33)

Chapter 3 Related Work

With the advancement in autonomous perception and seeing the potential in lidar for this field, several methods for object detection based on point cloud data have emerged. The field of autonomy is constantly evolving, which means that new methods are regularly arising. In the following sections, related work regarding object detection will be introduced. This includes methods based on neural networks, clustering and the use of intensity and reflection to enhance classification. Inspired by the findings investigating related work, and with the goal for this project in mind, a method to procede with will be chosen.

3.1 Lidar Detection

There are several methods based on using CNNs to do object detection in the point cloud. As mentioned in section 2.3.1, CNNs specializes on grid-like topology. The 3D scenes captured by the lidars used in this project, is in the form of sparse and irregular point clouds. This gives rise to the problem of structuring and representing the 3D data in a good and efficient manner.

To cope with this problem, grid-based methods have been developed. The methods generally consist of representing the data as 3D voxels, as for instance in [64, 59, 54], or based on 2D maps such as in [11, 60, 33, 62]. The 2D maps can include both bird’s eye view and front view to perceive the environment. These methods can be processed by 3D or 2D CNNs to extract features for 3D detection. There also exist point-based methods such as [36, 53] which is based on PointNet [37]. These methods directly extract features from the point cloud for detection. Grid-based methods are generally more computationally efficient, but are more prone to information loss compared to point-based methods. PV-RCNN [52] is a method utilizing features from both grid-based methods and point-based methods.

Clustering is a technique of structuring a finite data set into a set of clusters. The data in a cluster has similar structure, defined by the method used. By evaluating the geometric properties and the amount of points in a cluster, it is possible to extract valuable information from them.

This information can be used to do object localization in the points cloud, for instance if the geometric constraints of an object are known. There exists several techniques for clustering, where DBSCAN [40] and OPTICS [4] are some notable methods. In the recent years, a cone detection technique has been developed at Revolve NTNU. It uses Euclidean clustering with certain conditions together with a series of filtering techniques to extract cone candidates. It

(34)

utilizes the geometric properties of a cone to determine if a given cluster is a cone candidate. It has a quite fast processing time, but is prone to false positives.

3.2 Lidar Intensity

Some of the key attributes of a lidar is that it can provide accurate distance measurements with an accuracy of up to a few centimeters. Most of the detection methods mentioned in the previous sections rely only on the positional data of the points, but it may be unused potential.

Many lidars provides intensity and reflective information that could be used to enhance the classification. It is not widely used, but the potential is still recognized as seen in the paper given by Scaioni et al. [49]. The paper concludes that using intensity data has the potential to improve classification for certain areas, mostly with examples for airborne use. Examples of such methods are given in [38, 26].

A case in which reflection data is used to improve classification for autonomous perception is presented in the paper given by L. Zhou and Z. Deng [63]. They use a linear support vector machine to classify traffic signs based on camera and lidar data. The image classification is enhanced by the lidar, which inputs 3D-position together with reflection data. Another example is given by Hern´andez et. al [13]. Here the clustering technique DBSCAN and reflective data from a laser rangefinder is used to find line surfaces for autonomous navigation. Methods for improving segmentation [55] and detection[5] using intensity and reflective data from the lidar have also been proposed.

A lidar manufacturer named Ouster has updated their drivers to output images based on lidar intensity data. It works by mapping the intensity data from the points in the point cloud to a grey-scale image. The result is similar to a conventional image, only that it is possible to extract accurate depth information from the image. Since it resembles conventional images, methods for object detection in images can be used [1].

In the paper given by Gosala et al. [19], they present approaches related to perception and state estimation for an autonomous race car. The autonomous race car is designed for the same framework as Revolve NTNU, the Formula Student competition. As part of their perception system, they proposes a method for recognizing cone colors based on the intensity data from a lidar. Their method is based on mapping a cone to a 32x32 grey-scale image where a pixel represent an intensity value from the point cloud. The image is classified using a CNN.

3.3 Choice of Method

The method of choice should be able to detect and classify the cones, preferably with cone color. It should have a fast processing time, be accurate and have high precision and recall. Due to the need of labeled training data for the neural network approaches, and since there already exists an implementation of Euclidean clustering, this is chosen to be used for localizing cone candidates. As mentioned, it has a fast processing time, but is prone to false positives, giving it low precision.

To ensure that precision becomes higher, a method for classifying the cone candidates is sug-

(35)

3.3 Choice of Method gested. Inspired by the front view approach from the detection methods, the use of Ouster as a camera, and the cone color classification mentioned by Gosala et al. [19], a method of project- ing the cone candidates to images is proposed. This is both to improve precision and to classify color. The overall method is similar to many traditional object detection methods on images, such as R-CNN [17], since the detection system finds regions of interest which is then classified by a CNN.

(36)

(37)

Chapter 4 Methodology and Implementation

This chapter intends to introduce the software, the hardware and the practical integration of the sensors. Thereafter, relevant coordinate frames and notations will be explained, which is important in order to make transformations between the different frames. Furthermore, a framework for lidar-lidar fusion is introduced, including calibration and synchronization. A proposed method for ego-motion compensation is also introduced, which is important to compensate for distortion in the lidar scan. Finally, the framework for for cone detection is explained, which is motivated in section 3.3. The resulting frameworks for this chapter is illustrated in figure 4.1. For each part of the frameworks, the motivation behind it is explained, relevant methods are introduced and finally some aspects of practical implementation are explained along with modifications and contributions.

Figure 4.1:Frameworks

(38)

4.1 Software, Hardware and Sensor Integration

4.1.1 Software

ROS

Robot Operating System (ROS) [39] is an open source project for developing robotics software.

Contrary to its name, it is not an operating system, but rather a framework consisting of several tools and libraries. It is supported in C++ and Python among other languages. ROS provides a communication infrastructure, different features and certain ROS-specific tools.

The communication infrastructure consists of nodes connected to each other in a network by a master node. It is based on a peer-to-peer network where information can flow between relevant nodes. A node is defined as a process that performs certain tasks. A node can listen to certain messages, or in ROS terms, it can subscribe to certain topics. It can also publish topics that other nodes can subscribe to. For instance, the driver for a lidar is a node that publishes point cloud messages on a topic that the detection node is subscribing to. Several nodes can publish and subscribe to the same topic.

It is possible to define your own message types in ROS, but it also contain many standard messages. For instance, point cloud messages and navigation messages that can be used for state estimation. The standard point cloud message can be visualized in rviz, which is an integrated visualization tool in ROS. It is also possible to store data in ROS-bags, which is an built-in tool that can subscribe to topics and store the data in .bag files. It allows the user collect sensor data, which can be processed and analyzed at a later time.

PCL

Point Cloud Library (PCL) [48] is a large-scale open source project for image and point cloud processing. It is a framework that consists of a broad range of algorithms including segmentation, filtering, and registration. It is integrated as a toolkit in ROS, but can also be used as a standalone library in C++ or Python. It also includes visualization tools. Most of the point cloud operations for this project are done with PCL, which covers most of the filtering, transformations, calibration, and clustering.

Tensorflow

Tensorflow [3] is an open source framework for machine learning. It consists of various tools and libraries for deploying and evaluation of machine learning methods. It is supported in Python, JavaScript, C++ and Java, where the Python and C++ API is relevant for this project.

For classification it contains methods for building neural networks and it is possible to define the user’s network structure. For instance, the different fully connected or convolution layers, type of pooling, or the activation function used. It also includes support for GPUs and CUDA.

(39)

4.1 Software, Hardware and Sensor Integration

4.1.2 Integration of Sensors and Hardware

Atmos, the trolley and the cones

Atmos is the electrical race car that Revolve NTNU will use for the 2020 season of the Formula Student competition. The vehicle was created as an electrical race car fitted for drivers in the 2018 season of the Formula Student competition, but was adapted the following year to allow for autonomous driving. The vehicle is equipped with a variety of sensors and configurations to allow for autonomous racing. This includes sensors for detection such as lidars and cameras, sensors for state estimation and a processing unit. Atmos can accelerate from 0-100 km/h in 2.2 s with a top speed of 110 km/h. The overall goal for Atmos in the 2020 season is to drive 17 m/s on average on an already mapped track. This gives indications for the performance needed for the autonomous pipeline. A CAD representation of Atmos is given in figure 4.3.

For testing purposes, a trolley was adapted to be fitted with mounts for the lidars and cameras, along with the INS. It can be equipped with a 12V car battery and a DC-AC inverter making it mobile. Since Atmos is under maintenance and due to ease of testing, the trolley is used for the experiments.

There are mainly three types of cones that defines the track. The first is the large orange cones that are placed before and after the start and finish lines. These can be used to indicate and count the number of laps the vehicle has run. The other two types are the small yellow and blue cones.

The left border of the track is marked with small blue cones, and the right border is marked with small yellow cones. Their geometric properties are228mm×228mm×325mmas illustrated in figure 4.2. Since the orange cones are taller than the small cones, clustering managed to find them without a lot of false positives. Therefore, they will not be a focus area in this project.

Figure 4.2:Yellow cone with measurments

Lidars

For this project, two lidars were chosen. The reasoning behind this is that it is assumed to result in a longer detection range, which is important to be able to perceive and plan sufficiently ahead when driving fast. The lidars used for this project are the Hesai Pandar40 and the Hesai Pandar20B, abbreviated as Pandar40 or P40, and Pandar20B or P20 in this thesis. Both are created by Hesai Technologies for autonomous driving. They are mechanical rotating lidars that spins with a frequency up to 20Hz and provide information about x, y and z coordinates, along with intensity information. Hesai Pandar40 has 40 vertical laser receivers and emitters, while Hesai Pandar20B has 20. For further specifications see table 4.1. Hesai provides an interface for changing parameters on the lidars, such as the address it sends data to and narrowing the field of view, both horizontally and vertically. Both the lidars have their own ROS driver that

(40)

processes the raw sensor data into the correct coordinate system, and publish it as a point cloud message, making it compatible with the ROS environment.

Figure 4.3:Render of Atmos

Pandar40 Pandar20B

Channel 40 20

Measurement Range 0.3 m to 200 m 0.3 m to 200 m

Measurement Accuracy ±2 cm (0.5 m to 200 m) ±2 cm (0.5 m to 200 m)

FOV (Horizontal) 360^◦ 360^◦

Angular resolution (Horizon- tal)

0.4^◦ 0.4^◦

FOV (Vertical) -16^◦to 7^◦ -19^◦to 3^◦

Angular Resolution (Vertical) From 0.33^◦to 1^◦(nonlinear) From 0.33^◦to 5^◦(nonlinear)

Wavelength 905 nm 905 nm

Table 4.1: Hesai Pandar40 [57] and Hesai Pandar20B [56] specifications

Vectornav VN-300

The state estimation system on Atmos is a complex system, combining several sensors to output the most probable state of the vehicle. On the trolley on the other hand, a more simplified state estimation system is integrated. This consists of an INS that is combined with dual GNSS antennas with an integrated EKF to output the states. The used sensor is the Vectornav VN-300.

Relevant specifications for the sensor is given in table 4.2. It has a ROS driver configured to publish\nav msgs, which is a common ROS message type that is also used on Atmos.

VN-300 SPECIFICATIONS

Horizontal Position Accuracy 2.5 m RMS

Position Resolution 1 mm

Velocity Accuracy ±0.05 m/s

Velocity Resolution 1 mm/s Output Rate (Navigation Data) 400 Hz

Table 4.2:VN-300 Specifications

(41)

4.2 Coordinate Frames and Notations Sensor Integration

The Pandar40 is located in the vehicle’s main hoop, and the Pandar20B is placed on the front wing, which is illustrated in figure 4.3. These positions are attempted to be similar on the trolley, which is illustrated in figure 5.1 in section 5.1, where the Pandar20B is mounted low in the front and the Pandar40 is placed higher up in the rear. Both lidars are obstructed by aerodynamic elements covering the rear view. They are therefore configured to a horizontal view of 180^◦.

The connection chart for the trolley setup is illustrated in figure 4.4. The 12V battery provides power to the processing unit and the lidars, while the processing unit provides power to the INS via the USB. The processing unit can be a regular laptop or computer. The criteria is that it run Ubuntu 16.04 with the required drivers and ROS installed.

Figure 4.4:Connection chart

4.2 Coordinate Frames and Notations

There are several coordinate frames in relation to the vehicle. The two lidars and the state estimation system have their own frames, which are local to the vehicle. It is also a global frame that defines where the vehicle is in world coordinates.

The lidar frame is defined in section 2.1, and is centered in the middle of a lidar. The state estimation frame is defined by the INS, which is placed at the Center of Gravity (CG). The INS also defines the body frame of the vehicle and will be the base for the common local frame. It is from here all the velocities on the vehicle is defined. The frames are defined by their 6 degrees of freedom, namely translation in the x,y and z-axis and the rotaions around each axis, namely roll, pitch an yaw. The lidar frames and the body frame are illustrated in figure 4.5, and roll, pitch and yaw is illustrated in figure 4.6. To transform all individual frames to a common frame, a set of transformations must be performed. The transformation matrix from frameato frameb

(42)

is defined by a rotation matrix and a translation as

T_a^b =

R_a^b t_b/a

0 1

. (4.1)

The calibration procedure explained in the next section gives the transformation from the P20 to the P40 frame, denoted T_P^P₂₀⁴⁰. The transformation from the P40 frame to the common body frame, bodyis defined by the transformationT_P^body₄₀. These are used to transform all the frames to the common body frame.

The global frame is defined by the Earth Centered, Earth Fixed (ECEF)-frame, and has its origin defined where the vehicle starts the race. It is used to synchronize the lidars, by relating two global transformations at two given points in time. This is explained in section 4.5. Relevant notations and symbols are given in table 4.3.

Figure 4.5:Frames on the vehicle

Figure 4.6:Roll, pitch and yaw in body frame, adapted from [2]

(43)

4.3 Lidar-Lidar Calibration

Description Symbols

x,y and z (m) (x, y, z)

Velocities in x,y and z (m/s) (v_x, v_y, v_z) Roll, Pitch, Yaw (rad) (φ, θ, ψ) Roll, Pitch, Yaw rates (rad/s) (φ,˙ θ,˙ ψ)˙ Rotation from a to b R^b_a Translation from a to b t_b/a Transformation from a to b T_a^b

Body frame body

Lidar frames P20, P40

Table 4.3: Relevant notations

4.3 Lidar-Lidar Calibration

Lidar-lidar calibration is the process where the transformation between each lidar on the vehicle/trolley is found. This is a one-time process that is done for both the vehicle and the trolley, but it can easily be done several times to confirm that the transformation still is valid. There are several methods for calibration, but what they often have in common is that they try to match one point cloud with another based on a common scenery. During the matching process, the transformation between them is found. A group of methods that can be used for this purpose are registration algorithms, such as different versions of the the Iterative Closest Point (ICP) [41, 7]

and the Normal Distribution Transform (NDT). The latter is selected for this project. It was chosen due to ease of implementation in the PCL, and because it gave better results compared to the point-to-point ICP.

4.3.1 The Normal Distribution Transform

The NDT is a registration algorithm that uses optimizations on statistical models of the point cloud to determine the most probable transformation between two scans. It was first introduced by P. Biber and W. Strasser [8] as a 2D-scan matching algorithm. The method used for this project is based on ”The Three-Dimensional Normal Distribution Transform - an Efficient Representation for Registration, Surface Analysis, and loop Detection” by M. Magnusson [30], which is implemented as a method in the PCL. The main parts of the method is described in algorithm 1 and 2.

(44)

Algorithm 1:Building the NDT Result:Probability density divide space into voxels;

foreach voxeldo

collect all pointsx_i=1..nin voxel;

calculate the meanq= _n¹ P

ix_i; calculate covariance matrixP

= _n¹P

i(x_i−q)((x_i−q)^t end

probability is modeled asp(x)≈exp(−^(x−q)^t^P₂⁻¹^(x−q));

Algorithm 2:Scan alignment

Result:Transformation between scans build NDT for first scan using algorithm 1;

initialize estimate of transform;

whileconvergence criterion is not metdo foreach sample of second scando

map points in second scan accordingly to estimate of transform;

end

foreach mapped pointdo

determine normal distribution;

evaluate score by evaluating distribution;

end

calculate a new estimate by optimizing the score using Newton’s method;

end

4.3.2 Implementation

To implement the NDT algorithm, a set of parameters must be set along with an initial transformation. As shown in algorithm 1, the point cloud should be divided into voxels, the voxel size is user-defined and is chosen to be 0.5m for each dimension. From algorithm 2, it can be noted that the algorithm is initialized with an estimate of a transformation between the two scans. By measuring the distance between the lidar mounts on the trolley, an initial guess of the translation fromP20toP40was found ast_P_40/P₂₀= [−1,0,−1]. It is assumed no rotation. Furthermore, a convergence criterion must be set, which determines how accurate the matching should be, this is set to 0.01m.

The inputs to the NDT are a point cloud from the Pandar40 lidar, a point cloud from the same instance of time with the Pandar20B lidar, along with the initial transformation. The output is the transformation from P20 to P40,T_p20^p40.

4.4 Ego-Motion Compensation

The lidar performs a rotation as it scans the surroundings, at the same time the vehicle moves at speeds up to and above 20m/s, with both rotation and translation. Since the lidar itself does not compensate for motion, this must be done to get a correct view of the surroundings. If the lidar spins at a rate of 20Hz, and the vehicle has a velocity of 20m/s, it will result in up to a meter

(45)

4.4 Ego-Motion Compensation distortion in the scan. Ego-motion compensation is mentioned in the literature, and different methods have been proposed. In the paper given by Merriaux et al. [32], a method is proposed by using CAN bus data to obtain angular and translational displacements, in order to correct the point cloud. In the paper given by J. Rieken and M. Maurer [45], they are approximating each measurement by parameters such as start time, rotation rate and start/end angle. In both articles they are making certain assumptions, for instance constant rotation rate from the lidar and no pitch or roll from the vehicle. Inspired by these, a method for ego-motion compensation has been developed.

4.4.1 Proposed Method

The method uses velocity estimates from state estimation and hardware specifications from the lidar to adjust each point in the point cloud. It works by mapping the time,∆t_ifor each pointi as a function of the azimuth angle,α_i. The following assumptions are made for each scan:

• Constant velocities,v_x,v_y andψ˙ during a fraction of the scan

• Constant rotational frequency from the lidar,f

• Assumes small deviations in roll and pitch

For this project, a fraction of the scan is 180^◦, which relates to the half of a full scan. A full scan is performed at 20Hz, which means that 180^◦is performed during 40Hz which is a 0.025s period. Since the time period is so small, the assumption of constant velocity is thought to be a good approximation. The following explanation is using an 180^◦scan fraction as an example, but the method can be adjusted to any fraction.

For each point p = [p_x,p_y,p_z], p_z will stay constant as the deviation in roll and pitch is assumed to be so small that it can be neglected. The lidar spins clockwise from 0^◦to 180^◦, and the latest received points, at 180^◦is thought to be correct. For each point in the cloud the following is calculated, which is also illustrated in figure 4.7,

α= arctan (p_x

p_y) (4.2a)

∆t= π

2πf(1− α π) = 1

2f(1− α

π). (4.2b)

This is used, alongside the velocities gathered from state estimation to calculated the new final point coordinates,p^f based on the initial coordinatepⁱusing

p^f_x p^f_y

=

cos ( ˙ψ∆t) sin ( ˙ψ∆t)

−sin ( ˙ψ∆t) cos ( ˙ψ∆t) pⁱ_x pⁱ_y

− v_x∆t

vy∆t

. (4.3)

Lidar based object detection for an autonomous race car

Master ’s thesis

Benjamin Palerud

Lidar based object detection for an autonomous race car

Benjamin Palerud

Lidar based object detection for an autonomous race car

Abstract

Preface

Contents

List of Tables

List of Figures

Abbreviations

Chapter 1

Introduction

1.1 Background and Motivation

1.2 Goals and Research Questions

1.3 Contributions

1.4 Report Outline

Chapter 2

Background Theory

2.1 Lidar

2.2 State Estimation with INS/GNSS

2.3 Object Detection

2.3.1 Convolutional Neural Networks

2.3.2 Evaluation Metrics

Chapter 3

Related Work

3.1 Lidar Detection

3.2 Lidar Intensity

3.3 Choice of Method

Chapter 4

Methodology and Implementation

4.1 Software, Hardware and Sensor Integration

4.1.1 Software

4.1.2 Integration of Sensors and Hardware

4.2 Coordinate Frames and Notations

4.3 Lidar-Lidar Calibration

4.3.1 The Normal Distribution Transform

4.3.2 Implementation

4.4 Ego-Motion Compensation

4.4.1 Proposed Method