Conclusion and Future Work

7.2 Future Work

7.2.3 Camera-Lidar Fusion

Revolve NTNU uses a stereo camera setup that runs a detection algorithm independent of the lidar detection module on the vehicle. This means that both the camera module and the lidar module are looking for cone candidates, but they parses this data to the localization and mapping algorithm asynchronously. This might be unused potential, because a detection method using camera and lidar combined may be better. A lidar and a camera have complementary features, with lidar’s accurate depth information and the rich informational images from the camera. This section intends to explore some of the possibilities with camera-lidar fusion, which can be used as a potential spring board for future work.

Related Work

There are methods that combine camera and lidar in an integrated deep neural network architec-ture, for instance, Multi-View 3D [11] and AVOD-FPN [25]. They both are combining different views from the lidar with an image input for classification. There are also several methods for camera classification, for instance Fast R-CNN [44], Faster R-CNN [43] and YOLO [42]. The latter is used to classify cones during the 2020 season at Revolve NTNU.

Calibration and Synchronization

To combine data from a lidar and a camera, the two sensors must be calibrated. Similarly with lidar-lidar calibration, the goal for calibration is to find the transformation between the lidar and the camera coordinate systems. A relevant method is given by Dhall et al. [14] which has a ROS package for practical implementation.

Synchronization is also important for a camera-lidar fusion framework to ensure that the fused data is related to the same instance of time. Unlike the lidar, it is possible to trigger a camera.

By using this it is possible to trigger the camera at the same time as the lidar publishes data.

There are a few methods related to this, such as TriggerSync [16].

In this thesis, it is suggested to use lidar for localization of cone candidates that are classi-fied by doing 2D projection of the candidates. This is a concept that is highly adaptable to a camera-lidar fusion-based concept. Instead of 2D projecting the candidates, snippets of images that represent the candidate can be extracted. The images can then be classified by a CNN, which should contain much more information compared to 2D projected lidar data. It can also increase the classification range, since the clustering is able to find candidates at around 30m.

One possible drawback is that the field of view of a typical camera is less than that of the lidar, which means that multiple cameras might need to be used.

Another solution is to use image based classification to localize and classify cones in images, the lidar can then be used to extract the distance of the cones found by the camera. The positive aspect of using camera-based classification is that there already exist a database with around 75,000 images of cones that are labeled. This allows for better training and verification. The weight aspect can also be improved since a camera can be smaller and lighter, and it can poten-tially compensate the benefits and need of two lidars. A combined camera-lidar deep network architectures can also be used, such as AVOD-FPN or Mulit-View 3D. The disadvantage with these is that they need training data based on both lidar and camera.


Appendix A