Object detection - Towards creating a map layer of road intersections by information extraction

Compared to the image detection classifiers, where the goal is to predict the class of one object in an image, object detection also involves identifying the position of one or more objects predicted in the image. Object detection classifiers produces a list of objects presented in the image with corresponding scores, as well as an aligned bounding box indicating the position and scale of every object. Using CNNs in object detection trains two networks, one for classifying the objects in an image, another for fitting the box around each object and given that this is supervised learning, the training data needs both truth labels for each object, as well as the location in the image in form of the bounding box. The truth label for the bounding box is call ground truth box.

2.3.1 Evaluation metrics

In order to know if a model is well trained or not, several evaluation metrics are defined based on the predictions. A simple classification task is simple to evaluate, but accounting for the object detection, a confidence score is introduced for each bounding box of the object detected[1].

IoU Intersection over Union is an evaluation metric that quantifies the similarity between the predicted bounding box and the ground truth box (gt) in form of a probability measure. The higher the IoU score, the closer the two boxes are to each other. The IoU measures the overlap/intersection of the bounding boxes divided by the union.

Figure 2.10: An example of the IoU between two bounding boxes[1].

Predictions To decide if the bounding box prediction is good enough or not, the IoU is measured, and based on a set threshold, values above this threshold is considered positive predictions, and those below are vise versa negative predictions. If the next sections, some evaluation metrics are calcu-lated using true positives(TP), true negatives(TN), false positives(FP) and false negatives(FN). A true positive denotes that the object is there, and the IoU is above the threshold. True negatives denotes that the object isn’t there, and the model does not detect it. False positives denotes that the object is there, but the IoU is below the threshold. False negatives denotes occurrences where the object is there, but the model doesn’t detect it, mean-ing the predicted boundmean-ing box has no prediction.

Accuracy The accuracy is the percentage of true positives plus true negatives divided by every prediction. This is often misleading when dealing with imbalanced datasets.

Accuracy = T P +T N

T P +F P +T N +F N

Precision The precision is the probability of the predicted bounding boxes with respect to the actual ground truth boxes. This metric is in other words the probability of when an object is detected, the model is correct.

P recision = T P T P +F P

Recall The recall is the rate of true positives, often referred to as the sensitivity of the predictions. It measures the probability of ground truth objects being correctly detected, i.e. how many of the actual objects did the model detect.

Recall = T P T P +F N

Average precision AP The average precision is an evaluation metric that measures the performance of the model as it returns a single value that accounts for both the precision and recall. The average precision is also known as the area under curve (AUC) and measures sum of the maximum precision p for any recall ˜r multiplied by the change in recall≥r.˜

AP = X

(r_n+1−r_n)p_interp(r_n+1) p_interp(r_n+1) =max(p(˜r))r≥r_˜ _n+1

Mean Average precision mAP The mean average precision is sim-ply over N classes, the mAP averaged the AP over all the N classes, i.e. the total performance for all classes.

The research in object detection is still an ongoing porcess. I order to avoid having to train a network from scratch every time on either new datasets, finetuning a model towards the same dataset, or introducing new model architectures, one can use already general and robust pretrained networks to help speed up the process of learning. Using a pretrained network as a foundation for what the new task is, is called using a backbone. When it comes to object detection in images, as explained in the previous sections, the goal is to fit a bounding box around a classified object. The CNN has as a task to extract features of an image, and learn the model on those instead of the input images directly. Using an already robust and generalized model as a backbone and extract the most important features of it will make the new keep the important features of said backbone, and make the process of both detection and localization of objects much faster. ImageNet[15] is one of the most used backbones today as the network is trained on over 14 million images containing almost 22 000 classes. Another commonly used backbone are models CoCo dataset. CoCo stands for Common Objects in Context and contains of both training and validation data of over 120 000 images with multiple bounding boxes for each image for around 100 common objects.

The difference between fine-tuning a model and feature extraction in gen-erally speaking either to, train the model further using similar data with the corresponding classes, typically a sample set of the classes already used in the model, or extract the important features from the network and use that as a foundation for training a new network. The first is called fine-tuning.

A typical example using the MS CoCo dataset, is to fine-tune the model on new data to make it better at predicting less classes, e.g. busses, instead of all the 100 classes. Feature extraction[9] reduces the dimensionality of the original input data so it is more manageable for further training which is more commonly used when training on new datasets with new classes so

that the model already has a common opinion of what to look for both in terms of classification and localization of the bounding boxes.

In document Towards creating a map layer of road intersections by information extraction from Mapillary images (sider 30-33)