Deep learning applied to automatic polyp detection in colonoscopy images : master thesis in System Engineering with Embedded Systems

(1)

University College of Southeast Norway Faculty of Technology and Maritime Sciences - Master Thesis in System Engineering With Embedded Systems Kongsberg Department of Engineering May 19, 2017

Qinghui L

IU

Deep Learning Applied to Automatic Polyp Detection in

Colonoscopy Images

(2)

(3)

iii

Abstract

Deep learning is an improvement to the neural network that contains more computational layers that allow for higher levels of abstraction and prediction in the data. So far, it is becoming a leading machine learning tool for general imaging and computer vision. Current trends in research have also demonstrated that deep convolutional neural networks (DCNNs) are very effective in automatically analyzing images. How- ever, the requirement of large number of annotated samples prohibits its wide use in medical image analysis, since collecting and labeling a large amount of data is difficult due to the challenges in obtaining the data from the medical domain.

Polyps are known as possible colorectal cancer precursors, and their early detection is of great importance, but highly challenging from an image processing stand- point. In this work, we evaluate several state-of-the-art machine learning techniques and deep learning methods in the medical image processing domain and research solu- tions about how they can be more efficiently utilized for automatic detection of polyps in endoscopy and colonoscopy images.

This work proposes an effective transfer learning (TL) framework relying on pre- trained DCNNs using a large collection of natural ImageNet images. This has been achieved by evaluating various kinds of cutting edge techniques including both traditional machine learning methods by training feature-based classifiers from scratch and modern DCNNs algorithms with (TL) and fine tuning pre-trained models. We transfer learned ImageNet weights as initial weights, and then fine-tune this model combined with a new deep classifier called fully connected networks (FCNs) with data augmentation and patch-extraction of colonoscopy images to automatically detect polyps. In case of insufficient colonoscopy images, patch-based data augmentation and deep features extracted using TL strategy can provide sufficient and balanced classification information.

With the proposed TL framework with our optimized hyper-parameters, the system achieved overall 96.00% polyp detection precision and sensitivity, which outperformed the traditional machine learning classification methods in each defined performance metric. Moreover, the TL framework proposed is scalable and flexible so that it can easily be extended to include other types of disease detection in the future and also be able to integrate one more DCNNs model to boost its generalizing capabilities.

(4)

(5)

v

Acknowledgements

First and foremost, I would like to express my sincere gratitude to my thesis advisors, Dr. António L. L. Ramos at USN, Norway, and Dr. Sergio L. Netto at UFRJ, Brazil, for their guidance and support while conducting this research work as well as in the writing of this thesis. I could not have had better advisors and mentors for my master study and research. I am very thankful to Dr. António L. L. Ramos for giving me this task and to Dr. Sergio L. Netto and his research group, specially Lucas Cinelli and Bruno Afonso, for sharing their expertise and experience in the field of Machine Learning.

I would like to thank Dr. Olaf Hallan Graven, our head of department, for providing all the necessary means and facilities to carry out this task. I also take this opportunity to express my gratitude to my lecturers and faculty members at USN for sharing their expertise, and sincere and valuable guidance and encouragement.

I would also like to thank all my classmates, specially my lab-mates Zhili Shao, Blessing, Vegard, and David, for the stimulating discussions and for all the fun we have had in the past two years. My sincere thanks also go to Mr. Paul Stupkin for reading part of the manuscript and for the valuable comments on the English grammar.

Last but not the least, I would especially like to thank my family. Words cannot express how grateful I am to my mother-in-law, father-in-law, my mother and father for all of the sacrifices you have made on my behalf. To my beloved wife, GU Yan, I can’t thank you enough for supporting me for everything, and for encouraging me throughout this experience.

(6)

(7)

vii

List of Figures

1.1 Gastro-Intestinal Track Diagram . . . 1

1.2 Examples of the type of polyp in colonscopy images. . . 2

1.3 A polyp appears in different scales and shades . . . 3

2.1 The architecture of VGG16 model . . . 12

2.2 The structure of an Inception module . . . 13

2.3 The residual block for residual learning approach. . . 13

3.1 Three frameworks for automatic polyp detection . . . 19

3.2 A scalable framework for computer-aided detection systems . . . 20

3.3 Performance evaluation of different histogram algorithms. . . 21

3.4 Performance evaluations of different filters. . . 22

3.5 Examples of data augmentation results. . . 22

3.6 Mathematical model of a neuron. . . 23

3.7 3-layer feed-forward neural network model. . . 24

3.8 Algorithm for computing Adadelta . . . 26

3.9 Convolutional operation examples on a 3-channel image. . . 28

3.10 An example of setting zero-padding and strides. . . 29

3.11 Max and average pooling examples for subsampling features. . . 29

3.12 An example of applying dropout to a neural network. . . 30

3.13 The structure of a residual block for ResNet. . . 30

3.14 The architecture of 50-layer ResNet. . . 32

4.1 Patch extraction examples from a frame with a polyp. . . 35

4.2 Traditional machine learning process for polyp detection. . . 37

4.3 Classifiers comparison results of visualization . . . 40

4.4 Learning curve of 5-layer CNNs . . . 41

4.5 Transfer learning architecture . . . 42

4.6 Fine-tuning learning rate experimentation. . . 45

4.7 Dropout rate tuning experimentation. . . 46

4.8 Tuning decay rate experimentation. . . 46

4.9 Batch-size experimentation. . . 47

4.10 10 Models performance comparison . . . 49

4.11 The impact of different input image sizes . . . 50

4.12 The learning curve of Model-0. . . 53

(10)

x

4.19 The learning curve of Model-7. . . 57 4.20 The learning curve of Model-8. . . 57

(11)

xi

List of Tables

4.1 System configuration requirements. . . 34

4.2 Patch-balanced dataset after data augmentation. . . 36

4.3 Definition of performance metrics. . . 36

4.4 Classifiers comparison results. . . 39

4.5 The architecture of 50-layer RestNet. . . 43

4.6 The suggested setting ranges of hyper parameters . . . 45

4.7 An example of 3-fold cross validation process. . . 47

4.8 Model-1 to model-5 polyp detection results. . . 48

4.9 Model-6 to model-8 polyp detection results. . . 49

(12)

(13)

xiii

List of Abbreviations

AI ArtificialIntelligence.

API ApplicationProgrammingInterface.

BN BatchNormalization.

CAD ComputerAidedDiagnosis.

CLBP CompletedLocalBinaryPattern.

CNN ConvolutionalNeuralNetwork.

ConvNet ConvolutionalNetwork.

CUDA ComputeUnifiedDeviceArchitecture.

cuDNN cuDADeepNeuralNetwork library.

CV CrossValidation.

CWC ColorWaveletCovariance.

DCNN DeepConvolutionalNeuralNetworks.

DL DeepLearning.

DT DecisionTree.

FCN FullyConnectedNetwork.

GI Gastro-Intestinal.

GP GaussianProcess.

GPU GraphicsProcessingUnits.

ILSVRC ImageNetLarge-ScaleVisualRecognitionChallenge.

KNN K-NearestNeighbors.

LBP LocalBinaryPattern.

ML MachineLearning.

MLP Multi-LayerPerceptron.

NN NeuralNetwork.

PCA PrincipalComponentAnalysis.

RBF RadicalBasisFunction.

ReLU RectifiedLinearUnit.

RF RandomForests.

ResNet ResidualNetwork.

RMS RootMeanSquared.

ROI RegionOfInterest.

SGD StochasticGradientDescent.

SIFT ScaleInvariantFeatureTransform.

SVM SupportVectorMachine.

TL TransferLearning.

TPE Tree-structuredParzenEstimator.

TSCH TextureSpectrum andColorHistogram.

TSH TextureSpectrumHistogram.

VCE VideoCapsuleEndoscopy.

VGG VisualGeometryGroup.

WCE WirelessCapsuleEndoscopy.

(14)

(15)

1

Chapter 1

Introduction

1.1 Background

Colorectal cancer is the third most common type of cancer in men and women in the United States of America and also the second highest cause of cancer deaths [1]. Early detection of polyps, protrusions from the colon surface, is vital to the prevention of colorectal cancers, since colorectal cancer is highly curable when it is detected early. It often begins as a benign polyp of the tissue lining the colon or rectum and, without proper treatment at early stage, it will eventually develop into a cancer. Therefore, one of the major goals of endoscopy and colonoscopy is early detection of polyps and cancers.

FIGURE 1.1: Gastro-Intestinal Track Diagram (Image from Wikimedia Commons).

Endoscopy, colonoscopy and wireless capsule endoscopy

The conventional endoscopy performs a visual inspection of the gastro-intestinal (GI) track (see Figure 1.1) using a lighted, flexible tube with a tiny video camera at its tip (endoscope). There are two basic types of endoscopy: upper endoscopy carried out by inserting a flexible endoscope through the mouth to collect images from the esophagus,

(16)

2 Chapter 1. Introduction

FIGURE1.2: Examples of the type of polyp in colonscopy images.

stomach, and small intestines; and Colonoscopy, performed by inserting the endoscope via the anus to examine the large intestine, colon, and rectum.

Wireless (video) capsule endoscopy (WCE/VCE) is a noninvasive technology designed primarily to provide diagnostic imaging of the small intestine in a less invasive manner. The capsule measures 26 by 11 mm, the size of a large vitamin pill, and is propelled through the small bowel by peristalsis. Wireless capsule endoscopes have also been developed for the esophagus and colon, but their use in those areas is not yet as popular [54]. Colonoscopy is still the preferred technique for colon cancer screening and prevention.

Appearances of polyps

Polyps appears in different shapes ranging from flat to predunculated forms. The flat polyps are often attached to the colon wall by their base, predunculated polyps are attached via a stem. The figure 1.2 shows some examples of colonic polyps extracted from different colonscopy videos from CVC-ColonDB.

Besides, a polyp may appear in scale depending on the distance between the polyps and the colonoscopy camera. This is shown in the Fig.1.3 where the same polyp appears different in scale in each image.

1.2 Problem statement

Colonoscopy is an operator dependent procedure wherein human factors such as fa- tigue and insufficient attentiveness can lead to the miss-detection of polyps during long and back-to-back procedures. The average polyp miss-rate is estimated around 4- 12%. Patients with missed polyps may be diagnosed with a late stage colorectal cancer with the survival rate of less than 10%. In addition, though WCE is now also used for colon examination, including the identification of polyps, after ingestion of the capsule,

(17)

1.3. Motivation 3

FIGURE1.3: A polyp appears in different scales and shades in colonscopy videos from CVC-ClinicDB.

over 50,000 images are captured for analysis, which is time-consuming for physicians to assess manually.

To reduce the miss-detection of polyps caused by human factors and the cost and time of screening a large number of colonoscopy or WCE frames, a large number of techniques have been studied and exploited recently for the automatic detection of polyps in colonic images.

However, computer-aided automatic detection of polyps is still a difficult task due to the variety of shape, size, color, texture and size scale in the captured images. Ad- ditionally, the complex structure of the GI tract, similar color between polyp and non- polyp regions, poor image quality, and image variation of the same polyp caused by frequent camera angle changes creates further challenges.

1.3 Motivation

Current trends in research have demonstrated that deep learning methods, especially deep convolutional neural networks (DCNNs), are very effective for automatic analysis of images. So far, DCNNs have become a leading machine learning tool for general imaging and computer vision. Indeed, recent advances in deep learning frameworks and and methods have shown great potential to enhance the performance in computer vision applications, owing to their robust learning capabilities [17]. This captured our curiosity to explore and develop an effective approach based on cutting edge DL algorithms to solve a real world problem in medical image analysis. This work, focusing on automatic polyp detection, can potentially be a life savior, and builds upon our initial study presented in [31].

(18)

4 Chapter 1. Introduction

1.4 Objective

The objective of this work is to develop high performance, scalable and reliable auto- mated polyp detection systems that can tolerate polyp variability. By handling differ- ences such as shape, size, color, and texture, computer-aided automatic polyp detection systems become more feasible in clinical practice.

1.5 Approach

To achieve our objectives and desired outcomes, we chose the SCRUM methodology.

This choice was based on the type of project to be carried out. This method allowed us to achieve maximum efficiency using iterative weekly sprints on which the work done the last week was reviewed and new tasks were organized and defined for the next week. We proposed the following sub-tasks:

• Researching work on the topics related to automatic polyp detection

• Studying and evaluating imaging processing algorithms and the state-of-the-art machine learning approaches.

• Extensively studying DCNNs algorithms and choosing the most appropriate models for automatic polyp detecting tasks.

• Developing pre-processing techniques for dataset preparation and performing primarily experiments on the domain dataset.

• Designing and implementing the DCNNs models for the detection of polyps.

• Performing extensive tests and fine-tuning the models to obtain the best performance.

• Final evaluation and suggesting future work.

We first studied the literature that focused on image processing algorithms and machine learning methods for polyp classification. We then built tools to evaluate these techniques. Meanwhile, we extensively studied and investigated the newest DCNNs architectures which could be employed in our work. Then we developed a scalable transfer learning framework to utilize pre-trained DCNN models for polyp detection tasks. After the proposed DCNNs models were implemented, we performed extensive tests and fine-tuning of the models in order to obtain the best performance. Eventually, we evaluated all supposed methods using comprehensive performance metrics and suggested an outlook on future work.

1.6 Outline

The remainder of this thesis is organized as follows.

• In Chapter 2, we provide an overview literature discussion on topics related to automatic polyp detection. This covers a brief description about traditional texture/shape based methods, conventional machine learning classification and deep learning concepts.

(19)

1.6. Outline 5

• Chapter 3 contains our complete methodologies and various frameworks for polyp detection by using both traditional machine learning methods and new DCNNs algorithms, which covers the description of low-lever image processing techniques, various popular classifiers and the cutting edge DCNNs framework which are employed in our work.

• In Chapter 4, we present in-depth information about our design and implemen- tation. We discuss experimental results we have performed and further evaluate the proposed methods by well defined performance metrics.

• Chapter 5 provides conclusions of our work, summarizing our main contribu- tions and achievements, and providing a suggested outlook on future work.

(20)

(21)

7

Chapter 2

Literature review

This chapter covers general aspects of ML and DL methods in order to build the necessary foundations to understand to scope and results presented in this work. First, an overview of machine learning techniques is given with a brief discussion of different learning types such as supervised and unsupervised learning, reinforcement learning and so on. Subsequently different low-level feature extraction approaches, which include the texture, shape and fusion of texture and shape features, are presented sep- arately in the context of automatic polyp detection. Next, the chapter focuses on dis- cussing deep learning methods, which represent the current state-of-the-art of current work and future trends. We first present deep learning concepts in general. Then several cutting edge DCNN models including AlexNet, VGG Net, GoogLeNet and ResNet, are described in detail since DCNN models are used as an important part of our work. In the subsequent sections, we analyze different deep learning applications for the automatic detection of polyp and publications related to this topic, which are grouped into two separate sections namely CNN-based CAD systems and pre-trained CNNs according to their utilized methods. Finally, we summarize and evaluate the results against our specific requirements.

2.1 Machine learning

The fields of Machine Learning (ML) and Deep Learning (DL) have been experiencing great progress in recent years and many useful techniques have been developed. These techniques are currently playing an important role in fields such as medical image processing and computer-aided diagnosis (CAD).

2.1.1 Overview

The typical goal of machine learning is to determine a mapping from input patterns to an output value [4]. A machine learning algorithm can be expressed as a functiony(x) that uses a input x and generates an output y. The output is usually encoded in the same way as the target vectors [4]. The form of the functiony(x)is determined during the training or learning phase, based on a training data set. Once the model is trained, it is then using new date referred to as the test set.

Machine learning algorithms are typically classified into three categories, based on the nature of the training signals or feedback to the learning system, as follows [43]:

• Supervised learning: In supervised learning problems, the training data is made up of tuples (x_i, y_i), where x_i is the input and y_i the corresponding target vector [10]. The goal is to learn a general rule, also called mapping functionf :X →

(22)

8 Chapter 2. Literature review

Y, that maps inputs to outputs. Supervised learning tasks can be further classified into classification and regression categories based on the desired output of a machine-learned system [4]:

– In classification, inputs are grouped into two or more classes or categories, where the output variable is a category as well, e.g., "disease" or "no disease".

– In regression, the output is a continuous real value rather than a discrete category, such as the prediction of the price a house.

• Unsupervised learning: In unsupervised learning tasks, no corresponding labels are given to the algorithms during training process. The algorithms are left to their own devises to find structure in the input data (x). Unsupervised learning problems can be further divided into clustering and density estimation problems as described in the following.

– In clustering, the goal is to discover the inherent groups in similar examples based on measured or perceived similarities from the input data, such as grouping polyps by shapes.

– In density estimation, the objective is to determine the distribution of the input data in some space.

• Reinforcement learning: It is concerned with the interaction tasks with a dy- namic environment in which it can choose and perform a suitable actions in a given scenario, such as driving a car or playing a game [4].

In most of machine learning techniques the main stages are the feature extraction/descriptor step, and a decision-making stage called a classification step. Addi- tional steps could be added prior to the feature extraction stage such as image smooth- ing or noise filtering and region-of-interest(ROI) selection.

There are primarily two types of features namely, the shape/geometric features and texture-color features. Both types of features have been utilized in the literature for polyp detection in medical images. To improve further the quality of the features and to have more information acquired on the images, feature fusion approach have been employed as well. This is done by combining geometric and textual features of the image to benefit from the information that both provide.

2.1.2 Texture features

The early work of Iakovidis et al. [24], investigated four texture extraction methods for the discrimination of gastric polyps in endoscopic videos, namely Color Wavelet Covariance (CWC), Texture Spectrum Histogram (TSH), Texture Spectrum and Color Histogram (TSCHS), and Local Binary Pattern (LBP). Results reported so far support the feasibility of using texture (and color information) feature analysis for detection of polyps.

A similar comparative study of Li et al. [30] was published in 2012 where three different color spaces, namely RBG, Lab, and HIS, were used to examine the performance. It claims that Lab color space coupled with CLBP (completed LBP) color features showed the best experimental results reaching a 77.20% detection rate. Both cases used the SVM as classifier.

(23)

2.1. Machine learning 9 However, texture-color based analysis has two major limitations [23]: it uses a fixed size analysis window; and relies heavily on an exhaustive training set of images, which make them very sensitive to parameters tuning.

Nawarathna et al. [37] made use of texton histograms for identifying abnormal regions with different classifiers such as SVM and KNN. An accuracy of 95.27% was obtained for polyp detection by using Schmid filter bank based textons and SVM classifier. Nawarathna et al. [36] later extended futher this approach by using an local binary patterns (LBP) feature. In addition, a bigger filter bank (Leung–Malik), which includes Gaussian filters, were proposed for capturing texture more effectively. These approaches only use texture features without any color or geometrical features. The best performance of 92% accuracy was obtained based on the Leung–Malik-LBP filter bank with KNN classifier.

Yuan and Meng [59] utilized scale invariant feature transform (SIFT) feature vectors with K-means clustering for bag of features representation of polyps. The authors calculated weighted histograms of the visual words by integrating histograms in both saliency and non-saliency regions. These were fed into an SVM classifier and experiments on 872 images with 436 polyp frames showed that 92% detection accuracy was obtained.

2.1.3 Shape features

The objective of shape-based methods is to localize those specific appearances that most polyps commonly have in endoscopy frames. One method, as Hwang et al. [23]

suggested, is to utilize elliptical shape features to detect the shots of polyps, assuming that polyps tend to have an elliptical shape.

Hwang et al. further improved their method in [22] by involving a watershed segmentation base on Gabor texture features and K-means clustering, prior to identifying polyp candidates by extracting curvature-based geometric information from the resulting segments. But the author only tested their method using a small dataset of 128 images containing 64 polyp shots and 64 non-polyp images, which make the results (100% sensitivity and over 81% specificity) little convincing, since a small number of test dataset tend to lead to over tuning that may easily create an illusion of good performance.

A more sophisticated shape-based method, introduced by Bernal et al. [7], employed valley information and a region growing approach to find polyps. Bernal et al. called it Sector Accumulation – Depth of Valleys Accumulation (SA-DOVA). The method was further improved and renamed as Window Median DOVA (WM-DOVA).

Further performance evaluations presented in [8] claim to have reduced the number of false positives around vascular structures and specular reflections. In principle, WM- DOVA not only exploited some available methods of low-level image processing such as valleys or edges analysis, but also first introduced a searching method for concavity of boundaries and elements of the scene such as blood vessels and specular highlights [8]. Nonetheless, shape-based approaches tend to mislead a polyp detector towards other polyp-like structures such as fecal content and reflection spots [51].

(24)

10 Chapter 2. Literature review

2.1.4 Texture and shape features

Both texture-based and shape-based methods have benefits and drawbacks. For that reason, more recent systems have considered combining them as an attempt to obtain improved performance. Mamonov et al. [33] presented an algorithm of polyp detection in colon capsule endoscopy, which is referred to as binary classification with pre-selection. This algorithm relies on geometrical analysis and the texture content of the frame. They assumed the polyps are characterized as protrusions that are mostly round in shape, and then considered a best fit ball radius as a decision parameter of the binary classifier. In addition, the author also introduced a pre-selection procedure used to discards the frames with too much or too little texture content, considering that the surface of polyps is often highly textured. Meanwhile, too much texture tends to imply the presence of bubbles or trash liquids. Therefore, it makes sense to discard the frames with both too little and too much texture information in them.

Although, the above algorithm demonstrated high per polyp sensitivity (81.2%) and high per patient specificity (92.2%) by a thorough statistical testing with a rich data set, there are still some drawbacks and areas of improvement as listed below:

• It did not detect the actual location of polyp in a colon. This problem is particu- larly exacerbated in capsule colonoscopy.

• Its effectiveness lies partially in the use of a pre-selection criterion; however, the pre-selection approach proposed was robust in some sense, but not sophisticated enough. It was less effective in filtering out frames with bubbles.

• It only utilized texture and geometry information, but the color content was dis- carded since the frame had to be converted to grayscale before processing.

• It just used a binary classifier; however, more advanced classification techniques such support vector machines (SVM) may improve its detection performance.

Another work by Tajbakhsh et al. [51] proposed a hybrid context-shape approach, which utilizes texture context information to remove non-polyp structures and shape information to reliably localize polyps. The proposed system consists of four stages as follows:

• Constructing Edge Maps for input images.

• Refining the edge map using context information.

• Localizing polyp candidates using shape information.

• Placing a band around each polyp candidate.

The suggested system had been tested using two public polyp database containing 300 unique polyps, and achieved a sensitivity of 88.0%. However, the author also pointed out that the suggested system might fail to detect the polyps with faint gradients around their boundaries, resulting in a polyp localization failure. In addition, unsuccessful edge classification could also lead to localization failures.

(25)

2.2. Deep learning 11

2.1.5 Classifiers

In most cases that we studied, SVM has been the widely used classifier in medical image processing. SVM determines some support vectors from the feature space which are helpful to determine the optimal hyperplane to separate a set of objects with maximum margin [12]. However, there are no single classification methods which outper- forms all others on all data sets and there are also some other state of the art classifiers such as Random Forests (RF) [32], KNN and so on. We will evaluate all of them in this work.

2.2 Deep learning

More recently, Deep learning (DL) techniques have become state-of-the-art for many image and signal processing tasks. DL is a new branch of ML that is based on a set of algorithms to model high level abstractions in data by extracting multiple processing layers, which allows the systems to be able to learn complex mapping functions directly from input data f : X → Y. DL is indeed moving ML closer to the one of its original goals: Artificial Intelligence (AI), which was acknowledged as one of the top 10 breakthroughs of 2013 [11].

There are various deep learning architectures have been extensively studied in recent years, which include deep belief network (DBN) [19], autoencoder [55], deep convolutional neural network (DCNN) [28], recurrent neural network (RNN), region- based convolutional neural network (R-CNN) [16], signal processing [21, 44] and so on. They have been successfully applied in various areas, such as natural language processing [3, 34, 56], computer vision [45, 50, 58], and so on. However, current trends in research have demonstrated that DCNNs are highly effective in automatically analyzing images, that is the reason they are nowadays the first choice in complex computer vision applications. We therefore choose to utilize DCNN techniques as well in our work.

2.2.1 Deep architectures

The main power of CNNs lies in its deep architectures [14, 47, 49], which allows for extracting a great number of features at multiple levels of abstraction. Nowadays, there are various state-of-the-art deep CNN models developed as presented as following.

AlexNet[28] developed by Krizhevsky, Sutskever, and Hinton in 2012 was the first time a model performed so well on ImageNet dataset, which achieved a top-5 error of 15.4% (Top-5 error is the rate at which, given an image, the model does not output the correct answer with its top-5 predictions). AlexNet was composed by 5 convolutional layers along with 3 fully connected layers, which illustrated the power and benefits of CNNs and backed them up with record breaking performance in the competition of 2012 ILSVRC(ImageNet Large-Scale Visual Recognition Challenge). And more, the techniques utilized by AlexNet such as data augmentation and dropout are still used today.

VGG Net[9] was proposed by the Oxford Visual Geometry Group (VGG) in ILSVRC 2014 best utilized with its 7.3% error rate. This model consists of five main groups of convolution operations. Adjacent convolution groups are connected via max-pooling layers. Each group contains a series of 3x3 convolutional layers (i.e. kernels). The

(26)

12 Chapter 2. Literature review number of convolution kernels stays the same within the group and increases from 64 in the first group to 512 in the last one. The total number of learnable layers could be 11, 13, 16, or 19 depending on the number of convolutional layers in each group.

Figure 2.1 illustrate the architecture of 16-layer VGG net (VGG16). VGG Net is one of the most influential architectures since it strengthened the intuitive notion that CNNs have to have deep layers for making this hierarchical representation of visual data to work.

FIGURE2.1: The architecture of VGG16 model [9].

GoogLeNet[49] was the winner of ILSVRC 2014 with a top-5 error of 6.7%. The authors introduced an novel Inception module which performs pooling and convolutional operations in parallel. GoogLeNet used 9 inception modules with over 100 layers in total but had 12x fewer parameters than AlexNet. It was the first model that introduced the idea that CNN layers with different kernel filters can be stacked up and operating in parallel. Utilizing the creative inception module, GoogLeNet can lead to improved performance and computationally efficiency, since it avoided stacking all convolution layers and adding huge numbers of filters sequentially which require a greater number of computational and memory resources and increase the chance of over-fitting issue as well.

ResNetwere originally introduced in the paper "Deep Residual learning for Image Recognition" [18] by He et.al. It won the championship of ILSVRC 2015 with a new 152- layer convolutional network architecture (ResNet152) trained on an 8 GPU machine for two to three weeks. It achieved an incredible top-5 error of 3.6% that set new records in classification, detection, and localization. Resnets architectures were demonstrated with 50, 101 and 152 layers. The deeper ResNets got, the more its performance grew.

The authors of ResNet proposed a residual learning approach to ease the difficulty of training deeper networks by reformulating the layers as residual blocks, with each block containing two branches, one directly connecting input to the output, the other

(27)

2.2. Deep learning 13

FIGURE2.2: The structure of an Inception module of GoogLeNet.

performing two to three convolutions and calculating the residual function with ref- erence to the layer inputs. The outputs of these two branches are then added up as shown in Figure 2.3.

FIGURE2.3: The residual block for residual learning approach.

2.2.2 CNNs-based CAD systems

With the revival of CNNs techniques, the medical image processing field has also been experiencing a new generation of CAD systems with more promising performance.

Wimmer et al. applied CNNs for the computer assisted diagnosis of celiac disease

(28)

14 Chapter 2. Literature review based on endoscopic images of the duodenum in [57]. To evaluate which network configurations are best suited for the classification of celiac disease, the author trained several different CNN models with different numbers of layers and filters and different filter dimensions. The results of the CNNs are compared with the results of popular general purpose image representations methods. The results show that the deeper CNN architectures outperform these comparison approaches and that combining CNNs with linear support vector machines furtherly improves the classification rates for about 3–7% leading to distinctly better results (up to 97%) than those of the comparison methods.

Jia et al. employed Deep CNNs for detection of bleeding in GI 10,000 Wireless Cap- sule Endoscopy (WCE) images [25]. The WCE is a non-invasive image video method for examination small bowel disease. They claimed F-measure approximately to 99%.

Pei etal. mainly focused on evaluation of contraction frequency of bowel by investiga- tion diameter patterns and length of bowel by measuring temporal information [38].

A popular approach of automatic feature extraction from endoscopy images adopted using CNN [61]. Then the features vector to the SVM for classification and detection of gastrointestinal lesions. The proposed system realized on 180 images for lesions detection and 80% accuracy reported. Similarly hybrid approach used by [15]. Fast features extraction using CNN architectures and then the extracted features passed to SVM for detection of inflammatory GI disease in WCE videos. The experiments conducted on 337 annotated inflammatory images and 599 non-inflammatory images of the GI tract.

Training set containing 200 normal and 200 abnormal while the test set containing 27 normal and 27 abnormal and obtained an overall accuracy upto 90%.

There are several recent works [41, 52, 53] that have exploited CNNs-based methods for automatic detection of polyps in endoscopy and colonoscopy images. Though DL approaches have the property of extracting a set of discriminating features at multiple levels of abstraction by exploiting the input image pixel directly, it usually requires a large amount of training dataset that might be quite rare in some medical imaging fields. Ribeiro et al. [40] proposed a method allowing the use of small patches to increase the size of the database and classify different regions in the same image and then train the CNNs.

In yet another work, Tajbakhsh et al. proposed a new polyp detection method based on the unique 3-way image presentation and CNNs in [52]. The 3-way image represents the three major types of polyp features, namely (1) color and texture clues, (2) temporal features, and (3) shape in context. This method fully utilizes a variety of polyp features such as color, texture, shape, and temporal information in multiple scales, which enable more accurate polyp detection in [52].

To train the CNNs, the author first collected all the generated polyp candidates and grouped them into true and false detections, then collected the three sets of patches Pc, Pt, and Ps at multiple scales, translations, and orientations, and finally, total of 400,000 patches were labeled as positive or negative and resized to 32x32 pixels for the entire training dataset. The evaluations based on a large annotated polyp database showed a superior performance and significantly reducing polyp detection latency and the number of false positives [52]. There was one drawback that this method was not reliant on the future frames and avoiding the delayed feedback on the locations of polyps.

(29)

2.3. Summary 15

2.2.3 Pre-trained CNNs

The above methods need to train CNNs from scratch with a large amount of training database that might be quite rare in medical fields. The updated work of Tajbakhsh et al. [53] tried to address the problem by making use of pretrained CNNs, with sufficient fine-tuning, to eliminate the need for training CNNs from scratch.

The author considered four distinct medical imaging applications (polyp detection, pulmonary embolism detection, colonoscopy frame classification and intima-media boundary segmentation) in three specialties (radiology, cardiology, and gastroenterol- ogy) involving classification, detection, and segmentation, and investigated how the performance of deep CNNs trained from scratch compared with the pre-trained CNNs fine-tuned in a layer-wise manner. Their experiments demonstrated that [53]:

• Use of a pre-trained CNN with adequate fine-tuning outperformed or, in the worst case, performed as well as a CNN trained from scratch.

• Fine-tuned CNNs were more robust to the size of training sets than CNNs trained from scratch

• Neither shallow tuning nor deep tuning was the optimal choice for a particular application.

• Layer-wise fine-tuning scheme could offer a practical way to reach the best performance for the application at hand based on the amount of available data.

These results showed the knowledge transfer from natural images to medical images is possible and suggested [53] that the layer-wise fine-tuning might offer a practical way to achieve the best performance for some medical image application based on the amount of available data.

2.3 Summary

In summary, we discussed all of the polyp detection approaches covered so far with machine learning and deep learning techniques, classifiers utilized along with the dataset as well as performance details (whenever available). We can see that plenty of improvements was done either in the pre-processing techniques, feature extraction algorithms, classification methods or in all, and there is a clear trend toward the use of deep learning frameworks, especially CNN-based architectures. However, it can also be seen that these proposed methods are tuned to obtain the best achievable detection accuracy results for their corresponding datasets, so our belief is that the majority of these methods have more or less over-fitting or under-fitting problems.

(30)

(31)

17

Chapter 3

Methodology

In this chapter, we describe different techniques in detail for automatic polyp detection.

The first section of this chapter presents our 3 major frameworks (ML-framework, DL- framework and TL-framework) for automatic detecting polyps in colonic images, and we also describe a scalable framework for computer-aided diagnosis systems based on the fusion of overall state-of-the-art techniques to generalize and extend our project in future with versatile capabilities in medical domain.

The subsequent section analyses various image preprocessing methods that are utilized in our work and also are necessary for most machine learning and deep learning systems. These techniques cover histogram modification, noise filtering, data augmentation, and dimension reduction. Next, the chapter focuses on neural networks design methodologies that mainly cover all the necessary algorithms to build a effective artificial neural network such as feed-forward structure, activation functions, softmax functions, loss functions, regularization, gradient descent optimizers and backpropagation methods.

Finally, in the last section, we describe all necessary methodologies for designing deep convolutional networks that represent state-of-the-art now, which include the convolution algorithm with zero-padding and stride methods, pooling and dropout techniques. At last, we describe the deep learning model - 50-layer ResNets with its detail structure. ResNet50 is the major deep convolutional network architecture utilized in our project.

3.1 Proposed frameworks

In this work, we propose and test 3 different methodologies for automatic detection of colorectal polyps as shown in Figure 3.1. The first detection scheme named ML- frameworkstands for the traditional machine learning classification methods based on a set of low-level feature descriptors. The second one calledDL-frameworkis to make use of deep learning methods (mostly CNNs-based architectures) for image classification. The last scheme calledTL-framework presents transfer learning (TL) strategies utilized for automatic polyp detection. We will discuss them in detail in later sections.

In addition, based on above three proposed detection methods, we layout a generalized but scalable framework for computer-aided diagnosis (CAD) systems [31] in which fusion of machine learning algorithms and deep learning techniques are employed to further generalize and boost system’s performance and robustness. This scheme is flexible and easy to add new types of data in future as needed in order to

(32)

18 Chapter 3. Methodology detect or predict other types of diseases. Generally, it consists of four stages: preprocessing, feature extraction, classification and post-processing as shown in Figure 3.2.

Here red dash line represents the process for training the system.

First, the preprocessing stage is quite import to properly prepare the data by removing noise or unwanted parts of the data. The objective of preprocessing is to refine the quality of digital images. It can consist of subsampling, enhancing, edge detecting, scaling or extracting research of interest (ROI) patches, and so on. It has a lot of impact on the following feature extraction and classification processing.

For the feature extraction phase, the focus is on the extraction of some key characteristics of candidates such as texture and shape by a set of low-level image processing algorithms. However, more and more DL techniques like CNNs have been recently utilized as feature descriptor in medical image analysis. We also took advantage of deep CNNs techniques in our work.

In classification stage, many kinds of classifiers are utilized to discriminating multiple objects on the base of features defined and extracted from previous phase. Finally, the post-processing stage is needed to properly display the results, formulate diagnosis reports, or localize and annotate the diseases for further evaluations by medical physicians.

The purpose of this suggested CAD architecture is to be as a roadmap for making versatile CAD systems in future by reproducing, generalizing, and extending our work on automatic polyp detection systems.

3.2 Image preprocessing

Image preprocessing here refers to processing of digital images by low-level algorithms, i.e removing the noise in an image using a digital computer. Preprocessing is a common and necessary step in machine learning pipeline. For mathematical analysis, an image is defined as 2-dimension function f(x, y), where x and y are spatial coordinates, and the amplitude of f is called the intensity or gray level of the image at the point of coordinates (x, y). A digital image is composed of a finite number of elements or pixels described by x, y,andf. Pixel is the basic cell and the most widely used term to denote the elements of a digital image.

Various algorithms and methodologies have been developed in image processing during the past decades such as contrast and edge enhancement. In our work, we evaluated some important algorithms to preprocess our images, including histogram modification, contrast stretching, noise filtering, PCA etc.

3.2.1 Histogram modification

Histogram has a lot of importance in image enhancement. It reflects the characteristics of image. By modifying the histogram, image characteristics can be modified. One such example is Histogram Equalization. Histogram equalization is a nonlinear stretch that redistributes pixel values so that there is approximately the same number of pixels with each value within a range.

Meanwhile, the contrast stretching methods are designed exclusively for frequently encountered situations, since some images are homogeneous i.e., they do not have

(33)

3.2. Image preprocessing 19

FIGURE 3.1: Three frameworks for automatic polyp detection. ML- framework stands for traditional machine learning methods; DL- framework presents deep learning (mostly CNNs-based architectures) classification techniques, and TL-framework is the scheme of utilizing transfer learning strategies on pre-trained deep neural networks such as

pre-trained ResNet50, VGG16 etc.

(34)

20 Chapter 3. Methodology

FIGURE3.2: A generalized framework for computer-aided diagnosis systems which can be extended to detect or predict other types of diseases in future. The framework has further evolved from our previous work [31].

(35)

3.2. Image preprocessing 21 much change in their levels. They are characterized as the occurrence of very narrow peaks. Different stretching techniques have been developed to stretch the narrow range to the whole of the available dynamic range as well. The figure 3.3 shows the different histogram performance based on three algorithms: contrast stretching, histogram equalization and adaptive equalization.

FIGURE3.3: Performance evaluation of different histogram algorithms.

3.2.2 Noise filtering

Noise Filtering is used to filter the unnecessary information from an image. It is also used to remove various types of noises from the images. Various filters like mean, median, max, min, sobel, prewitt, canny, laplace etc., are available for our project. We evaluate most of them with our images. The Figure 3.4 visualizes the performances of a number of different filtering algorithms.

3.2.3 Data augmentation

Data augmentation is one way to fight over-fitting for small training dataset. Over- fitting happens when a model exposed to too few examples learns patterns that do not generalize to new data, i.e. when the model starts using irrelevant features for making predictions. In our work, we make use of data augmentation methods to enlarge our training dataset. For example, assume a training set of 100 images of polyps and non- polyps. By rotating, mirroring (horizontal and vertical flip), shift (width and height), zoom, and adjusting contrast, etc. It is possible to generate additional over than 2000 images. The figure 3.5 shows an example of data augmentation form one polyp image to 10 additional images by random rotation, horizontal and vertical flip, shift and zoom

(36)

FIGURE3.4: Performance evaluations of different filters.

methods. In many machine learning applications, data augmentation approach allow to build better models.

FIGURE3.5: Examples of data augmentation results.

(37)

3.3. Neural network design 23

3.2.4 Dimension reduction

Dimensionality reduction is an useful way to reduce the running time of machine learning algorithms, since the number of input features has an effect upon runtime.

Principal component analysis (PCA) is an algorithm which can be used for dimensionality reduction. Basically, PCA can be represented in major 4 steps.

• Normalize the data to have features on the same scale.

• Calculate the covariance matrix to measure of how two different variables change together.

• Find the eigenvectors of the covariance matrix.

• Translate the data to be in terms of the components.

The covariance between X and Y, can be given by the following formula 3.1, and then covariance matrix can be computed with this form 3.2:

cov =

n

P

i=1

(X_i −X)(Y¯ _i−Y¯)

(n−1) (3.1)

C=





cov(x, x) cov(x, y) cov(x, z) cov(y, x) cov(y, y) cov(y, z) cov(z, x) cov(z, y) cov(z, z)



 (3.2)

3.3 Neural network design

Neural Networks are a group of models based on biological neural networks. Fig- ure(3.6) shows how a general neuron looks. Where W is a matrix and X is an input column vector containing all pixel data of an image. For instance, Xcan be a [32*32*3 x 1] column vector, andWis a [2 x 32*32*3] matrix, and the output is a vectorYofN class scores (class-1, to class-N). That isY(x_i, W, b) = σ(W x_i +b). The weightsWare learn-able and control the strength of influence of each input. Where activation function is usually an abstraction representing the firing rate in the cell.

FIGURE3.6: Mathematical model of a neuron.

(38)

3.3.1 Neural networks

Neural networks take inputs and transform them by a series of hidden layers. Figure 3.7 shows a 3-layer feed-forward neural network with 2 hidden layers. Each hidden layer consists of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely inde- pendently and do not share any connections. The last fully-connected layer is called the “output layer” and in classification settings it represents the class scores.

A feed-forward neural network takes in an input, then that input "trickles" through the network and the neural network returns an output vector. More formally, call aⁱ_j the activation output of thej^thneuron in thei^thlayer, whereaⁱ_jis thej^thelement in the input vector.

FIGURE3.7: 3-layer feed-forward neural network model.

Then we can relate the next layer’s input to it’s previous via the following relation:

aⁱ_j =σ X

k

wⁱ_jkaⁱ⁻¹_k +bⁱ_j

!

(3.3) Where in equation (3.3)

• σis the activation function;

• wⁱ_jk is the weight from thek^thneuron in the(i−1)^thlayer to thej^thneuron in the i^thlayer;

• bⁱ_j is the bias of thej^th neuron in thei^th layer;

• aⁱ_j represents the activation value of thej^thneuron in thei^thlayer.

Sometimes we writez_jⁱ to representP

k

(wⁱ_jk·aⁱ⁻¹_k ) +bⁱ_j, in other words, the activation value of a neuron before applying the activation function.

z_jⁱ =X

k

wⁱ_jkaⁱ⁻¹_k +bⁱ_j (3.4)

(39)

3.3. Neural network design 25

3.3.2 Activation functions

The most important unit in neural network structure is a scalar-to-scalar function called

“the activation function or threshold function or transfer function”, output a result value called the “unit’s activation”. An activation function for limiting the amplitude of the output of a neuron. The goal of an activation function is to transform its input to an output that makes binary decisions more separable. The widely-used activation functions are sigmoid, tanh, and the rectified linear unit (ReLU), since they avoid sat- uration issues and make learning faster than other functions.

Sigmoid (non-linear) functions have the mathematical form as below. They are often used for mathematical convenience because their derivatives are very easy to calculate, which we will use to calculate the weight updates in training algorithms.

aⁱ_j =σ(z_jⁱ) = 1

1 +e^−z^jⁱ (3.5)

The Tanh functions with the mathematical form as below are related linearly and can be seen as a rescaled version of the sigmoid function so that its output range is between -1 to 1.

aⁱ_j =σ(z_jⁱ) = tanh(z_jⁱ) (3.6) The ReLu functions are the most popular choice for deeper architectures. It can be seen as a ramp function whose range lies above 0 to infinity, so that it is much easier to calculate than the sigmoid function. The biggest benefit of ReLU is that it bypasses the vanishing gradient problem.

aⁱ_j =σ(zⁱ_j) = max(0, z_jⁱ) (3.7)

3.3.3 Softmax functions

In our work, we make use of Softmax functions as the output of a classifier which represent the probability distribution overC classes, in our case,C = 2since we have only 2 classes: polyp and non-polyp. This function is a normalized exponential and is defined as:

y_c=%(z)_c= e^z^c PC

d=1e^z^d forc= 1· · ·C (3.8) Here the softmax function%takes as input aC-dimensional vectorzand outputs a C-dimensional vectoryof real values between0and1. The denominatorPC

d=1e^z^dacts as a regularizer to make sure thatPC

c=1y_c = 1.

Loss functions

A loss function, or a cost function is used for parameter estimation in training neural networks. The choice of the loss function is an important aspect for designing a deep neural network. In our project, we make use of cross-entropy loss function which is defined as:

L(X, Y) =−1 n

n

X

i=1

y⁽ⁱ⁾lna(x⁽ⁱ⁾) + 1−y⁽ⁱ⁾

ln 1−a(x⁽ⁱ⁾)

(3.9)

(40)

26 Chapter 3. Methodology HereX =

x⁽¹⁾, . . . , x⁽ⁿ⁾ is the set of input examples in the training dataset, and Y =

y⁽¹⁾, . . . , y⁽ⁿ⁾ is the corresponding set of labels for those input examples. The a(x)is the output of the neural network given inputx, which is typically restricted to the open interval (0, 1) by using a ReLU 3.7 or sigmoid 3.5 activation function.

Regularization

Regularization is a very important technique in neural network design to prevent over- fitting. Regularization works by extending the loss function with a regularization penalty (R(W)) as:

L= L(X, Y)

| {z }

loss function

+ λR(W)

| {z }

regularization penalty

(3.10) Then the loss function can be weighted by a hyper-parameter λ in order to prevent the coefficients. The most common regularization penalty is the L2norm that is utilized in our design. It is defined as:

R(W) = X

k

X

l

W_k,l² (3.11)

3.3.4 Gradient descent optimizers

Gradient descent is one of the most popular algorithms to optimize neural networks.

There are five popular optimization techniques: Stochastic gradient descent (SGD), SGD+momentum, Adagrad [13], Adadelta [60] and Adam [27] – methods for finding local optimum (global when dealing with convex problem) of certain differentiable functions. The gradient descent algorithm is used in every layer to update the weights in the direction of the negative gradient by backpropgation learning algorithm.

In our work, we choose Adadelta as the optimizer of the model, since in practical, Adadelta seems to be "safer" because it doesn’t depend so strongly on setting of learning rates, and base on our own experiments as well, it alway gave us the quick- est convergence and performed better than AdaGrad or SGD and Momentum with decaying learning rate. The full algorithm of Adadelta is as shown in Figure 3.8.

FIGURE3.8: Algorithm of computing Adadelta [60].

Although adadelta algorithm strive to do away with learning rate tuning, in practice the issue isn’t completely solved. Setting and tuning constant and decay rate

(41)

3.4. Convolutional networks 27 ρ are still important and necessary in our work to achieve sound performance curve while the adaptation can effectively counter the learning rate with its own scaling if the optimization directs it in that direction. The constantcan be consider as the ’learning rate’ of adadelta because it actually determines the update of ∆x_t sinceRM S[∆x]_t = pE[∆x²]_t+andE[∆x²]_t =ρE[∆x²]t−1+ (1−ρ)∆x²_t, where RMS stands for root mean squared.

Backpropagation

Technically, the backpropagation algorithm is a supervised learning method for training the weights in multilayer feed-forward neural networks. The algorithm can be divided into two phases: propagation and weight update.

The propagation covers 2 steps: first forward propagation of a training input through the neural network and then backward propagation of the generated deltas (the error between the targeted and actual output value). While the weight update must follow 2 steps as well: first, the weight’s delta and input activation are multiplied to determine the gradient of the weight, and a ratio of that gradient is then subtracted from the weight.

3.4 Convolutional networks

Convolutional networks (ConvNets) [29], also known as convolutional neural networks (CNNs), are a specialized kind of neural networks for processing data that has a known, grid-like topology [17]. In principle, though CNNs/ConvNets are very similar to regular neural networks which consist of neurons with learnable weights and biases, ConvNets architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network than regular neural nets do, because regular neural nets don’t scale well to full images [26]. For instance, if taking an images of size 200x200x3 (200 wide, 200 high, 3 color channels) as inputs, so a single fully-connected neuron in a first hidden layer would have 200*200*3 = 120,000 weights. Moreover, we would almost have to have many such neurons, so the parameters would grow up quickly which would cause over-fitting issues.

3.4.1 Convolutional layer

The Convolutional layer is the core building block of a convolutional network which takes the convolution operation of the input image with convolution matrices (also known as kernel filters) to generate the output feature maps. Figure (3.9) is an example of a convolution operation with 2-kernel filters (5x5x3x2) on a RGB image of size 28x28x3. The output feature maps can be produced by the form below:

(k ? im)(x, y) =

2

X

c=0 4

X

n=0 4

X

m=0

k(n, m, c).im(x+n−2, y+m−2, c) (3.12)

(42)

FIGURE3.9: Convolutional operation examples on a 3-channel image.

Here the input volume size is represented as H_i ×W_i × Cⁱ, and the kernel filter setting isF ×F ×Cⁱ×K whereF stands for the size of the kernel,Cⁱ is the channels of the kernel (must be equal to the channels of input) and K stands for the number of kernel filters, if given a stride ofS and a zero-padding ofP, the volume of output maps (H_o×W_o×C^o) can be produced by the forms below:

• H_o = (H_i−F + 2P)/S+ 1(the output dimension of hight)

• Wo = (Wi−F + 2P)/S+ 1(the output dimension of width)

• C^o =K(the output channels or depths)

Stride and padding

As we can see, there are 2 key hyper parameters control the size of the output volume:

stride and padding. Stride controls how the filter convolves around the input volume.

When the stride is 1 then the filters would move one pixel at a time. When the stride is 2 then the filters would jump 2 pixels at a time as we slide them around. This will produce smaller output maps spatially. However, sometimes it would be necessary to pad the input volume with zeros around the border. The feature of zero-padding allows us to control the spatial size of the output feature maps. Figure 3.10 illustrates an example of setting for zero-padding and stride to produce the output spatial size as the same with the input volume.

3.4.2 Pooling layer

Pooling layer (also known as subsampling layer) is a popular approach to mainly down-samples the input volume spatially, and hence to reduce the amount of parameters and computation in the network which also give the network more invariance and robustness to control overfitting.

(43)

3.4. Convolutional networks 29

FIGURE3.10: An example of setting zero-padding and strides.

In practice, pooling layers are commonly stacked in-between successive convlu- tional layers in a ConvNets model. The most used method for pooling layer in image processing tasks is max pooling. Max pooling decreases the dimension of input volume simply by taking only the maximum value from a fixed region while average pooling taking the average of each groups as shown in Figure 3.11.

FIGURE 3.11: Max and average pooling examples for subsampling features.

In addition to max pooling method, average pooling or even L2-norm pooling was often used historically. However, it has recently fallen out of favor compared to the max pooling, which has been shown to work better in practice [26]. But we still made use of both max pooling and average pooling methods in our neural networks, and it demonstrated that average pooling performed better in some situation than max pooling.

(44)

3.4.3 Dropout layer

In our work, we introduced dropout layers to avoid over-fitting problems. The idea of dropout is simplistic in nature. This layer “drops out” a random set of activations by setting them to zero in that layer that would force the network to learn the multiple characteristics of input example to be redundant and robust, so that the network could be able to provide the right output even if some of the activations are dropped out.

Figure 3.12 illustrates an example of applying dropout methods on a neural network.

FIGURE3.12: An example of applying dropout to a neural network.

FIGURE3.13: The structure of a residual block for ResNet.

3.4.4 ResNet architecture

In our work, we propose to utilize 50-layer ResNets as our deep learning model. ResNets reformulate the layers as residual blocks. The idea behind residual blocks is that the inputxgoes through some convolution layers, and you will get the result f(x). That result is then added to the original input x. Let’s call thaty(x) = f(x) + x. In traditional CNNs, youry(x)would just be equal tof(x), so instead of just computing that

(45)

3.4. Convolutional networks 31 transformation fromx directly tof(x), in ResNet we’re computing the term of y that addf(x)to the identityxas shown in Figure 3.13.

The residual network design addresses the problem of vanishing gradients in the simplest way possible, since the main challenge in training deeper networks is that accuracy degrades with network depth. The concept of residual learning behind is a great innovation and becoming one of the hot new ways to build deep convolutional neural networks. Safe to say, the ResNet model is now the best single CNN architecture for object detection, which is the main reason we choose this model for our work. Fig- ure 3.14 illustrates ResNets with 50 layers. ResNets use bottleneck blocks of different numbers of repetitions which converges very fast and can be trained with hundreds or thousands of layers.

(46)

FIGURE3.14: The architecture of 50-layer ResNet.