• No results found

Convolutional Neural Networks

N/A
N/A
Protected

Academic year: 2022

Share "Convolutional Neural Networks"

Copied!
66
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Introduction to

Convolutional Neural Networks

(2)

Motivation

+

pixel 1 pixel 2

+

+

+

‐ ‐

+

+ Coffee Mug

Not Coffee Mug

modified slides originally by Adam Coates

(3)

+

pixel 1 pixel 2

+

+

+

‐ ‐

+

+ Coffee Mug

Not Coffee Mug

+

pixel 1 pixel 2

+

+

+

‐ ‐

+

Is this a Coffee Mug?

Learning Algorithm

Motivation

(4)

Need stronger feature representations!

+

handle?

cylinder?

++

+

+

+ Coffee Mug

Not Coffee Mug

cylinder?

handle?

Is this a Coffee Mug?

Learning Algorithm +

handle?

cylinder?

++

+

+

modified slides originally by Adam Coates

(5)

From “swallow” to “deep” mappings (networks)

Images, shapes, natural language have compositional structure and patterns.

Deep neural networks can learn powerful feature representations capturing those.

(6)

Suppose you want to predict mug or no mug in an image.

Output:

Input:

Classification basics: Logistic Regression

[coffee mug], 0 [no coffee mug]

y = 1 y =

1 2

{ , ,...} [pixel intensities, gradients, SIFT, etc]x x

x

(7)

Suppose you want to predict mug or no mug in an image.

Output:

Input: 

Classification function:

Classification basics: Logistic Regression

( | ) ( ) ( )

where is a ( ) 1

1 exp( )

P

  

    

f w

w weight vector (paramet

x x

er y = 1

x

s

w w

x

x

)

y

( )

w x

[coffee mug], 0 [no coffee mug]

y = 1 y =

1 2

{ , ,...} [pixel intensities, gradients, SIFT, etc]x x

x

(8)

Logistic regression: training

Need to estimate parameters w from training data e.g., images of 

objects xi and given labels yi (mugs/no mugs) (i=1…N training images) Find parameters that maximize probability of training data

[ 1] [ 0]

1

max N ( 1 | ) [1 ( 1 | )]

i

i i

P  P 

yi yi

w y x y x

(9)

Logistic regression: training

[ 1] [ 0]

1

max N ( i) [1 ( )]

i

 i 

yi i

w

w x w x y

Need to estimate parameters w from training data e.g., images of 

objects xi and given labels yi (mugs/no mugs) (i=1…N training images) Find parameters that maximize probability of training data

(10)

Logistic regression: training

[ 1] [ 0]

1

max log{ N ( i) [1 ( i)] }

i

 

yi yi

w w x w x

Need to estimate parameters w from training data e.g., images of 

objects xi and given labels yi (mugs/no mugs) (i=1…N training images) Find parameters that maximize the log probability of training data

(11)

Logistic regression: training

1

max N [ 1]log ( i) [ 0]log(1 ( ))

i

i

 

i i

w y w x y w x

Need to estimate parameters w from training data e.g., images of 

objects xi and given labels yi (mugs/no mugs) (i=1…N training images) Find parameters that maximize the log probability of training data

(12)

Logistic regression: training

1

min N [ 1]log ( ) [ 0]log(1 ( i))

i

i

 i 

w yi w x y w x

Need to estimate parameters w from training data e.g., images of 

objects xi and given labels yi (mugs/no mugs) (i=1…N training images)

Find parameters that minimize the negative log probability of training data

(13)

Logistic regression: training

1

min N [ 1]log ( ) [ 0]log(1 ( i))

i

i

 i 

w yi

L(w)

w x y w x

Need to estimate parameters w from training data e.g., images of 

objects xi and given labels yi (mugs/no mugs) (i=1…N training images) This is called (negative) log likelihood:

(14)

Logistic regression: training

,

( ) i d[ i ( ) ]

i

i d

L x

w y

   

w

w x

(partial derivative for dth parameter)

1

min N [ 1]log ( ) [ 0]log(1 ( i))

i

i

 i 

w yi

L(w)

w x y w x

Need to estimate parameters w from training data e.g., images of 

objects xi and given labels yi (mugs/no mugs) (i=1…N training images) We now have an optimization problem:

(15)

How can we minimize/maximize a function?

Gradient descent: Given a random initialization of 

parameters and a step rate η, update them according to:

newold   L( )

w w w

(16)

Regularization

Overfitting: few training data and number of parameters is large!

Penalize large weights: 

Called ridge regression (or L2 regularization)

min ( ) w

2 d

L   

d

w

w

(17)

Back to our original example…

+

pixel 1 pixel 2

+

+

+

‐ ‐

+

+ Coffee Mug

Not Coffee Mug

+

pixel 1 pixel 2

+

+

+

‐ ‐

+

Is this a Coffee Mug?

Learning Algorithm

classification boundary

(18)

How can we learn better feature representations?

+

handle?

cylinder?

++

+

+

+ Coffee Mug

Not Coffee Mug

cylinder?

handle?

Is this a Coffee Mug?

Learning Algorithm +

handle?

cylinder?

++

+

+

modified slides originally by Adam Coates

classification boundary

(19)

Fixed/engineered descriptors + trained classifier/regressor 

"Traditional" recognition pipeline

car?

"Hand‐engineered" 

Descriptor Extractor e.g. SIFT, bags‐of‐words

Trained

classifier/regressor

(20)

Trained descriptors + trained classifier/regressor 

"New" recognition pipeline

car?

Trained Descriptor

Extractor

Trained

classifier/regressor

modified slides originally by Adam Coates

(21)

From “swallow” to “deep” mappings (networks)

Logistic regression: output is a direct function of inputs. Think of it as a net:

x1 x2 x3 ... xd 1

y y f x( ) (w x )

(22)

Neural network

Introduce latent nodes that play the role of learned feature representations. 

(1)

1 ( 1 )

h w x h2 h2 (w2(1) x) ( (2) )

y w h

x1 x2 x3 ... xd 1

h1 1

y

modified slides originally by Adam Coates

(23)

Neural network

Same as logistic regression but now our output function has multiple  stages ("layers", "modules").

Intermediate representation Prediction

x (W(1)x) h (W(2)h)

( )

( ) ( )

( )

...

m

where

 

1 2

w W w

w

y

(24)

Biological Neurons

Axon

Terminal Branches of Axon Dendrites

(25)

Analogy with biological networks 

Activation Function

(26)

Neural network

Stack up several layers:

x1 x2 x3 ... xd 1 1 1 h1 h2 h3 ... hm

h1' h2' h3' hn' y

modified slides originally by Adam Coates

(27)

Forward propagation

Process to compute output:

x1 x2 x3 ... xd 1

(28)

Forward propagation

x1 x2 x3 ... xd 1 h1 h2 h3 ... hm 1

x (W(1)x) h Process to compute output:

modified slides originally by Adam Coates

(29)

Forward propagation

x1 x2 x3 ... xd 1 1 1 h1 h2 h3 ... hm

h1' h2' h3' hn'

x (W(1)x) h (W(2)h) h' Process to compute output:

(30)

Forward propagation

x1 x2 x3 ... xd 1 1 1 h1 h2 h3 ... hm

h1' h2' h3' hn' y

x (W(1)x) h (W(2)h) h' (W(3) h') y Process to compute output:

modified slides originally by Adam Coates

(31)

Multiple outputs

x (W(1)x) h (W(2)h) h' (W(3) h') y

x1 x2 x3 ... xd 1 1 1 h1 h2 h3 ... hm

h1' h2' h3' hn' y1 y2

(32)

How can you learn the parameters? 

Use a loss function e.g., for classification:

In case of regression i.e., for predicting continuous outputs:

1

( ) [ 1]log ( ) [t 0]log(1 t( ))

i ou

i i

tput t

L f f

 

 

yi,t  x yi,t  x w

,

( ) [ t( )]2

i

i outp

i t

ut t

L w

 

yf x

(33)

Backpropagation

For each training example i (omit index i for clarity):

(3) ( )

t yt f

x

1 1 h1 h2 h3 ... hm

h1' h2' h3' hn'

y1 y2 For each output:

(3) (3)

,

( )

t n

t n

L h

w

w

(3)

w1,1

(3)

w2,1

(34)

x1 x2 x3 ... xd 1 1 1 h1 h2 h3 ... hm

h1' h2' h3' hn' y1 y2

Backpropagation

(2) (2) (3) (3)

'( ) ,

n t n t

t

wn h

w '( ) ( )[1 ( )]

   

Note:

For each training example i (omit index i for clarity):

(2)

w1,1 w2,1(2)

(2) (2)

,

( )

n m

n m

L h

w

w

(35)

(1) (1) (2) (2)

'( ) ,

m m n m n

n

w x

w 1

1 h1 h2 h3 ... hm

h1' h2' h3' hn' y1 y2

Backpropagation

For each training example i (omit index i for clarity):

(1)

w1,1 (1)

w2,1

( ) (1)

L x

w

(36)

Is this magic?

All these are derivatives derived analytically using the chain rule! 

Gradient descent is expressed through backpropagation of messages δ following the structure of the model

(37)

Training algorithm

For each training example [in a batch]

1. Forward propagation to compute outputs per layer 2. Back propagate messages δ from top to bottom layer

3. Multiply messages δ with inputs to compute derivatives per layer 4. Accumulate the derivatives from that training example

(38)

Yet, this does not work so easily...

(39)

Yet, this does not work so easily...

Non‐convex:  Local minima;  convergence criteria.

• Optimization becomes difficult with many layers.

• Hard to diagnose and debug malfunctions.

Many things turn out to matter:

Choice of nonlinearities.

Initialization of parameters.

(40)

Non‐linearities

Choice of functions inside network matters.

Sigmoid function yields highly non‐convex loss functions

Some other choices often used:

1

‐1

1

tanh(∙) ReLu(∙) = max{0, ∙}

“Rectified Linear Unit”

Increasingly popular.

1

abs(∙)

[Nair & Hinton, 2010]

tanh'(∙)= 1 - tanh(∙)2 abs'(∙)= sign(∙) ReLu'(∙)= [ · > 0 ]

(41)

Initialization

Usually small random values.

Try to choose so that typical input to a neuron avoids saturating 

Initialization schemes for weights used as input to a node:

tanh units:  Uniform[‐r, r];  sigmoid:  Uniform[‐4r, 4r].

See [Glorot et al., AISTATS 2010]

1

(42)

Step size

Fixed step‐size

try many, choose the best...

pick size with least test error on a validation set after T iterations

Dynamic step size

decrease after T iterations

if simply the objective is not decreasing much, cut step by half

(43)

Momentum/L2 regularization

Modify stochastic/batch gradient descent: 

“Smooth” estimate of gradient from several steps of gradient descent:

High‐curvature directions cancel out, low‐curvature directions “add up” and  accelerate.

Other techniques: Adagrad, Adadelta…

Before : ( ),

With momentum : previous ( ), L w w

L w w

 

     

       

w

w

w w w

w w w w

Add L2 regularization to the loss function: 

(44)

Yet, things will not still work well!

(45)

Main problem

Extremely large number of connections.

More parameters to train.

Higher computational expense.

(46)

Local connectivity

Reduce parameters with  local connections! 

modified slides originally by Adam Coates

(47)

Neurons as convolution filters

Now think of neurons as convolutional  filters acted on small adjacent 

(possibly overlapping) windows

Window size is called 

“receptive field” size and spacing is called 

“step” or “stride”

(48)

Extract repeated structure

, ( )

hp f g w xf p

Apply the same filter (weights) throughout the image Dramatically reduces the number of parameters

modified slides originally by Adam Coates

(49)

Convolution reminder [animated]

(red numbers are filter values)

(50)

Response per pixel p, per filter f for a transfer function g:

Can have many filters!

modified slides originally by Adam Coates

(51)

Example: multiple 3D filters working on multiple channels

(52)

Pooling

Apart from hidden layers dedicated to convolution, we can have layers  dedicated to extract locally invariant descriptors

[Scherer et al., ICANN 2010]

[Boureau et al., ICML 2010]

', max( )

p f p

h xp

Max pooling:

', ( )

p f p

h avg xp

Mean pooling:

',

p f gaussian

hwxp

Fixed filter (e.g., Gaussian):

Progressively reduce the resolution of the image, so that the next  convolutional filters are applied on larger scales  

(53)

Interchange convolutional and pooling (subsampling) layers. 

In the end, unwrap all feature maps into a single feature vector and pass it  through the classical (fully connected) neural network.

A mini convolutional neural network

(54)

LeNet

Initial architecture from LeCun et al., 1998:

Convolutional layers with tanh non‐linearity Max‐pooling layers

Stochastic gradient descent Applied to digit recognition

(55)

AlexNet

Proposed architecture from Krizhevsky et al., NIPS 2012:

Convolutional layers with Rectified linear units Max‐pooling layers

Stochastic gradient descent on GPU with momentum, L2 regularization, dropout

Applied to image classification (ImageNet competition – top runner & game changer)

(56)

Application: ImageNet classification

Top result in ILSVRC 2012  [~85%, Top‐5 accuracy]

Krizhevsky et al., NIPS 2012

(57)

Learned representations

Think of convolution filters as optimized feature templates capturing  various hierarchical patterns (edges, local structures, sub‐parts, parts…)

(58)

Multi‐view CNNs for shape analysis

CNN1

CNN1

CNN1

CNN1

. . .

CNN1: a ConvNet extracting image features

Image from Hang Su, Subhransu Maji, Evangelos Kalogerakis, Erik Learned-Miller Multi-view Convolutional Neural Networks for 3D Shape Recognition, ICCV 2015

(59)

All image features are combined by view pooling…

CNN1

. . .

View pooling: element‐wise max‐pooling across all views

View  pooling

(60)

… then passed through CNN2 and to generate final prediction

CNN1

. . .

View  pooling

CNN2:       a second ConvNet  producing shape descriptors 

CNN2

softmax

Image from Hang Su, Subhransu Maji, Evangelos Kalogerakis, Erik Learned-Miller Multi-view Convolutional Neural Networks for 3D Shape Recognition, ICCV 2015

(61)

Train on image datasets!

CNN1

. . .

View  pooling

CNN2

CNNs pre‐trained on ImageNet (leverage large image datasets for training shape analysis techniques!)

softmax

(62)

… and then fine‐tune on 3D datasets!

CNN1

. . .

View  pooling

CNN2

Fine‐tuning on shape  databases

softmax

Image from Hang Su, Subhransu Maji, Evangelos Kalogerakis, Erik Learned-Miller Multi-view Convolutional Neural Networks for 3D Shape Recognition, ICCV 2015

(63)

Volumetric CNNs

Key idea: represent a shape as  a volumetric image with binary  voxels. 

Learn filters operating on these volumetric data.

(64)

Volumetric CNNs

Image from Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao 3D ShapeNets: A Deep Representation for Volumetric Shapes, 2015

(65)

Comparison

Shape retrieval evaluation in ModelNet40: 

(66)

Summary

CNNs can learn highly discriminative, hierarchical, powerful feature  representations for image and shape analysis.

Deep learning and CNNs have revolutionized computer vision, robotics, NLP, 

machine learning: solve hard tasks, achieve performance comparable to humans.

Why do we still use far‐from‐optimal, ‘old‐style’ descriptors in CG?

Deep learning has also shown very promising results in image and shape synthesis  [see Data‐Driven Shape Analysis and Processing, EG’16 STAR report].

Referanser

RELATERTE DOKUMENTER

Our approach provides a visualisation that maps the principal contrasting features of a batch of images to the original image space in a single forward pass of the network.. We

In this sec- tion we want to discuss the test performances of three optimal models from each data distribution in light of pre-processing, hyperparameter tuning, and model

Real-time Standard View Classification in Transthoracic Echocardiography using Convolutional Neural Networks.. Andreas Østvik a , Erik Smistad a,b , Svein Arne Aase c , Bjørn

This paper proposes a novel multi-period probabilistic load and generating forecasting model for distributed energy resources based on convolutional neural networks and

We used cropped microscope images of blood smear to validate our model during training and test it after training.. The images are cropped from larger images captured at

As the question of efficiently using deep Convolutional Neural Networks (CNNs) on 3D data is still a pending issue, we propose a framework which applies CNNs on multiple 2D image

The obtained results show the significantly high performance in the classification rates of the proposed model and its ability to decrease the computational cost by training low

In contrast, as our focus is the embedding of machine readable information on the 3D printed object, we see watermark retrieval as a classic computer vision