Convolutional Neural Networks

(1)

Introduction to

Convolutional Neural Networks

(2)

Motivation

+

pixel 1 pixel 2

‐

+

‐

+

‐ ‐

+

+ ^Coffee Mug

Not Coffee Mug

‐

modified slides originally by Adam Coates

(3)

+

pixel 1 pixel 2

‐

+

‐

+

‐ ‐

+

+ ^Coffee Mug

Not Coffee Mug

‐

+

pixel 1 pixel 2

‐

+

‐

+

‐ ‐

+

Is this a Coffee Mug?

Learning Algorithm

Motivation

(4)

Need stronger feature representations!

+

handle?

cylinder?

‐

++

‐

+

‐

⁺

+ ^Coffee Mug

Not Coffee Mug

‐

cylinder?

handle?

Learning Algorithm +

handle?

cylinder?

‐

++

‐

+

‐

⁺

(5)

From “swallow” to “deep” mappings (networks)

Images, shapes, natural language have compositional structure and patterns.

Deep neural networks can learn powerful feature representations capturing those.

(6)

Suppose you want to predict mug or no mug in an image.

Output:

Input:

Classification basics: Logistic Regression

[coffee mug], 0 [no coffee mug]

y = 1 y =

1 2

{ , ,...} [pixel intensities, gradients, SIFT, etc]x x

 x

(7)

Suppose you want to predict mug or no mug in an image.

Output:

Input:

Classification function:

Classification basics: Logistic Regression

( | ) ( ) ( )

where is a ( ) 1

1 exp( )

P



  

    

f w

w weight vector (paramet

x x

er y = 1

x

s

w w

x

)

y

( )



w x

[coffee mug], 0 [no coffee mug]

y = 1 y =

1 2

{ , ,...} [pixel intensities, gradients, SIFT, etc]x x

 x

(8)

Logistic regression: training

Need to estimate parameters w from training data e.g., images of

objects x_i and given labels y_i (mugs/no mugs) (i=1…N training images) Find parameters that maximize probability of training data

[ 1] [ 0]

1

max ^N ( 1 | ) [1 ( 1 | )]

i

i i

P ^ P ^



  



^yⁱ ^yⁱ

w y x y x

(9)

Logistic regression: training

[ 1] [ 0]

1

max ^N ( _i) [1 ( )]

i

 ^  i ^



  



^yⁱ ⁱ

w

w x w x y

objects x_i and given labels y_i (mugs/no mugs) (i=1…N training images) Find parameters that maximize probability of training data

(10)

Logistic regression: training

[ 1] [ 0]

1

max log{ ^N ( _i) [1 ( _i)] }

i

 ^  ^



  



^yⁱ ^yⁱ

w w x w x

objects x_i and given labels y_i (mugs/no mugs) (i=1…N training images) Find parameters that maximize the log probability of training data

(11)

Logistic regression: training

1

max ^N [ 1]log ( _i) [ 0]log(1 ( ))

i

  i



     



ⁱ ⁱ

w y w x y w x

objects x_i and given labels y_i (mugs/no mugs) (i=1…N training images) Find parameters that maximize the log probability of training data

(12)

Logistic regression: training

1

min ^N [ 1]log ( ) [ 0]log(1 ( _i))

i

 i 







   ⁱ   

w yi w x y w x

objects x_i and given labels y_i (mugs/no mugs) (i=1…N training images)

Find parameters that minimize the negative log probability of training data

(13)

Logistic regression: training

1

min ^N [ 1]log ( ) [ 0]log(1 ( _i))

i

 i 







   ⁱ   

w yi

L(w)

w x y w x

objects x_i and given labels y_i (mugs/no mugs) (i=1…N training images) This is called (negative) log likelihood:

(14)

Logistic regression: training

,

( ) _{i d}[ _i ( ) ]

i

i d

L x

w y 

   

 ^w



^w ^x

(partial derivative for d^th parameter)

1

min ^N [ 1]log ( ) [ 0]log(1 ( _i))

i

 i 







   ⁱ   

w yi

L(w)

w x y w x

objects x_i and given labels y_i (mugs/no mugs) (i=1…N training images) We now have an optimization problem:

(15)

How can we minimize/maximize a function?

Gradient descent: Given a random initialization of

parameters and a step rate η, update them according to:

new  old   L( )

w w w

(16)

Regularization

Overfitting: few training data and number of parameters is large!

Penalize large weights:

Called ridge regression (or L2 regularization)

min ( ) w

2 d

L   

d

w

(17)

Back to our original example…

+

pixel 1 pixel 2

‐

+

‐

+

‐ ‐

+

+ ^Coffee Mug

Not Coffee Mug

‐

+

pixel 1 pixel 2

‐

+

‐

+

‐ ‐

+

Learning Algorithm

classification boundary

(18)

How can we learn better feature representations?

+

handle?

cylinder?

‐

++

‐

+

‐

⁺

+ ^Coffee Mug

Not Coffee Mug

‐

cylinder?

handle?

Learning Algorithm +

handle?

cylinder?

‐

++

‐

+

‐

⁺

classification boundary

(19)

Fixed/engineered descriptors + trained classifier/regressor

"Traditional" recognition pipeline

car?

"Hand‐engineered"

Descriptor Extractor e.g. SIFT, bags‐of‐words

Trained

classifier/regressor

(20)

Trained descriptors + trained classifier/regressor

"New" recognition pipeline

car?

Trained Descriptor

Extractor

Trained

classifier/regressor

(21)

From “swallow” to “deep” mappings (networks)

Logistic regression: output is a direct function of inputs. Think of it as a net:

x₁ x₂ x₃ _... x_d 1

y y  f x( ) (w x )

(22)

Neural network

Introduce latent nodes that play the role of learned feature representations.

(1)

1 ( 1 )

h  w x h₂ ^h² ^^⁽^w²⁽¹⁾ ^^x⁾ ( (2) )

y  w h

x₁ x₂ x₃ _... x_d 1

h₁ 1

y

(23)

Neural network

Same as logistic regression but now our output function has multiple stages ("layers", "modules").

Intermediate representation Prediction

x ^⁽^W⁽¹⁾^^x⁾ h ^⁽^W⁽²⁾^^h⁾

( )

( ) ( )

( )

...

m

where



 



 

 

  

 

 

1 2

w W w

w

y

(24)

Biological Neurons

Axon

Terminal Branches of Axon Dendrites

(25)

Analogy with biological networks

Activation Function

(26)

Neural network

Stack up several layers:

x₁ x₂ x₃ _... x_d 1 1 1 h₁ h₂ h₃ _... h_m

h₁' h₂' h₃' h_n' y

(27)

Forward propagation

Process to compute output:

x₁ x₂ x₃ _... x_d 1

(28)

Forward propagation

x₁ x₂ x₃ _... x_d 1 h₁ h₂ h₃ _... h_m 1

x ^⁽^W⁽¹⁾^^x⁾ h Process to compute output:

(29)

Forward propagation

x₁ x₂ x₃ _... x_d 1 1 1 h₁ h₂ h₃ _... h_m

h₁' h₂' h₃' h_n'

x ^⁽^W⁽¹⁾^^x⁾ h ^⁽^W⁽²⁾^^h⁾ ^h^' Process to compute output:

(30)

Forward propagation

x₁ x₂ x₃ _... x_d 1 1 1 h₁ h₂ h₃ _... h_m

h₁' h₂' h₃' h_n' y

x ^⁽^W⁽¹⁾^^x⁾ h ^⁽^W⁽²⁾^^h⁾ ^h^' ^⁽^W⁽³⁾ ^^h^') ^y Process to compute output:

(31)

Multiple outputs

x ^⁽^W⁽¹⁾^^x⁾ h ^⁽^W⁽²⁾^^h⁾ ^h^' ^⁽^W⁽³⁾ ^^h^') ^y

x₁ x₂ x₃ _... x_d 1 1 1 h₁ h₂ h₃ _... h_m

h₁' h₂' h₃' h_n' y₁ y₂

… …

(32)

How can you learn the parameters?

Use a loss function e.g., for classification:

In case of regression i.e., for predicting continuous outputs:

1

( ) [ 1]log ( ) [_t 0]log(1 _t( ))

i ou

i i

tput t

L f f



 

 

^y^i,t  ^x  ^y^i,t   ^x w

,

( ) [ _t( )]2

i

i outp

i t

ut t

L ^w 

 

^y  f ^x

(33)

Backpropagation

For each training example i (omit index i for clarity):

(3) ( )

t yt f

   x

1 1 h₁ h₂ h₃ _... h_m

h₁' h₂' h₃' h_n'

y₁ y₂ For each output:

(3) (3)

,

( )

t n

t n

L h

w 

 

 w

(3)

w1,1

(3)

w2,1

(34)

x₁ x₂ x₃ _... x_d 1 1 1 h₁ h₂ h₃ _... h_m

h₁' h₂' h₃' h_n' y₁ y₂

Backpropagation

(2) (2) (3) (3)

'( ) ,

n t n t

t

  ^wⁿ ^h



w  '( ) ( )[1 ( )]

     

Note:

(2)

w1,1 w^2,1⁽²⁾

(2) (2)

,

( )

n m

n m

L h

w 

 

 w

(35)

(1) (1) (2) (2)

'( ) ,

m m n m n

n

   ^w ^x



w  1

1 h₁ h₂ h₃ _... h_m

h₁' h₂' h₃' h_n' y₁ y₂

Backpropagation

(1)

w1,1 ⁽¹⁾

w2,1

( ) (1)

L  x

 w 

(36)

Is this magic?

All these are derivatives derived analytically using the chain rule!

Gradient descent is expressed through backpropagation of messages δ following the structure of the model

(37)

Training algorithm

For each training example [in a batch]

1. Forward propagation to compute outputs per layer 2. Back propagate messages δ from top to bottom layer

3. Multiply messages δ with inputs to compute derivatives per layer 4. Accumulate the derivatives from that training example

(38)

Yet, this does not work so easily...

(39)

Yet, this does not work so easily...

• Non‐convex: Local minima; convergence criteria.

• Optimization becomes difficult with many layers.

• Hard to diagnose and debug malfunctions.

• Many things turn out to matter:

• Choice of nonlinearities.

• Initialization of parameters.

(40)

Non‐linearities

• Choice of functions inside network matters.

• Sigmoid function yields highly non‐convex loss functions

• Some other choices often used:

1

‐1

1

tanh(∙) ReLu(∙) = max{0, ∙}

“Rectified Linear Unit”

Increasingly popular.

1

abs(∙)

[Nair & Hinton, 2010]

tanh'(∙)= 1 - tanh(∙)² abs'(∙)= sign(∙) ReLu'(∙)= [ · > 0 ]

(41)

Initialization

• Usually small random values.

• Try to choose so that typical input to a neuron avoids saturating

• Initialization schemes for weights used as input to a node:

• tanh units: Uniform[‐r, r]; sigmoid: Uniform[‐4r, 4r].

• See [Glorot et al., AISTATS 2010]

1

(42)

Step size

• Fixed step‐size

• try many, choose the best...

• pick size with least test error on a validation set after T iterations

• Dynamic step size

• decrease after T iterations

• if simply the objective is not decreasing much, cut step by half

(43)

Momentum/L2 regularization

Modify stochastic/batch gradient descent:

“Smooth” estimate of gradient from several steps of gradient descent:

• High‐curvature directions cancel out, low‐curvature directions “add up” and accelerate.

Other techniques: Adagrad, Adadelta…

Before : ( ),

With momentum : _previous ( ), L w w

L w w



 

     

       

w

w w w

w w w w

Add L2 regularization to the loss function:

(44)

Yet, things will not still work well!

(45)

Main problem

• Extremely large number of connections.

• More parameters to train.

• Higher computational expense.

(46)

Local connectivity

Reduce parameters with local connections!

(47)

Neurons as convolution filters

Now think of neurons as convolutional filters acted on small adjacent

(possibly overlapping) windows

Window size is called

“receptive field” size and spacing is called

“step” or “stride”

(48)

Extract repeated structure

, ( )

hp f g w x_f _p

Apply the same filter (weights) throughout the image Dramatically reduces the number of parameters

(49)

Convolution reminder [animated]

(red numbers are filter values)

(50)

Response per pixel p, per filter f for a transfer function g:

Can have many filters!

(51)

Example: multiple 3D filters working on multiple channels

(52)

Pooling

Apart from hidden layers dedicated to convolution, we can have layers dedicated to extract locally invariant descriptors

[Scherer et al., ICANN 2010]

[Boureau et al., ICML 2010]

', max( )

p f p

h  x_p

Max pooling:

', ( )

p f p

h  avg x_p

Mean pooling:

',

p f gaussian

h  w x_p

Fixed filter (e.g., Gaussian):

Progressively reduce the resolution of the image, so that the next convolutional filters are applied on larger scales

(53)

Interchange convolutional and pooling (subsampling) layers.

In the end, unwrap all feature maps into a single feature vector and pass it through the classical (fully connected) neural network.

A mini convolutional neural network

(54)

LeNet

Initial architecture from LeCun et al., 1998:

Convolutional layers with tanh non‐linearity Max‐pooling layers

Stochastic gradient descent Applied to digit recognition

(55)

AlexNet

Proposed architecture from Krizhevsky et al., NIPS 2012:

Convolutional layers with Rectified linear units Max‐pooling layers

Stochastic gradient descent on GPU with momentum, L2 regularization, dropout

Applied to image classification (ImageNet competition – top runner & game changer)

(56)

Application: ImageNet classification

Top result in ILSVRC 2012 [~85%, Top‐5 accuracy]

Krizhevsky et al., NIPS 2012

(57)

Learned representations

Think of convolution filters as optimized feature templates capturing various hierarchical patterns (edges, local structures, sub‐parts, parts…)

(58)

Multi‐view CNNs for shape analysis

… CNN₁

. . .

CNN₁: a ConvNet extracting image features

Image from Hang Su, Subhransu Maji, Evangelos Kalogerakis, Erik Learned-Miller Multi-view Convolutional Neural Networks for 3D Shape Recognition, ICCV 2015

(59)

All image features are combined by view pooling…

…

… CNN₁

. . .

View pooling: element‐wise max‐pooling across all views

View pooling

(60)

… then passed through CNN₂ and to generate final prediction

…

… CNN₁

. . .

View pooling

CNN₂: a second ConvNet producing shape descriptors

… CNN₂

softmax

(61)

Train on image datasets!

…

CNN₁

. . .

View pooling

…

CNN₂

CNNs pre‐trained on ImageNet (leverage large image datasets for training shape analysis techniques!)

softmax

…

(62)

… and then fine‐tune on 3D datasets!

…

CNN₁

. . .

View pooling

…

CNN₂

Fine‐tuning on shape databases

softmax

…

(63)

Volumetric CNNs

Key idea: represent a shape as a volumetric image with binary voxels.

Learn filters operating on these volumetric data.

(64)

Volumetric CNNs

Image from Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao 3D ShapeNets: A Deep Representation for Volumetric Shapes, 2015

(65)

Comparison

Shape retrieval evaluation in ModelNet40:

(66)

Summary

CNNs can learn highly discriminative, hierarchical, powerful feature representations for image and shape analysis.

Deep learning and CNNs have revolutionized computer vision, robotics, NLP,

machine learning: solve hard tasks, achieve performance comparable to humans.

Why do we still use far‐from‐optimal, ‘old‐style’ descriptors in CG?

Deep learning has also shown very promising results in image and shape synthesis [see Data‐Driven Shape Analysis and Processing, EG’16 STAR report].