• No results found

Machine learning techniques can be interpreted through the framework of statistical inference. To view learning in this manner, the models to be trained must be interpreted probabilistically. This can be achieved through the use of so-calledensemble methods[48], which extend standard learning algorithms to principled schedules of sample values for

3.5. Learning as Inference 29 hyperparameters of interest, or through the Bayesian framework. The focus of this work is on the latter, though notes are first presented on ensemble methods to build intuition.

3.5.1 Ensemble Methods

In the interest of robustness, we may be interested in a class of ANN models with dis-tinct parameterizations, as opposed to one trained network. A set of realizations (i.e., trained networks) from this class is dubbed an ensemble, and by treating a learning task to each model within this set, we can perform inference over model parameters and generate measures of uncertainty over our ˆyvector of model outputs.

Formally speaking, we begin by declaring M = {M1,M2, ...,Mm} where eachMi is characterized by an ANN function f(θi,x). The parameter vectors θi = {W,b}are considered to be drawn from some distributionΘ. Network outputs ˆyi = f(θi,x)can then be drawn from each network, and an average response ˆy= m1 mi=1yˆiis calculated as an approximateexpectation- a robust estimate of the true response vector (equation 3.17).

E[y] =

Z

yf(y)dy (3.17)

How many networks should there be in a well-defined ensemble? Two unsatisfying answers might be as many as we deem appropriate, or as many as are computationally feasible. The former answer corresponds to situations where we are uncertain about a particular set of architecture specifications. We may know that a network would benefit from either 2 layers or 3, but can not say which of the two might be more optimal. We might instead wish to define 32, 64 or 128 neurons in each layer, or we may seek to con-sider networks with different activation functions. Alternatively, we may train an entire table of network specification permutations, one such approach to Neural Architecture Search [49].

This discussion of how an optimal ensemble is defined hints towards a Bayesian ap-proach. It might be argued that the optimal ensemble size is the asymptotic limit - an infinite number of networks, over which an expected distribution can be integrated.

3.5.2 Bayesian Neural Networks

Bayesian Neural Networks (BNNs) [29],[45], [50] are non-parametric models structured in the same manner as ANNs, for which probability distributions are placed over the weight and bias matrices corresponding to each layer of the network. A joint posterior distribution of network parameters can then be computed by declaring prior distribu-tions, establishing the appropriate likelihood for the paired observational and response

30 Chapter 3. Bayesian Inference data given the model task, and executing an appropriate approximate inference tech-nique.

A formal definition of a BNN model follows from the definition of an ANN in sec-tion2.2.1and the components of Bayesian inference in section3.2. An`-layered neural network framework y = f(x,θ)is declared for fitting responses y to feature vectors x based on parameterizationsθ = {W,b}, respectively the weights and biases of the linear transformationszj = hj(zj1,Wj,bj)for each layerj= 1, ...,`, wherez1 = x. Ac-tivation functionsgj(z)are applied after each linear transformation, and will typically be taken to be equivalent across all layers excluding perhaps the final layer, depending on the nature of the analysis.

The parameter vectorθis assumed to contain componentsθk that are iid random vari-ables, the joint prior distribution for which is declared such thatθ∼ p(θ). A likelihood function for the parameters based on the observed data is selected based on the net-work’s role as a classification or regression model and will be of the formL(θ|x,y). The marginal distribution of the data is discarded, and a posterior distribution of the model parameters is achieved through equation3.5.

The end result of inference over a BNN is a posterior distribution of network parame-terizations. To draw a single sample from this distribution is to generate a single ANN that is considered to have effectively completed its training regimen. The fitted response data as modeled should provide a reasonable estimate of the training labels3. To draw nsamples from the posterior would be to generate an ensemble of these ANNs, each of which may contribute estimated labels for the training data as one sample estimate of the true response. More generally, samples contributing to an expectation of the re-sponsesyare drawn from the posterior predictive distribution as a means of obtaining point or interval estimates.

The BNN approach offers several advantages over the standard machine learning ap-proach. Inference procedures offer the same benefits as ensemble methods, but these benefits may be thought of as being "built-in" to the inference procedure, and need not be approximated through the addition of auxiliary methods and corresponding hyper-parameters. Select advantages are outline below:

1. Robust, built-in generalization

A known result of BNNs is that the use of zero-centered Gaussian priors over weight parameters (see section 3.5.3) induces an equivalence to L2-regularization

3The distribution of trained networks may not be defined such that every sample will correspond to a network that has been trained "effectively". Low-probability draws from thetailsof the distribution (those with a relatively low posterior score) may not achieve optimal results based on appropriate metrics, such as classification accuracy or regression MSE.

3.5. Learning as Inference 31 via the inclusion of a weight decay penalty [51]. Such a penalty term is included in the loss function of an ANN asλkW2k, whereλ > 0 is some coefficient affecting the strength of the penalty on the squared norm of the weight parameters. Net-works with smaller weights are therefore favoured in producing a trained ANN model.

For a BNN, the posterior distribution of network parameterizations corresponds to an infinite ensemble of networks that treat the data to all possible functional representations y = f(x)weighted by corresponding posterior scores, for which the contribution of a distribution centered at zero will similarly favour parameter absolute values closer to zero [41].

2. Uncertainty measures built-in

The expectation represents the first-order moment of the posterior distribution;

the second-order moment is referred to as the variance. The variance of each com-ponent of the parameter vector provides as a metric the spread of likely values for the corresponding model parameter, indicating how widely distributed random realizations of the marginal posterior may be.

Variance for the posterior distribution allows for statements about the certainty of estimates for network parameters, but does not directly provide predictive in-sight. Variance for the predictive distribution, however, can be extremely benefi-cial in represented the limitations of a neural network’s predictive capabilities. A large variance associated with a fitted data point can indicate the degree of caution that should be regarded when relying on a network’s predictions for real-world applications.

3. Interpretability of network parameters

Referring to the discussion of black box models in section2.2, it is not always clear what precisely motivates the parameterization of a well-specified non-parametric model. The posterior distribution of parameters facilitates inference about the nature of the model, in that draws from the posterior effectively produce a dataset of parameterizations. Analysis of this dataset may provide insight into the model at hand; we are essentially turning statistical analysis inward on itself.

Treating neural networks to Bayesian inference therefore proposes an inherent op-portunity to understand such models, in parallel to modelling the data according to predictive motivations as in a classical machine learning approach. The rior predictive distribution provides a fitted estimate of the data while the poste-rior distribution lends clarity to the model itself, such that we don’t merely return

32 Chapter 3. Bayesian Inference how to represent the data, but also information about why those representations may be reasonable.

Essentially, a BNN is a distribution of ANNs which may be learned through one of the three inference techniques previously introduced.

3.5.3 Priors for BNNs

Prior distributions in Bayesian inference allow for the injection of previously acquired domain knowledge into the modelling task, but this is not an immediately intuitive proposition when dealing with non-parametric models [52]. To impose a distribution over the weights and biases of a neural network model is to implicitly define expec-tations about the nature of these parameters, such as their restricted domain, and the regions of the real number line corresponding to high probability mass for parameter values.

Discussion of BNN priors dates back to Mackay [45],[53],[41] and Neal [29]. The con-sensus in the literature to date largely agree with their forward approach of employing zero-centered Gaussian distributions for both weights and biases, with variance param-eters σi2 declared for sets of parametersi corresponding to similar roles. Neal for ex-ample declares separate variances respectively for sets of hidden node weights, hidden node biases, output node weights, and output node biases, treating each corresponding Gaussian prior to an appropriate width based on the previous layer size.

Further discussion of BNN priors is included in section5.1.1.

3.5.4 BNN Likelihoods

The likelihood function must be appropriately specified based on the nature of the anal-ysis for a given BNN model. In this work, BNNs are used for both classification and interpolation tasks. Suitable likelihoods extend from equivalent analyses via Bayesian regression (BR) and Bayesian logistic regression (BLR).

Interpolation Networks

Each fitted value yi is assumed to be drawn from a normal distribution centered on the network output ˆyi = f(x)with noise termσy. This corresponds to the noise terms ei = yi−yˆi in standard regression, which are assumed to be drawnei ∼ N(0,σy). σy is often considered to be a hyperparameter; it is marginalized over in computing the predictive distribution.

3.6. Model Summary 33