• No results found

Comparing clustering methods

In document L ATENT V ARIABLE M ACHINE L EARNING (sider 167-191)

11.3 Clustering of AT-TPC events

11.3.3 Comparing clustering methods

Both the pre-trained and autoencoder based clustering methods hold promise.

As is the case for classification, the ease of application of the pre-trained meth-ods is a significant boon, but their static nature creates a hard to breach cap on their performance. We also observe an impressive purity in the proton clus-ters, as shown in figure 10.1, for both the filtered and full datasets. Notably, this purity is absent from the MIXAE clustering results. The increase in perfor-mance from the VGG16+K-means is then a result of a stronger segmentation of the "other" class of events.

We also observe a link between the clustering and classifying autoencoders in their performance on real AT-TPC data. Recall that for the real data, both the MIXAE and VGG16+K-means approaches showed confusion over very noisy proton events. We observe the same behaviour in the classifying latent spaces of the autoencoder and VGG16 models illustrated by their t-SNE projections shown in figures 9.2, 9.4 and 9.6. In those figures, the majority of proton events are correctly segmented from the rest of the data. A portion of the proton events does, however, intermingle with the "other" class in a way that is famil-iar to us from the clustering results.

It remains to be seen if the algorithms explored in this thesis are capable of separating events in future experiments with more similar tracks. A related avenue for future research is then to combine the autoencoder methods dis-cussed here with a duelling decoder objective on Hough transformed events.

This representation is promising as it can explicitly encode the geometry in the event.

For on-going experiments with the AT-TPC, our recommendation for clus-tering is to employ a pre-trained model combined with K-means. Performance validation is, unfortunately, a necessary step. Some positive identifications of events are thus needed to validate the clustering.

In this thesis, we have demonstrated strong performance with the MIXAE algorithm. However, further inquiry is needed to investigate the stability of the algorithm. Other avenues of potential interest are the coupling of clus-tering with a duelling decoder objective, as well as other autoencoder based clustering algorithms. Lastly, there is a need for the coupling of unsupervised

162 Discussion Chapter 11

performance metrics to measures of performance against ground truth labels.

Chapter 12

Conclusions and Future Work

In the present work, we explored the segmentation of active target time-projection chamber (AT-TPC) events in neural network latent spaces. This exploration is necessary because traditional methods are both computationally prohibitively expensive and can not be applied to events with broken tracks. Specifically, the goal was to implement autoencoder based models for semi-supervised classi-fication and clustering. We compared these models with a classic pre-trained model from the image analysis community.

Two tasks were proposed to contribute to the exploration of AT-TPC events:

a semi-supervised objective which describes the necessary volume of labelled data, and a clustering objective which measures the quality of segmentation without labelled data.

To solve the semi-supervised problem, we implemented two autoencoder-based algorithms: a convolutional autoencoder model with the capacity for two different latent space regularisation, and the sequential deep recurrent attentive writer (DRAW) model. We trained these models on three different sets of AT-TPC data and found the following:

• A convolutional autoencoder can linearly segment its latent space by event type when trained on the46Ar data.

• Good class separability can be achieved both with and without latent regularisation. When regularised, we found better segmentation in mod-els trained with a Gaussian mixture maximum mean discrepancy loss, than those trained with a variational autoencoder loss.

• The recurrent DRAW model does not offer meaningful improvements to the convolutional autoencoder performance on the semi-supervised task.

Moreover, neither the DRAW algorithm nor the convolutional autoencoder outperformed the pre-trained VGG16 as a function of the available labelled

163

164 Conclusions and Future Work Chapter 12

data. This discrepancy indicates that while the reconstruction objective en-courages class separability in the latent space, it does not do so to the degree that a classification objective does even when the classification objective is over a different dataset.

Avenues of academic interest for further research include the construction of new representations for the duelling decoder objective, as well as models that combine autoencoders with generative adversarial networks. Lastly, we analysed two-dimensional projections of AT-TPC events in this work, expand-ing to include the full three-dimensional cold contribute additional insight.

To address the clustering task, we implemented two algorithms: the deep convolutional embedded clustering (DCEC) algorithm and the mixture of au-toencoders algorithm (MIXAE). As in the semi-supervised objective, we com-pared these algorithms with the performance of a pre-trained VGG16 net-work. We showed that the pre-trained network latent space could be com-bined with a simple k-means algorithm for clustering of AT-TPC events. With the VGG16+K-means algorithm, we achieved convincing results on simulated data, as well as promising segmentation of the full and filtered data. Especially notable was the consistent purity of the proton event cluster. With the autoen-coder based MIXAE algorithm, we found that an increase in the performance on the clustering task from the VGG16+K-means approach. In conclusions, we found the following for the clustering task:

• Using a K-means algorithm on the VGG16 latent space, we demonstrate strong clustering of simulated data. Additionally, this approach consis-tently finds high purity proton clusters for both filtered and full 46Ar data.

• Building on the insights from the semi-supervised task, and the failure of the DCEC algorithm, we successfully clustered AT-TPC data with the MIXAE algorithm. We demonstrate that this approach can increase per-formance on the clustering task compared to the VGG16+K-means al-gorithm. However, the MIXAE performance is dependent on the loss-weights which we selected based on the performance on the clustering task. We also found challenges with the MIXAE model as it has signifi-cant stability problems.

The contribution of the present work is then to both demonstrate the ap-plicability of pre-trained models in the unsupervised clustering of AT-TPC data. Moreover, we have shown that MIXAE model can improve upon this performance. Further research is needed to understand the variability in au-toencoder based clustering performance. Additionally, deep clustering is an active field of research and novel methods might provide additional insight.

In preparation for a publication of these results we will also be exploring clus-tering algorithms applied to other pre-trained models.

165

In summary, we have found promising avenues for research applying both supervised and unsupervised techniques applied to AT-TPC data, with the lat-ter having major implications for experiments in which researchers are unable to separate event types. However, this research is still at an early stage and for current experiments we recommend the application of pre-trained models for both supervised and unsupervised tasks.

Part V Appendices

167

Appendix A

Kullback-Leibler divergence of of Gaussian distributions

A multivariate Gaussian distribution inRnis defined in terms of its probability density,

p(x) = 1

(2π)n/2|Σ|1/2exp(−1

2(x−µ)TΣ1(x−µ)), (A.1) which is defined in an analogous way to its univariate formulation. The prob-ability density is described in full by the mean vectorµand covariance matrix Σ. The Kullback-Leibler divergence between two multivariate Gaussians is defined as

DKL(p1||p2) = hlogp1−logp2ip1

=h1

2log|Σ2|

|Σ1| +1

2(−(xµ1)TΣ11(xµ1) + (xµ2)TΣ21(xµ2))i. We use the fact that the exponential factors represent an inner-product to ap-ply a trace operator to manipulate the sequence of operations given the trace operators invariance under cyclical permutations i.e. tr(XTBX) = tr(BXTX). Furthermore we use the fact that the trace is a linear operator and so com-mutes with the expectation i.e. h(tr(BXTX)i = tr(Bh(XTX)i). We also move the logarithm of the covariance determinants outside of the expectations,

DKL(p1||p2) = 1

2log|Σ2|

|Σ1| +1

2h−tr(Σ11(x−µ1)T(x−µ1)) +tr(Σ21(x−µ2)T(x−µ2))i

= 1

2log|Σ2|

|Σ1| +1

2(−tr(Σ11h(x−µ1)T(x−µ1)i) +tr(Σ21h(x−µ2)T(x−µ2)i)). Conveniently the covariance matrix is defined by the expectation

169

170 Kullback-Leibler divergence of of Gaussian distributions Chapter A

Σ:=h(x−µ)T(x−µ)i, (A.2) giving an evident simplification. For the terms originating fromp2we will use the definitions of the covariance matrix and the mean vector, i.e. µ =hxiand

Σ =hxTx2xµT+µµTi Σ =hxTxi −µµT.

Returning to the Kullback-Leibler divergence we then have

DKL(p1||p2) = 1

2log|Σ2|

|Σ1| +1

2(−tr(Σ11Σ1) +tr(Σ21hxTx−2xµ2T+µ2µT2i))

= 1 2

log|Σ2|

|Σ1| −n+tr(Σ21(Σ1+µ1µT1 −2µ1µT2 +µ2µT2))

. Grouping terms then gives us the final expression for the Kullback-Leibler divergence of two multivariate Gaussians

DKL(p1||p2) = 1 2

log|Σ2|

|Σ1| −n+tr(Σ21Σ1) + (µ2µ1)TΣ21(µ2µ1)

. (A.3)

Appendix B

The bias-variance decomposition

In approximating functions we observe a relationship between the complexity of our model and how much data we have available to fit on. Quantizing this relationship helps understand what challenges we face when fitting models to data. We begin by considering the true process which we want to model, decomposed in contributions from the true function we wish to approximate,

fˆ, and a noise termeas

ˆ

yi = fˆ(xi) +e, (B.1)

where the recorded data are the tuplessi = (yˆi,xi)and the set of recorded data are denoted as S = {si}. We here assume that the noise is uncorrelated and distributed ase ∼ N(0,σ2).

Furthermore, assume that we have a procedure to fit a model,g(xi;θ), with parametersθto a datasetSk, giving an estimator for unseen datag(xi;θSk). The quality of this estimator we measure by the squared error cost function, which has the form

C(S,g(xi)) =

i

(yˆi−g(xi;θSk))2. (B.2) The relationship we wish to describe is known as the bias-variance decompo-sition. This decomposition decomposes the expected error on unseen data of our modelling procedure to three discrete contributions, a bias term, a vari-ance term and a noise term. Mathematically it has the form

hC(S,g(xi))iS,e =

i

(fˆ(xi)− hg(xi;θSk)iS)2

+hg(xi;θSk)− hg(xi;θSk)iSiS +σ2.

(B.3)

171

172 The bias-variance decomposition Chapter B

Before we derive this expression we note that the expectation hgS(xi;θSk)iS is the expected value of our model, g, on an unseen datum, x, when trained on differing datasets, Sk.

The derivation of the relationship starts with the expectation of the cost with respect to the data selection- and noise-effects. It has the form

hC(S,g(xi))iS,e =

i

h(yˆi−g(xi;θSk))2iS,e. (B.4) We introduce a notational shorthand, g(xi;θSk) := gSk, for the estimator to maintain clarity in the derivation. The derivation begins by adding and sub-tracting the expected value of our estimator on the unseen data, and we then have that

hC(S,g(xi))iS,e =

i

h(yˆigSk)2iS,e, (B.5)

=

i

h(yˆi+hgSkiS− hgSkiS −gSk)2iS,e, (B.6)

=

i

[h(yˆi− hgSkiS)2iS,e +h(gSk− hgSkiS)2iS

+hyˆi− hgSkiSiS,e· hgSk− hgSkiSiS],

(B.7)

where we observe that the cross-term is zero. Further decomposing the ˆyiwe can write this as

hC(S,g(xi))iS,e =

i

[h(fˆ(xi) +e− hgSkiS)2iS,e +h(gSk− hgSkiS)2iS],

(B.8)

=

i

[(fˆ(xi)− hgSkiS)2 +h(gSk− hgSkiS)2iS +σ2],

(B.9)

where the cross-term from the last transition is zero as the error has zero mean, by assumption.

Appendix C

Neural network architectures

173

174 Neural network architectures Chapter C

Table C.1: Showing the details of the VGG network architectures. Network D trained on the ImageNet [50] dataset the network known as VGG16 and is what we use in this thesis.

Appendix D

Model hyperparameters

The convolutional autoencoder and the deep recurrent attentive writer each have many hyperparameters that need to be specified. We provide a complete listing with descriptions in tables D.1 and D.2.

175

176 Model hyperparameters Chapter D

Table D.1:Detailing the hyperparameters that need to be determined for the convolutional autoencoder. The depth and number of filters strongly influ-ence the number of parameters in the network. For all the search-types we follow heuristics common in the field, the network starts with larger kernels and smaller numbers of filters etc.

Hyperparameter Scale Description

Convolutional parameters:

Number of layers Linear integer A number describing how many convolutional layers to use

Kernels Set of linear integers An array describing the kernel size for each layer

Strides Set of linear integers An array describing the stride for each layer Filters Set of logarithmic integers An array describing the number of filters

for each layer Network parameters:

Activation Multinomial An activation function as detailed in section 3.1.3

Latent type Multinomial One of the latent space regularization techniques (KLD, MMD, clustering loss) Latent dimension Integer The dimensionality of the latent space β Logarithmic int Weighting parameter for the latent term

Batchnorm Binary Whether to use batch-normalization in each layer Optimizer parameters:

η Logarithmic float Learning rate, described in 2.10

β1 Linear float Momentum parameter, described in 2.10.1

β2 Linear float Second moment momentum parameter.

Described in 2.10.3

177

Table D.2: Hyperparameters for the draw algorithm as outlined in section 4.5. The implementation of the convolutional read and write functions is a novel contribution to the DRAW algorithm. We investigate which read/write paradigm is most useful for classification and clustering. Additionally as a measure ensuring the comparability of latent sample we fix theδparameter determining the glimpse size. The effect ofδis explored in detail in the paper by Gregor et al. [26] and in the earlier section 4.5.

Hyperparameter Scale Description

Recurrent parameters:

Readwrite functions Binary One of attention or convolutional describing the way draw looks and adds to the canvas.

Nodes in recurrent layer Integer Describing the number of cells in the LSTM cells Network parameters:

Latent type Multinomial One of the latent space regularization techniques (KLD, MMD, clustering loss) Latent dimension Integer The dimensionality of the latent space β Logarithmic int Weighting parameter for the latent term Optimizer parameters:

η Logarithmic float Learning rate, described in 2.10

β1 Linear float Momentum parameter, described in 2.10.1

β2 Linear float Second moment momentum parameter.

Described in 2.10.3

Bibliography

[1] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations, 2015. doi: 10.1.1.740.6937.

[2] N. Chinchor. Evaluation Metrics. InFourth Message Understanding Confer-ence, page 22, 1992.

[3] L. Hubert and P. Arabic. Comparing Partitions. Journal of Classi-fication, 2:193, 1985. URL https://link.springer.com/content/pdf/10.

1007{%}2FBF01908075.pdf.

[4] D. Silver, T. Hubert, J. Schrittwieser, et al. A general reinforcement learn-ing algorithm that masters chess, shogi, and Go through self-play.Science, 362:1140, 2018. ISSN 10959203. doi: 10.1126/science.aar6404.

[5] H. W. Lin, M. Tegmark, and D. Rolnick. Why Does Deep and Cheap Learning Work So Well? Journal of Statistical Physics, 168:1223, 2017. doi:

10.1007/s10955-017-1836-5.

[6] Y. Wang and M. Kosinski. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Per-sonality and Social Psychology, 114:246, 2018. doi: 10.1037/pspa0000098.

[7] J. Frankle, K. Dziugaite, D. M. Roy, and M. Carbin. Stabilizing the Lottery Ticket Hypothesis. Technical report, 2019. URL https://arxiv.org/pdf/

1903.01611.pdf.

[8] J. Frankle and M. Carbin. The lottery ticket hypothesis: finding sparse, trainable neural networks. InInternational Conference on Learning Repre-sentations, 2019. URLhttps://openreview.net/forum?id=rJl-b3RcF7.

[9] P. Mehta, M. Bukov, C. H. Wang, et al. A high-bias, low-variance in-troduction to Machine Learning for physicists. Physics Reports, 810:

1, 2019. doi: 10.1016/j.physrep.2019.03.001. URL https://linkinghub.

elsevier.com/retrieve/pii/S0370157319300766.

179

180 Bibliography Chapter D

[10] S. van der Walt, S. C. Colbert, and G. Varoquaux. The NumPy Array:

A Structure for Efficient Numerical Computation. Computing in Science

& Engineering, 13:22, 2011. doi: 10.1109/MCSE.2011.37. URL http://

ieeexplore.ieee.org/document/5725236/.

[11] A. E. Hoerl and R. W. Kennard. Ridge Regression: Biased Esti-mation for Nonorthogonal Problems. Technometrics, 12:55, 1970.

URL https://www.math.arizona.edu/{~}hzhang/math574m/Read/

RidgeRegressionBiasedEstimationForNonorthogonalProblems.pdf.

[12] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Tech-nical report, 1996. URLhttps://www.jstor.org/stable/pdf/2346178.pdf?

refreqid=excelsior{%}3A0665690fe41c338bbaa8d3f1883ccb60.

[13] J. Bergstra and Y. Bengio. Random Search for Hyper-Parameter Opti-mization Yoshua Bengio. Technical report, 2012. URLhttp://scikit-learn.

sourceforge.net.

[14] F. Rosenblatt. The perceptron: a probabilistic model for information stor-age and organization in the brain. Psychological Review, 65:19, 1958. URL https://www.ling.upenn.edu/courses/cogs501/Rosenblatt1958.pdf.

[15] A. Karpathy. CS231n Convolutional Neural Networks for Visual Recog-nition, 2019. URLhttps://cs231n.github.io/neural-networks-3/.

[16] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. Technical report, 2013.

URLhttp://proceedings.mlr.press/v28/sutskever13.pdf.

[17] S. Ruder. An overview of gradient descent optimization algorithms.

Technical report, Insight Centre for Data Analytics, 2016. URL http:

//caffe.berkeleyvision.org/tutorial/solver.html.

[18] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=H1oyRlYgg.

[19] D. P. Kingma and J. Lei Ba. ADAM: a method for stochastic op-timization. In International Conference on Learning Representations, 2014. URL https://arxiv.org/pdf/1412.6980.pdfhttps://openreview.

net/forum?id=33X9fd2-9FyZd{&}noteId=33X9fd2-9FyZd.

[20] M. P. Kuchera, R. Ramanujan, J. Z. Taylor, et al. Machine learning meth-ods for track classification in the AT-TPC.Nuclear Instruments and Methods

Bibliography 181

in Physics Research, Section A: Accelerators, Spectrometers, Detectors and As-sociated Equipment, 940:156, 2019. doi: 10.1016/j.nima.2019.05.097. URL https://linkinghub.elsevier.com/retrieve/pii/S0168900219308046.

[21] J. Wishart and J. Neyman. Proceedings of the Berkeley Symposium on Math-ematical Statistics and Probability, volume 34. University of California Press, 1950. doi: 10.2307/3610901. URLhttps://www.jstor.org/stable/

3610901?origin=crossref.

[22] S. Marsland. Machine Learning: An Algorithmic Perspective, 2009. ISSN 00368075.

[23] B. Schölkopf, A. Smola, and K.-R. Müller. Component Analy-sis as a Kernel Eigenvalue Problem. 5(44):1299–1319, 1996. doi:

10.1.1.100.3636. URL https://citeseerx.ist.psu.edu/viewdoc/summary?

doi=10.1.1.100.3636http://www.face-rec.org/algorithms/Kernel/

kernelPCA{_}scholkopf.pdf.

[24] E. Fertig, A. Arbabi, and A. A. Alemi. beta-VAEs can retain label in-formation even at high compression. Technical report, 2018. URLhttp:

//arxiv.org/abs/1812.02682.

[25] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825, 2011.

URLhttp://www.jmlr.org/papers/v12/pedregosa11a.html.

[26] K. Gregor, D. J. Rezende, and D. Wierstra. DRAW: A Recurrent Neu-ral Network For Image Generation. InInternational conference on machine learning, volume 37, page 1462, 2015.

[27] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.The Bulletin of Mathematical Biophysics, 5:115, dec 1943.

ISSN 0007-4985. doi: 10.1007/BF02478259. URL http://link.springer.

com/10.1007/BF02478259.

[28] S. Linnainmaa. Taylor expansion of the accumulated rounding error.BIT, 16:146, jun 1976. doi: 10.1007/BF01931367. URL http://link.springer.

com/10.1007/BF01931367.

[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In neural information pro-cessing systems, page 1097, 2012. URL https://papers.nips.cc/paper/

4824-imagenet-classification-with-deep-convolutional-neural-networks.

182 Bibliography Chapter D

[30] N. Srivastava, G. Hinton, A. Krizhevsky, and R. Salakhutdinov. Dropout:

A Simple Way to Prevent Neural Networks from Overfitting. Techni-cal report, 2014. URLhttp://jmlr.org/papers/volume15/srivastava14a.

old/srivastava14a.pdf.

[31] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift. In Interna-tional conference on machine learning, page 11, 2015. URL https://pdfs.

semanticscholar.org/b58f/1529c22d682dbe08ae02ec52587c9da7f270.pdf.

[32] V. Dumoulin and F. Visin. A guide to convolution arithmetic for deep learning. mar 2016. URLhttp://arxiv.org/abs/1603.07285.

[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learn-ing applied to document recognition. Proceedings of the IEEE, 86:2278, 1998. doi: 10.1109/5.726791. URLhttp://ieeexplore.ieee.org/document/

726791/.

[34] C. Szegedy, W. Liu, P. Sermanet, et al. Going deeper with convolutions.

Technical report, 2014. URLhttps://arxiv.org/pdf/1409.4842.pdf.

[35] B. A. Pearlmutter. Learning State Space Trajectories in Recurrent Neural Networks. Neural Computation, 1:263, 1989. doi: 10.1162/neco.1989.1.2.

263. URLhttp://www.mitpressjournals.org/doi/10.1162/neco.1989.1.2.

263.

[36] A. Karpathy. The Unreasonable Effectiveness of Recurrent Neu-ral Networks, 2015. URL https://karpathy.github.io/2015/05/21/

rnn-effectiveness/.

[37] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9:1735, 1997. doi: 10.1162/neco.1997.9.8.1735.

[38] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations, dec 2014. URL http://arxiv.org/abs/1312.6114https://openreview.net/forum?id=

33X9fd2-9FyZd.

[39] I. Higgins, L. Matthey, A. Pal, et al. β-VAE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representationsr, 2017. URL https://pdfs.semanticscholar.org/

a902/26c41b79f8b06007609f39f82757073641e2.pdfhttps://openreview.

net/forum?id=Sy2fzU9gl.

Bibliography 183

[40] S. Zhao, J. Song, and S. Ermon. InfoVAE: Information Maximizing Variational Autoencoders. In International conference on machine learning, page 24, 2018. URLhttp://arxiv.org/abs/1706.02262.

[41] M. Hjorth-Jensen. Computational Physics 2, 2019. URL https://

compphysics.github.io/ComputationalPhysics2/doc/web/course.

[42] S. Kullback and R. A. Leibler. On Information and Sufficiency. The An-nals of Mathematical Statistics, 22:79, 1951. doi: 10.1214/aoms/1177729694.

URLhttp://projecteuclid.org/euclid.aoms/1177729694.

[43] K. P. Burnham, D. R. Anderson, and K. P. Burnham. Model selection and multimodel inference : a practical information-theoretic approach. Springer, 2002. ISBN 0387953647. URL https://books.google.no/books?id=

fT1Iu-h6E-oC{&}pg=PA51{&}redir{_}esc=y{#}v=onepage{&}q{&}f=false.

[44] B. Seybold, E. Fertig, A. Alemi, and I. Fischer. Dueling Decoders: Reg-ularizing Variational Autoencoder Latent Spaces. may 2019. URLhttp:

//arxiv.org/abs/1905.07478.

[45] E. Harris, M. Niranjan, and J. Hare. A Biologically Inspired Visual Working Memory for Deep Networks. Technical report, 2019. URL http://arxiv.org/abs/1901.03665https://openreview.net/

forum?id=B1fbosCcYm.

[46] X. Guo, X. Liu, E. Zhu, and J. Yin. Deep Clustering with Convolutional Autoencoders. Inneural information processing systems, page 373. 2017. doi:

10.1007/978-3-319-70096-0_39.

[47] D. Zhang, Y. Sunm, B. Eriksson, and L. Balzano. Deep Unsupervised Clustering Using Mixture of Autoencoders. In International Conference on Neural Information Processing, page 373, 2017. URLhttps://arxiv.org/

pdf/1712.07788.pdf.

[48] L. Van Der Maaten and G. Hinton. Visualizing Data using t-SNE.

Technical report, 2008. URL http://www.jmlr.org/papers/volume9/

vandermaaten08a/vandermaaten08a.pdf.

[49] J. Xie, R. Girshick, and A. Farhadi. Unsupervised Deep Embedding for Clustering Analysis. Technical report, 2015. URLhttp://arxiv.org/abs/

1511.06335.

[50] O. Russakovsky, J. Deng, H. Su, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115:211, 2015. ISSN 0920-5691. doi: 10.1007/s11263-015-0816-y. URL http:

//link.springer.com/10.1007/s11263-015-0816-y.

In document L ATENT V ARIABLE M ACHINE L EARNING (sider 167-191)