Compressed Sensing and Deep Learning

(1)

Compressed Sensing and Deep Learning

Investigation of stability properties for linear inverse problems

Andras Filip Plaian

Master’s Thesis, Autumn 2021

(2)

Science, with program optionApplied Mathematics and Risk Analysis, at the Department of Mathematics, University of Oslo. The scope of the thesis is 60 credits.

The front page depicts a section of the root system of the exceptional Lie group E₈, projected into the plane. Lie groups were invented by the Norwegian mathematician Sophus Lie (1842–1899) to express symmetries in differential equations and today they play a central role in various parts of mathematics.

(3)

Abstract

Deep Learning (DL) has already begun to find its way into the computational scientist’s toolkit, yet the amount of material on numerical analysis of these techniques is somewhat lacking. In recent years, an instability phenomenon has been discovered when DL is used to solve certain problems in computational science, namely, in inverse problems in imaging. In this thesis, we give a brief introduction to compressed sensing and neural networks, take a closer look at the instability phenomenon in Neural Networks (NN) used for inverse problems through the eyes of numerical stability, and compare DL methods to the more well-established scientific techniques in image reconstruction, namely, Compressed Sensing (CS).

(4)

(5)

Acknowledgements

First and foremost, I would like to express my gratitude to my advisors Øyvind Ryan and Vegard Antun, for always taking their time to answer my questions, for their advice and for their support during the work with this thesis. Secondly, I wish to thank Annika Rigenholt at the Department of Mathematics for her guidance through the master program, especially during the COVID-19 lockdown.

I also want to thank Ida for her endless love and support, especially the last six months. Finally, I wish to thank my parents for their support and for sparking my interest in the natural sciences!

(6)

(7)

A

. . . 4

2.3 Finding suitable reconstruction algorithms . . . 6

2.4 Basis pursuit . . . 10

2.5 Coherence . . . 16

3 Neural Networks And Deep Learning 19 3.1 Supervised machine learning . . . 19

3.2 Design . . . 20

3.3 The Universal Approximation Theorem . . . 21

3.4 Training . . . 22

3.5 The success of deep learning . . . 24

3.6 Instabilities in deep learning . . . 24

3.7 Consequences of the false structure phenomenon . . . 26

3.8 Reasons to go beyond compressed sensing . . . 26

3.9 Deep learning for linear inverse problems . . . 27

4 Instabilities In Deep Learning for Inverse Problems 29 4.1 Universal instabilities in inverse reconstruction methods . . . 29

4.2 False Positives . . . 31

4.3 False negatives . . . 33

4.4 Stability through kernel awareness . . . 35

5 Conclusions 39

Bibliography 41

(8)

(9)

List of Acronyms

AI Artifical Intelligence CS Compressed Sensing DL Deep Learning ML Machine Learning

MRI Magnetic Resonance Imaging NN Neural Network

NP Nondeterministic Polynomial time NSP Null Space Property

P Polynomial time ReLU Rectified Linear Unit rNSP robust Null Space Property UAT Universal Approximation Theorem

(10)

(11)

CHAPTER 1 Introduction

Reconstruction from measurements is an important task in a wide range of scientific, industrial and medical applications, including, but by no means limited to, electron and fluorescence microscopy, seismic imaging, Magnetic Resonance Imaging (MRI) and X-Ray CT. In later years, a variety of different neural networks has emerged, claiming competitive performance to current state-of-the-art techniques [1][2].

An important part of numerical analysis is to figure out if a given algorithm is numerically stable or not. An algorithm is called numerically stable if an error, whatever its cause, does not grow to be much larger during the calculation [3].

In order to estimate the bounds to these errors, there exist the following notion of a Lipschitz constant:

Definition 1.1(Lipschitz constants).Let (X, d_X) and (Y, d_Y) be metric spaces, and let the function f :X →Y be Lipschitz continuous such that there exist a numberK∈Rthat satisfy

d_Y(f(x), f(y))≤Kd_X(x, y) for all x, y∈X. (1.1) The smallest possible non-negative valuedK that satisfy (1.1), is the said to be the Lipschitz constant off.

As we can see from the definition above, the Lipschitz constant gives an upper bound for how much an error in the input of a function may change the error in the output of the function. Depending on the non-negative value ofK, there are three ways the functionf can change the input error:

• K <1, the error reduces, such a function is called acontraction.

• K= 1, the error stays the same.

• K >1, the error increases.

During the exploding interest of Machine Learning (ML) in the last decades, the community have suffered from some growing pains. As there has been a huge push to publish new methods, in many cases being the first is more important than making sure that all the details are correct [4][5]. However, the goal of science is not about being first, but to gain knowledge and insight.

Sometimes we need to be reminded that with the development of new technology,

(12)

we must make sure that the methods we develop are secure, particularly for applications where the lack of stable and accurate methods may result into seriously unfortunate events. Some of these events, that resulted from unstable numerical methods, include The Sleipner A offshore platform accident in 1991, causing an estimated loss of 1.8 billion NOK [6], The Patriot Missile Failure in 1998, killing 28 and injuring 98 people [7], and The Explosion of the Ariane 5 costing 500 million USD [8].

Thesis outline

• In Chapter 2 we introduce the fundamental concepts of compressed sensing, which will be called upon inChapter 4.

• Chapter 3 begins with an introducing to the typical elements of the neural network paradigm. We then continue with a brief look at its successes in recent history, the existence of instabilities in classification problems, reasons to use DL methods in image reconstruction and finally how NN can be defined to solve certain inverse problems.

• In Chapter 4 we study the instability phenomenon in linear inverse reconstruction methods and look at the important role that kernel awareness plays in these methods.

• Finally, in Chapter 5we summarize and point to future work.

(13)

CHAPTER 2 Compressed Sensing

In an underdetermined system of linear equations there are fewer equations than unknowns. In mathematical terms this can be stated as the matrix equation Ax=y, where A ∈ C^m×N, x ∈ C^N , y ∈ C^m andm < N. From classical linear algebra it follows that this equation is unsolvable in the general case, however, given certain conditions, it is possible to find an exact or an estimate of a solution.

The research area associated with these assumptions is called Compressed Sensing (CS). The goal of this chapter is to introduce the reader to the needed concepts in order to determine the required conditions for finding a stable and robust reconstruction method, which then can be used for finding approximate or exact solutions of the linearinverse problem x=A⁻¹yby exploiting the theory of CS. Most of the material presented in this chapter is based upon the book of Foucart and Rauhut, which is a comprehensive introduction to compressed sensing, thus where details are lacking, the reader is referred to [9].

See also the work of Adcock and Hansen [10], for an in-depth treatment.

After this chapter, roughly speaking, we should be able to have some an- swers to the following questions:

• What properties does Aneed to have for x to be reconstructable from the measurement vectory=Ax?

• Does there even exist feasible methods for solving the linear inverse problem?

2.1 Main assumptions: sparsity and compressibility

The main assumption we have about the vectorx, which we are trying to recover, is that it issparse. Informally, a vector is sparse if most of its components are zero. Although, a change of basis might be needed forx to be sparse in the problem of interest. For instance in image reconstruction, the natural images are sparse in the wavelet domain [9]. An interesting non-mathematical side note is that natural scenes sparsely activate neurons in the primary visual cortex in humans [11].

Before we define sparseness formally, we let [N] denote the set{1,2, ..., N} and recall the definition of thesupportof a vector:

(14)

Definition 2.1.The support of a vectorx∈C^N, is the index set of its non-zero entries, that is:

supp(x) :={i∈[N] :x_i6= 0}.

In compressed sensing literature, it is customary to abuse thel₀ notation to denote the cardinality of the support set, i.e. the number of non-zero entries of a vector.Thek·k0fails to be a norm since it does not satisfy the scaling property of norms. We can now defines-sparse vectors:

Definition 2.2.A vector x∈ C^N, is called s-sparse if it has no more thans non-zero entries, that is if: kxk0≤s.

The notion of sparsity is an ideal one, meaning that in the real world it may very well be that the vector we are trying to recover, is only close to being sparse. In order for CS to handle more problems, we may also be interested in the following notion ofcompressibility:

Definition 2.3. Forp> 0, the measure of a vector’s compressibility is given by thel_p-error of bests-term approximation tox∈C^N defined by

σ(x)_p := inf{kx−zk_p,z∈C^N iss-sparse}

Informally, a vector is compressible if thelp-error decays quickly ins. For instance, in image processing; if we can describe most of the variance in the image data with a small number of components of a vector, then the image is said to be compressible.

2.2 Basic requirements of A

What properties does A need to possess in order to have the possibility to reconstruct everys-sparse vector regardless of the reconstruction method? For instance, what is the minimal number of rows thatAneeds to have such that it would be possible to find everys-sparse vector? We have the following result:

(15)

2.2. Basic requirements of

A

Theorem 2.4. For a matrixA ∈C^m×N, the following statements are equivalent:

a) Every s-sparse vectorx∈C^N is the unique s-sparse solution of Az=Ax, that is, if Ax=Azand both zand xare s-sparse, then z=x.

b) The null space ofA, ker(A), does not contain any 2s-sparse vector other than the zero vector, that is, ker(A)∩ {z ∈ C^N : kzk0 ≤ 2s}={0}.

c) For everyS ⊂[N], with cardinalitycard(S)≤2s, the submatrix AS is injective.

d) Every set of 2s columns ofAis linearly independent.

Proof. b) =⇒a) Letx,zbes-sparse vectors, not necessarily unique, such that Ax = Az. Then x−z has at most 2s number of non-zero components, i.e.

being 2s-sparse, then using the linearity ofA, we get thatA(x−z) = 0. If b) holds, then the kernel ofAcontains only the zero vector, and consequently the only wayA(x−z) =0holds true, is ifx=z, thusx must be the unique s-sparse vector.

a) =⇒b) Assume thatxis the uniques-sparse vector such thatAx=Az. Let v∈ker(A) be 2s-sparse. Sincevis 2s-sparse, there exist a way to constructv fromx−zwherex,zare boths-sparse and such that they have no components in common. Assuming a) holds, we then havex=z=v=0.

b)⇐⇒c)⇐⇒d) For a 2s-sparse vectorvwith S = supp(v), ifa1, ....,aS are the column vectors of the submatrix A_S made from A, we get Av = A_SvS. Note that S = supp(v) may be chosen from all possible subsets of [N] with card(S)≤2s andv may be any 2s-sparse vector. Then we can pickS andv such that d) holds. Also, linear independence d), injectivity c) and that the kernel is trivial b) are all equivalent statements aboutA.

With a little knowledge about how to construct invertible matrices, we can derive the following result from the above theorem:

Corollary 2.5. For any integerN ≥2s,there exists a matrixA∈C^m×N withm= 2srows such that everys-sparse vectorx∈C^N can be recovered from the vectory=Ax∈C^m.

Proof. Lett1, t2, ...., tN be strictly positive numbers and consider the matrix A∈C^m×N withm= 2sdefined by

A=







1 1 · · · 1 t₁ t₂ · · · t_N ... ... · · · ...

t^2s−1₁ t^2s−1₂ · · · t^2s−1_N







(16)

Now define a square submatrixAS ∈C^m×mindexed byS ={i1< ... < im} as

A_S=







1 1 · · · 1 ti₁ ti₂ · · · ti_m

... ... ... ...

t^2s−1_i

1 t^2s−1_i

2 · · · t^2s−1_i

m







The given submatrix AS can be recognized as the transposedVandermonde matrix. A result about Vandermonde matrices, is that its determinant in view ofAS, can be expressed as det(AS) =Q

k < l≤m(ti_l−ti_k). Since thet’s were defined to be strictly positive, it follows that det(AS) > 0, and hence thatAS

is invertible. The Invertible Matrix Theorem then tells us thatAS is injective, which means that condition c) in Theorem 2.4 holds, and we then have the uniques-sparse vectorx∈C^N such thatAz=Ax.

2.3 Finding suitable reconstruction algorithms

Recall that the compressed sensing problem consist in finding as-sparse vector x∈C^N giveny=Ax. Abusing the customaryl₀ notation for the number of non-zero entries, we can reformulate this problem as an optimization problem, that is:

minimizekzk0 subject toAz=y. (P0) A naive approach to solving this could be to look at it as a combinatorial optimization problem, and use brute force to calculate every square system A^∗_SASx = A^∗_Sy, for x ∈ C^S where S runs through all possible subsets of [N] with size s. This might very well work on small sizes of N, but the total amount of subsets the algorithm must go through is determined by the formula ^N_s

. A quick illustration shows that withN = 1000, s= 10 we have

1000 10

≥ ¹⁰⁰⁰₁₀ ¹⁰

= 10²⁰linear systems of size 10×10 to solve. Assuming one could solve each system in 10⁻¹⁰seconds, it would still take 10¹⁰seconds, i.e.

several human lifespans, thus the brute force approach is completely unpractical for sufficiently largeN.

P₀is NP-hard

It’s not only the brute force approach that isn’t bounded by a polynomial expression, it might also very well be that there doesn’t even exist an approach bounded by a polynomial expression at all.¹ In fact, we have the following result:

Theorem 2.6.The l₀-minimization problem of an arbitrary matrix A∈C^m×N and a vectory∈C^mis NP-hard.

Before proving the result, let us recall some terminology:

1UnlessP=N P, which is one of the Millenium Prize Problems, see [12].

(17)

2.3. Finding suitable reconstruction algorithms

• The class of P problems consists of all decision problems for which there exists a polynomial-time algorithm, i.e. an input size bounded by a polynomial expression, for finding a solution.

• The class of NP problems, not to be confused with NP-hard, consists of all decision problems for which there exists a polynomial-time algorithm certifyinga solution.

• The class of NP-hard problems consists of all problems for which the solving algorithmcould be transformed in polynomial time into a solving algorithm for any NP-problem.

This means that if we can show that a problem A, solves another problem B, which belongs to a certain class of problems, then problem A also belongs to that same class of problems where B belongs.

Proof. If we can show that a known NP-hard problem can be reduced, in polynomial time, to thel0-minimization problem, thenl0-minimization itself is NP-hard.

Let The Exact Cover by 3-sets problem be our known NP-hard problem. The Exact Cover by 3-sets problem asks that, given a collection{Ci, i∈ [N]} of 3-element subsets of [m], does there exist an exact partition or cover of the set {1,2, ..., m}= [m]?

Let{Ci, i∈[N]}be the collection of 3-element subsets of [m]. Define vectors a1, ...,aN ∈C^m by

a_ij=

(1 ifj ∈ Ci, 0 ifj /∈ Ci. where j∈[m].

Define a matrixA∈C^m×N and a vectory∈C^m by A=

a1· · ·aN

and y= [1, ...,1]^T.

Since N ≤ ^m₃, this construction can be done in polynomial time since

m 3

= m(m−1)(m−2)

3! which is a 3rd degree polynomial inm.

If a vector z ∈ C^N satisfies Az = y, then all the m components of Az are nonzero and kAzk0 = m. Since each vector ai has exactly 3 nonzero components, the vector Az = PN

j=1zjaj has at most 3 kzk0 nonzero components, kAzk0 ≤3kzk0 and consequently kzk0 ≥ m/3 . Now consider the l0-minimization and letx∈C^N be the output. We separate two cases:

1. If kxk0 = m/3, then the collection {Cj, j ∈ supp(x)} forms an exact cover of [m], for otherwise themcomponents ofAx=PN

j=1xjaj would not all be nonzero.

2. If kxk0 > m/3, then no exact cover {Cj, j ∈ J} can exist, for otherwise the vectorz∈C^N, defined byzj = 1 ifj∈J andzj = 0 ifj6=J, would satisfyAz=yandkzk0=m/3, contradicting thel0-minimality ofx.

(18)

This shows that solving l0-minimization problem enables one to solve the Exact Cover by 3-sets problem, and consequentlyl0-minimization problem is

NP-hard.

Pq, where0 < q < 1, also NP-hard

IfP₀optimization is intractable, wouldPq optimization be a better approach?

I.e.

minimizekzkq subject toAz=y. (Pq) Not according to the following proposition:

Proposition 2.7. lq-minimization for0< q <1is NP-hard.

Proof. The proof goes along the same lines as proving the NP-hardness of l0-minimization, that is if lq-minimization can help to verify that a given NP-hard problem has a solution, that is if the minimum ofkwkq subject to Aw=y, equals n, thenlq-minimization must necessarily also belong to the same class of NP-hard or NP-complete problems.

Thepartition problem consists, given integersa₁, a₂, ...., a_n, in deciding whether there exists two setsI, J⊂[n] such thatI∩J =∅,I∪J = [n] andPa_i

i∈I

=Pa_j

j∈J

. Define

A=







a1 a2 · · · an −a1 −a2 · · · −an

1 0 · · · 0 1 0 · · · 0 0 1 · · · 0 0 1 · · · 0

... ... 0 ... ... 0

0 · · · 0 1 0 · · · 0 1







andy= [0,1,1, ....,1]^T.

Letxandzbe the first and second half of the input vectorwtoArespectively.

ThenAw=yimplies that Pn

i=1aixi =Pn

i=1aizi andxi+zi= 1 for all i. We seek to minimizePn

i=1(x^p_i +z^p_i) under these constraints.

The minimum value for the objective function, when only the constraintsxi+yi

are considered, isn, and this occurs only if 0’s and 1’s are involved.

If the minimumkwkq subject toAw=yisn, then it will also benwhen only thexi+yi= 1 constraints are considered. This means that eitherxi orzi is equal to 1, for alli. LetI be the set ofi’s wherexi=1, andJ be the set ofi’s wherezi=1. We then haveI∩J =∅andI∪J = [n], i.e. a partition on the form we seek.

Defining xi as 1 when i∈ I and defining z similarly, we obtain a vector w, where all the constraints are fulfilled and where the minimumnis obtained.

(19)

2.3. Finding suitable reconstruction algorithms

Pq, whereq > 1

If both l₀ and l_q-minimization for 0 < q < 1 is NP-hard, how about l_q- minimization for q > 1? Unfortunately, this optimization problem fails to recover even 1-sparse vectors. We have the following proposition:

Proposition 2.8. Letq > 1, and letA be am×N matrix withm < N. Then there exists a 1-sparse vector which is not a minimizer ofP_q.

Proof. Sincem < N, we have from linear algebra that the kernel ofAis non- trivial. Hence there must exists av6=0∈ker(A) such thatAv=0. Assume, for the sake of contradiction, that all standard basis vectorsej, are minimizers.

Choose aj so thatv_j 6= 0. Then kej+tvk^q_q =|1 +tvj|^q+X

k6=j

|tvk|^q =|1 +tvj|^q+|t|^qX

k6=j

|vk|^q. Define two functions such that

g₊(t) = (1 +tv_j)^q+t^qX

k6=j

|v_k|^q,

g₋(t) = (1 +tvj)^q+ (−t)^qX

k6=j

|vk|^q.

For |t| < 1/vj, kej+tvk^q_q coincides withg+ for t ≥0, andg₋ for t ≤0. Taking the derivatives ofg+ andg₋ we get

g₊⁰ (t) =qvj(1 +tvj)^q−1+qt^q−1X

k6=j

|vk|^q, g₋⁰ (t) =qv_j(1 +tv_j)^q−1−q(−t)^q−1X

k6=j

|v_k|^q.

Taking the limit of the two derivatives, and then forq >1 we get

t→0+lim g⁰₊(t) = lim

t→0−=g₋⁰ (t) =qvj

This means that near 0,kej+tvk^q_q has derivative arbitrarily nearqv_j 6= 0.

Since a linear function with derivativeqv_j has no minimum near 0, then with t = 0, e_j can’t be a minimizer of P_q, which consequently contradicts our assumption that all standard basis vectorse_j are minimizers ofP_q.

The special caseP_q, whereq = 1

Recall that in an optimization problem, we try to minimize or maximize an objective function subject to constraint functions. If the objective function is convex, and all the constraint functions are also convex functions, then the optimization problem becomes a convex optimization problem. Consequently bothl0 andlq-minimization, where 0< q <1, will be non-convex optimization problems, this can easily be verified by looking at the shape of their respective

(20)

norm-balls. If all the constraint functions are linear functions, then the following convex optimization problem:

minimizekzk1 subject toAz=y, (P1) is also a linear program. Efficient algorithms exist for solving LPs. For instance, the Simplex method, behaves like a polynomial-time algorithm for solving real-life LP problems. l1 -minimization is also known asbasis pursuit, and during the next sections, we shall see that given certain assumptions onA, we can solveP0 by solvingP1, and that the recovered solution will be sparse.

2.4 Basis pursuit

In order to show how basis pursuit can identify the solution to the non-convex optimization problemP0, we need to introduce some definitions and theorems.

A crucial property, which we impose on our matrixA, is that it possesses the so-calledNull Space Property:

Definition 2.9. A matrixA ∈C^m×N is said to satisfy theNull Space Property (NSP)relative to a set S ⊂ {1,2, ..., N}if

kvSk1<kv_Sk1for allv∈kerA\ {0}.

A is said to satisfy theNull Space Property of order s, if it satisfies the null space property relative to any setS ⊂ {1,2, ..., N}with|S| ≤s. The next theorem shows that the NSP of a matrix relative to a set S, is a sufficient condition for basis pursuit to find the unique solution ofP₁.

Theorem 2.10.Firstly, given a matrix A ∈C^m×N, then every vector x

∈ C^N supported on a set S, is the unique solution ofP1 with Ax=y, if and only if A satisfies the NSP relative to S.

Secondly, by letting S ⊂ {1,2, ..., N} be any subset with |S| ≤ s, then every s-sparse vectorx ∈C^N is the unique solution of P₁ with Ax

= y, if and only if A satisfies the NSP of order s.

Proof. For the first part, let S be a fixed index set, and assume that every vectorx∈C^N supported on this set, is the unique minimizer of P1. From the assumption, it follows that forv ∈ker A\ {0}, the vector vS is the unique minimizer of P₁. SinceA(vS+v_S) =0and−vS 6=v_S, from the minimality assumption, we must have thatk−vSk1 <kv_Sk1. This established the NSP relative toS.

Conversely, assume that NSP relative toS holds. Letx∈ C^N be supported onS and a vectorz∈C^N,z6=x, such thatAz=Ax. Following the rules of norms and taking complements of the support of a set , we obtain

kxk1≤ kx−zSk1+kzSk1=kvSk1+kzSk1<kv_Sk1+kzSk1=k−z_Sk1+kzSk1=kzk1.

(21)

2.4. Basis pursuit

Which shows thatxobtains the unique minimum.

To prove the second part of the theorem, we let S be any given subset of [N] with card(S)≤s, and assume that everys-sparse vectorxis found by solvingP1 subject toAx=y. Next, we letzbe the solution toP0 subject to Ax=y, thenkzk0≤ kxk0so that alsoziss-sparse. Since everys-sparse vector is the unique minimizer ofP₁, we have thatx=z, and the result follows.

Minimum number of rows of A for the success of basis pursuit From the results above, and particularly from the proof of the second part of the above theorem, it becomes clear that if a matrix possesses the NSP of order s, then basis pursuit will uniquely solveP1, which in turn is the same unique solution toP0. Next, we will introduce a theorem that can help us identify whenA has the NSP of orders.

Theorem 2.11.Given a matrix A∈C^m×N, then every set of 2s columns of A, is linearly independent if and only if A satisfies the NSP of order s.

Proof. Assume that every 2scolumns ofAis linearly independent, then from The Invertible Matrix Theorem [13], we have that the kernel of A does not contain any other 2s-sparse vector, except for the 2szero vector0. Now letx, andzbes-sparse withAz=Ax. Then A(x−z) =0andx−zis 2s-sparse, but since kerA\ {0}is empty, we must have x=z, but this implies that the NSP of ordersholds.

Conversely, assume that the kernel of A does not contain any other 2s- sparse vector other than0, then for any setS with card(S) = 2s,ASx=0has only the trivial solution and thereby 2slinearly independent columns.

From these results we can derive that :

Corollary 2.12.In order for basis pursuit to find a unique s-sparse solution of P0, the number of rows of the matrix A ∈ C^m×N, has to satisfy:

m≥2s.

Proof. Assume that it is possible to uniquely recover anys-sparse vectorxfrom y=Ax. Then, by Theorem 2.4, statement a) holds, and consequently so does d), that is, that every set of 2s columns ofA is linearly independent. Thus rank(A)≥2s, but we also have that the rank of a matrix cannot be greater than the number of rowsm, so we must have rank(A)≤m. Combining and arranging these two inequalities, we get that

2s≤rank(A)≤m

which implies,m≥2s, the inequality in the corollary.

(22)

Stability

Basis pursuit features another important property, namely stability.² This property tackles sparsity defects, that is, to recover a vectorx∈C^N where the error to the true underlying vector is measured by its distance to a s-sparse vector.

Definition 2.13. A matrixA∈C^m×N is said to satisfy thestable null space property with 0< ρ <1 relative to a setS⊂[N] if

kvSk1≤ρkv_Sk1for allv∈kerA

Furthermore, it satisfies the stable null space of orders if it satisfies the stable null space property relative to any setS⊂[N] with card(S)≤s. If a matrix satisfies the stable null space property of orders, then thel1-error to the true underlying vectorxis given by

kx−x⁰k1≤2(1 +ρ)

1−ρ σs(x)1. (2.1) The next result will give us a condition for when, the inequality (2.1) and the stable null space property will hold.

Theorem 2.14. The matrix A ∈C^m×N satisfies the stable null space property with constant 0< ρ <1 relative toS if and only if

kz−xk₁≤ 1 +ρ

1−ρ(kzk₁− kxk₁+ 2kx_Sk₁) (2.2) for all vectorsx,z,∈C^N withAz=Ax.

Before proving Theorem 2.14, we will show that (2.2) implies (2.1). Let S be the set of theslargest non-negative components ofx, thenkx_Sk1=σs(x)1. Ifx⁰ is a minimizer of (P1), thenkx⁰k1≤ kxk1andAx⁰=Ax. The right-hand side of the inequality (2.2) withz=x⁰ will then be equal to the right hand side of inequality (2.1).

Proof. To prove Theorem 2.14, assume that the matrixAsatisfies inequality (2.2) for all vectorsx,z∈C^N withAz=Ax. Given a vectorv∈ker(A), since Av_S =A(−v_S), we can apply (2.2) withx=−v_S andz=v_S. This gives us

kvk₁≤ 1 +ρ

1−ρ(kv_Sk₁− kv_Sk₁).

Rearranging and cancelling equal terms, this can be written as kvSk1≤ρkv_Sk1,

2’Stability’ is meant to mean ’accuracy’ in this thesis, see notes on terminology at the end of section 2.5.

(23)

2.4. Basis pursuit

which we recognize as the stable NSP with 0< ρ <1.

Conversely, assume that the matrixAsatisfies the stable NSP, with 0< ρ <1, relative toS. Then forx,z∈C^N withAz=Ax, we get thatv=z−x∈ker(A) yields

kvSk1≤ρkv_Sk1. (2.3)

Sincekv_Sk1=k(z−x)_Sk1, we can rewrite (2.3) into kv_Sk1≤ kz1k − kx1k+ρkv_Sk1+ 2kx_Sk1. Withρ <1, this can be rewritten as

kv_Sk1≤ 1

1−ρ(kz1k − kx1k+ 2kx_Sk1). Using (2.3) once again, we chain the inequalities to get,

kvk1=kv_Sk1+kvSk1≤(1 +ρ)kv_Sk1≤1 +ρ

1−ρ(kzk1− kxk1+ 2kx_Sk1), which is (2.2), the desired inequality.

Robustness

Basis pursuit features another important property, namelyrobustness.³ This property tackles perturbations in the input.

Definition 2.15. Givenq≥1, the matrixA∈C^m×N is said to satisfy thel_q-robust null space property (rNSP), with constants 0< ρ <1 and τ >0, relative to a setS⊂[N] if

kvSkq≤ ρ

s^1−1/qkv_Sk1+τkAvk for allv∈C^N.

Furthermore, it satisfies thelq-rNSP of orders, if it satisfies thelq-rNSP relative to any setS⊂[N] with card(S)≤s.

In order to introduce a more general result, that follows from thel_q-rNSP, we need an intermediate result which holds for the ordinary rNSP, that is, settingq= 1.

3See notes on terminology at the end of section 2.5.

(24)

Theorem 2.16.The matrix A∈C^m×N satisfies the robust null space property with constants 0< ρ <1 andτ >0 relative to S if and only if

kz−xk₁≤1 +ρ

1−ρ(kzk₁− kxk₁+ 2kx_Sk₁) + 2τ

1−ρkA(z−x)k (2.4) for all vectorsx,z∈C^N.

Proof. Very similar to the proof of Theorem 2.14, we therefore use results from there when appropriate.

Assume A satisfies (2.4) for all vectors x,z ∈ C^N. Let v ∈ C^N be so that v=z−xwithz=v_S andx=−vS which gives

kvk₁≤ 1 +ρ

1−ρ(kv_Sk₁− kvSk₁) + 2τ

1−ρkAvk.

Multiplying by (1−ρ) on both sides and expandingkvk₁ on the left side gives (1−ρ)(kvSk1+kv_Sk1)≤(1 +ρ)(kv_Sk1− kvSk1) + 2τkAvk.

Expanding, cancelling equal terms and dividing by 2 on both sides, results in kvSk1≤ρkv_Sk1+τkAvk,

which is the rNSP with constants 0< ρ <1 andτ >0 .

For the converse, assume A satisfies rNSP with constants 0 < ρ < 1 and τ >0 relative toS. Forz,x∈C^N withv=z−x, the rNSP gives

kvSk1≤ρkv_Sk1+τkAvk, and

kv_Sk₁≤ kzk₁− kxk₁+kvSk₁+ 2kx_Sk₁.

Combining these two inequalities and using the rNSP again, we arrive at kvk1=kvSk1+kv_Sk1≤(1 +ρ)kvSk1+τkAvk

≤1 +ρ

1−ρ(kzk1− kxk1+ 2kx_Sk1) + 2τ

1−ρkAvk, which is the desired inequality.

Theorem 2.17.Let1≤p≤q, and suppose that the matrix A∈C^m×N satisfies the l_q-robust null space property of order s, with constants 0< ρ <1 andτ >0 . Then, for any x,z∈C^N,

kx−zkp≤ C

s^1−1/p(kzk1− kxk1+ 2σs(x)1) +Ds^1/p−1/qkA(z−x)k, where C:= (1 +ρ)²/(1−ρ)andD:= (3 +ρ)τ /(1−ρ).

(25)

2.4. Basis pursuit

In order to prove the result above, we need the following little lemma:

Lemma 2.18.For any q > p >0, and anyx∈C^N, the inequality σs(x)q ≤ c_p,q

s^1/p−1/qkxkp

holds with

cp,q =hp q

^p/q 1−p

q

1−p/qi^1/p

≤1. Particularly, by settingp= 1 andq=p, we get

σs(x)p≤ 1

s^1−1/pkxk1. Proof. (Theorem 2.17)

We observe that if thelq-rNSP holds, then so does thel1-rNSP and thelp-rNSP (p≤q) in the forms of

kvSk1≤ρkv_Sk1+τ s^1−1/qkAvk, (2.5) kvSkp≤ ρ

s^1−1/qkv_Sk1+τ s^1/p−1/qkAvk, (2.6) for allv∈C^N and all S⊂[N] with card(S)≤s. In view of inequality (2.5), applying Theorem 2.16 withS chosen as an index set ofslargest non-negative components ofx, we get

kz−xk1≤1 +ρ

1−ρ(kzk1− kxk1+ 2σ_s(x)1) + 2τ

1−ρs^1−1/qkA(z−x)k. (2.7) Next, we letSbe an index set ofslargest non-negative components ofz−x, and apply Lemma 2.18 to it, which yields

kz−xkp≤ k(z−x)_Skp+k(z−x)Skp≤ 1

s^1−1/pkz−xk1+k(z−x)Skp. In view of inequality (2.6), we get

kz−xkp≤ 1

s^1−1/pkz−xk1+ ρ

s^1−1/pk(z−x)_Sk1+τ s^1/p−1/qkA(z−x)k

≤ 1 +ρ

s^1−1/pkz−xk1+τ s^1/p−1/qkA(z−x)k. (2.8) Substituting (2.7) into the above inequality (2.8), gives us

kz−xkp≤ (1 +ρ)² (1−ρ)

1

s^1−1/p(kzk1− kxk1+ 2σs(x)₁) +(3 +ρ)

(1−ρ)τ s^1/p−1/qkA(z−x)k.

LettingC:= (1 +ρ)²/(1−ρ) andD:= (3 +ρ)τ /(1−ρ), and substituting for it above, gives us the desired inequality from Theorem 2.17.

(26)

The results from this section will be called upon in the latter part of this thesis.

2.5 Coherence

It’s not easy to verify if a certain matrix is suitable for basis pursuit, that is, to check if the given matrix has the NSP. Thus it would be practical if we could find an easier way to calculate a quantity from which we can guarantee the success of the recovery algorithm. Thecoherence is such a quantity.

Definition 2.19.LetA∈C^m×N be a matrix withl2-normalized columns, a1, ...,aN such that kaik2= 1 for alli∈[N]. Thecoherence µ=µ(A) of the matrix Ais defined as

µ:= max

1≤i6=j≤N|hai,aji|.

Thel₁-coherenceµ₁ is defined fors∈[N−1] as µ1(s) := max

i∈N maxn X

j∈S

|hai,aji|, S⊂[N],card(S) =s, i /∈So .

From the definition of coherence, we have the following result:

Theorem 2.20. LetA∈C^m×N be a matrix with l2-normalized columns.

If

µ1(s) +µ1(s−1)<1,

then every s-sparse vector x∈C^N is exactly recovered from the vector y=Ax via basis pursuit.

Proof. If we can show that for a given matrixA∈C^m×N, withl2-normalized columns satisfyingµ1(s) +µ1(s−1)<1 also possess the NSP of order s, then according to Theorem 2.10, every s-sparse vectorxrecovered via basis pursuit, is unique.

Let a1, ...,aN denote the columns of A. If v ∈ker(A), thenAv =0can be rewritten as the weighted sum of the column vectors ofA, i.e. PN

j=1vjaj=0. Since theµ1(s) coherence function is a statement about the sum of the inner products of pairwise columns ofA, it will be convenient to rewrite a particular weightvi as:

v_i=v_ihai,a_ii=−

N

X

j=1,j6=i

v_jhaj,a_ii.

Splitting [N] into two parts S andS, and taking the absolute value of vi, it follows that

|vi| ≤X

l∈S

|vl||hal,aii|+ X

j∈S,j6=i

|vj||haj,aii|.

(27)

2.5. Coherence

Summing over alli∈S we get kvSk1=X

i∈S

|vi| ≤X

l∈S

|vl|X

i∈S

|hal,aii|+X

j∈S

|vj| X

i∈S,i6=j

|haj,aii|

≤X

l∈S

|v_l|µ₁(s) +X

j∈S

|v_j|µ₁(s−1) =µ₁(s)kv_Sk₁+µ₁(s−1)kv_Sk₁. Rearranging and neglecting the terms between the first term and the last in the equations above, we get

(1−µ1(s−1))kvSk1≤µ1(s)kv_Sk1.

Applying the conditional part of the theorem, that is, µ1(s) +µ1(s−1)<1, we get the desired inequality

kvSk1<kv_Sk1, which concludes the proof.

On terminology in Chapter 2

The way we have defined the stability property in section 2.4 for basis pursuit, we see that we refer to the methods ability to accurately recover a vector x. Similarly, when we talk about basis pursuit’s robustness in section 2.4, we refer to the stability of the method with respect to perturbations of the inputy. This difference in notation, is common in compressed sensing, see [9] [14]. As such, when we talk about instabilities in thesis, we refer to the methods robustness.

(28)

(29)

CHAPTER 3 Neural Networks And Deep Learning

Neural Networks (NNs), a biologically inspired programming paradigm, provide amongst the best solutions to many problems where the neural network has a lot of training data. Consequently, NNs perform well in areas such as image classification, speech recognition, natural language processing, and so forth. However, NNs have reportedly been seen to suffer from instabilities in a multitude of domains, notably, in image reconstruction [15][16], which is the domain of interest for this thesis. Our goal is to get mathematical insight to why these instabilities occur. Most of the material presented in this chapter is based on the upcoming book [10]. We begin by introducing the basic elements of NNs, in the framework of supervised machine learning.

3.1 Supervised machine learning

In supervised machine learning, the objective is to learn a mapping f from an input space to an output space. By varying how we define the input and the output space, we can choose what our function f should model. Two big applications for supervised learning are regression and classification. In regression,xandyare both continuous variables, often some vectors from the spacesRⁿ or Cⁿ. The most common form of regression is linear regression, where one tries to find the line that best fit the data. In classification, the input is often some vectorx, but the outputyis restricted to a set of finite known possibilities, usually referred to aslabels.

For instance, ifytakes the values -1 or 1, thenf can model a binary classification problem or ifytakes three or more discrete values, thenf can model a multi classification problem.

For the general supervised learning problem, let us assume that we are given a data set{(xi,yi)}ⁿ_i=1,xi∈X,yi∈Y, which is referred to as the training set.

We then assume, that there exists an ideal, underlying functionf, such that f(x_i) +_i=y_i

for each samplei, wherei ∼ N(µ, σ²) represents a small error. The goal is

(30)

then to compute a mapping ˜f, that approximatef well, such that f˜(x_i)≈y_i, i∈ {1, ..., n},

wherexi,yi are elements from the training set.

The process of computing the approximation mapping ˜f, is known as training the model. There exist many models in supervised machine learning that compute the approximation mapping ˜f, however in this thesis, we will only look at the neural network model.

3.2 Design

The termneural network covers such a large class of models, that we can only afford to give the most general and traditional description that is sufficient for later analysis. Although there doesn’t exist a unique definition of NNs, the following description covers the typical ones:

Definition 3.1.Let n₀, ..., n_L ∈ N. Let W_l : Rⁿ^l−1 → Rⁿ^l for l = 1,2, ..., L be affine maps. Let ρ₁, ..., ρ_L−1 : R → R be activation functions and let n0 =M and nL = N. Then a map Ψ : R^m →R^N given by

Ψ(y) =WL(ρ_L−1(...ρ1(W1(y))...)) is called aNeural Network.

Figure 3.1: Overly simplified neural network with a single hidden layer illustrating the feedforward graph interpretation.

The type of network defined above, is called a feedforward network. The name emphasizes that the network takes an input y and feeds it forward through the network by an alternating sequence of affine maps and non-linear activation functions. It is common to visualize the network as a graph, where the nodes of the graph represent the artificial neurons and the edges in the graph represent the weights, which are found in the entries ofWl. Figure 3.1, gives a crude illustration of this. A layer is a set of neurons. At each layer, each neuron takes in a sum of linear combination, applies a bias, applies an activation function and then feeds the output forward to the next layer of

(31)

3.3. The Universal Approximation Theorem

neurons.

The architecture characterizes the design of network, and consists of the following:

• The depthL, given by the number of layers,

• the widthn0, ..., nL of each layer,

• and the choice of activation functionσ.

The number of parameters in each network grows rather quickly as a function of the number of layers and the width of each layer. More specifically, the number of parameters in a NN is given by:

d= (n1×n0+n2×n1+...+nL×n_L−1) + (n1+n2+...+nL) (3.1) wheren₀, ..., n_Ldenotes the number of neurons per layer and 0, ..., Lis the number of layers.

The choice of activation function may be important to the performance of the network. Common activation functions in the literature include:

ReLU(x) =

(x ifx >0 0 otherwise tanh(x) = e^2x−1

e^2x+ 1 σ(x) = 1

1 +e^−x

Figure 3.2: Common choices of activation functions.

3.3 The Universal Approximation Theorem

As mentioned earlier, in supervised machine learning we are assuming that there exists an underlying "ground truth" function that we are approximating.

The Universal Approximation Theorem gives justification for the use of NNs to approximate such a function.

(32)

Theorem 3.2.(Universal Approximation Theorem) Let σ∈ C(R) and letK⊂Rⁿ be compact.

Then for anyg∈ C(Rⁿ)and any >0, there exists a set of parameters k ∈ N, c1, ..., ck ∈ R, w1, ...,wk ∈ Rⁿ, b1, ..., bk ∈ R such that f :Rⁿ→Rdefined as

f(x) =

k

X

i=1

ciσ(w^T_ixi+bi) (3.2) satisfies

maxx∈K |f(x)−g(x)|<

for all x∈Rⁿ, if and only if σis not a polynomial.

In other words, the class of real-valued neural networks with one hidden layer are dense inC(Rⁿ), in the topology of uniform convergence on compact sets, if and only if σis not a polynomial.

We can easily see that the activation function σ cannot possibly be a polynomial if the above theorem is to hold. Consider a polynomialσof degree m, then for every choice of x∈Rⁿ andb ∈R, σ(w^Tx+b) is polynomial of total degree at mostm, and thus does not span C(S).

The Universal Approximation Theorem 3.2 states that the class of NNs with one hidden layer, is already extremely rich. However, there are reasons for considering deeper NNs, i.e., where the network architecture consists of several hidden layers, since they may approximate certain functions more efficiently than shallow NNs, with fewer layers.

3.4 Training

The training of a NN is to compute the values for the weights and biases such that the NN performs well on a given data set. To simplify notation, we will letθ be defined as the set of all trainable parameters. For a neural network, this will typically be:

θ=

W1, ...,WL,b1, ...,bL .

We write Ψθ to emphasize that this NN usesθas its parameters.

One way to measure how well the network is trained, is by evaluating its performance on a given set, by acost function.

Cost function

Several cost functions may be defined on the network. A popular choice is the quadratic cost function, also known as themean squared error (MSE), is given by:

C(θ|(y1,x1), ...,(yn,xn)) = 1 2n

n

X

i=1

kΨθ(yi)−xik² (3.3)

(33)

3.4. Training herenis the total number of samples,yi the input sample into the network, andxi the corresponding validation sample.

Next, we want to minimize the cost function, we do this by recasting the problem into an optimization problem, where we try to find values for all the trainable parametersd(recall 3.1), that is to attempt to solve:

θ∈Rmin^d

C(θ) (3.4)

This problem is nonlinear, nonconvex and there is generally no possibility of computing global minimizers of it. Even worse, since a NN needs a huge number of parameters to perform well, and a large set of training data, it is a major computational challenge.

Gradient descent methods

To make the training of the NN’s practical regarding the computational demand, a first-order optimization method is usually used to solve (3.4). These are methods based on the gradient descent. In a gradient descent method we pick a starting pointθ0, then we follow the path of steepest descent, that is, we produce the sequence:

θi+1=θi−τi∇C(θi), i= 0,1, ... (3.5) Calculating the gradient requires one pass through all the training data.

When the amount of training data is large, this ceases to be computationally feasible. However, there is at least one way to reduce the number of computational steps, known asStochastic Gradient Descent. There are several variations of the method, the typical difference is in how they set a step length τ in each iteration, but the main idea they have in common, is to compute the gradient from a smaller subset of the training data and thereby reducing the computation needed at stepi. The common algorithm for computing the gradient∇C, is known asbackpropagation.

If we take a closer look at the sigmoid and the hyperbolic tangent activation functions, presented in Figure 3.2, we observe that for large positive or negative inputs, the value of the derivative of these functions may get arbitrarily close to zero. Consequently, the calculated gradient in the cost function may get so small, that the pointθ_i+1 fails to update its value in eq. (3.5), preventing the network from exploring the optimization landscape effectively. In worst case, it may completely stop the neural network from further training, this is known as thevanishing gradient problem.

Overfitting

The goal of supervised machine learning is to find a map that performs well on a multitude of data. However, the choice of network architecture, cost function, training set and optimization algorithm may yield a network that generalizes poorly. Inoverfitting, the network achieves a small error on the training set but fails to perform on a general set. This may occur when the architecture has too many parameters, causing it to learn details in the training data to the extent that it negatively impacts the performance on new data.

Compressed Sensing and Deep Learning