A Neural Network approach to 3D non-LTE Radiative Transfer

(1)

A Neural Network approach to 3D non-LTE Radiative Transfer

Bruce Chappell

Thesis submitted for the degree of Master of Science in Computational Science

Institute of Theoretical Astrophysics University of Oslo

May 18, 2021

(2)

ii

This work, entitled “A Neural Network approach to 3D non-LTE Radiative Transfer”

is distributed under the terms of the Public Library of Science Open Access License, a copy of which can be found at http://www.publiclibraryofscience.org.

(3)

Abstract

Synthetic spectra calculated from model solar atmospheres are central to our understanding of the connection between the observed intensity and the solar atmosphere.

To obtain synthetic profiles in the general case, we must solve the 3D non-LTE radiative transfer problem for the model atmospheres. The current method used to solve this problem is extremely time intensive and requires access to a supercomputer. We have developed and implemented a convolutional neural network to learn the full 3D mapping between the LTE and non-LTE populations for a model atom. The network is generalized across atom structure and can be adjusted to fit any model atom. In this thesis, the network is built to fit a 6-level model hydrogen atom. Once we have the non-LTE populations, we can easily calculate the intensity they produce at any viewing angle assuming complete redistribution (CRD). The network architecture we propose successfully learns the population mappings for individual atmosphere simulations, allowing us to calculate non-LTE populations for a range of time-steps for a given simulation. Our network is also rather small, allowing it to easily be trained on a standard consumer GPU and resulting in a run time saving of multiple days. A challenge that remains is to make the network general across different simulations.

(4)

iv

(5)

Acknowledgments

First off I want to thank Tiago Pereira for being the best supervisor anyone could ask for over this past year and a half. Coming back to school after 2 years off was a big challenge and I felt completely clueless and out of my league at the start. Your guidance, ability to explain things, and always answering my poorly-timed emails has been invaluable in building my scientific understanding and getting to the finish line with this project.

I also want to thank my entire support network back in the US. Getting to move across the world to go to school has been a privilege that most people are not lucky enough to experience, and it would not be possible without the support from my family. And the biggest shout out of all goes to my fiancée Kendal, thanks for keeping me sane and making the best out of the pandemic year. Now on to the next big adventure.

(6)

vi

(7)

Chapter 1 Introduction

The field of astrophysics generally revolves around studying phenomena and celestial bodies far away from Earth. One of these bodies we all deal with every day is our sun.

The sun is quite literally the driver for all life on Earth and therefore an important body to understand. Additionally, we can use it as a proxy laboratory to study matter under conditions that are not replicable here on Earth. The sun itself is a hot, dense ball of gas so we cannot send man-made probes to its surface to study the conditions and make measurements. Instead, we can use the electromagnetic radiation the sun emits to study its local conditions, particularly the atmosphere. This EM radiation is emitted as streams of photons that propagate through space as both waves and particles. We can then study the characteristics of these streams of photons to draw conclusions about the conditions and dynamics of the sun’s atmosphere.

One of our best probes for studying the solar atmosphere are spectral lines. Atomic species in the atmosphere undergo energy level transitions based on the radiative field and thermodynamic properties present, resulting in photons emitted at specific wavelengths. These photons are what makes up the radiation we observe and we can calculate its intensity for wavelengths of interest using the radiative transfer equation and a 3D simulation of the atmosphere. The resulting synthetic spectral line profiles can then be used to understand how physical features of the atmosphere, such as temperature, velocities, and densities, affect the spectra. The physical understanding we extract from studying the connection between synthetic spectra and their atmospheres can then be applied to interpret observed spectra and the atmospheres that produced them.

Specific atomic energy level transitions are formed in different regions of the solar atmosphere. For example, the Hαline is formed from the photosphere to the chromosphere.

Because of this, the spectral line calculated for the Hα transition can be used as a diagnostic specifically for studying the photosphere and chromosphere regions. In these regions, Local Thermodynamic Equilibrium (LTE) does not hold and the radiative

(11)

2 Introduction

transfer equation must be solved for non-LTE (NLTE) conditions. Solving the radiative transfer problem in NLTE is much more complicated than in LTE as the problem becomes non-linear and non-local. The Multi3D code (Leenaarts & Carlsson 2009) solves the NLTE problem and calculates the spectral line profiles from simulated atmosphere cubes produced by the Bifrost code (Gudiksen et al. 2011). Currently, this method is extremely computationally intensive. Producing one time step, or snapshot, of a simulation has a run-time of multiple days when using 5,000 cores. The computational demand of this problem essentially limits solving capability to people with access to a super computer.

This thesis will develop and present a Neural Network framework called SunnyNet to solve the 3D NLTE radiative transfer problem for a given simulation. Due to its relatively small size, the network can be run on a standard consumer GPU, removing the limitation of needing access to a super computer. We also get a significant time saving when using SunnyNet in place of the traditional iterative methods as the entire process from training to calculating spectra takes around 4 hours. We can therefore reduce the total run time from multiple days using 5,000 cores to a few hours using a single machine, allowing us to synthesize spectra for multiple snapshots and simulations at a low computational cost. Comparing our synthetic spectra with observed spectra gives us a way to study how well our models are performing and whether our understanding of the underlying physics is sound.

The proposed SunnyNet network will learn the 3D mapping from LTE to NLTE populations for a model hydrogen atom. These populations will be taken from Multi3D runs and arranged into training sets with the LTE populations as the input data and NLTE populations as the target data. Since we are training using Multi3D results as our data set, the network is really learning to replicate the results of Multi3D. We are therefore using our network to create a “shortcut” for a widely-used radiative transfer code. Learning the NLTE populations will allow us to calculate spectral line profiles for any transition included in the model atom, and at arbitrary inclination angles. In its calculation of the atomic populations, the network will consider radiation effects from all directions, thus approximating the result of the full 3D NLTE radiative transfer problem.

(12)

Chapter 2 Radiative Transfer

Central to our goal of developing a faster way to calculate emergent intensity is understanding how photons radiate through the atmosphere. The following treatment de- scribing the transport equation and various equilibria is adapted fromRutten (1998a) Chapters 3-7 andStix (2012) Chapter 4.

2.1 Transfer Equation

The radiation emerging from a surface can be described by its intensityI_ν for a given frequency ν. Photons do not decay with time and therefore if the radiation beam is propagating through a vacuum its line of sight intensity will remain constant. When propagating through a medium, interactions between the photons in the beam and the atoms in the medium will affect the intensity. As the photons propagate, they can excite and deexcite electrons to and from specific energy levels, thus emitting and absorbing photons with frequencies corresponding to the energy difference of the levels involved in the transition. The change in the intensity at a specific frequency due to these interactions is given by:

dI_ν

ds =j_ν −α_νI_ν, (2.1)

wherej_ν is the emissivity andα_ν is the extinction coefficient. The emissivity describes how the medium adds photons to the beam and the extinction coefficient describes how the medium removes photons from the beam.

We can reformulate equation (2.1) into a more convenient form by defining two new quantitiesSν and τν:

τ_ν =− Z z0

∞

α_ν(s)ds, (2.2)

S_ν ≡ j_ν

αν. (2.3)

(13)

4 Radiative Transfer

τν is called the optical depth and measures path length of the beam in terms of photon penetration as a function of extinction, instead of the geometric path length s. It is measured against the direction of beam propagation, so it can be thought of as looking at a points=z0 in the medium from a points=∞ far away from the surface of the medium. S_ν is called the source function and describes to total effect on the intensity from interactions with the medium. Sν,αν, and jν can be broken down into their line and continuum components by:

S_ν = α^l_νS_ν^l +α^c_νS_ν^c

α^l_ν+α^c_ν , (2.4)

j_ν ≡j_ν^l +j_ν^c, (2.5)

and

αν ≡α^l_ν+α^c_ν. (2.6)

The line components measure the contributions of bound-bound transitions to the emissivity, extinction, and source function and the continuum components measure the contributions of bound-free and free-free processes to the emissivity, extinction, and source function. The line components for emissivity, extinction, and source function and are given by:

j_ν^l = hν₀

4π n_UA_ULψ(ν−ν₀), (2.7)

α^l_ν = hν0

4π [n_LB_LUφ(ν−ν0)−n_UB_ULχ(ν−ν0)], (2.8) and

S_ν^l ≡ j_ν^l

α^l_ν = nUAULψ(ν−ν0)

n_LB_LUφ(ν−ν0)−n_UB_ULχ(ν−ν0). (2.9) The A’s and B’s above are Einstein coefficients and give the transition rates for discrete

transition processes for a specific element. The units are s⁻¹forA_ULand s⁻¹/[W m⁻² Hz⁻¹ sr⁻¹] for BUL/BUL, and they only depend on the element species, not on the local thermo-

dynamics or radiative field. A_UL gives the transition rate for spontaneous deexcitation, B_UL gives the rate for stimulated emission, and B_LU gives the rate for radiative excitation. The symbolsφ,χ, andψare all profile functions for their corresponding transition processes, andν₀is the central frequency of the profile functions for a specific transition.

In this thesis, we will assume complete redistribution (CRD) and thusφ=χ=ψ. There are many free-free and bound-free processes that contribute to the continuum emissivity and extinction. The specific processes relevant to this thesis will be discussed in Section4.3, where we will discuss our algorithm for calculating intensity. In general terms, if we have a continuum extinction α^c_ν, and a continuum source function S_ν^c =

(14)

2.2 Local Thermodynamic Equilibrium 5

Bν(T), where Bν(T) is the Planck function, continuum emissivity is given by j_ν^c = α^c_νB_ν(T).

Using Sν andτν equation (2.1) becomes:

dIν

dτν

=Iν −Sν. (2.10)

Equation (2.10) tells us how the intensity of the beam changes as optical depth changes.

Integrating this equation gives what is called the formal solution:

Iν(τν) =Iν(τ0ν)e^−(τ^0ν^−τ^ν⁾+ Z τ0ν

τν

Sν(τ_ν⁰)e^−[τ^ν⁰^−τ^ν^]dτ_ν⁰, (2.11) whereτ0ν is the optical depth at a reference point. We can integrate from our observa- tion point just above the Sun’s atmosphere,τ_ν = 0, to a point deep in the sun,τ_0ν =∞ to get the total emergent is intensity:

Iν(0) = Z ∞

0

Sν(τν)e^−τ^νdτν, (2.12) which will account for attenuation from radiative effects along the beam’s path through the whole atmosphere. This is the integral we will ultimately solve to compute the emergent intensities around the Hα transition. We now know from equations (2.8) and (2.7) that we need the populations of the hydrogen energy levels to get these values.

2.2 Local Thermodynamic Equilibrium

In some areas of the solar atmosphere, such as the photosphere, we can make the assumption that the gas particles behave under the conditions of LTE. Under this assumption, the distribution of atoms in various ionization and excitation states is given by the Boltzmann-Saha distribution and the velocities are given by the Maxwell distribution. The radiative field behaves like an isotropic black-body with the source function S_ν = B_ν(T) which is the Planck function. All of these distributions depend solely on the temperature T. In areas where matter is roughly in equilibrium with radiation and collisions dominate, the local T can be sufficient to describe the atomic populations and ratio of emission to absorption. This makes solving the transport equation quite easy as all quantities of interest are determined by the local temperature.

In less dense areas of the atmosphere, this assumption does not hold and we need to consider particles under non-LTE conditions.

2.3 Non Local Thermodynamic Equilibrium

When the convenient LTE assumption fails, we can no longer assume that atomic populations follow the Boltzmann-Saha distribution and that the Planck function accurately

(15)

approximates the radiative field. Instead, we can assume statistical equilibrium. This assumes that the level populations are time independent for all points in the medium.

Thus if there is a transition to one level, there must be a simultaneous transition out of that level. This gives:

dni(~r) dt =

N

X

j6=i

nj(~r)Pji(~r)−ni(~r)

N

X

j6=i

Pij(~r) = 0, (2.13)

whereniis the population of the level of interest,N is the total number of energy levels in the species, and P_ij and P_ji are the transition rates from i to j and j to i in s⁻¹ respectively. The transition rates can be broken down into:

P_ij =

(A_UL+B_ULJ¯_ν₀+C_UL i > j

B_LUJ¯_ν₀ +C_LU i < j, (2.14) where CUL and CLU are the rates for collisional deexcitation and excitation. J¯ν0 is the frequency and angle averaged radiative field. This means the intensity is averaged over the transition line profile centered at ν₀ and the full solid angle. To satisfy the conditions of statistical equilibrium, we also require particle conservation at all times:

N

X

i

n_i =n_total. (2.15)

We can now see that in the NLTE regime the energy level populations and the radiative transport equations are coupled to each other. The extinction (and thus the optical depth) and source function both depend on the energy level populations and thus equation (2.10) for a specific transition becomes a function of populations:

dIν

dτ_ν(n_L, n_U) =Iν−Sν(nL, nU). (2.16) Similarly, the transition ratePij’s for all levels are functions ofJ¯ν0 and thusIν:

N

X

j6=i

n_jP_ji( ¯J_ν₀(I_ν)) =n_i

N

X

j6=i

P_ij( ¯J_ν₀(I_ν)). (2.17) Here we lose the simplicity of the LTE regime where populations depended solely on temperature. Now populations at any point in the atmosphere are being influenced by the average radiation coming from all directions in the atmosphere viaJ¯ν0. The problem is now non local as photons created in one corner of the atmosphere contribute to the

(16)

2.4 Calculating Intensity from Simulations 7

radiative effect on the populations in another corner. Thus to calculate the emergent intensity for a region under NLTE conditions, we need to solve equations (2.16) and (2.17) simultaneously for the entire atmosphere and in all directions. The resulting system of equations is large, non-linear, and non-local, which makes it extremely computationally intensive to solve the full system.

2.4 Calculating Intensity from Simulations

The general process for calculating and studying intensity is the same no matter what method we use to obtain our final atomic populations. We start with 3D simulations of the solar atmosphere produced by the Bifrost code (Gudiksen et al. 2011) as our

“test atmosphere” and calculate the populations and subsequent intensity they would produce for different atomic transitions. The variable we get to chose is the method used to get from the simulated atmosphere to the populations we use in the intensity calculation. Below we will go through two existing methods for calculating intensity from atmosphere simulations.

2.4.1 Iterative Methods

When solving radiation in LTE, we simply use the thermodynamic variables from the Bifrost simulations to calculate the populations, source function, and optical depth. We then can use these quantities to solve the transport equation.

Solving in the NLTE regime involves an iterative process, called the accelerated Λ iteration, which is performed by Multi3D. A high-level summary of the algorithm is given by the following steps:

1. Initialize populations, source function, and optical depth using thermodynamic variables from Bifrost simulation under LTE conditions or the zero radiation approximation

2. Calculate the intensity for all frequencies and angles using the transport equation (equation (2.11))

3. Solve the statistical equilibrium equations (equations (2.13) & (2.15)) to get the new populations

4. Update the source function (equations (2.4) & (2.9)) and optical depth (equation (2.2)) using the new populations

5. Repeat steps 2 through 5 until a certain convergence criterion is met

6. Calculate the transport equation one last time using the final populations to obtain the emergent intensity

(17)

2.4.2 Neural Network methods

In recent years, neural networks have been explored as a possible tool to minimize the computational cost of solving the radiative transfer problem. The RADYNVERSION code (Osborne et al. 2019) develops an accurate invertable neural network (INN) that learns the forward and inverse mapping between simulations of flaring solar atmospheres and the line profiles for the Hα and Ca II λ8542 transitions. In this framework, the atmosphere simulations were used as the input data and intensity profiles calculated by the RADYN code (Carlsson & Stein 1992,1995;Allred et al. 2005,2015) were used as the target data. The network handles each vertical column in the atmosphere independently, and thus considers the problem in 1.5D and effectively leaves out radiative effects from other places in the atmosphere and all sources of oblique radiation. As shown in Leenaarts et al.(2012), 3D radiative transfer is essential in calculating hydrogen lines, and therefore learning the 1.5D problem is not ideal. Neural networks have also been used to learn the mapping from Stokes profiles to atmospheric variables as was done in Asensio Ramos & Díaz Baso(2019) and Milić & Gafeira (2020). These three previous works all have either a partial or full focus on solving the inverse problem, leaving the forward problem largely unexplored.

(18)

Chapter 3 Machine Learning

Machine learning popularity has taken off in the last few years as computing power and GPUs have become more readily available to the public. And while the name is becoming somewhat of a technical “buzzword”, the actual mathematical and statistical concepts that make up the field remain a mystery to most. At its core, machine learning is a branch of artificial intelligence where a program can independently learn and im- prove its performance at predicting specified outputs given corresponding inputs from a data set. This often entails an iterative process of guesses and checks that updates the program based on its prediction error for each round of guesses. After enough guesses and subsequent adjustments, the program should have ideally converged to the lowest possible prediction error.

Within machine learning there are three branches:

• Supervised Learning

• Unsupervised Learning

• Reinforcement Learning

In supervised learning, each input in the data set has a corresponding “true” output, or target. This branch fits the guess, check, adjust description best. Unsupervised learning involves using an input data set without a target set. Reinforcement learning is an entirely different system that takes advantage of an “agent”’s resulting rewards or punishments from interactions with an environment with the objective being to maximize the reward. For a deeper dive into the differences and applications of the different branches seeAurélien(2017).

In this thesis, we will focus specifically on a subsection of supervised learning called Convolutional Neural Networks (CNNs). To understand these network architectures,

(19)

10 Machine Learning

it is advantageous to understand the Multilayer Perceptron (MLP) and then extend to CNNs.

3.1 Multilayer Perceptron

In very general terms, MLPs and all other Neural Networks are used to learn an approximation for a mappingf from a set of inputs X to a set of desired outputs y,

f(X_i,θ) =y_i. (3.1)

The best possible approximation off, calledh, is given by:

h X_i,θˆ

=y˜_i, (3.2)

whereθˆare the parameters learned by the network, called weights, andy˜i is the predicted output of the network. The parameters θˆare learned by minimizing a function L(θ)ˆ that compares the true y_i’s to the predicted y˜_i’s, called the loss function. The general form of this function is given by:

minθˆ

L θˆ

= min

θˆ n

X

i=1

L

θ;ˆ yi,y˜i

. (3.3)

The process of minimizing the loss, called training, is an iterative process. Data points from a training set are fed into the network (Section3.1.1), the loss is calculated (Section 3.1.3), and then the weights are updated (Section 3.1.4). One full training iteration, called an Epoch, is when all of the points in the training data have passed through the network. We will now take a closer look at the different steps when building and training a network.

3.1.1 Forward Pass

All networks start with a forward pass of input data. The general structure is a series of layers of computational nodes inspired by neurons in the brain, as proposed in Rosenblatt (1958). Each node performs a linear transformation to the input data X = (X₁, X₂, ..., X_n) by multiplying each input feature with its corresponding weight and summing the results. This weight value can be thought of as a number that measures the importance of each input feature and scales how much it contributes to the output. A non-linear activation function f, which we will explain in Section 3.1.2, is then applied to get the output afor that node. The forward data flow for one node is shown in Figure3.1.

This process can be expanded to include multiple layers made up of multiple nodes as shown in Figure3.2. When stacking layers the outputs of the previous layer become the

(20)

3.1 Multilayer Perceptron 11

𝑓 "

!"#

$

𝑋_!𝑤_! 𝑋_#

𝑋_%

𝑋_$

𝑤_# 𝑤_%

𝑤^$

𝑎

Figure 3.1: Visualization of forward data flow through a single artificial neuron.

𝑋_#

𝑋_%

𝑋_$

𝑎_#^(#)

𝑎_%^(#)

𝑎_$^(#)

𝑎_#^(%)

𝑎_%^(%)

𝑎_$^(%)

&𝑦_#

&𝑦_%

𝑤^(#) 𝑤^(%)

𝑤₍₍₎

*+)

Figure 3.2: Visualization of forward data flow through a Multilayer Perceptron network.

(21)

12 Machine Learning

inputs to the next layer. The equations for a forward pass through the network become the following:

a⁽¹⁾=f⁽¹⁾

w⁽¹⁾Xinputs

(3.4)

...

a⁽ⁱ⁾=f⁽ⁱ⁾

w⁽ⁱ⁾a⁽ⁱ⁻¹⁾

(3.5)

...

˜

y=f^(Last)

w^(Last)a^(Last−1)

. (3.6)

The bold face characters in equations (3.4), (3.5), and (3.6) indicate that the quantities are now matrices. We now have a full definition for the forward pass of the network, and need to understand how the network handles non-linear problems and self adjusts the weight matrices to make better predictions.

3.1.2 Activation Functions

An activation function is the mathematical operation applied to the output of each layer. The function, f, is applied to introduce non-linearity into the network and to restrict the value of the outputs as they will be the inputs to the next layer. Some commonly used activation functions are:

ReLU(z) =max{0, z}, (3.7)

tanh(z) = e^x−e^−x

e^x+e^−x, (3.8)

Sigmoid(z) = 1

1 +e^−x. (3.9)

Non-linearity is needed because of the complexity of the patterns Neural Networks attempt to learn. If one were to omit activation functions, each layer would reduce to:

a⁽ⁱ⁾=w⁽ⁱ⁾a⁽ⁱ⁻¹⁾, (3.10)

which is a simple linear regression problem and ill-suited for highly complex patterns.

The three functions above also act as a limiting mechanism for the outputs, with tanh and sigmoid “squishing” the output to values between −1 and 1, and ReLU dropping all negative values.

In recent years, the rectified linear unit (ReLU) (equation (3.7) and Figure 3.3) has replaced sigmoid and tanh as the activation function of choice for most neural network architectures (Solberg 2020). The ReLU function avoids the problem of vanishing gradients that its predecessors ran into and is well suited for deep neural networks.

(22)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 z

0.0 0.2 0.4 0.6 0.8 1.0

ReLU(z)

Figure 3.3: Plot of the ReLU activation function.

3.1.3 Loss and Backpropagation

After each forward pass of the network, the loss function is calculated to measure how wrong the outputs of the network are. To determine each individual weight’s contribution to the total loss and update them to minimize this loss, we implement the backpropagation algorithm of Rumelhart et al. (1986). This algorithm leverages the chain rule and calculates the partial derivative of the loss with respect to each weight, thus telling us how much each weight contributes to the loss and allowing us to update them accordingly.

For this thesis specifically, the chosen loss function was the Mean Squared Error. The loss for one data point is:

l_i(θ) = (y_i−y˜_i)², (3.11) and the loss over all the data points is:

L(θ) = 1 N

N

X

i=1

li(θ) = 1 N

N

X

i=1

(yi−y˜i)². (3.12)

This loss measures the squared L2 distance (Stanford University 2019) between the targets and predicted outputs.

3.1.4 Gradient Descent

The next step in our process is to backpropagate the loss through the network and collect all the partial derivatives. With these in hand we are ready to update the weights using a method called gradient descent as presented inMurphy (2013).

(23)

14 Machine Learning

𝐿 "𝜃

"𝜃 𝐿 "𝜃

"𝜃 𝜆too small

𝜆too large, unstable

Figure 3.4: The left plot shows a small learning rate that will take too many iterations to converge, increasing computational cost. The right figure shows a large learning rate that results in unstable minimization and never converges.

The collection of partial derivatives ofL θˆ

with respect toθˆconstitutes the gradient of the loss,∇L

θˆ

. The gradient of a function tells us how fast the function is changing at a certain point in the parameter space. The gradient of the loss thus tells us how fast the loss function is changing for a certain collection of weightsθˆ. To reduce the loss we therefore update the weights in a way that moves us to a new point in the parameter space in the negative direction of the loss.

The mathematical definition of gradient descent is given by:

θ_i+1=θ_i−λ∇_θL(θ), (3.13)

withλrepresenting the learning rate. This is typically a user-defined scalar that controls how big of a step in the direction of the negative gradient the algorithm takes. The selection of the learning rate can have a big effect on whether the network ever converges to the minimum of the loss function. Figure3.4 illustrates how λbeing too small can lead to a significant increase in the amount of iterations needed to converge to the minimum and also how λ being too large can result in unstable training that never converges.

(24)

3.1.5 Mini Batch Gradient Descent

To introduce randomness in our training and avoid getting stuck in local minima in the loss function, we will use a modified version of Gradient Descent called Mini Batch Gradient Descent (MBGD). For each training iteration, we randomly group the input data into subgroups called mini batches. Batches are defined by:

B_k, k = 1. . . n/M, (3.14)

wherenis the total number of training samples andM is the number of batches. Batch size is a user defined variable and typically ranges anywhere from 2⁶ to 2¹⁰ samples.

We then randomly select a mini batchB_k, feed it through the network and update the weights by performing a gradient descent step for that batch. The gradient descent step becomes the following:

θi+1 =θi−λX

i∈B_k

∇_θli(θ). (3.15)

See Hjorth-Jensen (2020) for an excellent in-depth explanation of all things gradient descent.

3.1.6 Adam optimizer

The Adam optimizer of Kingma & Ba (2014) makes further improvements to MBGD by having an adaptive learning rate for each parameter in the network. These learning rates are updated with each gradient descent iteration based on the first moment (running average) and second moment (uncentered variance) of the gradient at each step. Adaptively updating the different learning rates allows Adam to converge faster to the minimum by taking large steps where the loss function is steep and small steps where it is flat (Ruder(2016)). The weights update for one mini batch using the Adam optimizer is given by:

θi+1 =θi− λ

√vˆ_t+mˆt, (3.16)

wheremˆ_t is the bias corrected first moment estimate given by:

ˆ

m_t= m_t

1−β₁^t, (3.17)

with,

mt=β1mt−1+ (1−β1)X

i∈B_k

∇_θli(θ), (3.18)

and ˆvt is the bias corrected second moment estimate given by:

ˆ

vt= vt

1−β^t₂, (3.19)

with,

vt=β2vt−1+ (1−β2)



 X

i∈B_k

∇_θli(θ)





2

. (3.20)

(25)

16 Machine Learning

1 2 4 1 1 3 1 5 2 1 6 1 1 3 6 3 1 2 4 1 1 3 1 5 2 1 6 1 1 3 6 3 1 2 4 1 1 3 1 5 2 1 6 1 1 3 6 3 0 0 0

1 1 1 2 2 2

w

X

25 23 29 z

Figure 3.5: The first three steps of a filterwconvolving over a 1 channel data pointX.

3.2 Convolutional Neural Networks

Following the success of AlexNet (Krizhevsky et al. 2012) at the ImageNet Large Scale Visual Recognition Challenge in 2012, convolutional neural networks (CNNs) have ex- ploded in popularity for all things image processing and classification. Instead of learning the weights of fully connected layers, CNNs learn a set of filters that slide, or convolve, over the input data creating output feature maps. The pixel values of the data point are multiplied with the filter values they overlap with and the results are all summed together. The equation for the 2-D convolution operation is given by:

z[p, q] =w∗x=

K

X

r=−K K

X

s=−K

w[r, s]·x[p+r, q+s], (3.21) wherez[p, q]is the corresponding pixel in the outputz,w is the learnable filter, and x is the input data point. This is often hard to conceptualize and is best described with a visual which is shown in Figure3.5. As the network runs and the values in the filters are updated, the filters begin to learn to pick out certain features of the input such as edges, dark and light areas, and contours if the input is an image (Tollef 2019). In one layer, we can pass as many different filters as we want over the input data. The resulting output maps orz’s we create get stacked depth-wise and thus our layer output shape is (filters, z_x, z_y). We can control the shape of the output z’s by tinkering with the stride and padding of the filters. The shape of any of the spatial output dimensions is given by

Nⁱ⁺¹ = Nⁱ−F+ 2P

S + 1 (3.22)

(26)

3.2 Convolutional Neural Networks 17

3 x 24 x 24 input

9 (3x5x5) filters With S=1, P=0

9 x 20 x 20 feature map

9 x 10 x 10 feature map

ReLU activation and (2 x 2) MaxPool With S=2

Figure 3.6: An example of a simple CNN block

where Nⁱ⁺¹ and Nⁱ are the spatial size before and after the filter is applied, F is the spatial filter size, P is the filter padding, and S is the filter stride. Padding is simply applying a frame of zeros around the input data, and stride tells us how many pixels to move the filter in a certain direction as we convolve. Figure3.5shows an instance of no padding and anxandy stride of 1. The parameters can be adjusted to both reduce and expand dimensionality as we iterate through the layers.

While Figure 3.5shows a 2D convolution across one input channel, we are often inter- ested in input data with multiple channels. 2D images tend to have three color channels and therefore have filters of shape (3×L×H). We can extend filter convolutions to any number of dimensions with 1D filters having shape (channels×L) and 3D filters having shape(channels×W ×L×H).

A small note on dimension ordering. All dimensions in this thesis will follow the conven- tion(channels×z×x×y). We choose this to preserve consistency with the dimension ordering in PyTorch.

3.2.1 General CNN Architecture

Typically, CNNs consist of layers of convolutional “blocks”. These blocks can vary greatly in architecture as seen in the famous ResNet (He et al. 2015) and Inception-v4 (Szegedy et al. 2016) architectures. For this thesis we will focus on blocks consisting of a convolution layer(s), an activation function, and a max pooling layer as shown in Figure3.6.

(27)

18 Machine Learning

The max pooling layer simply convolves over the feature map with a given stride and takes the max value within its filter window. This operation is used to pull out dominant features of an input. This is analagous to zooming out of an image. As you zoom father out, the image becomes more fuzzy and the main features begin to dominate smaller details. The equation which describes the output shape of a maxpool layer is given by:

Nⁱ⁺¹ =bNⁱ−F

S c+ 1. (3.23)

We can now stack different shaped convolutional blocks on top of each other to form our network. After the last convolutional block, we include a series of fully connected layers to untangle the resulting feature map from the convolutional blocks and resize it to fit our desired output.

3.3 Training

Once our architecture is decided it is time to train the network. We split our data into a training, validation, and test set typically in sizes of80%,10%, and10%. One training epoch consists of the whole training set passing forwards and backwards through the network followed by a forward pass of the validation set. While parameters are updated for the training set, the validation set is simply used to calculate the loss of the network with non-training data.

As we iterate through epochs, the training loss will continually go down as the network learns the patterns in the training data. At the end of each epoch, we do a forward pass of the validation set through the updated model. This gives us a way to evaluate the current network’s performance on unseen data. Eventually, the model will begin overfitting the training data with training loss continuing to go down, and validation loss beginning to creep up. This is a sign that the network is losing generality by learning patterns too specific to the training set and not the general problem.

To counteract this, we make a temporary save of the model parameters for an epoch if the validation loss is less than for the epoch before. After a certain threshold of n epochs without a new lowest validation loss we break the training loop and permanently save the model we have earmarked with the lowest validation loss. This process is called early stopping and helps us preserve generality in our networks. Figure3.7provides an illustration the tradeoff between training and validation loss as the network is trained.

The crux of Neural Network training is that there are so many user-defined parameters and no set rules for how to initiate them. It is up to the user to decide the number of layers, the number of filters per layer, stride, padding, maxpool size, activation functions, initial learning rate, and early stopping tolerance. As a result, much of the work becomes a process of guessing, training, and testing with different architectures.

(28)

3.3 Training 19

Loss

Epochs

Validation Loss

Training Loss

Best model

Figure 3.7: Training and validation loss as a function of Epochs. The model for the epoch at the minimum of the validation loss curve is the optimal model, as the model begins to overfit the training data after this epoch.

(29)

20 Machine Learning

(30)

Chapter 4 SunnyNet

In this chapter we will go through the structure and implementation of the SunnyNet.

This network helps to solve the 3D NLTE radiative transfer problem by learning the mapping from LTE populations to NLTE populations calculated by Multi3D for snapshots of a given Bifrost simulation. In the breakdown given in Section 2.4, this corresponds to steps 2 through 5. Therefore, the network is giving us a method for attaining the final populations without going through the accelerated Λ iteration in Multi3D.

We can then use the final populations to solve the transport equation and obtain the intensity line profiles for a given transition. This data flow is shown in Figure4.1.

Learning the mapping from LTE to NLTE populations, as opposed to LTE populations to output intensity, gives our results a degree of flexibility. If we were to learn from LTE populations directly to the line profile for a given transition, as is done inOsborne et al.

(2019), we would have to train a new network for each transition. By instead learning the final NLTE populations for a given element, we can use our intensity method to calculate the line profile for any transition present in the model atom. The advantage here is time savings, as training neural networks is the most time intensive part of the process.

LTE SunnyNet Non-

LTE Intensity

Calculation

Figure 4.1: The general data flow of the problem is shown, going from input LTE populations to output intensity for atmosphere cubes.

(31)

22 SunnyNet

4.1 Data Structure and Preparation

The starting point for creating testing and training data sets for the SunnyNet is hydrogen populations from Multi3D runs. Most simulations are run for points in a time series, giving a “snapshot” of the simulation at each time. Each snapshot object contains the atomic level populations for a specific model atom in the atmosphere under LTE conditions, and the final NLTE atomic level populations obtained by Multi3D. The Multi3D populations we will use to build our data sets were calculated for atmospheres composed of theoretical 6-level hydrogen atoms, with the last level being ionized hydrogen. The shape of our starting atmospheres is therefore (6, z, x, y), wherex,y, and zcan be different from simulation to simulation.

4.1.1 Input and Output Shapes

The simplest structure for the input and output data is to consider the 1.5D problem.

This structure is shown in Figure4.2 where each column at a unique X,Y in the LTE atmosphere is treated as an independent input X_i with a corresponding column from the NLTE atmosphere as the output point y_i. As discussed in Section 2.4.2, building our input/target pairs in this way ignores all oblique radiation. Training a network on data pairs with this structure will therefore result in it not considering the full 3D problem.

Z

LTE input 𝑋_!

Z

NLTE target 𝑦_!

Figure 4.2: Illustration showing howXi and yi input/output pairs are constructed for the 1.5D case.

To ensure the network is considering radiative effects from all directions, we construct the data pairs as shown in Figure4.3. Here, a 3x3 group of LTE columns is the input

(32)

4.1 Data Structure and Preparation 23

LTE atmosphere 𝑋_! (LTE) 𝑦_! (NLTE)

Figure 4.3: Input and target pairs for SunnyNet. The yellow arrows represent how radiation can propagate in all directions through the atmosphere.

data and the corresponding NLTE column of the center pixel is the target data. By feeding the network 3x3 input cubes we are ensuring that information from neighboring cells goes into the network when predicting the NLTE populations of the center pixel.

4.1.2 General Data Set Assembly

Whether we are building a data set for training a network or for calculating NLTE populations using an already trained network, the beginning process is always the same.

We start with the full LTE and NLTE population cubes for a given snapshot and break it down into the input/target pairs described in Section4.1.1. This is done by convolving a 3x3 window across all suitable X/Y points of the LTE cube and pairing them with the NLTE column that corresponds with the center of the window. These pairs are then stored in a list with shape [(6, z,3,3),(6, z,1,1)]. Figure 4.4provides a visual for this process for the sake of clarity.

The population values we are using are particle densities in units of particles per m³ and have original values ranging from∼10⁻³ to ∼10²⁴. Due to the huge range in the input and output data, we take thelog₁₀ of the data before putting it into any of the data sets.

4.1.3 Training Set Assembly

When putting together our training sets we need to consider what the goal for our network after training is. If we want a very generalized network we will want to include

(33)

24 SunnyNet

X

Y

Z

Figure 4.4: Top down image (z axis into the page) of convolving across the LTE population cube. Each time the window reaches the edge of the X/Y plane, it jumps down one row and starts the process again from the left. This continues until it has passed over the whole cube.

snapshots from different simulations in our set. This exposes the network to diverse atmosphere conditions and gives it a better chance at learning the population mappings under all conditions. We can also construct a training set of multiple snapshots from the same simulation. In this case the network will learn the population mappings under more specific conditions, potentially increasing accuracy when testing the network on other snapshots from the same simulation.

Once we have decided what snapshots we want to include in our training set, we need to standardize the data. This involves scaling each feature so that its average and standard deviation across the data set is µ = 0 and σ = 1. Doing this puts all the features in a similar numerical scale which helps make training more stable. In our data set, the features are the energy levels and their populations. To standardize, we apply the following transformation to each point in the data set:

Xˆn,z,x,y = X_n,z,x,y−µ_n σn

, (4.1)

whereµn and σn are the mean and standard deviation calculated across the entire set for a specific energy level n. We will do this separately for the inputs and targets of each training set we use. Once we have scaled our training data, we save theµ_n’s and σn’s to apply to future test data, giving it the same scale as the data the network was trained on.

(34)

4.1 Data Structure and Preparation 25

4.1.4 Test Set Assembly

The test set assembly process involves processing just the LTE inputs of a snapshot.

The most significant thing to note is that the snapshot cannot be one that was used in the training set. We need data completely unseen by the network to get an unbiased view of how well it can predict the NLTE populations. After splitting the original data cube into the list of (6, z,3,3) LTE windows we need to standardize the data as described in Section4.1.3 using the same µ_n’s andσ_n’s that were used on the training set the network was trained on.

4.1.5 Column Mass Dimension

So far, we have not made note of what to do when the original Bifrost snapshot objects do not have the same shape. Neural networks are rigid with respect to input and output dimensions, requiring a new network to be trained for any new input or desired output dimension. Our choice to break up the input data into[6, z,3,3]cubes and output data into[6, z,1,1]columns leaves us only having to account for possible differences in thez dimension.

Since radiation is not sensitive to physical height, but rather the density of the atomic species in question, we make a unit transformation along the z axis from distance (m) to horizontally-averaged column mass (kg m⁻²) and interpolate to a fixed range. Doing so allows us to take a snapshot with any arbitrary height scale, and interpolate it to a set, physically relevant column mass scale. The column mass of a(z,1,1)atmospheric column at a depth pointz0 can be calculated by:

σ(z₀) = Z z0

zsurface

ρ(z)dz, (4.2)

where ρ(z) is the mass density for the column in kg m⁻³. Column mass tells us the mass per unit area considering the entire atmospheric column above a given height. To make this transformation we perform the following steps:

1. Collect simulation mass density ρ with shape (x, y, z) from the Multi3DAtmos object for the simulation

2. Take mean of ρ along x and y axis leaving a vector rho_mean of shape (z,1,1) containing the averageρ for each depth pointz_i

3. Cumulatively integrate rho_meanalong the z axis from top to bottom attaining the vector cmass_meanwhich contains the mean column mass for eachz_i

4. Create interpolation functions usingcmass_meanfor the full LTE and NLTE cubes and the originalz vector

5. Use the three interpolation functions to interpolate the quantities to the new scale, cmass_scale, given bynumpy.logspace(-6,2,400)

(35)

26 SunnyNet

We can now interpolate any physical quantity with the simulation’s original z scale to our set column mass scale cmass_scale via cmass_mean, resulting in a 400 point z dimension that evenly spans 10⁻² to 10⁶ kg m⁻² in log space. This eliminates the need to adjust network architecture based on z resolution and lets us train and test our network using a collection of simulations and snapshots that may not originally be the same scale inz. Figure4.5shows the relationship between the original height scale

0.0 2.5 5.0 7.5 10.0 12.5 15.0

z [Mm]

10

¹⁰

10

⁸

10

⁶

10

⁴

Av er ag e C olu m n Ma ss [k g m

2]

(a) Average column mass versus atmosphere height.

0.0 2.5 5.0 7.5 10.0 12.5 15.0

z [Mm]

10

⁹

10

⁷

10

⁵

10

³

mean

[k g m

3

]

(b) Average mass density versus atmosphere height.

Figure 4.5: The two plots show how mass density and column mass change with height in the atmosphere. We see that most of the mass is in the lower part of the atmosphere and that it quickly becomes less dense aroundz= 0.2×10⁷m.

and average column mass and mass density for a snapshot of a simulated atmosphere.

The snapshots we will use in our training and testing sets will all have a similar mass distribution to the example in Figure4.5.

(36)

4.2 Network Structure 27

4.2 Network Structure

Having defined how we transform our data from non-uniform Multi3DOut objects to usable data sets, we will now fully define the structure of the SunnyNet. Typically, input data to CNNs is three-dimensional, with two spatial dimensions and one channel dimension. In Section 3.2, we developed the general 2D convolution used for inputs with two spatial dimensions. Since our input data has three spatial dimensions and one channel dimension, we will need to make us of 3D convolutions.

4.2.1 3D and 1D Convolution Layers

With a base knowledge of 2D convolutions, it is easy to extrapolate how 1D and 3D convolutions work. A 1D convolution is the simplest as the filters only have one spatial dimension and move in one spatial plane. In our network this will be the z direction, giving us filters of shape(channels, z). 3D convolutions follow the same logic with a 3 spatial dimension filter that moves in the x, y, and z planes, giving us a filter shape of (channels, z, x, y). Figure 4.6 shows how 3D convolutions will be applied to our (6,400,3,3)shaped input data. We restrict the spatial(3,3,3)filters to only move in the zdirection, producing an output feature (green cell in Figure4.6) for each convolutional step. These outputs are stacked on top of each other as the filter slides down the input column and are used as the input to the next layer.

Taking the center column of the input data (red column in Figure4.6) to be the points of interest, each convolution step includes information from all the nearest neighbors of the pixel at the center of the filter. We also get information from each channel, or energy level, for the center pixel at each step. The resulting output feature thus contains information from all immediate neighboring cells across all energy levels. Our network can therefore learn the full 3D radiative transfer problem, and not just the 1.5D problem.

4.2.2 Layer Arrangement

We are now ready to construct the full network. The arrangement of layers often requires quite a bit of guess work as each application of neural networks is so unique and there is a limited “best practice” standard. There are undoubtedly many different and complex network structures that would work well for the radiative transfer problem. We found that the architecture shown in Table4.1both performed well and was relatively simple.

Figure4.7provides a visual for what happens from input through the second 1D Conv block of the network as it can be difficult to keep track of what is going on as the input data passes through the layers due to different kernel sizes, strides, and padding. As the input data passes through the network, it is shortened in the spatial dimension z but expanded in terms of channels. It is important to note that, although not shown in Table4.1, each convolutional layer and the first linear are followed by a ReLU activation function.

(37)

28 SunnyNet

N = 1

N = 2 ... 5

Convolution

Addi tion

N = 6, H⁺

Figure 4.6: The yellow(3,3,3)filter starts at the top and slides down the data column multiplying the filter values by the cell values at each step. It then adds the results of the filter calculations for each channel together to get the final output for the convolutional step.

(38)

Layer Type Kernel Shape Output Shape Learnable Parameters

3D Conv (3,3,3) (−1, 32, 400, 1,

1) 5,216

1D Conv (3) (−1, 32, 398) 3,104

1D MaxPool (2) (−1, 32, 199) 0

1D Conv (3) (−1, 64, 197) 6,208

1D MaxPool (2) (−1, 64, 98) 0

1D Conv (3) (−1, 128, 96) 24,704

1D MaxPool (2) (−1, 128, 48) 0

Linear NA (−1, 4,700) 28,881,500

Dropout NA (−1, 4,700) 0

Linear NA (−1, 2,400) 11,282,400

Input Shape: (−1, 6, 400, 1, 1) Total Params: 40,203,132 Params size (MB): 153.36

Table 4.1: Layers of the SunnyNet presented in order from the first 3D convolutional layer to the output linear layer. The stride for 3D Conv, 1D Conv, and 1D MaxPool are 1, 1, and 2 respectively. The 3D Conv layer is also 0-padded in thezdirection. The

−1 dimension in the “Output Shape” column is a placeholder for however many input samples are in each batch.

(39)

30 SunnyNet

C = 32 C = 32

C = 6

C = 32 C = 64

C = 64

Z = 400

Z = 400 Z = 398 Z = 199 Z = 197 Z = 98

3D Co nv, Re

LU

1D Conv, Re LU

1D Ma xPool

1D Co nv, Re

LU

1D Ma xPool

Input Data

Figure 4.7: Visualization of the first three convolutional layers of the network showing how the data changes shape as it passes throught the layers.

(40)

t

Without Dropout With Dropout

Figure 4.8: Standard data flow through two hidden layers versus flow through the same architecture with two nodes turned off.

After the last 1D MaxPool layer, we flatten the output from shape (−1,128,48) to (−1,4700)and feed it through two linear layers to get the final output of shape(−1,2400). In between the linear layers we also include a dropout layer. First proposed in Srivas- tava et al.(2014), this layer randomly “turns off” nodes and their connections in a layer at a given probabilitypduring each training iteration. This helps to prevent over fitting by training with a slightly different “view” of model at each iteration, thus increasing generalization. Figure4.8 shows how turning off nodes effects data flow through linear layers.

The last thing to note is that the final linear layer has the identity activation function given by:

f(x) =ax, (4.3)

which is commonly used in the last layer of networks used to predict numerical values.

The resulting final outputs can then be reshaped to (−1,6,400,1,1) to be compared with the true NLTE target that corresponds to the input data.

4.2.3 Loss Function

As mentioned in Section3.1.3, the base loss function used in SunnyNet is mean squared error, given by:

Loss= 1 N

N

X

i=1

NLTEtruei−NLTEpred_i

2

. (4.4)

(41)

32 SunnyNet

In addition to comparing predicted and true NLTE energy level populations on a pixel by pixel basis, we also need to enforce conservation of mass in our calculation. To do this, we compare the total number of hydrogen atoms at each spatial coordinate before the calculation (LTE) to after (NLTE). We can add a second term our loss function that penalizes total particle number discrepancies giving us:

Loss= 1

N (1−α)

N

X

i=1

NLTEtruei−NLTEpred_i

2

+α

N

X

i=1

Htruei−Hpred_i

2

! , (4.5) where Htruei is the total hydrogen atom populations for each point in the LTE column of interest and Hpred_i is the total hydrogen atom populations for each point in the corresponding NLTE column prediction. α is a user defined hyper-parameter which determines how much each term in the loss function contributes to the total loss. Using equation (4.5) we can now ensure that the network produces results in accordance with true physical constraints.

4.3 Intensity Calculation Method

Now that we have described how to build and train a model to predict the NLTE hydrogen populations, we need a method the calculate intensity from the populations.

Our method is written in Julia and takes advantage of functions from the Transparency Julia package (Pereira 2021), which is a package used to compute the opacity of solar atmospheres.

For this method we will need the line and continuum extinctions and emissivities. j_ν^l andα^l_ν remain as defined in Section2.1. For continuum processes we assume:

S^c_ν = j^c_ν

α^c_ν =Bν(T), (4.6)

and therefore that:

j_ν^c=α^c_ν×Bν(T). (4.7)

In our calculation ofα^c_ν we include the following seven interactions:

1. Free-free extinction from H⁻ ions followingStilley & Callaway (1970)

2. Free-free extinction from hydrogen-like species followingMihalas(1978a) andRut- ten(1998b)

3. Free-free extinction from H⁺2 molecules followingBates(1952) 4. Bound-free extinction from H⁻ ions followingGeltman(1962) 5. Bound-free extinction from H⁺₂ molecules followingBates(1952)

(42)

4.4 Source Code 33

6. Extinction from Thomson Scattering given byσ^T_e ×ne, whereσ_e^T = 6.652×10⁻²⁹ m², and n_e is the electron density

7. Extinction from Rayleigh Scattering from H atoms

In our calculation of our profile function, a part ofα^l_ν andj_ν^l, we include the following three sources of line broadening:

1. van der Waals broadening using Unsöld’s approximation followingMihalas(1978b) 2. Linear Stark broadening followingSutton (1978)

3. Quadratic Stark broadening followingTraving(1960)

With all extinction and emissivity coefficients fully defined, we can now fully define our source function,Sν for any wavelength. To obtain emergent intensity, we loop over wavelengths and calculateSν and optical depth to solve the radiative transfer equation for each wavelength.

4.4 Source Code

The source code for the SunnyNet is released on GitHub for open source use (Chappell 2021). The network itself and the data preparation was written in Python 3.8.5 using PyTorch 1.6.0 (Paszke et al. 2019) for the machine learning and Helita 0.9.0 (GitHub) for handling the Bifrost simulations. The final intensity calculation was written in Julia 1.5.4 using the Transparency package mentioned in Section4.3.

The data flow for training a model goes as follows:

1. 1_build_training_set_multi_sim.py 2. train_model_3d.py

Once we have a trained model, the data flow for testing is:

1. 1_build_solving_set.py 2. 2_predict_populations.py 3. 3_intensity_calc.jl

(43)

34 SunnyNet

(44)

Chapter 5 Results

In this chapter we will explore instances of the SunnyNet trained and tested on different data sets and analyze their results. The data sets were designed with the intention of testing how generalized models could be. The simulations we will build our data sets from include both quiet and flaring regions of the solar atmosphere. The snapshots contained in each simulation are calculated at various steps in the solar time scale, giving us a chronological series of snapshots.

First, we will look at models trained on snapshots from a single simulation. We will then test trained networks on data from simulations not included in the training sets.

We will also examine whether our network can learn to predict intermediate snapshots when trained on a set built from the first and last snapshots. These tests will give us insight into how generalized the models are across different simulations.

All models in our analysis have identical structure, are trained with the same loss function, and use the same training hyperparameters given in Table5.1. The parameters we use work well for all tested simulations and were found using best practices and a bit

Parameters Value

Epochs 50

Batch Size 128

Optimizer Adam

Learning Rate 1×10⁻³

α 0.2

Early Stopping 5

Table 5.1: Hyperparameters used when training models for analysis.

A Neural Network approach to 3D non-LTE Radiative Transfer