LDA Topic Classification - The Text Premium and Stock Returns

In this paper, we will be using the state-of-the-art method (Yi & Allan, 2009), i.e., the Latent Dirichlet Allocation. This method is preferred above the Latent Semantic Analysis (LSA) as LSA has received much criticism due to its inadequate statistical foundation. The main criticism towards LSA is the case that it erroneously assumes Gaussian noise on term frequencies, which empirically follows a Poisson distribution. Even the extension of the LSA (pLSA) is incomplete, as there is no statistical model at the document level, which may result in overfitting. Conse-quently, we employ the LDA as it builds upon this shortcoming by using a Dirichlet prior for the topic distribution within documents (Blei et al., 2003).

Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of ob-servations to be explained by unobserved groups that explain why some parts of the data are similar. We treat data as observations that arise from a probabilistic generative process that includes latent variables, where these variables reflect the thematic structure for the collection.

What the topic classification ultimately is after is to identify the hidden structure that likely generated the observed collection of words.

3.1.1 Formal Notation of LDA

The topics areβ1:k, where eachβk is a distribution over the vocabulary. The topic proportions for thedth document areθd, where θd,k is the topic proportion for topick in documentd. The topic assignments for thedth document arezd, where zd,n is the topic assignment for thenth word in documentd. Lastly, the observed words for documentd arew_d, wherew_d,n is thenth word in documentd, which is an element from the fixed vocabulary. The generative process for LDA corresponds to the following joint distribution for the hidden and observed variables,

p(β1:K, θ1:D, z1:D, w1:D) =

Figure 2: The graphical model for LDA. Illustration: David M. Blei.

In Figure 2, each node is a random variable and is labeled according to its role in the generative process. The hidden nodes — the topic proportions, assignments, and topics—are unshaded. The observed nodes — the words of the documents—are shaded. The rectangles are “plate” notation, which denotes replication. The N plate denotes the collection of words within documents; the D plate denotes the collection of documents within the collection.

We will use a sampling-based algorithm, where we attempt to collect samples from the pos-terior to approximate it with an empirical distribution. More specifically, we will use Gibbs Sampling, a Markov Chain Monte Carlo (MCMC) technique, mainly in order to make an infer-ence with Bayesian models on the corpus.

In Table 1 we did an LDA topic classification on Equinor’s (formerly known as Statoil, the largest company listed on Oslo Stock Exchange) annual reports for 2010-2017, and for 1980-1985 in Table 2. What Table 1 and Table 2 suggests is that words such as ”Statoil” and ”compani”

appears to be of significant value, although the words do not help explain the topics of the annual reports. This implies that we have to clean the corpus more precisely in order for the LDA topic classification to find better-related words of the underlying topics of the annual reports.

What is clear is that technology and new developments in the company are considered very important to Equinor. This comes as no surprise, since as we know the largest countries in the world is shifting their focus to renewable sources of energy, the complete opposite of the oil and gas related business which the company has operated in. Moreover, we see that risk management is also one of the most important topics, most likely due to the highly fluctuating commodity prices we have seen in the last few years. In contrast, a topic classification on the company’s annual reports from 1980-1985 reveals a different picture. This brings up the question: Are some topics more important than others in explaining a particular aspect of Equinor? Here we limited the number of topics to only the eight most important ranked by probabilities, although we can

easily extend it to a case of 100 topics, and track how their relative importance change over time.

Table 1 - LDA topic classification on Equinor’s Annual reports, 2010-2017

Technology Exploration Commodities Reporting Financials Performance Risk Mngt. Stock

1 statoil statoil gas statoil tax nok financi share

2 develop field oil report asset reserv cash board

3 oper product market norwegian incom billion statoil statoil

4 busi develop natur state cost product risk committe

5 activ oper product financi estim increas usd corpor

6 result well crude statement rate prove invest member

7 risk project statoil petroleum loss oper liabil compani

8 product north transport intern recognis incom interest execut

9 project interest volum account amount price nok sharehold

10 technolog sea price requir futur million capit meet

Table 1: LDA Equinor 2010-2017

Table 2 - LDA topic classification on Equinor’s Annual reports, 1980-1985

Operations Market Stock Financials Management Exploration Director Oil Fields

1 statoil oil nok cost compani statoil manag field

2 norwegian product million account board block vice gas 3 develop statoil consolid financi general norsk director oper

4 activ market share profit shall gas pres billion

5 insur crude statoil depreci meet oil statoil platform

6 compani will compani liabil director hydro presid million

7 work transport valu incom assembl explor engin develop

8 project increas amount current two licenc sen will

9 oper price invest statement art saga rock product

10 year per interest consolid matter oper area start

Table 2: LDA Equinor 1980-1985

In document The Text Premium and Stock Returns (sider 15-18)