• No results found

Assessment of U.S. 10-K

In document The Text Premium and Stock Returns (sider 21-29)

4.3 Assessment of the Data

4.3.1 Assessment of U.S. 10-K

The following test was performed on 3M CO’s 10-K for 2017 and gave the following results.

From Figure 3, we see a clear pattern that terms which belong to a particular topic are highly associated, as well as an indication that there is a variation between each topic. This suggests that our data can be interpreted and that it is not just random noise. LDAvis further support this in Figure 3, which visualize the fit of an LDA topic model to a corpus of documents. Also, we made a few interesting observations worth noting that are shown in Appendix A.

(a)(b) Figure3:Visualizationof10-KLDAtopics

Figure 4: The widths of the gray bars represent the corpus-wide frequencies of each term, and the widths of the red bars represent the topic-specific frequencies of each term

In Figure 3 and Figure 4, two visual features provide a global perspective of the topics. First, the areas of the circles are proportional to the relative prevalences of the topics in the corpus (Sievert

& Shirley, 2015). Second, it is the ability to select a term to reveal its conditional distribution over topics as well as revealing the most relevant terms for that topic. The relevance of termw to topickgiven a weight parameterλis defined as

r(w, k|λ) =λlog(φkw) + (1−λ) log φkw

pw ,

whereλdetermines the weight given to the probability of termwunder topickrelative to its lift.

In our test sample of 3M Co, we see that the first topic comprise 5.5% of the corpus. Initially, however, the first topic only comprised 4.9% of the corpus. This is because it contained much standard, non-specific terms such as the name of the company, ”company” and ”include” which are commonly used words in 10-Ks and 10-Qs, which are not considered normal English ”stop words.” This is similar to the LDA classification we did on Equinor in Table 1 and Table 2, where

”Statoil” and ”compani” are words that consistently appears, but do not provide any value for our analysis. Thus, we have applied the LDAvis visualization in order to make sure we get rid of terms that do not help to explain the underlying topic of the corpus.

4.3.2 FOMC Reports

The FOMC reports are published two times a year and contains much macroeconomic information which we will apply as a proxy for the state of the economy. In order to make sure these reports are comprehensive enough to serve as a proxy for the state of the economy, we have visualized the report for the second half of 2009. As this was shortly after the financial crisis, we would expect to see topics such as unemployment, inflation, and interest rates. However, we also expect to see measures being taken by the FED to improve the situation, such as increasing federal funding and encouraging private investments. From the visualization of Figure 6, we found that unemployment, inflation, and interest rates were important topics, but we also found that topics such as encouraging mortgage and consumer spending were essential topics. This is an indication that the FOMC reports are quite comprehensive and that they can be used as a proxy for the state of the economy.

As a lot of our data are from the early 1990s, it is essential that we assess whether all of the data is interpretable, or just a collection of noise. In order to do this, we visualized all the FOMC reports, and found that the reports from before 1999-07-22 consistently had less association between topics and more noise than the ones after. Consequently, we chose the appropriate starting point to be the 1999-07-22 report. Figure 5. illustrates why we made our decision. From Figure 5 we clearly notice that there is significantly less association between topics and a much larger degree of noise in 1999-02-23 report. Intuitively, we set the starting point as 1999-07-22 in order to reduce the noise in our sample.

(a)1999-02-23(b)1999-07-22 Figure5:VisualizationofFOMCReports

Figure 6: FOMC Report Visualized

4.3.3 Financial News

The FOMC reports are only published twice a year and mostly contain macroeconomic informa-tion. Intuitively, we would like to extend the reports by a financial news corpora that reports information about more factors than just macroeconomics. Once again, we see a clear pattern that terms which belong to a particular topic are highly associated, as well as an indication that there is a variation between each topic. In this particular case of financial news, we see an even more precise pattern than in the case of the 3M Co 10-K. Intuitively, this is a dataset that can be interpreted and thus contributes to the analysis, rather than generate additional noise.

Assessing the financial news we also made a few very interesting observations, which are shown in Appendix A.

Figure 7: LDA Topics Financial News - t-SNE

5 Constructing the Portfolios

In this section, we will cover the procedure on how portfolios can be sorted based on a similarity measure, which will later be used to investigate further whether exposure to common topics can help explain the difference in returns across firms. A rationale behind this idea can be that some firms ex-ante beliefs, predictions, or opinions about what might drive future economic growth not necessarily fit the ex-post realizations given by the Federal Open Market Committee reports.

The idea is that these disagreements can realize different returns, or that at least firms with a much higher degree of disagreement on aggregate will experience different returns that the firms with a much lesser degree of disagreement. Another interpretation can be that the similarity with the Federal Open Market Committee report, which we think of as acommon text can reflect different firms risk attributes.

5.1 The Similarity Measure

The similarity measure must be constructed in a way which captures the notion of disagreement between the Federal Open Market Committee reports and the reports published by each firm.

To recognize a topic, we use the Jaccard distance, and we restrict each topic to be the set of the 25 words which contribute the most to each topic. If we let Ai be the sets of words for the companyi, andF be the sets of words for the FOMC reports, then we can calculate the distance as

dJ(Ai,F) = 1−J(Ai,F) =|Ai∪ F | − |Ai∩ F |

|Ai∪ F | .

The problem we faced was that the two texts could be fundamentally different, which makes it hard for us to get meaningful matches across all topics. By construction, dJ(Ai,F)∈[0,1], wheredJ(Ai,F) = 0 (0) implies perfect similarity (dissimilarity). To isolate the effect of highly dissimilar topics, we subtract one and take absolute values,

dJ(Ai,F) =|dJ(Ai,F)−1|,

which then leaves us with an adjusted Jaccard distance representative of the total amount of match found by the Jaccard distance. For each company filing and corresponding FOMC report, we are then left with a 40×40 matrix with the adjusted Jaccard distances for each topic. Since we are interested in the similarity between the company filings and each topic of the FOMC report,

we fix each FOMC topic column and sum across all rows to construct the similarity score,S:

for each company i. Naturally, firms different in lengths of available reports. In the case of missing reports, we disregard them and focus only on the available reports in each periodt. For eachtwe are left with a vector of scoresStwhere each row corresponds to a company. We then sort this vector from the least to largest and divide it into ten equally sized portfolios.

In document The Text Premium and Stock Returns (sider 21-29)