• No results found

In Search of the Present - An indicator comparison: Nowcasting quarterly GDP using Google search data and monthly accounts of GDP

N/A
N/A
Protected

Academic year: 2022

Share "In Search of the Present - An indicator comparison: Nowcasting quarterly GDP using Google search data and monthly accounts of GDP"

Copied!
82
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

In Search of the Present

An indicator comparison: Nowcasting quarterly GDP using Google search data and monthly accounts of GDP

Malin Charlotte Engel Jensen

Thesis submitted for the degree of Master in Economics

30 credits

Department of Economics

Faculty of Social Science

UNIVERSITY OF OSLO

(2)
(3)

In Search of the Present

An indicator comparison: Nowcasting quarterly GDP using Google search data and monthly accounts of GDP

Malin Charlotte Engel Jensen

(4)

c 2019

Malin Charlotte Engel Jensen

In Search of the Present http://www.duo.uio.no/

Printed: X-press printing house

(5)

Abstract

This thesis investigates the role of hard- and soft-indicators to address the problem of nowcasting quarterly GDP. Hard indicators may include industrial production or retail sales, while financial series and surveys are examples of soft indicators. I develop a formal method for using high-frequency data to nowcast current-quarter GDP growth.

The approach taken here includes the use of timely search data collected from Google and monthly national accounts of GDP, to track the aggregate fluctuations in the economy.

The models derived from Google search data are referred to as Semi-Automatic Soft Indicator (SASI) models, because information is selected semi-automatically from the group of indicators. The models, which exploit monthly observations of GDP, are referred to as hard indicator models. My findings indicate that the two indicator categories hold valuable information for predicting quarterly GDP in Norway. Specifically, all hard- indicator models significantly outperform two benchmark models, an autoregressive- and a random walk-model. The hard-indicator models also significantly outperform four out of seven soft SASI models. Three of the SASI models significantly outperform the two benchmark models. In general, these findings amplify the trade-off between predictive accuracy and timely predictions as the SASI models can be updated on a daily basis, whereas the hard indicator models rely on information released with a substantial lag.

(6)

Acknowledgements

The past years at Blindern, have been the most enjoyable and challenging years of my life. Blindern has not only been a place where I have acquired new skills and friends, but it has shaped me into becoming a dedicated and curious social scientist.

In relation to this thesis, there are several people whom I wish to thank. Especially, I want to thank my supervisor, Ragnar Nymoen, for always being optimistic and interested in my thesis and listening attentively to all my questions. It is difficult to express in words the gratitude and credit you deserve for the support you have given me. I also owe a proper thanks to Thomas Von Brasch and Håvard Haugnes at Statistics Norway for the constructive discussions in relation to modelling quarterly GDP by means of monthly national accounts of GDP. Further, I owe a great deal of gratitude to Vegard Wiborg and Silje Jelsness for carefully revising my thesis and giving me valuable comments and constructive feedback. I also want to dedicate a thank-you to my lunch-crew, Simon, Magnus, Eirik, Mikkel and also Vegard, for going to great lengths to support me on my most neurotic days and for putting a smile on my face.

Finally, thank you to all my friends in the 10th floor for valuable discussions over the years and to my professors who never closed their door on me when I came knocking with silly questions. Data is available upon request and any error in this thesis are my own.

Oslo, May 2019 Malin Jensen

(7)

Contents

1 Introduction 1

2 Literature and Contribution 4

3 Data 7

3.1 Target variable . . . 7

3.2 Google trends for dummies . . . 8

3.3 Methodology . . . 9

3.4 Data set reduction . . . 10

3.4.1 Correlations (targeted predictors) . . . 11

3.4.2 Principal Component Analysis . . . 12

3.4.3 Supervised Principal Component Analysis . . . 13

3.5 Automatic variable selection . . . 13

3.5.1 Cost of automatic variable selection . . . 15

3.6 Characteristics of Google Trends . . . 16

3.7 Transformation of SVIs . . . 17

4 General evaluation of nowcasting models 19 4.1 Unbiased forecasts . . . 19

4.2 Descriptive measure of forecast performance . . . 20

4.3 A formal test of comparative accuracy: The DM test . . . 20

5 Specifications on the SASI models 22 5.1 General specifications . . . 22

5.2 Complete data set . . . 23

5.3 Targeted predictors . . . 23

5.4 PCA . . . 24

5.5 Supervised PCA . . . 24

6 Empirical Results of the SASI models 26 6.1 Model properties . . . 26

6.1.1 Residual properties . . . 26

6.1.2 In-sample fit of the SASI models . . . 27

6.1.3 Typical SASI models . . . 28

6.1.4 Parameter stability of the SASI models . . . 30

6.2 Out-of-sample performance of the SASI-models . . . 33

6.2.1 Comparative accuracy of the SASI models . . . 36

(8)

7 The use of monthly GDP 39

7.1 Data Description . . . 39

7.2 Finding an indicator of month three . . . 40

7.2.1 Four candidate indicators of month three . . . 41

7.2.2 Evaluation and model specification . . . 44

7.3 Parameter stability of the hard indicator models . . . 44

7.4 Comparative accuracy of the hard indicator models . . . 46

8 SASI models vs. Hard indicator models 49 8.1 Scaling the RMSFE . . . 49

8.2 In-sample fit of hard-and soft-indicator models . . . 49

8.3 Out-of-sample forecasts of hard-and soft-indicator models . . . 51

9 Discussion 53 10 Conclusion 56 Appendices 60 Appendices 61 A List of SVIs 61 B SASI models 63 B.1 In-sample estimation from 2004,Q3-2013,Q4 . . . 64

B.2 In-sample estimation from 2004,Q3-2016,Q4 . . . 67

C Empirical tests 70

D RMSFE 72

(9)

List of Tables

3.1 Unit root test of ∆QGDP . . . 8

3.2 Data Description . . . 18

5.1 Specifications of the SASI models . . . 24

6.1 In-sample estimation of the SASI models . . . 27

6.2 A typical SASI model . . . 29

6.3 Out-of-sample estimation of the SASI models . . . 33

6.4 Test of comparative accuracy of the SASI models . . . 37

7.1 Four indicators of M3t/M3t−1 . . . 42

7.2 Bridge equation estimation results of the hard indicator models . . . 46

7.3 Test of comparative accuracy of the hard indicator models . . . 48

8.1 In-sample fit of SASI models for extended estimation period . . . 50

8.2 Out-of-sample performance of hard- and soft- indicator models . . . 51

8.3 Test of comparative accuracy between hard indicator models and SASI models . . . 52

A.1 List of SVIs . . . 61

A.2 List of targeted predictors . . . 62

B.1 Model 1, 2 . . . 64

B.2 Model 3, 4 . . . 65

B.3 Model 5, 6 . . . 66

B.4 Model 7 . . . 66

B.5 Model 1,2 . . . 67

B.6 Model 3, 4 . . . 68

B.7 Model 5, 6 . . . 69

B.8 Model 7 . . . 69

C.1 Diagnostics tests of forecast error. . . 70

C.2 Diagnostics tests of forecast error. . . 70

C.3 Diagnostics tests of forecast error . . . 71

C.4 Test for forecast unbiasedness of the SASI models . . . 71

C.5 Comparative accuracy test of big data models . . . 71

(10)

Acronyms

ADF Augmented Dickey Fuller.

AIC Akaike’s information criterion.

AR(1) Autoregressive model of order one.

DGP Data Generating Process.

DM Diebold-Mariano, refers to a test for comparative accuracy.

GDP Gross Domestic Product.

Gets General-to-Specific.

GUM General Unrestricted Model.

LDGP Local Data Generating Process.

OLS Ordinary Least Square.

PC Principal Component.

PCA Principal Component Analysis.

RMSFE Root Mean Square Forecasting Error.

RW Random Walk.

SASI Semi-Automatic Soft Indicator.

SNL Store Norske Leksikon.

SSB Statistisk Sentralbyrå.

SSR Sum of Squared Residuals.

SVI Search Volume Index.

(11)

Chapter 1 Introduction

Unlike a weather nowcast, which is available in real-time, measures of the aggregate state of an economy are typically released with a substantial lag. For the purpose of nowcast- ing, a weatherman is therefore completely redundant because the meteorological out-turn is available for anyone to observe. For an economist, assessing the economic conditions of the present and even the near-past, is a whole other ballgame. Because most macroe- conomic aggregates are released with a substantial lag and are available at different fre- quencies, both forecasting and assessing current-quarter conditions, i.e. nowcasting, are important for economic decision-making. Governments, central banks, consumers, firms and financial institution base their decision-making on expectations about the future and consequently pay close attention to selected data releases. In particular, one measure of the aggregate state of the economy that attracts substantial attention is GDP.

GDP is one of the most important business cycle indicators and often directly tied to the monetary policy objective of central banks. However, GDP is usually collected on a quarterly basis and released over a month after the end of the quarter. Therefore, to assess the macroeconomic condition in the meantime, forecasters rely on relevant indica- tors measured at a higher frequency. Overall, these indicators can be divided into soft- and hard-indicators. The soft indicators usually reflect market expectations as opposed to the hard indicators that measure certain components of the GDP directly (see e.g.

Banbura et. al., 2011). Examples of the former are financial series, twitter feeds and surveys, while examples of the latter include industrial production, retail sales and car sales. Selection between hard-and soft-indicators often pose a trade-off; as opposed to the hard indicators, the soft data is available in real-time, but the hard indicators are often considered to give more precise signals for GDP. In this thesis, I evaluate the two indicator categories against two benchmark models and compare them to each other.

First, my main finding suggest that the hard indicators outperform the soft indicators when used for nowcasting quarterly GDP growth. The hard indicators, used in this thesis, refers to monthly national accounts of GDP. Unlike the aforementioned examples of hard indicators, monthly observations of GDP do not reflect a component of quarterly GDP, but rather, echoes the exact variable just measured at a higher frequency. I will test the hypothesis that these monthly accounts of GDP provide stable and precise predictions of quarterly GDP.

Since the two measures of GDP share the same level of detail, it may come as no surprise that the hard indicators are directly tied to quarterly GDP by summation. The mixed-frequency nature of the data set is modelled by means of a bridge equation. In

(12)

the context of nowcasting, the objective is to find an indicator for the third month of the quarter, as this number is released together with the quarterly number. I attain four different indicators of the third month of the quarter by varying the composition of variables included in the model. The sum of the three monthly growth rates are then aggregated to obtain a quarterly value. The aggregated value is used as a regressor in the bridge equation to obtain a forecast of current-quarter GDP. I use out-of-sample fore- casting to assess the performance of these models in terms of two benchmark models, an autoregressive – and a random walk - models. I find that the four hard-indicator models outperform both benchmarks on a 5 % significance level. Finally, the reliability of the hard indicator models are also quantified by comparing them to the soft indicator models. All hard indicator models display lower forecast errors than their soft indicator counterparts. A test of comparative accuracy also reveals that 75 % of the hard indicator models significantly outperform the soft indicator models in four out of seven cases.

Together with hard indicators, I use search data collected from Google, often re- ferred to as search volume indices (SVIs), as soft indicators of quarterly GDP. Google SVIs are both timely, quantitative and, as pointed out by Choi and Varian (2012), also qualitative, as they generally correlate with economic indicators. In addition, Wu and Brynjolfsson (2015) highlights that the SVIs may reflect information about individuals intention to make a decision or economic transaction. The hypothesis is therefore the following: given that Google search data reveals peoples’ intention to make a decision or an economic transaction, changes in these series can hold valuable information about aggregated intentions followed by a subsequent action, which is measured before official statistics. In that way, Google SVIs can be good proxies of the future and present fluc- tuations in quarterly GDP.

A dominant feature of the data set obtained by collecting Google SVIs, is the vast number of time series. A classical issue in this context relates to modelling large volumes of data in a parsimonious manner. 1 Therefore, I use a formal methodology to reduce the size of the data set. By exploiting multiple systematic frameworks to shrink the data set, I am also able to extract underlying GDP-related signals in the large volumes of data. The hypothesis is that smaller data sets derived by systematic methods pro- vide a higher signal-to-noise ratio, which will enable me to obtain models that give more precise measures of current GDP growth. Data set reduction includes three approaches;

selecting “targeted” predictors that display a correlation coefficient with quarterly GDP growth above 20 %, performing PCA on a large data set and, finally, performing PCA on a smaller data set containing targeted predictors. To select the top predictors to be in- cluded in the nowcasting models, I use an automatic search algorithm called Autometrics.

These models are labelled Semi-Automatic Soft Indicator (SASI) models as the informa- tion is selected semi-automatically from the group of indicators. By varying which data set I allow to enter in Autometrics combined with different algorithmic specifications, I attain seven different models.

My findings indicate that the SASI models contain valuable information to predict the future movements of quarterly GDP growth. The predictive performance of these models are assessed by means of out-of-sample forecasting and compared with the two benchmark models. I find, in short, that the scope of the data set, i.e. the number and variety of

1This problem should not be confused with the bridge model approach, taken above. See e.g. Gian- none (2008) which allow a large number of series to be modelled by a bridge equation.

(13)

series included, matters for forecasting accuracy. First, three out of seven SASI models significantly outperform both benchmark models on a 5 % significance level. These mod- els are systematically derived by extracting principal components or targeted predictors from the large volumes of data. In addition, these models significantly outperform the

“big data” models, which are derived by utilizing larger volumes of information. Further, in harmony with the predictions of Boivin and Ng (2006), principal components drawn from a smaller data set containing targeted predictors leads to more accurate model pre- dictions than when the two reduction techniques are performed separately.

The thesis is structured as follows. In Chapter 2, the nowcasting literature and related Google literature is introduced. Chapter 3 describes the data and the variable selection methodology. In Chapter 4, I introduce measures for evaluation and comparative accuracy. Chapter 5 emphasizes the algorithmic specifications I need to make. Empirical results of the SASI models are presented in Chapter 6. In Chapter 7, the hard indicator models are presented and evaluated. Chapter 8 compares the performance of the two categories of indicators. Finally, Chapter 9 discusses the findings and methods, while Chapter 10 concludes. Additional information is found in the Appendix.

(14)

Chapter 2

Literature and Contribution

This thesis contributes to three branches of the economic literature.

First, the methods taken here speaks to the literature that use frequently updated indicators for short-term forecasting. The macroeconomic literature has devoted a sub- stantial amount of time and space to the field of GDP nowcasting the past decade. Espe- cially, two problems have been emphasized in this context. First, information is released sequentially which introduces a mixed-frequency problem. Second, policy makers need to assess large volumes of data. However, it is not given that all available information is valuable for predicting the future movements of GDP. This thesis addresses both these issues.

To address the problem of mixed-frequency, I use a bridge model approach. Bridge equations have typically been a workhorse for short-term forecasting in cases where we use a high-frequency variable to predict a low-frequency variable. 1 Bridge equations are simple, transparent models and widely used, in particular, by central banks. Applications in the literature include, among others, Baffigi et. al. (2004) and Foroni (2014). Mul- tiple studies report that the performance of bridge models are usually better than that of benchmark models, especially when used for modelling supply-side measures, such as aggregate GDP, see e.g. Foroni and Marcellino (2018). One example of an application, is the use of monthly indicators of industrial production and retail sales to predict quarterly GDP, see e.g. Ingenito et. al. (1996). Two steps are taken to bridge the gap between the monthly indicators and the quarterly measure. First, since we have insufficient informa- tion of retail sales over the entire quarter, retail sales is predicted over the remainder of the quarter. The monthly indicators are then aggregated to obtain the quarterly value.

Second, the aggregated value of retail sales is used as a predictor in the bridge equation to obtain a forecast of quarterly GDP. However, bridge equations, as far as I am aware, have without exception been used in a context where hard indicators are included based on the statistical fact that they contain timely updated information. Rather, in this thesis, I propose to use monthly observations of GDP that are additively equal to the quarterly GDP number. That is, the monthly observations of GDP reflect the exact variable we want to forecast, just updated more frequently.

Unlike the hard-indicators discussed above, the soft indicators are based on a large

1An alternative approach to short-term forecasting with mixed-frequency data is based on the MIDAS regression family, formally introduced by Ghysels et. al. (2004); for macroeconomic application the reader is referred to Marcellino and Schumacher(2010).

(15)

data set structure. Therefore, a more pressing issue in this context is how to model large volumes of data in a parsimonious manner. A variety of different modelling approaches are available, among others, the mixed frequency factor model, see e.g. Banbura and Rünstler (2011) and the dynamic factor model, see e.g. Kuzin et. al. (2013). In particu- lar, the dynamic factor model approach relies on factor analysis to separate each variable into a common component and an idiosyncratic component. This will allow the common dynamics of a large number of time series to be modeled parsimoniously by using a smaller number of latent factors. As one example, Thorsrud (2016) uses a factor model approach to bridge a large panel of daily news topics to quarterly GDP growth. This approach shares many features with the approach taken here. To predict quarterly GDP growth, I use big data in the form of frequently updated soft indicators collected from Google. As a supplement to a factor model approach, I rely on data set subdivision techniques and, in the end, an automatic variable selection approach to select seven unique, parsimonious models.

As a second contribution, this thesis provides a formal method for testing the ap- plicability and efficiency of large volumes of data. Selection of relevant information in large volumes of data is facilitated by dividing the data into multiple, smaller data sets.

This is done by implementing a methodical framework in order to reach two objectives, namely, to reduce the sample size and to extract the series, which contain the highest GDP-related signals. I expand upon the methodological framework suggested by Bai and Ng (2008) and let “targeted predictors”, which are characterized by their high correlation with GDP, enter into the data set. PCA describes a second approach that reduces the dimensionality of the data set and, in addition, displays the most important patterns that describe the data set. The method has readily been used for dimension reduction in many fields and has proven to be a successful tool in the context of forecasting quarterly GDP with Big data, see e.g. Ouysse (2013). A feature stressed in recent applications of PCA is whether the use of large panels improves forecasting precision. Until recently, there has been a tendency among researchers to make use of all available data. However, Bai and Ng (2002) argue that the number of series not necessarily need to be large in order for principal components estimators to give precise estimates. For example, Boivin and Ng (2006) showed that factors extracted from 40 series in most cases seem to perform better than factors extracted from 147 series. This hypothesis is tested by extracting principal components from a smaller data set, which contains targeted predictors.

The third contribution in this thesis adds to the growing body of literature that use Google as a data source for nowcasting. Perhaps the most well-known example in the literature is the use of Google search data as a large scale monitoring device used to detect illness outbreak in the US, see e.g. Polgreen et. al (2008). Many fields have taken an interest in Google trends for nowcasting, to mention some; psychology, see e.g.

McCarthy (2010), finance, see e.g. Preis (2013) and political science, see e.g. Mellon (2013, April). Choi and Varian (2009) were the first to explore the use of Google search data in economics and suggested that Google trends was a sufficient tool to nowcast various economic metrics. Since then, several authors have used Google Trends to nowcast macroeconomic variables, such as unemployment rate, retail sales, private consumption, automobile sales and so on, see e.g. Jon Ellingsen (2017), Vosen and Schmidt (2011) and Carrière-Swallow and Labbe (2013). Still, to my knowledge, few have so far studied the forecasting performance of Google Trends on quarterly GDP. My contribution in this

(16)

branch relates mostly to the works of Hendry and Castle (2009) as they predict quarterly GDP by means of Google Trends data. However, my thesis depart from theirs as I lean on Google search data alone to forecast quarterly GDP, while they use multiple sources of timely data.

(17)

Chapter 3 Data

Data is collected using Google Trends, a public web tool provided by Google. My sample consists of data from January 2004 to February 2019. Google trends are published on a monthly basis and dates back to 2004. I focus on Norway because it is a small and open economy and thereby representative of many western countries. In addition, Norway is the first country with official statistics for monthly GDP and its main components. In Chapter 7 below, I will give a more detailed account of this data set. The structure of this chapter is as follows. First, in Section 3.1, I describe the target variable, which is the percentage change in quarterly GDP. Second, in Section 3.2, I present an in-depth explanation of the SVIs collected from Google Trends. Section 3.3 describes the methodology I use for selecting search terms. In Section 3.4, methods for data set reduction are introduced. In Section 3.5, I demonstrate how to use an automatic variable selection tool to determine a final model. Finally, in Section 3.6 and 3.7, I present some typical statistical traits of the SVIs and describe the transformations I will apply.

3.1 Target variable

Gross domestic product (GDP) for mainland Norway in constant 2016 prices (million NOK), is obtained in a quarterly frequency from Statistics Norway (SSB). Quarterly GDP is normally reported with approximately 1 month lag. Note that this variable is not seasonally adjusted. For a more thorough discussion on this matter, see Chapter 9. As pointed out by Nelson and Plosser (1982) and many subsequent authors (see e.g.

Flessig et. al., 1999; Soytas et. al., 2003), it is quite common for macroeconomic data to be characterized as non-stationary. Normally, series that contain a unit root can easily be transformed to a stationary series by taking the first difference. I multiply the transformed variable with100and attain a measure of the percentage change of quarterly GDP as my new target variable. I use a Dickey Fuller (ADF) to verify that the new target variable is in fact stationary. See Table 3.1 below.

(18)

Table 3.1: Unit root test of∆QGDP

D-lag t-ADF t-stat ∆QGDPt−1

3 -6.62** -0.08

2 -19.71** 11.26

1 -9.71** 3.69

0 -10.55**

Table 3.1The Augmented Dickey-Fuller test. H0: the percentage change in quarterly GDP contains a unit root. Estimation is performed between 2004,Q1-2018,Q4. The stars indicate that the null hypothesis is rejected on a * 5 % or ** 1 % significance level.

All the ADF tests (for different degrees of dynamic augmentation) are significant on the 1 percent significance level (as indicated by **). Hence, the evidence in Table 3.1 gives reason to conclude that the percentage growth rate of quarterly GDP does not contain a unit root. In addition, the percentage change in quarterly GDP is a relevant target variable to forecast as it is one of the most important coincident indicators of economic performance, closely monitored by the government, business sector and media.

To simplify notation, I will refer to the percentage change in quarterly GDP as∆QGDP.

3.2 Google trends for dummies

Google trends report an index of search activity. These indices, also known as search vol- ume indices (SVIs), display the search interest of a specific query over time. The index is also available at country and municipality level for Norway. The search interest show how often a specific search term is explored relative to the total search volume, defined over a specific period and country. To attain an index of search activity over specific queries, the volume of web search are normalized and scaled. To normalize the search volume, Google divides the query of interest by an unrelated, common web search query. For example, the search volume index for the term “Cappuccino” may be normalized by dividing it by the search volume for the unrelated and common term “Champions league”. Further, the volume of web search queries is scaled such that the SVIs fluctuates between 0 and 100. This scaling procedure enables me to measure the relative change in the interest of a specific SVI over time. 1

As one example, if you are interested in the search term “Iphone”, Google trends will provide a chart, a time series of the interest of the search term, geographical distribution, related searches and categories. If the SVI decreases from 100 in September to 50 in Oc- tober, this means that the percentage of searches that included the search term “Iphone”

was twice as high in September as in October. Depending on the date range, Google can provide daily, weekly and monthly interest. For periods longer than 5 years, Google will plot the monthly data. The real time availability and high frequency of Google trends makes it an attractive indicator for short-term forecasting.

1The reader is referred to Stephens-Davidowitz and Varian (2014) for a more detailed introduction on how to use Google Trends for research purposes.

(19)

There are four other aspects of the SVI’s worth emphasizing. First, the data treated here is an unbiased sample of Google search data and only a percentage of the searches are used to compile the trends. Second, the trends data are adjusted so that searches made by very few people or with low volume will appear as 0. We typically see this pattern for search inquiries on “Iphone” before 2006. This is mostly because Apple did not launch Iphone before 2008 coupled with the more general fact that the search engine was not as popular in 2004 as it is nowadays. Another note on the early part of the time series, is that it may contain an inequality bias, as less affluent Norwegian citizens could not afford a computer or internet access that would enable them to use Google.

Thirdly, duplicate searches can bias the trends. Therefore Google eliminates searches made repeatedly by the same person over a short period of time. Finally, identical search queries requested on different days produce slightly different time series, while queries sent on the same day returns identical series. Therefore, all series are downloaded on the same day (31.01.2019).

A more detailed mapping of the target variable, Google SVIs (and the monthly ob- servations of GDP) is found in Table 3.2 at the end of the chapter.

3.3 Methodology

Essentially, Google offers a big data service with an extreme volume. If the 1 trillion unique URLs submitted to Google were typed end to end, these web addresses would stretch a third of a distance to the sun. This exemplifies just how excessive this ser- vice is. The extreme scope of URLs and search terms attached to them underlines the importance of establishing a clear methodological framework for collecting Google SVIs.

There are some challenges attached to this step as the choice of search terms involves a trade-off between balancing objectivity against validity. On the one hand, one attractive approach may be to handpick a vast amount of search terms that makes intuitive sense to google once a person is affected by a unexpected shock, for example when filing for unemployment benefits or wanting to spend this years’ bonus on a new car. Selection of search terms in this manner will most likely increase the fit and significance in my results later on. On the other hand, handpicking variables in this fashion implies a high degree of subjectivity. Leaning on a gut-feeling alone instead of a systematic framework may bias the sample and can lead to spurious results. I therefore need a methodology to ensure I draw an unbiased sample.

To find a methodical framework that fits my purposes, I draw on the existing liter- ature. The literature on Google trends is surprisingly vague about addressing the issue of sample bias. It seems that researchers rely mostly on personal intuition when picking out search terms. See e.g. Seabold et al. (2015) that handpicked a sample of 22 Google SVIs to nowcast price series. Methodical direction is also missing in other fields, such as psychology. Tran et al. (2017) evaluated the validity and findings of earlier studies conducted by the use of Google SVIs to produce nowcasts of suicide rates. They observed that previous studies suffered from the same methodical shortcomings discussed above;

studies select search terms mostly in an ad hoc manner and do not systematically evalu- ate which search terms provide the most relevant analyzable data.

To identify search terms in an objective manner, I borrow the methodological frame-

(20)

work of Da et al. (2014). They use the Harvard IV-4- and the Lasswell Value-Dictionaries to select economic queries with either a positive or negative sentiment. Rather, I propose to use the well-known Norwegian encyclopedia, Store norske leksikon (SNL) to select relevant queries. SNL publishes over 200 000 articles online separated into 15 different topics. The topic Økonomi og næringsliv (economics and commerce) contains another ten categories from which I will draw the queries. These are categories such as “banking and finance”, “economics” or “private economy”. I collect the list of words, which I classify as economic. I draw between 10-30 words from the nine most relevant categories. This gives me a total of 185 words. Typical examples include “luggage”, “monetary policy”

and “exports”. Further, another 58 subjective words are added, which I consider to be more Google-friendly. I regard these queries as more likely to capture a persons’ intention when googling. The list of subjective words are to a high degree based on queries used by others performing similar analysis before me, see e.g. Thorsrud (2018, appendix D) and Koop (2013, appendix A). Secondly, search terms consisting of more than one word, are collected both with and without quotation marks. As misspellings or incorrect wording can occur, this serves as an appropriate step to eliminate such sensitivity biases. For example, if you include double quotation marks when searching for “wirte CV”, Google trends will also include searches like “write CV” or “how to write CV”. Finally, I eliminate search terms with too few valid SVIs. This will typically occur when the interest for a search term is infrequent over time. The selection procedure given above results in a list of 221 search terms.

3.4 Data set reduction

Large volumes of data can be difficult to handle. A great amount of time series might make it difficult to distinguish relevant variables from pure noise. To address this issue, I propose to divide the large data set obtained by collecting Google SVIs into multiple, smaller data sets. The advantages of constructing smaller data sets, are twofold. First, by allowing for smaller data sets to be evaluated, I am able to increase the degrees of freedom when I specify the models later on. See Section 3.5.1 for a thorough description of this issue. Second, by employing a systematic framework for allowing a smaller number of (weighted) variables to enter into different data sets, I may be able to pick up a higher GDP-related signal. These signals are present in the large volumes of data, but as the data sets are reduced, the probability of retrieving these signals, will hopefully increase.

Data set reduction serves as an intermediate step so that I, later on, can decide on an econometric forecasting model by the use of an automatic variable selection method. Fig- ure 3.1 is a simplified visual representation of the procedure I will describe below.

(21)

Figure 3.1Data set map. The map is read from left to right. To the left, we have the complete data set with k=663 variables. Moving to the right, the data set is subdivided into multiple, smaller data sets. In the final stage, to the right, we use an automatized algorithm (Autometrics) for variable selection to the final forecasting model equations.

I have T = 58observations of quarterly GDP and221 explanatory variables collected on a monthly basis. I adopt the terminology of Doornik and Hendry (2015) and label the shape of my data set asfat since I have “many variable, but not so many observations”. I transform each individual SVI into three new variables, one for each month in the quarter specific to a given year. This transformation has two advantages. First, handling a data set of mixed-frequency is more convenient when analyzing data in this format. Second, it allows me to test whether the third month of the quarter is more important for identifying the future growth in quarterly GDP than the first month of the quarter. This procedure results in a total of N = 3×221 = 663 variables. Some estimation methods used for model reduction require T ≥N, i.e. that the number of observations are larger than or equal to the number of explanatory variable in the general unrestricted model (GUM).

However, the algorithm I employ allows for fat data sets. That said, handling a fat data set where the width (N) is more than 10 times as large as the length (T), leaves little room for flexibility. This thesis will consider an automatic variable selection approach where only a few settings is possible for the user to vary, such as the significance level of the tests the algorithm relies on. If the variable-to-observation ratio is high, it will be important to tighten the significance level in order to minimize the number of irrelevant variables that the final model retains. In other words, the room for flexibility shrinks.

Hence, I will target a smaller data set containing no more than 150 variables. The rational is that a data set of more than 150 variables will lead to a higher probability of type-I error than what I am comfortable with, see discussion in Section 3.5.1.

3.4.1 Correlations (targeted predictors)

The first technique applied for data set reduction is to look for simple pairwise correlations between each variable N and the percentage growth rate of quarterly GDP, see e.g. Boivin et al. (2006). I decide to keep only the regressors whose correlation with∆QGDP is above 20%. This amounts to123monthly variables and I refer to them as “targeted predictors”.

Each SVI in this data set is listed in the Appendix, Table A.2. “Hard thresholding” such

(22)

as this, poses two potential challenges to the analysis. For one, hard thresholding can be sensitive to small changes in the data. Since each time series change marginally on a day-to-day basis as more information is logged into the SVI, the data and consecutive correlations will differ depending on which day the SVI is downloaded. For example, when a smaller data sample is downloaded at 01.04.2019 I find 11correlation coefficients above 0.2, while data download at 02.04.2019 has 12 correlation coefficients above the threshold. Another drawback of adopting a hard thresholding rule as selection criterion, is that it only considers the bivariate relationship between each SVI and ∆QGDP without accounting for the information contained in the other regressors. As a result, “hard thresholding” tends to select highly collinear variables. Multicollinearity may complicate parameter estimation and lead to larger sampling variance and difficulties in estimating the partial effects on the target variable. As one example, I select among others, the two SVIs “Adecco” and “Jobzone”. These are examples of recruitment platforms in Norway.

It is imaginable that these search terms are entered into Google by the same people on approximately the same day. In other words, they are not independent and likely to correlate highly with one another.

3.4.2 Principal Component Analysis

The method I refer to as PCA is a principal component analysis of the set of indicator variables, followed by a linear regression. The dependent variable is then regressed on the leading principal components of the data set. The method has readily been used for dimension reduction in many fields and has proven to be a successful tool in the context of forecasting quarterly GDP with Big data, see e.g. Ouysse (2013). PCA solves three important problems in this thesis: (1) it reduces the dimensions of the data set, (2) it calculates the correlation between my explanatory variables and (3) it displays the ex- planatory variables that are most significant in describing the full data set that I have at hand. Consequently, such an analysis has the advantage of displaying the most dominant pattern of the data set by attaching higher weights to the most important variables that describe the data set and lower weights to the variables that behave more like noise.

Much like a linear regression, we can think of principal components as an attempt to bisect the scatter plot of data with straight lines. This is a way of summarizing the dependency between each data point. The direction of the line represents the principal component, also known as an eigenvector. The first principal component cleaves the scat- ter plot of data points with a straight line which follows the dimension with the highest variability. Thus, this component explain the most of the variation in the data and there- fore point in the most significant direction of the data set. The most significant direction indicates which area of the data set that holds the most variation and hence, the most information. Eigenvalues are weights attached to the eigenvectors. The corresponding eigenvalue of an eigenvector reveals how dispersed the data is on the straight line. A high eigenvalue corresponds to a higher spread of data points. The principal components are constructed as a linear combination of all 663 variables in the data set. Each linear com- bination is constructed in such a way as to maximize the variance accounted for, within the group of variables. Every variable is assigned a different weight in the respective com- ponent, depending on how much each variable contributes to the variation in the data set. Specifically, the weights are constructed by finding the entry of variable k in the rth principal component and dividing it by the standard deviation of the respective variable.

(23)

Note also that the components are linearly uncorrelated (“orthogonal”) with each other.

By performing PCA on the entire data set, I find that the first principal component in this data set explains 9.35 % of the sample variance, while the second and third explain 8.25 % and 7.34 % of the sample variance, respectively. In total, the first 30 principal components account for accumulatively 90 % of the variation in the data set.

There are some aspects one should be aware of, when performing a PCA. First, the principal components describe the underlying structure in the data and constructs linear combinations of all explanatory variables in the data set; as opposed to a factor analysis, a PCA does not take into account how these variables correlate with the target variable.

Second, we need to choose a threshold for how many components we want to include in our analysis. On the one hand, if we choose too many components, we have not effi- ciently managed to reduce the dimension of the data set. On the other hand, we might end up with too little information in the data set if we ignore principal components that correspond to smaller eigenvalues that are in fact relevant for explaining the covariance structure of the data set. There is a vast literature on this topic, see e.g. Jolliffe (2002, Ch.6). The simplest procedure is to set some threshold and stop when the first k compo- nents account for a percentage of total variation greater than some targeted percentage number.

3.4.3 Supervised Principal Component Analysis

I apply the final data set reduction technique by extracting principal component from a smaller data set than the original one. Boivin and Ng (2006) were among the first to ask whether more data always provide a better basis for extracting principal components.

Specifically, if we retain variables with a higher GDP-related signal when the data set is subdivided, we might be able to enhance the predictive performance of the forecasting model. To draw my second batch of principal components, I use the data set that includes the regressors whose correlation coefficient with ∆QGDP are above 0.2. The advantages of this procedure can be summarized in two points; (1) using a smaller data set might increase the signal-to-noise ratio as discussed above and (2) we mitigate one of the deficiencies of PCA discussed earlier as we are able to impose an indirect relationship between the SVIs and the target variable. By this method, the first, second and third principal component account for 18, 9 and 8 % of the sample variability, respectively.

3.5 Automatic variable selection

The final step is selecting an empirical model. It is not practical nor feasible to include all variables in a final model, even after trimming the data set by the methods just men- tioned. This section will therefore consider the usage of automatic search methods to select one final model subject to a specific data set, see the end station in Figure 3.1.

In terms of automatic variable selection, several methodical approaches are available.

Since this thesis deals with fat data sets, OLS and other related methods are not well- suited for these purposes. Following the terminology of Epprecht et al. (2019), there are three competing strategies within the branch of automatic variable selection. Specific-to- general, general-to-specific (Gets) and Shrinkage. In the context of an OLS estimation,

(24)

a specific-to-general approach entails starting from a parsimonious model and including new predictors that are significant in explaining the dependent variable. A Gets approach would be the opposite; you start from a large, general unrestricted model (GUM) and omit variables depending on whether the information they hold, significantly contributes to the model. Some examples related to the specific-to-general approach are forward step- wise regression and RETINA. LASSO refers to a mathematical based shrinkage method formally introduced by Tibshirani (1996). Finally, Autometrics, an extended version of PcGets, is an example of a general-to-specific based approach, see e.g. Doornik (2009) and Hendry et al. (1999). Autometrics has several advantages; First, it can digest more vari- ables than observations contrary to, for example, PcGets. Second, it is a well-documented system with advanced econometric terminology and suitable for econometric application.

In addition, Autometrics is an objective and easily reproducible tool which is generally not affected by the subjective choices of a modeller. This increases its attractiveness, as it is easy to find documentation on the relative algorithmic performance in a given modelling setting. Finally, it has been shown that Autometrics performs reasonably well when used for nowcasting Google trends. Sloof (2016) evaluated three variable selection methods and found that LASSO and Autometrics produced accurate and robust results using Google search data to predict flu outbreak in the US. As pointed out by Marcellino and Ghysels (2018), forecasts from an Autometrics-based final model perform rather well.2 The main success criteria of any model-specification is to conclude with an empirical model which is a close approximation to the local data generating process (LDGP). We usually use the LDGP to describe the data generating process (DGP) for locally relevant variables that reflect the economic mechanisms that operates in the real world. Therefore, a sufficient search algorithm should be able to detect the LDGP that is reduced from a GUM.

Performing a Gets search is much like attempting to find the source of Nile. For ex- ample, your intuition may tell you to start in Egypt and find your way from there. Egypt will in this case reflect all the variables and information you think is relevant for finding the LDGP. In other words, Egypt mirrors the GUM as we start from a general model and work our way to the specific source of the Nile. As the Nile has a vast amount of rivers and creeks to search, the GUM can also be reduced into many different models. In order to find the river that takes you directly to the source, with the lowest search costs, one is bound to operate in a systematic manner. Algorithms used to perform structured Gets search, such as Autometrics, does so in the following manner. First, since the algorithm in this case handles more variables than observations, Autometrics will divide the sample into multiple GUMs, where each GUM has fewer variables than observations. Essentially this means that the search initiates in an even more specific direction, by for example starting multiple searches in the Western desert and near the coast of Egypt. Second, based on the information available, Autometrics pursues each river that is most likely to take us to the source of the Nile. In this process, variables that have insignificant coefficients are removed from the search until one arrives at a final specification, where all variables are significant. However, since the Nile splits in various passages from Egypt, Autometrics will usually find multiple rivers that can take us there. For example, it can find two parallel rivers that are both assessed as equally good candidates. These two rivers may split again and form a third river, a union of the two others. Autometrics would then treat each of these rivers as new starting points and evaluate the two models

2See Castle, Doornik and Hendry (2011) for a general discussion of the properties of Autometrics.

(25)

against the union. If the two models are rejected against the union model, the union will be the terminal model. The terminal model and equivalently the mouth of the river is hopefully the source of the Nile and the best approximation to the LDGP.

3.5.1 Cost of automatic variable selection

Although automatic Gets may appear like a room with a view, it comes at a price. One inevitable consequence of multiple testing is the accumulation of type-I error.3 The algo- rithm performskt-tests and the probability of rejectingH0 is increasing in the number of variables we assess. One solution to mitigate this problem, that has proven successful in simulations, is to set a tighter significance level. In Autometrics, the parameter we use to adjust the significance level is called Target size and I will denote it by α. As long as the chosen significance level is set to the lowest floor available in Autometrics, Hendry and Doornik (2014) argue there is little efficiency loss in examining many candidate variables that transpire to be irrelevant.

There are two main success criteria for a search algorithm. First, we want the search algorithm to omit all irrelevant variables. Second, we want it to retain all relevant vari- ables. These two criteria’s point in opposite directions when deciding on the appropriate significance level, as an algorithm’s tendency to drop variables that belong to the LDGP is inversely related to the chosen significance level. I.e. the looser we set the significance level, the more of the relevant, but also (by chance) the more irrelevant variables are re- tained and vice versa. Therefore, to set the appropriate significance level, I must decide on an acceptable probability of type-I error and an adequate average number of irrelevant variables (false positives), which I allow to enter in the final model.

The probability of type-I error is defined as:

P(T ype−I error |βj = 0,∀j) = 1−(1−α)kgum, (3.1) where βj = 0 refers to the coefficient of variable j under the null hypothesis.

Further, to find the acceptable average number of irrelevant variables, I follow Hendry and Nielsen (2007, Ch 19.3). They consider the case where the variables, kgum, are independent 4 and model the decision to retain an irrelevant variable as a Bernoulli distributed variable. With kgum independent tests, the number of retained irrelevant variables will be:

kirr =kGU M ×α (3.2)

For example, by setting α = 0.01 and allowing kgum = 600 to enter in the GUM, I will expect the final model to contain 6 irrelevant variables. This might be considered extreme, but a model holding some irrelevant variables will not necessarily decrease fore- casting precision as it might be correlated with excluded, relevant variables.

3On the flip side, we know about the problem and can, by different means, try to limit it.

4Whether the SVIs can be characterized as independent is debatable as a person googling “unem- ployed” may subsequently also google “CV” or “NAV”. The GUM contains all these series. For the purpose of this analysis, I will, however, consider the SVIs as independent.

(26)

The trade-off between accepting a high probability of type-I error and retaining rele- vant explanatory variables, can to some degree be mitigated by allowing for more than one significance level. I will therefore proceed by comparing two decision strategies: one liberal and one conservative. Each data set will be used to derive two unique models, one model using a loose significance level and another model using a tight significance level.

Finally, note that the accumulation of type-I error is not a problem when principal components enter in the GUM. As discussed above, repeated testing may distort selection when the variables in the GUM are dependent on each other. However, when the variables are orthogonal to each other, as in the case of principal components, repeated testing will not distort selection and there is practically a zero cost of search. Since we do not have documentation on the Autometrics package, it is difficult to know which tests are used to select variables in the final model. However, if the variables in the GUM are orthogonal on each other, it is logical to presume that Autometrics ranks each regressor according to their absolute t-value and selects only those variables that have t-values above the cut-off. Thus, there will be no accumulation of type-I error.

3.6 Characteristics of Google Trends

Although econometric modelling by the use of Google trends have become popular the last decade, the field might still be somewhat obscure for a practitioning time series econometrician. It seems reasonable, given these circumstances, to dedicate some space to describe the most frequent patterns Google trends variables display. To do this, I have taken out three random samples with nine different variables (separated in three months).

To summarize, all samples display very similar patterns. First, the mean within each variable seem to vary little, but there is a considerable amount of variation between each variable. In addition, most variables in this sample are approximately normally distributed, but some of the distributions suffer from a higher kurtosis than the normal distribution. If a distribution has a large amount of mass in its tails, then extreme de- partures from the variables’ mean is more likely. This indicates that there might be large outliers in the sample. The literature offers some solutions to this problem. For example, Da et. al (2015) winsorized each series at the 5% level, 2.5% in each tail, to limit the effect of spurious outliers. Since the estimates of standard errors will most likely under- estimate the uncertainty of the regression coefficients, I will use robust standard errors to evaluate the significance. Further, a classical assumption in time series modelling is that Cov(εi, εj | Xi, Xj) = 0,∀i6= j i.e. no autocorrelation. When there exists autocor- relation in an econometric model, and this is left unaccounted for, it will lead to residual autocorrelation. I find that the majority of the variables considered in this small pilot study significantly autocorrelates with its first lag. If there is considerable autocorrelation between the residuals then standard inference tests, such as t-and F-tests, are not reliable anymore. This suggests that when narrowing the analysis later on, it will be important to include more lags in the model. See Chapter 5 for a more detailed guide on how these GUMs are specified.

One specific form of a process with autocorrelation, which I highlighted in Section 3.1, is a unit root process. A battery of unit root tests indicates that around 50% of

(27)

the variables are integrated of order one (X ∼ I(1)). As one example, the SVI “M1, Avtalegire” (LHS) gives a clear indication of containing a unit root as displayed in Figure 3.2. This is reflected by the changing slope resembling a random walk. Once I difference the SVI once, into “DM1, Avtalegiro” (RHS), the trend disappears.

M1,Avtalegiro dM1,Avtalegiro

Figure 3.2: Example of search value indices from Google Trends. The series in the right panel displays the first differenced variable DM1, Avtalegiro, where d describes that the variable is differenced once andM1 explains that the variable is measured in the first month of the quarter. The series in the left panel displays the relative search interest ofM1, Avtalegiro in the first month of the quarter and is not transformed. The graph runs from 2004,Q1-2018,Q4.

First differencing appears to be a satisfying solution in solving the unit root problem.

The augmented Dickey Fuller tests all reject the null hypothesis indicating stationary series after differencing once.

3.7 Transformation of SVIs

With reference to the discussion above, all the SVIs are differenced once. I do this with some hesitation as this transformation involves a trade-off. First differencing removes the variance associated with variables that contain unit root trends, which is a good thing, but it also removes some of the variance in the variables that are unit root-free, which we want to keep. Unfortunately, I do not have the capacity or the tools required, to inspect each individual series closely and am therefore not able to determine the order of integration of each variable. Despite the cost associated with taking the first difference of variables that are in fact stationary, the advantages tip the scale in favor of transforming the series in this way.

Second, some of the SVIs contain seasonal components. To name one example, the SVI “feriepenger” (holiday money) spikes every June from 2006 to 2018. Since the target variable is also seasonally unadjusted, I do not necessarily need to transform the SVIs.

Instead, to account for some of the seasonal variability in∆QGDP, I include the variable Workdays as an independent variable in the regression. Workdays is a calendar measure of how many working days there are in each quarter in a given year. This method is to some extent used by Statistics Norway (see e.g. Foss, Seierstad (2009)) to pre-correct for seasonal components in GDP.

The data used in this thesis is described in Table 3.2 below. I have added the monthly observations of GDP in this Table, although this data description will not be relevant

(28)

for the following chapters. See Chapter 7 for a though discussion on how I use monthly observations of GDP to predict current-quarter GDP.

Table 3.2: Data Description

Variable Name Source Frequency Comments

Google SVIs Google Trends Monthly Relative frequency index

Updated daily Transformation:

Partioned into three variables, one for each month

Differenced once Number of SVIs: 663 Data downloaded:

31.01.2019

∆QGDP Statistics Norway Quarterly Percentage change in quar-

terly GDP

Fixed 2016 prices, mill.

kroners, fastlandsnorge.

Published approximately 40 days after end of quarter Extracted: 09.02.2019

∆M GDP Statistics Norway Monthly Monthly growth rate

Fixed 2016 prices, mill.

kroners, fastlandsnorge.

Published 40 days after end of month

Extracted: 09.02.2019

Table 3.2Data description. I use Google SVI data and quarterly GDP (∆QGDP) data from 2004,Q1- 2018,Q4. Monthly observations of GDP (∆M GDP) are collected from 2016,Q1-2018,Q4.

(29)

Chapter 4

General evaluation of nowcasting models

Before I move on to model specifications and empirical results, I first need to address how I evaluate a good sequence of forecasts. This is important as modelling yet-unknown variables by the laws of statistics will generally produce forecasting errors. The conse- quence of these errors usually depend on their size and the purpose of the forecast. For example, a weather forecast that underestimates the temperature by one or two degrees is to most people negligible. In other cases, where large forecasting errors are made repeat- edly and obstruct good policy decisions to be reached, forecast errors can incur severe costs. According to Clement and Hendry (2006), these errors represent the most serious challenges to macroeconomic forecasting. Therefore, this chapter will first introduce a test for forecast unbiasedness and a descriptive measure of forecasting accuracy. Finally, I describe a formal test for comparative accuracy.

4.1 Unbiased forecasts

A forecast is said to be unbiased if

Et(yt+h|t−yt+h) = 0 (4.1)

i.e. if the expected value of the difference between the forecast, yt+h|t, and the real- ization,yt+h, is equal to zero. That is, on average the forecast should be correct. It may come as no surprise that a good sequence of forecasts is characterized by the tendency to neither systematically under- nor over-predict the target variable. The tests for forecast unbiasedness (Clements and Hendry 1998, pp. 6), can be performed via a test of β0 in the regression:

et+1|t0 +t+1 (4.2)

where et+1|t corresponds to the forecast error and β0 to an intercept. The test is per- formed by comparing the t-statistic of the null hypothesis that β0 = 0 to the Normal distribution. If the null is rejected, the general interpretation of the test is that the series of forecasts are biased, as a rejection implies that the model is systematically over-or under-shooting the target variable.

(30)

4.2 Descriptive measure of forecast performance

To evaluate the predictive accuracy of the forecast, I first need to settle on a descrip- tive measure of forecast accuracy. I use the root mean-squared forecast error (RMSFE) to compare the nowcasts with the realizations of GDP growth. The RMSFE has sev- eral advantages. First, is a popular and well-known measure of forecasting performance.

Practitioners and academicians select the RMSFE more frequently than any other error measure; see e.g. Armstrong and Collopy (1992). Second, the RMSFE is scale-invariant, which will become important later on in the analysis, see Section 8.1. Finally, Arm- stong and Collopy (1992) argue that the RMSFE is, compared with other measures, an easily understandable measure for decision makers without an advanced econometric background.

I follow custom and use the RMSFE as the main descriptive measure of forecast per- formance. For an H-period forecast, it is defined as:

RM SF E = v u u t

1 H

H

X

h=1

(∆QGDPT+h−∆QGDPTf+h|T)2 = v u u t

1 H

H

X

h=1

(efT+h|I

T)2 (4.3)

where∆QGDPT+h refers to the realization of the percentage change in quarterly GDP in periodT+hand∆QGDPTf+h|T to the predicted percentage change in quarterly GDP at timeT+hgiven the information available in periodT. Further,efT+h|I

T delivers the fore- cast error in period T+h of the percentage change in quarterly GDP. In addition to being unbiased, a sequence of good forecasts are also characterized by low forecast errors. This feature is materialized by a low RMSFE, which indicates a better predictive performance.

The RMSFE is a measure of forecast accuracy, but fails to inform us whether differ- ences between rival forecasts can be attributed to sampling variability or whether these differences disappear once this variability is accounted for. In other words, on average model A might seem to be a better forecasting model than model B, in terms of a lower RMSFE. However, this does not necessarily mean that model A is a better predictive model. Model B could, for example, have one large outlier that decreases its average performance below that of model A. Therefore, I need a different tool to test for the comparative accuracy of two predictive models.

4.3 A formal test of comparative accuracy: The DM test

To evaluate the performance of each model relative to a benchmark, I use the Diebold- Mariano (DM) test of comparative accuracy, as proposed by Clements (2005, pp.12-14) The DM test is preformed as follows. Lete2it+1|tfor allt= 1,2, . . . , ndenote thennumber of observations of the square 1-step forecast error for model i. The difference between

(31)

the squared forecast error of model 1 and model 2 is denoted by:

dt+1|t=e21t+1|t−e22t+1|t, t= 1,2, . . . , n. (4.4) With reference to Clements (2005, pp.14), the DM test is performed as follows:

DM = nd¯ qPn

t=1(dt+1|t−d)¯2

∼N(0,1) (4.5)

where d¯refers to the mean difference between the forecast error of model 1 and 2:

d¯= 1 n

n

X

t=1

dt+1|t

We can conclude that the two models are significantly different from each other as long as the DM test statistic is larger than the critical value of the Normal distribution.

To evaluate the relative performance of each model, I compare the out-of-sample 1 predictions of the respective model with that of a quarter autoregressive (AR(1)) model and a random walk. The autoregressive model specifies that the left-hand side variable depends linearly on its own lag and a constant term, whereas the random walk model refers to a model in which the value of the series today is its value yesterday plus an unpredictable change. As noted by D’Agostino, Giannone, and Surico (2006), simple benchmarks like these can hardly be outperformed by more sophisticated models. Hence, it seems like a reasonable exercises to evaluate the performance of my models in the light of these benchmarks.

1I refer to theout-of-sample as the part of the sample in which I assess the predictive performance of the model. I regard theholdout sample and thetest sample as synonymous without-of-sample.

(32)

Chapter 5

Specifications on the SASI models

In this chapter I describe in some detail how I have applied the method presented above to obtain seven different models by injecting four different data sets into Autometrics and vary the algorithmic specifications. Figure 5.1, in the end of the chapter, gives a visual representation of the process in its entirety.

Model specifications are important because the Autometrics machinery, in many ways, resembles a butterfly effect. The idea is that one butterfly, or in this circumstance one

“small change in the specification”, can have far-reaching ripple effects on the subsequent final model. Even a marginal change, like including one lagged variable or excluding a random SVI in the GUM, can result in a completely different final model. Therefore, I will go through the specifications thoroughly to ensure the reader that the models are well thought through and properly specified. There are in total two specifications in the algorithm I can adjust; (1) the level of significance and (2) the number of variables in the GUM. These specifications will be altered to attain two different and unique models for each data set. With reference to (1) and (2), it is clear that the terminal models that spring out of Autometrics are in fact not automatically retained as there is some degree of human intervention present. It is therefore natural to label these models Semi-automatic soft indicator (SASI) models rather than “ASI” models.

5.1 General specifications

I split the sample in two periods whereby the first 38 observations are used to perform the in-sample estimation of each model, while the remaining 20 periods are reserved for out-of-sample estimation. Specifically, observations from 2004,Q3-2013,Q4 are used to estimate each model and the remaining observations 2014,Q1-2018,Q4 are set aside to be used to test the nowcasting performance of each model. Since we are evaluating these models in a nowcasting-setting, it is appropriate to set the forecast horizon, h= 1.

Specifically, I attain 20 1-step ahead forecasts in the out-of-sample estimation period.

The econometric approach, injectingkvariables into the Autometrics machinery to at- tain a terminal model, will be conducted in two steps, referred to as a sequential selection procedure. The GUM in the first stage consists of all variables in the data set, where a relatively liberal significance level is implemented, denotedα1. In thesecond stage, selec- tion is undertaken over the subsample of predictors retained in the first estimation. The model is estimated again, but with the use of a relatively tighter significance level, denoted

Referanser

RELATERTE DOKUMENTER

The dense gas atmospheric dispersion model SLAB predicts a higher initial chlorine concentration using the instantaneous or short duration pool option, compared to evaporation from

In April 2016, Ukraine’s President Petro Poroshenko, summing up the war experience thus far, said that the volunteer battalions had taken part in approximately 600 military

This report documents the experiences and lessons from the deployment of operational analysts to Afghanistan with the Norwegian Armed Forces, with regard to the concept, the main

A COLLECTION OF OCEANOGRAPHIC AND GEOACOUSTIC DATA IN VESTFJORDEN - OBTAINED FROM THE MILOC SURVEY ROCKY ROAD..

Based on the above-mentioned tensions, a recommendation for further research is to examine whether young people who have participated in the TP influence their parents and peers in

− CRLs are periodically issued and posted to a repository, even if there are no changes or updates to be made. NPKI Root CA CRLs shall be published bi-weekly. NPKI at tier 2 and

Overall, the SAB considered 60 chemicals that included: (a) 14 declared as RCAs since entry into force of the Convention; (b) chemicals identied as potential RCAs from a list of

An abstract characterisation of reduction operators Intuitively a reduction operation, in the sense intended in the present paper, is an operation that can be applied to inter-