• No results found

2.3 Models

2.3.5 Stacked Regression

The last model we consider is a Stacked Regression approach based on the SuperLearner algorithm by Van der Laan et al. (2007). Stacking was first introduced by Wolpert (1992), who coined the term Stacked Generalization. The concept was later later formalized by Breiman (1996). The fundamental idea of Stacking is to leverage the strengths of several base learners to achieve more accurate predictions (Hastie et al., 2009; Boehmke and Greenwell, 2019). Unlike ensemble methods that combine severalweak learners, Stacked methods are designed to combine several strong learners to form optimal predictions (Boehmke and Greenwell, 2019). Stacked models have become increasingly popular in recent years, as they have shown strong performance in machine learning competitions (Kaggle, 2020a).

The SuperLearner algorithm consists of three main steps (Boehmke and Greenwell, 2019).

First the ensemble is constructed by specifying a list of L base learners and selecting a metalearning algorithm15 (Boehmke and Greenwell, 2019). To avoid selecting and tuning additional models, we use the models previously described in this chapter as our base learners. The metalearner can be any machine learning algorithm, but a regularized regression model is most commonly used (Boehmke and Greenwell, 2019). In this thesis, we use a regularized linear regression model as the metalearner. A regularized model is chosen to mitigate problems related to overfitting and highly correlated base learners.

The linear metalearner has two regularization parameters: α and λ. The regularization parameters are closely related to the penalty terms from the Lasso and Ridge regression models, often referred to as L1 and L2, respectively. Although an in-depth description of these models are outside the scope of this section, a discussion about the regularization parameters in the context of Stacking is warranted. In short, α dictates the distribution

14See Section 2.3.5 on Stacked Regression for in-depth explanation of Ridge regularization.

15The algorithm that is used to train the base learners. Hereafter referred to as metalearner.

24 2.3 Models

between the penalty terms of the Lasso and Ridge regression models. Both models are very similar to OLS regression in the sense that the coefficients are estimated such that RSS is minimized. In addition to minimizing RSS, they also include a penalty term λ that performs regularization (James et al., 2013).

Ridge performs regularization by shrinking the coefficients of highly correlated predictors towards zero when λ increases (James et al., 2013), but never sets any of them to zero.

Lasso works in a similar fashion, except that it has the power to set coefficients to zero for highly correlated predictors if the penalty term is sufficiently large (James et al., 2013).

In the metalearner a high α means that regularization will be performed by means of Lasso, and a low α means that Ridge regularization will be performed. In the context of Stacking, this means that a value for α close to 0 will shrink the weights of highly correlated base learners towards zero, whereas a value closer to 1 means that highly correlated base learners can be excluded from the final ensemble if λ is sufficiently large.

As such, λ controls the degree to which regularization should be performed.

In the second step the ensemble is trained. This involves using k-fold CV to train each of the base learners and collect their respective cross-validated predictions (Boehmke and Greenwell, 2019). TheN cross-validated predictions from each model is then combined in a N ×L feature matrix, as represented by matrix Z in Equation 2.9 (Boehmke and Greenwell, 2019). The feature matrix along with the original response vector y make up the level-one data

where p1, ..., pLis the cross-validated predictions obtained from each of the algorithms and n is the number of rows in the training set. The metalearning algorithm is then trained on the level-one data, as shown in Equation 2.9, which contains the individual predictions (Boehmke and Greenwell, 2019). The optimal weights for the base learners in the final model is obtained through cross-validation. In the third and final step the ensemble generates out-of-sample predictions. This is achieved by first generating predictions from the base learners, which are subsequently fed into the metalearner to generate the final

2.3 Models 25

ensemble predictions (Boehmke and Greenwell, 2019).

26

3 Data

In this Chapter we describe the data cleaning process and present descriptive statistics of the final data used in the two phases of our analysis. In Phase 1 we replicate the data cleaning process implemented by SSB and construct a data set using the same variables as SSB. This is done to ensure a reliable comparison in the first part of the analysis.

In Phase 2 we leverage additional property-specific and macroeconomic variables in an attempt to optimize the data for predictive modeling. This includes extensive feature engineering and advanced outlier detection.

The housing data used in this thesis is obtained from Eiendomsverdi and contains second-hand residential property transactions from the period January 2005 to September 2020 for five Oslo boroughs. The data covers approximately 121,000 unique transactions across 43 variables, including sales price, sales date and whether the transaction was a market sale. Moreover, the data contains variables that indicate the specific type of housing, such as estate type, estate sub-type and ownership type. The data also includes technical specifications and amenities. Technical specifications include variables such as square meters of primary area, gross total area, site area and build year. Examples of amenities include the number of bedrooms and whether the specific property has a balcony or an elevator. As for geographical variables, we have information about city district, coordinates, distance to coast, altitude and sunlight conditions.

3.1 Phase 1: SSB Replication