Descriptive Statistics - Phase 2: Additional Variables

3.2 Phase 2: Additional Variables

3.2.3 Descriptive Statistics

As in Phase 1, the vast majority (98.2%) of properties in the Phase 2 data set are apartments. In the interest of prediction accuracy, we therefore limit the analysis to only consider apartments in Phase 2 as well, as we do not have a sufficient amount of data to produce reliable and accurate predictions for other housing types. Upon completing

3.2 Phase 2: Additional Variables 35

feature engineering and data cleaning, we end up with 116,126 observations and 60³⁰ predictor columns in total. Table 3.3 shows descriptive statistics for a subset of the continuous variables from the final sample, and Table 3.4 shows descriptive statistics for a subset of the dummy variables. Complete descriptive statistics for all variables can be found in Appendix A3.1 and A3.2.

Table 3.3 shows that the average apartment in Oslo is a two-bedroom apartment covering approximately 63 square meters of primary area, that lies 51 meters above sea level and roughly 2.1 km from the coastline. Moreover, the building is on average 72 years old, lies in an area with approximately 4,100 units in adjoining squares and has a total site area of 3,700 square meters. The value of such property is on average 56,500 NOK per square meter with 121,200 NOK in shared debt. From the table we see that there are large variations in some of the variables, as evidenced by the large gap between minimum and maximum values. The average sales price, for instance, is approximately 56,500 NOK per square meter, but it varies from 1,200 to 226,900 NOK per square meter. This means that the distribution of the response variable is highly skewed. Large variations are also found in the different metrics for square meters (primary and gross area) as well as total and undeveloped site area.

30Includes dummy variables.

36 3.2 Phase 2: Additional Variables

Table 3.3: Descriptive statistics for a subset of the continuous variables used in Phase 2 This table shows descriptive statistics for a subset of the continuous variables. Mean, Min and Max are the average, minimum and maximum value for each variable, respectively.

Price per square meter is reported on its original scale, but will be log-transformed before fitting the models as described in Chapter 5. All observations with an anomaly score that fall in the 99th percentile (of the anomaly score) or above have been removed from the training set, but not the test set. Missing values for all variables have been imputed using grouped median, median or default value. Unreliable observations have been removed. The data contains transactions for apartments in the period 2005-2020 and covers properties from Frogner, Grünerløkka, St. Hanshaugen, Sagene and Ullern. N = 116,126. Note that the data was split into a training and test set before missing values and outliers were handled. The following table is constructed by combining the final training and test data after pre-processing, and is only used to illustrate the entire data set in this chapter. The training and test data remain separated for all other purposes.

Variable Mean Min Max

Price per square meter (NOK) 56,535.94 1,209.68 226,851.90 Shared debt in joint ownership (NOK) 121,240.60 0 5,201,000

Primary area (m²) 62.80 10 386

Gross area (m²) 68.08 10 386

Age of building (years) 71.87 0 256

Number of bedrooms 1.69 0 23

Site area (m²) 3,671.85 0.00 61,435.60

Undeveloped site area (m²) 2,741.13 0.20 49,273.60

Altitude (meters) 51.28 0 190

Distance to coast (meters) 2,120.45 5 10,000

Slope of site (degrees) 3.65 0 37

Number of units in adjoining squares 4,146.67 258 6,653

Consumer price index 96.94 81.20 112.90

NIBOR 3-month (%) 2.17 0.23 7.91

Brent Spot (USD) 75.57 19.33 146.08

Table 3.4 shows the distribution of a subset of the dummy variables. We see that the majority of apartments are block apartments (51.2%), and that a large fraction of the apartments do not have a specific sub-type (47.8%). Moreover, the majority of apartments have a balcony (66.8%), and 38.9% of them have an elevator installed. Furthermore, we see that the vast majority of these apartments are middle-floor apartments (76.8%), whereas top- and ground-floor apartments only account for 5.9% and 17.3%, respectively.

The data indicate that the areas with highest turnover are Grünerløkka (27.7%), Sagene (25.7%) and Frogner (23.9%). When it comes to ownership, the two most common forms

3.2 Phase 2: Additional Variables 37

of ownership are condominiums³¹ (52.2%) and joint housing cooperatives (37%). Finally, the vast majority of lot sites are owned by the home owners (88.4%), and the remaining 11.6% of sites are leased.

Table 3.4: Descriptive statistics for a subset of the dummy variables used in Phase 2 This table shows descriptive statistics for a subset of the dummy variables. Mean is the average value for each dummy variable. All observations with an anomaly score that fall in the 99th percentile (of the anomaly score) or higher have been removed from the training set, but not the test set. Missing values for all variables have been imputed using a default value. The data contains transactions for the period 2005-2020 for apartments only and covers properties from Frogner, Grünerløkka, St. Hanshaugen, Sagene and Ullern. N = 116,126. The data was split into a training and test set before missing values and outliers were handled. The following table is constructed by combining the final training and test data after pre-processing, and is only used to illustrate the entire data set in this chapter.

The training and test data remain separated for all other purposes.

Variable Mean

31Independent apartment that is owned solely by the owner.

4 Methodology

In this chapter we outline the approach for our analysis and describe the methodology by which we implement and evaluate our machine learning models. The first section describes the analyses we will perform and the order of actions taken. The second section deals with our model selection process, and will outline how we implement hyperparameter tuning and resampling methods for each model. Finally, the last section describes the metrics used for model assessment.

4.1 Approach

The aim of our analyses is to investigate whether our non-linear machine learning models can achieve increased prediction accuracy for housing prices in the Norwegian housing market compared to linear models. To investigate this, we apply a two-pronged approach where we split the analyses into two separate phases. In Phase 1 of our analyses, we replicate the housing data methodology developed by SSB and compare SSB’s model against our non-linear machine learning models trained on the same data set utilized by SSB. In Phase 2, we implement our machine learning models and the benchmark linear model (LM) on the expanded data set. The expanded data set includes additional property-specific variables obtained from Eiendomsverdi in addition to external macroeconomic variables³². In this manner, Phase 1 isolates the effects of the different algorithms and demonstrates the effects of changing the learning algorithm employed. Phase 2 investigates how much accuracy can be increased by including additional variables and implementing different pre-processing, and how the different algorithms capture this additional information.

In document Machine learning as a tool for improved housing price prediction : the applicability of machine learning in housing price prediction and the economic implications of improvement to prediction accuracy (sider 42-46)