• No results found

5 Model Development

6.1 Results - Part 1

After identifying the "optimal" hyperparameter combinations, as well as which variogram model to select for the RK, the models’ generalizability with respect to unseen data was evaluated. The final models were fitted on the entire training set, with the hyperparameter values identified as yielding the best performance during the CV process. The models were evaluated by taking the matrix of predictors that constitute the held-out test set as input, in order to output predictions for first-year production volumes. The predicted values were then compared to the true values, and the performance metrics described in Chapter 4.3 were calculated. Table 6.1 summarizes the selected performance metrics for the different models.

6.1 Results - Part 1 43

Table 6.1

Summary of the different models’ performance.

Model RMSE MASE Moran’s I

RF_ns 23,683.64 0.6567 0.2354

RF_fe 19,512.52 0.5284 0.1207

RF_xy 17,944.21 0.4794 0.0992

GRF 16,705.98 0.4451 0.0679

RK 18,853.69 0.5110 0.0481

RK_rf 18,500.60 0.4944 0.1045

Table 6.1 reveals that predictive performance improved as the spatial resolution of the RF models increased. The non-spatial RF_ns yielded the highest error metrics in terms of RMSE and MASE, and was also the model with the most spatially autocorrelated errors, revealed by Moran’s I. All three of these metrics steadily decrease as the spatial sophistication of the models gradually increases through the RF_fe, RF_xy, and GRF.

The GRF manages to account for spatial heterogeneity through fusing local sub-models with a global model. In this way, it is capable of extracting local signals (lower bias) while utilizing the global model’s larger data basis (lower variance) (Georganos et al., 2019). As a result, the GRF’s performance was superior to all of the other models. Interestingly, the well-established RK performed slightly worse than RF_xy in terms of RMSE and MASE, but the Moran’s I signals that its residuals suffered from less spatial autocorrelation. This may imply that the RF_xy is better at modeling the relationship between the well-design variables and the response, but slightly worse at modeling the spatial processes. This suggests that there are two sources of bias at play, against which each of the models has its strengths and weaknesses. However, the gains of the former seem to outweigh the latter, resulting in a lower error for RF_xy.

Since the RF_xy performed better than RK in terms of RMSE and MASE, and since traditional RK is based on OLS residuals, it was experimented with constructing a kriging model based on residuals from RF_xy. The results from this model are summarized in the row corresponding to RK_rf, in Table 6.1. This model performed slightly better than the original RK in terms of RMSE and MASE, but with a higher Moran’s I. However, its

44 6.1 Results - Part 1

performance was worse than the RF_xy. A potential reason may be that the initially fit RF_xy changed the spatial correlation structure of the residuals which potentially made the kriging ineffective. Indeed, the Moran’s I indicates that more spatial autocorrelation is present in the residuals of RK_rf than for the traditional RK. To investigate this hypothesis, the sample variograms were compared. The left pane of Figure 6.1 illustrates the semi-variance of the OLS residuals. A partial sill is reached at a range of about 7.75 kilometers. Clearly, there seems to be autocorrelation that can be captured by the kriging model, up until distances equal to the detected range. The right pane reveals that the spatial autocorrelation structure was altered strongly in the case of RF_xy’s residuals.

With only a nugget effect and a range of zero, little obvious spatial autocorrelation can be detected. Accordingly, combining kriging with RF could not improve the performance, and since RK_rf was inferior to GRF and RF_xy, it was discarded. This was further motivated by the fact that the main reason for including RK was for it to act as a benchmark model for more traditional geostatistical techniques. The RK_rf would not serve as a sensible benchmark for this, due to its experimental nature.

Figure 6.1. Sample variograms for RK and RK_rf (note that the y-axes differ). Left:

Kriging of OLS residuals. Right: Kriging of RF_xy’s residuals.

Further, a new manipulated test set was introduced to investigate the behavior of the models of different spatial resolution. This test set was generated by using forecasted well-design levels for 2020, and holding the well-design variables constant at these values.

The forecasts were generated in a fairly simple manner, by fitting three linear regression models with each of the well-design variables averaged by year as the response, and the

6.1 Results - Part 1 45

year as predictor:

well\_design=β01year (6.1) The models were fitted on yearly averages from 2011 to 2018. This yielded the forecasted values for 2020, summarized in Table 6.2.

Table 6.2

Forecasted average well-design levels for year 2020.

proppant [lbs/ft] lat_length [ft] frac_fluid [bbl/ft]

1210 8762 30

Replacing the well-design values in the original test set with these forecasted values created a scenario as if all these locations had been drilled with the predicted average 2020 well-design. Figure 6.2 visualizes the mean predicted production volumes for each of the models (note that the y-axis is truncated, beginning at 80,000 bbl). The figure reveals that the less spatially sophisticated models, on average, predicted substantially higher production volumes under these simulated circumstances. This suggests that these models generate more optimistic forecasts by giving the well-design variables more weight, compared to the models with higher spatial resolution. In other words, when more spatial variability is captured by the model, less weight is assigned to the well-design variables.

Figure 6.2. Predicted first-year production, using forecasted well-design levels for 2020.

46 6.1 Results - Part 1

To further investigate the different models’ behavior, another manipulated test set was constructed. Here, the well-design variables were held constant at the average levels for the first quarter of 2011, for all observations in the test set. These average levels are summarized in Table 6.3.

Table 6.3

Average well-design levels of Q1-2011.

proppant [lbs/ft] lat_length [ft] frac_fluid [bbl/ft]

798 4089 24

The purpose of this was to isolate the impact of spatial effects from well-design, on well-productivity. Thus, this would illustrate the effect of high-grading practices. Figure 6.3 visualizes the mean predicted production volumes over time, when well-design is allowed to vary according to the data (green curves), and when held constant at Q1-2011 levels (orange curves), for the different models. The predictions are indexed to the first quarter of 2011.

Figure 6.3. Comparison of predictions when well-design levels are held constant (orange curves), and when allowed to vary according to the data (green curves).

6.1 Results - Part 1 47

The figure reveals that, gradually, more of the improvements in terms of productivity are attributed to high-grading as the models become more spatially sophisticated. If one were to imagine a scenario where the government implemented regulations prohibiting operators from increasing well-design parameters beyond the levels summarized in Table 6.3, the orange curves would represent the different models’ projections about the development in well-productivity. The models of lower spatial resolution present a far more pessimistic projection than the models of higher spatial resolution.

As a last way of investigating the behavior of the different models, it was chosen to compute their associated measures of variable importance. As outlined in Chapter 4.2.2, the permutation importance was chosen for this study, due to its general robustness.

Strobl et al. (2008) mentions that interpretation of variable importance scores is only sensible when the number of trees, ntree, is set sufficiently large such that importance scores do not vary systematically with different random seeds. To ensure robust results, the importance scores and their associated mean and standard deviation were computed across ten different seeds. Figure 6.4 summarizes each predictor’s average share of the total importance score, across the ten random seeds, for the different models. The standard deviations are presented underneath each of the fractions and imply robust results.

Figure 6.4. Mean relative importance of the variables in explaining the variation in production volumes.

48 6.1 Results - Part 1

As also mentioned in Chapter 4.2.2, importance measures may be subject to bias if predictors are highly correlated since their importance scores get spread over more than one predictor. In this case, the longitude and latitude variables correlated quite strongly, with a correlation of 0.74. Due to this, their importance scores were added together to what is represented as location in Figure 6.4. For the GRF, the fractions are calculated as the weighted sum of the importance scores from the global and local models. Since the models were weighted 50/50 when fusing the predictions, their importance scores were also weighted equally.

Figure 6.4 reveals how the importance of the different well-design variables gradually decreases as the models, more sophisticatedly, take location into account. Naturally, the non-spatial model may only attribute importance to the well-design variables, and as presented in Table 6.1, its associated performance is relatively poor. With increasing spatial resolution, more importance is attributed to location. Furthermore, this leads to improved performance across all reported metrics, as revealed in Table 6.1. Again, the results imply that the impact of well-design may easily be overestimated when the location is not, or inadequately, controlled for. This is especially true for the importance of lateral length, which is downgraded from an average share of importance of 0.56 to 0.32 throughout the spectrum of spatial resolution. Simultaneously, the importance of location increases from 0 to 0.37 through the spectrum of models. This implies that high-grading of geological conditions is an important factor in explaining the variations in well-productivity observed in the Niobrara through the last decade. It should be mentioned that "importance" in this case only implies that permuting a predictor’s values leads to increased model error (Molnar, 2019), and does not directly imply causality.

The fact that location was attributed slightly short of 40% of the importance in explaining the variation in first-year production volumes suggests that a substantial share of the productivity gains from recent years are not directly transferable to new locations. This highlights the importance of selecting favorable locations for future drilling. This, and similar findings from previous research, motivated the second part of the analysis, which involved predicting the prospectivity of undrilled locations with the help of geological variables.