• No results found

4 Cultural sensitivity

Group 1: High ongoing business experience

8.7 Influential cases

One question that needs to be considered is whether the model fits the observed data well or is influenced by a small number of cases (Field 2005, p. 162). The identification of influential cases is an important step in interpreting the results of regression analysis (Hair et al. 1996, p. 236).

Therefore, in the final analysis an attempt has been made to identify observations that are influential, that is, having a disproportionate impact on the regression results (Field 2005, p. 164).

If it is ascertained that a case does not represent the general population it should be excluded. If a sample contains one or more cases that are not representative, the achievement of generalisable results is hindered (Hair et al, 1996, p. 235). Cases that differ from the main trend of the data can cause

the model to be biased because they affect the values of the estimated regression coefficient (Field 2005, p. 162; Hair et al. 1996, p. 205).

Diagnostic statistics should, however, not be used as a way of attaining some desirable change in the regression parameters (e.g. deleting cases that change a non-significant coefficient into a significant one) (Field 2005, p. 169, refers to Stevens 1992). Checking for outliers and influential cases should be done with the purpose to find out whether the model fits well with the sample or is influenced by a small number of cases (Field 2005, p. 162).

One has to use discretion in the exclusion of cases identified as influential, otherwise good results are almost guaranteed if the data set is trimmed uncritically. There are always outliers in any population (Field 2005; Hair et al. 1996, pp. 236-37). The process of identifying potentially influential cases means that judgement and trade-offs have to be made. It is not evident in all situations which cases to delete. Diagnostic statistics put forward by Hair et al. (1996, pp. 221-237) have been used to guide the process of identifying potentially influential cases.

As a starting point, residuals were investigated to identify outliers (Field 2005, p. 162; Hair et al. 1996, p. 206). Residuals are as follows: the differences between the values of the outcome predicted by the model and the values of the outcome observed in the sample. A large residual of a case indicates that this case is an outlier (Field 2005, p. 163; Hair et al. 1996, p.

222). A frequently used residual to identify outliers is studentized residual (Hair et al. 1996, p. 226), which was computed in SPSS 14.0. This residual corresponds to t values, and upper and lower limits have been set at 95 percent confidence interval, t value = ± 1.96. Statistically significant residuals are those falling outside these limits (Hair et al 1996, p. 226).

Another residual is studentized deleted residual, which helps identify single cases’ impact on the regression results (Field 2005, p. 165; Hair et al. p.

229). Also in this case, statistically significant residuals are those falling outside the limits as described above.

Another three statistics were used to evaluate the influence of a particular case: Cook’s distance, standardized DFBeta, and covariance ratio (CVR).

Cook’s distance is a measure of the overall influence of a case on the model.

It portrays the influence of a case from two sources: the size of changes in the predicted values when the case is excluded (outlying studentized residuals) as well as the case’s distance from the other cases (leverage). The rule of thumb is that values greater than 1 may be a cause for concern (Field 2005, p. 165; Hair et al. 1996, p. 225). A more conservative measure is 4/(n-k-1), where k is the number of independent variables. This conservative measure was used in this study. Even if no one of the cases exceed this

threshold, attention should be paid to those cases that have substantially higher values than the remaining cases (Hair et al. 1996, p. 225).

The impact of deleting a single case on each regression coefficient is described by DFBeta, and is the relative change in the coefficient when the case is deleted. DFBeta is calculated for every case and for each of the parameters in the model. The standardized DFBeta have been applied, and absolute values above 1 indicate cases that considerably influence the model parameters (Hair et al. 1996, p. 225).

CVR estimates the effect of a case on the efficiency of the estimation process. It is a measure of whether a case influences the variance of the regression parameters and considers all parameters collectively (Hair et al.

1996, p. 225). When this ratio is close to 1, the case has very little influence on the variances of the model parameters (Field 2005, pp. 166-67). CVR may act as an indicator of cases that have substantial influence both positively and negatively on the coefficients (Hair et al. 1996, p. 225). A threshold can be established at 1 ± [3(k+1) /n]. Values above 1+ [3(k + 1)/n]

make the estimation process more efficient, which means that deleting the case will damage the precision of some of the model’s parameters. Those less than 1- [3(k + 1)/n] detract from the estimation efficiency, and deleting the case will improve the precision of some of the model’s parameters A fourth measure of overall fit is standardized DFFIT (SDFFIT), which refers to which extent the fitted values change when the case is deleted. A recommended cut-off value is 2 √(k+1)/(n-k-1) (Hair et al. 1996, p. 225;

Field 2005, p. 167).

Cook’s distance and SDFFIT are measures of overall fit. However, they should be complemented by an examination of residuals as already explained above, but also by an examination of leverage points. One measure has been applied to identify leverage points, also called hat values.

For each case, the leverage is a measure of the distance of the case from the mean centre of all other cases on the independent variables. Besides, large diagonal values indicate that the case carries a disproportionate weight in deciding its predicted dependent variable value minimising its residual. This is an indication of influence because the regression line must be closer to this case for the small residual to occur. The average leverage value is defined as (k+1)/n. If cases exert no influence, then we would expect leverage values to be close to the average value. Some recommend looking for values twice the average and other recommend looking for values three times the average (Field 2005, p. 165; Hair et al. 1996, p. 224). In this study, values greater than three times the average were used to identify influential cases.

Residuals and leverage points help identify potentially influential cases.

Often, an influential case will not be identified as an outlier because it has influenced the regression estimation to such an extent as to make its residual minor (Field 2005, pp. 167-68; Hair et al. 1996, pp. 221-22). Therefore, different diagnostics need to be complemented to identify influential cases.

The diagnostic measures that were used to identify outliers and influential cases are described in table 8-23.

Table 8-23: Diagnostic tests for influential cases Diagnostic measure: Threshold value

specification:

Average leverage value 3(k+1)/n Values calculated for each separate model There is no clear-cut method for identifying influential cases or what action to take when influential cases have been identified. A rule of thumb is that if diagnostic measures show that a case is unrepresentative of the general trend, it should be eliminated. The method applied here was to identify those cases that were consistently identified by the diagnostic analysis. Those cases are likely to have the most impact on improving the regression equation (Hair et al. 1996, pp. 234-36). The results attained after potential influential cases had been excluded and new models had been estimated, were compared with models estimated in previous sections on the following three areas: (Hair et al. 1996, p. 236): the overall prediction (R2), the standard error of estimate and the statistical significance of the coefficient.

Standard error of estimate is one measure of prediction, a measure of the variation around the regression line (Hair et al. 1996, p. 199).

The direction of effects, and thus, the main findings, remained the same for all regression parameters after cases identified as potential influential were

excluded and new models estimated. Most of the relationships that already had proved to be significant were strengthened, including those that were significant at p < .10 level. To conclude, no one of the cases were viewed to deviate seriously from the main trend of the data.