• No results found

PART VI. RESEARCH QUESTIONS AND METHODOLOGY

10. METHODOLOGY

10.4 C ONSIDERATIONS IN C HOICES OF S TATISTICAL T ECHNIQUES

This section will discuss the statistical techniques that will be applied in this dissertation to determine statistical conclusion validity. According to Cook and Campbell (1979, p. 39) three decisions about covariation have to be made: (1) Is the study sensitive enough to permit reasonable statements about covariation? (2) If it is sensitive enough, is there any reasonable evidence from which to infer that the presumed cause and effect covary? and (3) if there is such evidence, how strongly do the two variables covary? The first of these issues concerns statistical power which is covered in section 10.4.4 and appendix F. This covers both analyses regarding the sample size required for detecting an effect of desired magnitude and of the computation of the magnitude of effects that could have been reasonably detected in the this study. According to Cook and Campbell, “Power analyses are desirable in any report of a study where the major research conclusion is that one variable does not cause another” (1979, p. 13). The most important major threats to statistical conclusion validity that are outlined and covered in this section are; (1) Low statistical power and (2) Violated assumptions of statistical tests. Another classical threat concerns the reliability of measures, which was covered in section 10.3. Three additional threats—the reliability of treatment implementation, random homogeneity of respondents, and random irrelevancies in the experimental setting—were natural parts of the considerations described in section 10.2, where the design was discussed and outlined.

The guiding principle has been to select the simplest statistical technique that would provide a reasonably valid test of the research questions in accordance with the chosen design. This approach, however, requires that these considerations must be examined in detail and dealt with if possible. I will now outline the most important considerations that concern this dissertation, including T-tests, One-Way Analysis of Variance (ANOVA), and Multiple regressions. Finally, statistical power and effect size will be discussed.

10.4.1 T-tests

The most commonly used technique to the chosen design is to apply T-tests. T-tests are used to compare mean scores on a continuous variable, such as before and after a leadership development program. There are two main types of t-tests used in this dissertation: The paired sample t-test, or repeated measures, involves two conditions, and the same subjects

participate in both conditions. We measure the subjects’ behavior in conditions 1 and in condition 2. If there is an experimental manipulation after the first condition, attending the leadership development program at the RNoNA, for example, we would expect a person’s behavior to be different in condition 2. The difference between conditions 1 and 2 is the manipulation, in this case, the leadership development program. Therefore, any difference between the means of the two conditions is probably because of the leadership development program, if the performance measure is reliable. The samples are ‘related’ because the same people are tested each time. In a repeated measure design, differences between the two conditions can be caused by two things: (1) the manipulation that was carried out on the subjects, or (2) any other factor that might affect the way a person performs from one time to the next. The latter factor is likely to be fairly minor compared to the influence of the experimental manipulation (Field, 2004).

Independent sample t-tests are used when there are two different (independent) groups of people, and we want to compare their scores, still leaving us with two conditions. In this case, we see how these two groups differ at a given occasion. In an independent design, differences between the two conditions can have one of two causes: (a) the manipulation that was carried out on the subjects or (b) differences between the characteristics of the people allocated to each of the groups. The latter factor in this instance is likely to create considerable random variation both within each condition and between them.

The paired sample t-test measures the sample mean, which is adequate for measuring the effect of the leadership development program. However, this might give a misleading picture, especially when it comes to the effect of the larger program. Therefore, an additional analysis was performed, the Reliable Change Index (RCI) (Christensen & Mendoza, 1986;

Jacobsen & Truax, 1991; Ogles, Lambert, & Masters, 1996).

The Reliable Change Index

This section will outline The Reliable Change Index (RCI). “Individual differences in change” refers to the magnitude of increase or decrease exhibited by each individual over the duration of the study on any given trait. Furthermore, individual differences in change can be and often are unrelated to population indices of change. A given population may demonstrate robust individual differences in change while showing absolutely no mean-level changes. There can also be meaningful individual-level change even when there is

substantial differential consistency at the population level (Kohn, 1980; Roberts &

Chapman, 2000). One might find that a large proportion of the population increases substantially, whereas an equally large proportion decreases substantially, so that the groups effectively cancel each other out, resulting in no population-level changes in specific subgroups of individuals. An example from this dissertation illustrates this, shown in Table 11.5, where the paired sample t-test results on Self-sacrificing shows a t(72) = .416 and p <

.670, indicating no change at all. The corresponding RCI analysis, see Table 11.6, however, shows that there was a significant 11% decrease and 11% increase, while 78% of the cadets stayed the same.

The RCI was calculated by following the suggestions outlined by Christensen & Mendoza (1986), Jacobsen & Truax, (1991) and Ogles, Lambert, & Masters (1996) and the details are outlined in appendix F. By applying the RCI it is possible to classify how many cadets as having decreased, increased, or stayed the same on the SPGR 12-vector measures and the NEO PI-R as result of the leadership development program at the RNoNA.

10.4.2 One-Way Analysis of Variance

One-Way Analysis of Variance (ANOVA) was also used as statistical technique. Here, two or more groups are compared in a continuous variable. The ANOVA produces an F-statistic or F-ratio, which is similar to the previous t-statistic in that it compares the amount of systematic variance in the data to the amount of unsystematic variance.

The ANOVA tells whether the groups differ, but it will not tell where the significant difference is. It is, therefore necessary after conducting an ANOVA to carry out further analyses to find which groups differ. There are two options, planned comparison and post hoc comparison (Field, 2004). The difference between planned comparison and post hoc test can be linked to the difference between one- and two-tailed tests in that planned comparison are done when we have specific hypotheses that we want to test, whereas post hoc tests are done when we have no specific hypotheses. Because of the exploratory approach in answering research question 4, post hoc tests will be applied, and the consideration that was done is outlined in appendix F.

10.4.3 Multiple Regression

Multiple regressions were applied to address the question concerning the types of leadership behavior the RNoNA rewards through its use of the MD grade. Similar analyses were performed for the NEO PI-R data as well. There are several important assumptions concerning these statistical techniques: sample size, mulitcollinearity and singularity, outliers, normality, linearity, homoscedasticity, and independence of residuals (Tabachnick

& Fidell, 2001). Of these, only sample size will be discussed now because sample size strongly influences the ability to generalize. With small samples, the result may not generalize with other samples. Stevens (2002) recommends that about 15 subjects per predictor are needed for reliable equations. Tabachnick and Fidell (2001) give the following formula for calculating sample size requirements, taking into account the number of independent variables: N > 50 + 8m, where m = number of independent variables. Because of these considerations, regression analyses will only be performed for all four cohorts together.

10.4.4 Statistical Power and Effect Size

Statistical significance is one of two pillars upon which the process of accepting or rejecting scientific hypotheses rests. The other pillar is statistical power, or the probability that statistical significance will be obtained, and that probability is determined by the size of the effect that an experiment is most likely to produce. Experiments must be designed with sufficient power to detect the intervention’s true effect size (ES). Otherwise, statistical significance will not be obtained once the data are collected, and the intervention will be declared noneffective, although a clinically relevant difference might actually have occurred as a result of the intervention.

Statistical power is computed before a study’s final data are collected and determines how likely a study’s data are to result in a statistical significance before the study is conducted.

“Power” is the probability of obtaining statistical significance in a properly run study when the hypothesized ES is correct, where ES is the standardized measure of size of the mean difference(s) among the study’s groups or the strength of the relationship(s) among its variables (Bausell & Li, 2002). The primary purpose of a power analysis with a fixed alpha level is to estimate one of the three following parameters: (a) the number of subjects needed,

(b) the maximum detectable effect size, or (c) the available power at design phase. Two of these parameters, the acceptable level of power and the significance criterion (alpha), are often set by conventions, and almost without exception the alpha level is set at p ≤ 0.05 and the minimum acceptable power level is most often considered to be 0.80 (Bausell & Li, 2002). A power level, as suggested, of 0.80 means that if everything goes as planned, the experiment has an 80% chance of achieving statistical significance and a 20% chance of not achieving a statistical significance.

Choosing an independent sample t-test with an employed level of power of 0.80 with an alpha level at 0.05 requires N/group of 64 for a hypothesized ES of .050. The paired sample t-test normally yields significantly more power than an independent sample t-test, especially when the correlation between the two paired sets of number is relatively high. Cohort 2000 was the first cohort where the SPGR was tried. Because these data were available, it was possible to derive the Pearson r. The average Pearson’s r for the SPGR Humres was r =.58.

By employing the tables in Bausell and Li (2002) it was estimated that 28 participants would be needed to enable the detection of an ES of 0.50 between pre- and post-intervention mean, assuming the measures were correlated 0.60. Should the correlation be as low as 0.50 34 participants would be needed. This indicates that there would be enough statistical power in measuring the effect of the leadership development program with the SPGR at the team level because this is a 360 degrees measure. A team which consists of six members yields a total of 36 ratings, which would be similar to 36 participants, that is higher than the number required.

The effect size statistic indicates the relative magnitude of the differences between means. It describes the “amount of total variance in the dependent variable that is predictable from the knowledge of the independent variable” (Tabachnick & Fidell, 2001, p. 52). This measure is important because with large samples, even small differences between groups can become statistically significant. In such a case, however, a statistically significant result may not have any practical or theoretical significance because of a high N and low ES.

There are a number of different ES statistics, the most common of which are eta squared (η2) and Cohen’s d (d), these are outlined in appendix F, and I will calculate Cohen’s d for paired sample t-tests according to the formula provided by Dunlap et al. (1996). The discussion outlined in appendix F reveals that there several approaches to calculate Cohen’s

d when it comes to the ES of an independent t-test: (a) using means and standard deviations, (b) using t values and df, separate groups t-test with equal n in each group, and (c) t values and df with unequal n’s in each group. Because a golden standard has not been established for this (Van Etten & Taylor, 1998), I calculate and report Cohen’s d by using the t-test value and the degrees of freedom when these parameters are available. Otherwise it will be calculated by means and standard deviations, and any deviations from this will be footnoted.

This calculation will also be checked against the η2 statistics for independent t-tests.