• No results found

Research Credibility

4 Methods and Research Design

4.5 Research Credibility

54 compare the three programs through a regular ANOVA. In Article II, Levene's test showed that variances were not equally distributed in the case of item 2E, and we therefore used Welch F for the overall comparison and Games-Howell as a post-Hoc test. We replaced missing data with the series mean (Dong & Peng, 2013). Missing value analysis indicated that none of the items had 5% or more missing cases. Variable 1c had the highest percentage of missing data (1.5%), and items 1g and 1i had the lowest percentage (.4%).

55 interpretation” (Ary et al., 2010, p. 452). One might argue that, because of its nature,

qualitative research should start from the data, rather than from a hypothesis or a conceptual framework like ours (Hammersley, 2008). However, because it is impossible to do research without prior assumptions, any observation is theory-laden to some extent (Hanson, 1958;

Kvale & Brinkmann, 2009; Silverman, 2006). Therefore, an important question is the degree to which an open and inductive research approach is indeed so. Hammersley (2010) argued that categorizations consist of concepts derived from literature on previous research (theory), ordinary experience of the world, or through the process of abduction, and will never be totally derived from the data. Openness around where these categories come from is therefore important, no matter the research approach. Some have criticized the traditional qualitative research approach for not being explicit enough about methods used, data collection, operationalizations, and categorizations (Hammersley, 2008, 2010; Silverman, 2006). Even though the use of theoretically driven categorization presents a challenge with reductionism, it is also a way to make the research more transparent and explicit (Creswell, 2013).

Indeed, research implies reducing complexity and systematizing the real world to some extent, but discussions about limitations of this reduction is important. Overall, our categories have been considered through rational assessment and empirical inquiry (Hammersley, 2010;

Kleven, 2008). For instance, although we were interested in the dimension plan for teaching, lectures about planning (e.g., about learning goals or generative questions) were not included in our description of the dimension and were therefore excluded from our analysis. We wanted to capture the instances where the candidates had opportunities to actually plan for teaching, not only to learn about planning for teaching. Further, after a pilot study, we revised our framework by adding dimension eight to include national and state context and curriculum.

During the main analysis, I looked for instances that ‘fit’ with the dimensions we had identified as well as experiences that did not reflect the dimensions, and we altered the

descriptions of the dimensions to ground them in empirical evidence (cf. Darling-Hammond et al., 2010). This led to rounds of re-analysis of the data. For example, we encountered a

situation where one course scored low on all dimensions. We re-examined these observation notes to identify other dimensions that could have led to a modification of our analytical framework. We found that this course was characterized by a teacher-lead classroom

discussion based on a prepared lecture on subject didactical theory displayed on PowerPoint, evidence that did not encourage us to develop new dimensions in our framework.

In addition to examining construct validity and reductionism, we can employ several validation strategies to enhance the quality of qualitative research (Creswell, 2013). One

56 common strategy is triangulation, which I have used in different ways. First, this thesis

consists of multiple cases that I looked across to strengthen or weaken the findings. Second, the study drew from multiple data sources, I used survey data in Articles II and III to support my analysis. In Article I, the findings were corroborated by findings in other publications within the CATE study, building on other data sources. Third, I used the same analytical framework to look across these data sources. The scoring of the dimensions in Article I led to the focus dimensions in Articles II and III.

Member checks can also enhance validity of research findings (Creswell, 2013;

Hammersley, 2010). Representatives from five of the six programs offered feedback on presentations of findings and on drafts of the articles in this thesis. For example, I discussed my early findings with faculty at UCSB and Stanford. Similarly, we presented these findings to our informants and additional faculty in Oslo on several occasions, and we engaged in new rounds of data collection to continue the development of the program.

Finally, we have subjected research instruments, methodology, and article drafts to expert validation and peer reviews in formal and informal situations, such as conferences, research groups, journal reviews, and the national graduate school of education. My research has also been subject to institutionalized external audits throughout the process of my thesis (Creswell, 2013; Hammersley, 2010).

Summarizing, in this chapter, I have referenced the aims of the study to be transparent about the quality of the methodological choices I have made, as validity addresses whether the methods used are suitable to answer the research questions (Creswell, 2013). I believe this transparency conveys that the validity of my inferences should be satisfactory. It is important to take the exploratory nature of this study into account, as the framework and instruments used within this study are under development, and this study serves as a first step in trying out these measures. I relate the validity of this study to the reliability measures in the following.

4.5.2 Reliability

Throughout this thesis, I have offered transparent, detailed information on my research, which is important for the ability to replicate research and to strengthen the reliability and validity of my findings (Creswell, 2013). As discussed above, the definitions of the dimensions in our conceptual framework are of importance. Additionally, the scores in this study were based upon the coding book developed within the CATE study. Such an instrument can be subject to random measurement errors, which is a threat to the measurement’s reliability (Cronbach, 1975). We thus conducted double coding of 8.7% of our data material to calibrate the scoring.

57 The strength of agreement was “good” (Fleiss, Levin, & Paik, 2003), with Kappa = 0.66. After inter-rater reliability was established, the first author coded all lessons and picked excerpts from the data to illustrate the characteristics of a higher score of the dimensions.

Kappa might be lower than desired, since Kappa increases with an increasing number of codes (Bakeman & Quera, 2011). The unit of our score was the whole lesson, and each dimension has received only 3–6 scores in each subject. For this reason, we did not report the Kappa of the individual dimensions. Additionally, a more systematic approach to the double-coding process might have contributed to a stronger Kappa. The two coders discussed how to understand the dimensions and the scores before the first round of scoring. After that first round, the coders continued these conversations to calibrate the scores. I completed the coding of all the data included in this study. Kappa, however, represents the first round of coding; due to time constraints, a second round of double coding was not conducted. One might expect that the inter-rater reliability would have resulted in a higher Kappa at this second stage and that the coding is thus of higher reliability than the initial Kappa expressed. Further, as one researcher did all coding, the internal consistency should be high. I coded the material several times to increase the stability of the coding and the scoring (Church, 2010). For instance, I scored the data again when our understanding of the dimension take the pupils’ perspective changed and when I found new empirical evidence in a program that made me refine the score description on one dimension. Furthermore, the scores and codes were subjected to member checks, peer review, and expert validation (Creswell, 2013; Hammersley, 2010). Finally, the development of the coding book was a first step in developing a sustainable instrument for research on teacher education. Further research is necessary before this instrument is robust enough for upscale use. In that respect, the use of audio- or video-data to capture details in real-time talk and actual timing of events would be advantageous.

Another aspect of reliability is the process of data collection. We had several research assistants collecting the data across nations and cultural contexts. I already addressed the challenges with equivalence in comparative research (Raivola, 1985), and one might question whether the research assistants and informants understood the framework and the survey in the exact same way (cf. Dalland, 2011; Walliman, 2011 for a discussion on authentication and credibility). All five research assistants collecting data had undergone common training by the CATE team, and we were using the same instruments, developed within the CATE study and adjusted to the different national contexts. For instance, the survey was translated and back translated (Blömeke & Paine, 2008). In addition, the research assistants collaborated closely with the CATE team, resulting in careful discussions about how to understand different

58 aspects of the data to be collected across all contexts (e.g., how to estimate the hours of

practice across the programs). Finally, the CATE team evaluated the data at the end of the collection period to ensure the data were suitable for the study purposes. This resulted in a new data collection at one site, and a re-check with research assistants at another.

4.5.3 Generalization

Some researchers view qualitative research as purely descriptive, interested in the very

particular cases. Thus, qualitative research risks becoming too context-bound and too specific, so that generalizations or comparisons are not possible (Hammersley, 2010). Shadish, Cook, and Campbell (2002) emphasized that researchers using case-methodology often do not dare to generalize; if they do, they are easily accused of being biased towards their favorite case.

Stake (2006) noted that multiple-case studies can guide policy for cases like those studied or that knowledge about the cases can be transferred to other cases.

Sampling is important when it comes to external validity, and as previously stated, purposive modes of sampling are necessary in case-study approaches (Stake, 2006). The strongest generalizations can be made by “most similar” cases that are broadly representative of their population (Flyvbjerg, 2006). Hence, the extent to which our six teacher education programs are representative of all teacher education programs worldwide, or the extent to which each is representative of its region, determines the extent to which we can generalize our findings. We sampled programs that were assumed to have paid attention to teacher education efforts, but they were not “most similar” cases in a strict sense. One should therefore be careful when using the findings of this thesis to make generalizations about the current state of teacher education in general or in the three nations in our sample. However, certain characteristics of my research design are worth mentioning when discussing the generalizability of the results. The cases share characteristics, but also display a variety across contexts. According to Stake (2006), this multiple-case approach generates specific ways of drawing conclusions. Unusual situations or findings across the cases limit the generalizability of the answers to the research questions whereas typical situations across the cases contribute to the descriptions in the conclusions. In that sense, Eisenhardt and Graebner (2007) argued that a multiple-case approach can contribute to theory-building to a greater extent than can a single case study. In that respect, it is important that we have provided systematic and

transparent descriptions of the cases in our sample, so that the readers can decide whether the findings are transferrable to their own context (Creswell, 2013; Stake, 2006).

59 Additionally, Silverman (2006, p. 249) argued that social science should not necessarily copy research designs that enable generalizations, but that we should rather see our findings as context-bound and contrast them to generalizations of these findings. Yin (1994) emphasized that case-study research relies on analytical generalizations, where one generalizes the results from cases to a broader theory. Our context-bound findings about opportunities grounded in practice in teacher education coursework could provide temporary generalizations, working hypotheses rather than conclusions, which could be the starting point for further investigations in other contexts. Thus, our findings might contribute to theory-building around practice-based teacher education, contributing to knowledge about what good teacher education might look like, of use for policymakers and others.