Reliability and validity concerning present study

3.6 Reliability and Validity

3.6.1 Reliability and validity concerning present study

There are many areas concerning reliability and validity, to be considered and reflected upon, to see if a research study meets the necessary acquirements intended in quality research work. The following text will view these issues concerning the present study.

The original purpose of this present project was to study and compare beliefs among teachers from US, Taiwan and Norway using the same methods and instruments on the topics described in this thesis. During this process I became aware of different ways of conducting Q research (Thorsen, 2006). A consequence of this was a decision to analyze Q data as it was originally intended according to the methodology developed by William Stephenson (1935, 1953). This would make it difficult to compare with US data, which was gathered and conducted in a different manner. This led to a focus only on Norwegian teachers and little research has been conducted to uncover beliefs of teachers in daycare and early school years in Norway so far. This also led to a stronger focus on the subjective beliefs of these teachers displayed by their individual configurations through Q-sorting of the different Q-samples and the beliefs, values and priorities that emerged from the data.

As mentioned before, in Q-methodology, reliability and validity do not play any role in the conventional sense (McKeown & Thomas, 1988), because “the importance to me” is the measuring unit. There is therefore no external criterion to judge internal spontaneous organizations or feeling projections (Brown, 1980). One measures the subjective understanding and the importance of one statement in comparison with all the other statements as a whole. On the other hand representativeness in choice of statements and P-set is important to validity and reliability. The rigorous procedure for gathering the statements and choosing a balanced selection of them in TBQ has been accounted for earlier.

One might argue that these statements have been gathered in the US, and is that concourse representative for what Norwegian teachers think and believe?

First, some of the statements were collected from relevant literature and existing scales. The quality of the statements gathered could have been increased by selecting statements from interviews with the participants as

well, but on the other hand, this might not have made much difference.

Teachers have an academic education, and one would think they were acquainted with relevant literature on the selected topics, but, there are no guaranties. However, follow-up interviews were conducted with six of the Norwegian participants, and the question: “Are there issues that have not been referred to well enough in this study?” was asked primarily to uncover any discrepancies in the concourse. The general response was that the statements gave a good outline of the themes in question.

Another point to discuss is the issue of structured versus unstructured choice of statements. While Stephenson (1953) was a proponent for a balanced block design when sampling statements, others do not necessarily follow his lead. For example Watts and Stenner (2004, 2005) and Corr (2006) do not use the balanced block design, but strive for a Q-sample that is broadly representative of the different opinions in the domain of interest for the research they are conducting. Watts and Stenner (2005) look upon it as a sampling task where the procedure is of little consequence as long as the final Q-sample can be justified as being broadly representative of the relevant domain. Norwegian researchers (Allgood, 1999; Allgood & Kvalsund, 2000;

Kvalsund, 1998, 2005) well acquainted with Q-methodology are consistent in their use of a balanced block design when sampling statements and which seems to capture representativeness in a more precise manner. In this light one might conclude that the statements in the present study could have been structured in a more precise manner through a balanced block design and additional quality could have been added through gaining natural statements through prior interviews in a Norwegian context.

Another aspect concerning the representativeness of statements and individual’s possibilities to express their different opinions has to do with the number of statements in question. Both Watts and Stenner (2005) and Brown (1980) point to a Q-sample of 40-80 statements as satisfactory, but fewer statements have been used, for example 16 statements were used in a study by Wester and Trepal (2004). With too few statements there may be a problem of adequate coverage, while too many may lead to problems with the Q-sorting process. It is wise to generate a large sample of statements which can be refined and reduced for example through piloting, but a Q-sample “only needs to contain a representative condensation of information” (Watts & Stenner,

2005, p. 75). In this present study there are 20 statements in each Q- sample theme and a four times five forced distribution with a range of five categories to from A to E (-2 to +2). This may not seem like much, but it still gives each person a wide range of choice possibilities. Brown (1980) has exemplified this principle in his technical note ‘2. Permutations and Combinations in Q sorting’ on pp. 265-267 where he calculated the numerous combination possibilities in the Lipset study with 33 statements and a range from -4 to +4.

Although the present study has less statements and a narrower range, it still leaves room for sufficient individuality in view of a representative condensation of information, as pointed to above.

In Q, experience has indicated that reliability coefficients of a person with himself normally range from .80 upward (Brown, 1980). In addition when more individuals define the factor, reliability increases. The higher the reliability coefficient is for a factor, the lower the error estimation is for the factor’s scores. Factors with high numbers of loadings reduce the factor score error proportionally. In the present study there are high loadings of many individuals on the factors that emerged from the data. For example Subgroup 1 has 30 defining sorts out of 40 possible at p< .01, ranging from .59 to .89, and the equivalent for Subgroup 2 is 32 out of 40 possible at p< .01 with a range from .60 to .93, both on Q1. There are even stronger results concerning Q3, beliefs about children. Q2 where the theme is group/classroom practices, shows more variation with two factors for Subgroup 1 (A and B) and three factors (C, D and E) for Subgroup 2. While factors A and C have many defining sorts, (28 and 26), factors B, D and E have fewer (9, 4 and 3) defining sorts. A guiding rule for a well defined factor is to have two or more clearly defining sorts on each factor that do not load highly on other factors (Schmolck, 2006b). This should indicate that data gathered and the results obtained should meet necessary levels of reliability and validation.

In collecting Q data, a forced distribution is usually used, as was the case in this study. The intent here is to get participants to make judgments they might otherwise resist to make. The nature of this forced distribution is to have one statement placed in each of the 20 places in the four times five rectangular distribution. There were some complaints from pilot-groups and interviewees, that this took time, but on the other hand several expressed it was interesting because it made them reflect before making judgments. Many Q studies use a

quasi-normal distribution and not a rectangular shape as in this study. Brown (1971) argues that the same results are obtained despite the response distribution, and that ordering preferences are more influential than distribution preferences and no important statistical information is lost by using differing distribution matrixes. Cottle and McKeown (1980) support that the matrix for Q-sorting is arbitrary for the results, and that bell-shaped, flat or matrixes with more statements on the extreme ends may be applied without seeming to affect factor structure. According to Brown (1980, p. 289)

“distribution effects are virtually nil”. With a rectangular distribution the psychological significance of the extremes is still not as explicit as with a quasi-normal distribution with fewer places for statements at each pole and more in the middle. Q is more than a technique, it is “a comprehensive approach to the study of behavior, where man is at issue as a total thinking and behaving being” (Stephenson, 1953, p. 7). Cottle and McKeown (1980, p. 62) are concerned that “technical components should not overshadow the validity of the total methodology”.

As noted previously, reliability has to do with reaching similar results through repeated trials, and also with the accuracy of the measurement procedure. An issue that may be of concern in relation to reliability is the conditions of instruction for Q1 and Q2. For Q3 the respondents were instructed to sort the statements into five categories from least to most characteristic of your beliefs about children. This is a straightforward instruction concerning ‘beliefs’ and should be uncomplicated to relate to. However, for Q1 and Q2 the instructions were more complicated. Respondents were asked to sort the statements into five categories from least to most characteristic of your approach or beliefs about discipline and behavior management (Q1), and sort the statements into five categories from least to most essential and/or characteristic of your teaching (Q2). Since reliability of responses also depends on the accuracy of how the measurement was carried out, having two issues in the same instruction can be a problem. Did the respondents relate to ‘approach’ or

‘beliefs’, or were these words seen as an integrated part of the instruction?

The same concern may be applied to ‘essential’ and/or ‘characteristic’. Since I used the TBQ (Rimm-Kaufman et al., 2006), I also duplicated their instructions. At the time of my data collection I was not conscious enough of

this issue or that it might be a problem. In hindsight I see the instructions should have been more precise.

The TBQ is developed for use in the USA. I have tried the Norwegian version out in three small pilot studies and in the main study. I cannot say that the reliability coefficients of a person with him or herself, is .80 (Brown, 1980) or higher in this present study, since I have not duplicated the investigation on the same people. However, there are factors with high loadings of many individuals which reduce the factor score error. In addition, results from both subgroups showed similarities.

The questionnaire containing data of demographic issues and self-efficacy beliefs was carefully constructed to measure what it was supposed to do and administered in an appropriate manner to meet standards in quantitative traditions. As noted earlier, the Teacher Self-Efficacy Scale has been used satisfactory by others (NICHD-ECCRN, 2002) previously. The 10 items represent two components of personal efficacy: (1) instructional self-efficacy (seven items), and (2) disciplinary self-self-efficacy (three items). In this study Cronbach alphas for each component were .85, and .84 respectively.

The follow-up interviews were done in line with criteria for qualitative inquiry. A purpose was to make it possible for the person being interviewed to bring me as the interviewer into his or her world. The quality of the data is highly dependent upon the interviewer (Patton, 2002). I did my best to make the interviewees comfortable with the situation, and was conscious of the importance of listening and being genuinely interested in what they said. I took notes in addition to using tape-recorder to try to get as much and as accurate information as possible from those being interviewed in the restricted amount of time we had. The interviews were transcribed, checked, and rechecked with the tape-recorded versions to be sure data was transcribed correctly. The interview data was also important in comparison with the Q data and questionnaire data to see if it would confirm or contradict any of the other information. I tried to show the sensitivities and sensibilities to be the research tool (Marshall & Rossman, 1999), and to deal with the data respectfully and in line with qualitative traditions.

3.6.2 Summary

Different aspects of reliability and validity have been pointed to and discussed, and an attempt made to view this present study’s degree of reliability and validity in light of these issues. Historically there have been differences in pursuing reliability and validity in various methods, and some of these variations still exist, although researchers of today tend more to look for similarities and overarching frameworks to guide the work that is needed to seek, uncover, and report on current research. A commonality in all methods is the importance of the inferences we make during the whole research process.

As noted earlier, there is no perfect research study. This applies to the present study as well. The typology of the purpose of the study changed, and from having a comparative focus of teachers in Taiwan, US and Norway, only results from Norwegian teachers are presented here due to methodological differences and gained insight into Q-methodology. Conventional reliability and validity are not central in Q due to the measuring unit being ‘importance to me’. However, both Brown (2006a) and Messick (1989, 1995) state the importance of representativeness. According to this aspect the present study could have been improved by having more statements also derived from interviews with Norwegian teachers on the topics, and in addition the use of a balanced block design when narrowing down the number of statements to apply as a Q-sample. A wider range and the use of a quasi-normal distribution could nuance the picture even more and make the extremes of the factors clearer. The use of A to E instead of numbers from -2 to +2 may have had an influence. On the other hand, all five positions were written in words under the respective letter on each of the answer sheets (Appendix II, III, IV) The condition of instruction in Q1 and Q2 could have been more precise. The Teachers Self-Efficacy Scale has been used satisfactory by others, and the follow-up interviews were done in line with qualitative criteria. As noted, there are issues in this study that decrease reliability and validity to a certain degree, but the overall procedures have been conducted in line with the relevant methodologies and should give ground for sufficient reliable and valid inferences of the data.

Other essential aspects are the ethics and values implicated in our research and the consequences these may have for our respondents and others and will be viewed in the next section.

In document Teachers' priorities and beliefs : a venture into beliefs, methodologies, and insights (sider 133-139)