Can Small Group Tuition Reduce Observed Gender Gaps in Mathematics?
Results from a randomized controlled trial
Anja Ovesen Aigeltinger November 2019
Master thesis in Economics
Department of Economics, University of Oslo
Preface
This thesis was supervised by Astrid Marie Jorde Sandsør and written in collaboration with the Nordic Institute for Studies in Innovation, Research and Education (NIFU). The data analyzed are collected from a randomized controlled trial titled the ”1+1-project”. The randomized controlled trial is financed by the Research Council of Norway.
I would like to thank my supervisor, Astrid, for welcoming all my more or less stupid
questions with an open door (in both sense of that word). I would also like to thank the project leader, Vibeke Opheim, for welcoming me at NIFU and letting be a part of the project.
Abstract
This thesis investigates gender differences in mathematics among pupils in Norway. The research question is; are there observed differences in mathematical abilities between girls and boys in second grade? If so, can small group tuition in mathematics help decreasing these differences?
Other well-established tests, such as the National tests, detects a gender gap in favor of boys both in primary and in lower secondary education. This thesis exploits some newly developed tests aimed at the younger pupils in primary school to show that the gender gap in favor of boys observed in later years is also present for these tests.
The thesis finds a significant gender gap of -0.074 standard deviations in favor of the boys.
The gender gap seems to be equally present for the children born late in the year as for the early-born children, but heterogeneous in the test distribution. Results show a significant estimate of 0.024 standard deviations in favor of the girls for the under-average performing pupils, while show a significant estimate of -0.052 standard deviations in favor of the boys at the top of the distribution. This observation is supported by the research finding that boys and girls often are unequally distributed when for example testing intelligence, where boys to a larger extent than girls seem to be found in the tails of the distribution.
Lastly, this thesis presents mid-way results from a randomized controlled trial, where treatment schools are given one additional teacher, carrying out small group tuition in mathematics. The estimates show a positive intention to treat effect of small group tuition in the end of 2nd grade which seems to be increasing towards the end of 3rd grade. However, treatment does not seem to have any special effect on girls and do hence not decrease gender gaps.
Content
Preface ... 2
Abstract ... 3
Introduction ... 1
1 Why should we care about gender differences in second-grade mathematics? ... 3
1.1.1 International evidence of increasing gender gaps by age ... 3
1.1.2 Heckman and the rates of return to human capital investment ... 4
1.2.1 Women underrepresented in the STEM occupations ... 6
1.2.2 Females choose easier math subjects ... 8
1.3.1 Primary school ... 10
1.3.2 Lower secondary education ... 11
1.3.3 Upper secondary education ... 12
2 Data description ... 15
2.1.1 RCT - the gold standard for effective research ... 15
2.1.2 The “1+1-project” ... 16
2.2.1 Reducing Selection Bias ... 19
2.2.2 Missing data ... 19
2.3.1 Content ... 20
2.3.2 Distribution of test scores among girls and boys ... 21
2.3.3 Standardizing test scores ... 23
3 Estimation Strategy and Results ... 25
3.1.1 Balance tests and methodology ... 25
3.1.2 Results ... 27
3.2.1 Birth Month ... 29
3.2.2 Math level ... 31
3.3.1Balance tests and methodology ... 33
3.3.2 Results ... 37
4 Conclusion ... 42
5 References ... 43
Appendix ...Feil! Bokmerke er ikke definert.
Figures and Tables
Figure 1-1: International tests measuring gender gaps ... 4Figure 1-2: Rates of return to human capital investment ... 5
Figure 1-3: Predicted demand for Engineers and other occupations within the Scientific field from 2008-2030 ... 7
Figure 1-4: Mathematic choice in upper secondary education ... 7
Figure 1-5: Choice of mathematic course ... 10
Figure 1-6: National test scores in 5th grade ... 11
Figure 1-7: Distribution of National test scores in lower secondary education (2018) ... 11
Figure 1-8: Distribution of grades in 10th grade (2017) ... 12
Figure 1-9: Average grade in all math subjects (2018-2019) ... 13
Table 2-1: Data Reduction ... 20
Figure 2-1: Distribution of tests scores ... 22
Figure 2-2: Distribution Standardized scores ... 23
Table 2-2: Comparing Standardized Distributions ... 24
Table 3-1: Balance Test in Gender ... 26
Table 3-2: Gender Gap Results ... 29
Table 3-3: Subsample analysis Birth Month ... 30
Figure 3-1: Negative correlation between birth month and test score ... 31
Table 3-4: Subsample analysis Math Level ... 32
Table 3-5: Balance test in treatment status ... 34
Table 3-6: Mean pretest scores for those missing and not missing pretest results ... 36
Table 3-7: Can small group tuition decrease the gap among girls and boys? ... 38
1
Introduction
This thesis is about gender differences in mathematics among pupils in Norway. Other well- established tests, such as the National tests, detect a gender gap in favor of boys both in primary and in lower secondary schools. Although research shows that early intervention is important (Cunha & Heckman, 2010), there has up until now been few means available to test pupils in earlier grades. This paper exploits some newly developed tests which can be used to detect gender gaps even earlier in primary school.
Some studies show that girls and boys are distributed differently when testing intelligence, i.e.
that boys seem to perform best and worst in intelligence while females seem to be scoring more average (Johnson, Carothers, & Deary, 2008). There is also evidence that children born late in the year perform worse compared to their peers born early in the year, although less is known as about whether this differs by gender. This paper investigates both hypotheses.
Having established a gender gap in favor of boys that is heterogeneous across the distribution of test score, the paper further presents mid-way results from a randomized controlled trial (RCT), where 80 randomly selected schools are given one additional teacher to carry out small group tuition in mathematics. Although small group tuition seems to improve test scores for both genders, the paper does not find any sign of this learning method reducing the gender gap.
The paper is constructed as follows. Section one motivates the scope of this thesis by first demonstrating the importance of early investment in children’s education. Then, more generally it explains why gender differences in mathematics may be problematic in the perspective of income inequalities among men and women. Finally, it systematically goes through observed gender gaps in the Norwegian education system.
Section two explains the data used in the investigation of gender gaps and the intention to treat effect from small group tuition. It starts off by explaining how RCT as a research
strategy can be used to detect the effect of small group tuition before it explains the 1+1-RCT- project more in detail. It ends by explaining the sample selection and construction and
distribution of the outcome variable; 1+1-tests scores.
2 Section 3 presents the results and corresponding estimation strategies for the following
research questions, (1) are there observed differences in mathematical abilities between girls and boys in second grade? If so, are they homogeneous across (2) birth month and/or (3) distribution of test scores? And (4) could small group tuition potentially reduce these differences?
The data material used in this thesis has restricted access. This thesis can only be made publicly available once the 1+1 project ends (spring 2020). The software Stata is used for all estimations.
3
1 Why should we care about gender differences in second-grade mathematics?
This section motivates why investigating gender differences in mathematics as early as in second grade is important. First, economic incentives of intervening early are explained in Section 1.1. Then more generally, Section 1.2 explains why gender differences in
mathematics may be problematic in the perspective of income inequalities among men and women. Lastly, Section 1.3 presents the gender gaps that already are observed in the Norwegian education system.
Economic returns to early intervention
1.1.1 International evidence of increasing gender gaps by age
Norway is not the only country observing gender differences in mathematics in favor of boys.
Figure 1-1 shows gender differences in calculations, measures in standard deviations, for the 1984 cohort from Norway and eleven other OECD- countries.
The results come from three different international tests: TIMMS, PISA and PIAAC. TIMMS tests children in fourth grade, PISA in 10th grade and PIAAC tests individuals when they reach the age of 27-28. We see that, in Norway (and in most of the other countries), the gap in favor of the boys in the TIMMS small and insignificant, but this difference increases over time. In 10th grade girls seem to perform just above one standard deviation worse than boys, while in the age of 27-28 this gap has doubled to be above 2 standard deviations (measured in PIAAC).
In fact, all countries except Korea show the tendency of an increasing gender gap toward the age of 27-28 in favor of the boys. So lets ask this, could the gender gap in PIAAC be the result of a small but accumulating gap starting already in fifth grade, or even earlier? If so, there should be economic returns to early intervention.
4 Figure 1-1: International tests measuring gender gaps
Note: Figure 1.1 shows gender differences in numeracy, measured in standard deviations, through the whole educational course for the 1984 cohort. Statistically significant gender gaps are marked in a darker tone. We see, in Norway (and in most of the other countries), increasing gender gaps as students age. In fifth grade girls seem to perform ¾ standard deviations worse than the boys (measured by TIMSS). In 10th grade this gap has increased to just above one standard deviation (measured by PISA), while in the age of 27-28 the gap has doubled to be above 2 standard deviations (measured in PIAAC). (Borgonovi, Ferrara, & Maghnouj, 2018, p. 18)
1.1.2 Heckman and the rates of return to human capital investment
In economic theory, labor is equivalent to human capital. Doyle, Harmon, Heckman, and Tremblay (2009, p. 3) argue that by investing early in humans, the benefits are larger and are enjoyed for longer, which in turn increases the return to investment. The investment as it refers to in this thesis corresponds to increasing females’ abilities in mathematics. Early investment means focusing on increasing mathematical understanding among girls in
primary- or even preschool. Later investment would mean prioritizing the mathematics among the post school population of females.
Why focusing on children yields the greatest rate of returns is explained by the set-up of a multistage technology (Cunha & Heckman, 2010, p. 6). The idea is that teaching children the basics in mathematics raises the productivity of learning more advanced mathematics later.
Cunha and Heckman call this effect self-productivity, which “embodies the idea that skills
5 acquired in one period persist into future periods” (Cunha & Heckman, 2010, p. 7). A second key feature is that when pupils learn the more advanced mathematics more productively, because of the earlier investment, the cost of teaching advanced mathematics falls. This effect is termed dynamic complementarity by the authors. The joint effects of self-productivity and dynamic complementarity is illustrated in Figure 1-2. The rates of return to human capital are decreasing over the life cycle and when setting cost of investment equal across all ages, the investment is efficient only in preschool and early in school.
Figure 1-2: Rates of return to human capital investment
Note: Figure 1.2 illustrates how the rates of human capital investment, setting investment to be equal across all ages, decreases over the life cycle. We see that the rates of return start the highest in preschool and then decrease. (Carneiro and Heckman (2003, p. Appendix ))
6
Gender differences in mathematics - a factor in explaining income inequalities?
In 2018 the average man was making 6 250 NOK more per month in salary compared to the average woman (SSB, 2019a). At the same time, women are now overrepresented in higher education making up 60% of the student population (SSB, 2019e). It is possible to consider this a paradox. Could gender differences in mathematics be part of the explanation?
1.2.1 Women underrepresented in the STEM occupations
Women are still underrepresented in the fields of Science, Technology, Engineering, Math and Medicine, the so-called STEM occupations (SSB, 2019c). Workers with STEM competences are highly demanded in the job market (see Figure 1-3) which in economic theory means higher salary. Compared to others with 1-4 years in higher education, the average person with a background in science, handy work and technical subjects made on average 8 480 NOK more per month five or more years after completed education (SSB, 2019d). Comparing the same two groups but with more than four years of higher education the gap is measured to be 6 560 NOK per month.
In an effort to get more females into these STEM occupations girls have since 2014 been given extra “gender points” when applying into these fields (Samordna opptak, 2013), but obviously this has not had such great impact when only 10 percent of the women in 2018 chose an education within science, handy work and technical subjects (SSB, 2019c).
7 Figure 1-3: Predicted demand for Engineers and other occupations within the Scientific field from 2008-2030
Note: Figure 1.3 shows the predicted demand for engineers (to the left) and other occupations within the scientific field (to the right) in the period 2008-2030. (Gjefsen, Gjelsvik, Roksvaag, &
Stølen, 2012, pp. 58-59)
The STEM occupations often require advanced mathematics from upper secondary education.
In upper secondary education in Norway, mathematics is mandatory the first two years of the academic track and then optional the last year. Students can, in all years, choose between different mathematic courses that vary in their degree of difficulty (see Figure 1-4).
Figure 1-4: Mathematic choice in upper secondary education
Note: Figure 1.4 illustrates the different mathematical paths available for the upper secondary education students. The light grey path corresponds to the simplest track and the darkest to the most difficult track.
First grade Second grade Third grade
R1 (5h) R2 (5h)
1T (5h)
S1 (5h) S2 (5h)
1P (5h)
2P (3h)
8
1.2.2 Females choose easier math subjects
Evidence of positive effect of advanced mathematics on earnings
There is a large literature in support of a sizeable effect of advanced mathematics on earnings and also that gender differences in mathematics qualifications may explain a substantial part of the gender gap in income and in career outcomes more broadly.
Joensen and Nielsen (2015) investigate a pilot scheme implemented in some Danish high schools. The pilot scheme relaxed the strictness in course combinations, lowering the utility cost of choosing advanced mathematics as this no longer implied that you had to take other scientific courses as well. The authors study the impact of more females choosing to study advanced mathematics and find that those females choosing advanced mathematics move into more advanced and more mathematics-intensive careers and climb towards the top of the earnings distribution
The first systematic attempt of measuring the effect of course curriculum was done by Altonji (1992), where he exploits the large variation in curriculum across U.S. high schools1 to measure the effects of the specific courses on wage rates and years of higher education. Not able to access individual data, he uses the means for each high school of courses taken in each subject as instrumental variables for the courses chosen by the individuals. He finds that that an extra year of math, science and foreign languages increases further education by 0.339 years and income by 0.017 dollars. These estimates are however, as he points out, likely to be overestimated since courses provided by each school vary by menu and quality and
curriculum choices are likely to vary across students. Particularly, he finds that factors that are favorable to the outcome variables are positively correlated with semester hours of math, science and foreign languages and negatively correlated with semester hours of commercial- and industrial arts.
Rose and Betts (2004) conduct a longitudinal study accessing information on courses on the individual level including. Controlling for demographic information, family and school fixed characteristics and degree of higher education, they use a vector of credits in six different math subjects (vocational math, pre-algebra, algebra/geometry, intermediate algebra,
advanced algebra, and calculus) to estimate the effect of specific high school math courses on earnings, nearly ten years after graduation. To deal with any upward biasness that may come
1 Comparable to the higher secondary education in Norway
9 from factors such as motivation and ability they include the student’s mathematics grade point average (GPA) as a proxy. They also apply an instrumental variables (IV) approach similar to that used in Altonji (1992). The intuition for using IV is that the predicted curriculum is deflated for unobserved factors that may explain any deviation from this level, so that the estimated effect will be the effect of curriculum alone rather than the effect of a mix of
curriculum, ability, and motivation. The authors conclude that math matters and that advanced math courses have the largest effect on earnings.
The positive effect of mathematics on further educational outcomes is also supported by Norwegian research. Falch, Nyhus, and Strøm (2014), exploit the random selection into one compulsory exit exam in either mathematics or languages the last year of Norwegian lower secondary school. The students are not notified about exam subject until a few days before the actual exam so have just a couple of days to prepare. The authors find that, relative to
languages, the students randomly selected into mathematics are less likely to drop out of upper secondary education, have higher enrollment in higher education, and particularly higher enrollment in natural science and technology education programs.
Females choose easier math subjects
Figure 1-5 shows that the choice of mathematic courses, in 2018, varied greatly between genders. Most boys chose the most difficult subjects available (R2, R1 and T), while most females chose the easiest subjects (S2 and 2P). The exception was in 1st grade where most chose T-math. In other words, choosing a “soft” math subject seemed to be the default choice of females and choosing a harder math course seemed to be the default for males.
It is curious why females choose easier math subjects than males. One possible explanation is that they have a weaker performance in mathematics already in 5th grade (see the TIMMS test in Figure 1-1) that accumulates over the years. The same is also found for the for the
Norwegian National tests in 5th and 8th grade (see section 1.3).
10 Figure 1-5: Choice of mathematic course
Note: Figure 1-5 shows that in 2018, 56 percent of the girls and 63 percent of the boys choose 1T (the most advanced course in mathematics in 1st grade). In 2nd grade, almost one half of the girls and 37 percent of the boys prefer continuing with the simplest track in mathematics (2P), 23 percent of the girls and 35 percent of the boys choose the hardest level of mathematics (R1). In 3rd grade, math is optional, so among those continuing the mathematical track, 43 percent of the girls and 57 percent of the boys select R2. (Udir, 2019a).
Numbers in evidence of gender differences in mathematics 1.3.1 Primary school
In Norway, education is compulsory from the calendar year of the child’s 6th birthday and until completion of 10th grade.2 National tests are performed in 5th in primary school and one of the two first year in lower secondary school. Comparing the average performance of boys and girls, boys do better by two points (equivalent to one standard deviation).
Figure 1-6 shows the distribution of boys and girls at each level, where level three
corresponds to the best and one to the worst. We see that the distribution is weighed toward level one for females and toward level three for males, illustrating a gender gap in favor of males.
2 See https://snl.no/Skole_og_utdanning_i_Norge
0% 10% 20% 30% 40% 50% 60% 70%
1P T 2P S1 R1
S2 R2
girls boys
11 Figure 1-6: National test scores in 5th grade
Note: Figure 1-6 shows the distribution for females (first figure) and for males (second figure) among the three levels: level 1< 43 points, level 2: 43-56 points and level 3: > 57 points. The distribution is weighed toward level one for females and toward level three for males, illustrating a gender gap in favor of males. (Modified version of Figur 10. 5. trinn fordelt på mestringsnivå, for regning, 2018. From "Analyse av nasjonale prøver på 5. trinn, 2018" Udir, 2018a.
https://www.udir.no/tall-og-forskning/finn-forskning/tema/nasjonale-prover/analyse-av-nasjonale- prover-pa-5.-trinn-2018/)
1.3.2 Lower secondary education
In addition to observed national test score results, pupils in lower secondary school also receive grades. There is an overall grade in the subject determined by the teacher in mathematics and also (for some pupils) an examination grade the last year.
In the national tests, girls on average score 49 points in the national tests, two points less than boys (Udir, 2019d). Figure 1-7 shows the 2018-distribution of national test scores among girls and boys in 8th grade. As in fifth grade we see a larger portion of the girls performing under average and a larger portion of the boys scoring above, indicating a gender gap in favor of boys.
Figure 1-7: Distribution of National test scores in lower secondary education (2018)
Note: Figure 1-7 shows the distribution for females (first figure) and for men (second figure) among the five levels: level 1< 36 points, level 2: 37-54, level 3: 45-54, level 5: 55-62 points and level 3: >
63 points. Compared to the boys, four percentage points more girls perform lower than level three and six percentage points less in level five, again illustrating a gender gap in favor of males.
(Regning - fordeling på mestringsnivåer. From «Nasjonale prøver ungdomstrinn», of Udir, 2019d
30,7 20,3
49,2 53,5
20,1 26,3 Boys
Girls
Level 3 Level 2 Level 1
11,7 8,8
23,9 21,2
36,6 38,3
19,6 22,5
8,2 9,1
Boys Girls
Level 5 Level 4 Level 3 Level 2 Level 1
12 (https://skoleporten.udir.no/rapportvisning/grunnskole/laeringsresultater/nasjonale-proever-
ungdomstrinn/nasjonalt?periode=2018-2019&orgaggr=a&trinn=8&sammenstilling=1&fordeling=5))
However, measured in grades we observe a gender gap in the opposite direction of what is seen in the National tests. Girls perform on average 0.1 points better than boys in the written exam (the best grade being 6 the average grade for girls in 2018 was 3.7) and 0.2 points better in the grade determined by the teacher (the average grade being 3.9 for girls) (Udir, 2019b). In Figure 1-8, showing the 2017-distribution of grades among girls and boys in 10th grade, we see that more boys than girls were graded bellow four and more girls than boys received a grade above four.
Figure 1-8: Distribution of grades in 10th grade (2017)
Note: Figure 1-8 shows the 2017-distribution of grades among girls and boys in 10th grade. We see that more boys than girls, in 2017, got the grade bellow four and that more girls got grade above three. (Figur 4.7. Fordeling av standpunktkaraktereri matematikk og norsk hovedmål ved avslutning av grunnskolen i 2017. From "Nye sjanser- bedre læring", NOU, 2019 NOU, p. 44)
1.3.3 Upper secondary education
After 10th grade, further education is voluntarily, and students can follow an academic track or a vocational track in upper secondary school. In 2018, roughly two percent dropped out after 10th grade (Udir, 2018b). Among the remaining individuals, slightly more than half continue on to academic track, 60 percent of the girls and 43 percent of the boys (SSB, 2019b).
In upper secondary education there are no National tests only an overall grade determined by the teacher and grades from oral and written exams (Udir, 2019c). In the overall grade and the oral exam, girls perform between 0,3 and 0,6 points better than boys in all four math subjects.
The differences are however smaller in the written exam where they vary between 0,1 and 0,3
13 in the four subjects. Figure 1-9 depicts the average of these three measures.3 We see here that girls perform better than boys in all four subjects. The largest gap is observed in mathematics for social science (S), where girls get a grade 0.4 points above boys.
Figure 1-9: Average grade in all math subjects (2018-2019)
Note: Figure 1-9 shows that girls, averaged over the measures (overall, written and oral) perform better than boys in all four subjects. The largest gap is observed in mathematics for social science (S), where girls receive a grade 0.4 points above boys. (Udir, 2019c).
Clearly, the average girl performs better than the average boy regardless of the subject.
However, generalizing this to say that girls are better in math is more complicated. In Figure 1-5 we saw that choosing a “soft” math subject seemed to be the default choice of females and choosing a harder math course seemed to be the default for males, indicating that the girls and boys in each course are not necessarily comparable. The boys actively choosing “softer” math might choose this because they struggle more with math. By consequence, the average boy choosing this “softer” math is expected to perform weaker compared to the average girl choosing this subject. Also, in the harder math courses R1 and R2 one could expect the minority of girls actively choosing these courses to be extra motivated or gifted in the
mathematical domain. In sum, one could therefore expect a better performance of the average girl compared to the average boy.
The gender gap goes in opposite directions when comparing grades and National tests scores.
Regarding the overall grade in the subject and the grade received from the oral exam, a gender gap in favor of the girls, could be explained by biasness towards girls. However, this does not
3 When applying to higher education, these three grades are weighed equally.
3,60 4
3,47 3,9 4,03 4,27
4,03 4,27
0,00 1,00 2,00 3,00 4,00 5,00
boys girls boys girls boys girls boys girls
S1 S2 R1 R2
14 explain why girls perform better in the written exam while boys perform better on the
National tests.
Tests differ in how they are constructed and in which abilities or type of tasks they emphasize.
In the international case, TIMSS is based on the participating countries’ national curriculum, while PIAAC and PISA are based on the students’ ability to use calculations in favor of problem solving facing them in their day-to-day life, further education and work life (NOU, 2019, p. 43). The National tests score- and grade-system in Norway also differ. The National tests put their emphasis on calculations, and although this corresponds to a crucial part of the curriculum in math the exams measure more (NOU, 2019, p. 42). Thus, differences in test design may lead to tests being biased towards one of the genders.
A third explanation lies in the data sample and the possibility of selecting into tests. The completion of written exam is mandatory, whereas pupils with special needs or with a non- Norwegian mother tongue could get an exemption.4 Over twice as many boys than girls receive special education (NOU, 2019, p. 55). The exclusion of this under-performing majorly male population from the National tests may bias the gender gap in National tests, in favor of the boys, upwards.
4 See https://www.udir.no/eksamen-og-prover/prover/nasjonale-prover/administrere-nasjonale-prover2/#fritak
15
2 Data description
“The most credible and influential research designs use random assignment” (Angrist &
Pischke, 2008, p. 11). The 1+1 – project is a randomized controlled trial investigating the effect on test scores of learning mathematics in small groups for pupils in 1st to 4th grade in primary school. Section 2.1 explains how RCT as a research strategy can be used to detect the effect of small group tuition. Section 2.2 explains the sample selection and finally in Section 2.3 the construction and distribution of the outcome variable; 1+1-tests scores is explained in more detail.
Using the RCT “1+1 project” for identification 2.1.1 RCT - the gold standard for effective research
For there to be any causal interpretation of results, and not merely statistically significant correlation, we need the Stable Unit Treatment Value Assumption (SUTVA) to hold. This assumption requires that individuals have equal probabilities of receiving treatment, that there are no hidden variations in the received small group tuition and that there are no selection on gains, i.e. that Equation (𝑖) holds. The RCT design aims to measure the causal effect of small group tuition by randomly introducing small group tuition to some schools and by that isolating its effect on test scores.
Let 𝑇𝑖 be a dummy for student 𝑖 that is equal to 1 if the student attends a school offering small group tuition and 𝑌𝑖 be the observed test score of that individual, then the potential score outcome can be expressed as:
𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 = 𝑌𝑖𝑇 {𝑌𝑖1 𝑖𝑓 𝑇𝑖 = 1
𝑌𝑖0 𝑖𝑓 𝑇𝑖 = 0 (𝑖)
The observed outcome will always be equal to one of the two potential outcomes, leaving the other potential outcome a confounder (not observable in the data set) 5. This makes the observed outcome a function of the two potential outcomes in (𝑖):
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 = 𝑌𝑖 = 𝑌𝑖0+ ( 𝑌𝑖1− 𝑌𝑖0) 𝑇𝑖 (𝑖𝑖)
5 A student can only attend one school at a time, either a school with small group tuition or a school without.
16 By OLS we can compare the average outcome between individuals participating in small group tuition and those who do not:
𝛽𝑂𝐿𝑆= 𝐸[𝑌𝑖|𝑇𝑖 = 1] − 𝐸[𝑌𝑖|𝑇𝑖 = 0] (𝑖𝑖𝑖)
For the sake of interpretation this expression is identical to the longer expression:
𝛽𝑂𝐿𝑆 = 𝐸[𝑌⏟ 𝑖1|𝑇𝑖 = 1] − 𝐸[𝑌𝑖0|𝑇𝑖 = 1]
𝐴𝑇𝑇
+ 𝐸[𝑌⏟ 𝑖0|𝑇𝑖 = 1] − 𝐸[𝑌𝑖0|𝑇𝑖 = 1]
𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑏𝑖𝑎𝑠
6 (𝑖𝑣)
Assuming the treatment is purely random, treated and not treated individuals should
asymptotically have identical distributions of (pre-assignment) unobservable and observables and thus have equal probabilities of receiving small group tuition: 𝑌𝑖1, 𝑌𝑖0 ⊥ 𝑇𝑖 or
𝐸(𝑌𝑖0|𝑇𝑖 = 1) = 𝐸(𝑌𝑖0|𝑇𝑖 = 0). We see that random assignment rules out the problems of selection and/or omitted variable bias in (𝑖𝑣), such that the estimated effect of small group tuition equals the average treatment effect (ATT) estimator.7
2.1.2 The “1+1-project”
The 1+1-project was initiated the fall of 2016 to last for four years, ending the spring of 2020.
In total 163 schools participate from ten large municipalities, geographically spread from the south-west to the northern region.8 The choice about concentrating the project in large municipalities, which are mainly densely populated areas, is to ensure a sufficient supply of qualified teachers in mathematics. Moreover, it ensures enough participating schools from each municipality, as randomization is conducted within each municipality.
The randomized controlled trial was conducted by randomly dividing the 163 schools into treatment and control schools. The 80 treatment schools get resources for an extra teacher- man yearteaching mathematics in small groups, while the 83 control schools continue as before.
6 This expression is a result of adding and subtracting the counterfactual 𝐸[𝑌𝑖0|𝑇𝑖= 1].
7 In a perfect RCT framework we have, in addition that ATT = ATE =ATN
8 The municipalities are: Asker, Bærum, Bodø, Drammen, Sandefjord, Sarpsborg, Stavanger, Tromsø, Trondheim and Ålesund.
17 The treatment: one additional teacher-man-year to perform small group tuition
School leaders in treated schools are allocated an additional teacher man-year which they are instructed to use for small group tuition in mathematics in grades specified by the project (see Guidance Letter in Appendix). Schools were instructed to split the teacher-man-year between no more than two teachers. About one third (26) of the schools divided the resources between two teachers.
The group of pupils is constructed using a pull-out strategy where small groups of pupils (maximum six) attend tutorials in separate classrooms. A pull-out-strategy means that a group of students is pulled out of regular teaching to have a period of small group tuition, and then placed back into regular instruction once the period is over.9 The groups vary over the school year so that all pupils should receive two periods of small group tuition in mathematics by the end of the school year. This pull-out strategy generates the necessary leeway for the additional teacher to customize teaching to the pupils (Bonesrønning et al., 2018, p.3).
Treatment consists of sessions, where the pupils meet three to five times each week in a period of four to six weeks. The sessions differ in length, as there are variations in the
schools’ organization of mathematics instruction. While some schools have long sessions (up to 90 minutes), others have shorter sessions. Most often sessions are of 60 or 45 minutes duration (Bonesrønning et al., 2018, p. 4).
Intention to treat estimate (ITT)
The smallest participating schools have 20 pupils per grade and the largest schools have about 70 pupils in each grade. As the pupils in participating schools must share the time of the additional teacher, treatment intensity of small group instruction would depend on the number of pupils. To ensure more homogeneous treatment effects across schools the largest schools are randomized into control and treatment classes or groups, so that every pupil receives a minimum quantity of small group tuition (see Guidance Letter in Appendix). When analysing treatment effects, this thesis only observes received treatment at the school level, making it a measure of the average intention to treat (ITT) estimate and not the actual received treatment effect (ATT).
9 In most of the occasions the group is pulled out of the class, but in some schools, mathematics is taught across different classes. In these cases, the group is pulled out of the math class.
18 Another problem making the estimate an ITT is that although all treatment schools are
informed about the importance of not mixing the use of the additional “1+1-teacher” with other teacher or assistant resources, some schools may be tempted to move some of the teachers or assistants in the “intervention grades” over to other grades not subject to small groups. If this is the case, then these schools will not be implementing the treatment as intended.
If there are a lot of classes in treatment schools not receiving small group tuition the “true”
effect of treatment is larger than the estimated ITT, thus making the estimated effect of small group tuition downward biased.
Randomization at the school level and clustered standard errors
The total number of 163 public primary schools permits a stratified randomization at the school level, within each municipality and stratum. By randomizing schools within their respective municipality, one detangles a potential confounding fixed municipality effect due to for example municipalities participating in different welfare programs, allocation budgets of different sizes, carrying out different politics regarding education et cetera. These potential confounding factors are removed when comparing schools within their respective
municipality. The randomization into treatment groups was conducted at the school level for two main reason. First, school leaders may be reluctant to participate in an experiment where similar pupils are treated differently within the school. Second, it is more challenging to keep control group unaffected by the treatment group if this group is within the same school as pupils receiving the treatment.
Sample Selection
The analysis focuses on the youngest pupils in second grade. This is the most interesting data regarding the importance of early intervention but also by the simple fact that we know much less about younger students than older students as the data material on older students, thanks to the National Test database, is quite rich. The thesis uses data from both the 2009- and the 2011-cohort, where we in total have access to four different tests: two pretests (one from each of the cohorts) and two posttests for the 2009-cohort (one from second grade last semester and one from third grade last semester). For the empirical analysis, this thesis uses a panel data set
19 with information about gender, birth month, test scores10, home municipality, education institutions, treatment status and strataid.
2.2.1 Reducing Selection Bias
The population may suffer from selective migration on the individual and/or school level.
Parents and pupils were informed about the randomization at the beginning of the school year 2016/2017. Thus, pupils enrolled in later school years knew about the treatment status prior to enrollment. The admission system in most municipalities is based on a strict neighborhood rules11 meaning that the location of residence determines the school the student is enrolled in.
There are, however, a few municipalities in the sample that allow school choice. In the population file, there are six observations that switch from one school participating in the project to another. To remove any selection bias and include school fixed effects these observations are deleted.
One treatment school did not participate after their treatment status was revealed, reducing the number of schools to 162. By exiting from its treatment status, treatment is no longer random within the strata, creating attrition bias. To overcome this problem, the analysis is run without this school’s strata, further reducing the number of schools to 159 for the RCT analyses.
2.2.2 Missing data
Among the population of second grade pupils, only pupils observed with a gender, an active consent from their guardian/parent and observed with at least one test score are included in the analysis. The gender variable is available for pupils who attended a participating grade and school at the beginning of the project period. There is therefore no gender information for students entering participating schools and classes at a later point in time, although these missing students are likely to be evenly distributed across gender. After these constraints, the pretest for sample one (the 2009-cohort) contains in total 7069 observations and the pretest for sample two (the 2011-cohort) in total 6348 observations (see Table 2-1: Data Reduction).
Pupils are located in ten different municipalities in 159 different schools.
10 For the 2009 cohort, the test scores used are the ones from the fall 2016 (pretest 1), spring 2017 (posttest 1) and spring 2018 (posttest 2). For the 2011 cohort, the test score used is from spring 2018 (pretest 2).
11 See Opplæringsloven §8-1
20 Table 2-1: Data Reduction
N Reduction % Reduction
Sample A: Cohort 2009 Active consent
8287
7421 866 10.45 %
Pretest 1 7069 352 4.74 %
Posttest 1 7137 284 3.83 %
Posttest 2 6572 849 11.44 %
Sample B: Cohort 2011 Active consent
8189
6684 1505 18.38 %
Pretest 2 6348 336 5.29 %
Outcome variable based on 1+1-tests
The dependent variable in this analysis (test score) is based on tests developed specifically for the 1+1 project, thus the name 1+1-tests. They were developed as the two types of
standardized tests that currently are available in mathematics (standardized mapping tests and national tests), do not cover the relevant ability levels or the relevant second and third grade pupils. National tests aim at testing the whole distribution of pupils but are first taken in fifth grade. Mapping tests in mathematics are conducted in the second grade, but as the tests are constructed in a way that only helps to identify the pupils in need of extra learning assistance, they do not measure the whole ability range.
2.3.1 Content
The tests were developed by the research team in cooperation with a teacher in mathematics.
The 1+1 tests are based on both the national test and the standardized mapping test but are adjusted to match grade and ability levels. However, the 1+1-tests are much shorter in length and duration (40 minutes) than the standardized tests.12 The national tests take 90 minutes to solve and the mapping tests take about 60 minutes to run through and can be split up into two parts if this seems fit.
Before running the test pupils are informed that:1) the test shall be done electronically (on PC or Ipad); 2) the pupils are given paper and writing material which can be used if needed; 3) pupils are not allowed phone or calculator and; 4) the pupils have no more than 40 minutes in completing the test. In the case where the test participants were having troubles reading the questions (not all second-grade pupils are steady readers), the teacher could read the question
12 See Udir.no
21 out loud. Because of the importance of comparable results between schools, all schools were strictly instructed to complete the test in this way.
The 1+1-tests contain question regarding numeracy and the four calculation types; addition, subtraction, division and multiplication. The tasks include for instance estimating amounts, counting, comparing number values, expressing the same value in different ways and doubling/halving amounts. Before being used the tests were first piloted. Adjustments were also made during the project period to further improve certain questions and the psychometric properties of the test. The ideal difficulty level for a test measuring the whole ability rang is found when the average student can answer about one half of the questions, giving a normal distribution of the score.
2.3.2 Distribution of test scores among girls and boys
The results come from four different tests, two pre-treatment tests and two post-treatment tests. The two pretests are from two different cohorts of second grade pupils, the 2009 cohort pretest (Pretest1) conducted fall 2016 and the 2011-cohort-test (Pretest2) from the spring 2018. The two posttests are both from the 2009 cohort, one conducted in the spring semester of second grade in 2017 (Posttest1) and the other in third grade, spring 2018 (Posttest2).
Figure 2-1 shows the density distributions for the four tests in addition to the non-parametric distributionds corresonding to the girls (in red) and to the boys (in green). The red vertical line corresponds to the observed median in the relevant subsample.
22 Figure 2-1: Distribution of tests scores
Notes: Figure 2-1 shows the density distributions for the four tests in addition to the non-parametric distributions corresonding to the girls (in red) and to the boys (in green). The red vertical line corresponds to the observed mean in the relevant subsample.
The aim of having a standard normal distribution in test scores is difficult to achieve.
Figuratively it seems to be Postscore2 (bottom-right) that has the most standardized – looking distribution. Absent outliers and with density distributions more or less weighted to the right of the distribution, the other tests show signs of being “too easy” when it comes to testing the best pupils. See Section 2.3.3 for a further analysis of the distributional attributes of the four tests. For now, we note that compared to the girl’s distributions boys have a lower density to the left of the distribution and a higher density to the right of the distribution, indicating that boys indeed do better than girls on all tests.
23
2.3.3 Standardizing test scores
With maximum scores varying from 24 points (in Posttest1) to 41 points (in Pretest1) the three tests are quite different in their scaling, which makes any comparison of results difficult.
A common approach in applied regression is to standardize the test scores. A test score is standardized by subtracting its mean and dividing by its standard deviation;
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑠𝑐𝑜𝑟𝑒 =𝑆𝑐𝑜𝑟𝑒 − 𝑀𝑒𝑎𝑛(𝑠𝑐𝑜𝑟𝑒) 𝑆𝐷(𝑠𝑐𝑜𝑟𝑒)
Subtracting the mean typically improves the interpretation of main effects in the presence of interactions and dividing by the standard deviation puts all the predictors on a common scale (Gelman, 2008). Each coefficient in this standardized model is the expected difference in the outcome, comparing units that differ by one standard deviation in an input variable, with all other input fixed at their average values. About two thirds of the observations are within a ∓ one standard deviation from the sample mean, and 95 percent of the observations are within ∓ two standard deviations from the sample mean. See Figure 2-2, for the distributions of
standardized scores.
Figure 2-2: Distribution Standardized scores
Although, standardizing coefficients eases the comparison of test scores by ignoring the different scaling used, standardizing can be misleading in the case of when the test scores’
24 distributions do not resemble each other. Table 2-2 compares the distributional attributes of the standardized version of the different test score distributions.
Table 2-2: Comparing Standardized Distributions
Standardized test scores: Mean Skewness Standard deviation Kurtosis
Normal distribution 0 0 1.64 3
Pretest1 0.019 -0.489 0.995 2.523
Pretest2 0.053 -0.396 0.991 2.432
Posttest1 0.027 -0.263 0.993 2.526
Posttest2 0.021 -0.354 0.992 2.593
The first row shows which attributes the test creators were aiming for when constructing all four tests, namely the attributes of the standard normal distribution. The two first columns are measurements of how the densities are weighted while the two lasts measure the spreadedness of observations.
Column one shows that all tests have a mean greater than zero, meaning that pupils on average scored a bit too well compared to the goal of a mean of zero. Skewness, in column two, is a measure of the distributions’ symmetric attributes, where any symmetric distribution (like the standard normal distribution) has a skewness of zero. All test distributions are here more or less negatively skewed, which (as we already pointed out in the analysis of Figure 2-1) means that the distribution is concentrated to the right of the figure, indicating that the tests have been “too easy”.
The standard deviations in Column two are all about 0.99 and thus lower than 1.64, meaning again that the distributions seem to be centered toward the mean. This feature is also captured in the last column where all distributions have fewer and less extreme outliers, resulting in thinner tails (see Figure 2-2 and Figure 2-1), than the standard normal distribution. This attribute shows that the tests have been “too easy” when it comes to separating the “best”
from the good pupils and/or “too hard” to be separating the worst performing pupils from the
“bad” performing pupils.
When comparing the distributions of test scores, we see in Figure 2-2 that the posttests are in general closer to the standard normal distribution. Posttest1 (green curve) corresponds to the most symmetric distribution, while Posttest2 (yellow curve) has a wider spread of
observations making the “tailedness” more similar to the normal distribution.
25
3 Estimation Strategy and Results
Section 3 presents the results and corresponding estimation strategies for the following research questions, are there observed differences in mathematical abilities between girls and boys in second grade (Section 3.1)? If so, are they homogeneous across birth month and/or distribution of test scores (Section 3.2)? And could small group tuition potentially reduce these differences (Section 3.3)?
Identifying gender gap 3.1.1 Balance tests and methodology
To not confuse the analysis of gender differences with the effect of small group tuition this section focuses the analysis on the pretests. In these tests none of the pupils had yet received small group tuition resulting in data from both treatment and control schools being suited for the investigation of gender gaps. The gender gap in mathematics is estimated by the following regression:
𝑌𝑖𝑔𝑡= 𝛽𝑔𝑖𝑟𝑙𝑖+ 𝛾𝐵𝑖𝑟𝑡ℎ 𝑚𝑜𝑛𝑡ℎ𝑖 + 𝛼𝑔+ 𝜖𝑖𝑔𝑡 (1)
𝑌𝑖𝑔𝑡 is the test score of pretests 𝑡13 for individual 𝑖, belonging to school 𝑔. It is likely that test scores of pupils in the same class, or even same grade-level, correlates because pupils in the same class share background characteristics and are exposed to the same teacher and
classroom environment. A school fixed effects denoted, 𝛼𝑔, is therefore included to control for this within-school correlation. By the same logic, because standard errors (𝜖𝑖𝑔𝑡) are assumed to correlate within (but not between) schools, standard errors are clustered at the school level (Angrist & Pischke, 2008, p. 309). Birth month is a an indicator variable from 1 to 12 where 1 corresponds to being born in January and 12 to being born in December.
𝑔𝑖𝑟𝑙𝑖 is a dummy variable that equals one if being a girl and zero otherwise. 𝛽1 then measures the average difference in Pretest1-results, in school 𝑔, between the 2009-cohort of boys and girls. In the same way 𝛽2 measures the gender gap in Pretest2-results, in school 𝑔, between the 2011-cohort of boys and girls. The estimate 𝛽 tells us whether there is a gender difference, but not necessarily why. One possibility could be that one or both tests favorize one of the
13 𝑡 = 1 for the 2009-cohort-test and 𝑡 = 2 for the 2011-cohort test
26 genders. Another possibility is that girls and boys are treated differently in school or at home.
Many of these hypotheses require data that is unavailable or outside the scope for this analysis.
There is however data available concerning birth month and the number of missing observations for each gender. Birth month is an indicator variable from 1 to 12 where 1 is January and 12 is December. Missing is a dummy equal to 1 if the relevant score is missing.
Let Xi be a vector of birth month and missing then Equation (2) estimates any imbalances between girls and boys in these variables, further presented in Table 3-1.
𝑋𝑖 = 𝛽𝑔𝑖𝑟𝑙𝑖 + 𝛼𝑔 + 𝜖𝑖𝑔 (2) Table 3-1: Balance Test in Gender
Sample A: Cohort 2009 Sample B: Cohort 2011
(1) Boys (2) Girls (3) Difference (1) Boys (2) Girls (3) Difference
Birth month 6,48 6,50 0.011
(0.083)
6,42 6,44 0.022
(0.087) Likelihood of missing14
Pretest 1 4.68 % 4.81 % 0.002 (0.005)
Pretest 2 5.53 % 4.45 % -0.005
(0.005)
N 3890 3531 7421 3562 3122 6684
Notes. Both tests were taken in second grade (pre-treatment). Pretest1by the 2009-cohort in and Pretest2 by the 2011-cohort. Standard errors are clustered at the school level and a school fixed effect is included in all estimations. Significance at a 1%, 5% and 10%-level are indicated by ***, ** and * respectively.
Column (1) and (2) (in Table 3-1) present average birth month and likelihood of missing the relevant test score separately for boys and girls for the 2009- and 2011-cohort respectively.
We find that the variables are equally distributed among the two genders, which means that any potential endogeneity issues in the estimate β must come from data unavailable for this analysis.
14 Missing score value may due to missing consent from parent/guardian, being absent from the test or changed school during the period. Should be interpreted as the average individual’s likelihood of missing a score.
27 Table 3-2 presents results from estimating Equation 𝑌𝑖𝑔𝑡= 𝛽𝑔𝑖𝑟𝑙𝑖 + 𝛾𝐵𝑖𝑟𝑡ℎ 𝑚𝑜𝑛𝑡ℎ𝑖 + 𝛼𝑔 + 𝜖𝑖𝑔𝑡(1) by OLS where I conduct the following robustness checks:
i. Include strata- or municipality- or school fixed effects.
ii. Control for birth month
iii. Compare results from both pretests.
iv. Give the missing score a synthetic maximum and minimum value
3.1.2 Results
Panel one and two show results from the 2009-cohort pretest (Pretest1) and the 2011-cohort pretest (Pretest2). The first six columns show results only for the sample having an observed test score corresponding to 7069 observations in Pretest1 and 6348 observations in Pretest2 (see Table 2-1), while the two last columns also include the 352 observations missing the Pretest1 score and the 336 observations missing the Pretest2 score. The bottom of the table indicates which variables and fixed effects that are included in the regression.
Column (1) shows results when running the regression and only controlling for gender. We see that girls on average performs -0.086 standard deviations worse in Pretest1 and -0.144 standard deviations worse in Pretest2. In what follows, we note that the estimate of β never loses the one percent level of significance and is relatively stable across specifications. Worth noting also is that the non-parametrically evidence in the density distributions depicted in Figure 2-1 and Figure 2-2 also go in favor of such a gender gap.
Column (2)-(4) show results when running the regression and controlling for gender and either including municipality fixed effects (Column 2), strata fixed effects (Column 3) or school fixed effects (Column 4). A fixed municipality effect controls for characteristics that are stable within each municipality and which are likely to influence test results. Examples of such characteristics could be local school politics or constraints in the school economy. The strata-fixed effects in Column (3) group schools, within municipalities, that are similar in earlier observed national test scores, see Bonesrønning et al. (2018) for more details.
We see that estimates of 𝛽 are relatively robust to both municipality- (Column 2) and strata fixed effects (Column 3). Including school fixed effects in Column (4) however increase both estimates by 0.01. This last result indicates that the first estimate (in column one) is driven by some between-school differences.
28 Commenting the adjusted R-squared we see that it increases steadily across specifications in column (1) to (4). As the model specification in Column (4), including school fixed effects, seems to be explaining most of the data variations, further regressions use this specification as the base for comparisons.
Column (5) shows results when running the regression including school fixed effects and controls for gender and birth month. The estimate seems to be robust to the inclusion of birth month, which is consistent with the knowledge we have about birth month being evenly distributed among girls and boys.
Columns (6) and (7) derive robustness checks of the gender gap estimate in Column (5), by assigning the missing values (corresponds to about 352 observations in Pretest1 and 336 observations in Pretest2) a synthetic score. In Column (6) this score correspond to the minimum score received in the relevant test, and in Column (7) to the maximum score received. A confidence interval robust to math level of those with missing scores can be derived comparing the two 95 percent confidence intervals in column (6) and (7).
The gender gap estimate in Pretest1 (β1) increases marginally by -0.005 when assuming that the counterfactual outcome of the pupils missing test scores correspond to the minimum and decrease marginally by 0.003 when assuming these outcomes correspond to the maximal observed test score. Comparing the two 95% confidence intervals we find that the estimate, being robust to the value assigned is likely to be within the interval [-0.132, -0.023]. This interval comprises the estimate of the gender gap (-0.074) in column (5). Thus, the gender gap seems to be robust to the value of the missing scores, which again is consistent with what we know about missing scores being evenly distributed among girls and boys. A similar result is observed for the Pretest2 estimate.
29 Table 3-2: Gender Gap Results
(1) (2) (3) (4) (5) (6) (7)
Pretest1
β1 -0.086*** -0.088*** -0.086*** -0.076*** -0.074*** -0.079*** -0.071***
(0.024) (0.024) (0.024) (0.023) (0.022) (0.027) (0.024)
95% CI [-0.132, -0.026] [-0.118, -0.023]
Adj. R-squared 0.002 0.024 0.038 0.142 0.176 0.162 0.154
N 7069 7069 7069 7069 7069 7421 7421
Pretest2
β2 -0.144*** -0.143*** -0.141*** -0.135*** -0.135*** -0.112*** -0.134***
(0.027) (0.028) (0.027) (0.027) (0.026) (0.027) 0.028
95% CI [-0.165, -0.059] [-0.189, -0.079]
Adj. R-squared 0.005 0.015 0.039 0.136 0.173 0.207 0.173
N 6348 6348 6348 6348 6348 6684 6684
Municipality FE
No Yes No No No No No
Strata FE No No Yes No No No No
SchoolFE No No No Yes Yes Yes Yes
Birthmonth FE No No No No Yes Yes Yes
Assigning synthetic … value to missing obs
No No No No Yes Min Max
Notes. Both tests were taken in second grade (pre-treatment). Pretest1by the 2009-cohort in and Pretest2 by the 2011- cohort. Standard errors are clustered at the school level. Significance at a 1%, 5% and 10%-level are indicated by
***, ** and * respectively.
Is the gender gap more present in some subgroups than others?
The regression specification above depicts an average disadvantage of being a girl. Section 3.2 investigates whether this disadvantage varies with birth month or math level. Fixed school effects are included in all regressions and standard errors are clustered at the school level.
3.2.1 Birth Month
In the data (see Figure 3-1) there is a downward linear trend in test score across birth month.
We see that pupils born late in the year, on average, score lower compared to pupils born early. The question posed in this subSection is whether the disadvantage of being born late in the calendar year is equal for both genders, or whether observed gender gaps vary among the youngest and oldest pupils. Table 3-3 compare the mean pretest-scores among boys (Column 1) and girls (Column 2) born early (Panel 2) and late (Panel 3) in the calendar year. The