The cognitive benefits of learning computer programming: A meta-analysis of transfer effects

(1)

The Cognitive Benefits of Learning Computer Programming:

A Meta-Analysis of Transfer Effects

Ronny Scherer University of Oslo, Norway

Fazilat Siddiq

Nordic Institute for Studies in Innovation, Research and Education (NIFU), Norway Bárbara Sánchez Viveros

Humboldt-Universität zu Berlin, Germany

Author Note

Ronny Scherer, Department of Teacher Education and School Research (ILS) &

Centre for Educational Measurement at the University of Oslo (CEMO), Faculty of Educational Sciences, University of Oslo; Fazilat Siddiq, Nordic Institute for Studies in Innovation, Research and Education (NIFU), Oslo; Bárbara Sánchez Viveros, Faculty of Life Sciences, Humboldt-Universität zu Berlin, Germany.

Correspondence concerning this article should be addressed to Ronny Scherer, Faculty of Educational Sciences, Department of Teacher Education and School Research (ILS), Postbox 1099 Blindern, 0317 Oslo, Norway. E-Mail: ronny.scherer@cemo.uio.no

© 2018, American Psychological Association. This paper is not the copy of record and may not exactly replicate the final, authoritative version of the article. Please do not copy or cite without authors’ permission. The final article will be available, upon publication, via its DOI:

10.1037/edu0000314

(2)

Abstract

Does computer programming teach students how to think? Learning to program computers has gained considerable popularity, and educational systems around the world are

encouraging students in schools and even children in kindergartens to engage in programming activities. This popularity is based on the claim that learning computer programming

improves cognitive skills, including creativity, reasoning, and mathematical skills. In this meta-analysis, we tested this claim performing a three-level, random-effects meta-analysis on a sample of 105 studies and 539 effect sizes. We found evidence for a moderate, overall transfer effect (g = 0.49, 95 % CI = [0.37, 0.61]), and identified a strong effect for near transfer (g = 0.75, 95 % CI = [0.39, 1.11]) and a moderate effect for far transfer (g = 0.47, 95 % CI = [0.35, 0.59]). Positive transfer to situations that required creative thinking, mathematical skills, and metacognition, followed by spatial skills and reasoning existed.

School achievement and literacy, however, benefited the least from learning to program.

Moderator analyses revealed significantly larger transfer effects for studies with untreated control groups than those with treated (active) control groups. Moreover, published studies exhibited larger effects than grey literature. These findings shed light on the cognitive benefits associated with learning computer programming and contribute to the current debate

surrounding the conceptualization of computer programming as a form of problem solving.

Keywords: Cognitive skills; computational thinking; computer programming; three- level meta-analysis; transfer of skills

Educational Impact and Implications Statement: In this meta-analysis, we tested the claim that learning how to program a computer improves cognitive skills even beyond programming. The results suggested that students who learned computer programming outperformed those who did not in programming skills and other cognitive skills, such as

(3)

creative thinking, mathematical skills, metacognition, and reasoning. Learning computer programming has certain cognitive benefits for other domains.

(4)

The Cognitive Benefits of Learning Computer Programming:

A Meta-Analysis of Transfer Effects

Computer programming is an activity similar to solving problems in other domains: It requires skills, such as decomposing, abstracting, iterating, and generalizing, that are also required in mathematics and science—in fact, these skills are critical to human cognition (Román-González, Pérez-González, & Jiménez-Fernández, 2017; Shute, Sun, & Asbell- Clarke, 2017). Acknowledging these commonalities between the skills required in programming and the skills required to solve problems in other domains, researchers and computer scientists have claimed that learning to program computers has certain cognitive benefits (Grover & Pea, 2013; Liao & Bright, 1991; Pea & Kurland, 1984). According to this hypothesis, intervention studies that are aimed at fostering programming skills should not only reveal direct training effects but also transfer effects to situations that require other cognitive skills. Yet, the current research abounds in conflicting findings, as there is evidence both for and against the transferability of learning computer programming (Scherer, 2016), and some researchers claimed that far transfer does not exist (Denning, 2017). This

observation is by no means unique to programming: Sala and Gobet (2017a) reviewed several meta-analyses in the domains of chess instruction, music education, and working memory training and concluded that so-called ‘far transfer’ (i.e., the transfer of knowledge or skills between two dissimilar contexts) may not exist. However, does this hold for learning to program computers as well? With the current meta-analysis, we investigated this question by testing the hypothesis that programming interventions have certain cognitive benefits. In particular, we examined (a) the overall transfer effect of learning computer programming, (b) the near transfer effects to situations that require programming skills and the far transfer effects to situations that require skills outside of programming, and (c) the differential far effects computer programming interventions may have in situations that require different

(5)

types of cognitive skills. In this meta-analysis, programming skills were defined as the skills to create, modify, and evaluate code and the knowledge about programming concepts and procedures (e.g., objects, algorithms). These two dimensions are referred to as

‘Computational concepts’ and ‘Computational practices’ in the existing frameworks of computational thinking (Lye & Koh, 2014).

The Transfer of Skills

The question whether acquired knowledge and skills can be transferred from one context or problem to another is key to cognitive and educational psychology. In fact, the transfer of learning lies in the very heart of education, as it taps the flexible application of what has been learned (Barnett & Ceci, 2002). Perkins and Salomon (1992) understood transfer as a situation in which learning in one context impacts learning and performance in other, perhaps new contexts. Although researchers agreed on this definition (Bransford &

Schwartz, 1999), some questions remain: Which conditions foster successful transfer? What characterizes “other” or “new” contexts?

In their seminal article, Woodworth and Thorndike (1901) considered improvements in basic cognitive skills and the transfer to situations that require other cognitive skills. Their main proposal for explaining successful transfer is referred to as the ‘Theory of Common Elements’—a theory hypothesizing that the degree of successful transfer depends on the elements two different contexts or problem situations share. The authors argued that the transfer of skills between situations that have less in common (i.e., require only few shared skills or knowledge elements) occurs less often than transfer between closely related situations (see also Bray, 1928). Barnett and Ceci (2002) pointed out that the Theory of Common Elements has led to the distinction between near and far transfer. In this context, near transfer refers to successful transfer between similar contexts, that is, contexts that are closely related and require the performance of similar skills and strategies; far transfer refers

(6)

to successful transfer between dissimilar contexts, that is, contexts that are inherently different and may require different skills or strategies (Perkins & Salomon, 1992). In essence, the transfer of skills depends on the similarity and overlap between the contexts and problems in which the skills were acquired and those presented later on (Schunk, 2012). The issue with these definitions lies in the concepts of similarity and difference, both of which are features of the specific problem situations (Bransford et al., 2005). Greeno et al. (1998) emphasized that the transfer of skills to other contexts is highly situation-specific, that is, it largely depends on the situations in which the skills have been acquired previously—in other words, transfer is situated in experience and is influenced by the participation in previous activities (see also Lobato, 2006). Bransford and Schwartz (1999) pointed out that prior knowledge forms an additional prerequisite for successful transfer, in particular the knowledge about structure of a problem situation, the variables involved, and solution strategies (e.g., Bassok, 1990; Chen &

Klahr, 2008; Cooper & Sweller, 1987). It therefore seems that the acquisition of schemata to solve problems may foster the transfer of learning between problem situations.

Although the existence of far transfer was often denied (Barnett & Ceci, 2002;

Denning, 2017), several studies provided evidence for far transfer, yet to varying degrees (Bransford & Schwartz, 1999). In a recent review paper, Sala and Gobet (2017a) questioned the existence of far transfer and referred to a series of meta-analyses in the domains of chess instruction and music education. Indeed, the meta-analyses the authors referred to provided only limited evidence for far transfer effects—successful transfer could only be found in situations that required skills similar to those trained in the interventions. Melby-Lervåg, Redick, and Hulme (2016) supported this finding by their meta-analysis of working memory training, and so did Sala, Tatlidil, and Gobet (2018) in their meta-analysis of video gaming.

These findings suggest that far transfer may be differentially effective for improving cognitive skills. Overall, our brief review of the existing literature of transfer revealed that (a) transfer is

(7)

more likely to occur between closely related contexts or problem situations; (b) the success of transfer depends on schematic knowledge; (c) far transfer may differ across contexts.

The Transfer of Programming Skills

Programming skills are considered critical to the development of “computational thinking”—a concept that “involves solving problems, designing systems, and understanding human behavior, by drawing on the concepts fundamental to computer science” (Wing, 2006, p. 33). In their seminal review, Shute et al. (2017) named five cognitive processes involved in computational thinking: Problem reformulation, recursion, problem decomposition,

abstraction, and systematic testing. These skills defined the concept as a form of problem solving (Lye & Koh, 2014). Despite the close relationship between programming skills and computational thinking, the two concepts are not identical—the latter also entails taking computational perspectives (i.e., students’ understanding of themselves and their interaction with others and with technology; Shute et al., 2017) as an element of computational

participation (Kafai & Burke, 2014). Nevertheless, the processes involved in programming require problem-solving skills, such as decomposing problems, applying algorithms, abstracting, and automatizing, and ultimately aid the acquisition of computational thinking skills (Yadav, Good, Voogt, & Fisser, 2017). Programming may therefore be considered a way of teaching and learning computational thinking (Flórez et al., 2017), a way of assessing computational thinking (Grover & Pea, 2013), and a way of exposing students to

computational thinking by creating computational artefacts, such as source code or computer programs (Lye & Koh, 2014). Barr and Stephenson (2011), as they compared core

computational thinking with the demands of solving problems in STEM domains, concluded that programming skills, computational thinking, and problem solving are intertwined.

In this meta-analysis, we define programming skills as the skills to create, modify, and evaluate code and the conceptual and procedural knowledge needed to apply these skills, for

(8)

instance, in order to solve problems—a definition close to that of computational thinking.

This definition includes two key dimensions of computational thinking: computational concepts (i.e., syntactic, semantic, and schematic knowledge) and computational practices (strategic knowledge and problem solving; e.g., Lye & Koh, 2014). Hence, the research on the transfer of programming skills we review here also targets aspects of the transfer of

computational thinking skills.

As learning computer programming engages students in problem solving activities, transfer effects on students’ performance in situations that require problem solving seem likely (Shute et al., 2017). Although Pea and Kurland (1984) doubted the existence of such effects, they still argue that some effects on thinking skills that are close to programming may exist. Part of this argument is the observation that problem solving and programming skills share certain subskills. In a conceptual review of problem solving, creative thinking, and programming skills, Scherer (2016) listed several subskills that are required to successfully solve tasks in these three domains. The author concludes that these commonalities provide sufficient ground to expect a positive transfer between them. Clements (1995) established that creativity plays a role in programming, and Grover and Pea (2013) supported this perspective.

Reviewing further domains and contexts, Clements (1986a) claimed that programming skills can even be assigned to the cognitive dimensions of intelligence frameworks—hence, a transfer of programming skills to intelligence tasks seems likely. The author further suggested considering metacognitive skills as integral parts of programming. Finally, Shute et al. (2017) identified problem solving and modeling as two commonalities between programming and mathematical skills, arguing for the existence of transfer effects. The list of cognitive skills that overlap with programming could be extended even further (for a detailed overview, please refer to Scherer, 2016). However, the selection presented here already points into one

(9)

direction: programming skills and other cognitive skills share important subskills, and transfer effects of learning computer programming may therefore exist.

A recently published, cross-sectional study of computational thinking provided some evidence supporting this reasoning: Román-González et al. (2017) developed a performance test of computational thinking and administered it to 1,251 Spanish students in grade levels 5 to 10. The results showed that computational thinking was significantly and positively related to other cognitive skills, including spatial skills (r = .44), reasoning skills (r = .44), and problem-solving skills (r = .67). Drawing from the Cattell-Horn-Carroll [CHC] theory (McGrew, 2009), Román-González et al. (2017) concluded that computational thinking, operationally defined and measured as what we consider programming skills in this meta- analysis, represents a form of problem solving. Although these findings suggest that

programming skills overlap with other cognitive skills, they do not provide evidence for the transferability of programming skills, due to the lack of experimental manipulation.

Previous Meta-Analyses on the Transfer of Programming Skills

Two meta-analyses addressed the transferability of programming skills, both of which resulted in positive and significant effect sizes. The first meta-analysis synthesized 432 effect sizes from 65 studies that presented students with programming activities and administered assessments of cognitive skills (Liao & Bright, 1991). Using a random-effects model, Liao and Bright obtained an overall effect size of d = 0.41 (p < .01) and thus supported the claim that programming skills can be transferred. Liao and Bright further found that this overall transfer effect size was moderated by the type of publication (with largest effects for

published articles in the database ERIC), grade level (with largest effects for college and K-3 students), the programming language used during the intervention (with largest effects for Logo and BASIC), and the duration of the intervention (with largest effects for short-term

(10)

interventions). Neither the design of the primary studies nor the year of publication explained variation in the overall effect size.

Although this study provided evidence for the transferability of computer programming based on a large sample of effect sizes, we believe that it has got two shortcomings: First, the authors reported an overall effect size for transfer without

differentiating between cognitive skills. Existing meta-analyses that examined transfer effects in other domains, however, found that transfer effects vary considerably across cognitive skills (Melby-Lervåg et al., 2016; Sala & Gobet, 2016). In other words, transfer intervention studies may be particularly effective in situations that require cognitive skills close to the trained skills (Sala & Gobet, 2017a). Second, Liao and Bright (1991) included a dataset that comprised 432 effect sizes from 65 studies—a dataset that clearly had a nested structure (i.e., effect sizes were nested in studies). Considering the recent methodological advancements of meta-analyses (M. W.-L. Cheung, 2014), a three-level random-effects modelling approach would have been more appropriate than the random-effects model Liao and Bright specified, as it quantifies both within- and between-study variation.

In the second meta-analysis, Liao (2000) updated the former meta-analysis and included 22 interventions and 86 effect sizes that were published between 1989 and 1999.

Aggregating these effects resulted in a large overall transfer effect of d = 0.76 (p < .05). In contrast to the original meta-analysis, pre-experimental study designs were included (e.g., one-group pretest-posttest designs). Considering that these designs provided the smallest transfer effects (d = 0.45) among all other designs (d = 0.56–2.12), their inclusion may have biased the overall effect. Moreover, the reported effects must be interpreted with caution, given the small sample size of studies and effect sizes. In contrast to Liao and Bright (1991), Liao (2000) tested whether transfer effects differed across cognitive skills. Indeed, the

strongest effects occurred for the near transfer of skills (d = 2.48), whereas the smallest effects

(11)

occurred for the far transfer to creative thinking situations (d = -0.13). Other skills such as critical thinking, problem solving, metacognition, and spatial skills benefited from learning computer programming moderately (d = 0.37–0.58).

Uttal et al. (2013) included seven studies that administered programming interventions to enhance students’ spatial skills. Although the authors did not report an overall effect size for this selection of studies, six out of the seven primary effect sizes of these interventions were positive and significant (g = 0.12–0.92, p < .05). This finding uncovered that positive transfer of learning computer programming on situations that require the application of spatial skills may exist.

In their review of video gaming, Sala et al. (2018) claimed that “teaching the computer language Logo to improve pupils’ thinking skills has produced unsatisfactory results” (p. 113) and referred to two intervention studies. Although this claim was in line with the authors’

main argument, we believe that it stands on shaky legs, given the plethora of Logo

intervention studies that showed positive far transfer effects (e.g., Clements & Sarama, 1997;

Lye & Koh, 2014; Scherer, 2016; Shute et al., 2017). Nevertheless, we agree with their position that the existing research on far transfer in this area abounds in mixed results—some studies found significant effects, while others failed to provide evidence for far transfer (Palumbo, 1990; Salomon & Perkins, 1987). This controversy motivated the present meta- analysis. Overall, the previous meta-analyses of the transferability of computer programming suggested possible, positive transfer effects. However, we identified several methodological and substantive issues which primarily referred to the specification of meta-analytic models, the differentiation of cognitive skills, and the treatments of control groups.

The Present Meta-Analysis

In this meta-analysis, we synthesize the evidence on the transferability of learning computer programming to situations that require certain cognitive skills. Along with

(12)

providing an overall transfer effect, we examine the variation and consistency of effects across studies, types of transfer, and cognitive skills. We believe that the rapid advancements in technology and the development of visual programming languages (e.g., Scratch) next to text-based languages (e.g., Java) necessitate an update of the existing research. Besides, acquiring computational thinking skills through programming has received considerable attention lately: Programming is introduced into school curricula in several educational

systems—this development is largely based on the claim that learning computer programming has certain cognitive benefits in other domains and contexts (Grover & Pea, 2013; Lye &

Koh, 2014). We provide some answers to the question whether learning to program helps to improve cognitive skills and extend the existing research literature on the transfer of skills, which recently focused on chess and music instruction, working memory training, and video gaming, by testing the claims of transfer effects for the domain of computer programming.

More specifically, we focus on the following research questions:

1. Overall transfer effects: (a) Does computer programming training improve performance on cognitive skills tasks, independent of the type of transfer or cognitive skill? (b) To what extent are these effects moderated by study, sample, and measurement characteristics?

2. Near transfer effects: (a) Does computer programming training improve

performance on assessments of computer programming skills? (b) To what extent are these effects moderated by study, sample, and measurement characteristics?

3. Overall far transfer effects: (a) Does computer programming training improve performance on tasks assessing cognitive skills other than computer programming?

(b) To what extent are these effects moderated by study, sample, and measurement characteristics?

(13)

4. Far transfer effects by cognitive skills: (a) Does computer programming training improve performance on tasks assessing reasoning, creative thinking,

metacognition, spatial skills, mathematical skills, literacy, and school achievement in domains other than mathematical skills and literacy? (b) To what extent do these far transfer effects differ across the types of cognitive skills and subskills?

First, we examine the overall transfer effects of computer programming training (Research Question 1a). These effects include benefits for programming skills and skills outside of the programming domain. The main purposes of providing answers to this research question are (a) to set a reference for the overall cognitive benefits, and (b) to compare the findings obtained from our meta-analysis with those reported by Liao and Bright (1991), who treated

“cognitive skills”, although measured by several skills, as a univariate outcome. Although the overall transfer effect already provides insights into the cognitive benefits of learning

computer programming, we believe that a further differentiation into the skills is needed that are required in the new situations and contexts. Indeed, the findings of existing meta-analyses examining transfer effects of cognitive skills trainings warranted further differentiation either by the type of transfer or by the cognitive skills (e.g., Bediou et al., 2018; Melby-Lervåg et al., 2016; Sala & Gobet, 2017a).

We add possible moderators to explain variation in the reported effect sizes (Research Question 1b). The key premise for addressing this question is that effect sizes may vary within and between studies—moderating variables can therefore explain variation at the study level or the level of effect sizes. Possible moderators represent the study, sample, and

measurement characteristics, such as the statistical study design, types of control groups, educational level of study participants, programming tools, types of performance tests, and the subskills assessed by performance tests.

(14)

Second, we quantify the immediate, near transfer effects of learning computer programming to situations and tasks that require programming skills and explain possible variation within or between studies by the above-mentioned moderators (Research Questions 2a & 2b). Third, we examine the overall far transfer effects and possible moderators thereof (Research Questions 3a & 3b). This study of the overall far transfer is based on measures of skills other than programming and does not differentiate between the different types of

cognitive skills. Finally, we differentiate between different types of cognitive skills to provide more information on the far transfer effects (Research Questions 4a & 4b). These skills represent a range of domain-general and domain-specific skills—skills that show a relative distance to computer programming. To further substantiate the skill- and situation-specificity of far transfer effects, we compare the resultant effect sizes across cognitive skills. This comparison also unravels whether certain cognitive skills benefit from computer

programming training more than others.

Method Literature Search and Initial Screening

To identify the primary literature relevant to this meta-analysis, we performed searches in literature databases, academic journals, reference lists of existing reviews and meta-analyses, publication lists of scholars, and the informal academic platform

ResearchGate. The database search included ACM Digital Library, IEEE Xplore Digital Library, ERIC, PsycINFO, Learn Tech Library, ProQuest Dissertations and Theses Database, and Google Scholar (the first 100 publications as of January 31, 2017), and focused on publications that had been published between January 1, 1965 and January 31, 2017. The databases ACM Digital Library, IEEE Xplore Digital Library, Learn Tech Library,

ResearchGate, and Google Scholar contained both publications in peer-reviewed academic journals and grey literature. We referred to Adams et al.’s (2017) definition of “grey

(15)

literature”, which included dissertations, conference proceedings, working papers, book chapters, technical reports, and other references that have not been published in scholarly journals after peer-review (see also Schmucker et al., 2017).

Whenever Boolean search operators were possible (e.g., ERIC, PsycINFO), the following search terms were used: (Programming OR coding OR code OR Scratch* OR Logo* OR Mindstorm* OR computing OR computational thinking) AND (teach* OR learn*

OR educat* OR student* OR intervention OR training) AND Computer* AND (compar* OR control group* or experimental group* OR treatment). These terms were comprised of three core elements: the concepts of programming and relevant programming tools (e.g., Scratch and Logo), the context of teaching, training, and interventions, and the design of relevant studies (i.e., studies with treatment and control groups). Whenever needed, we adapted them to the search criteria set by the databases (for details, please refer to the Supplementary Material A2). All searches were limited to titles, abstracts, and keywords.

Besides the search in databases, we also hand-searched for publications in relevant academic journals, and reference and citation lists (whenever possible, via the ISI Web of Knowledge) of existing reviews and meta-analyses on the following topics: teaching and learning computer programming, the concept of computational thinking, and the effects of training spatial skills and creativity (see Supplementary Material A2). From the existing meta- analyses, however, (Liao, 2000; Liao & Bright, 1991), we could only retrieve the studies and effect sizes reported there to a limited extent, because (a) several publications were no longer available in a readable format given their publication year (before 2000)—we contacted twenty authors directly via email or via the messaging tool implemented in ResearchGate;

five authors responded to our queries and sent us their publications; (b) inclusion and exclusion criteria of the transfer studies differed between the two meta-analyses; (c) pre- experimental designs were included in these meta-analyses. Finally, we reviewed the formal

(16)

and informal publication lists of scholars in the field (Bright, Clements, Kazakoff, Liao, Pardamean, Pea, Grover, and Resnick) via Google Scholar and ResearchGate. In August 2017, we received a notification about two additional, empirical studies that had been published that month (Erol & Kurt, 2017; Psycharis & Kallia, 2017)—these studies entered our list of possibly relevant publications. Despite our efforts to retrieve unpublished studies (e.g., in the form of conference presentations or informal communications) from authors and associations in the field, we did not receive any unpublished material.

Overall, our literature search resulted in 5,193 publications (see Figure 1). After removing duplicates and screening titles for content fit (i.e., the studies must concern

computer programming), 708 publications were submitted to an initial screening of abstracts.

We read each abstract and examined whether the publication presented a training of computer programming skills and was of quantitative nature; conceptual papers that presented computer programming tools and theoretical reviews without any quantitative evaluation were

discarded. This initial screening addressed the criteria of relevance, quantitative data sufficiency, and the presence of an intervention, and resulted in 440 eligible abstracts. The results of both the literature search and the initial screening are shown in Figure 1.

Screening and Eligibility Criteria

The extracted publications were further screened based on inclusion and exclusion criteria (Figure 1). As the current meta-analysis focuses on the transfer effects of learning to program as results of an intervention—including near transfer effects (i.e., effects on

performance in programming or computational thinking) and far transfer effects (i.e., effects on performance in related cognitive constructs, such as reasoning skills, creative thinking, spatial skills, or school achievement)—studies with an experimental or quasi-experimental design that reported pretest and posttest performance or posttest performance only were included. In line with existing meta-analyses on transfer effects in other domains (e.g., Melby-

(17)

Lervåg, Redick, & Hulme, 2016; Sala & Gobet, 2016), we excluded studies with pre-

experimental designs (e.g., single-group pretest-posttest designs without any control group).

Overall, studies were included in our meta-analysis if they met the following criteria:

1. Accessibility: Full texts or secondary resources that describe the study in sufficient detail must have been available.

2. Study design: The study included a training of computer programming skills with an experimental or a quasi-experimental design and at least one control group (treated or untreated); correlational, ex-post facto studies, or pre-experimental designs (e.g., one- group pretest-posttest designs) were excluded.

3. Transfer effects: The effect of learning computer programming could be isolated;

studies reporting the effects of two or more alternative programming trainings without any non-programming condition were excluded.

4. Reporting of effect sizes: The study reported data that were sufficient to calculate the effect sizes of learning computer programming.

5. Grade levels: Control and treatment group(s) had to include students of the same grade level or age group to achieve sample comparability.

6. Performance orientation: The study had to report on at least one cognitive,

performance-based outcome measure, such as measures of computer programming, reasoning, creative thinking, critical thinking, spatial skills, school achievement, or similar; studies reporting only behavioral (e.g., number and sequence of actions, response times) or self-report measures (i.e., measures of competence beliefs, motivation of volition) were excluded.

7. Educational context: The study samples comprised children or students enrolled in pre-K to 12, and tertiary education; studies conducted outside of educational settings

(18)

were excluded to avoid further sample heterogeneity (a similar reasoning can be found in Naragon-Gainey, McMahon, & Chacko, 2017).

8. Non-clinical sample: Studies involving non-clinical samples were included; studies involving samples of students with specific learning disabilities or clinical conditions were excluded.

9. Language of reporting: Study results had to be reported in English; studies reporting results in other languages without any translation into English were excluded.

In total, 20 % of the studies entering the fine screening (i.e., the application of

inclusion and exclusion criteria) were double-screened by the first and the second author. The overall agreement was high, weighted κ = .97. Disagreement was resolved in a discussion about whether and why specific inclusion and exclusion criteria might or might not apply until consent was achieved. The performance of the inclusion and exclusion criteria resulted in m = 105 studies providing k = 539 effect sizes, as shown in Figure 1 (for more details, please refer to the Supplementary Material A2).

Effect Size Measures

To examine the transfer effects on learning to program on cognitive skills, we extracted the relevant statistics from the eligible studies and transformed them into effect sizes. The resultant effect sizes indicated the degree to which gains in cognitive abilities existed in the treatment group that received a programming intervention, relative to a control group that did not. Hedges’ g was reported as an effect size, because it accounted for possible bias due to differences in sample sizes (Borenstein, Hedges, Higgins, & Rothstein, 2009; Uttal et al., 2013). We calculated Hedges’ g from pretest-posttest experimental or quasi-

experimental and posttest-only designs using the available statistics (e.g., mean scores, standard deviations, Cohen’s d, F-values, and t-values). If studies included multiple control groups, we included the transfer effects obtained from all possible treatment-control group

(19)

comparisons. Supplementary Material A2 details these calculations, and Supplementary Material A1 documents the resultant effect sizes. Given that only 43.2 % of the reliability coefficients of the cognitive skills measures were available and considering the current disagreement about the impact of unreliability corrections on effect size estimations (M. W.- L. Cheung, 2015), we did not correct the reported effect sizes for the unreliability of the outcome measures.

Coding of Studies

To understand the role of contextual variables for the transfer effects, we extracted information about the study design, the content, purpose, and language of programming, the types of outcome variables, the educational level of participants, the length of the

intervention, the publication year and status. These variables were identified as possible moderators explaining variation in effect sizes in previous meta-analyses (Liao, 2000; Liao &

Bright, 1991), and defined the contexts in which programming interventions may or may not succeed (Grover & Pea, 2013; Shute et al., 2017). Considering that transfer effects may vary within and between studies, possible moderators may operate at the study-level, the level of effect sizes (or measures), or both levels. Whereas most of the variables listed below served as study-level characteristics (e.g., average age of students, randomization and matching of experimental groups), some of them varied within studies and were thus considered effect- size-level predictors (e.g., statistical study design, treatment of control groups, cognitive skills). To ensure that the coding scheme was reliable, 25 % of the eligible studies were coded independently by first and the third author. The overall agreement was 94 %; conflicts were resolved during a discussion session until consensus was reached. Supplementary Material A1 documents the coded variables. Categorical moderator variables with more than one category were dummy-coded.

(20)

Sample characteristics. To describe the samples involved in the studies, we extracted information about participants’ average age (in years), the educational level the intervention was targeted at (i.e., pre-kindergarten, kindergarten, primary school [1-6], secondary school [7-13], or university/college), and the proportion of female participants in the sample.

Randomization and matching. To supplement the list of study characteristics, we coded whether individuals or pairs were randomly assigned to the treatment and control conditions. Studies assigning entire classrooms (as a cluster) to the conditions were coded as

‘non-random’. If authors failed to communicate the degree of randomization, their study was coded as ‘non-random’, even though authors labelled their design as ‘experimental’. In addition, we coded the matching of the experimental groups with respect to relevant variables (e.g., basic cognitive abilities, computer experience, or sample characteristics including gender, age, and grade level) using the categories ‘matched’ or ‘not matched’.

Type of control group. We coded the type of treatment of the control groups as

‘treated’ or ‘untreated’. Control groups were coded as ‘treated’ (or active) if they received an alternative training that did not involve programming activities yet was aimed at training a certain cognitive skill. For example, Kim, Chung, and Yu (2013) examined the effects of learning programming with the language Scratch on creative thinking. Whereas the treatment group engaged in the programming instruction, the control group followed regular instruction that was not targeted at improving creativity. For this study, we coded the control group as untreated. Hayes and Stewart (2016) examined the effects of learning Scratch programming on reasoning. Given that the control group engaged in an alternative training of reasoning skills, we coded it as treated.

Studies may contain multiple outcome variables and control groups that were treated to support only one of these outcomes (i.e., they were treated considering one outcome

variable, yet untreated considering another outcome variable). The treatment of control groups

(21)

is thus a variable measured at the level of effect sizes. At the same time, we tested whether this variable may also explain variation between studies and coded the treatment of control group(s) at the study level as “treated”, “untreated”, or “mixed” as well. Hence, the type of control group(s) served as both an effect size-level and study-level variable.

Student collaboration. A recently published meta-analysis indicated that learning computer programming can be more effective in groups than learning it individually

(Umapathy & Ritzhaupt, 2017). Moreover, the transfer of problem-solving strategies may be more effective when students work in pairs (e.g., Uribe, Klein, & Sullivan, 2003). We therefore coded whether students collaborated during the intervention as another possible moderator (0 = individual work, 1 = collaborative work during the intervention).

Study context. We coded the context in which programming interventions were administered, either as embedded in regular lessons or as extracurricular activities.

Programming language. The programming languages (tools) used during the interventions were reported and categorized as ‘text-based programming languages’ (e.g., Basic, C, and Java) and ‘visual programming languages’ (e.g., Alice, Logo, and Scratch).

Intervention length. The length of interventions was extracted and reported in hours.

In case authors provided the number of school lessons, we assumed an average lesson to last about 45 minutes. This assumption may not reflect the true intervention length but provided an approximation of it in most educational systems. The true time distribution may therefore result in slightly different moderation effects. A lack of reporting the intervention length resulted in missing values.

Cognitive skills. Cognitive skills measures were grouped according to the constructs they measured. These constructs comprised broad and narrow categories both of which are shown in Table 1. Overall, the outcome measures covered programming skills, skills that cannot be assigned to a single domain (i.e., creative thinking, reasoning, spatial skills, and

(22)

metacognition), and domain-specific skills (i.e., mathematical skills, literacy, and school achievement in subjects other than mathematics and literacy). Specifically, creative thinking comprised the skills needed to exhibit creative behavior, including originality, fluency, flexibility, and elaboration (Hennessy & Amabile, 2010). Creative thinking was mainly assessed by validated performance tests, such as the Torrance Test of Creative Thinking.

Reasoning skills included not only the skills needed to perform logical (formal) reasoning, which are considered elements of fluid intelligence and problem solving (McGrew, 2009), but also critical thinking skills (i.e., informal reasoning); attention, perception, and memory also fell into the category of intelligence, due to their close relation to the reasoning and

intelligence (Sternberg, 1982). Our classification of these subskills resonated with that

proposed by Sala and Gobet (2018) and Bediou et al. (2018) in their papers on transfer effects of video gaming. Their classification summarized intelligence, attention, memory, and

perception as general cognitive skills surrounding reasoning skills. By and large, reasoning skills were assessed by standardized tests of cognitive abilities and critical thinking (e.g., Cornell’s Critical Thinking Test; see Table 1). Spatial skills included the skills to memorize and understand spatial objects or processes, and to perform reasoning (Uttal et al., 2013).

These skills were mostly assessed by standardized tests of the understanding of two- or three- dimensional objects (Table 1). Metacognition referred to the processes underlying the

monitoring, adaptation, evaluation, and planning of thinking and behavior (Flavell, 1979), and was mostly assessed in conjunction with certain problem-solving tasks. Despite the

dominance of self-report measures of metacognition, the measures used in the selected studies were performance-based and comprised tasks that required, for instance, the representation of a problem, the evaluation of problem situations and strategies, the monitoring of students’

comprehension, and the integration of new information in the presence of old information (Table 1). Mathematical skills comprised mathematical problem solving, modeling,

(23)

achievement in general (e.g., measured by course grades), and conceptual knowledge (Voss, Wiley, & Carretero, 1995). Some of the tests reported in primary studies used standardized mathematics tests, whereas others relied on self-developed assessments (Table 1). Literacy spanned several knowledge and skills components, including reading, writing, and listening skills. Most primary studies presented students with writing tasks and evaluated the written pieces against certain linguistic criteria; these tasks were often accompanied by reading comprehension tests (Table 1). Finally, school achievement was indicated by performance measures in domains other than mathematics and literacy. These measures assessed students’

achievement in Earth Sciences, Social Sciences, and Engineering, often measured by national or teacher-developed achievement tests in these subjects (Table 1). Although mathematical skills and literacy can also be considered aspects of school achievement, we did not assign them to this category in order to avoid introducing further heterogeneity which may have compromised the comparability of the effect sizes within this category. We further extracted information about how these constructs were measured. This information included the origin of the tests (i.e., standardized test, performance-based test developed by researchers or teachers), along with the available reliability coefficients.

Type of transfer. On the basis of the coding of cognitive skills at the level of effect sizes, we coded whether studies focused on near transfer only (i.e., only programming skills were measured), far transfer only (i.e., only skills outside of programming were measured), or near and far transfer at the same time (i.e., programming skills and skills outside of

programming were measured). This variable operated at the study level and allowed us to examine its possible moderating effects on the overall transfer effect.

Statistical study design. For the included studies, we coded the statistical design underlying the estimation of effect sizes and the overall study design. Several studies included multiple outcome measures, for which pretest and posttest scores were available to a different

(24)

extent. Generally, the statistical and the overall (implemented) study designs will agree; yet, in some cases, they may differ, as the following two examples illustrate: (1) Transfer studies with one outcome measure: Although authors reported a pretest-posttest control group design to examine the effects of learning to computer programming on mathematical skills, pretest and posttest measured entirely different skills in mathematics, for instance, the skills to deal with variables (pretest) and conceptual understanding of geometric shapes (posttest). Given the non-equivalence of the pretest and posttest, the statistical design was best represented as a posttest-only control group design (Carlson & Schmidt, 1999). In such cases, effect sizes were extracted using the posttest means and standard deviations only. (2) Transfer studies with multiple outcome measures: Statistical study designs sometimes differed within studies, in particular when multiple outcomes were measured. For instance, some authors reported both pretest and posttest scores for one outcome variable, yet only posttest scores for another outcome variable. Whereas the former represents a pretest-posttest design, the latter

represents a posttest-only design. Hence, statistical study designs are primarily placed at the level of effect sizes. In addition to treating the study design as an effect size feature, we also coded the overall study design as a study feature using the categories “pretest-posttest design”, “posttest-only design”, or “mixed”. Comparable to the types of control groups, this variable served as both an effect size- and a study-level moderator.

Publication status. To examine the extent to which the reported effect sizes were moderated by the type of publication, we established publication status as another, possible moderating variable. Publication status was thus coded as ‘grey’ or ‘published’. In the current, meta-analytic sample, ‘unpublished’ studies did not exist.

Statistical Analyses

Several studies provided multiple effect sizes, either because they included multiple treatments or control groups, or they reported effects on multiple outcome variables. The

(25)

reported effect sizes were therefore dependent (Van den Noortgate, López-López, Marín- Martínez, & Sánchez-Meca, 2013). To account for these dependencies, M. W.-L. Cheung (2015) suggested using either multivariate meta-analysis, which models the covariance between multiple effect sizes derived from multiple outcomes measures, or three-level random-effects modeling, which quantifies the degree of dependence by adding a variance component at a third level of clustering (Pastor & Lazowski, 2018). The latter is particularly suitable for situations, in which the degree of dependence or covariance among multiple outcome variables is unknown (M. W.-L. Cheung, 2014), and returns unbiased estimates of fixed effects (Moeyaert et al., 2017). Considering this and the observation that very few primary studies reported covariances or correlations between multiple outcomes in the current meta-analysis, we decided to account for the clustering of effect sizes in studies by adopting a three-level random-effects modeling approach. For the ith effect size in the jth study, this approach decomposes the effect size !_"# into the average population effect $_%, components

&_'"# and &_())# with level-specific variances +,-.&_'"#/ = 1_'^' and +,-.&_)#/ = 1₎^', and residuals 2_"# with the known sampling variance +,-.2_"#/ = 3_"# (M. W.-L. Cheung, 2014):

!_"# = $_% + &_'"# + &_)# + 2_"# (1)

Model (1) represents a three-level random-effects model which is based on the standard assumptions of multilevel modeling (see M. W.-L. Cheung, 2014, for details). This model quantifies sampling variability (level 1), within-study variability (level 2), and between-study variability (level 3). To establish which variance components (i.e., within and between studies) are statistically significant, we compared four models against each other, using likelihood-ratio tests and information criteria: Model 1 represented a random-effects, three- level model with within- and between-study variances (see equation [1]). Model 2 was a random-effects model with only between-study variance, and Model 3 was a random-effects model assuming only variation between effect sizes. Finally, Model 4 represented a fixed-

(26)

effects model without any variance component. To quantify the heterogeneity of effect sizes at both levels, we estimated the 5^' statistics based on the level-2 and level-3 variance

estimates as follows (Cheung, 2015):

5_'^' = 100% ∙_:_; ^:^;^<^<

<<=:;_>^<=?@ and 5₎^' = 100% ∙_:_; ^:^;^>^<

<<=:;_>^<=?@ (3)

In equation (3), 3@ represents the typical within-study sampling variance proposed by Higgins and Thompson (2002).

If statistically significant variation within or between studies exists, the random-effects Models 1-3 can be extended to mixed-effects model by introducing covariates (i.e., possible moderator variables) at the level of effect sizes and studies. Under the standard assumptions of three-level regression, the mixed-effects model with level-2 and level-3 variances and a covariate at the level of effect sizes A_"# is:

!_"# = $_%+ $_BA_"#+ &_'"# + &_)#+ 2_"# (2)

The variance explained by a covariate at the level of effect sizes is estimated by the reduction of level-2 variance when comparing models (1) and (2). We specified all models in the R packages ‘metafor’ (Viechtbauer, 2017) and ‘metaSEM’ (M. W.-L. Cheung, 2018) using restricted maximum likelihood estimation. Supplementary Material A3 contains the R sample code.

Publication Bias and Sensitivity Analysis

To test the robustness of the obtained transfer effects, we conducted several analyses of publication bias: First, we examined the funnel plot and performed trim-and-fill-analyses (Duval & Tweedie, 2000). Second, we compared the effect sizes obtained from published studies and grey literature (Schmucker et al., 2017). Third, we examined the p-curve that resulted from the statistics underlying the transfer effect sizes (Simonsohn, Nelson, &

Simmons, 2014). If studies had evidential value, the p-curve should have been right-skewed; a left-skewed curve would indicate publication bias (Melby-Lervåg et al., 2016). Fourth, we

(27)

performed a fail-safe N analysis on the basis of Rosenberg’s weighted procedure (Rosenberg, 2005). In contrast to other fail-safe N procedures (e.g., Rosenthal’s and Orwin’s procedures), Rosenberg proposed a weighted approach, which is applicable to both fixed- and random- effects models in meta-analysis and might represent the number of unpublished studies better than the alternative approaches. Fifth, we applied Vevea’s and Hedges’ (1995) weight

function procedure that assumes a dependency between the p-value in a study and the probability of publication (linked via a weight function). All approaches to publication bias were performed in the R packages ‘metafor’ (Viechtbauer, 2017), ‘weightr’ (Coburn &

Vevea, 2017), and the ‘P-curve Online App’ (Simonsohn, Nelson, & Simmons, 2017).

We tested the sensitivity of our findings to several factors, including the estimation method, the presence of influential cases, the handling of missing data in moderators, and the different assumptions on the variance components in the main model. For instance, we compared the transfer effects and the existence of possible variation within and between studies between restricted maximum likelihood (REML) and maximum likelihood (ML) estimation. Existing simulation studies indicate that, although both methods may not differ in the estimation of intercepts (i.e., overall effect sizes; Snijders & Bosker, 2012), REML creates less biased between-study variance estimates of random-effects models than ML does

(Veroniki et al., 2016). M. W.-L. Cheung (2013) therefore argued for the use of REML in multilevel situations yet suggests comparing the variance components obtained from both estimation methods for validation purposes (see also M. W.-L. Cheung, 2014). Furthermore, the dataset underlying our meta-analysis may contain influential effect sizes. We therefore compared the results of our meta-analysis with and without influential effect sizes. We identified influential effect sizes using Viechtbauer’s and Cheung’s (2010) diagnostics based on random-effects models in the R package ‘metafor’. These diagnostics included student residuals, Cook’s distances, and other leave-one-out deletion measures.

(28)

Results Description of Studies

Table 2 summarizes the distribution of the study design, sample, and publication characteristics among the m = 105 studies and k = 539 effect sizes. Most studies reported effect sizes based on pretest-posttest control group designs and random group assignment but did not match the experimental groups—hence, they were quasi-experimental. Most studies targeted far transfer effects only (87.6 %), and about 70 % of the effect sizes were based on untreated control groups. Interventions were mostly conducted during regular school lessons.

Besides these design features, studies primarily used visual rather than text-based programming tools in their interventions. Study participants used these tools to design

computer games, maneuver robots, or engage in pure programming activities. Control groups, however, did not use programming tools, but attended lectures or other forms of instruction (see Supplementary Material A2). Standardized and unstandardized test were administered to almost the same extent. These tests measured a variety of cognitive skills, with a clear focus on reasoning, mathematical, and creative thinking skills. Overall, the sample of participants was comprised of mainly primary and secondary school students in Asia and North America.

The overall sample contained N = 9,139 participants of the primary studies (treatment groups:

NT = 4,544; control groups: NC = 4,595), with an average sample size of 87 (SD = 72,

Mdn = 66, range = 14–416). Considering the central tendencies of sample sizes, treatment and control groups were balanced (treatment groups: M = 43, SD = 37, Mdn = 30; control groups:

M = 44, SD = 43, Mdn = 29). On average, interventions lasted for 25 hours and ranged

between 2 and 120 hours (SD = 20.9, Mdn = 20 hours). Of the study participants, 49.1 % were female. Most publications describing the study results dated back to the 1980s and 1990s, followed by studies published in the 2010s.

(29)

Publication bias

Before quantifying the overall transfer effects, we examined the degree of publication bias in the sample of primary studies. The funnel plot indicated some degree of asymmetry (see Figure 2a)—this observation was supported by Egger’s regression test, t(537) = 4.10, p < .001. Trim-and-fill analysis resulted in an overall transfer effect size of g = 0.43, 95 % CI = [0.37, 0.50], without any additional studies to be filled left of the mean.

Rosenberg’s fail-safe N suggested that 77,765 additional effect sizes would be necessary to turn the overall transfer effect size into insignificant (with p > .01). Finally, p-curve analyses indicated that observed p-values had evidential value, z = -38.9, p < .0001 (continuous test for a right-skewed curve; Simonsohn et al., 2014), and that the p-curve was right-skewed (see Figure 2b). Vevea’s and Hedges’ (1995) weight function model with a selection function based on p-values with cut-off points of 0.05 and 1 resulted in an adjusted overall effect size of g = 0.63, 95 % CI = [0.52, 0.74] that was based on random effects. The difference between this weighted model and a model containing constant weights (i.e., no publication bias) was significant, χ²(1) = 20.4, p < .001. Hence, the publication of effect sizes could depend on the reported p-value, because the model adjusted for publication bias fits better than the

unadjusted model (for more details, please refer to Vevea & Hedges, 1995). Taken together, these findings suggest the presence of some publication bias and small-study effects (Egger’s test) in the present data. At the same time, p-curve analysis did not uncover the presence of p- hacking, and the fail-safe N indicated that it is unlikely that the key results obtained from the main and moderation models are mainly due to publication bias.

Overall Transfer Effects

To aggregate the transfer effects of learning computer programming on cognitive skills, including programming skills and skills outside of the programming domain, we

(30)

established a main (baseline) model, which formed the basis for the subsequent moderator analyses of the overall transfer effects.

Main model (Research Question 1a). To identify the main model, we performed a sequence of modeling steps: First, a random-effects three-level model (Model 1) resulted in positive and moderate transfer effect, g = 0.49 (m = 105, k = 539, 95 % CI = [0.37, 0.61], z

= 8.1, p < .001; Model fit: -2LL = 1127.8, df = 3, AIC = 1133.8, BIC = 1146.7). This effect was accompanied by significant heterogeneity (Q[538] = 2985.2, p < .001), which also surfaced in variation of the effect size within studies (1_'^' = 0.204, 95 % CI = [0.164, 0.252], 5_'^' = 36.7 %) and between studies (1₎^' = 0.281, 95 % CI = [0.189, 0.415], 5_'^' = 50.7 %). The corresponding profile likelihood plots peaked at both variance estimates, and the log- likelihood decreased for higher values of these variances—thus, both variance components were identifiable (see Supplementary Material A2, Figure S1). The intraclass correlations of the true effects were 0.42 (level 2) and 0.58 (level 3), indicating substantial variation within and across studies.

Second, we specified a model with constrained level-2 variance (1_'^' = 0), but freely estimated level-3 variance (Model 2). This model showed the same transfer effect size as the three-level model (g = 0.49, 95 % CI = [0.38, 0.61]), along with significant level-3 variance, 1₎^' = 0.328, 95 % CI = [0.238, 0.458], 5_'^' = 82.4 % (Model fit: -2LL = 1589.0, df = 2,

AIC = 1593.0, BIC = 1601.6). In comparison to Model 1, this model degraded model fit significantly, χ²(1) = 461.2, p < .001.

The third model assumed variation at level 2, yet not at level 3 (1₎^' = 0), thus

representing a standard random-effects model (Model 3). This model revealed a positive and moderate effect size, which was slightly smaller than that obtained from the three-level model (g = 0.43, m = 105, k = 539, 95 % CI = [0.37, 0.49], z = 13.8, p < .001; Model fit: -

2LL = 1266.5, df = 2, AIC = 1270.5, BIC = 1279.1), with significant between-study variation

(31)

(1_'^' = 0.415, 95 % CI = [0.352, 0.490], 5_'^' = 85.6 %). Introducing the constraint of zero level- 3 variance degraded the model fit significantly, as the results of a likelihood-ratio test

comparing Models 1 and 3 indicated, χ²(1) = 138.7, p < .001.

The fourth model constrained both level-2 and level-3 variances to zero (1_'^' = 0, 1₎^' = 0), assuming fixed effects (Model 4). The resultant overall transfer effect amounted to g = 0.35 (m = 105, k = 539, 95 % CI = [0.33, 0.37], z = 31.1, p < .001; Model fit: -

2LL = 2740.0, df = 1, AIC = 2742.0, BIC = 2746.3). The three-level random-effects model, however, fitted the data significantly better than this model, χ²(2) = 1612.2, p < .001.

Overall, this sequence of model specifications and comparisons indicated significant level-2 and level-3 variance of the overall transfer effect and the sensitivity of the overall effect size to these variance components. It also showed that the three-level random-effects model represented the data best, g = 0.49, m = 105, k = 539, 95 % CI = [0.37, 0.61].

Moderator analysis (Research Question 1b). Model 1 formed the basis for further moderator analyses. Table 3 shows the results of these analyses for the categorical

moderators. Significantly higher effects occurred for published literature (g = 0.60, 95 % CI = [0.45, 0.75]) than for grey literature (g = 0.34, 95 % CI = [0.15, 0.52];

QM [1] = 4.67, p = .03). Besides the publication status, only the type of treatment that control groups received (i.e., treated vs. untreated) significantly explained level-2 variance,

QM (1) = 40.12, p < .001, C_'^' = 16.7 %. More specifically, transfer effect sizes were significantly lower for studies including treated control groups (g = 0.16) than for studies including untreated control groups (g = 0.65). Concerning the z-transformed, continuous moderators at level 3, neither publication year (B = 0.09, SE = 0.06, QM [1] = 2.36, p = .12, C₎^' = 0.0 %), students’ average age (B = -0.07, SE = 0.07, QM [1] = 0.86, p = .35, C₎^' = 3.6 %), the proportion of female students in the study samples (B = -0.07, SE = 0.07, QM [1] = 0.86, p = .35, C₎^' = 1.1 %), nor the intervention length (B = 0.00, SE = 0.06, QM [1] = 0.00, p = .98,

(32)

C₎^' = 0.0 %) affected the overall transfer effect, thus leaving large proportions of level-2 and level-3 variances unexplained.

Sensitivity analyses. The variance components of the overall transfer effect, obtained from REML, differed only marginally from the ML variances (ML level-2 variance:

1_'^' = 0.203, 95 % CI = [0.160, 0.247], 5_'^' = 37.0 %; ML level-3 variance: 1₎^' = 0.277, 95 % CI = [0.169, 0.385], 5_'^' = 50.3 %; see Supplementary Material A2, Table S1). Some moderator variables exhibited missing data. Hence, we compared the variance explanations of effect sizes between the maximum likelihood and the full-information maximum likelihood (FIML) approaches. The FIML approach handles missing data within the analysis model by using all observed effect sizes and study characteristics to compensate the loss if information due to missing values (Little et al., 2014) and is implemented in the R package ‘metaSEM’

(‘meta3X()’ function; M. W.-L. Cheung, 2018). Overall, the differences in variance explanations between FIML and ML, and FIML and REML were only marginal (see Supplementary Material A2, Table S2).

The influential cases diagnostics flagged ten influential effect sizes that were obtained from five studies (see Supplementary Material A2, Figure S2). These effect sizes ranged between g = 2.10 and g = 8.63 (Mdn = 3.31), with an average of g = 3.99 (SD = 1.99). The studies exhibiting these effects all contained primary school students, used visual

programming tools, and examined transfer effects on cognitive skills outside of programming;

all other sample and study characteristics differed. After removing these effect sizes, the remaining m = 103 studies comprising k = 529 effect sizes were submitted to the three-level meta-analysis, following the same procedure as for the full data set. Model 1 fitted the data based and revealed a positive, significant, and moderate overall transfer effect size of g = 0.41, 95 % CI = [0.32, 0.50], which was slightly lower than the original effect size (see Supplementary Material A2, Table S3). The moderator analyses supported the finding that

(33)

studies comprising treated control groups exhibited significantly smaller transfer effects than studies with untreated control groups (see Supplementary Material A2, Table S4). The continuous moderation effects did not change after excluding influential cases (see Supplementary Material A2, Table S5). Nevertheless, two findings contrasted previous moderator analyses with the full data set: First, the difference between published literature and grey literature diminished after removing influential cases, suggesting a possible

reduction of publication bias in the data. Indeed, the funnel plot indicated improved graphical symmetry, and the p-curve did not provide evidence for further publication bias (see

Supplementary Material A2, Figure S3). Second, studies administering standardized tests showed a significantly lower transfer effect size (g = 0.33) than studies administering unstandardized tests (g = 0.49; QM [1] = 4.56, p = .03, C₎^' = 7.8 %). Overall, the sensitivity analyses showed marginal differences in the overall transfer effects, their variance

components, and possible moderation effects between the conditions—substantive causes for differences could not be identified.

Near and Far Transfer Effects

Taking a second step in our meta-analysis, we analyzed the transfer effects for near transfer (i.e., effects on programming skills) and far transfer (i.e., effects on cognitive skills outside programming). To allow for possible differences in (a) the selection of a main model, (b) the within- and between-study variances, and (c) the moderation effects, we conducted two separate meta-analyses, following the same procedure as for the overall transfer.

Main models (Research Questions 2a & 3a). Comparisons between models with different variance constraints identified a random-effects model with between-study variation of effect sizes (Model 2) as the best-fitting main model for near transfer effects (Table 4;

please find the forest plot in the Supplementary Material A1); for far transfer effects, the random-effects three-level model (Model 1) described the data best (Table 4), indicating

(34)

significant variance within and between studies. The overall effect size for near transfer was high (g = 0.75, m = 13, k = 19, 95 % CI = [0.39, 1.11], z = 4.1, p < .001), and showed substantial heterogeneity across studies (5₎^' = 85.5 %). In contrast, the overall far transfer effect size was lower (g = 0.47, m = 102, k = 520, 95 % CI = [0.35, 0.59], z = 7.8, p < .001), and showed heterogeneity within (5_'^' = 37.1 %) and between studies (5₎^' = 50.0 %) with intraclass correlations of 0.43 and 0.57, respectively. For both types of transfer, the profile likelihood plots peaked at the estimated variances, testifying to the identification of both variances (see Supplementary Material A2, Figures S4 and S5). Overall, the selection of main models suggested positive and significant near and far transfer effects.

Moderator analyses (Research Questions 2b & 3b). The moderator effects differed between near and far transfer (see Tables 5 and 6): Whereas neither publication status nor the treatment of control groups showed significant moderation for near transfer, far transfer effect sizes were significantly lower for treated control groups (g = 0.15) than for untreated control groups (g = 0.64) at the level of effect sizes, and significantly higher for published studies (g = 0.58) than for grey literature (g = 0.43). Studies with random group assignment

(g = 0.29) showed lower near transfer effects than for those without (g = 0.95). None of the continuous study and sample characteristics moderated the two transfer effects (Table 7).

Notably, the confidence intervals accompanying near transfer effects were large, due to the limited number of studies addressing this type of transfer. Hence, the moderation effects of near transfer must be treated with caution.

Publication bias. As noted earlier, publication status did not explain variance in near transfer but in far transfer effects, indicating some bias toward published studies in the latter.

Moreover, the funnel plots for near and far transfer confirmed this tendency, as they showed some skewness only for far transfer (see Supplementary Material A2, Figure S6). Trim-and- fill analyses suggested adding two more effect sizes for near, yet no further effects for far