AVAILABILITY AND IMPACT

(1)

AVAILABILITY AND IMPACT

On the Citation Advantage of Data Sharing in Political Science

Maria Seippel Bineau

Master’s Thesis in Political Science Department of Political Science

Faculty of Social Sciences University of Oslo

Spring 2016 Words: 18055

(2)

II

AVAILABILITY AND IMPACT

On the Citation Advantage of Data Sharing in Political Science

Maria Seippel Bineau May 17. 2016

(3)

III

Availability and Impact: On the Citation Advantage of Data Sharing in Political Science Maria Seippel Bineau

http://www.duo.uio.no/

Trykk: Reprosentralen, Universitetet i Oslo

(4)

IV

Abstract

This study investigates whether available replication data leads to increased citation impact for political science publications. The research question is part of a broader movement towards scientific openness that we can trace throughout history. Technological advances bring this movement into the modern times, with increased possibilities of sharing of empirical materials.

Whether this data sharing is beneficial for individual scholars is still largely unanswered, and there are shortcomings with the studies investigating the relationship between available replication data and citation impact. This thesis intends to improve on the existing research.

The starting point is the article “Posting your data: will you be scooped or will you be famous”

by Gleditsch, Metelits, and Strand (2003). Their model attracted serious criticism, especially from Abbott (2007), who objected not only to the solidity and validity of their model, but also the interpretation of the results. Keeping his objections in mind, I attempt to go further and explore not only whether there is a relationship between replication data and citation impact, but also make some observations as to the causes of this relationship. To do this, I anchor the analysis deeper in existing theory and previous research, and thus include more (previously missing) aspects of publishing and citing. Also, because of the criticism of the data used by Gleditsch, Metelits and Strand, I add a second data set available from Dafoe (2014). The new data allows me to investigate the finer points of data sharing, as the information on replication data is more detailed.

This study includes more data, better methods, and more aspects of publishing than previous research on the same subject, but confirm the findings of this existing literature (Dorch 2012;

Gleditsch, Metelits, and Strand 2003; Ioannidis et al. 2009; Piwowar, Day, and Fridsma 2007;

Piwowar and Vision 2013). The analyses indicate that there is a relationship between data sharing and citation impact.

(5)

V

(6)

VI

Acknowledgements

I would like to thank my excellent supervisors Håvard Strand and Gunnar Sivertsen for your help and encouragements. For discussing problems, big or small, from analysis structure to 18^th century citations. You have been superb! I also need to thank you for introducing me to the topic at all. It has been more fun than I could have imagined.

I would like to thank my mother and Robert for proof reading. And last but absolutely not least:

Thanks to my brilliant father for reading, commenting, and supporting me through it all.

The contents and errors in this thesis are my responsibility.

Maria Seippel Bineau Oslo, May 17. 2016

(7)

VII

(8)

VIII

Abbreviations

Journal of Peace Research JPR

American Political Science Review APSR American Journal of Political Science AJPS

Open Access OA

Web of Science WoS

American Political Science Association APSA Data Access and Research Transparency DA-RT

National Science Foundation NSF

Research Council of Norway RCN

National Institute of Health NIH

International Relations IR

Variance Inflation Factor VIF

Incident Rate Ratio IRR

Log Likelihood LL

Akaike Information Criterion AIC

(11)

XI

List of Tables

Table Description Page

Table 1 Replication data, JPR 21

Table 2 Replication data options, Dafoe 21

Table 3 Replication data, Dafoe 22

Table 4 Comparison of journals 25

Table 5 Replication data variable, Dafoe 28

Table 6 Analysis, JPR data 30

Table 7 Analysis, Dafoe data 31

Table 8 Fame variable as dummies, JPR data (excerpt) 36

Table 9 Coding of the disciplinary background of authors, Dafoe 46

Table 10 Model fit, Dafoe data 48

Table 11 Model fit, JPR data 49

Table 12 Replication of and comparison to original analysis, JPR data 50

Table 13 Variations on scientific cooperation, Dafoe 51

Table 14 Fame variable as dummies, JPR (full) 52

Table 15 Results from bootstrapping 53

(12)

XII

List of Figures

Figure Description Page

Figure 1 Number of articles mentioning ‘replication data’ in political science journals 4 Figure 2 Total number of citations in the data, Dafoe data and JPR data combined 20

Figure 3 The ‘fame’ variable, JPR data 23

Figure 4 Articles without replication data, JPR data 25

Figure 5 Articles with replication data, JPR data 25

Figure 6 Number of pages 33

(13)

1

1. Introduction

If I (in an optimistic moment) should hope that this thesis is the start of a scientific career, would it make a difference whether I make my data available to others? What would it mean to science? What would it mean to me? Should I go the extra mile it takes to make my data accessible to other scholars?

One of the main criteria when statistically evaluating science is citation impact. It is important for policy makers, research institutions and individual scholars because awarding of research grants, jobs, and esteem is more and more dependent on the citation impact of the papers an author has published. So, returning to the former paragraph, does data sharing increase citation impact? Should I (as an aspiring scientist) publish my data and hope for fame and fortune? Or will I be ‘scooped’? Will someone else take my data, freeride on my efforts, publish results and get all the glory?

That is what I attempt to find out: Does data sharing lead to increased citation impact?

This question is part of a general movement where the aim is to increase scientific openness. A movement with long roots and many branches. The specific branch that is most important in the following is the replication movement, a movement preaching the benefits of data sharing for science as a whole and for the individual scholar. Studies have focused on the latter before, as I do here, analyzing the relationship between available replication data and citation impact in various disciplines. Focusing on political science, Gleditsch, Metelits, and Strand (2003) published a study claiming to have found the association under scrutiny to be true for international relations (IR) articles published in Journal of Peace Research (JPR). But the article faced criticism, especially from Abbott (2007). He pointed to the distribution of citations in the sample, the low explanatory power of the model, and he disagreed on the interpretation of the results.

This is where we begin.

With the data used by Gleditsch, Metelits, and Strand (2003)¹ I give the question a second go.

I add a second data set available from Dafoe (2014), consisting of articles from American Political Science Review (APSR) and American Journal of Political Science (AJPS)². I add new variables to both of the data sets to account for more aspects of publishing and citing. The most

1 From here on referred to as the ‘JPR data’

2 From here on referred to as the ‘Dafoe data’

(14)

2

notable extension of the data is the inclusion of a ‘fame’ aspect, measured by the number of citations the author(s) had at the time of publication. This was suggested by the original authors as a possible omitted variable bias in their analysis (Gleditsch, Strand, and Nordkvelle 2014:

9). With bigger, better data, and the objections to the original analysis in mind, this study attempts to go further, and not only see whether there exists a relationship between data sharing and citation impact, but also make some observations regarding the cause of this relationship (if there is one). The Dafoe data allows this. The data does not only include a variable showing whether data actually is available (as is the case with the JPR data), but also whether the article explicitly states that replication data exists. There is a significant discrepancy between the two, and this allows us to investigate the finer points of data sharing.

I start with a background chapter on scientific openness. Then comes a combined theory and literature review chapter that consists of three parts. The first is a summary of the previous literature on my exact research question: the relationship between data sharing and citation impact. The second part is a look into the existing research on Open Access (OA) publishing.

This literature has a substantial subset of investigations into the relationship between OA publishing and citation impact. This is close to my research question, only the effect variable is different. The OA literature is more comprehensive than the research concerned with replication data, and they have explicit expectations of what an OA effect on citation impact might look like. This is useful for the interpretation of the results. The last part of the theory and literature review chapter is a look into of the general literature on citations. This is to help structure the analyses, and to make an informed choice of which control variables to include.

After this, I present the research design. Here I describe the data, common challenges in citation analysis, and the method I use: a count model. I also list the variables in the analysis with their operationalizations and some descriptive statistics. Chapter 5 presents the analyses and the results, and I go through the variables to look at how they perform. Chapter 6 is the discussion chapter. Here, I interpret the results, look at the robustness of the models, and attempt to answer some of the criticism from Abbott (2007). In the last chapter, I try to draw some conclusion and propose some further research.

(15)

3

2. Background

To appreciate the significance of the replication data controversies, a look into the broader debates on and evolution of scientific openness is important. To understand the debates of the day, we should take a step back. So that is what happens next. First, an introduction into scientific openness in general, before we narrow down to the concrete arguments on replication data of the present, and land on the disagreements concerning data sharing and its consequences for the individual scholar; among them our main question of citation impact.

2.1. Scientific Openness

The idea of scientific openness is not new. The belief that science should be available and accessible, so that others can assess thoughts, theories, inventions and results has been present for a long time. In 1744, Ludvig Holberg wrote in an essay about the Royal Society of London and its Philosophical Transactions, commenting also on the recently established parallel in Copenhagen, that “[Researchers shall] communicate to each other their thoughts, so that everyone lets their new ideas and publications be corrected by the entire scientific community”³ (Holberg 1744: Libr. III, Epigr. 18). This line of thought and the emphasis on this aspect of science – where scientists control and build on each other’s work – is heavily dependent on scholars being open about not just their results, but the entire process leading up to the finished research product. Openness is a focus we find in much of the sociology of science, with prominent scholars such as Karl Popper (1957) and Robert K. Merton in the forefront. The latter with the focus on reliability, and science as a cumulative endeavor (Merton 1968a: 501-503).

To make this accumulation possible, some degree of scientific openness is a prerequisite.

The emphasis on openness follows through to the modern times of the internet, and all the opportunities brought on by technological advances. These developments are increasing the possibilities of openness and sharing of scientific materials. Journal articles available online – to scholars or everyone – and easily accessible data banks with scientific data sets open for reuse and control are now the reality. The latter, the possibility of easy data sharing, is the main precondition for this study.

The new possibilities of sharing brings with them new challenges concerning the limits of openness. Scholars can now easily make their empirical material available through different

3 Transated by me. Original: [Forskere skal] «communicere hindanden deres Tanker, saaledes, at enhver lader sine Inventioner og Skrifter see og corrigere af det heele Societet»

(16)

4

data sharing websites. But should they? And should they have to? And should they all have to?

These questions have sparked discussions. I will now lay out the main arguments in the debates regarding replication data, after a brief explanation of what replication data is, and an overview of the replication data policies in political science.

2.2. Replication Data

Figure 1: Number of articles mentioning ‘replication data’ in political science journals⁴

2.2.1. What is Replication Data?

A replication data set “include[s] all information necessary to replicate empirical results” (King 1995: 446). What this entails depends on the kind of research. For quantitative research, it would mean the original data, codes, information about the computer program used, usually a note with explanations on how to reproduce the published results, etc. For qualitative research, it could mean interview transcripts, selection methods, audiotapes etc. All of the information does not have to be shared, only the exact information sufficient to reproduce the results (ibid.).

4 Data collected by me from jstor.org. Replication data and code available in Appendix G

(17)

5

2.2.2. Replication Data Policies

Policies on data sharing vary between social science disciplines. Economics have been leading the way by making data sharing the norm, followed by political science (Zenk-Möltgen and Lepthien 2014: 711-712). The American Political Science Association (APSA) has included increased transparency and openness in their guidelines, and appointed the Data Access and Research Transparency (DA-RT) committee to investigate these issues further (Carsey 2014;

Lupia and Elman 2014). The focus on replication data is relatively new, and the attention nearly exploded after around 2000 (illustrated by Figure 1).

These new recommended policies are also encouraged and enforced by other actors. Funding agencies such as the National Science Foundation (NSF) and National Institute of Health (NIH) in the United States, the Research Council of Norway (RCN), and public financing institutions in for example the UK and Germany now demand that data is made available if they are to give financial support (Forskningsrådet 2014; Gherghina and Katsanidou 2013: 335; Zenk-Möltgen and Lepthien 2014: 711). Data sharing is also recommended by the European Union (EU) (European Commission 2010). These new policies are primarily based on a view of publicly financed data as an important public good, beneficial for everyone, but with limitations such as security, privacy, (in some cases) commercial interests and legal concerns (Forskningsrådet 2014: 8).

But practices also differ within political science, explicitly between journals (Gherghina and Katsanidou 2013). Several journals and editors have embraced the new recommended policies, and many publishers now demand or encourage their authors to make their data available, while others are more skeptical. Journal policies are important because scientific journals are the primary arena for scientific publishing, debates, and progress. They are also increasingly important for the career development of individual scholars. The movement of political science as a whole, and the realities of data sharing thus depend on the policies and practices of scientific journals (Gherghina and Katsanidou 2013: 335).

2.2.3. The Replication Debate: Why (Not) Share Replication Data?

The debates about replication data in political science concerns both quantitative (Lupia and Alter 2014) and qualitative (Elman and Kapiszewski 2014; Moravcsik 2014) research, and all sub disciplines.

The main argument in favor of data sharing is that researchers should make their empirical materials available to other scholars because science is a cumulative endeavor. We cannot make

(18)

6

scientific progress at the same pace and with the same quality individually as we can working together, when authors are inspired by and build on each other. If more scholars get access to the same, data they can do more research based on the same empirical material, and the amount of discoveries will increase. Better and more extensive use of the same data benefits everyone and science in general (Lupia and Elman 2014: 22). This is realized through for example increased scrutiny, easier access to science for non-academic actors, facilitation of learning and co-operation between disciplines (ibid.), increased learning potential from each individual study (Carsey 2014: 72), and more potential research projects based on the same data.

But there are weighty objections. We can group most of them into three rough categories:

objections to the universality of the recommended data sharing policies, concerns about practicalities, and disagreements about perceived consequences.

Some scholars have relatively fundamental objections, criticizing the universality of the recommended policies. Abbott (1997: 1151) argues that data sharing is a question without an answer; what is lacking in the social sciences is not methodological control and rigor, but good ideas and enthusiasm. The data sharing policies are not promoting the (most) needed positive changes in the social sciences, and they are not equally well suited for all sub-fields. Jeffrey C.

Isaac, editor of Perspectives on Politics, similarly objects that these new developments are not for everyone. Not all parts of political science benefit from, or want, data sharing to be a rule they have to adhere to, imposed from above (Isaac 2015: 276).

The second group of objections are concerned with the practicalities of data sharing. One of the major worries is privacy (Abbott 2007: 214). This is a very valid concern, as only very basic information is needed to identify the individuals in for example a quantitative data set based on a survey (King 2011: 721). User agreements is one proposed solution. This is already in place in one of the major data sharing bases: Gary King and Harvard’s Dataverse project (Crosas 2012; Freese 2007: 221). Whether this is enough is not agreed upon.

Administration, ownership, and resources are other practical concerns without definite solutions to date. Who should administer the data? Who should decide who gets access, and should the use of the data be restricted? Making sense of someone else’s data can be challenging, and many scientists and publishers lack the required knowledge to post data (Sieber 1991: 141). There are also disagreements about where the additional burden associated with data sharing falls, and where it should fall – who should pay the costs of data sharing?

(Berman and Cerf 2013). Firebaugh concludes that the benefits of sharing data would likely

(19)

7

exceed the costs from a journal perspective, as the quality of research products is likely to improve, and the results easier to assess (Firebaugh 2007: 208). There is still a big job left organizing and planning comprehensive data sharing. It is still a work in progress.

The third group of arguments against data sharing is the perceived consequences of such policies. These are consequences on two levels: the consequences for science in general, and for the individual scholar.

Focusing on the former, it is feared that data availability will lead to less primary data collection (Freese 2007: 221). If more or less useable data is already available, scholars will not have the same incentive to collect their own, and the outcome of the research might suffer because of this. In addition, it is hurting science in general, because less primary information is gathered.

On the other hand, reuse of data can make projects that were otherwise undesirable or too expensive possible. Which would benefit everyone – the researcher who is able to perform an analysis s/he otherwise would not be able to perform, and the scholar whose data is used and (perhaps) cited.

Increased citation impact is thus one possible consequence of data sharing for the individual scholar, but one can imagine other scenarios that are less positive. Some of which are still unanswered. Does the individual scholar carry the weight and pay the prize that comes with data sharing? After all, data needs to be documented, formatted and uploaded (Piwowar and Vision 2013: 1). Or does the individual scientists benefit from more available data and thus more possible research projects? And will the scholar who share his or her data be ‘scooped’?

Will other freeride on his or her hard work, beat him or her to the punch, and publish findings s/he planned on publishing? Or will s/he be famous? Does posting your data really lead to increased citation impact?

(20)

8

3. Literature Review and Theory

In this study, I do not use one specific theoretical framework, and the previous literature and theoretical backdrop are thus closely interwoven. I choose to merge the normally separate chapters, literature review and theory. This turned out more productive, and better structured.

What follows are three parts presenting previous research and the theoretical input I take with me. First, I go through the existing literature concerned with my specific research question.

Then follows a section presenting similar research with OA publishing as the interesting effect variable instead of replication data. Finally, a part laying out the previous research and theory on citations in general. Together, this helps structure the analysis to come, and the interpretation of the results.

3.1. Replication Data and Citations

The association between available replication data and citation impact has been studied in various disciplines. Astrophysics (Dorch 2012), cancer research (Piwowar, Day, and Fridsma 2007; Piwowar and Vision 2013) and IR (Gleditsch, Metelits, and Strand 2003) are scientifically far apart, but have all been subjected to investigations with the same aim – to figure out whether available replication data increases citation impact.

And the answer has so far always been the same: Yes. Data sharing is associated with increased citation impact for publications.

Data sharing can increase citation impact, primarily through increased visibility. This can happen in three ways (Piwowar, Day, and Fridsma 2007: 3). First, making your data set available increases the exposure of the original research project (Carsey 2014: 73). When authors make the data available in databases and on websites, the number of people who encounter the publication increases. Second, if other scholars use the already existing data and publish results based on this empirical material, the visibility of the original project increases.

That available data actually is being used is previously found to be true (Pienta, Alter, and Lyle 2010: 21; Piwowar and Vision 2013: 1). Third, “these re-analyses may spur enthusiasm and synergy around a specific research question, indirectly focusing publications and increasing the citation rate of all participants” (Piwowar, Day, and Fridsma 2007: 3).

The studies investigating the data sharing/citation impact relationship have done so with the use of three methods: data split and comparison, OLS analysis (with a log transformed dependent variable), or count models.

(21)

9

The first method, data split and comparison, comprises analyses where scholars divide their data set into ‘articles with’, and ‘articles without’ available replication data. They then compare the two samples (see Dorch 2012; Henneken and Accomazzi 2011; Ioannidis et al. 2009⁵). This has been done without controlling for any other variables (see Dorch 2012), or a few (see Henneken and Accomazzi 2011; Ioannidis et al. 2009). The results of these investigations are uncertain, in part because of the lack of control variables. This is explicitly stated by Ioannidis et al. (2009: 151-152).

The second method used to study data sharing and citation impact is OLS analysis with a log- transformed dependent variable. This is done by Piwowar, Day, and Fridsma (2007), and Piwowar and Vision (2013). In the latter analysis, several important control variables are included in the model, such as the date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic (ibid.: 1). The trouble with these analyses is that OLS with a log transformed variable is unsuited when you have count data. I will return to this in section 4.2. Method.

The third method is more appropriate, given the form of the dependent variable: a count model.

This is done by Gleditsch, Metelits, and Strand (2003) in their analysis of JPR articles. But even though their method is an improvement compared to the ones mentioned above, the article still faced serious criticism, and multiple problems have been detected. The criticism came from Abbott (2007), and is primarily based on the distribution of citations in the sample: it is very top heavy. The top one percent of the observations has 20 % of the citations in the sample (ibid.:

211). This means that the five most cited articles have one fifth of the total number of citations.

The focus on the distribution leads Abbott to criticize the original analysis on two points: the interpretation and the robustness of the results. Focusing on the former, he argues that the effect of replication data found in by Gleditsch, Metelits, and Strand (2003) only matters for the top range. It is not important to go from two to four citations, what matters is “getting into a small club, the 5 to 8 percent of the articles that take half of the citation pool. That takes something more than just sharing your data” (Abbott 2007: 212). Concerning the robustness of the analysis, he also criticizes the low explanatory power of the model (ibid.). The critique from

5 This article is primarily investigating a different research question, but their short (and shallow) analysis of the relationship between available data and citation impact is cited several places as evidence of this association.

(22)

10

Abbott needs to be considered when interpreting the results, and I will return to these issues in the discussion chapter.

3.2. OA Publishing and Citations

OA publishing is perhaps the most direct and obvious way of increasing scientific openness. It entails, broadly speaking, that research products are made available to the public free of cost.

There are two main ways to publish OA: gold and green⁶.

The number of OA publications has increased over the last few years, especially since around 2000 (Laakso et al. 2011: 6)⁷. This development has sparked numerous investigations of the new publication form, and the consequences of OA have been studied extensively. This large body of literature has a substantial sub-group consisting of articles studying the relationship between OA and citation impact. This research is in many respects similar to what I do here, with replication data as the interesting effect variable instead of OA. The results of the studies of the relationship between OA and citation impact are diverging (for a review of the literature, see Craig et al. 2007). Some find that OA published papers get more citations (Eysenbach 2006;

Norris, Oppenheimer, and Rowland 2008), while others find other factors to be the cause of this relationship (Davis et al. 2008; Gaulé and Maystre 2011).

A consequence of the extensive effort that has been put into these studies is that expectations of what an OA effect on citation impact will look like has gotten fairly nuanced. These analyses do not expect a universal effect. The citation boost – if any – is likely to be dependent on other factors such as research field, the time of publication, and the proportion of OA publishing (Swan 2010: 2). In other words: the effect of OA on citations works with other factors known to affect citation impact, and not alone as an independent boost we are likely to find overall in publications isolated from their other attributes.

This is also likely to be the case when studying the association between citations and data sharing. One should expect the effect of data sharing to vary not only between disciplines, but also over time and with the nature of the discipline in question. We might expect the effect of

6 The gold model is closest to the traditional toll based publication system, only the paying actor is changed. Instead of a subscriber paying to read the paper, the author (or a sponsor) pays to have the article published (Craig et al.

2007: 240). A journal may publish fully through gold OA, or could use a hybrid model, where some articles are toll based (traditional) and some are OA. The green OA model are cases where the author or an institution makes the paper available online in an electronic archive (as either a pre-print manuscript, or the finished paper). Authors are also increasingly making their research available on personal web pages (ibid.). This version of OA, as opposed to gold OA, operates outside the traditional journal system.

7 More or less coinciding with Figure 1 on page 4

(23)

11

replication data to be larger for articles that are getting citations already – it might give an extra boost when an article is already getting attention. More people could want to do similar projects and use the existing data, or replicate (or duplicate) the existing article.

In other words, the effect should be expected to be dependent on other factors that are known to affect the number of citations, just like with OA. To map these, we need to study the literature on citations: What needs to be taken into account when attempting to study citation impact?

That is what happens next.

3.3. Citation Theory

Scientific achievements have been rewarded in various ways throughout history. The use of

‘eponymies’⁸ is an example of a way of acknowledging scholars whose work is important and influential (Merton 1957: 642-647). Citations are a modern way of giving such rewards. To acknowledge one another, and to recognize scientific work as useful, in the least.

But the relationship between publications and citations is more complex. There is more to the story than scientific excellence. Citing previous research is a scientific norm; we expect citations to be part of a scientific publication. Scholars are expected to pay homage to the authors they build on, and they are expected to build on someone. A scientific publication is not an isolated product standing completely on its own. It is part of a scientific history, and scholars are expected to honor this, in part through citations.

The reasons behind citations vary, and several typologies have been suggested (for a review of the literature, see Case and Higgins 2000). But, in the myriad of publications, why do authors cite one specific paper, perhaps among several possibilities?

Citations are dependent on many things. The author needs to know the paper exists, must deem the citation necessary, and must want to cite the paper. The potential citation will thus be dependent on characteristics of the publication in question – going beyond the scientific quality.

But not just of this. It is also possible to imagine that characteristics of the publisher play a part – one could for example want to base claims on papers published in more esteemed journals.

But characteristics of the authors could also play a part, and here many aspects can be important, as will be shown below. Furthermore, times have changed, and the citation impact of an article

8 Where distinguished scientists are immortalized as the ‘fathers’ (or ‘mothers’) of their discipline (or sub discipline), or you get your name on the theory, hypothesis, body part, etc. you were the first to discover/write about

(24)

12

might vary with the time of publication. In short: the number of citations an article receives might be dependent on characteristics of the paper itself, the author(s) of the paper, the publisher, and on the time of publication: who, what, where and when.

3.3.1. Who

There is a vast wasteland of publications that are never cited (Aksnes 2003: 159). The increase in scientific publications is making it harder for scholars to stay on top of the current research in their field, even if their interests are few and narrow. To be visible – and to get noticed – in the competitive field that science is, it is often beneficial to have a recognizable name. This is the Matthew effect, put forward by Merton (1968). Fame breeds more fame, and well-known scientists get a disproportionately large amount of recognition for their work, compared to lesser known scientists.

Simply having more authors might also increase citation impact. Several empirical investigations find the relationship between the number of authors and citation impact to be true (see for example Peters and van Raan (1994)). Various mechanisms can explain this: there are more potential self-citers, the quality of the research product might improve as a results of the increased combined brain power, and the possible exposure through personal contact with other scholars increase when there are more individuals ready to spread the message (Aksnes 2003: 162).

Following the line of thought where exposure through personal connections might lead to increased citation impact, more arenas (more potential networks) might be important. Many potential arenas can be imagined: authors belonging to different institutions, disciplines or of different nationalities. The latter is shown to be true by, among others, Aksnes (2003: 162).

Two of these aspects might also have an effect through different mechanisms: the nationality of the author, and his or her disciplinary belonging.

The country of the author(s) can affect the number of citations because of the sample of articles available through WoS. This is highly skewed in favor of North American publications. It might therefore be an asset for authors if they are Canadian or from the United States. The evidence on this is not unanimous: Gleditsch, Metelits, and Strand (2003) find this variable to be significantly positive, while Piwowar and Vision (2013) find no significant evidence of such a relationship.

The disciplinary background of the author is likely to affect the number of citations a research product receives because there are differences in expected number of citations between

(25)

13

scientific fields (Aksnes et al. 2011: 629). But the disciplinary belonging of the author(s) could also work through a different mechanism. As many journals are discipline specific, this could benefit scholars from that exact discipline. They might be exposed to a more relevant audience and therefore perhaps expect to get more citations than scholars belonging to different fields (Gleditsch, Metelits, and Strand 2003: 91).

A much-disputed topic in bibliometric research is gender, and the relationship between gender and citation impact has been studied numerous times. The results of these investigations are not unanimous. Studies find diverging results even when they are using the same data (Maliniak, Powers, and Walter 2013; Zigerell 2015). In a large-scale investigation of Norwegian scientists Aksnes et al. (2011) find that female scientists indeed are cited less than men, although the differences are not big. They attribute the differences to publication rates – men are more productive than women are (ibid.: 633).

3.3.2. What

Characteristics of the research product itself is likely to affect the citation impact. This is reflected in the literature, where numerous aspects of the research product have been investigated in association with citation impact.

New scientific discoveries are valuable, and likely to get noticed and appreciated, or at least discussed. Getting priority (claiming to be the first to discover something) has thus sparked heated debate throughout history (Merton 1957). Papers publishing new discoveries can also expect to get more citations than traditional findings expanding on already established research.

Foster, Rzhetsky, and Evans (2015) investigate this, and show how, in chemistry, the nature of the finding affects the number of citations.

It is also likely that the topic of the paper is important. What the paper is about. Field and subfield will affect citation impact. This is partly caused by differences in publication pattern, and the fact that average citation impact varies between field and subfield. Writing about a trendy topic, or a topic with many potential citers (more authors/publications) can also increase expected citation impact for an article. This is shown by Maliniak, Powers, and Walter (2013) who include subfield in their analysis of IR papers.

If the topic of the paper is useful for a wide range of authors, it could also increase citation impact. An example of such wide usefulness is methodological papers. This argument is put forward by Peritz (1983), who also found the mechanism to hold through empirical investigation (ibid.: 217). Methodology can also matter in a different way, where more

(26)

14

sophisticated analyses are more appreciated (or admired and copied), leading to a higher citation impact based on the degree of formalization. This aspect is also included in the analyses done by Maliniak, Powers, and Walter (2013), and they show how it is important for citation impact.

3.3.3. Where

Characteristics of the publisher is likely to influence citation impact in various ways. In high impact journals, readers might expect the quality of the publications to be higher than average.

The topic of the journal might also be important in the sense that it helps decide the potential audience of a research product. More narrow journals might have a smaller readership, and thus fewer potential citers.

Other attributes of the journal that can have an effect on citation impact are language (where English language journals will reach a broader readership), scope (whether the journal has a national, regional or international focus), and the location of the journal. WoS is skewed in favor of North America, which could imply a bias towards papers published in journals from this region.

3.3.4. When

The amount of scholarly publications has increased steadily over the last three centuries (Laakso et al. 2011: 2), and the number of articles using some kind of data is now more frequent than it has been in the past (Gherghina and Katsanidou 2013: 333). The rules of the game have also changed, with replication data becoming more and more usual. Together, these changes in publication pattern and publishing norms make time an important aspect to consider when trying to find relationships and correlations among factors in research publications. Time is also difficult as policies can change swiftly, technological advances can be sudden, changes can be sharp (as will be seen), and this can be hard to model.

(27)

15

4. Research Design

4.1. Data

The analyses in this study are executed with the use of two data sets: the JPR data from Gleditsch, Metelits, and Strand (2003), and the Dafoe data (Dafoe 2014). I added new variables to both of the data sets to include more aspects of publishing and citing based on the Literature Review and Theory chapter (for more detailed information on the new variables, see section 4.3. Variables).

The JPR data set consists of 430 articles from Journal of Peace Research (JPR) covering the period between 1990 and 2001. The data set includes all publications from this period. The new variables I added to this data are the ‘fame’ variable and the number of cited references in the article. I also updated the citation variable (the dependent variable).

The second data set is based on somewhat less complete data used in the article “Science Deserves Better: The Imperative to Share Complete Replication Files” (Dafoe 2014). This data set includes articles from American Political Science Review (APSR) and American Journal of Political Science (AJPS). The data set consists of 341 articles, from 2009 until 2012. The Dafoe data set only contains articles with some statistical analysis. This data set needed a bigger update than the JPR data for it to include all the aspects of publishing and citing that I want to cover.

The variables I added are: the number of citations (the dependent variable), the number of cited references in the article, gender of the author(s), whether the article is coauthored, disciplinary background of the author(s), whether the article is a product of interdisciplinary research, whether (all or some of) the author(s) are political scientists, and whether the article is a product of interinstitutional research.

I keep the two data sets separate for two reasons: one is because of the time gap, and the other is because of differences in variables and operationalizations.

4.2. Method

4.2.1 Challenges: Citation Analysis

There are some common issues facing scholars doing research on citations that needs to be considered. I have identified five such challenges: (a) what constitutes ‘highly cited papers’;

(b) how big the ‘citation window’ should be; (c) WoS coverage; (d) causality; and (e) overdispersion.

(28)

16

a) To investigate what characterizes highly cited papers, a common method is to divide a data set of articles into ‘highly cited’ and ‘not highly cited’ papers, and then compare the two samples (see for example Aksnes 2003)⁹. This is problematic, because the division between the categories will be dependent on sample characteristics and the judgement of the researcher¹⁰. This is an issue comparable to ‘the 41^st chair’ – where the one(s) just outside the chosen ‘highly cited’ (the 40 chairs) will be somewhat arbitrarily left out (Merton 1968: 56). The challenge is caused by the kind of data we have. The dependent variable is continuous, without any given distinction (or natural distinction) between ‘high’ and ‘low’.

b) The second issue facing many studies of citations is ‘citation windows’. How many citations an article receives within a given time period. The time period is the citation window. The choice of citation window can be problematic, as articles have different citation curves; they get citations at different times (Aksnes 2003: 165-166). This challenge is caused by the phenomenon that is citations. As citations accumulate over time, with differing pace, it is difficult to set a citation window suitable to all the observations in the data set. It is, as the challenge above, an issue caused by the kind of data we are dealing with.

c) The third problem facing scholars who do research on citations is the source they get information from: WoS or Scopus (or similar data banks). These databases can introduce biases through their inclusion or exclusion of journals and articles. Journal (or article) language, journal (or article) nationality, and scientific discipline are three factors where the source causes biases. Considering language and nationality, WoS (where much of the data from this study is gathered) is skewed in favor of English-language, North-American journals. This could thus introduce a bias, for example if an article is more relevant for and thus more cited from a specific country where the majority of publications are published nationally in a language other than English. The percentage of publications in English within a field will also have consequences for the coverage in WoS and Scopus (for example national law versus international law).

The scientific discipline one wishes to study could also introduce bias in an analysis. This is because the WoS and Scopus coverage of the natural sciences is better than for social sciences and the humanities. This in turn is largely caused by the differing publication patterns of the disciplines. Where the natural sciences primarily publish articles, books and monographs are more common in the social sciences and the humanities. And WoS and Scopus have a better

9 This is similar to the method I called ‘data split and comparison’ in the Theory & Literature Review section.

10 What is common is either an absolute limit, or a relative one, meaning the divide is set at some percentage above the mean (or similar)

(29)

17

coverage of journal articles than other publication forms. Many of the citations in the social sciences and humanities are thus not present in the data from WoS and Scopus, because they come from publications that are not in these databases. The coverage of political science in WoS is 45 % for all publications, and 64 % for journals. Political science articles published in high impact journals and IR are relatively well covered (Sivertsen 2014: 601).

d) Causality is difficult. Data sharing and citation impact might be associated in various ways, some of which would include a serious problem with endogeneity¹¹. I have identified three ways replication data and citations might be associated.

In a perfect world (considering the analysis that is to follow), researchers would share their data regardless. If everyone did that, no matter the quality and popularity of their research products, we would not have any issues with causality. If we were to find a relationship between available replication data and citation impact, we could be sure that the effect went from replication data to citation impact: if you make your data available, you get (on average) more citations. But, of course, it cannot be that simple.

Making your data available means opening your research up to increased scrutiny. It means that others can assess not just the results, but also (more or less) the entire research process. And it is easy to imagine how authors might be more comfortable with this increased scrutiny if they are confident in what has been done. This could introduce a relationship between the quality of the research product and available replication data. And this relationship is found to be true.

Wicherts, Bakker, and Molenaar (2011: 1) find that the “[w]illingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results”. This introduces an endogeneity nightmare for analyses like mine. If the papers with available replication data simply are better, and the data availability is just an expression of this quality difference, we will not be able to pick that up. We are in trouble, and this in only the beginning.

If a study is published and gets a lot of attention, it could spark a new sense of enthusiasm about the topic, and an urge among fellow scientists to control the results or to use the original data to investigate similar research questions (Piwowar, Day, and Fridsma 2007: 3). If the author has not published his data, it could lead to a bombardment of requests from others to get their hands on the original empirical material. One can imagine how many, many requests for data can persuade the author into releasing them to everyone, if not because of goodwill then perhaps

11 What leads to what? Does data sharing increase citation impact, or does high citation impact improve the likelihood of data sharing?

(30)

18

to make the requests stop. This would mean that an article that has received much attention already (and is thus likely to be highly cited), also is more likely to make their replication data available. High citation impact thus leads to the release of replication data. And our preferred causality is not just unclear, it is turned upside-down, backwards.

The two former paragraphs suggest we have a huge challenge in terms of causality. This is (for the most part,) not addressed in the previous studies presented in the literature review, but needs to be taken very seriously. Luckily for us, it can be solved by the replication policies of the journals – if they force all their authors to publish their data. This is the case with parts of our data sets. I will return to this.

e) The last issue I identified is that of potential overdispersion. We have overdispersion in a model when the variance in the sample is greater than the mean. One of the causes of this challenge is when the dependent variable might be ‘producing more of itself’ – high values on the dependent variable contributes to even higher values on the same variable (Hilbe 2014: 82).

Citations in itself could lead to more citations; known as ‘the Matthew effect’ (Merton 1968b).

The effect is disputed, but produces a theoretical expectation of overdispersion nevertheless.

Summing up: There are some serious challenges inherent in the data. Some of which can be modelled into the analysis, and some of which cannot. The challenges that are caused by the kind of data we are dealing with and the form of the dependent variable can in large parts be included. The challenges that we cannot model need to be taken into account in other ways:

through careful interpretation and consideration of the generalizability of the results.

4.2.2. Count Models

The method we choose to study the association between data sharing and citation impact needs to take as many as possible of the challenges mentioned above into account, and still use as much of the information that exist in the data as possible. Two of the methods presented in the literature review do not do this to a satisfactory degree. Neither the split data set nor OLS analysis is ideal when we have citations as a dependent variable, because it is a count.

The analyses using the first method, the data split and comparison, do not control for (almost) any variables. When so many aspects of publishing and citation giving are left out, the results of the analysis are uncertain at best. The method also uses very little of the available information, because of the rough split of the data set.

(31)

19

The second method, the OLS, is not appropriate when you have a count variable as dependent variable. OLS analyses assumes constant variance and normally distributed standard errors.

This will not always be the case when you have count data. In addition to this, the OLS might lead to the prediction of negative counts (which would be impossible, no one can get less than zero citations), and the model will struggle with the number of zeros, which is likely to be overrepresented (Crawley 2013: 579).

The main reason for the choice of model is (as mentioned) the form of the dependent variable;

the number of citations an article has received. This is count data, because the variable is enumerated. Specifically, it means observations that have only nonnegative integer values ranging from zero to some greater undetermined value (Hilbe 2014: 2). Things we can count.

There are several count models available to analyze such data, depending on the appropriate distribution. But all of these models answer well to some of the challenges mentioned above:

As count models are regression models, we do not have to define what constitutes ‘highly cited’

papers, or define a ‘citation window’. Overdispersion can be included in a count model, by adding an extra parameter to allow more flexibility in the model. This is the case with a negative binomial specification. If overdispersion is not evident in the data, it does not matter; the parameter included to account for this will not make any fuss, and the model will in reality become a Poisson model: a simpler model (Long 1997: 230, 237). In other words: The model is not unnecessarily complicated. But even with this model, some of the challenges with citation analysis still apply.

The problems caused by the WoS coverage cannot be included in the analysis. It is possible to do with a data set of a size that allows it, but our data is too limited and also confined to one scientific discipline. The consequences for this study is the generalizability of the results. As expected citation impact vary between fields, and the proportion of the total citations we are able to account for is unique for each discipline, the results will only be valid for political science publications.

Causality is, as mentioned, an issue. The Dafoe data allows us to control for the observations that actually has data, but where it is not stated in the article. These observations are potentially problematic because there is a possibility (as pointed out in alternative three above) that the data has been added after the publication because of requests and attention. For the articles published in AJPS and JPR after 1998, data submission is mandatory. This means that for these observations, there is less likely to be a bias caused by endogeneity.

(32)

20

4.3. Variables

I coded all the new variables (and updated the citation variable in the JPR data) manually through the fall of 2015, and the information was for the most part gathered from WoS (more detailed information for each variable follows). I did all the primary coding in Excel, and some modifications, recoding, and restructuring in R.

As I did all the data collection and coding myself, there might be some wrongly coded units, affecting the reliability of the analysis. There is not much to do about this. I did my best, and inspected all the variables visually to see if there were any obvious mistakes. I also went over the citation variable twice (checked it against WoS again after the initial coding) to make sure (as well as I could) that all the information was correct.

4.3.1. Dependent Variable: Citations

The dependent variable is the number of citations for each article in the data set. This is not a measure of the quality of a research product, but shows the use the scientific community has had of the article (Gleditsch 1993: 445).

Figure 2: Total number of citations in the data, Dafoe data and JPR data combined

(33)

21

I added the citation count for each article to both of the data sets. I gathered the information from WoS, and coded the variables in November 2015: the Dafoe data between the 9^th and 10^th of November, and the JPR data between the 11^th and 12^th of November. The distribution of citations differs in the two data sets, and this is in large parts caused by differences between the journals, which I explore further below (in section 4.3.4. Control Variables).

4.3.2. Effect Variable 1: Replication Data

Variables showing whether replication data is available were already in both of the data sets, but the operationalization differs.

In the JPR data, the ‘data’-variable is a dichotomy showing whether the articles have available replication data. There are a lot more articles in the JPR data set without replication data than there are articles with: 75 observations with replication data, and 355 without. If we look at the JPR data, and just compare the articles with and without citation data, at first glance it seems that the papers with replication data are more cited than the ones without.

Table 1: Replication data, JPR

Replication data Yes No

N 75 355

Citations, mean 44.17 13.34

Citations, median 25 7

The Dafoe data has two variables showing data availability. Whether the article explicitly states that data is available is the first. The second shows whether the data collectors were able to locate these data. In the process of data gathering, Dafoe and his helpers first only searched for data among the articles which states data availability. This procedure gives the following three options:

Table 2: Replication data options, Dafoe States availability No Yes

Data really there - No Yes

Options 1 2 3

But, luckily, they did not stop there. After this initial data collection, they also searched for replication data among the articles that do not state that data is available from their analyses.

Put together, this means we get a data variable with four different options. I reformed the four

(34)

22

options (based on the two variables in the original Dafoe data) into one categorical variable based on the Table 3.

As can be seen from Table 3, the mean and median number of citations is higher for the publications where replication data really is available. The papers where data really is shared, but where it is not stated in the article, have a substantially higher mean and median than the rest. This could be an indication of endogeneity problems (as replication data might come after the publication because the article has gotten much attention).

Table 3: Replication data, Dafoe

States data No Yes

Data really there No Yes No Yes

Given value 0 1 2 3

Citations, mean 23.48 34.57 11.55 17.16

Citations, median 18 25 9 12

N 135 30 56 121

N, total 342

4.3.3. Effect Variable 2: Fame

The most notable addition to the JPR data set in this study is a variable showing the ‘fame’, or seniority, of the author. As the explanatory power of their model is low, Gleditsch, Strand, and Nordkvelle (2014: 9) suspected there to be an omitted variable bias. The source of this bias could be the Matthew effect: the fame – or seniority - of the author(s) should be included in the model. They tried to control for it using a proxy with the author’s age, but this did not improve the model fit. Here, I include their preferred operationalization in the analysis of the JPR data.

To create the fame variable, I coded all authors in the data set with the number of citations they had in the year of publication, minus the citations from the article itself. I got the information from WoS. The coding of the variable proved a challenge. When the authors did not have ResearcherID, or very unusual names (so that all the hits obviously were from the same author), I had to look up their CVs online and narrow by research fields, coauthors, and journals. Some change their names, publish under different names, or with different spellings. I have corrected this for the ones I discovered. Some authors did not have available online CVs, personal university page, Wikipedia page or anything similar. These authors were very challenging, and I had to look up individual papers and narrow by removing other authors one by one. I

(35)

23

assembled the variable by summarizing all the author citations for each unit: Citations for all the authors combined are the value for each article.

I only added the fame variable to the JPR data because of the time limitation, and the extensive work that went into making the variable.

The fame variable is extremely skewed. There are 128 articles where none of the authors had any citations at the time of publication. On the other side of the scale is an article by Sagan and Turco (1993), where the combined citation number for the two authors surpass 8000 (Sagan more than 6000, Turco more than 2000).

Figure 3: The ‘fame’ variable, JPR data

(36)

24

The skewness of the variable needs to be considered in the analysis. Especially the Sagan/Turco article, which is an outlier and might have a big influence on the results.

4.3.4. Control Variables

Characteristics of the publisher are important. The journals in the two data sets employed here are similar in many respects: they are all English language high impact political science journals. But, even if there are similarities, the journals differ a great deal in some important aspects: the expected audience, replication policies and citation pattern.

First: the topic of the journal and the audience it attracts is important in this study because there is a potential difference between the two data sets. JPR might attract more readers from poorer countries because the content of the journal in large parts are relevant for people and scholars in conflict-ridden parts of the world. These scholars are perhaps from lesser-known research institutions and might have more difficulties publishing in esteemed, English language journals.

This could lead to a bias, where the citations from an audience that is bigger for the JPR than the other journals in this study, is not covered.

Second: replication data policies. Journal of Peace Research (JPR) and American Journal of Political Science (AJPS) demands that authors make their data (and syntax etc.) available on their website in a data archive administered by the journal (American Journal of Political Science 2016; Journal of Peace Research 2014: 9). When the Dafoe data set was assembled, and at the time of publication for the articles in this data set APSR instructed their authors to make clear where to find the data used to support empirical claims, including syntax etc. This could be done by specifying where the data and additional information could be found online.

Exceptions to the rules needed to be well grounded (American Political Science Review 2016b).

The policies today are the same as when the articles in my data set were published for JPR and AJPS (Dafoe 2014: 61). APSR changed their policies in the beginning of 2016. They are now in line with the recommended policies from APSA (and the DA-RT initiative) (American Political Science Review 2016a). This does not influence the following analyses.

Third, the journals differ when we look at their citation patterns. The average number of citations (the mean) is 18.72 in the JPR data, and 20.26 in the Dafoe data. When separating between the two journals in the latter data set, it turns out that the average number of citations for AJPS is 17.55, while it is 26.65 in APSR. The median on the other hand is 8 in the JPR data, but 14 in the combined Dafoe data. For APSR the median is 18 and for AJPS it is 13. Looking at the number of zeroes, the journals also differ substantially. In APSR there is only one article

(37)

25

without citations, in AJPS there are three, while JPR has as many as 33 articles without any citations. On the other hand, JPR has the clearly highest cited article, with 618 citations, compared to 140 at APSR and 110 at AJPS. This can imply that the distribution of citations differ in the two data sets, and between the three journals.

Table 4: Comparison of journals

Journal APSR AJPS JPR

Citations, mean 26.65 17.55 18.72

Citations, median 18 13 8

Highest cited 140 110 618

Number of zeroes 1 3 33

Percentage with replication data 29.41 % 49.58 % 17.44 %

N 102 240 430

N, total 772

To control for developments over time, there are variables showing publication year in both of the data sets. Time is a complicated matter when we look at the JPR data. There is a sudden shift at the journal, caused by a change of norms concerning replication data and/or journal policy (Gleditsch, Strand, and Nordkvelle 2014: 16). After about 1998, the number of articles with available replication data increases very much, and such a sudden change is difficult to model. This issue must be considered when interpreting the results.

Figure 4: Articles without replication data, JPR Figure 5: Articles with replication data, JPR

AVAILABILITY AND IMPACT

AVAILABILITY AND IMPACT

On the Citation Advantage of Data Sharing in Political Science

AVAILABILITY AND IMPACT

On the Citation Advantage of Data Sharing in Political Science

Abstract

Acknowledgements

Contents

Abbreviations

List of Tables

List of Figures

1. Introduction

2. Background

2.1. Scientific Openness

2.2. Replication Data

2.2.1. What is Replication Data?

2.2.2. Replication Data Policies

2.2.3. The Replication Debate: Why (Not) Share Replication Data?

3. Literature Review and Theory

3.1. Replication Data and Citations

3.2. OA Publishing and Citations

3.3. Citation Theory

3.3.1. Who

3.3.2. What

3.3.3. Where

3.3.4. When

4. Research Design

4.1. Data

4.2. Method

4.2.1 Challenges: Citation Analysis

4.2.2. Count Models

4.3. Variables

4.3.1. Dependent Variable: Citations

4.3.2. Effect Variable 1: Replication Data

4.3.3. Effect Variable 2: Fame

4.3.4. Control Variables