Final Report of the Task Group on GBIF Data Fitness for Use in Agrobiodiversity

(1)

Final Report of the Task Group

on GBIF Data Fitness for Use in Agrobiodiversity

Final version 1.0 published on 15 February 2016

Authors (in alphabetical order)

Elizabeth Arnaud, Bioversity International, France - Task Group Chair

Nora Patricia Castañeda-Álvarez, CIAT, Colombia and University of Birmingham, UK Jean Ganglo Cossi, University of Abomey-Calavi, Benin

Dag Endresen, GBIF Norway, University of Oslo, Norway Ebrahim Jahanshiri, Crops for the Future, Malaysia

Yves Vigouroux, Institut de Recherche pour le Développement (IRD), France

GBIF contact

Dmitry Schigel, GBIF Secretariat ([email protected]) - Programme Officer for Content Analysis and Use

(2)

Acknowledgements

Many thanks to Abdallah Bari, Ahimsa Campos Arceiz, Alberto Tanzi, Arthur Chapman, Asha Karunaratne, Aoudji Augustin, Aryo Feldman, Axel Diederichsen, Christian Leclerc, Christoph Germeier, Chrystian Camilo Sosa, Colin Khoury, Daniel Callo-Concha, Dro Daniel Tia, Evert Thomas, Fabrizio Celli, Gueye Mathieu, Hannes Gaisberger, Harold Achicanoy, Helmut Knüpffer, Holly Vincent, Igor Loskutov, Joana Magos Brehm, Jose Iriondo, Koffi Kouao Jean, Koura Kourouma, Lee Belbin, Maarten van Zonneveld, Mame Codou Gueye, Marc Deletre, Marcelo Simon, Matija Obreza, Marie-Angelique Laporte, Mauricio Parra Quijano, Nidhi Nagabhatla, Nigel Maxted, Ola Westengen, Peter Desmet, Priscila Ambrosio Moreira, Raymond Sognon Vodouhe, Razlin Azman, Reinhard Simon, Robin Goffaux, Ruth Bastow, Samy F. Gaiji, Sean Mayes, Severin Pohlreich, and Theo van Hintum for

participation in the online survey; Sue Walker kindly helped to distribute the information about the survey. Thanks to Andrea Hahn, Mélianie Raymond, and Tim Robertson for comments on the first draft report. Many thanks to the GBIF secretariat and Bioversity International for setting up and hosting this task group.

Arthur Chapman, Lee Belbin, Joana Magos Brehm, Shelagh Kell, and Mauricio Parra Quijano provided valuable feedback to the first draft report. We wish to highlight

acknowledgement to Joana Magos Brehm for providing particular detailed and constructive comments and suggestions.

Special thanks to Dmitry Schigel, Programme Officer for Content Analysis and Use, GBIF, for facilitating the task group work, the survey preparation and meetings.

Document history

Draft version 0.1 released on 2 October 2015 Final version 1.0 published on 15 February 2016

(4)

Top recommendations for GBIF Data Fitness for Use for Agrobiodiversity

• The Multi Crop Passport Data (MCPD) is the data exchange standard for describing crop samples held in gene banks. GBIF must index data attributes described with the MCPD terms to stimulate the use of gene bank data and other ABD data published in GBIF. Most of the MCPD terms were mapped to Darwin Core terms (see table 1 on p.14). Therefore, to enable full compatibility between these standards, only a few terms need to be added to the GBIF data profiles, following the model proposed in the existing Darwin Core germplasm extension.

This will be achieved by including the Darwin Core germplasm extension into the GBIF data indexing routines. (Recommendation 6.2.1).

• A more formal agrobiodiversity (ABD) community governance policy is needed for the germplasm extension. The Biodiversity Information Standards (TDWG) could be a suitable platform for implementation of a formal agrobiodiversity community governance policy for the Darwin Core germplasm extension. Darwin Core germplasm extension should be maintained by a TDWG task group to reach a stable standard for germplasm accessions conserved in gene banks (ex situ conservation), and should be expanded to address needs of data on the in situ and on-farm conservation. (Recommendation 6.2.5).

• Authoritative checklists and classification of crop wild relatives, cultivars, landraces and neglected and underutilized crop species, including vernacular names from authoritative lists along with the language and countries where it applies, should be added to GBIF when developed and validated by an international expert group and community. (Recommendation 6.4.1).

• GBIF should seek to support the integration of popular data cleaning tools such as GEOLocate, OpenRefine (formerly Google Refine), and workflow services from BioVeL and other Galaxy or Taverna compliant protocols with data published to the GBIF portal. It is also important to take into account the requirements on use cases that are being developed by a task group of the TWDG/GBIF data quality interest group. (Recommendation 6.8.2).

• GBIF should improve routines for preliminary quality assessment of data records and datasets (aggregated records) giving levels of confidence to individual data record or datasets and highlight issues to data suppliers. A level of confidence can only be applied within a specific context so a weighting of the scores (possibly ‘weighted completeness’ and ‘weighted issues’) should be proposed in the context of use by ABD community. (Recommendation 6.11.1).

• GBIF should develop or adapt existing tools to: (a) identify quality improvement thresholds based on the decided weighting of scores such as unreliable

coordinates; identify issues with taxon names such as completeness of name- strings and up-to-date nomenclature and whether names are backed by publication reference, sequence, or expert; (b) check the completeness of the data (e.g. index of passport data completeness) through possibly two scores:

‘weighted completeness’ and ‘weighted issues’; (c) provide the percentage of records with actual data reported for each attribute (data column), possibly with

(5)

two scores: ‘weighted completeness’ and ‘weighted issues’. (Part of Recommendation 6.11.2).

• Expand the data attributes made available for search from the GBIF portal.

Include the most important agrobiodiversity terms from the MCPD and the corresponding Darwin Core germplasm extension as searchable information attributes (such as gene pool and taxon group concepts, trait information, characterization and evaluation data, pre-breeding and breeding information) (Recommendation 6.12.3).

• Resources for the ABD community like the global crop wild relative species checklist (http://www.cwrdiversity.org/checklist/) and the Bioversity Collecting Mission database must be published to the GBIF portal registry of checklists and integrate this checklist to the GBIF taxon backbone. To complement this global list, other crop wild relatives (CWR) checklists can be proposed for publishing as a taxon checklist in GBIF and first the crop wild relative species list developed by the Southern African Development Community (SADC)-CWR project which includes a global list of crop genus names that is a useful tool for national species list of crop wild relatives. (Recommendations 6.3.1 and 6.6.1)

• Stimulate the digitization of relevant collections (i.e. herbaria, gene banks,

published articles, MSc and PhD theses, national and regional projects) related to ABD, and stimulate the publishing of already digitized collections, by providing small--grants through competitive calls (Recommendation 6.6.2).

• Train the GBIF Nodes on the value of CWRs, and mobilization of data on crop wild relatives and on species traits useful for crop improvement and for landscape restoration (Recommendation 6.3.4).

(6)

1. Abstract

Human wellbeing and food security in a changing climate depend on productive and

sustainable agriculture. For this, policies based on analyses and research results are vital to establish conservation priorities of natural resources that underpin the enhancement of sustainable food production. Therefore, data from agrobiodiversity and wider biodiversity sources are required to be available and accessible. Currently, there is a risk that

agrobiodiversity and the wider biodiversity data communities remain separated with

inefficient data aggregation, unless data flow pathways are harmonized. GBIF has a role to play in contributing to the convergence of the two communities. Biodiversity data in particular on wild relatives of the cultivated species will flow easier into agrobiodiversity conservation priority assessments and analysis with agrobiodiversity data integrated in GBIF.

The Task Group on Data Fitness for Use in Agrobiodiversity was established by the GBIF Secretariat and Bioversity International to help improve the fit of data related to

agrobiodiversity to the variety of important uses required and requested by the community of research and policy. The task group has been looking at the key actions for creating

interoperability of data on ex situ, in situ and on-farm conservation of agrobiodiversity, with a focus on plants. A survey and interviews of selected experts and ABD data practitioners were conducted to collect feedback on fitness for use and issues with GBIF-mediated data.

The 53 recommendations of the task group cover the whole data flow, from publishing to data use with a focus on agrobiodiversity, also considering the role of nodes in data mobilization and in promotion and training. Some key recommendations are to (i) promote GBIF to the agrobiodiversity community, (ii) integrate the terms from the long-standing Multi Crop Passport Data standard (MCPD) already used for several decades by agricultural gene banks into Darwin Core indexed attributes, (iii) by installing proper governance, the Darwin Core germplasm extension can be maintained as a stable international standard, (iv)

develop agrobiodiversity user profiles on GBIF data portal to improve the user experience in accessing data of interest, (v) add infraspecific taxonomy levels to ensure adequate

publication of agrobiodiversity data, by means of integrating into the GBIF taxonomic

backbone the reference taxonomies used by the community with additional attributes related to the crop wild relative species, landraces and cultivars, (vi) publish existing digitized ABD data collections, such as the Bioversity Collecting Mission database¹ and the Crop Wild Relative Global Occurrence dataset², to support capacity building of agrobiodiversity data publishers, (vii) provide quality filtering of the data only using attributes of interest to the agrobiodiversity data users. Additionally, GBIF needs to provide tools and services to discover, mobilize, or link to additional specialized data sources commonly used by the agrobiodiversity community. Integrated access from GBIF to external sources of key agrobiodiversity data would be an added value for the community. (viii) Assign a level of confidence to individual data records, and (ix) channel feedback to data suppliers.

1 http://bioversity.github.io/geosite/

2 http://www.cwrdiversity.org/checklist/cwr-occurrences.php

(7)

The task group identified increasing the knowledge of the nodes about agrobiodiversity data through training as a key step to enable them to play a more prominent role in the

mobilization of locally available information resources on ABD.

A priority setting of these recommendations, with the feedback of the ABD community, the GBIF country parties and the expert knowledge of the GBIF secretariat and nodes, is needed.

2. Scope: what is agrobiodiversity and why it matters?

“Agricultural biodiversity [agrobiodiversity or ABD in this report] is the diversity of crops and their wild relatives, trees, animals, microbes and other species that contribute to agricultural production. This diversity exists at the ecosystem, species, and genetic levels and is the result of interactions among people, biodiversity components, and the environment over thousands of years. The use of agricultural biodiversity can help make agricultural ecosystems more resilient and productive; and can contribute to better nutrition, productivity and livelihoods”(Bioversity International)³.

Note of the task group: Given that agrobiodiversity covers a large area of research, and that the focus of this task group is ‘crop diversity’, it is worth acknowledging that a group of livestock experts should be convened to extend the recommendations related to animal diversity.

Agrobiodiversity contributes to farmers’ resilience to climatic events and plant pathologies, and provides options for adaptive strategies to environmental and economic changes; it supports the restoration of ecosystem services and provides a genetic reservoir of new traits and species for farming. An estimated total of 35,000 plant species are cultivated by humans for use in gardening, landscaping and agriculture. An estimated total of 7,000 from these plant species are cultivated for use in agriculture (Khoshbakht and Hammer 2008). About 1,000 species of cultivated plants are threatened globally (Khoshbakht and Hammer 2007).

Addressing the loss of species and genetic resources is critical for improving crops, coping with pests and diseases, soil health, global freshwater, and pollinators, and adaptation to climate change. Pollinators contribute to the production of over 80% of crops traded on the world market and up to 10–16% of global yearly harvests are lost to plant diseases.

The discovery, access and adequate use of primary biodiversity data is critical to inform decision making to achieve sustainable use of agrobiodiversity resources, to secure their availability in the future, and to address many of the world’s key challenges such as feeding a growing human population, and developing more productive and sustainable agriculture under climate change. It is estimated that various agrobiodiversity data portals and

institutions (Genesys⁴, EURISCO⁵, GRIN⁶, CIAT, FAO, national and regional gene banks)

3 http://www.bioversityinternational.org/why-agricultural-biodiversity-matters-foundation-of-agriculture/

4 https://www.genesys-pgr.org/welcome

5 http://eurisco.ipk-gatersleben.de/

6 http://www.ars-grin.gov/npgs/searchgrin.html

(8)

collectively house some 7.4 million specimens (FAO, 2010)⁷ of cultivated species and cultivated varieties, genetic samples and other important evidence of patterns and trends in global biodiversity. High numbers of species and specimens of crop wild relatives need to become digitally available alongside the data on cultivated species. Only a fraction of this vast databank of species information and genetic material is freely and digitally available.

Cultivated crop species for food and agriculture are generally conserved ex situ in gene bank collections. Traditional cultivars or landraces can also be conserved on-farm, in active

farming. Species of crop wild relatives (CWRs) are generally conserved in situ in their natural habitat. Currently only a few prioritized and important populations of CWRs have been collected and conserved ex situ in botanical gardens and gene bank collections.

Definition of Crop Wild Relative species (CWR)

"those wild plant taxa more or less closely related to species of direct socio-economic importance including food, fodder and forage crops, medicinal plants, condiments, ornamental and forestry species, as well as those related to crops used for industrial purposes such as oils and fibres" (Maxted et al. 2006). Crop wild relatives are used as a source of genes for plant improvement.

Definition of Landrace

“Landraces have a certain genetic integrity. They are recognizable morphologically; farmers have names for them and different landraces are understood to differ in adaptation to soil type, time of seeding, date of maturity, height, nutritive value, use and other properties. Most important, they are genetically diverse” (Harlan 1975).

"A landrace is a dynamic population(s) of a cultivated plant that has a historical origin, distinct identity and lacks formal crop improvement, as well as often being genetically diverse, locally adapted and associated with traditional farming systems" (Villa et al. 2005).

Neglected and Underutilized Species (NUS)

Also called ‘orphan crops’, NUS are plant species and varieties of importance for the rural

communities but to which little or no attention is paid by agricultural researchers, plant breeders and policymakers. NUS are not widely traded (Padulosi et al. 2013) and are represented by wild, semi- domesticated or local varieties and many non-timber forest species, adapted to local and often marginal areas. According to the combined gene pool concept (Harlan and de Wet 1971) and Taxon Group categorization (Maxted et al. 2006), many NUS are also classified as CWR.

7 http://www.fao.org/docrep/013/i1500e/i1500e00.htm

(9)

3. Rationale: what can GBIF do for agrobiodiversity data users?

GBIF.org as a global biodiversity data platform can play an important role in the agrobiodiversity landscape by mobilizing and connecting biodiversity datasets that can support research and development for food security and ecosystem services resilience. As part of a broader global strategy on fitness for use of biodiversity data, GBIF and Bioversity International convened a Task Group on Data Fitness for Use in Agrobiodiversity in March 2015. The Task Group identified the need to bridge ecological and agricultural data that are relevant for agrobiodiversity and agroecology uses. In general, the plant ex situ conservation data are in a good state with developed data and metadata solutions.

However, data on in situ and on-farm conservation, use and management in agrobiodiversity are not yet fully available and often unstructured. Although occurrence data from many of the gene bank collections, including landraces and collected CWR resources (conserved ex situ), are already published and made available in the GBIF portal, the agrobiodiversity community does not use GBIF much. The reasons are probably because (a) a number of well-established ABD data portals exist (such as Genesys, EURISCO and a number of crop- specific databases, each serving different subsets of the ABD data) and (b) because

research on plant genetic resources diversity generates data at the intraspecies level, mainly unstructured and requiring specific attributes to describe data sets that are not yet available through GBIF.

However, GBIF has great additional potential for the ABD community, providing integrated access from one single portal to ABD-related data, including crop wild relatives (which are generally under-represented in ex situ gene bank collections and in situ monitoring data in general severely under-represented in the ABD data portals), data for landscape restoration, crop improvement, eco-geographic land characterization and other uses. Such information could be linked to other information outside GBIF such as extinction risk, genotype-level trait data and restoration and molecular genetic data (see figure 1). Agrobiodiversity data users often access these different data from different platforms.

(10)

Figure 1: GBIF within the landscape of agrobiodiversity relevant data sources (not exhaustive list))

4. Objectives of the task group

The task group aims to capture the best available experiences, document limitations in existing GBIF services, and suggest improvements in the functionality of GBIF.org for domain-specific needs.

• To make recommendations on improving data availability and use, mobilization, publishing and processing of data / metadata. Also to deliver a vision of the ideal data, data modifications, cleaning steps, analyses and visualization needs of the agrobiodiversity community.

• To document best practices using agrobiodiversity-related data, and to collect the information on repeatable tools and data management solutions.

• To make recommendations on GBIF.org improvements, and to provide guidance in the development of training and outreach materials for data users, to allow the better interconnection of different platforms and to allow different datasets to be combined.

(11)

5. Mode of operation of the task group and outputs

A survey of selected agrobiodiversity experts was launched to capture knowledge,

experiences and opinions relating to data used in agrobiodiversity research for food security and agroecosystem resilience (see full survey in Appendix I). Fifty-one respondents

answered the survey, which ended on 10 September 2015. To complement the survey, results and the ideas and demands captured in the initial phase of its work, the task group subsequently conducted in-depth interviews with experts about data flows and their practices in consulting sources of data. The survey mostly captures the feedback of experts interested in species distribution modelling and genetic diversity analyses of ABD, reflecting the

dominant current uses of data accessed through GBIF. The draft report was published online and submitted to community feedback until 15 December through the GBIF Community site and through e-mails. Additional comments received at the end of December were also integrated.

The main deliverable of the task group is a set of practical guidelines and recommendations from the agrobiodiversity community around the issues defined under the Terms of

Reference (see Appendix 3), summarized into a short, action-oriented report. An interim summary of demands and an early analysis are presented below. Recommendations will not only be directed at the GBIF Secretariat, but also for GBIF Participant Nodes to target data mobilization activities informed by gap analysis from the task group.

Four Skype meetings of the task group members were held on 28 April, 4 June, 21 July and 29 September 2015. Two face-to-face meetings were held, the first at Bioversity in

Montpellier from 10 to 11 July and the second at the GBIF Secretariat in Copenhagen from 10 to 11 September. The meeting notes, draft version of the report and the survey results were shared among task group members through an online shared folder.

Feedback from survey respondents and ABD community stakeholders following the initial release of the draft report, were incorporated into this final version 1.0 of the task group report.

6. Recommendations

Key recommendations were derived from our analysis of the information received through the survey, from interviews and based on our own experience.

6.1 General use of GBIF by the agrobiodiversity community

A survey was sent to agrobiodiversity experts selected based on their previous experience with GBIF and the results provide a good snapshot of their user experience. Respondents are satisfied by the availability of big datasets and the possibility of downloading large data sets (most survey respondents have downloaded data from GBIF.org) but are generally less satisfied with the quality of the coordinates, outdated taxonomic names, and presence of duplicates. 60% of the respondents report problems in accessing the data they need.

However, most of the mentioned issues are related to the lack of access to different types of external auxiliary data, among which the external environmental and external trait data are

(12)

the most frequently mentioned. It is worth noting that very few of the respondents have contacted the GBIF helpdesks at the Secretariat and the national nodes to get support.

Some nodes provide substantial national helpdesk functions (e.g. France, Spain, Norway), while some other nodes could increase their availability to provide helpdesk services.

National pages on GBIF.org should provide the email to the national helpdesk, and/or national mailing lists if nothing else is available at the national level.

There is a clear potential for expanding the use of GBIF in the agrobiodiversity community for publishing and accessing data. The task group recommends that the GBIF Secretariat and GBIF nodes stimulate data mobilization in the agrobiodiversity community by showing the academic and non-academic benefits to potential publishers, with a particular focus on an audience ranging from pure scientists to pure data managers, and on young scientists.

Data mobilization approaches include support for crowdsourcing of observation data generation, and making data available for applications on handheld mobile devices. Citizen scientists should be invited to browse through their national records flagged with data quality issues. GBIF and the task group must inform and promote the importance of agrobiodiversity data to partners (IUCN, EOL, BHL, and others). Researchers, students and teachers need encouragement and clear guidelines on how to publish their existing data. The GBIF Secretariat should encourage and support the organization of workshops at international, regional and at national levels through GBIF nodes to build capacity on data types, data mobilization and publication by students, researchers, teachers and node staff members.

Recommendations

6.1.1. A promotional campaign showing that GBIF is a useful resource for agrobiodiversity research (abbreviated ABD) will be required to explain to potential data publishers the academic and non-academic benefits of data sharing through the GBIF portal.

6.1.2. A series of training workshops at international, regional and national levels targeting the ABD research community will be necessary to explain the features of the GBIF portal, including the upload and download of data to/from GBIF and feedback to GBIF. Training materials on the publication and use of biodiversity data should be provided to the

agrobiodiversity community. This should be done in conjunction with the training sessions focusing on ABD data, such as best practices and standardized methodology to collect relevant data on crop wild relatives.

6.1.3. Provide training materials and best practices targeting agrobiodiversity users.

6.1.4. The visibility of helpdesk point of contact for the GBIF Nodes needs to be improved.

GBIF.org national pages should provide the email to the national helpdesk, and/or national mailing lists when this is available at the national level.

6.1.5. As a result of the above actions, the ABD community should publish existing digitized ABD data collections through GBIF.

(13)

6.2 Integration of relevant data standards for agrobiodiversity

The community of ex situ gene banks, which predates GBIF, has agreed on a set of core Multi-Crop Passport Descriptors (MCPD). The first version was introduced in 1998

(Hazekamp et al. 1998) with the first official version published in 2001 (Alercia et al. 2001).

An updated version was published in June 2012 (Alercia et al. 2012), and the current version including terms for specimen-level persistent identifiers was published in December 2015 (Alercia et al. 2015). The MCPD is a long-standing community specific standard. Darwin Core and the germplasm extension dominate in citation by gene bank managers responding to the survey because it integrates the additional data fields that survey respondents suggest should be directly available at GBIF.org. The MCPD covers the essential core terms for an agrobiodiversity specimen data type level backbone aligned one-to-one with the Darwin Core occurrence data type level. Further refinement of the germplasm terminology and the Darwin Core extension is still needed to improve the support for other agrobiodiversity data types such as the crop wild relatives (Thormann et al. 2013), pre-breeding and breeding data, and characterization & evaluation trait data.

Most of the MCPD terms already have corresponding terms in the Darwin Core standard⁸ (Wieczorek et al. 2012, see table 1). Unless all of the current 41 MCPD descriptors are included in and made available in data downloads from the GBIF index, the agrobiodiversity community will need to continue maintaining parallel occurrence-level data flow pathways and independent data indexing solutions. A remedy to this situation will be the addition of the few currently lacking MCPD terms into the Darwin Core set of terms – following the

description in the Darwin Core germplasm extension⁹ (Endresen and Knüpffer 2012) – that are indexed by GBIF procedures for the occurrence-level backbone. Further work will be required to integrate descriptors for in situ conservation of crop wild relatives to the Darwin Core extension (Thormann et al. 2013). There is no alternative to full integration.

The integration priorities are as follows:

1. The most important term to add is SAMPSTAT (biological status of sample, g:biologicalStatus) describing the type of germplasm material.

2. Another prioritized set of terms include DONORCODE (g:donorInstituteID), DONORNAME (g:donorInstitute), DONORNUMB (g:donorsIdentifier), ACQDATE (g:acqusitionDate), and COLLSRC (g:acquisitionSource). Germplasm material is living material allowing for living copies to be passed on from one gene bank collection to another. The set of terms for donor institute and the germplasm identifier used by the donor is important to enable tracking regarding the provenance of germplasm. Darwin Core germplasm extension also promotes a persistent identifier (g:donorsID) for the germplasm material held by the donor (this term is also proposed for a future MCPD revision).

3. A similar set of prioritized terms include BREDCODE (g:breederInstituteID), BREDNAME (g:breedingInstitute), ACCENAME (g:breedingIdentifier), and ANCEST (g:ancestralData, g:purdyPedigree). The Darwin Core germplasm extension promotes a persistent identifier (g:breedingID) for this type of source material. Germplasm material can be created by a plant breeder through an

8 http://rs.tdwg.org/dwc/terms/

9 http://purl.org/germplasm/germplasmTerm#

(14)

active and experiment-based crop improvement and research activity. These terms describe the creation of germplasm material in the situations when this material is created through breeding and not collected in situ (or on-farm). Terms for collecting events are already very well covered in Darwin Core.

4. One minor issue here is the recommendation of the MCPD to use the degree- minute-second format for geographic coordinates while Darwin Core prescribes the decimal-degree format.

5. A second issue for the collecting event is the need for a term to describe the FAO WIEWS institute code for the collector (COLLCODE, g:collectingInstituteID).

6. Darwin Core should include information on whether a particular population is cultivated, wild, escaped from cultivation, sub-spontaneous or unknown (Note:

dwc:establishmentMeans partly covers cultivated, wild, naturalized etc, while MCPD:SAMPSTAT provides richer information) and the 'basisOfRecord' term should enable to distinguish herbarium specimens from gene bank accessions because "preserved specimens" is too ambiguous. The ‘dwc:basisOfRecord’ has potential for improvement and this is also under discussion at TDWG.

7. Include in Darwin Core several administrative fields for the description of the site of observation or collection, rather than a two-field called "COLLSITE" and

"ORIGCTY" (like in the current MCPD) or three-fields "Country", "County" and

"Locality" (like in GBIF format). Such inclusion should also take place soon in the MCPD format. The suggestion came from the quality of georeferencing quality assessment tool called GEOQUAL tool¹⁰.

Table 1: Mapping of MCPD (Alercia et al. 2001, 2012, 2015; Hazekamp et al. 1998) to Darwin Core (Wieczorek et al. 2012) using the Darwin Core germplasm extension (Endresen and Knüpffer 2012). 25 terms in MCPD match a corresponding term in Darwin Core. 15 terms from MCPD are not matching terms already described in Darwin Core (highlighted in blue) and 2 terms partly matching (highlighted in grey). [Namespaces, dwc

= http://rs.tdwg.org/dwc/terms/; g = http://purl.org/germplasm/germplasmTerm#]

Term MCPD (2015) Darwin Core (dwc), germplasm (g) NA (not applicable) dwc:datasetID

0 PUID dwc:occurrenceID

1 INSTCODE dwc:institutionCode

2 ACCENUMB dwc:catalogNumber

3 COLLNUMB dwc:recordNumber

4 COLLCODE g:collectingInstituteID

10 http://www.capfitogen.net/en/tools/geoqual/

(15)

4.1 COLLNAME dwc:recordedBy 4.1.1 COLLINSTADDRESS (dwc:recordedBy) 4.2 COLLMISSID dwc:collectionCode

5 GENUS dwc:genus

6 SPECIES dwc:specificEpithet

7 SPAUTHOR dwc:scientificNameAuthorship (if SUBTAXA is empty)

8 SUBTAXA dwc:infraspecificEpithet

9 SUBTAUTHOR dwc:scientificNameAuthorship (if SUBTAXA is not empty)

10 CROPNAME dwc:vernacularName

11 ACCENAME g:breedingIdentifier

12 ACQDATE g:acquisitionDate

13 ORIGCTY dwc:countryCode

14 COLLSITE dwc:locality

15.1 DECLATITUDE dwc:decimalLatitude 15.2 LATITUDE dwc:verbatimLatitude 15.3 DECLONGITUDE dwc:decimalLongitude 15.4 LONGITUDE dwc:verbatimLongitude

15.5 COORDUNCERT dwc:coordinateUncertaintyInMeters 15.6 COORDDATUM dwc:geodeticDatum

15.7 GEOREFMETH dwc:georeferenceSources 16 ELEVATION dwc:minimumElevationInMeters

(16)

17 COLLDATE dwc:eventDate 18 BREDCODE g:breedingInstituteID 18.1 BREDNAME g:breedingInstitute

19 SAMPSTAT g:biologicalStatus

20 ANCEST g:ancestralData, g:purdyPedigree

21 COLLSRC g:acquisitionSource

22 DONORCODE g:donorInstituteID 22.1 DONORNAME g:donorInstitute 23 DONORNUMB g:donorsIdentifier

24 OTHERNUMB dwc:otherCatalogNumbers 25 DUPLSITE g:safetyDuplicationInstituteID 25.1 DUPLINSTNAME g:safetyDuplicationInstitute

26 STORAGE g:storageCondition

27 MLSSTAT g:mlsStatus

28 REMARKS (dwc:occurrenceRemarks)

Genesys, the global catalogue of plant germplasm gene bank accessions¹¹, that uses the MCPD to aggregate gene bank data appeared in many of the responses gathered through the survey. The overall suggestion is to increase the quality of the existing records, reduce duplication, and collaborate with Genesys. Genesys could be invited to form an

agrobiodiversity thematic data node in GBIF and to provide an agrobiodiversity data mobilization helpdesk.

The survey revealed some concerns among the respondents regarding modifications and changes applied to the Darwin Core standard. The Darwin Core standard is ratified by the Biodiversity Information Standards (TDWG) and all modifications require ratification by the TDWG community as is described by the Darwin Core namespace policy¹². The Darwin

11 https://www.genesys-pgr.org

12 http://rs.tdwg.org/dwc/terms/namespace/

(17)

Core standard was first ratified by TDWG in 2009 and the Darwin Core decision history¹³ list all the approved and implemented modifications, while the normative Darwin Core complete historical record¹⁴ lists all the terms including all historical declarations. Some of the

concerns raised by respondents to the survey might relate to the previous versions of Darwin Core that existed prior to the TDWG ratification in 2009. A mapping between the current Darwin Core standard and these older obsolete versions of Darwin Core is presented by TDWG¹⁵. However, most users of Darwin Core will most likely find the quick reference to the current valid Darwin Core terms¹⁶ to be the most useful presentation.

The Darwin Core germplasm extension follow the similar design principles as is set by Darwin Core and the overall design guidelines set by the TDWG Vocabulary Management Task Group (TDWG 2013). The terms are declared as RDF (resources description

framework) using the SKOS (simple knowledge organization system)¹⁷ language and organized into class¹⁸ or property¹⁹ terms following current W3C (World Wide Web

Consortium) recommendations. However, the terms have deliberately very limited semantic declarations. Some of the survey respondents have expressed the requirement of a more formal semantic description for the germplasm terms. The Darwin Core germplasm

extension includes a type vocabulary for controlled element values. Recent updates to the Darwin Core standard have transferred the corresponding Darwin Core type terms into the main Darwin Core namespace. A similar approach could be implemented for the germplasm extension. A more formal agrobiodiversity community governance policy is needed for the germplasm extension. The Biodiversity Information Standards (TDWG) could be a suitable platform for implementation of a formal agrobiodiversity community governance policy for the Darwin Core germplasm extension. The Darwin Core germplasm extension is available for collaborative management at the TDWG Term Wiki²⁰, but is not prepared and submitted for formal community review as a TDWG standard.

Recommendations

6.2.1. The Multi Crop Passport Data (MCPD) is the data exchange standard for describing crop samples held in gene banks. GBIF has to index data attributes described with the MCPD terms to stimulate the use of gene bank data and other ABD data published in GBIF.

Most of the MCPD terms were mapped to Darwin Core terms (see table 1). Therefore, to enable full compatibility between these standards, only a few terms need to be added to the GBIF data profiles, following the model proposed in the existing Darwin Core germplasm extension. This will be achieved by including the Darwin Core germplasm extension into the GBIF data indexing routines.

6.2.2. Collaboration with Genesys, the global portal for information on plant genetic

resources, is necessary and the task group recommends studying the feasibility of Genesys

13 http://rs.tdwg.org/dwc/terms/history/decisions/

14 http://rs.tdwg.org/dwc/terms/history/

15 http://rs.tdwg.org/dwc/terms/history/versions/

16 http://rs.tdwg.org/dwc/terms/

17 https://www.w3.org/TR/skos-reference/

18 http://www.w3.org/2000/01/rdf-schema#Class

19 http://www.w3.org/1999/02/22-rdf-syntax-ns#Property

20 http://terms.tdwg.org/wiki/Germplasm

(18)

becoming a thematic data node within GBIF, providing a helpdesk for agrobiodiversity data mobilization.

6.2.3. Further refinement of the germplasm terminology and the Darwin Core extension with additional attributes (terminology) is needed for describing agrobiodiversity species, such as crop wild relatives and pre-breeding and breeding data, and characterization & evaluation of trait data. It should be added as an extension to the GBIF taxon core data profile and be included in the corresponding GBIF indexing routines. The gene pool and taxon group classifications along with traits have the highest priority as extension attributes.

6.2.4. The possibility of accommodating various standards and descriptors used in data sources was mentioned alongside the addition of data generated by predictive

characterization using geospatial information. Population data should link to pre-breeding and breeding data.

6.2.5. Indigenous names are needed in addition to vernacular names.

6.2.6. A more formal agrobiodiversity community governance policy is needed for the germplasm extension. The Biodiversity Information Standards (TDWG) could be a suitable platform for the implementation of a formal agrobiodiversity community governance policy for the Darwin Core germplasm extension. Darwin Core germplasm extension should be

maintained to reach a stable standard for germplasm accessions conserved in gene banks (ex situ conservation), and should be expanded to address the needs of data concerning in situ and on-farm conservation.

6.2.7. Technically, in Darwin Core germplasm extension, a clear distinction between

classes and properties is required. Darwin Core germplasm extension needs to be revised to align with the very last version of Darwin Core. Using a controlled vocabulary for the value of an element should be considered. Ideally, Darwin Core germplasm extension should be compliant with the DCMI model proposed by Dublin Core²¹.

6.2.8. Expand the Darwin Core germplasm extension with standard terminology to describe in situ conservation of crop wild relatives (to be based on Thormann et al. 2013).

6.2.9. Include in Darwin Core several administrative fields for the description of the site of observation or collection, rather than a two-field called "COLLSITE" and "ORIGCTY" (current MCPD) or three-fields "Country", "County" and "Locality" (GBIF format). Such inclusion should also take place soon in the MCPD format.

6.2.10. GBIF should consult livestock experts to adapt the Darwin Core germplasm extension to livestock in order to better cover agrobiodiversity research.

21 http://dublincore.org/documents/interoperability-levels/

(19)

6.3 Inventories of crop wild relatives

GBIF needs to integrate agrobiodiversity terms and attributes for Crop Wild Relatives

(CWR), for in situ (Moore et al. 2008, Thormann et al. 2013) and on farm conservation (FAO 2015).

1. Global, regional and national taxon checklists to identify crop wild relatives should be developed, agreed upon and published via GBIF. A global list of crop genera would be an important tool here.

2. Indigenous names are needed in addition to vernacular names to support the identification of the diversity maintained by local communities.

3. At the taxon-backbone-level a new attribute to identify the priority and conservation status for CWR species would be very helpful.

4. Attributes describing the gene pool category status also need to be added at the taxon/checklist level.

5. Information on interactions between agricultural crop and pest species, at the taxon level, is needed to correlate the respective occurrence data available in the GBIF portal (e.g. using the Darwin Core dwc:ResourceRelationship²²).

In situ conservation and on-farm management information systems need to combine basic eco-geographic information (climate variables, water availability, soil type, vegetation type, land cover, latitude, longitude, altitude, spatial distribution of pests and diseases, etc). This information is critical to allow users of the information system to locate traits of interest (e.g.

drought, disease or salinity tolerance) and also to identify sites with similar conditions where the varieties or landraces could perform well (Thormann et al. 2014, 2015).

Classification and identification of crop wild relatives

Harlan and de Wet (1971) propose a classification of crop wild relatives according to the relative crossability between wild and cultivated species as follows:

Gene pools

GP1A Primary Cultivated forms of the crop (cultivars and landraces) GP1B Primary Wild or weedy forms of the crop

GP2 Secondary Species with which gene transfer is possible but difficult GP3 Tertiary Species with which gene transfer is impossible by genetic

engineering [1]

[1] With the advance of molecular engineering techniques allowing for complex genetic transfers, Hammer et al. (2003) suggested adding a fourth gene pool that includes genetic components of artificial origin (transgenes).

22 http://rs.tdwg.org/dwc/terms/ResourceRelationship

(20)

The main assumption behind the proposed definition is that taxonomic distance is

positively related to genetic distance (Maxted et al. 2006) and thus provides a pragmatic approximation of potential crossability between taxa (when genetic information lacks). Taxon groups are defined as follows:

Taxon groups

TG1A Crop (cultivars and landraces) TG1B Same species

TG2 Same series or section TG3 Same subgenus TG4 Same genus

TG5 Same tribe but different genus

As a rule of thumb, Maxted et al. (2006) also suggested the following ranking methodology for establishing conservation priorities:

Degree of

CWR relatedness Gene pool Taxon group

Conservation priority Close CWR GP1B TG1B, TG2 High priority Remote CWR GP2 TG3, TG4 Low priority

Not CWR GP3 TG5 Excluded

Although it remains difficult to differentiate between “close” and “remote” CWR as often the taxonomy is not given in full and the “series”, “section” and “subgenus” are not informed in most databases, the combined GP/TG definition is convenient and can easily be

implemented in an automated request (Delêtre et al. 2012).

The global checklist of crop wild relatives²³ (Vincent et al. 2013) and the GRIN Taxonomy provide important information resources for the classification of taxa as crop wild relatives.

Traits of interest for breeding are being added and will complement the species attributes.

The CWR classifications from these and similar checklists should be published to the GBIF

23 http://www.cwrdiversity.org/checklist/

(21)

checklist bank and integrated into the GBIF taxon backbone. European Native Seed Conservation Network (ENSCONET) developed a database for crop wild relatives.

ABD users should be enabled to filter occurrence data based on the CWR, gene pool and taxon group classifications. National checklists of CWR conservation priority species and their status should also be published into the GBIF checklist bank and made available through the GBIF portal.

Occurrence data from the Global Atlas of crop wild relatives should be published via GBIF (see figure 2). National programmes for the monitoring of CWR species should be promoted by national GBIF nodes and datasets published through GBIF.

Figure 2: Distribution of occurrence data in the global dataset of crop wild relatives (CWRDGC, 2015). This map was prepared with information mobilized via GBIF and other sources (i.e., herbaria, gene bank databases, researchers archives), and includes crop wild

relatives occurrence data and cultivated occurrence data, offering a picture of the current availability of ABD occurrence data. Map prepared by Steven Sotelo (CIAT).

Recommendations

6.3.1. The global crop wild relative species checklist (www.cwrdiversity.org/checklist/) has to be published to the GBIF portal registry of checklists and integrated in to the GBIF taxonomy backbone. To complement this global list, other crop wild relatives checklists for publishing as a taxon checklist via GBIF may be proposed, starting with the crop wild relative species list developed by the Southern African Development Community (SADC)-CWR project which includes a global list of crop genus names - a useful tool for national species list of crop wild relatives.

6.3.3. New GBIF indexed attributes has to be added at the taxon-backbone-level to:

(22)

1. Identify the conservation priority and status of crop wild relative species at the global level (and at the regional and national levels in a taxon-level extension).

2. Provide information on the relationship of crop wild relatives (CWR) and their associated crops (e.g. gene pool and taxon group).

3. These taxon attributes should be implemented as conditional filters for selecting occurrence data in the GBIF portal.

6.3.4. Train GBIF Nodes on the value of CWRs, and mobilization of data on crop wild relatives and on species traits useful for crop improvement and for landscape restoration.

6.4 Mobilizing data on cultivated plants

In their farms, small-scale farmers maintain a large diversity of cultivated species and recognize many different types (‘landraces’ sensu Harlan 1975 and Villa et al. 2008) within each of their crops (Jarvis et al. 2008). Over 200,000²⁴ landraces of rice (Oryza sativa L.) are estimated to exist worldwide and about as many varieties of bread wheat (Triticum aestivum L. subsp. aestivum). There are about 47,000 varieties of sorghum, 30,000 varieties of common bean (Phaseolus vulgaris L.), chickpea (Cicer arietinum L.), and maize (Zea mays L.), about 20,000 varieties of pearl millet, 15,000 varieties of peanut (Arachis

hypogaea L.), and between 7,000 and 9,000 varieties of manioc (Manihot esculenta Crantz) (FAO, 1998, 2010, Deletre et al. 2012).

The growing interest in neglected and underutilized crops (NUS) reflects rising concerns over this increasing reliance on a handful of crops to ensure global food security and economic growth (Padulosi et al. 1999, Stamp et al. 2012). NUS encompass a variety of plant species that are farmed (minor crops), reared (semi-domesticates), or gathered from the wild for a variety of uses and may contribute to nutrition (food, beverage), medicine, cosmetics, fodder, fibres, fuel, or provide material for building. NUS also include some ornamental plants. Although the promotion and conservation of NUS is part of FAO Global Plan of Action for the Conservation and Sustainable Use of Plant Genetic Resources for Food and Agriculture since 1996, NUS have been overlooked by breeders and botanists and data are lacking on their taxonomic/nomenclature, ecology, distribution, genetic diversity, local uses, and nutritional value. Inadequately described or characterized, NUS are at high risk of cultural and genetic erosion (Vietmeyer 1986).

Information required:

1. Taxonomy and checklists of traditional names in relevant languages.

2. Geospatial distribution information in cultivated areas.

3. Morpho-taxonomy, agronomic traits (farmers and breeders), functional traits, local uses, characterization and evaluation data.

4. Use, agronomic practices, cultural practices, seed conservation and exchange

24 FAO 2010 actually estimates a total of 35% landraces of rice from a total of 773,948 rice accessions, which amounts to approximately 270,882 rice landrace accessions

(23)

Recommendation

6.4.1. Authoritative checklists and classification of crop wild relatives, cultivars, landraces and neglected and underutilized crop species, including vernacular names from authoritative lists along with the language and countries where it applies, should be added to GBIF when developed and validated by an international expert group and community.

6.5 Interactions between species

ABD scientists study the spatial distribution and species interactions in order to predict, for a given area, the possible gene flows between crops and wild species relative to crops, draw conclusions on the opportunities for the evolution of on-farm diversity and crop adaptation, and identify trade-offs and risks in interventions for conservation, land management or restoration actions. Data about the presence of livestock, of pests, diseases, helpful insects like pollinators, or about the predators of pests are needed. Additionally, the risk of a species turning invasive in a given environment is important information for decision making on landscape restoration, natural resource management, and conservation of threatened species.

Recommendations

6.5.1. Additional attributes are needed about species relation to crops or between species at the taxon name level like pest, predator, pollinator, etc.

6.5.2. Additional attribute ‘pathogen’ with the scientific name of the pathogens and vernacular names of the diseases should be made available.

6.5.3. The GBIF portal should enable the selection and download of crop occurrences along with occurrences of pests, diseases, pollinators, livestock, etc.

6.5.4. Information on the risk of a species becoming invasive has to be made available through a link between the GBIF portal and other databases holding such information, like the CABI database on invasive species.

6.6 Improving the mobilization of new data sources

Existing CGIAR and other gene bank databases are critical sources of information for the ABD community. However, there are other alternative sources of ABD information that remain in the grey literature and need to be digitized. Others exist in digitized formats but are not yet available through GBIF. Approaches for making this information readily available are listed below.

• Stimulate the digitization of relevant collections (i.e. herbaria, gene banks,

published articles, MSc and PhD theses, national and regional projects) related to

(24)

ABD, by providing small grants in calls for competitive projects, as through the EU Biodiversity for Development (BID) call.

• Keep additional initiatives on the radar, as they progress with the digitization of herbaria specimens important for both the wider biodiversity and agrobiodiversity communities. Examples of such initiatives include JSTOR plants, Beyond the Box digitization competition²⁵, and data repatriation projects such as the “Capture of primary biodiversity data on West African plants”, where images of specimens from large herbaria are being digitized.

• Provide alternative tools that require low technical expertise, including further development of the GBIF spread sheet template for publishing data to GBIF.org.

The current IPT service is demanding in terms of informatics skills. For reporting the mandatory dataset metadata, an offline template and/or an online form should be provided. Alternatively a solution based on the Global Registry of Biodiversity Repositories (GRBio)²⁶ or similar can be explored as a solution for dataset metadata registration.

• Promote the use of citizen data portals such as the EarthSky²⁷ and iSpot²⁸ as a means for publishing data through GBIF.org.

• Existing digitized ABD data collections that should be considered to be made available through GBIF as they contain new information:

1. The Crop Wild Relative Global Occurrence Dataset²⁹, 2. The Bioversity Collecting Mission Database³⁰,

3. Data, including the CWR genera list from the Southern African Development Community (SADC)-CWR project³¹.

Recommendations

6.6.1. Existing digitized ABD data collections, such as the Bioversity Collecting Mission database³² and the Crop Wild Relative Global Occurrence dataset (see map on page 20, figure 2), should be published through GBIF.

6.6.2. Stimulate the digitization of relevant collections (i.e. herbaria, gene banks, published articles, MSc and PhD theses, national and regional projects) related to ABD and stimulate the publishing of already digitized collections, by providing small grants through competitive calls.

6.6.3. Support the publishing of occurrences on ABD by rendering the upload of data records easy, requiring a very low technical expertise, and by providing an offline template, online

25 https://beyondthebox.aibs.org/

26 http://grbio.org/

27 http://earthsky.org/earth/citizen-scientists-hit-one-million-mark-for-observations-of-nature

28 http://www.ispotnature.org/communities/global

29 http://www.cwrdiversity.org/checklist/cwr-occurrences.php

31 http://www.cropwildrelatives.org/sadc-cwr-project/

(25)

form or a system (such as e.g. GRBio) for reporting the mandatory metadata describing the data set.

6.6.4. Promote the use of citizen data portals such as the EarthSkySea and iSpot as a mean for publishing data to GBIF.org (as complementary to systems such as the iNaturalist that are already publishing citizen scientist observations through GBIF).

6.7 Data Mobilization targets for Nodes

As indicated in the report ‘Agrobiodiversity in perspective’ (Delêtre et al. 2012)

commissioned by Sud Experts Plantes and Bioversity International, the input of national experts is essential to create national inventories of CWR, NUS and landraces, to refine species checklists, and to identify and document knowledge gaps. Specific objectives should be to:

1. Revise or complete the taxonomy of lesser known plant genera.

2. Evaluate the species’ genetic diversity.

3. Gather detailed information on species distribution and ecology.

4. Collect ethnobotanical data on folk knowledge and traditional uses.

5. Assess the nutritional value and potential for commercialization of NUS.

With the help of the national agrobiodiversity research community, nodes are encouraged to contact the national agrobiodiversity data holders to improve ABD data availability through GBIF. Together, they can assess and influence national priorities.

However, Nodes should receive guidance and training on the mobilization and cleaning of agrobiodiversity data, and on CWRs and their importance for human food security. Experts could share their knowledge with Nodes by developing and sharing a global and a national checklist of CWR and neglected species so that CWRs can be identified as priorities for mobilization by the Nodes. A recommendation on this could be formulated by the task group on ‘accelerating the discovery of bio-collections data’³³.

Recommendations

6.7.1. GBIF nodes could have a significant role to play if they are properly trained in the identification of data relevant to agrobiodiversity (i.e., crop wild relatives and on-farm

diversity). Experts of agrobiodiversity data can provide support and best practices to Nodes to get acquainted with the expected data types (e.g. data collection methodology developed for the Crop Wild Relative Project of Southern African Development Community (SADC)³⁴).

6.7.2. A key and simple step is to increase the knowledge of nodes through training so that they can play a more prominent role in the mobilization of locally available information resources on ABD in GBIF.

33http://www.gbif.org/governance/task-groups

34 http://www.cropwildrelatives.org/sadc-cwr-project/

(26)

6.8 Services and tools for data processing and cleaning

Starting an agrobiodiversity study often means creating a reliable checklist of species, subspecies and cultivars that will be used for extracting relevant data from various sources.

Scientists responding to the survey reported that this labour-intensive work is usually done manually. Therefore, GBIF could promote (and potentially integrate into the GBIF portal) tools selected based on popularity to serve the agrobiodiversity community, in particular for the curation of georeferences and taxonomy. An example is GEOLocate³⁵ were GBIF might provide a service to package data extracted from the GBIF portal into a format suitable for upload into this tool, or potentially explore possibilities to integrate the GEOLocate tool into the GBIF portal. Modelling pipelines and workflow services such as BioVeL and other Galaxy and Taverna compliant protocols, in conjunction with universal workflow technology, can be used to help with data pre-processing. Additionally, the GBIF helpdesk and tool directory can be improved for supporting data processing with ample help and manuals for the users.

Tools for data processing can be hosted in an online environment for workflow processing, such as Galaxy Toolshed³⁶.

GBIF could become the point of access for the most reliable and up to date taxonomy for agrobiodiversity. Checklist resources such as USDA GRIN Taxonomy³⁷ (52,577 names of the respective plant agrobiodiversity species) is already integrated into the GBIF backbone taxonomy. Similarly, the Mansfeld´s World Database of Agricultural and Horticultural Crops³⁸ taxonomy (6,100 names, and the most complete checklist for cultivated species) should be closely integrated with additional information attributes (using an extension to the Taxon Core for additional terms such as gene pool status, etc).

Recommendations

6.8.1. GBIF to become the point of access for the most reliable and up-to-date taxonomy (including cultivars).

6.8.2. GBIF should seek to support the integration of popular data cleaning tools such as GEOLocate, OpenRefine (formerly Google Refine), and workflow services from BioVeL and other Galaxy or Taverna compliant protocols with data published through the GBIF portal. It is also important to take into account the requirements on use cases that are being

developed by a task group of TWDG/GBIF data quality the interest group).

6.8.3. GBIF needs to increase visibility of existing taxonomic name reconciliation tools Global Names Architecture (GNA)³⁹ on GBIF.org, and provide access to the Plant List⁴⁰ developed by botanical gardens.

6.8.4. GBIF has to implement or improve tools for cross-checking and validating nomenclature of records published by different collections on the GBIF portal using

35 http://www.museum.tulane.edu/geolocate/

36 https://toolshed.g2.bx.psu.edu/, https://wiki.galaxyproject.org/ToolShed

37 http://www.gbif.org/dataset/66dd0960-2d7d-46ee-a491-87b9adcfe7b1

38 http://mansfeld.ipk-gatersleben.de

39 http://globalnames.org/

40 http://www.theplantlist.org/

(27)

taxonomic authorities such as GRIN Taxonomy, Mansfeld's World Database of Agricultural and Horticultural Crops, IPNI, the Plant List and Tropicos to resolve naming issues.

6.9 De-duplication of occurrence-level data records

The agrobiodiversity community currently uses a number of different data flow mechanisms and portals for germplasm records. They need guidance in retrieving data across this broad landscape of data sources, including GBIF.

There are two types of duplicates – duplicate data records about the same accession (published from different datasets) and duplication of accessions originating from the same collecting event (copies of the same living material conserved in different gene banks). In fact, the very same occurrence of originally collected living plant material can be copied across multiple gene bank collections and lead to a physical duplication of the actual physical biological material. All these specimens or accessions are connected to the same original occurrence in nature or in a farmer's field. It is estimated a total of 7.4 million gene bank accessions conserved in ex situ collections worldwide originate from a total of almost 2.2 million original collected material samples (FAO 2010). Consequently, in the gene bank world, information about different accessions for each occurrence can be conserved and made available from several different gene bank collections. This “duplication” of occurrence records (across gene bank collections) carries varying richness and very different types of information for the same occurrence record, according to local needs.

The other type of duplication concerns information on the very same accession or specimen included in more than one dataset published via GBIF. Some datasets such as the Genesys gene bank portal or the Global Inventory of CWRs provides a meta-catalog with a mixture of datasets already published in GBIF from the source and other datasets not yet published in GBIF. Other datasets such as trait experiments by crop researchers and plant breeders contribute new information not available from the gene bank collection but linked to the accessions in the gene bank collections.

A useful solution is the new feature of the GBIF API that implements occurrenceID as a searchable term for emerging persistent identifiers coming from data holders and data publishers. Having occurrenceID as a searchable term demonstrates to data publishers the value of providing persistent identifiers for their specimens. Occurrence information from multiple data sources (datasets published by different data owners) can more easily be combined/merged with persistent occurrence-level identifiers.

A valuable functionality for the GBIF portal would be an “occurrence-backbone” to merge and combine occurrence information on the same occurrence/specimen provided by different data owners (different datasets but using the same specimen/accession-level persistent identifier) in the same manner in which taxon names from multiple sources are combined to form a GBIF taxon backbone. The same occurrence can be included in different datasets from different data owners for a number of valid reasons such as an interest or focus on different types of attribute information.

Agrobiodiversity specimens (accessions) are routinely the object of different types of

experiments conducted by crop scientists or commercial crop breeding companies each with

(28)

their own database systems. Following the implementation of an “occurrence-backbone”, an estimated 600,000 occurrences aggregated by Genesys and a few hundred thousand occurrences from CIAT that are not yet published through GBIF could readily be added. A portion of the records included in the above mentioned agrobiodiversity datasets originate from GBIF. The best strategy would be to refresh existing datasets using locally unique (currently not globally unique) record identifiers. De-duplication using resources based on GBIF-mediated data mixed with unique new data, cross-linking towards other datasets published in GBIF would be needed.

Recommendations

6.9.1. Implement as an additional feature to the GBIF portal an ‘occurrence backbone’ by merging and combining information for the same occurrence/specimen linked using

persistent identifiers, for datasets provided by different data owners, in the same manner as for taxon names.

6.9.2. Publishing additional occurrences from Genesys, the Crop Wild Relative occurrence dataset, and other ABD datasets containing a mixed set of new records and records already published via GBIF using shared or linked specimen-level identifiers so that information about the same specimen/accession can be linked together.

6.9.3. Using unique and semi-unique identifiers to identify “duplicate” records between different datasets could improve the mechanisms for updating and refreshing datasets in the GBIF portal. Cross-linking and de-duplication would be needed when using resources based on existing GBIF-mediated data records mixed with unique new data.

6.10 Agrobiodiversity user profile access

Users accessing the GBIF portal need to register with a user profile before downloading data. The user registration system should be expanded to allow users to choose from a set of predefined user profiles. An ‘agrobiodiversity user profile’ could offer thematically

designed search widgets and tools to support commonly performed operations on the GBIF portal by this type of user. The downloaded files could also be offered within a specific optimized ‘agrobiodiversity-format’. A hierarchy of data profiles will be useful here. The purpose of such user profiles is not to reduce the content by hardcoding a rigid filtering to exclude data records outside the ABD domain, but to put upfront thematically designed search tools to support ABD users to find the information they need with fewer operations.

The so-called “reference dataset” approach might prove to be less effective here, because the ABD user generally wants to discover new sources of crop and crop wild relative data records rather than find those records previously identified and included in a “reference dataset”. Algorithms to identify newly available data records, not yet discovered as agrobiodiversity relevant data records, would generally be more useful.

Recommendations

6.10.1. A hierarchy of data-profiles and user-profiles, thematically designed search widgets, and tools would enable thematic users (such as ABD users) to increase efficiency of their use of the GBIF portal to find the range of data they need.

Final Report of the Task Group on GBIF Data Fitness for Use in Agrobiodiversity