The Nansen Legacy Template Generator for Darwin Core and CF-NetCDF

1 Introduction

The way we manage scientific data is changing. Services are available that allow people to easily visualise, aggregate, or reuse published data (e.g. Global Biodiversity Information Facility, 2023; SIOS data catalogue, 2023). To build such services efficiently at scale, data must be published in a way that is compliant with the FAIR guiding principles (Wilkinson et al., 2016).

The scientific community is increasingly recognising the importance of the FAIR guiding principles. Scientists are now not only encouraged but, in some cases, required to publish FAIR data. This growing pressure is coming from different sources:

Research institutions (e.g. The University Centre in Svalbard, 2023) and projects (e.g. The Nansen Legacy, 2021a) can have data policies that outline the principles that scientists should adhere to when managing their data.
Projects can have data management plans that outline in detail where, when and in what format datasets will be published and by whom (e.g. The Nansen Legacy, 2021b). Funding bodies can require the projects they fund to provide data management plans, and some state that data created through the projects they fund must follow the FAIR principles where possible (e.g. Research Council of Norway, 2023).
Data centres can encourage (e.g. https://adc.met.no/) or require (e.g. https://www.gbif.org/, https://obis.org/) data be published in file formats that are compliant with the FAIR principles.

File formats that are considered FAIR-compliant must be understandable and usable by both humans and machines. Two suitable file formats that are widely adopted are Darwin Core Archives for biodiversity data and associated data (Darwin Core Community, 2010) and NetCDF files that are compliant with the Climate & Forecast (CF) conventions (Eaton et al., 2022) for geoscientific data. Many scientists lack experience in creating such files, and the learning curve for publishing fully FAIR datasets can be steep. Data managers have an important role to play in providing tools that scientists can use to simplify this process and all aspects of working with FAIR data.

CF-NetCDF files and Darwin Core Archives are suitable formats for publishing and long-term storage of data. However, spreadsheet editors such as Microsoft Excel and LibreOffice Calc are still regularly used by many scientists during data collection, analysis, processing and preparation. Converting customised spreadsheets to a CF-NetCDF file or Darwin Core Archive can be time consuming. However, time can be saved by ensuring that the spreadsheets are well-structured and populated in a consistent way. If an individual uses the same spreadsheet template for multiple data collections, they can write some code to automate converting their data. If communities develop templates, code can be shared or software can be developed for converting the templates. Data in the templates themselves can also be shared and more easily understood by other members of the community. Both data sharing and conversion to CF-NetCDF or Darwin Core Archives can be further simplified if the CF conventions and/or Darwin Core terms are considered when designing templates.

The Nansen Legacy spreadsheet template generator was initially developed and presented by Ellingsen et al. (2021). They developed the template generator for recording metadata on research cruises as part of the Nansen Legacy project; a multidisciplinary Norwegian research project involving over 200 participants from 10 Norwegian research institutions. The templates were well adopted across the project and metadata was recorded for almost all of the Nansen Legacy cruise data collected. These templates were fed into a searchable metadata catalogue within a few weeks of each cruise ending, to provide an overview of all the data collected in the Nansen Legacy project (https://sios-svalbard.org/aen/tools). A selection of Darwin Core terms and CF standard names could be included as column headers in the templates – some terms were required. However, the data themselves were not recorded in this metadata catalogue.

In this paper, we present significant updates to the Nansen Legacy template generator. The primary objective of these updates is to enable scientists who prefer working with spreadsheets to create templates for both their metadata and data that are easily convertible to CF-NetCDF files or Darwin Core Archives. Two new configurations have been developed; one tailored to Darwin Core and one tailored to CF-NetCDF. Users can select column headers from a complete list of CF standard names or Darwin Core terms. A separate metadata sheet is included in CF-NetCDF configuration where users can record metadata adhering to the Attribute Convention for Data Discovery (ACDD).

Through this work, we aim to address the pressing need for accessible and user-friendly tools that assist scientists in publishing FAIR data. This contributes to advances in how scientific data are managed and used. The template generator is hosted at https://www.nordatanet.no/aen/template-generator.

2 Materials and Methods

The new version of the Nansen Legacy template generator has been developed as a Python Flask application (Figure 1). The code is hosted in a GitHub repository at https://github.com/SIOS-Svalbard/Nansen_Legacy_template_generator.

Figure 1

Overview of the system architecture of the Nansen Legacy template generator.

At the time of writing, the template generator has three different configurations:

CF-NetCDF – to facilitate the creation of CF-NetCDF.
Darwin Core – to facilitate the creation of Darwin Core Archives.
Learnings from Nansen Legacy logging system – aligns with a new version of the Nansen Legacy logging system (Ellingsen et al., 2021), currently under development and outside the scope of this article. The configuration offers various subconfigurations to accommodate the logging of different types of (meta)data collected during marine research expeditions. These templates contribute to a hierarchical metadata catalog in the form of a PostgreSQL database table, as presented by Ellingsen et al. (2021).

In the remainder of this section, we firstly describe features and functionality relevant to all of the above configurations, before describing features specific to either the CF-NetCDF configuration or Darwin Core configuration.

2.1 Relevant to all configurations

The template generator has a graphical user interface (GUI) that includes checkboxes for selecting various terms (Figure 2a). In many cases, these terms are drawn from controlled vocabularies (Figure 1), and detailed descriptions for each term are accessible via hover-over tooltips. Each term that the user selects will become a column header in their spreadsheet template (Figure 2b).

Figure 2

a) Example of the GUI with checkboxes for term selection. b) Example of the ‘Data’ sheet of the template generated.

Users can also select from the complete catalogues of CF standard names or Darwin Core terms through searchable dropdown menu by selecting buttons for each vocabulary (Figure 3a). Each button opens a macro that includes a searchable list of terms (Figure 3b–c). Descriptions of each term appear when the user hovers over the term. The user can select the check-boxes to add terms to their template. The template generator also includes a selection of other terms that are not from a controlled vocabulary that we deem to be potentially useful in some cases, grouped into categories (Figure 3d).

Figure 3

The template generator includes buttons (a) that open up macros which allow the user to add CF standard names (b), Darwin Core terms (c), or other potentially useful terms (d) to the their spreadsheet templates.

The CF standard names and their descriptions are sourced from https://cfconventions.org/Data/cf-standard-names/current/src/cf-standard-name-table.xml. The Darwin Core terms and descriptions are sourced from https://github.com/tdwg/rs.tdwg.org/blob/master/terms/terms.csv. Both of these sources always include the latest version of their vocabulary. We store the terms as JSON files in the local repository so that they don’t need to be harvested each time the software is activated. The harvesting can be repeated, thus updating the JSON files, either using a separate page on the software’s GUI or using a function in a Makefile using the command line.

Additionally, users can select from a category of other terms that are not included in a controlled vocabulary but that we deem to be potentially useful in certain cases (Figure 3d). This category includes terms related to identifiers, sampling stations, personnel details, as well as terms specific to marine research cruises, such as those conducted within the Nansen Legacy project. We would like to encourage the exclusive use of terms from controlled vocabularies wherever possible, particularly when users preparing their data for publication. However, we acknowledge that some terms do not map to existing CF standard names or Darwin Core terms. The development team will remove obsolete custom terms from the template generator as more terms are added to controlled vocabularies.

The user can select a button to generate an XLSX file that can be opened using most commonly used spreadsheet editors. In the template (Figure 2b), when the user selects a cell below the column header, the description of that term is displayed as a note. In some cases, cell restrictions are included to prevent users from entering invalid values. These cell restrictions are hard-coded into the template generator as these cannot be deduced from the source files of the CF standard names or Darwin Core terms currently.

2.2 CF-NetCDF configuration

The CF-NetCDF configuration is tailored to facilitate the creation of CF-compliant NetCDF files. The user is presented with checkboxes of the CF standard names of some commonly used coordinate variables (Figure 2a). In most cases, each selected term becomes a column header in a dedicated ‘Data’ sheet within the spreadsheet template (Figure 2b). However, when data represent a cell rather than a discrete point, users can also select bounds for their coordinate variable, resulting in the generation of two additional columns (minimum and maximum) in the ‘Data’ sheet for each box selected.

Each spreadsheet template also includes a ‘Metadata’ sheet (Figure 4). This contains a list of global attributes that can be included in a CF-NetCDF file, categorised as required, recommended or optional, accompanied by descriptions for each term. These global attributes are harvested from the recommendations of the Arctic Data Centre at https://adc.met.no/node/4, which align with the recommendations of the Svalbard Integrated Arctic Earth Observing System (https://sios-svalbard.org/). The Arctic Data Centre have extended the ACDD conventions (https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3) to reduce the degrees of freedom, thereby making it easier to build reliable, machine readable services upon compliant CF-NetCDF files. Also, the ACDD includes recommendations whereas the Arctic Data Centre includes requirements; we believe that minimum requirements are a more effective way to encourage consistency across datasets.

Figure 4

An example of the ‘Metadata’ sheet included in the template created using the CF-NetCDF configuration of the template generator.

2.3 Darwin Core configuration

The Darwin Core configuration (Figure 5) can streamline the creation of a Darwin Core Archive (DwCA). A DwCA consists of one or more CSV files that are zipped together with an EML.xml file and META.xml file. One CSV file serves as the ‘core’ of the DwCA and any other CSV files are extensions that contain data linked to the core. The column headers of each core and extension are terms taken from a from controlled vocabularies. The META.xml file includes a machine-understandable description of the contents of each core and extension and links column headers to terms within controlled vocabularies. The EML.xml file includes metadata compliant with the Ecological Metadata Language.

Figure 5

For the Darwin Core configuration of the template generator, a) the graphical-user interface of the template generator, b) an example of a sheet entitled `Event Core’ in a template generated.

The Darwin Core configuration includes multiple subconfigurations (Sampling Event, Occurrence Core) based on what the user would like the core of the DwCA to be (Figure 5a). Upon selecting a subconfiguration, users can choose from applicable extensions to include alongside the core. The form is divided into multiple sections, with each section corresponding to a core or extension and becoming a separate sheet in the resulting template.

Users are presented with required, recommended, and additional suggested terms for each core or extension, based on the requirements of GBIF’s Integrated Publishing Toolkit (Robertson et al., 2014). The terms and descriptions are harvested from XML files accessible via https://rs.gbif.org/extensions.html and stored as local JSON files to expedite template generation. One limitation is that the URLs to the XML files are version specific. An evolving source file that always includes the latest version does not currently exist. Therefore, to accommodate changes and updates in to any cores or extensions, the source URL must be updated as new versions are released.

We have hard-coded which terms are ‘required’ and ‘recommended’ to be consistent with GBIF’s configuration of the Integrated Publishing Toolkit. This information is not included in the source XML files at the time of writing, and some different data centres have different requirements.

3 Discussion

While the template generator has been developed to facilitate the publishing of FAIR data, it is essential to note that the templates themselves are not compliant with the FAIR principles and should not be published. However, recording data in structured templates that consider commonly used standards is good practice for unpublished data too. The Nansen Legacy template generator could therefore be a useful tool for any project, course or field campaign where one is interested in recording data in a consistent manner between multiple data collectors or data collection campaigns.

To create FAIR-compliant data from the templates, users can create CF-NetCDF files or Darwin Core Archives. Below, we provide suggestions for tools and tutorials that can assist users in converting the generated templates into FAIR-compliant formats such as CF-NetCDF files or Darwin Core Archives. The conversion process will vary depending on the specifics of the dataset.

3.1 Creating a CF-NetCDF file from the template

There are several different ways that one can create a CF-NetCDF file from a populated template:

Using common programming languages: CF-NetCDF files can be generated using widely-used programming languages. For Python users, online tutorials are available to guide you through the process (Marsden, 2024a), while R users can find similar guidance for R programming (Marsden, 2024b).
Leveraging dedicated tools: There are dedicated tools available, such as Rosetta (Hamre et al., 2023), that offer a graphical user interface to convert tabular data (like the data in your template) into CF-NetCDF files compliant with the Attribute Convention for Data Discovery (ACDD) convention. Another tool, MeteoIO, can be used to create a CF-NetCDF file from multiple different input formats and perform statistical quality control on the data (Bavay et al., 2020).

One should always refer to the latest version of the CF conventions documentation on how to encode data in a CF-NetCDF file. In most cases, the column headers should be encoded as the content of the standard_name variable attribute for each variable. One exception is the bounds (minimum and maximum), which are used where the coordinates are more accurately described as cells than discrete points. Version 1.10 of the CF conventions includes relevant guidelines of encoding this type of data under ‘7. Data Representative of Cells.’

It’s important to note that the specific approach to creating CF-NetCDF files may vary depending on the structure and nature of the dataset being used.

3.2 Creating a Darwin Core Archive from the template

To create a Darwin Core Archive from a completed template, you can follow these steps:

Export CSV files from the template: Ensure that your completed template includes one sheet for each core/extension that you intend to include in the Darwin Core Archive. Export each of these individual sheets as separate CSV files. Optionally, you can delete the header rows, making the hidden row with column headers the first row in each CSV file. Please note that not all cores and extensions are available in the template generator.
Access the Integrated Publishing Toolkit: Identify a data centre that hosts the Integrated Publishing Toolkit (IPT) (Robertson et al., 2014) and request access to it. You can find suitable data centres at https://www.gbif.org/ipt.
Uploading data: Upload each of your CSV files as source data in the IPT. Map the source data to the relevant core or extension. If the spelling of your column headers is consistent, terms should be automatically mapped to terms that the IPT deems applicable for that core/extension. Column headers can alternatively be mapped manually.
Adding metadata: Use the Integrated Publishing Toolkit’s form to add metadata that will be included in the EML.xml file within the Darwin Core Archive.
Publishing: The Integrated Publishing Toolkit will combine your mapped CSV files and metadata to create a Darwin Core Archive when you publish with the data centre.

For a visual guide to this process, you can watch a video tutorial available at https://www.youtube.com/watch?v=DbvlwnYXuPU.

3.3 Limitations and future developments

This paper presents the current state of the template generator, though it is also important to acknowledge the ongoing development of this open-source software. We welcome contributions and offers of collaboration from the community to enhance and refine its functionality. The source code is hosted on GitHub at https://github.com/SIOS-Svalbard/Nansen_Legacy_template_generator, and we encourage users to raise issues to request new features or report any bugs they encounter.

We also recognize certain limitations in the current setup that may offer opportunities for future development.

3.3.1 Variable attributes in CF-NetCDF files

For a CF-NetCDF file to be FAIR, each variable must include comprehensive metadata that describes it in the form of variable attributes that adhere to the CF conventions (Eaton et al., 2022). The template generator’s CF-NetCDF configuration encourages users to consider the standard_name variable attribute, but there is room for expansion to include other required variable attributes. Developing this aspect of the template generator would mean that the template could include most, if not all, of the required content of a CF-NetCDF file. However, we envisage that accommodating different variable attribute requirements for various data types and user-specific needs, could be challenging.

3.3.2 A template validator

To ensure data and metadata integrity, there may be a need for a template validator. Such a tool could help identify errors in the user’s complete template. These could be, for example, due to accidental overwriting of cell restrictions or the deletion of required data. It is worth noting that the Integrated Publishing toolkit has a built-in validator for Darwin Core Archives, and there are validators for CF-NetCDF files that also check again the ACDD convention – e.g. https://github.com/ioos/compliance-checker. However, having a validator specific to the templates would allow users to ensure that the content of the template is okay before they proceed further. The development of a validator would be easier if the requirements for each term were encoded into the source files for the controlled vocabularies.

4 Conclusions

In this paper, we have introduced the Nansen Legacy template generator (Marsden and Schneider, 2023), a tool designed to simplify the process of creating FAIR-compliant CF-NetCDF files or Darwin Core Archives. It allows scientists to record their data and metadata in a consistent manner in a template that considers well established standards – Darwin Core, the Climate and Forecast convention and the Attribute Convention for Data Discovery. Scientists may choose to make the initial recording of their data within their template, or alternatively convert their own recording of the data into a template. As the scientific community increasingly recognizes the importance of FAIR data practices, our software addresses a critical need for researchers, particularly those who work frequently with spreadsheets.

We began by discussing the changing landscape of scientific data management and the growing demand for FAIR data publication. The Nansen Legacy template generator responds to this need, offering scientists an accessible way to prepare their data for publication in a structured and standardised way. Our approach includes a user-friendly graphical interface that encourages researchers to select terms relevant to their datasets from controlled vocabularies. It guides in outlining which data and metadata terms are required.

Importantly, we clarified that the templates generated are not themselves FAIR-compliant. Therefore, we have provided clear steps for users to follow, enabling them to transform these templates into CF-NetCDF files or Darwin Core Archives that adhere to the FAIR principles. Furthermore, the use of structured templates also allows scientists who use the same template for multiple datasets to automate the conversion to CF-NetCDF or Darwin Core Archive.

The template generator can also be used to record data that the user does not intend to publish. Recording data in a structured manner between different data collectors or data collection campaigns makes it easier to aggregate, analyse and understand the data.

In conclusion, the Nansen Legacy template generator represents a significant contribution to the scientific community’s ongoing efforts to promote the FAIR data practices. This template generator simplifies the creation of CF-NetCDF files and Darwin Core Archives, thereby empowering researchers to share their findings more effectively, fostering greater transparency, reproducibility, and collaboration in science. We invite researchers and developers to engage with our open-source project and contribute to its continuous improvement, as we collectively work towards a more FAIR scientific data landscape.

5 Code Availability

The code developed is freely available from GitHub at https://github.com/SIOS-Svalbard/Nansen_Legacy_template_generator and version 1.01 is published by Marsden and Schneider (2023). New releases will be published as the software develops.

Data Science Journal

The Nansen Legacy Template Generator for Darwin Core and CF-NetCDF

Practice Papers

Abstract