3.1 Requirements
For the sake of readability, we grouped requirements into four categories, previously acknowledged as FAIR principles (Findability, Accessibility, Interoperability, and Reusability) [
43]. In so doing we want to stress one of the main strengths of CLEF, i.e., producing high-quality, reusable data. This aspect is particularly relevant to the scholarly domain, and in recent years it has been increasingly addressed in the Cultural Heritage domain too [
28]. Along with requirements, we propose one or more solutions to drive development. Where applicable, such requirements have been translated into Linked Open Data requirements or functionalities.
–
Findability. Data are identified with persistent identifiers (URIs), described with rich metadata, and are findable in the Web.
–
Discoverability. Allow search engines to leverage structured data for indexing purposes. Every record is served as HTML5 documents including RDFa annotations.
–
Exploration. Automatically generate data views to facilitate retrieval. Operations for automatically generating views, such as filtering, grouping, and sorting, are available. These are ontology-driven, i.e., the result of SPARQL queries.
–
Accessibility. Data are accessible via the HTTP protocol, and are available in the long-term via a plethora of solutions for programmatic data access.
–
Preservation (sources). Request digital preservation of user-specified resources. Rely on established services like the WayBack Machine
13 for web archiving.
Preservation (ontologies). Allow direct reuse of up-to-date schemas and data of existing projects. Retrieve information on the user-defined data model from the Linked Open Vocabularies initiative [
42].
Preservation (data). Integrate the system with established data management workflows. Bind changes in Linked Data to commits in GitHub and release versions in Zenodo.
–
Persistency. Ensure continuity of services built on top of generated data. Prevent deletion of published records identified with persistent URIs.
–
Interoperability. Data are served in standard serializations, include references to standard or popular ontologies, and links to external Linked Open Data sources. While this was not an explicit requirement highlighted from use cases, it is a natural consequence of the usage of Linked Open Data, which makes it easier to work with data (e.g., in data integration over multiple sources).
–
Reusability. Data are released as open data with non-restrictive licenses, are associated with detailed provenance information and follow well-known data sharing policies.
–
Enhancement. Generate structured data from natural language texts. Perform Named Entity Recognition over long texts on demand, extract structured data, and reconcile to Wikidata.
–
Consistency. Ensure interlinking of records and correct usage of terminology. Suggest terms from selected Linked Open Data sources and user-specified controlled vocabularies while creating new records. Allow contradictory information to be recorded as named graphs. Ensure peer-review mechanisms are enabled to supervise contributions from non-experts, and prevent inconsistencies in the final user application. Allow restriction of access, and give privileges to a group of users that share ownership of data on GitHub.
–
Accuracy. Allow fine-grained curatorial intervention on crowdsourced data. Represent records as named graphs and annotate graphs with provenance information according to the PROV ontology (including contributors, dates, and activities/stages in the peer-review process). Update annotations every time a change happens in the graphs. Track changes and responsibilities in GitHub commits.
–
Validation. Allow automatic validation of data. Along with manual curation, perform schema and instance level checks to ensure created data conform to user-generated (ontology-based) templates.
Moreover, while not in scope in FAIR principles, the use cases highlighted that in order to prevent error-prone operations, guarantee high-level data quality standards, and serve easy-to-find data, user-friendly interfaces are necessary or highly recommended. Therefore, the provision of easy-to-use interfaces becomes a fundamental user requirement of CLEF to ensure (1) reusability of data and easiness of exploration for the final user, and (2) simplicity and error avoidance for editors and administrators.
In summary, the interaction with stakeholders highlighted three important research areas, namely: (1) the need of user interfaces to manage most data management processes, which would otherwise require complex or time-consuming operations to be performed manually (e.g., data reconciliation, data quality validation, data exploration) (UF); (2) the importance of provenance management in the editorial process (PM); and (3) the compliance with reusability and sustainability requirements and the integration with data management workflows for scholarly data (DMI). Managing data natively as Linked Data allows us to address all three aspects and to fully comply with FAIR principles.
3.2 CLEF Overview
CLEF is a highly configurable application that allows digital humanists and domain experts to build their own crowdsourcing platform, to integrate the data management workflow with Linked Open Data standards and popular development and community platforms, and immediately enjoy high-quality data with exploratory tools. CLEF is a web-based application in which users can describe resources (e.g., real-world entities, concepts, digital resources) via intuitive web forms. To help with entering descriptions, users are offered auto-complete suggestions from vocabularies, existing records, and terms automatically extracted from text. Administrative users have full control of the setup of their CLEF application and of the definition of templates for describing information about their resources. The templating system of CLEF is configurable via a web interface, in which each form field for describing the resource is mapped to an ontology predicate chosen by the user. Templates are the main drivers of the application, since they ensure consistency in data entry and data validation, they guide the peer-review of records, and are fundamental in retrieval and exploration via actionable filters. It’s worth noting that the template setup, in which the ontology mapping is manually (at present) curated, is the only input that requires expert users.
Both authenticated and anonymous contributions can be enabled. A simple peer-review mechanism allows users to curate their records, publish them, and continuously populate the catalogue of contents, which can be immediately browsed via automatically generated interfaces and filters. The tool is particularly suitable for collaborative projects that need to restrict access to members of one or more organizations, to share data and code on dissemination repositories, and that need an environment to discuss project issues. In fact, CLEF is designed to be easily integrated with GitHub and simplify the data management workflow, naturally supporting several of the FAIR requirements.
Figure
1 presents an overview of the CLEF data management system. In detail, CLEF allows an administrator to configure and customize the application via user-friendly interfaces. In the
configuration setup, users can specify information relevant to their dataset, e.g., URI base, prefix, SPARQL endpoint API (with default configuration), and optional mechanisms for version control and user authentication (via GitHub). The setup of dereferencing mechanisms is delegated to the adopter, who can choose and set up redirection rules by means of their favorite persistent URI provider (e.g., w3id).
14For each type of resources to be collected and described, a template is created in the form of a JSON mapping document. This includes form field types (e.g., text box, checkbox, dropdown), expected values (literals, entities), services to be called (e.g., autocomplete based on Wikidata and the catalogue, Named Entity Recognition in long texts), the mapping between fields and ontology terms or controlled vocabularies, and whether the field should be used in a default web page, called Explore, as a filter to aggregate data.
Ontology terms and terms from controlled vocabularies specified by users are managed via the
vocabulary module. It’s worth noting that, while users can specify their own ontology terms, CLEF fosters reuse of popular and standard vocabularies. The module updates the CLEF triplestore with user-specified terms (which may be new terms or terms belonging to existing ontologies) and calls the APIs of
LOV Linked Open Vocabularies [
42] to retrieve original labels and comments associated with reused terms. The resulting data model is shown in a dedicated web page called
Data Model along with information retrieved from LOV.
The form for data entry is generated according to settings specified in templates. While editing (creating, modifying, or reviewing) a record, both CLEF triplestore and external services like Wikidata APIs are called to provide suggestions. Every time a record is created/modified, data are sent to the ingestion module. The latter performs a first validation of the form based on the associated templates, and calls the mapping module, which transforms data into RDF according to ontology terms specified in the template and updates the named graph created for the record.
CLEF supports a compliant SPARQL 1.1 [
24] endpoint as back-end, therefore, it is not dependent on a specific implementation. However, current running instances use Blazegraph [
41]. In particular, named graphs are extensively used to annotate and retrieve provenance information needed to manage the peer-review process, and to efficiently serve record-related information in the exploratory interfaces.
A module is dedicated to the interaction with
GitHub. GitHub was chosen for its popularity as a dissemination platform for versioned code and data, which fosters visibility of project results, and for its services—i.e., APIs for read-write operations, OAuth mechanisms. Users may decide to bind their application to a GitHub repository, which allows them (1) to store a backup of data in a public/private repository, (2) to keep track of every change to data via commits, and (3) to enable user authentication to the web application via GitHub OAuth.
15Lastly, to increase findability of collections, a few automatically generated web pages serve browsing and search interfaces over the catalogue. Currently, CLEF provides the following templates: a homepage; the backend controller from which to access the list of records, the setup configuration form and the templates forms; records creation/modification/review and publication forms; a Documentation page with instructions on the usage of forms; the Explore page, where views on collected data are shown and filtered; a template to display the records wherein Linked Open Data are also served as RDFa annotations; a template to display controlled vocabulary terms and statistics on their usage in the catalogue; a Data model page, collecting ontology terms and definitions from LOV; and a GUI to query the SPARQL endpoint.
The software has been developed in two phases. An initial data management system was developed for ARTchives.
16 In a second phase, the code base has been extended and adapted to be customizable and reusable as-is in other crowdsourcing projects. CLEF is developed in Python, based on Webpy,
17 a simple and small-size framework for web applications. The source code of CLEF is available on GitHub
18 and Zenodo [
12].
CLEF is a production-ready solution. It is under continuous development to become a flexible tool for a wider range of collaborative scholarly projects. Potential scalability issues have been improved in recent SPARQL/quad store implementations and there is a continuing effort in the community to support analysis of performance [
29,
38].
The initial version of the system was tested with around 15 cataloguers of the six institutions promoting the ARTchives project, namely: the Federico Zeri Foundation (Bologna), Bibliotheca Hertziana (Rome), Getty Research Institute (Los Angeles), Kunsthistorisches Institut in Florenz (Florence), Scuola Normale Superiore (Pisa), and Università Roma Tre (Rome). Currently, user tests are continuously performed by Polifonia project members, who provide new requirements to foster development and research, documented in the musoW repository issue tracker.
19 User tests will soon be performed with users with different profiles and less technical experience.
3.3 The Editorial Process: Provenance Management and User Authentication
In CLEF, every record is formally represented as a named graph [
5]. Named graphs enable us to add RDF statements to describe those graphs, including their provenance, such as activities, dates, and agents involved in the creation and modification of a record. Provenance information is described by means of the well-known W3C-endorsed PROV Ontology [
33]. Moreover, named graphs allow us to prevent inconsistency caused by competing descriptions for the same entities, for instance when different cataloguers describe the same creator of multiple collections. While this scenario is allowed, users are informed of existing potential duplicates when creating a new record, which prevents involuntary inconsistencies.
The editorial process in CLEF addresses three phases: record creation, record modification, and review and publication. When creating a record, the corresponding named graph is annotated with the identifier of the responsible user (anonymous user if no authentication method is set), the timestamp, and the publication stage unmodified. When modifying a record, additional provenance information is added, including the identifier of the (new) responsible user, the new timestamp, and the new stage (modified). Lastly, when publishing the record, the stage changes to published. A published record can be browsed from the Explore page, searched from a text search, and can be retrieved as Linked Open Data from the SPARQL endpoint, via the REST API at <APP-URL>/sparql. While a published record can be modified, and therefore moved back to the stage modified, it cannot be unpublished. While this may be inconvenient in some scenarios, this prevents applications relying on records (and related persistent URIs) from getting unexpected, inconsistent responses.
We chose GitHub to manage user authentication, fine-grained provenance tracking, and version control. In general, CLEF allows both authenticated and anonymous users to create new records. However, records can be modified and published only by accredited users. CLEF is optimized to authenticate users that have a GitHub account. To enable GitHub authentication in the initial setup of CLEF, the admin must perform a one-off operation and specify (1) their GitHub credentials, (2) a GitHub repository they own, and (3) must have created an OAuth App connected to their repository, so as to enable read-write operations on the repository and to confirm the identity of collaborators.
Every time a change is made to a record, content data, and provenance information are updated on the triplestore via its REST API, on the file system, and—if enabled—also on GitHub. To avoid spamming, only records that have been reviewed are stored on GitHub, thereby initializing the versioning. All changes to records are identified by a commit on the repository, and it is possible to track which information (i.e., field of the resource template) has been modified. While such information is currently not stored as Linked Open Data, auxiliary tools such as git2PROV [
16] can be used to generate PROV-compliant RDF data. In so doing, we prevent the development from scratch of features that are anyway available on GitHub, and we intertwine the two platforms for a better data management workflow.
In case a user decides not to enable the GitHub synchronization, data are stored in the local triplestore, changes in data are recorded with minimal provenance information (date of changes and publication stage), and only anonymous contributions can be made to the platform. The latter scenario is particularly handy if the application runs only locally (e.g., because contributors do not have the possibility to run the application on a remote server). Indeed, users may decide to create data via their own private instance of CLEF (which runs as a web application in localhost), to store data on their local triplestore, and to manage the publication as they prefer. Moreover, if the application runs locally and user authentication is not enabled, but local users have a GitHub account and collaborate on a GitHub repository with other users, they may decide to keep working locally with CLEF, and backup their data on the shared repository. While publishing a remote instance of CLEF without any user authentication method is discouraged, CLEF implements anti-spamming mechanisms, in order to limit contributions from IP addresses, and to disable write operations on the triplestore.
3.4 Support Data Collection: Reconciliation and Enhancement
When creating or modifying a record, contributors are supported in certain tasks relevant to the reusability of their data, namely: (1) data reconciliation, (2) duplicate avoidance, (3) keyword extraction, (4) data integration.
In detail, when field values address real-world entities or concepts that can appear in other records, autocomplete suggestions are provided by querying external selected sources (live) and the SPARQL endpoint of the project at hand. Suggestions appear in the form of lists of terms, each term including a label, a short description (to disambiguate homonyms) and a link to the online record (e.g., the web page of Wikidata entity or a record already described in the project). If no matches are found, users can create new entities that are added to the knowledge base of their project and these will appear in the list of suggestions in new records. Currently, CLEF is optimized to work with Wikidata, but implementations of entity linking from the Open Library, the Getty AAT, and the Getty ULAN are available too.
When designing a resource template, users can flag a specific field to be used for disambiguation purposes (e.g., the field title for a book, the field name for a person). When creating a new record, the specified field is bound to a lookup service that alerts the user of potential duplicates already existing in the catalogue. The user may accept or ignore the recommendation.
Some fields may require contributors to enter long free-text descriptions (e.g., historians’ biographies, scope, and content of collections), which include a wealth of information that cannot be processed as machine-readable data. To prevent such a loss, SpaCy
20 APIs are used to extract entity names (e.g., people, places, subjects) from the text. Extracted entities are reconciled to Wikidata entities and keywords (bound to Wikidata QIDs) and are shown to users to approve/discard. Approved terms are included in the data as machine-readable keywords associated to the subject entity.
When Wikidata terms are reused, the system can be configured to query the Wikidata SPARQL endpoint to retrieve and store context information in the knowledge base. For instance, in ARTchives, Wikidata entities representing artists, artworks, and artistic periods (recorded as subjects addressed by contents of archival collections) are automatically enriched with time spans, retrieved from the Wikidata SPARQL endpoint, and saved in the local triplestore; likewise, Wikidata entities representing historians are enriched with birth and death places. Finally, entities can be geo-localized via OpenStreetMap APIs.
213.5 Data Sustainability: Ontologies, Data, and Long-term Preservation Strategies
Long-term accessibility of scholarly projects is often hampered by time and resource constraints. A well-known problem is the maintenance of ontologies adopted by small-medium crowdsourcing or scholarly projects [
4]. While CLEF does not prevent the creation of new ontology terms, which are stored along with data, CLEF supports reuse of external ontologies. Terms from external ontologies can be directly referenced in resource templates to map form fields to predicates and to map templates themselves to classes. Where reused ontologies are popular or W3-endorsed ontologies, CLEF allows enriching referenced terms with definitions provided by
Linked Open Vocabularies (
LOV). Note that reused ontologies are not imported. This design choice has the evident drawback of preventing inference mechanisms, which are not applicable without manually importing ontologies in the knowledge base created by CLEF. Nonetheless, due to this design choice, we believe CLEF has the merit of complying with another debated requirements in the Semantic web community [
4], namely, the ability to rely on up-to-date information on reused ontologies, provided by LOV.
Like the projects themselves, the wealth of data produced by scholarly initiatives often becomes unavailable in the mid/long-term. To prevent that, CLEF adopts several strategies. First, CLEF is optimized to reuse Wikidata as much as possible, both at schema level (users can choose classes and properties from Wikidata data model) and at instance level (autocompletion suggestions reuse individuals from Wikidata). The idea is to support stakeholders in producing curated metadata that can be exported and imported into Wikidata according to its guidelines for contributors.
22 While Wikidata allows users to also import non-Linked Data into the knowledge base, and to manually perform entity matching, CLEF data include entities already matched with Wikidata QIDs, avoiding the need for manual matching. Data can be retrieved via the SPARQL endpoint or via the GitHub repository.
Second, by synchronizing CLEF knowledge graphs with GitHub it is also possible to synchronize the repository with Zenodo. Zenodo is a certified repository for long-term preservation, widely recognized in the scientific community. Zenodo has recently offered the opportunity to link GitHub repositories to their platform, binding GitHub releases to new versions on Zenodo, uniquely identified with a DOI.
Lastly, the case studies highlighted the need to access and extract information from online web pages (e.g., the Dictionary of Art Historians, online music resources) and reference the source in records. Such web pages are cited as sources of information or are described as first-class entities in records, and are likely to be explored by final users of the project catalogues. Ensuring the persistence of such pages in the long-term is an important aspect, which contributes to foster trust in scholarly and cultural heritage projects. While preserving the original web sources along with data created in CLEF would be inconvenient for small-medium projects—that cannot afford to archive all the web sources they mention—CLEF allows users to specify which form fields include URLs that should be sent to the Wayback Machine,
23 which in turn takes a snapshot of the webpage and preserves it.