ArCo consists of (i) a network of seven ontologies (in RDF/OWL) modeling the cultural heritage domain (with focus on cultural assets) at a fine-grained level of detail, and (ii) a Linked Open Data dataset counting
\(\sim\)200M triples, which describe
\(\sim\)0.8M cultural assets and their catalog records derived from the
General Catalog of Italian Cultural Heritage(ICCD), i.e., the institutional database of the Italian cultural heritage, published by the
Italian Ministry of Culture(MiC). The ArCo ontology network is openly released with a CC-BY-SA 4.0 license both on
GitHub1 and on the official
MiC website,
2 where data can be browsed and acceded through the SPARQL query language.
3Extracting information from ArCo to generate a dataset for VQA is not free of obstacles. First, ArCo does not give us a measure of which kind of questions might be interesting for average users in a real scenario. Second, ArCo data need to be suitably transformed and cleaned to produce answers in a usable form and questions need to be associated to corresponding answers. Third, the dataset we aim at generating is huge, and therefore manual validation of produced data cannot be performed.
3.1 A Semi-automatic Approach for Generating the VQA Dataset
To create our VQA dataset, we resorted to a semi-automatic approach that involves the collaboration of expert and non-expert users and the use of text processing and natural language processing techniques to obtain an accurate list of question-answer pairs. We considered a scenario where an image is associated to available knowledge either manually (e.g., artworks in a museum can be associated with their descriptions) or by object recognition (e.g., architectural properties identified by taking pictures), and generated a dataset as a list of question-answer pairs, each one associated to an image, a description and a set of available information items. An instance of question-answer pair is: “Who is the author?”—“The author of the cultural asset is Pierre François Basan.”
Our semi-automatic approach consisted in two main steps. The first part of the process focused on generating a list of question types with associated verbal forms by considering both expert and non-expert perspectives, the latter assessed by surveys. Then, for each question type, we automatically generated a list of question-answer pairs by combining question forms and associated answer templates with information from relevant cultural assets in ArCo, and accurately cleaning the results. This process was performed by an ad hoc tool, developed following a build-and-evaluate iterative process. At each step, we evaluated a sample of the produced dataset to propose new data cleaning rules for improving results. The process ended when the desired accuracy was achieved. Eventually, question-answer pairs from different question types were combined. Next, we first detail our question types generation process, then fully describe the question-answer pairs generation by drawing from question types.
The
question types generation process was based on the following two perspectives carried out independently: a
domain experts’ perspective, represented by a selection of natural language
competency questions (CQs) [
41] previously considered to model the ArCo ontology network [
13], and a
user-centered perspective, represented by a set of questions from mostly non-expert (65 of 104) users, collected through five questionnaires on a set of different images of cultural assets belonging to ArCo (five cultural assets per questionnaire). In the questionnaires, the users were asked to formulate a number of questions (minimum 5, maximum 10) that they considered related to each image presented (questions they would ask if they were enjoying the cultural asset in a museum or a cultural site). In this way, we collected 2,920 questions from a very heterogeneous group of users in terms of age (from 24 to 70 years old and 42 years average age), cultural background and interests. Subsequently, the questions were semi-automatically analyzed and annotated to recognize their semantics, associate them (when possible) with ArCo’s metadata, and create corresponding SPARQL queries for data extraction.
In the clustering process, we grouped user-produced questions into semantic clusters, named
question types, with the purpose of grouping together questions that ask for the same information. Clustering was first performed automatically by text analysis and sentence similarity, then validated and corrected manually. The automatic procedure consisted in the following steps. We initially aggregated sentences that resulted to be identical after tokenization, lemmatization, and stop words removal. Then, for each question, we identified the most semantically similar one in the whole set by Sentence-BERT [
43] and aggregated sentences whose similarity was above 84% (we found empirically that this value resulted in a low error rate). Eventually, we performed average linkage agglomerative clustering with a similarity threshold of 60%. To prepare for manual validation, we extracted a list of question forms, each one associated to a numerical ID representing the cluster it belongs to. Questions in the same cluster (e.g., “Who is the author?” and “Who made it?”) were placed close to each other. After removing identical sentences, we obtained about
\(\text{1,659}\) questions, grouped in 126 clusters. Each question was then manually associated to a textual (human meaningful) ID (e.g., “AUTHOR”) agreed by the annotators and a special “NODATA” ID (about 10%) was introduced for questions that refer to information that is not contained in ArCo. Table
1 gives an overview of the question types generation process, where the effort of users and experts is combined. Each question type is labeled as “Expert” if it comes from the competency questions of ArCo ontology network and has been formulates by the team of experts (counted once in column Mention), “Users” if the question was formulated by non-expert users through the questionnaire, or “Both” if both users and experts proposed such a question (possibly with different verbal forms). At the end of the process, after excluding clusters that refer to unavailable and unusable information, we obtained 43 question types, with 20 of them referred by both users and experts.
In addition, the experts grouped the question types into three categories based on their nature. Most questions (31) were labeled as “contextual,” as it was not possible to find the appropriate answers in the images associated with the question type considered (e.g., “DATING”). Instead, eight question types were defined as “visual” (e.g., “BLACKANDWHITE”), since the answers can be inferred from the images associated to the cultural asset, while for four “mixed” question types the answers derive both from visual and contextual information (e.g., “SUBJECT”). Figure
2 depicts all 43 question types of QA split into this three categories, and some examples of images of cultural assets (i.e., PAINTING, SCULPTURE, PRINT, FRESCO) to which they are associated. Eventually, the experts defined an answer template and a SPARQL query for each question type.
We employed SparqlWrapper
4 for executing the SPARQL queries and extracting textual data and pictures from ArCo. We removed cultural assets that have zero or more than one associated pictures. For each record of the query results, we generated a question-answer pair by randomly drawing a question verbal form by the set of appropriated verbal forms in the associated question cluster, with the same distribution of the results of the user questionnaires (frequently proposed questions were selected with higher probability), and building the associated answer from the answer template.
Some question verbal forms are appropriate only for specific types of cultural assets (e.g., “Who was it painted by?” makes sense only for paintings). To establish the appropriated verbal forms for a cultural assets, we mapped both question verbal forms and cultural assets with corresponding macro-categories (we defined nine macro-categories, i.e., SCULPTURE, OBJECT, PHOTO, FRESCO, CHURCH, FIND, PRINT, PAINTING, OTHER). Since this information is not available in ArCo, we considered the available textual description of the cultural asset category to build the mapping. Due to the multitude of categories, we performed a filtering and mapping operation to bring the wide range of types back into a small but explanatory set. As a state-of-the-art work on Italian cultural heritage, we took into account the controlled vocabularies defined by the ICCD-MiC,
5 which also provided the data for ArCo KG [
13]. These controlled vocabularies ensure a standardized terminology for the description and cataloging of cultural heritage and help overcome the semantic heterogeneity that is often present in creating such catalogs. First, we filtered the vocabularies’ elements closest to the type of artworks to which users refer in their questions. We mapped each textual description of category with an entry in the controlled vocabularies. As detailed in Reference [
12], we used a string matching algorithm that takes as input a list of words from a well-defined taxonomy and a general description in free text and returns the equivalent term from the reference taxonomy.
To improve both the form of the answer itself and its rendering in its context, we adopted two approaches. First, we applied a set of cleaning rules, such as removing data with errors and changing patterns of verbal forms (e.g., from “Baldin, Luigi” to “Luigi Baldin”).
6 Second, we employed pre-trained language models to improve the form of conversational answers by adapting each sentence to its associated datum (e.g., Italian prepositions and articles have to be chosen according to the gender and number of corresponding nouns or adjectives). To solve this problem, we applied the cloze task of BERT [
23] on the generated answers, asking to infer words whose genre and number depend on the specific datum and cannot be previously determined.
7 Furthermore, we applied a final grammar correction task by automatic translating the sentence from Italian to English and back to Italian by means of a pre-trained language models for translation.
8Eventually, we automatically generated the description of each cultural asset by combining the long answers of all associated question-answer pairs, since this information is not available in ArCo.
3.2 A Large and Detailed VQA Dataset for Cultural Heritage
The generated VQA dataset contains 6.49M question-answer pairs covering cultural assets, 43 question types and 427 verbal forms. The number of question-answer pairs per template ranges from 35 to 576K. Each question-answer pair is associated with the corresponding cultural asset and its information, including its picture, a description and its URI in ArCo. The number of question types associated to each image depends on the cultural asset’s type and ranges from a minimum of 1 to a maximum of 26 question types associated to a certain cultural asset, as in the example of 26 IDs associated to the “PRINT” depicted in Figure
3.
The final dataset is the largest resource available for training and validating VQA models in the cultural heritage domain. It comprises 6,493,867 question-answer pairs, with associated visual, textual and structured information. In Table
2, we report this data in comparison to the AQUA [
27] dataset statistics. In contrast to AQUA, we consider a new dimension that incorporates mixed (contextual and visual) question types. Additionally, our dataset is two orders of magnitude larger than AQUA.
We associate each cultural asset in our dataset with a set of question-answer pairs, with both a long conversational answer and a short synthetic answer, an image, a natural language description, its URI in ArCo, the reference ontology class and its type. In addition, we provide information on the text span of the answer in the description, when possible.
We make our dataset available on GitHub.
9 We also provide two samples in Italian and English of 50 question-answer pairs per question type that we manually evaluated. Results show an overall accuracy of the long answers (percent of correct entries) of 96.6% for the Italian sample, and of 93% for the English one. We also provide statistics that reports, for each question type, its usage, the number of associated question forms, the number of question-answer pairs generated, and the accuracy. The distribution of cultural asset types in the dataset is provided in Figure
4. The most common question type are “TYPE”, “TITLE,” and “MATERIALORTECHNIQUE,” while “EVENTSITE,” “PURPOSE,” and “BLACKANDWHITE” have fewer associated cultural assets. Excluding cultural assets not classified in a specific category (“OTHER”), the macro categories with more elements are “OBJECT” (26%) and “PAINTING” (13%), while the less populated one is “FRESCO” (<1%).
Furthermore, Table
3 shows the breakdown of the number of question-answer pairs by cultural asset type and question type.