1 Introduction
System requirements are primarily written in
natural language (NL) [
24,
27]. To analyze and manage textual requirements, the
requirements engineering (RE) community has long been interested in applying
natural language processing (NLP) technologies. The emerging research strand,
natural language processing for requirements engineering (NLP4RE), has received a lot of attention from both industry practitioners and academic researchers, leading to initiatives such as the NLP4RE workshop series [
15]. To process textual information, NLP4RE tools utilize NLP techniques, which, in turn, rely on a variety of algorithms dominated by
machine learning (ML),
deep learning (DL), and
large language models (LLMs). A recent systematic mapping study [
57] reports that the majority of research papers in NLP4RE (
\(\approx\)84%) involve proposing novel solutions or validating existing technologies. However, only a small fraction of developed tools (
\(\approx\)10%) is made publicly available. Unavailable artifacts can impede the positioning of novel solutions through sound comparisons against existing ones and further hamper the industrial adoption of NLP4RE tools.
Replication is an important aspect of empirical evaluation that involves repeating an experiment under similar conditions using a different subject population [
50,
54]. Replicability is currently regarded as a major quality attribute in
software engineering (SE) research, and it is one of the main pillars of Open Science [
42]. The ACM badge system was introduced at the end of the 2010s
1 to reward, among others, available and replicable research. Several major conferences—e.g., ICSE, ESEC/FSE, and RE—feature an Artifact Evaluation Track, which grants variants of the ACM badges. Two mapping studies, from Da Silva et al. [
12] and from Cruz et al. [
11], cover replications in SE from 1994 to 2010, and from 2013 to 2018, respectively. In both studies, the RE field appears to be among the ones in which replication is most common. However, with a few exceptions (e.g., [
13]), replication does not seem commonly practiced in NLP4RE. This lack of replication can be attributed to various challenges, with one of the most prominent being the incomplete reporting of studies, as noted by Shepperd et al. [
49].
In this article, we propose a new artifact, referred to as ID-card, that fosters the replication of NLP4RE studies. The ID-card is a template composed of 47 questions concerning replication-relevant information, divided into seven topics. These topics characterize: the RE task addressed in the study; the NLP task(s) used to support the RE task; information about raw data, labeled datasets and annotation process; implementation details; and information related to the evaluation of the proposed solution. We advocate attaching the ID-card as part of the submission for future NLP4RE papers as well as creating it in retrospect for existing papers. The ID-card can be created and/or used by authors, newcomers to the field, reviewers, and students.
Defining the
ID-card is triggered by our hands-on experience concerning two replication scenarios. The first scenario involves replicating a state-of-the-art baseline [
55] as part of building a benchmark for handling anaphoric ambiguity in textual requirements. The second scenario involves replicating a widely used solution for classifying requirements into functional or non-functional [
38] against which we compare and position a novel solution proposed by a subset of the authors of this article, previously published in [
13]. Replication can be
exact when one follows the original procedure as closely as possible or
differentiated when one adjusts the experimental procedures to fit the replication context [
34]. In our replication experience, exact replication was not possible mainly due to the incomplete reporting of implementation details in the original papers. Deciding on such unknown details during replication can alter the original study to some extent. Minimizing decisions due to unknown details is an essential motivation for the
ID-card. Building on our hands-on experience, we conducted several
focus groups through which we identified the different challenges that can be encountered with regard to extracting replication-relevant information from NLP4RE studies. To address these challenges, we defined an initial version of the
ID-card. We then reviewed a representative sample from the NLP4RE literature and iteratively refined the
ID-card according to our findings and observations.
Terminology. Figure
1 provides an overview of the terminology used in this article.
Replication refers to the attempt, conducted by
third-party researchers (different from the
original researchers), to obtain the same results of a specific
original study.
Reuse can be a prerequisite for replication: for replicating a study, third-party researchers may reuse the
NLP4RE artifacts (
NLP4RE solutions or
datasets) that were proposed and released by the original researchers. Note that the figure focuses on reuse in the context of replication, although researchers may also reuse artifacts for other purposes. We refer to the information that can be extracted from the original paper for the purpose of replication as
replication-relevant information. We further use
solution and
tool interchangeably to indicate the automated procedure described in an NLP4RE paper to solve a particular problem. We use the term
reconstruction to denote the (re-)implementation of a solution as explained in the original paper.
Contributions. This article makes the following contributions:
(1)
We identify a total of 16 challenges that can arise in practice during the replication of NLP4RE studies. We do so by relying on our hands-on experience in which we have reconstructed two NLP4RE solutions in addition to performing an in-depth review of existing papers spanning diverse topics in the NLP4RE literature. As replication in NLP4RE often requires creating a dataset, we differentiate between the challenges related to dataset
(re-)annotation—aimed to determine the gold standard against which the solution is evaluated—versus challenges concerning tool
reconstruction. The final list of challenges is presented in Section
5.
(2)
We devise an
ID-card that summarizes in a structured way the replication-relevant information in NLP4RE papers. We demonstrate the applicability of the
ID-card by manually creating the equivalent
ID-cards for a total of 46 research papers from the NLP4RE literature. Specifically, we manually extracted the replication-relevant information from these papers and provided the answers to the questions posed in the
ID-card. For 15 out of the 46 papers, we also let the original authors independently fill in the
ID-card. We also let the authors assess several aspects of the proposed
ID-card, such as its ease of use. As we discuss in Section
6, the results indicate that the
ID-card can be used as a complementary source to the original paper for facilitating replication. We make all the material publicly available online [
1].
Structure. Section
2 discusses related work about replicabilities and associated challenges. Section
3 discusses the research method and research questions. Section
4 provides the context for the two replication scenarios considered in our work. Section
5 describes the challenges of annotation and tool reconstruction. Section
6 presents the
ID-Card. Section
7 discusses issues related to the use of the
ID-Card and the limitations of the study. Finally, Section
8 concludes the article and sketches future directions.
2 Related Work
This article focuses on replication in NLP4RE, a sub-field of RE that belongs to the broad area of SE. Only a few replication studies are published in this field (e.g., [
20,
48]), showing the need for methods and tools that can facilitate conducting such studies. Other studies in related domains have also identified some of the issues discussed in our article [
2,
46,
58], although those works build on secondary research (they are
surveys), while our approach heavily relies on our first-hand experience. In the following, we first discuss initiatives in Open Science and replicability in SE to show the relevance of this topic in the current discourse of the SE community and the complexity of ensuring replicability. Then, we refer to relevant works discussing guidelines and issues of study replication to illustrate the addressed research gap.
Open Science and Replicability Initiatives. Open Science in SE is a movement aiming to make research artifacts—including source code, datasets, scripts to analyze the data, and manuscripts—openly available to the research community [
42]. This enables
replicability, i.e., the possibility for other research teams to repeat a study by reusing the artifacts provided by the authors of the original work or by reconstructing them. Moreover, Open Science facilitates verifiability and transparency, which are nowadays critical research drivers for SE in general and RE in particular. For example, ICSE 2024 review criteria define verifiability and transparency as “The extent to which the paper includes sufficient information to understand how an innovation works; to understand how data was obtained, analyzed, and interpreted; and how the paper supports independent verification or
replication of the paper’s claimed contributions”.
2 Major SE outlets have introduced Open Science Policies (e.g., ESEC/FSE 2024
3), requiring authors to make their artifacts available to the reviewers so that research results can be scrutinized during the review phase and made reproducible once published. Moreover, conferences include Artifact Evaluation Tracks aimed at explicitly reviewing artifacts and assigning badges (e.g., the ones defined by ACM
4) to the associated papers. These tracks typically include a separate
Program Committee (PC) highlighting not only the relevance given to Open Science but also the
complexity of replicability assessment, which thus requires a dedicated PC. The guidelines for artifact evaluation became quite extensive, as shown, for example, by the ICSE 2021 guidelines,
5 which count 13 pages, as the evaluation process needs to ensure the fulfillment of multiple fine-grained aspects. To account for the complexity of replicability and the early degree of maturity of the community, the PC members are typically required to interact with the authors to provide them feedback to fix the artifacts before these can be evaluated—this is usually the case of software or scripts that cannot be properly run in order to produce the expected results. In addition to Artifact Evaluation Tracks, the ROSE festival (Rewarding Open Science Replication and Reproduction in SE
6) has been held in a number of SE top venues. The festival includes enlightening talks about replication studies to raise awareness in the community. In this scenario, not only conferences but also journals are also catching up with Open Science. The
Empirical Software Engineering Journal (EMSE) introduced the OSI Open-Data badge [
43] and
ACM Transactions on Software Engineering Methodology (TOSEM) encourages
Replicated Computational Results (RCR) Reports that should complement papers with information to support replications.
7 This landscape of initiatives demonstrates the relevance given to replicability in SE, and the complexity of ensuring verifiability and transparency.
Replicability Issues, Guidelines, and Challenges. From an online questionnaire of the
Nature journal involving 1,576 researchers from different disciplines, it emerged that 70% of the respondents tried and failed to reproduce another scientist’s experiment [
4], indicating that replication is an issue for the wider research community. The main causes listed (60% of the respondents) are pressure to publish and selective (i.e., incomplete) reporting. Furthermore, the absence of methods or code, as well as raw data, are regarded as relevant by more than 40% of the participants. Among the solutions, the authors propose more academic rewards for replicability-enabling content, as well as journal checklists to ensure that replication-relevant material is included. Concerning the specific field of SE, multiple mapping studies and survey papers have been published on Open Science and replication, identifying the current status, guidelines, and open challenges. Da Silva et al. [
12] cover replications in SE from 1994 to 2010. The study highlights that replications have substantially increased in the period considered and discusses desired characteristics of the replication studies. The study also includes a useful quality checklist to evaluate replication studies based on Carver’s guidelines [
7]. De Magalhães et al. [
19] cover replications from 1994 to 2013. The authors list recommendations oriented to those who perform replications, but also conditions that must be fulfilled by the original papers to be replicated. The latter include the need to provide all necessary replication-relevant details, precise and unambiguous definitions, and the sharing of raw data and tool versions. Gonzalez-Barahona et al. [
30,
31] evaluate the replicability
8 of
Mining Software Repositories (MSR) studies, and propose a set of best practices to make a study replicable and applicable beyond the MSR field. These are mainly focused on features of the replication package, which should include raw data, processed data, results, links to reused data (if any), and source code. In the context of the NLP4RE, a secondary study by Ahmed et al. [
2] shows that published results in the area of requirements formalization from natural language are often non-reproducible due to lack of access to tools, data, and critical information not reported in the primary studies. Similarly, another systematic review in the same area [
46] shows that the lack of openly accessible benchmarks hinders requirement formalization research.
In a book chapter, Mendez et al. [
42] characterize Open Science from multiple perspectives, including sharing manuscript preprints and artifacts, and provide detailed guidelines. The chapter is among the few works that discuss challenges related to Open Science. These include the overhead of sharing data, privacy, confidentiality, licensing, appropriate preparation of qualitative data, and anonymity issues—particularly in sharing preprints to be submitted to venues that adopt the double-blind review model. These are mainly challenges from the perspective of the authors of the original study to be replicated rather than from the third-party researchers who want to perform a replication. Finally, Anchundia and Fonseca [
3] provide guidelines that can facilitate replications in SE and informally list challenges that prevent or hamper replications: “reports and packages are neither sufficient to capture all information (e.g., raw material) nor to share tacit knowledge”, “large amount of effort to obtain all necessary data”, “replications do not satisfy professional needs”, “the cost of conducting replications”.
Research Gap. The landscape of Open Science initiatives in SE shows the increasing interest of the SE community in Open Science in general and replicability in particular underscoring that replication is a complex task, which requires further investigation and appropriate tool support. Compared with previous studies, we notice that a large part of them list criteria to be fulfilled to enable replication [
12,
19,
30,
31], but only two of them [
3,
42] identify
general replication-related challenges. None reports a
context-specific list of issues, specifically considering the viewpoint of third-party researchers who want to replicate a study, as in our case. While some general challenges also apply to our case, the provided list targets the NLP4RE context, which makes the identified issues more concrete. Furthermore, compared with previous work, our main outcome is a practical means to summarize relevant information to enable replications—i.e., the
ID-card, which aims to address the identified challenges.
3 Research Questions and Method
The research goal of this article is to support the extraction and documentation of replication-relevant information from NLP4RE papers. To this end, we propose two main dimensions that characterize replication in the context of NLP4RE: (i) the
datasets with their annotations, as research in NLP inherently relies on datasets for evaluation as well as for developing diverse solutions, e.g., training an ML classifier; and (ii) the reconstruction of the proposed
tools, as most NLP4RE studies describe an automated NLP-based solution that tackles an RE problem [
57], which typically needs to be reconstructed.
To achieve this goal, we devise a set of
research questions (RQs). The first two questions are instrumental to responding to the third one:
RQ1:
What are the challenges of annotating datasets for training and evaluating NLP4RE tools?
As stressed in previous studies [
15,
25,
57], the number of
annotated (i.e., labeled) datasets in NLP4RE is scarce. Concrete annotation guidelines are essential to ensure the soundness of both reuse and replication. Reusing an existing (available) tool on different datasets (e.g., based on industrial data) requires annotating these datasets following the same annotation procedure as the original dataset. Replicating the entire study also requires applying the same annotation procedure. Missing guidelines can cause unwanted differences in results between the original study and the replicated one. In RQ1, we investigate the challenges of creating labeled datasets while making them available to and reusable by the research community. We also consider the case in which an existing dataset must be re-annotated for various reasons to support replication.
RQ2:
What are the challenges in reconstructing NLP4RE tools?
In RQ2, we study the challenges that third-party researchers encounter when reusing and/or replicating existing tools as part of their NLP4RE research. Since the majority of NLP4RE tools are unavailable [
57], we highlight the process of
reconstructing (i.e., developing) a tool using only the descriptions of the algorithms provided in the research papers in which the tool is presented. More concretely, the notion of reconstruction in our context refers to the cases in which the solution is exclusively described in the research paper yet is not complemented with any source code. While such cases can be regarded as extreme, they currently represent the common situation of tool reuse specifically in NLP4RE. Dealing with the extreme cases encapsulates as well the challenges that are pertinent to less challenging replication scenarios.
RQ3:
How to support NLP4RE researchers to overcome the identified challenges in RQ1 and RQ2?
In response to RQ3, we propose developing an artifact that complements the NLP4RE research papers to primarily facilitate replication. The artifact, which we refer to as
ID-card, is presented in Section
6.
To answer RQ1 to RQ3, we follow the research method sketched in Figure
2. The figure also serves as a reading guide for the various sections of this article. To facilitate reading, the method is presented here linearly. However, the process was conducted iteratively.
In order to respond to RQ1 and RQ2, we reconstructed two established NLP4RE tools for which the source codes were not available (see Step ① in Figure
2) [
38,
55]. The reconstruction naturally required the (re-)annotation of datasets. The first tool, originally proposed by Yang et al. in 2011 [
55], concerns anaphoric ambiguity handling, a long-standing, highly researched problem in NLP4RE [
57]. The second tool, originally proposed by Kurtanović and Maalej in 2017 [
38], concerns functional and non-functional requirements classification, a very popular RE task that was also selected for the
dataset challenge at the IEEE Requirements Engineering conference in 2017 (RE’17). These two tools were reconstructed by two subgroups of our research team at different times for different purposes, as we elaborate in Section
4.
The rest of the process was mainly driven by focus groups conducted according to the guidelines from Breen [
6]. Applying this research instrument, the cases and the collected data act as a trigger for reflection so that the participants can brainstorm about specific challenges they encountered, recall previously faced problems, and compare their viewpoints with their peers. We refer to Appendix
A for further details on how the focus groups were conducted (structure, participants, timing, analysis, etc.).
The first two focus groups (Step ② and Step ③ in Figure
2) led to identifying a set of challenges related to RQ1 and RQ2. The challenges were reflected during the development of the NLP4RE
ID-card (Step ④).
Step ④ included iterative activities consisting of multiple rounds of design and assessment among the authors. The activity was supported by an analysis of the literature (Step ⑤ in Figure
2) in which we read and analyzed 46 representative NLP4RE papers, selected based on the literature review by Zhao et al. [
57]. A pair of researchers then used a similar strategy as the one applied in the mapping study to retrieve 155 additional papers published after 2019—i.e., the date of the mapping study. Finally, we selected 46 papers considering the following criteria: (1) Ensuring a balanced set of highly cited papers (i.e., representing the state-of-the-art) as well as more recent papers. (2) Inspired by the topic categorization of the NLP4RE landscape [
57], our selected papers covered the following nine categories: requirements classification, tracing, defect detection, model generation, test generation, requirements retrieval, information extraction from both legal documents and requirements documents, and app review analysis. We selected five papers from each category. At least two of the authors of this article independently manually analyzed a subset of the research papers and then the findings were cross-checked (
internal assessment). The objective was to assess the completeness, cohesiveness, and coverage of the
ID-card. In the final design iteration, we shared the NLP4RE
ID-card with 15 authors of the NLP4RE papers that we considered in our design. We let them fill in the
ID-card for their papers and further evaluate their own experience by answering a questionnaire that we appended to the
ID-card, as we explain in Section
6.4 (
external assessment). To conclude, we organized a third focus group (Step ⑥) in which we discussed the
ID-card and its use, leading to a number of observations to which we refer in the discussion (Section
7).
The research method of Figure
2 aligns with Wieringa’s design science research, particularly with the design cycle [
51], which comprises problem investigation, treatment design, and treatment validation. In particular,
problem investigation is conducted in Steps ① through ③, wherein the researchers conduct reconstruction and annotation cases. These activities, complemented by the analysis of the literature conducted in Step ⑤ and consolidated by two focus groups, led to the identification of a set of challenges for RQ1 and RQ2. These challenges inform the
treatment design—the NLP4RE
ID-card (Step ④).
Treatment validation of the
ID-card has been conducted both via the internal and external assessment activities.
In essence, multiple iterations of the design cycles were conducted as Step ④ has a cyclic nature in Figure
2. The last validation activity was conducted via a focus group (Step ⑥) in which the
ID-card was finalized to its current form.
4 Reconstruction Cases of NLP4RE Solutions
We share our hands-on experience in replicating two state-of-the-art solutions from the NLP4RE literature (Step① in Figure
2). The first solution (Section
4.1) focuses on detecting anaphoric ambiguity, a specific type of defect in NL requirements, as presented in an early work by Yang et al. [
55]. The second solution (Section
4.2), presented in a more recent work by Kurtanović and Maalej [
38], tackles the classification of non-functional requirements. Both defect detection and requirements classification are among the most common tasks in NLP4RE field [
57]. These topics and associated papers have garnered substantial recognition, with both papers accumulating over 160 citations according to Google Scholar,
9 making them highly representative of the NLP4RE field.
4.1 Anaphoric Ambiguity Detection
Motivation and overview. Ambiguity in NL requirements is a long-standing research topic in RE [
5,
16,
23,
28,
35,
37]. Pronominal anaphoric ambiguity occurs when a pronoun can refer to multiple preceding noun phrases. The work of Yang et al. [
55] was chosen as a representative approach in anaphoric ambiguity detection. Since neither the dataset nor the source code was to be found publicly, we decided to reconstruct the work using the details in the original paper and adapt it to the needs of our contest. This annotation activity acted as a trigger for the focus group in Step ②. The RE literature discusses unconscious disambiguation, in which the stakeholders involved in a given project are able to disambiguate the requirements thanks to their domain knowledge [
18,
22,
45,
47]. Automated detection and resolution of different ambiguities is still a valid scenario in RE since not all stakeholders have the same level of domain knowledge and would hence resolve incorrectly. Another scenario is end-to-end automation needing to be developed for other purposes, e.g., extracting information from requirements might necessitate resolving referential ambiguity as a prerequisite.
Annotation and Dataset Creation. We used the
PURE (PUblic REquirements) dataset [
26] and randomly selected a subset of 200 requirement statements from seven domains, such as railway and aerospace. Each requirement statement constituted one sentence, typically using the ‘shall’ format. The rationale behind our selection was that the resulting set should be manageable, time- and effort-wise, for annotation within the time frame of our contest. Further, covering diverse domains could be advantageous to assess the performance of a given solution across domains.
Prior to annotating the requirements, we exchanged some guidelines about the ambiguity task. The task was to decide whether a pronoun occurrence in a given requirement is ambiguous or not by investigating the relevant antecedents. Each pronoun occurrence was analyzed by two annotators from the participants. Our annotation process was then performed in multiple rounds. After each round, we had a session to discuss our findings and some problematic issues. The annotation process resulted in 103 ambiguous requirements (i.e., containing an ambiguous pronoun occurrence). For all remaining requirements marked as unambiguous, the annotators were asked to identify the antecedent they deemed correct. The outcome of our annotation process is a dataset in which each pronoun occurrence is labeled as ambiguous if it is (i) marked as such by at least one annotator or (ii) interpreted differently by the two annotators. Otherwise, when two annotators agree on the same interpretation, the pronoun occurrence is labeled unambiguous. We computed the pairwise inter-rater agreement using Cohen’s Kappa [
40] on a subset of
\(\approx\)8% from our dataset (16 requirements randomly selected). We obtained an average Kappa of 0.69, suggesting “substantial agreement” between the annotators. Following common practices, we used disagreements as indicators of ambiguity.
Tool Reconstruction. The solution proposed by Yang et al. [
55,
56] combines ML and NLP technologies to detect anaphoric ambiguity in requirements, taking a set of textual requirements as input and determining whether each pronoun occurrence is ambiguous or not. We recently reconstructed this solution and publicly released it [
22].
The reconstructed version of the approach includes four components: text preprocessing, pronoun–antecedent pair identification, classification, and anaphoric ambiguity detection. The first component (text preprocessing) parses the textual content of the requirements document and applies an NLP pipeline consisting of four modules: (i) a tokenizer for separating out words from the running text, (ii) a sentence splitter for breaking up the text into separate sentences, (iii) a part-of-speech (POS) tagger for assigning a part of speech to each word in the text (e.g., noun, verb, and pronoun), and (iv) a chunker (or a constituency parser) to delineate phrase boundaries, e.g., noun phrases (NPs).
The second component (pronoun–antecedent pair identification) extracts all pronouns occurring from the input requirements, identifies a set of likely antecedents for each pronoun, and finally generates a set of pronoun–antecedent pairs. For the purpose of simplifying our reconstruction scenario and since coreference resolution was not directly relevant to our contest, we left it out.
The third component (classification) builds an ML-based classifier to classify a given pair of a pronoun p and a relevant antecedent a into YES when p refers to a, NO when p does not refer to a, or QUESTIONABLE when it is unclear whether p refers to a. The classifier is trained over a set of manually crafted language features that characterize the relation between p and a, e.g., whether p and a agree in number or gender. The original paper lists 17 features divided in three categories: 11 syntactic and semantic features, two document-based, and three corpus-based features. We dropped out four features that are computed using a proprietary library and related to sequential and structural information to facilitate the reusability of our reconstructed solution.
The last component (
ambiguity detection) applies a set of rules over the predictions produced by the ML-based classifier to distinguish the ambiguous cases. The original paper reports two thresholds for detecting correct and incorrect antecedents for unambiguous cases. In our reconstructed solution, we redefined these thresholds with empirically optimized values for our dataset. The thresholds in the reconstructed tool could be generalized beyond the dataset as supported by empirical evidence in the paper, which reports the reconstruction [
22]. The reconstructed code is available in an online repository [
21].
4.2 Functional and Non-functional Requirements Classification
Motivation and Overview. Motivated by the importance of identifying quality aspects in a requirements specification starting from the early stages of SE, ML and NLP techniques have been widely used in proposing solutions to several related problems [
44], such as classifying requirements into functional vs. non-functional [
10], which we refer to as the
FR-NFR classification problem. Despite the variety of existing FR-NFR classification problem approaches, as of 2019, the most effective reported classifiers [
38] relied on a characterization of the requirements at the word level, e.g., via text n-grams or POS n-grams, resulting in a large number of low-level features (100 or 500 in [
38]). Moreover, for evaluating the classifiers, the great majority of the literature in requirements classification focusedon the PROMISE NFR dataset [
9], a collection of 625 requirements from 15 projects created and classified by graduate students.
Aiming at a more interpretable solution than those relying on a large number of low-level (word-level) features, which make it hard for analysts to understand why the classifier performs well or poorly and why requirements are classified in a certain way, in 2019, a subset of authors of this article proposed a new NLP approach to the FR-NFR classification problem [
13]. In proposing and evaluating the new solution, we (i) manually annotated a set of 1,500+ requirements consisting of 8 different projects and (ii) reconstructed Kurtanović and Maalej’s tool [
38] to compare the proposed solution with the state-of-the-art.
Dataset and Annotation. We manually re-annotated 1,500
\(+\) requirements from the PROMISE dataset [
9] and 7 industrial projects. For the annotation, we followed an approach based on the taxonomy of Li et al. [
41], which allows a requirement to possess both functional and non-functional aspects, as opposed to the original annotation, which allowed a single label per requirement. In particular, we annotated a requirement as possessing functional aspects (F) if it included either a functional goal or a functional constraint, whereas we annotated a requirement as possessing quality aspects (Q) (and thus being a non-functional requirement) if it included a quality goal or a quality constraint. The decision on the functional aspect was independent of the decision on the quality aspect. Thus, a requirement could possess only F aspects, only Q aspects, both aspects (F+Q), or none. In the last case, we considered the requirement as denoting auxiliary information [
52].
Each dataset was independently manually analyzed by two annotators. Reconciliation meetings were then organized to review the disputes in the annotation. If the annotators failed to convince each other, a third annotator was consulted for the final label. The annotators went over all disagreements and managed to resolve them.
Tool Reconstruction. To provide insights about how our approach compared against the state-of-the-art, we selected a relatively recent approach by Kurtanović and Maalej [
38], described extensively in the original paper, which shows excellent performance. The original solution consists of characterizing the training (PROMISE NFR) dataset using several low-level word features, such as n-grams, or POS n-grams. The authors of the original publication consider two cases with the top (most discriminant) 100 and 500 features. The requirements, labeled with the top features, are then used to train a
Support Vector Machine (SVM) model with a linear kernel and use the trained model to classify previously unseen requirements as functional or non-functional.
Since the original classifier was not publicly available, we reconstructed it from the details provided in the original publication. We also complemented such information with a code of another classifier related to app reviews [
39] that is partially available online and developed by the same research group.
During reconstruction, we applied a few minor modifications to the original version (details in [
13]). For example, to build parse trees, we used a different better-performing library (Berkeley’s
benepar [
36]) than that used in the original publication (the Stanford parser [
8]). We could not reproduce one of the classifier’s features since the explanation in the original paper was insufficient for us to make a correct re-implementation. We could not use the same dataset applied in the original solution to artificially balance the minority class of NFRs, as it was not publicly released. Finally, we released the reconstructed code in an online repository [
14].
6 Design and Evaluation of the ID-Card (RQ3)
Once the challenges related to data annotation and tool reconstruction were identified through the responses to RQ1 and RQ2, we conducted Step ④ of our research method and implemented the concept of ID-card.
6.1 ID-Card Design
We designed the structure of the ID-card following an iterative method composed of three steps: (1) Preliminary Definition, (2) Internal Assessment, and (3) External Assessment.
Preliminary Definition. In the first step, we outlined a list of information items needed for reconstructing an NLP4RE solution based on our own experience in reconstructing the two tools introduced in Section
4, as well as on the challenges gathered while answering RQ1 and RQ2. The researchers worked in pairs, and each pair drafted a set of questions and possible answers for a specific dimension. The considered dimensions included task, dataset, annotators & annotation process, tool, and evaluation. The dimensions were selected based on previous experience and brainstorming between the researchers. After the pairs of researchers drafted the questions, we carried out a group meeting for 1 hour to consolidate them. At this stage, the researchers listed 56 questions. Another two iterations were carried out during the Internal Assessment (Step 2, which is described below) to reach a stable set of 47 questions that are simplified to be more understandable by possible target readers.
Internal Assessment. We analyzed 46 papers from the NLP4RE literature (5 of which were co-authored by at least one of the authors of this article) and extracted replication-relevant data according to the information items from the first step. We also included in our analysisthe two papers that we used in our replication scenarios (see Section
4).
Following the collection of the 46 selected papers, we used the preliminary ID-card (56 closed questions); at least one researcher extracted information from each of the 46 papers. For four papers, we also independently filled in the ID-card for the same paper to have some common manual analysis in the same topic category, which we cross-checked and discussed at a later stage. The goal of this activity was to assess the applicability of the ID-card on a broad set of topics and possibly customize it for different categories. After that, the researchers had a plenary discussion to share the problems that emerged during this initial application of the ID-card and decide how to mitigate these problems. The outcome of this meeting was an updated list of 47 questions in total, including a combination of 32 closed and 15 open questions.
External Assessment. The ID-card was shared with the authors of the original papers. We asked them to fill in the ID-card for their papers without sharing our previously filled-in cards. We contacted 32 authors, avoiding contact with the same research group for multiple papers, and received filled-in ID-cards from 15 authors for 15 different papers. For each paper, we analyzed the discrepancies between our input and the input of the original authors and further provided explicit notes listing the points of disagreement and reflections on possible motivations. In a 1.5-hour plenary session, we discussed the identified discrepancies based on the researchers’ notes to identify major sources of disagreement. Most of the disagreements were due to the level of details provided. Different answers were provided with different levels of granularity to some questions in the ID-card. For example, in the case of multiple datasets used in the same study, some answers focused only on one dataset whereas others provided information about all datasets. Other disagreements were observed in the NLP task used in the study. In this case, we acknowledge that multiple NLP technologies can be applied in the same paper for solving some RE task. The goal of our plenary session was to identify the elements that could lead to misunderstanding, gather external viewpoints on the developed card, and discuss possible residual issues. After this analysis, we rephrased some of the questions according to our observations to make them more concrete and reduce possible disagreements.
6.2 ID-Card Description
The
ID-card resulting from the various iterations consists of 47 questions, divided into 7 sections, described below. A compact version of the
ID-card is displayed in Table
2. The card contains questions that either attempt to fully or partially address the challenges introduced in Section
5 or aim to provide metadata about the research paper, such as questions I.1 and II.1. In an online appendix [
1], we provide the complete
ID-card, validation material, and two filled-in
ID-cards of the reconstructed tools described in Section
4.
I. RE Task.
This section identifies the RE task addressed in the article, e.g., classification, tracing, and defect detection. The ID-card provides several options, from which only one option can be selected. The options include the nine categories listed earlier and an additional option that enables adding an unanticipated RE task. While the majority of NLP4RE papers address one main RE task, this is not always the case. For example, a paper could describe multiple RE tasks, such as generating a model from requirements specifications and a completeness checking method to identify incompleteness in requirements according to the generated model. Each RE task can be solved using a combination of NLP tasks such as text classification, named entity recognition, and semantic role labeling. Some papers also introduce different datasets and develop multiple tools. In this case, the ID-card is intended to be filled in separately for each of the RE tasks. The rationale behind this decision is two-fold. First, we simplify the overall design of the ID-card (e.g., we avoid the reiteration of certain questions for each RE task). Second, by decomposing the work into distinct RE tasks, we facilitate the retrieval of replication-relevant information associated with the paper, which is the main goal of the ID-card. Even when the entire work in such a paper is considered for replication, decomposing the work into distinct RE tasks helps better understand and replicate the tools more accurately.
II. NLP Task(s).
This section specifies the NLP tasks used to support the RE task, i.e., classification, translation, information extraction and information retrieval (again, unanticipated NLP tasks may be specified). NLP tasks are distinguished from RE tasks as some NLP tasks could be applied for different purposes. For instance, classification (NLP task) can be used to organize requirements into different categories but also to detect defects or identify trace links (RE tasks). More than one NLP task can be selected since the NLP tasks can be used in combination—e.g., in model generation, one uses information extraction plus translation.
III. NLP Task Details.
This section characterizes the details of the RE task addressed, i.e., the RE task input granularity (e.g., document, word, paragraph) and the output type. We define various options of the output types that differ for each NLP task. For example, a classification task requires specifying whether the output is binary-single label (e.g., ambiguous XOR not ambiguous), multi-class multi-label (e.g., feature request OR bug OR praise), or other options in between. One has to further specify the possible labels (i.e., classes) of the output. For translation, the output can be text but also test cases or models. Specific to translation, one has to specify the cardinality, i.e., one input to one output (e.g., one document to one diagram) many inputs to many outputs (e.g., from many use case descriptions to one class diagram and multiple other diagrams).
IV. Data and Dataset.
This section characterizes the dataset used in the article. One has to provide details about the size of the dataset; the year; the raw data source data (e.g., proprietary industrial data, regulatory documents, user-generated content—eight options plus “other”); the level of abstraction of the data (e.g., user-level, business-level, system-level). In this section, the term data collectively identifies both possible input and output data of the previously selected NLP task(s). Therefore, the questions support multiple answers. This choice was driven by the need to keep the ID-card well structured and easy to retrieve. This section also asks for information concerning the format of the data (use case, “shall” requirements, diagrams, etc.), the degree of rigor (unconstrained NL, restricted grammar, etc.), and the actual language (if applicable). Additional questions are included about the heterogeneity of the dataset—in terms of domain coverage and number of sources from where the data is obtained—as well as about the data licensing and the URL to access the dataset.
V. Annotators and Annotation Process.
This section includes information to characterize the annotators in terms of background knowledge, number, and level of bias in case annotation was carried out on the raw data. In addition, information is collected about the adopted annotation scheme (if any) and the process to measure and resolve disagreements. This section focuses mainly on manual annotation, a common practice for creating datasets in NLP4RE. With the current NLP technologies, researchers are shifting towards creating datasets using automated means. Extending the annotation section to cover the replication of automated annotation is left for future work.
VI. Tool.
This section collects information about any implementation provided along with the article, e.g., scripts, executable programs, application programming interfaces (APIs), collectively designated with the term tool. Specifically, the questions in this section require information about the enabling technology of the implemented NLP solution (e.g., machine learning, rule-based), what has been released (e.g., binary file, source code), and additional information concerning documentation, licensing, dependencies, and other details that can help access and execute the tool.
VII. Evaluation.
This section requires information about the evaluation carried out in the article: the evaluation metrics (precision/recall, Area under the Curve (AUC), etc.), the type of validation process (cross-validation, train-test split, etc.), and the baselines used for comparison (if applicable). Investigating other evaluation alternatives, e.g., the impact of using the tool on the downstream development, is left for future work.
We note that the ID-card provides a comprehensive view of what information is relevant for replication. However, for a particular paper, one might fill in only some sections of the ID-card that are found in and relevant to the paper. For example, if the paper only presents a new dataset without an automated approach, then only the section about annotation might be relevant.
6.3 Tracing Challenges to Questions in the ID-card
In the following, we discuss the rationale for the majority of the questions in the
ID-card, in relation to the challenges described in Table
1. We note that the traceability is not one-to-one mapping between challenges and the
ID-card. Some challenges, such as the little value given by our RE community to replicating studies, cannot be handled by the
ID-card.
Recall that the classification of the RE tasks (Question I.1) was adopted from the recent systematic literature review by Zhao et al. [
57], whereas the NLP task classification (Question II.1) was specifically designed to complement the RE tasks. The
ID-card partially addresses the annotation challenges (see Table
1) as follows:
—
To address challenge Ann1, questions V.5 and V.6 are introduced to highlight the necessity for establishing a clear annotation protocol prior to the annotation process and further making it publicly available afterwards. Question V.10 also points out that communicating among annotators is required for achieving a high-quality dataset. Once the annotation protocol is agreed on, the likelihood that it frequently evolves (Ann4) is limited.
—
While including domain experts in the annotation process is not always possible, questions V.1 to V.4 in the ID-card inform the researchers interested in reconstructing a dataset about whether the domain expertise was involved or not.
—
With regard to Ann3, Question V.8 indicates the need to mitigate fatigue.
—
Questions IV.10 to IV.13 are concerned with the dataset. The questions spotlight the publicly available datasets along with the licenses under which they are released. By leveraging such datasets in future similar annotation tasks, one can address the challenges Ann5 to Ann7.
—
Question V.7 gives insights about the context that is shared with the annotators during the annotation process. The ID-card can be used to create a common practice to assist researchers in designing new annotation tasks, thus, addressing the challenge Ann8.
With regard to the tool-reconstruction challenges, the ID-card demands more details and precise information concerning the implementation of the tool which are often omitted from a paper due to space limitations. Specifically, Questions VI.6 and VI.8 are about the libraries employed in the implementation, Question VI.5 is about the available documentation related to the tool, and Questions VI.3 and VI.9 are about what has been publicly released.
6.4 Evaluation of the ID Card
We disseminated a short survey (four Likert-scale questions and two open-ended questions) among the 15 authors who filled in the
ID-card, asking them to provide explicit feedback. Inspired by the questionnaire from the
Technology Acceptance Model (TAM) [
17], we included survey items (see Figure
3) regarding the perceived ease of use and intended use in three different use cases, reuse and replication, literature surveys (as typical instrument that NLP4RE newcomers use for learning the state-of-the-art), and education. The vast majority of respondents agreed that the
ID-card has the potential to be used for the purposes stated above. Nevertheless, opinions were mixed regarding the card ease of use. In the open feedback section of the survey, some respondents indicated a preference for the support of multiple datasets and a more interactive format that support conditional sections based on previous answers.
We also evaluated the respondents’ perception about the
ID-card appropriateness and level of details for the three different use cases (see Figure
4). Although the vast majority agrees that the use of the
ID-card is appropriate across the three use cases, they perceived that more details may be necessary to further facilitate reuse and replication of existing work. None of the respondents indicated which details should be included, but the necessity to strike a balance between details and usability was recognized, as one of the respondents pointed out:
“For replication purposes, I think more details would be required. On the other hand, putting in more detail will make the ID-cardcreation more time-consuming (potentially defeating the purpose). There is no easy answer here.”The respondents indicated other possible usages, such as leveraging the information contained in the card for searching and filtering previous literature, e.g., when selecting a baseline to compare in their own research work, and as a checklist to follow during the planning phase of a new work.
7 Discussion
In the following, we discuss how the
ID-card mitigates the identified challenges, relevant hints to fill the
ID-card, lessons learned, and limitations of the study based on the discussion carried out in the final focus group (see Step ⑥ in Figure
2).
7.1 Mitigation of Challenges
In Section
5, we have identified several challenges concerning tool reconstruction and dataset annotation. To mitigate the above-listed challenges or reduce their effect, we suggest below some mitigation actions.
—
Rigorous Annotation. Authors should make the annotation task more rigorous by (i) conducting pilot annotations to set the protocol and to avoid later changes; (ii) maintaining written guidelines, including examples, to guide the annotators as well as for future reference; (iii) performing reconciliation sessions to resolve disagreements; (iv) reminding the annotators about their ethical responsibilities—the annotation should be as reliable as possible, as it may be reused by other researchers in other work.
—
Reward reconstruction. The research community should consider rewarding annotation and tool reconstruction tasks by creating more venues and events where such activities can be published and shared with the community.
—
Clarity. The natural language description of a tool present in a paper is often insufficient and too informal for other researchers to reconstruct the tool unambiguously. Therefore, researchers should allocate sufficient effort to clearly describe how the tool can be reconstructed from scratch.
—
Flexibility. Flexibility is essential and often necessary to properly reconstruct existing tools. This includes simplifying or adapting existing tools to be applicable in the reconstruction scenario, e.g., using different data, filling in missing implementation details, replacing proprietary or legacy libraries.
—
Goal-driven Reconstruction. The rigor of reconstruction depends on the goal of the endeavor. For instance, building a baseline to compare a new solution requires more rigor than for reconstructing a tool for practical use in a company.
7.2 Guidelines for Filling the ID-card
One of the outcomes of the third focus group the authors held was the identification of a couple of considerations related to filling in the ID-card. In particular:
—
One ID per task. The first step when filling in the ID-card is to map the proposed NLP solution to the underlying RE task. However, many papers discuss multiple NLP solutions (e.g., extraction and classification) in one pipeline for solving the same RE task (e.g., solving cross-referencing in requirements). To improve re-usability and foster reconstruction, we recommend filling in the ID-card by focusing on one RE sub-task, which is likely the main task being solved in the paper. The ID-card in its current status can be theoretically filled in multiple times, each time for an RE sub-task.
—
Degree of detail. To identify the reconstruction-relevant information, it might not be clear for the researcher filling in the
ID-card to what extent such information should be sought beyond the original paper. Although having all possible details is the best option for reconstruction, this poses a pragmatic challenge related to the additional time required for the respondents to fill in the
ID-card. The researcher should address this consideration in line with the motivation of filling in the card. For example, the author of a paper could provide more details for better reuse and replication (see Figure
3(a)). We acknowledge that the
ID-cards are not expected to be identical for the same paper when filled in by two different researchers, yet they would still represent equivalent summaries of the original paper. However, filling in multiple
ID-cards for the same paper is unlikely in future practice when the authors provide an
ID-card along with their publications.
7.3 Lessons Learned on the ID-card Design
Another outcome of the final focus group was several lessons learned during the design of the ID-card, which can be useful for researchers addressing similar endeavors. Among them, we offer the following remarks.
—
Generality of the ID-card. The details required by the card should cover sufficient information about the solution without merely repeating what is in the paper. The ID-card aims to summarize an NLP4RE paper to replicate the solution and/or dataset. Ensuring that it can cover a wide spectrum of the papers in the NLP4RE literature is important, yet challenging. There is a trade-off between the coverage of the paper by the ID-card versus the level of detail required to achieve a comprehensive description of the paper. Requesting many details entails more time and effort filling in the card. To address this consideration, we opted to design the card at a generic enough level to be applicable to many papers of different research focus. The ID-card should help researchers decide whether or not papers are useful for their reconstruction goals.
—
Free text options. Missing details that are relevant to reconstruction must be properly accounted for and incorporated in the ID-card. We address this consideration by adding a free text option alongside each question to give the possibility to elaborate on the reason or remarks concerning missing details. We believe that providing a justification can give hints about what to do with these missing details. For example, if the evaluation uses the entire dataset instead of cross validation, the reason might be that the tool does not require training. In this case, details about the proportion of training data might be missing but also not needed.
7.4 Possible Uses of the ID-card
The
ID-card serves several purposes to different actors, as summarized in Table
3. To NLP4RE newcomers, e.g., researchers entering the field and PhD students starting their thesis, the
ID-card is an effective instrument to get acquainted with the state-of-the-art. Experienced researchers may use it when conducting their NLP4RE research at several stages—namely, design, evaluation, and reporting—as a checklist to ensure that their study covers all relevant aspects required to be replicable. In the frequent case that the work is presented in a paper, researchers may submit the
ID-card to a public repository as accompanying material. This way, reviewers of the paper have access to more detailed information, which helps provide an informed evaluation of the paper. Last, we foresee the
ID-card as a useful source of information for educators when preparing NLP4RE-related material.
We further envision that the ID-card can be used to create and archive summaries of papers which one has to review at a particular time and come back to them again later—e.g., researchers performing snowballing for a literature review can quickly look for the search seeds among such an archive.
Finally, the ID-card can be used to assess the adequacy for the reconstruction of a particular solution without having to read the entire paper, e.g., incomplete description, missing implementation details. In fact, it increases the chances of reconstructing a paper. In the future, ID-cards can be collected from the authors during the paper submission/review process and stored in a public repository to be publicly accessible by the RE community. Authors and reviewers can also use it as a checklist of all the information that should be reported in the paper concerning reconstruction and reusability. Thus, the ID-card can create more awareness of including such necessary information.
7.5 Limitations
This section identifies the limitations of our study and discusses how we mitigate them. We group the limitations, distinguishing those related to the identification of the challenges and to the construction of the ID-card.
The problems of requirements ambiguity detection and requirements classification are classical and widely studied topics in NLP4RE, as shown by the survey of Zhao et al. [
57]. However, the specific cases discussed during the focus groups are selected based on the experiences of the authors; thus, personal
bias might threaten the validity of the results. To mitigate personal bias, experts in different NLP4RE tasks participated in the focus groups and shared their views. These focus groups involved expert NLP4RE researchers who were involved in the two presented replications; moreover, all the experts had extensive experience in various NLP4RE tasks. The participants also identified commonalities among the challenges identified in different focus group studies, which indicates that the challenges were not identified based on the personal bias of the participants. In order to prevent any threats due to
social relations and persuasion, each focus group was moderated by a different person, and had distinct discussion leaders. During the discussion, each participant was given the opportunity, and encouraged, to freely speak up. Nevertheless, some participants were more vocal than others.
During the focus groups, identified challenges may not be mapped to the same category in the cognitive state of the participants. Thus, understandability of the challenges may be a threat. To mitigate this, each moderator took notes during the focus group and shared those notes in real time so that the other participants could read and clarify them when necessary. Each moderator summarized the findings at the end of each question during the focused discussion phase and also at the end of the focus group. Completeness of the challenges identified in the focus groups is another threat. To help achieve completeness, we did not enforce time limitations on the focus groups, allowing the participants to deeply discuss the cases and to add new points as they saw fit. To ensure conciseness, we iterated over the identified challenges to check for redundancy and reduced the redundancy. The challenges are identified based on two distinct, yet representative, NLP4RE cases. It is worth remarking that the two cases were used as triggers for the discussion, and the actual challenges elicited are based on the analysis of 46 papers and on the substantial expertise of the participants of the focus groups, who have participated in several artifacts tracks, in NLP4RE tool design, and in study replications. Further experimentation is however needed to ensure generality of our findings. Some challenges may apply to the field of natural language processing for software engineering (NLP4SE), whereas others may be more general to SE. Nevertheless, as the scope of this study is strictly NLP4RE, more studies are needed to support these claims.
Two main concerns for the
ID-card are
completeness and
pragmatic usability, which may conflict with each other. We tried to find a healthy balance between the two using multiple iterations of application and refinement of the
ID-card, although we expect that lightweight versions of the card may be created for specific tasks and uses, e.g., for those listed in Table
3. These also increased
understandability of the card as we applied it to multiple articles. After each iteration, we discussed the results to confirm that a shared understanding was reached. Understandability was further assessed by asking the authors of the paper to fill in the
ID-card. To prevent the personal
bias of single authors affecting the construction of the
ID-card, multiple experts of the different NLP4RE tasks participated in its construction. Issues related to
coverage of the
ID-card in terms of types of alternative answers for each question was mitigated by assessing the card on 46 papers specifically selected to cover the spectrum of typical RE tasks based on the taxonomy by Zhao et al. [
57] and considering both widely cited seminal papers and relevant recent ones.
8 Conclusions and Outlook
Replication, covering both data (re-)annotation and tool reconstruction, is an important strategy in experimentation and empirical evaluation. In this article, addressing the field of NLP4RE, we investigated what are the challenges of annotating datasets for training and evaluating NLP4RE tools (RQ1) and the challenges in reconstructing NLP4RE tools (RQ2). To answer these research questions, we conducted focus groups in which we reflected on our first-hand experience in replicating NLP4RE state-of-the-art tools, and we further analyzed 46 papers covering a wide spectrum of the NLP4RE landscape. As a result of our study, we identified 10 challenges concerning data annotation and 5 challenges for reconstructing tools. Some of these challenges are specific to NLP4RE whereas some can be considered applicable to other SE fields. Though we refrain from generalizing our findings, as they stem from an NLP4RE context, we encourage other authors to further investigate our list of challenges and possibly adapt it to their contexts.
Challenges concerning data annotation (RQ1) include the unavailability of theories specific to RE tasks to build the annotation task on, e.g., concrete definitions of non-functional requirements or nocuous ambiguity; the need to anticipate additional time and effort to potentially evolve annotation protocols; and issues resulting from dataset imbalance and the lack of domain knowledge of annotators. Challenges for reconstructing tools (RQ2) mostly arise from missing details in the original papers, e.g., the exact library version or the application of outdated NLP tools that are no longer available.
To reduce the effect of these challenges, we investigated how to support NLP4RE researchers (RQ3). We proposed an ID-card as a complementary source to original papers for summarizing, via a total of 47 questions, information relevant to replication. The ID-card covers seven topics essential for replication. These concern the RE and NLP tasks, inputs, outputs, annotation process, tool, and evaluation details. We assessed the ID-card both internally and externally (with the authors of original papers) by cross-filling the same paper. Though filling in the ID-card requires time and effort (which should be marginal for the authors of a paper), the ID-card provides a useful starting point to facilitate replication.
In the future, we would like to foster the
ID-card as part of paper submission at different RE venues. The goal is to help reviewers assess the replication consideration of the paper and increase the overall awareness of replication-relevant information. Furthermore, we plan to investigate extending the
ID-card to cover other SE-related areas. This is highly needed, as artifact evaluation is not sufficiently mature and needs further improvement [
32,
53]. Though the
ID-card mainly aims to improve artifact evaluation, this is not its only purpose. It can be used for other objectives, such as literature reviews, as also indicated by the participants in our external assessment. These are hypothetical scenarios of usage, which can evolve after the
ID-card is used and possibly adapted by the community.