1 Introduction

Automatic question generation (AQG) systems are those in which questions are generated based on a topic or idea or context in natural language from either a paragraph of text or images. Such systems are becoming more popular of late, with their requirement in machine reading comprehension applications, conversational systems, and even educational applications. The eventual goal of AQG systems is the capability to generate questions that are correct syntactically and semantically as well as meaningful in the context of the use-case. For instance, in some cases, the goal is to generate questions on a topic of interest or based on different spans of text in a passage [1]. On the other hand, in conversational systems, say, a question-asking bot, it is imperative to be consistent with the context of the conversation and also at the same time maintain the interest of the user in the conversation [2].

AQG has been widely experimented with, in educational settings. In [3], an attempt is made to generate questions based on the content of English stories. Questions in five different categories of understanding were framed by extracting syntactic and semantic information present in the stories using natural language processing. Their work assisted specifically to the language learning ability of the learners. The authors compared their generated questions to those that were asked in their collection of book problems and also evaluated them for semantic correctness. In [4], the concept of self-questioning in the context of reading comprehension is explored. In their approach, they generate instructions that help the learners to ask questions relevant to the passage. Children’s stories were considered as the dataset of passages. Rather than generating questions randomly, the questions related to the characters’ mental states involved in the passages are framed to enable to infer connections between key story characters. Ten different categories of modal verbs were used for constructing questions of three different types (what/why/how) making use of question templates. Evaluation of generated questions was done by testing for acceptability. About 71% of generated questions were marked acceptable, but they suffered from parsing and grammatical errors.

Question Generation Systems fall into one of the following domain categories: closed domain or open domain. In closed-domain question generation, questions are generated for a specific domain like medicine [5, 6], educational text [7]. Here, the questions usually rely on some domain-specific knowledge restricted by an ontology. Open-domain question generation does not depend on any domain and allows you to generate questions irrespective of the domain requiring only universal ontologies. The data on the basis of which such systems can generate questions are readily available and in abundance. This type of question generation system does not cater to any specific domain of discourse and can be applied to any domain in general. There are two major approaches for open-domain questions, which have been researched upon. The first approach makes use of the syntactic structure of sentences and other natural language processing operations to produce a question from a specified sentence, and the second approach makes use of an end-to-end method which uses an approach similar to machine language translation using neural networks for generating questions. Significant contributions using both approaches are made over the years including constituent and dependency parsing [8], a representation using lexical functional grammar [9], semantic role labeling in a rule-based set-up [10], and neural network-based approaches including the generation of factoid questions using recurrent neural networks [11], where to focus on for generating questions for reading comprehension [12] or generating questions by recognizing the question type [13].

The main focus of this review is to present the researchers and practitioners with a comprehensive overview of the research carried out in the field of automatic question generation. The significant contributions this paper make are listed below:

  1. 1.

    To provide a detailed overview of automatic question generation methodologies.

  2. 2.

    To provide the list of datasets available for AQG.

  3. 3.

    To provide an overview of the challenges and applications in the field of AQG.

For this review, we carefully selected papers which were published in journals and conferences of repute. We used key phrases like automatic question generation, question generation, visual question generation and the like for searching for relevant research articles in this field. We then carefully categorized these articles on the basis of the use-case they try to model. We then used certain inclusion criteria to cater to the quality of content in our survey. We included only those papers exhibiting sufficient experimental proof with their models experimented on benchmarked datasets, papers which introduced the state-of-the-art methodology used for the purpose of question generation, the articles which compare their proposed models with existing work. We also included the articles which introduced various datasets for benchmarking this problem area. We focused more on articles which used various machine learning and deep learning-based architectures. We also included articles which catered to various application domains in this field for showing the usefulness of the task of AQG. We excluded the remaining articles which had incomplete experiments or less sufficient proof of results and those with no comparison among various methodologies. Also, we excluded the articles that were not written in English.

Several reviews have been published in the past by various research works in the field of automatic question generation. In [14], a review of the automatic question generation from text is presented in the years from 2008 to 2018. However, this field has advanced to a great extent over the years with introduction of recent deep learning architectures. There are a few specific reviews, in particular domain areas of question generation. For example, the work done for question generation in the educational domain is discussed in [15]. The authors have provided a systematic review of contemporary literature with the focus on quality of question structure, sub-domains in the educational field and how the research is largely focused for the assessment purpose. A similar review is presented in [16], where the authors explore the joint task of question generation with answer assessment. Also, a comprehensive survey based on the task of visual question generation is presented in [17]. Our review is different from the earlier ones as it provides a detailed discussion on the methodologies for question generation. We also categorize the question generation techniques broadly based on three different use-cases. We analyze the datasets and metrics used in question generation for each of the use-cases identified.

This paper is organized as follows: We first formally represent the problem of automatic question generation and discuss the various question categories, summarizing technically the idea of such a system. Next, we provide a classification of the AQG methodologies based on three distinct use-cases: standalone question generation, visual question generation, and conversational question generation along with a comparative analysis of these methodologies. We next provide a comprehensive overview of the datasets available for training such automatic question generation systems. We also list all the types of metrics, both automatic and human-based used for rating the performance of question generation models. We evaluate the datasets available for training question generation systems and categorize the datasets which are existing for different use-cases of AQG. We finally identify the various research challenges in AQG systems and briefly discuss applications of such systems as seen in various research works.

2 Automatic question generation overview

Automatic question generation systems are realized using varied approaches. Also, the kind of questions that such a system generates is important when choosing an approach. Before we delve into the approaches, we first formally define the problem of automatic question generation and give a short description of the types of questions that can be generated.

2.1 Problem definition

The problem of automatic question generation can be formally described as discussed in this section. Consider a given input modality, text or image based on which a question has to be generated. Let I represent the input, Q represent the question to be generated, and A be the answer relevant to the question. We define the automatic question generation problem as follows:

Find a function

$$f \, \left( {I, \, A} \right) \, = \, Q^{^{\prime}}$$

such that Q′ is semantically equivalent to Q.

The input I can be represented as a vector of relevant features, either an image or text. The question generation problem is to find a model that approximates the question generated by it, namely Q’ to the labelled question Q. To realize this problem, the dataset is first pre-processed as per the requirement to make data available in the desired format. Based on the question type, an appropriate strategy is chosen for question generation. Depending on the type of question generation system, an appropriate data set can be chosen. The data could be in text form or images. The question generator model may be rule-based or neural network-based. For a rule-based AQG system, an appropriate NLP technique is used for generating the questions while for a deep-learning-based strategy, appropriate representation is chosen for training the model.

2.2 Question categories

When we consider the question categories, various taxonomies were proposed. An important contribution towards this direction is Lehnert’s classification [18]. As part of development of a computational model for question-answering, Lehnert classified questions based on the idea of conceptual categorization. As per this idea, in order for the question to be interpreted correctly, it must be placed in the right conceptual category, otherwise it will lead to wrong reasoning. In this sense, the emphasis should be on the context in which the question was asked. Accordingly, Lehnert proposed thirteen such conceptual categories, namely causal antecedent, goal orientation, enablement, causal consequent, verification, disjunctive, instrumental, concept completion, expectational, judgmental, quantification, feature specification and request [18].

A similar classification scheme includes [19], where an analysis of questions during tutoring sessions was made. Thus, for any question generation system, it is important to identify the types of questions which can be generated by it. Moreover, there can be several types of questions based on whether they are meant to be asked for expecting to-the-point answers or span several lines or fill-in-the-blank type questions. Questions may also be characterized, on the basis of cognitive levels of the answer expected. Other classifications could determine whether the questions are extractive or abstractive. Extractive questions are based on words extracted from the passage itself while abstractive questions would have as answers meaningful words but different from the passage. Question categorization helps in chalking out the exact use-case that has to be realized. We list below a classification of questions on the basis of various research carried out in the field of question generation.

  1. (1)

    Factual questions

    This category of questions is simple objective questions that start with what, which, when, who, how. Here, the expected answer is a word or a group of words from sentences on a paragraph of text. Most of these questions are asked by choosing a single sentence from a paragraph and expect a known fact as an answer. Complex natural processing is not required for answering factual questions.

  2. (2)

    Multiple sentences spanning questions

    Some questions can require multiple sentences of a paragraph as the answer. The facts are present in several sentences. These questions are again W4H (What/Where/When/Where/How) questions and can be solved using the same approaches that are used for solving factual questions.

  3. (3)

    Yes/no type questions

    These are questions that require a Boolean response, yes/no. These require some level of reasoning. Such questions require a higher level of reasoning in order to reply with a yes or no correctly.

  4. (4)

    Deep understanding questions

    These are inference-oriented questions that require a proper inference mechanism. These questions might require deriving a fact from several related facts in a piece of text. These are complex questions that require different information from varied parts of the text.

3 Classification of automatic question generation techniques

In this section, we come up with the different categories of automatic question generation. We make the distinction based on two different aspects. The first aspect is based on the application use-case they try to model. We further provide a categorization using the different classes of methodologies used in each use-case.

Broadly, we identify three types of question generation systems: standalone question generation (SQG), visual question generation (VQG), and conversational question generation (CQG) (Fig. 1).

Fig. 1
figure 1

Classification of automatic question generation

3.1 Standalone question generation

In this type of question generation, the questions are generated independently of each other. This is typically the idea about machine reading comprehension systems where the only goal is to produce semantically and syntactically correct questions based on a paragraph of text or certain rules for language modeling. However, there is no correlation of the different questions generated.

3.1.1 Rule-based approaches

The authors attempt the problem of question generation using an educational learning resource called OpenLearn which covers a wide range of different discourse types authored by various subject experts in [20]. For their implementation, the authors convert the matter from OpenLearn, which is represented in XML format into plaintext, and further apply NLP processing for forming a syntax tree. The system then uses pattern matching to generate questions on sentences. The patterns are used as a part of rules, which can match sentences from the text for generating questions and the corresponding answers.

In [21], a rule-based approach is employed for producing questions from declarative sentences. The approach first simplifies the sentence and then applies a transformation technique for question generation. The generated questions are then ranked through logistic regression for quality. The ranked questions are then annotated for acceptance. This approach of ranking improved the acceptability of the generated questions by the annotators.

A mechanism of generating questions from the online text for self-learning is proposed in [22]. The authors focus on what to ask the question about from a given sentence, i.e., the problem of gap selection. For this task, they use articles from Wikipedia and perform key sentence extraction via automatic text summarization [23, 24]. Then, multiple question/answer pairs are generated from a single sentence, which is later on given to the question quality classification model. They use [25] as their text summarization model, semantic and syntactic constraints via constituency parser and semantic role labeler for generating multiple questions from a sentence, and crowdsourcing for rating question quality. The aggregated ratings along with a set of extracted features from the source sentence and the generated question are then given to a classifier that tests question quality using L2-regularized logistic regression [26]. Features used for training the classifier were in different categories like token count, lexical, syntactic, semantic, named entity and Wikipedia link features. After their experimentation, the authors were able to train the classifier, which could largely agree with the human judgments on question quality.

An approach for high-level question generation based on text is discussed in [27]. A combined ontology and crowd-relevance-based technique on the Wikipedia corpus are proposed for this task. The authors first create an ontology of categories and sections. They make use of Freebase for creating categories and for each category, they use sections. For example, if there is an article about Albert Einstein, it falls under the category ‘Person’ and is further segmented using sections like Early_life, Awards, Political_views, etc. The authors then present this ontologically classified data to crowd workers to generate questions based on a Category-Section part of the articles. With these generated questions, the authors train 2 different models. The first model is for finding the category and section of an unseen article segment. For this, they use logistic regression classifiers for both the categories and the sections individually. The other model is also a classification model which predicts whether a question is relevant for a section. The authors concluded their experimentation by reporting recall and precision scores on an end-to-end task of generating questions on an article-segment pair given by the user.

A technique that employs natural language understanding (NLU) for generating questions is proposed in [28]. The technique improves the acceptability ratio of generated generations. In their approach, the authors first examine the pattern of constituent arrangement for understanding what a sentence is trying to communicate to determine the type of question that should be asked for that sentence as part of the DeconStructure algorithm that they propose. The algorithm works in two phases: the deconstruction phase, in which the sentence is parsed by means of a dependency parser and a Semantic Role Labeling (SRL) parser, and the structure formation phase, in which the output from the parses is combined to recognize the clause components and a label is assigned for function representation of each clause component. After this step, the sentence patterns are classified into relevant categories before proceeding for question generation. The question generation is based on matching approximately 60 templates with the template which has the best match being used for generating the question. Subsequently, a ranking mechanism was employed for deciding whether a question is acceptable or not using the TextRank algorithm [29] for keyword extraction. This helped to identify the most important questions. The authors found that they were able to improve the acceptability of questions by 71% from the top-ranked questions in comparison with state-of-the-art systems.

A system for generating questions from Turkish biology text has been proposed in [30]. For this, a corpus was created which was semantically annotated using SRL. Biology high school textbooks were chosen as the text for the corpus. SRL proceeds with POS tagging for predicate identification, argument identification by following a set of rules, and argument classification utilizing self-training. After the SRL, the system proceeds with automatic question generation using set templates and rules. In this approach, first templates are tried, and if no template is found, then an appropriate rule is used to formulate the question. Turkish sentence structure is used to formulate a question.

Comments on rule-based models Table 1 provides a comparative overview of the models used in rule-based techniques for standalone question generation. As seen from the table, most approaches make use of Wikipedia as their dataset and the evaluation metrics for automated evaluation include f1-score or precision. Evaluation is not strongly made in terms of a well-defined metric though human based ratings have been explored. These algorithms rely often on extracted features and later add a classifier model. They extract syntactic and semantic parts of text and make use of templates to construct the question. Usually, smaller target topics are considered where specific types of questions are required to be generated. General texts will not be converted effectively to questions if rule-based algorithms are used for question generation.

Table 1 Summary of rule-based approaches used for SQG

3.1.2 Neural network-based approaches

With the humongous number of datasets available in current years, neural-based approaches have become very popular for automatic question generation. In this section, we discuss the different approaches which have been employed to solve this problem.

Encoder–decoder (sequence-to-sequence) architectures The encoder–decoder architecture was introduced by Google in [31]. This architecture promotes end-to-end learning for tasks that require a sequence of tokens as input and a sequence of tokens as output. This makes it very convenient to use in language processing tasks. Encoder–decoder models are typically used for modeling problems that are based on sequences as inputs and sequences as outputs; hence, they are often called sequence-to-sequence models. The architecture is further divided into two subparts: an encoder, which is used to encode the input sequence by passing it through a series of recurrent neural network layers, and a decoder, also a series of recurrent layers, which attempts to produce the output sequence. Shown in Fig. 2 is a typical encoder–decoder architecture that can be used for SQG. When used for SQG, the encoder–decoder model takes input passage (and answer) as input and attempts to produce a question similar to the labelled question.

Fig. 2
figure 2

Encoder–decoder architecture for automatic question generation

The authors have attempted to generate questions based on a paragraph of text for machine-reading comprehension in [32]. The authors have used an attention-based mechanism built upon a sequence-to-sequence model for the same. They have used an RNN-based encoder–decoder mechanism in which they have created two different encoders. The first encoder network is for encoding the sentence-level information, while the second network encodes the combined sentence and paragraph-level information. Both encoders are attention-based bidirectional LSTM networks. The authors use the SQuAD dataset for training their model with randomly generated training, development, and test sets. They use pre-trained glove embeddings with 300 dimensions for word representation. They used 2 LSTM layers in both the encoder and decoder networks and strained them using Stochastic Gradient Descent. For experimental analysis, the authors considered five different baseline models, namely IR(Information Retrieval) [33], MOSES+ [34], H&S [35], and Seq2Seq vanilla model [31], and performed automatic as well as human evaluation. For automatic evaluation, they used the BLEU, METEOR, and ROUGE metrics and for human evaluation, naturalness and difficulty of question generated were considered as parameters. For human evaluation, a set of 100 randomly sampled question–answer pairs were chosen and evaluated by 4 professional English speakers on a rating from 1 to 5(5-best). The authors observed that both the sentence-based and paragraph-based models that they proposed performed better than the baselines in both the automatic and human-based evaluation. However, the paragraph-level model was not the best for all metrics, so the paragraph-level information can be used more efficiently for implementation to improve performance which provides directions for future work.

The authors propose a novel neural network-based question generation technique that generates a question towards a target aspect from an input piece of text [35]. The idea is inspired by the fact that in a typical conversation, seldom, questions are asked randomly and are always asked on some relevant aspects. For this purpose, a sequence-to-sequence neural network-based framework is used, which employs a pre-decode mechanism for improving the framework performance. They also employ two techniques to the sequence-to-sequence framework, an Aspect, and a Question Type so that they can improve the question quality based on the question type. Another mechanism that they use is an encoder–decoder framework with separate encoders for aspect, question type, and answer. For generating aspects from a given sentence, the authors make use of the cosine similarity metric to identify the semantically similar words from the sentence based on words from the question. Using a voting mechanism, the candidate words are selected from the sentence as part of the aspect if the average vote is greater than some threshold. After this extraction of aspects from the sentences, the authors perform noise removal by using a pre-decode mechanism and perform stop words removal. For question types, the authors make a distinction among 7 categories of question types, namely yes/no, W4H, and others. They make use of these keywords to identify the question types. For aspect and question type, authors make use of a bidirectional LSTM and for the answer, they use another bidirectional LSTM. For the decoder, the authors use another LSTM network. As part of the pre-decode mechanism to clean the generated aspect, the authors use yet another LSTM which acts as a filter for noise removal. For their experimentation, the authors make use of the Amazon Question/Answer corpus (AQAD) which contains 1.4 million question–answer pairs about products and services from Amazon. For results analysis, they divide the corpus into training, development, and test sets. The authors performed both automatic evaluations using BLEU, METEOR, and ROUGE as well as human-based evaluation and found that their model outperformed the baseline model in [32] that they considered.

In [36], question–answer pairs in natural language are extracted through a knowledge graph making use of the RNN-based model for question generation. In this approach, a set of keywords are first extracted from a knowledge graph. A subset of these keywords is then used for generating questions through a sequence-to-sequence-based RNN model. An encoder–decoder architecture is used in which a bi-directional RNN is used as the encoder with a hidden layer of 1000 neurons and the decoder is also constructed similarly. 1mn questions are extracted from WikiAnswers which forms the dataset for model training. In their approach, the authors create 2 models as part of the framework for generating QA pairs. In the first module, knowledge about the entities is extracted from the knowledge graph, and it is independent of language. In the second module, questions in natural language are generated using question keywords using an RNN. The RNN model gave higher BLEU4 scores over other compared baselines, phrase-based machine translation [34] and template-based method [37].

Generative adversarial network-based approaches The generative adversarial networks (GANs) were introduced in [38]. This class of deep neural networks makes use of two different networks, the generator and discriminator, which compete against each other in an adversarial set-up. The role of the generator is to generate samples of the required problem domain close enough to the labelled data such that the discriminator is not able to identify whether the sample generated is fake or real. A typical architecture for a question generating GAN is shown in Fig. 3. The generator tries to generate fake questions similar to the labelled question for the given answers. The role of the discriminator is to be able to correctly identify that the generated question is fake or not. In the process the generator is able to fool the discriminator by generating fake questions which are close enough to the original labelled questions and that is when training stops.

Fig. 3
figure 3

GAN-based general architecture for automatic question generation

The problem of fill-in-the-blank (FITB)-type question generation is dealt with in [39]. The creation of important distractors is done using generative adversarial networks for training. A FITB question consists of the sentence, the key (correct answer), and the distractor answers. The authors attempt to generate distractors, given a sentence and the key. In this approach, the generator of the GAN is used to capture real data (key) distribution from a given question sentence and the discriminator tries to estimate whether the key came from the actual data(real) or the generator(fake). The model training is performed on the subject of biology from the Wikipedia corpus. The proposed method performed better than already existing similarity-based methods.

The introduction of variability in the questions generated and prediction of question type is incorporated in a GAN framework discussed in [40]. The GAN model accounts for variability using a latent variable and its discriminator evaluates the genuineness of the question and predicts the type of the question. The generator of the GAN is an encoder–decoder network based on conditional variational autoencoders [41]. The discriminator is modified to act as a classifier for the question type (WHO, WHAT, WHICH, HOW, WHEN and OTHER) along with the task of classifying the real questions from fake questions. Experiments were conducted on the SQuAD dataset [42] and several variants of the models were created and compared against baselines from [43] and [44]. Automatic evaluation on BLEU, METEOR, and ROUGE scores was performed along with human-based evaluation and their proposed model outperforms the baselines considered in both types of evaluation.

In [45], the authors have addressed the problem of generating questions on a specific domain with the absence of labeled data. For this, they have proposed a new model which uses doubly adversarial networks. These networks use the data with ground-truth labels from one domain and unlabeled data from the goal domain for training. They used the SQuAD[46] dataset for unlabeled data and NewsQA[47] dataset as labeled data. Their experimentation proved that their model gave better results than existing methods.

In [48], an attempt is made to generate clarification questions on text to extract useful information that captures the context of the text. In this GAN-based approach, the generator is a sequence-to-sequence model which first generates the question on the basis of a context and then generates a hypothetical answer to that question. The question, answer, context triplet is then given to the discriminator, which uses a utility-based function to compute the usefulness of the question. The evaluation was made on two datasets: the first one was a combined dataset which consisted of the Amazon question answering dataset [49] and the Amazon reviews dataset [50], and the second was the Stack Exchange dataset [51] curated from stackexchange.com. The baseline model was an information retrieval-based model Lucene based on [49], which was compared against several variants of the proposed model, namely GAN-Utility, MaxUtility, and MLE. The models were evaluated using both automatic evaluation metrics BLEU, METEOR, and Diversity [52] and human evaluation based on the criteria of relevance, grammar, seeking new information, usefulness, and specificity. It was observed that the adversarial training given to the model produced good results than both MLE and the reference models.

Deep reinforcement learning architectures In [53], authors have attempted to create a framework for generating intelligent questions in the context of conversational systems. As most of the work in automatic question generation is utilizing neural-based systems, authors have extended this approach and have created a model based on deep reinforcement learning for question generation. The authors have used an end-to-end model which uses a generator and an evaluator. The generator model is based on the question’s semantics and structure. It uses a pointer network mechanism to identify the target answers and a copy mechanism to retain the contextually important keywords. It also uses a coverage mechanism for removing redundancy in the sentences. The evaluator mechanism employed in this paper uses direct optimization based on the structure of sentences using BLEU, GLEU scores, etc. It also matches the generated questions against an appropriate set of ground-truth sentences. The authors also introduce two new reward functions for evaluating the quality of the generated questions, namely Question Sentence Overlap Score (QSS) and Predicted and Encoded Answer Overlap Score (ANSS). The authors conducted their experimentation on the publicly available SQuAD dataset and compared their model’s results with two state-of-the-art QG models as baselines, namely L2A [32] and AutoQG [54]. They compared the baselines with eight variants of their model and used standard automatic evaluation techniques like BLEU, ROUGE-L, and METEOR as well as human evaluation techniques to further analyze the quality of their questions for syntactic correctness, semantic correctness, and relevance. For human evaluation, the authors randomly selected a subset of 100 sentences and presented the 100 sentence-question pairs to 3 different judges for getting a binary response on each quality parameter (syntax, semantics, and relevance), and the responses from all judges for every parameter were averaged for each model. After comparison, the authors observed that their model variant which included ROUGE, QSS, and ANSS outperformed the two state-of-the-art baselines on automatic evaluation using BLEU, METEOR, and ROUGE. In human evaluation, their model variant which used DAS, QSS, and ANSS was the best model on syntactic and semantic correctness and the one which used BLEU, QSS and ANSS gave the best relevance among all models. So, the authors conclude that the QG-specific reward metrics that they proposed, namely QSS and ANSS, improved the model significantly and outperformed the state-of-the-art methods.

In [55], the authors propose a refinement of ill-formed questions generated to well-formed questions by using a reward-based mechanism using reinforcement learning for training a deep learning model. The rewards used utilize the wording of questions as a short time reward and the correlation of question and the answer as the long-term reward. They also make use of character embedding and BERT-based embedding for enriching the representation for question generation. The authors conclude that their approach could produce comparatively readable questions.

A graph-to-sequence-based architecture guided by deep reinforcement learning is proposed in [56]. In their proposed architecture, the authors make use of a gated bidirectional neural network architecture and use a hybrid cross-entropy and reinforcement learning-based loss function to train the network. They also add answer information in the model training process. They use several state-of-the-art models to generate questions and compare them with 2 different variants of their model, the first one using syntactic information from the passage to construct a static graph and the other using a semantics-based dynamic graph. They evaluated the models on the SQuAD dataset and found that it outperformed the earlier state-of-the-art models on automatic evaluation metrics as well as human evaluation methods by a substantial margin.

Joint question answering-question generation approaches Some approaches make use of a joint question answering and question generation approach for automatic question generation.

For instance, in [57], the approach used is one where the question is asked as well as answered. The proposed model was trained on the joint task of question-answering and there was a substantial improvement in the model performance for the SQuAD dataset. An attention-mechanism-based sequence-to-sequence model was used for this task, which used a binary signal to set the learning mode as answer generation and question generation, respectively. The model was then compared to the QA-only model and it was observed that the joint model had better results than the QA-only model.

In another approach for joint QA-QG, the correlation between the task of QA and QG is exploited to improve model performance. A sequence-to-sequence model is used for QG and a recurrent neural network is used for QA and the results of the model are compared on 3 different datasets: MARCO, SQuAD, and WikiQA. The QA model is implemented using a bi-directional RNN which uses word embeddings for representing the inputs—the question+list of candidate answers and predicts the best answer from the candidates. The input to the QG model is an answer and its goal is to generate a relevant question. The QG model uses an encoder–decoder approach in which the answer representation is first done using an encoder, and later, the decoder generates a question based on the answer representation. The architecture jointly trained both models to improve the overall performance of both QA and QG on the datasets used [58].

Transformer-based approaches Several approaches based on transformers [59] have been experimented with recently.

In [60], the task of question generation from passages was attempted on the SQuAD dataset using transformers. Word error rate (WER) was used as the metric for comparing the generated questions with the target questions. The authors observed that the generated questions were correct syntactically and were of relevance to the passage. WER was low for the shorter questions, while it increased for longer questions.

In [61], the transformer model is improved further to generate questions on the SQuAD dataset. The ELMO (Embeddings from Language Models) representation [62] is employed to denote the tokens. Placeholding strategy for named entities is used as also a copying mechanism is employed in the different variants of the models experimented with. The models were evaluated on automatic metrics (BLEU, ROUGE) as well as human evaluation for correctness, fluency, soundness, answerability, and relevance and found that the model employing the ELMO+placeholding+copy mechanism gave better results on SQuAD.

A single pre-trained transformer-based model is used for generating questions from text in [63]. In particular, they use the smallest variant of the GPT-2 model [64], a pre-trained model which was fine-tuned further for their models. The model is purely dependent on the context, so answer labeling is not required. The model was evaluated on automatic metrics of BLEU, METEOR, and ROUGE and found to give average results. However, the simplicity of the model accounts for this observation. With more processing resources and bigger GPT-2 models and also other parameters of consideration, the model may give better evaluation scores.

In [65], the authors use a combination of the transformer-based decoder of the GPT-2 [66] model with the transformer encoder of BERT [67]. The authors train their model on the SQuAD dataset and use a joint question answering-generation-based approach for training. Each network (encoder and decoder) is trained individually for answering and generating questions, respectively. The authors evaluate their model on quantitative and qualitative metrics. They also propose a metric BLEU QA as a surrogate metric for assessing question quality. Their model produces good quality questions with maximum semantic similarity to ground-truth answers using the semi-supervised approach proposed.

A recurrent BERT-based model is explored in [68]. In their approach, the authors use a BERT model as an encoder and another BERT model as the decoder to generate questions using the SQuAD dataset. In comparison with other models using standard evaluation metrics, their model gave better results on both sentence-level and paragraph-level question generation.

In [69], an attempt is made to work on multiple question types using a single architecture based on pre-trained transformers, namely T5 (text-to-text transfer transformer) and BART (bidirectional and auto-regressive transformers). Among the question types, the authors chose extractive, abstractive, MCQ, yes-no and also abstractive questions combining various datasets comprising of such question types. The authors fine-tune the T5 and BART models on their combined dataset containing passages of text from 9 existing datasets. Evaluation of their unified model was done on both automatic metrics and qualitative parameters, which gave state-of-the art results.

Comments on neural-based models Table 2 gives a comparative overview of the models used in neural network-based question generation. As seen from the table, the features of the models are listed and the automatic evaluation scores using the standard metrics of BLEU, METEOR and ROUGE and others are compared. The various models that performed better have their automatic scores highlighted in bold. When we add more features, for example, RNN+ knowledge Graphs better than a purely RNN-based approach. In a few cases, reinforcement learning, which employs a reward-based training mechanism, also shows promising results. Also, the advanced models, namely transformers like GPT-2 and BERT, gave best results on the most frequently used dataset (SQuAD).

Table 2 Summary of neural network-based approaches for SQG

3.2 Visual question generation

Such systems are useful as an alternative to the solution of image captioning. In image captioning, the goal is to generate an account of the objects seen in an image. On the other hand, visual question generation (VQG) tries to accomplish the same goal by generating questions based on the objects in the image.

3.2.1 Methodologies used for VQG

Most of the techniques used for VQG include the use of neural network architectures employed in different perspectives. The task of visual question generation is introduced in [72]. The purpose of VQG is to generate questions that are natural and engaging for the user to answer. Three different datasets which range from object to event-centric images are also created for this purpose by the authors. The authors form 2 datasets, one with 5000 images from the MS COCO dataset [73] and the other with 5000 images from Flickr [74]. A third dataset was curated from the Bing search engine, which was queried with 1200 event-centric terms. The three datasets together comprise a wide range of visual concepts and events. The authors further present different retrieval and generative model architectures to accomplish the task of VQG. Among the generative models, maximum entropy language model (MELM) [31, 75, 76], machine translation (MT) sequence-to-sequence model, and gated recurrent neural networks (GRNN), derived from [77, 78] were constructed and evaluated. Among the retrieval-based models, different variants of the K-nearest neighbor model (KNN) were created and evaluated. The evaluation was performed using BLEU and METEOR for automatic evaluation and performed human evaluation by crowdsourcing three crowd workers for rating the semantic quality of generated questions on a scale of 1 to 3. This was the first paper that discussed the task of VQG and released 3 public datasets for the research community to solve this problem using different models. Also, various architectures were discussed, which could be used in training such models.

In [79], the problem of generating goal-centric questions on images is addressed using a deep reinforcement learning approach. A game GuessWhat?! with a goal-oriented flavor is used for applying the proposed model. In their approach, the authors prompt the agent to ask multiple questions with informative answers till the goal is achieved. For this, three different reward functions are proposed which compute intermediate rewards. The first reward is the goal-achieved reward for achieving the final goal, the second is the progressive reward, which ensures that every new question asked by the agent leads it more towards the goal and the third reward is called ‘informativeness’ which checks whether the agent is not asking useless questions. For evaluating the performance of their model, the authors use the GuessWhat?! dataset and create several variants of their model using different combinations of the reward functions and compare with [80]and Soler [81]. They find that their model variant with all three rewards surpasses all the compared models. They also conducted a human evaluation to compare all their models with other variants and their model outperformed in the case of human evaluation as well.

The generation of good informative questions is tackled by a model in which the maximization of mutual information between the question generated by the model, the image sample and the label answer is performed [82]. In their model, the authors used a latent space formed by embedding the target answer and the image example, and a variational autoencoder [83] is used to reconstruct them. A second latent space is set up which uses only the image and the answer category for encoding. Thus, the need for having an answer is eliminated. They make use of VQA [84] as their dataset and compare their model against several baselines. It was observed that their model outperformed all the other approaches considered for comparison.

In [85], the authors make use of an exemplar module in the existing deep learning framework for the task of VQA and VQG. For exemplars, two variants, namely attention-based and fused exemplars, are used for classification. The VQA and VQA2 datasets were used for testing their models for the task of VQA, while the VQA and VQG-COCO [73] datasets were used for the question generation task. They observed through various variants of their models that they gave an enhanced performance on state-of-the-art methods on VQA and VQG on standard automatic metrics.

Visual question generation in the presence of visual question answering as a dual-task is experimented with, in [86]. The framework makes use of inverted MUTAN (Multimodal Tucker Fusion for Visual Question Answering) and attention in its design. The experiments are performed on CLEVR and VQA2 datasets and give better performance than the existing methods using the dual training framework.

In [87], an attempt is made to guide the VQG system in generating questions based on objects and categories. They employ three distinct model architectures, explicit, implicit, and variational implicit. The VQA dataset was used in the experiments. In the explicit model, the image is first labeled with objects using an object detection model and an image captioning model to generate captions. This is then given to an actor which then chooses random samples from among the candidates according to the category, which in combination are given to the text encoder. Image encoder is used to encode image features. The image and text encoder outputs are combined into the decoder to generate questions. In implicit guiding, they use only image as input and try to predict the category and objects using a classifier network. Further, a variational encoder-based implicit guiding is also experimented with where a generative encoder and variational encoder together produce a discrete vector which is then fed into the decoder. The experiments result in an improvement over several metrics of VQG.

Other visual question generation approaches include using a human in the loop where VQG is used for asking questions to users and collecting their responses to build a dataset for visual question answering [88], use of reinforcement learning along with bi-discriminators using generative adversarial networks [89] and category-wise question generation using latent clustering [90].

Comments on models used for VQG Table 3 gives a comparative overview of the models used in visual question generation. As seen from the table, the features of the models are listed and the automatic evaluation scores using the standard metrics of BLEU 4, METEOR, CIDEr and others are compared. The various models that performed better with respect to various metrics have their automatic scores highlighted in bold. For extracting image-related features, ResNet variants have been explored in most techniques. Also, attention-based models are used giving good results. We observe that reinforcement learning using bi-discriminators provide the best scores on the VQA dataset.

Table 3 Summary of approaches for VQG

3.3 Conversational question generation

The primary objective of a conversational question generation model lies in generating questions that are rich in the context of the conversation. The main idea is to generate a series of questions for maintaining the conversation. In these systems, care must be taken that the conversation does not get stuck in a loop or gets too boring. The primary use-case of such a system is conversational chatbots.

3.3.1 Methodologies used for CQG

Significant research has been carried out for implementing conversational question generation systems. In [91], a neural-network-based approach is proposed which makes use of coreference alignment along with maintaining a conversational flow. This ensures that the questions generated in consequent turns are related to each other based on the conversation history. A multi-source encoder along with a decoder based on attention and copy mechanism is employed for this task. The experiments were performed on the CoQA dataset [92] and compared over various baselines models, and several ablations of existing models [93, 94] have been used in their proposed model.

An encoder–decoder-based architecture employing a dynamic reasoning technique using reinforcement learning is explored in [95]. The authors attempt to generate the next question based on the previous few questions on a given passage from the CoQA dataset. They also test their trained model on the SQuAD dataset for multi-turn question–answer-based conversations. The experimental analysis proves that the model gives better results than several compared baselines using automatic and human-based evaluation metrics.

Answer-unaware conversational question generation is explored in [96]. The proposed framework of the authors comprises three parts. The first part is the question focus estimation which decides which context to focus on to generate the next question. The second part is the identification of the question pattern which is done using either question generation or classification. These two parts are given as input to the encoder of the proposed model and the third part which is question decoding is the role of the decoder. The experiments were performed on the CoQA dataset and evaluated using BLEU scores although the authors suggest the development of new metrics for conversational question generation due to weakness of the existing automatic metrics for evaluation.

An approach using question classification based on Lehnert’s classification [97] is used to tag questions and later on used in a conditional neural network-based model in [98]. The authors introduce a new task called SQUASH (Specificity-controlled Question Answer Hierarchies) which converts the text into a hierarchy consisting of question–answer sets which start at broader “high-level” questions and keep proceeding with more refined questions down the hierarchy. The authors test their proposed pipeline on the 3 datasets (SQuAD, QuAC, CoQA) and get promising results.

The problem of generating informative questions is dwelled upon in [99]. For generating information seeking questions in the context of a conversation, the authors use reinforcement learning to optimize the information seeking content. The architecture used in this work consists of two fragments: the automatic question generation model and the informativeness and specificity measurement model for the generated question. The experimental analysis is modelled in the form of a teacher–student communication game. A sequence-to-sequence model is used with the encoder containing the representation of the topic of interest shared between the student and teacher and the decoder is used to generate the conversational question. The informativeness is measured using what additional information a given answer by the student provides which was not present the history of the conversation so far. Apart from the informativeness metric, the authors also propose a specificity reward which is obtained by training a classifier to distinguish positive with negative samples (in terms of how a question would divert from the current topic). The experiments are performed on the conversational QuAC dataset and the combined metrics for informativeness and specificity help to direct the conversation towards rationally relevant questions.

An architecture which employs flow-propagation-based learning for generating conversational questions is discussed in [2]. In this work, a question is generated based on a given passage, a target answer and the history of dialogue which has occurred before the current turn in a multi-turn dialog set-up. For encoding the answer and for question generation, the GPT-2 model [66] is used. The authors introduce a flow-propagation based training mechanism which considers the losses accumulated after n turns in a dialog sequence, thus improving the flow of conversation. The model outperforms several baselines considered, including T5 model and BART-large.

Comments on models for CQG Table 4 gives a comparative overview of the models used in conversational question generation. As seen from the table, the features of the models are listed and the automatic evaluation scores using the standard metrics of BLEU 4, METEOR, ROUGE and others are compared. In general, it is observed that encoder–decoder give decent results if using other model parameters like coreference alignment, multiple encoders for representing the text involved or usage of classifiers to improve QG. On the other hand, reinforcement learning boosts the performance when goal based QG is targeted. However, the use of transformer-based architecture like GPT-2 gives promising results. This is because transformers perform much better at remembering sequences than encoder–decoder architectures based on recurrent nets. The various models which performed better have their automatic scores highlighted in bold. We observe that GPT-2-based models gave the best results on CoQA, the most used dataset.

Table 4 Summary of approaches for CQG

3.4 Summary of approaches for AQG

We summarize the use-case-based question generation classification in terms of the preprocessing techniques and the methodologies in Fig. 4.

Fig. 4
figure 4

Summary of automatic question generation techniques

In Table 5, we list the different models used for each use-case of question generation. We also list the strengths and weaknesses of each model and identify the research gaps in them.

Table 5 Overview of techniques used in AQG

4 Evaluation techniques for question quality

The questions generated by a model must be evaluated for question quality so that the questions generated make sense for the purpose for which they were generated. For this purpose, there are broadly two types of evaluation techniques: automatic evaluation and human-based evaluation. We describe them briefly in this section.

4.1 Automatic evaluation

Several metrics are available for evaluating the output produced by language production systems. These metrics can be used to check the closeness of the machine-produced questions which the actual questions. For automatic evaluation, there are two scores, namely precision and recall. Precision is a measure of specificity, while recall is a measure of sensitivity. The popular metrics for automatic evaluation make use of precision and recall. A few of the popular metrics for evaluation are discussed in the sections below.

  • BLEU (BiLingual Evaluation Understudy Score)

In this metric, modified n-gram precision and the best match are used for computing precision and recall. n-gram precision is the fraction of n-grams in the given text found in one or more of the ground truth (reference) texts available. BLEU modifies this quantity as finding words that are present only those many times as they are existing in any of the reference texts. The best match length is used to compute the sensitivity of a candidate to general reference texts. For this, sentences with shorter lengths are penalized by a multiplicative factor [100]

  • METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR is an alternate metric for evaluating machine translation-based texts. It was modeled to have a better correlation with the human way of judgment. It tries to remove the drawback of BLEU which impacts the scores of individual sentences as in BLEU average lengths are calculated spanning the whole corpus. For this, METEOR uses a weighted F-score based on mapping unigrams along with a function that imposes a penalty for the wrong order of words [101].

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is a metric for automatic evaluation which is based on recall alone. It is commonly used for the evaluation of text summaries. There are different variants of ROUGE which are created based on the feature type used for computing recall. They are ROUGE-N (based on n-grams), ROUGE-L (based on longest common subsequence statistics), ROUGE-W (based on weighted longest common subsequence statistics), and ROUGE-S (skip-bigram co-occurrence) [102].

  • CIDEr (Consensus-based Image Description Evaluation

CIDEr (Consensus-based Image Description Evaluation) is an automatic metric proposed for evaluating the quality of image description. In this metric, the model-generated sentence is compared with a set of human written sentences of ground truth. A novel metric for automatic consensus on description quality is proposed by making use of 2 datasets, PASCAL-50S and ABSTRACT-50S. The consensus makes use of up to 50 reference sentences rather than 5 in the available datasets. This metric measures how similar a model generated sentence is to the consensus of the ground truth sentences for that image. Consensus is calculated in terms of how often the majority of the sentences used to describe an image are similar. Further, CIDEr claims that the aspects of grammar, importance, accuracy and saliency are innately captured by our metric [103].

Among the automatic metrics, BLEU, METEOR and ROUGE are the more favored metrics for Standalone QG and Conversational QG. BLEU is even used often for Visual QG but the other metrics have been replaced by CIDEr for Visual QG.

4.2 Human-based evaluation

It is observed that most of the techniques used for the evaluation of the performance of AQG systems are not an effective measure of the quality of the question generated. Hence, evaluation is also performed by human evaluation techniques.

An approach in which three crowd-workers are used for rating questions based on a scale of 1 to 5 (5 being good) on two parameters fluency and relevance is employed in [12, 94]. Other approaches make use of naturalness [32, 35] and difficulty [35].

In [104], human evaluators were asked to rate the quality of the generated questions based on three factors, syntactic correctness, semantic correctness, and relevance.

4.3 Other evaluation techniques

Over the years, different techniques have been proposed by various researchers for the evaluation of the generated questions. One notable contribution is made in [105], where the authors first use the answerability of a question through human evaluation to modify the existing automatic evaluation metrics to include the influence of relevant words, question types, function words, and named entities. They then use the weighted average of precision and recall of these scores to get the final metric proposed by them. They further proved that their proposed metrics had a better correlation with the human evaluation scores.

Figures 5a and b represent the automatic and human-based evaluation metrics used in SQG, VQG and CQG in the research reviewed in this survey.

Fig. 5
figure 5

a Automatic evaluation metrics for question generation. b Summary of human-based evaluation metrics for question generation

5 Datasets

As a result of constant efforts in this direction, many open datasets have been created for supporting research in question generation. The type of dataset chosen depends on various factors like whether the type of question generation is closed domain or open domain, whether the questions to be generated are independent, that is, one at a time or related like in a conversational system.

We categorize the datasets based on the use-cases. For SQG, there are several benchmarking datasets in both open-domain and closed-domain QG. Based on the cognitive level of the question, we can choose shallow datasets like SQuAD, NewsQA, [46, 47] or deep cognitive level datasets like LearningQ and NarrativeQA [106, 107].

For CQG, conversational datasets are used, in which question generation takes place in the form of a conversation of questions and answers. Depending on the application and the type of questions to be generated, appropriate datasets can be chosen.

VQG Datasets require images and the questions asked are based on the objects/scenes depicted in the images. Majority of the datasets are focused on recognition of objects and images and the questions range from MCQs to yes/no and short answer questions.

Most datasets are curated by crowdsourcing and the number of training examples are sufficiently large in datasets (SQuAD, NewsQA). Some datasets are open domain like general English passages curated from Wikipedia, while other closed-domain datasets make use of news(NewsQA, DeepMind), educational content (RACE, NarrativeQA) and some are from search queries(WIKIQA, MSMARCO).

The details of datasets used for question generation are summarized in Tables 6, 7, and 8.

Table 6 Summary of SQG datasets for AQG systems
Table 7 Summary of CQG datasets for AQG systems
Table 8 Summary of VQG datasets for AQG Systems

We select the most commonly used benchmark dataset for each use-case among those that we surveyed and list the various models used along with the work done in Table 9.

Table 9 Overview of benchmark datasets: models and work done

6 Challenges and future directions

Although the application of deep learning techniques and combining Natural Language Processing with them has made a tremendous improvement in question generation systems, there are still a few challenges to consider. We discuss the various challenges and possible future directions in this field in the following section.

6.1 Challenges in AQG

We identify and discuss the various challenges in AQG in this section.

6.1.1 Quality of Questions

Generating questions with proper syntax have been accomplished with the help of employing a language model in the loop and using other similar language-related features. However, the questions generated lack to a certain extent in terms of semantics and relevance as reported in most studies through human evaluation [32, 35, 36, 39, 56, 61]. Also, generating meaningful questions is a challenge as most existing techniques focus more on the syntactical aspects rather than information extracting questions. Syntax plays an important part but generating questions which make sense in extraction of meaningful information [99] is an important requirement in many applications and must be explored extensively.

6.1.2 Types of questions

The type of questions to be generated range from the typical short answer span questions to multiple spans for question generation from the text [42, 47, 106, 107, 112]. If we consider the case of images, most systems try to identify objects placed in various scenes or images, this task has been solved to quite an extent, although the kind of models which are able to extract meaningful information from an image are limited to trivial use-cases (refer Table 7). Another direction in which research could progress is generating relevant questions for the text given a topic. Although there are a few approaches where topic-based questions [99] have been looked at, there is no existing approach that completely solves the problem.

6.1.3 Datasets challenge

Most datasets that are currently available for training question generation systems are crowd-sourced [42, 47, 84, 92, 114, 115, 119, 120, 122, 123], which largely impacts the quality of generated questions. Also, many of the datasets are very generic rather than contoured to specific domains (refer Tables 6, 7, 8). Domain-specific datasets must be generated keeping in mind the quality of content while curating the dataset. It would also be important to note that for conversational question generation, only a few datasets are available. As a result, not much work is done in this use-case. This is a potential way of adding to the research community so that the models to be tested are provided with high-quality data for specific purposes.

6.1.4 Metrics challenge

Another area of working towards this field is building some metrics for a thorough evaluation of the generated questions. Although standard metrics for evaluating text generated like BLEU, METEOR, ROUGE, CIDEr can be used for automatic evaluation for generated questions, a more relevant metric which includes other factors like naturalness in the language used, weightage given to the syntax and grammar of the question generated can be experimented with. Although some research work uses such metrics in the form of human evaluation [2, 28, 31, 32, 35, 43, 57, 65, 67, 75, 84,85,86, 91,92,93], we need to devise efficient metrics which automatically give an estimate of the quality of the questions generated, thus eliminating the need for human evaluation.

6.2 Future research directions

6.2.1 Transfer learning

What happens in case of training data is that most of the datasets available are curated from open domain data like Wikipedia, reddit, social networking platforms and the like. Some works have recently focused on the transfer of training of one domain to another. For instance, in [124], transfer learning is performed by training on non-educational datasets like SQuAD and NQA (Natural Questions) and the evaluation is performed on an author curated dataset called TQA-A with questions based on educational text and tagged with answers. Several pre-trained BERT-based models were explored and, answer selection was investigated. This study found that there was a significant difference in which the answers were selected in educational and non-educational question generation. Transfer learning helps in cases where the data for training is less, and we have pre-trained models which exist on other similar datasets. With several such pre-trained architectures available, it is very useful to employ transfer learning for question generation.

6.2.2 Creating corpora of high quality

As most of the datasets for question generation are either crowd-sourced or borrowed from open-sourced communities like Wikipedia, Reddit and others. These are not suitable to domain-based question generation. Specialized domains like education and medical domains have requirements other than only focusing on purely generation of relevant questions. For instance, educational domain would require generating questions at a specific cognitive level or mapping to a a specific category. On the other hand, medical domain would require questions aimed at correct diagnosis. One such effort worthy of mention is [125]. Here, the authors have curated a dataset from discharge summaries with the help of 10 experts in the medical domain to construct 2000+ questions related to the diagnosis. After analysis of the type of questions, they used pre-trained transformer models to train on the dataset and achieved promising results. Hence, domain-based corpora of high quality need to be created and current approaches could be further improved by working in this direction.

6.2.3 Multimodality for QG

Visual question generation directs the research in question generation towards the aspect of multimodality. Multimodal question generation involves using different input modalities including text, images, videos for generation of questions. Using multiple modalities, more real-world applications can be targeted, including generating questions based on pictures and diagrams in educational domain [126] and helping the visually challenged identify objects around them [127]. Video QG is an upcoming area, where some works have been introduced lately. In [128], using a newly constructed architecture of a generator, which generates a question given a video clip and an answer, and a pre-tester, which tries to answer the generated question, has been investigated for joint QA-QG from videos. They make use of Video encoder for extracting video features of 20 frames from each video and later encode them using faster R-CNN and Resnet101. Overall, they obtained promising results on the TVQA [129] and ActivityNetQA [130] datasets.

6.2.4 Working on QG centric metrics

Metrics for QG are essential to gauge the quality of generated questions in terms of how meaningful they are. Although the standard metrics used to evaluate the generated questions are the ones used generally for text generation. QG metrics should include grammatical correctness, question well-formedness and domain centric metrics depending on the application being used in. Human-based evaluation is currently in use for the above aspects, although some efforts in this direction are being made recently. In [131], an evaluation metric called QAScore is proposed which makes use of pre-trained language model RoBERTa (Robustly Optimized BERT Pre-training Approach). The metric is reference-free and correlates well with human judgments as observed in the experimentation performed by the authors. Similar studies would be helpful for strengthening the metrics for QG evaluation.

7 Applications of automatic question generation

There are several applications of question generation systems. A very important use case is generating questions for passages. This can be useful in an educational setup where the input will be passages of text and will result in saving the time and effort required for setting question papers. In [132], a desiderata for generating cloze type and WH-questions has been discussed. In [133], a system for generating questions is discussed, by generating simple factoid questions using syntactic rules by question types. Classification schemes for questions are presented in [134] while asking students to generate questions themselves for improving meta-cognitive abilities has been discussed in [135].

Closed-domain question generation can be applied to healthcare bots where the bots can interview patients for specific symptoms for the preliminary investigation of diseases. A conversational bot with the ability to generate relevant questions in natural language can thus aid in speeding up the process of diagnosing a patient. Visual question generation, for instance, has been used for generating meaningful questions on radiology images in [136].

Open-domain generation systems can be used for working across use-cases that are not restricted to a limited collection of scenarios. Such systems can be used where it is difficult to comprehend and chalk out an exact flow of events like for example, open-ended conversational agents for their question generation capability.

Table 10 shows an overview of various application-based research work carried out in this field. If we consider the different domains, standalone question generation has been applied to education [123], news [124] and social media [126]. Another interesting application of standalone QG is design of reference-less metrics for summaries. For this purpose, a recent work, which was proposed in [137], attempts to use a reference less metric for text-to-text evaluation tasks. In their approach, a question generation model is first trained on SQuAD and then synthetic questions are generated using the trained QG model on a dataset which includes structured-input and a textual description, which is multimodal in nature. This synthetic dataset is then used to train on QA-QG models. The resulting model metric is directly used to compare generated summaries with source text. Conversational question generation largely has seen applications in the generation of questions from multiple documents field as also the medical domain. Visual question generation has been deployed in visual dialogue generation in a few works [131, 133, 134]. For a few of these applications, some datasets were used like SQuAD, MS-MARCO and the like. Some works curated their own dataset owing to the distinctive application involved [122, 124]. Also, most works used factoid-based questions but Yes/No [133], MCQ type [124] and reasoning type questions [122] were also generated for some applications. Most works used RNN-based architectures with additional features and a few used transformer models.

Table 10 Summary of applications of AQG systems

8 Conclusion

In this survey, we presented an overview of the literature for the generation of automatic questions. We classified the methodologies for question generation based on three broad use-cases: standalone question generation, visual question generation and conversational question generation. We also reviewed the different datasets being used for the task. Several challenges and applications of such systems are discussed and summarized. As presented in the survey, most question generation systems today have worked on generating questions from the text. There are a few aspects that are yet to be addressed. For example, questions generated lack naturalness and sometimes are meaningless in the sense of information extraction. Some improvements can be made in generating semantically relevant and information-seeking questions. Also, justifiable metrics for evaluating the quality of questions is still a work in progress. Multiple input modalities are being considered of late and the impact of incorporating them is being studied. There is a need to develop models which are an amalgamation of several techniques considering each aspect and at the same time be relevant to the application being addressed.