1 Introduction

With the rapid development of the Internet and artificial intelligence (AI), text generation has gained significant attention in the field of natural language processing (NLP). One important task in text generation is sentence ordering, which involves organizing sentences into coherent and readable text while ensuring coherence, text fluency, topical relevance, chronological order and the integrity of causal relationships [1].

The development of sentence ordering in the field of NLP has been influenced and driven by several aspects. With the advent of the information age, the rapid development of the Internet has led to the generation and storage of massive amounts of textual data, which has led to enormous information processing challenges. In such a context, it has become particularly important to organize and understand text content in an automated manner. Meanwhile, the rapid advancement of AI technology, especially in the fields of deep learning and natural language processing, provides stronger technical support for sentence ordering, enabling models to understand and generate natural language sequences more accurately. In addition, with the development of dialogue systems and intelligent assistants, the demand for models to be able to understand and generate natural language is increasing, and sentence ordering technology plays a key role in improving the performance and user experience of dialogue systems. Therefore, the development of sentence ordering is crucial to the field of NLP.

In the late twentieth century, the sentence ordering task first appeared in single-document summaries [2], and since then sentence ordering has been widely researched. In 1999, McKeown et al. [3] applied sentence ordering algorithms in multi-document summaries, which is different from single-document sentence ordering in that multi-document sentence ordering extracts and identifies a set of semantically similar corpus, and in the ordering, some judgments are required on their temporal coherence and thematic relevance [4] which need to make some judgment, increasing the sorting difficulty. Therefore, what is more popular in this period is the unsupervised learning-based sentence ordering algorithm, which manually identifies the features of each corpus and fits the ordering formula to achieve the coherence of the text, so as to achieve the purpose of sentence ordering, which is mainly classified into the ordering algorithms based on the probabilistic model [5,6,7] and entity grid [8, 9].

In the early twenty-first century, with the rise of machine learning (ML), many challenges such as data fusion and manual parameter tuning were addressed. As a result, researchers introduced the concept of learning to rank (LTR) [10], significantly improving the efficiency of sentence ordering tasks. Representative algorithms include SortNet [11], RankBoost [12], and RankNet [13].

With the development of deep learning (DL), the need to handle large-scale datasets has emerged. End-to-end approaches have gained significant application in the field of sentence ordering. This includes sentence ordering algorithms based on convolutional neural networks (CNN) [14], recurrent neural networks (RNN) [15], attention mechanisms [16], and bidirectional encoder representation from transformers (BERT) [17]. These approaches have contributed to the advancement of sentence ordering tasks.

Currently, researchers have proposed numerous concise and efficient algorithms to address sentence ordering problems. However, some researchers have only provided summaries of ordering learning algorithms. For example, Xiong et al. [18] introduced the research progress of the pairwise method in the field of ordering. In summary, there is currently no comprehensive literature review specifically focused on sentence ordering algorithms in NLP.

To achieve this, we conducted a comprehensive search for relevant papers using various sources, including Google Scholar, journal websites, Baidu Scholar, CNKI, and other professional websites. Additionally, we reviewed the references of the relevant papers to identify any additional sources. When searching for papers on websites, we used the query keyword ’sentence ordering’ for filtering. We have selected each paper on the basis of whether the paper makes constructive suggestions or improves the following issues, such as semantic understanding, long-range dependency, context dependency, and algorithmic efficiency improvement.

The content of this paper is organized as follows: in Section 1 we focus on the research background and significance of the sentence ordering task, briefly overview the state of the art of related research, and introduce the organizational structure of the paper. Section 2 classifies sentence ordering algorithms based on input data formats and implementation technologies and summarizes typical algorithms and research progress. Section 3 summarizes the datasets used for sentence ordering tasks. Section 4 outlines the evaluation metrics. Finally, Section 5 provides an outlook and summary of the sentence ordering task. Section 6 provides conclusion. Figure  1 illustrates the framework of this paper.

Fig. 1
figure 1

The sentence ordering task are structured into four sections. Part 1 categorizes sentence ordering algorithms based on the presentation of input data formats and implementation technologies. Part 2 classifies datasets as LTR-based or DL-based. Part 3 classifies evaluation metrics as LTR-based or DL-based. Part 4 provides an outlook and summary of the sentence ordering task

2 Sentence ordering algorithms

We analyze and summarize sentence ordering algorithm from two perspectives, namely input data formats and implementation techniques. According to the different input data methods, they are divided into pointwise, pairwise, and listwise; according to the different implementation techniques, they are divided into sentence ordering algorithm based on LTR and DL.

2.1 Methods categorization based on the input data formats

Sentence ordering algorithms can be categorized into three types based on the different input data methods: pointwise, pairwise, and listwise.

2.1.1 Pointwise

Pointwise approach converts the sentences in a document to feature vectors and then utilizes ordering algorithms to score the sentences. Based on the scores, the sentences are outputted in the order of their rankings. Existing pointwise algorithms can be classified into three categories: classification-based, regression-based, and ordinal regression-based. The core ideas and representative algorithms are summarized in Table 1.

Table 1 The differences between the three pointwise approaches are analyzed in terms of core ideas, inputs and outputs, and representative algorithms

The pointwise approach only considers the absolute relevance of individual sentences and is specific to each sentence. Since the ordering algorithm is learned through machine learning, pointwise can only determine better parameter combinations through continuous testing and is unable to directly compare or judge the relationships and relative order between sentences. Additionally, if there are errors in the previous ordering results, it can significantly affect the accuracy of subsequent ordering. To address these issues, researchers have proposed the pairwise and listwise approaches, which focus on handling the relative order between pairs of sentences and the overall order of the sentence list.

2.1.2 Pairwise

The pairwise approach [12, 13, 17, 22] involves combining any two sentences into pairs and transforming the ordering problem into a binary classification problem. In other words, a binary classifier is trained to determine the relative strength between two sentences, which determines their order. The core ideas and representative algorithms are summarized in Table 2.

Table 2 Representative algorithms for the pairwise and their respective algorithmic characteristics

The pairwise approach places more emphasis on the relative order between two sentences. It is suitable for sentence ordering tasks with a relatively small amount of data and clear pairwise features. It only requires binary classification for all sentence pairs, making it highly interpretable and having a smaller model size compared to listwise. However, it has the following three limitations: First, it only considers the order between two sentences, which may not be suitable if the relationships between all sentences are independent of each other. Second, the efficiency of ordering varies significantly for different query values, and the ordering results may be biased toward queries with more sentence pairs. Third, any two sentences can form a sentence pair for relevance calculation, disregarding the original sequential relationships and resulting in a loss function defined based on individual sentence pairs, lacking global context.

2.1.3 Listwise

The listwise approach considers all the sentences to be ordered as a single instance and trains a scoring function based on multiple instances to obtain an optimal solution. Depending on the optimization method, listwise algorithms can be categorized into two types: algorithms that directly optimize the loss function and algorithms that directly optimize the evaluation metrics.

2.1.3.1 Algorithms optimizing the loss function

These algorithms construct loss functions to optimize the sentence ordering list. They design different loss functions for different types of ordering tasks to directly reflect the inconsistencies between the predicted sequence and the true sequence in the sentence ordering algorithm. Representative algorithms can be found in Table 3.

Table 3 Representative algorithms for the algorithms optimizing the loss function method and their respective algorithmic characteristics

The algorithm that directly optimizes the loss function is similar to the handling of the loss function in pointwise and pairwise approaches. The idea behind all these approaches is to minimize the value of the loss function to improve the accuracy of sentence ordering. However, the essence of the sentence ordering task is to predict the overall order of the list of sentences to be ordered. Therefore, algorithms based on listwise direct optimization of the loss function demonstrate significantly better ordering performance compared to the other two types of algorithms.

2.1.3.2 Algorithms for optimizing evaluation metrics

This type of algorithm first defines the optimization goal of a specific evaluation metric and then selects an appropriate optimization algorithm to learn and construct the optimal ranking model, resulting in the best sentence sequence. Since evaluation metrics are often non-continuous and non-differentiable, they need to be transformed into a continuous and differentiable form. The following categorizes the algorithms that directly optimize evaluation metrics into two types:

  • One type of algorithm minimizing the upper bound of the fundamental loss function. For example, Yue et al. [36] proposed the SVMmap algorithm based on the support vector machine (SVM) framework to optimize the mean average precision (MAP) metric. It minimizes a hinge-based loss function, which serves as an upper bound of the basic loss function of MAP. Xu et al. [37] introduced the AdaRank algorithm based on the AdaBoost framework, minimizing an exponential loss function that bounds the basic loss function. Xu et al. [38] proposed the PermuRank algorithm, which uses a greedy algorithm to minimize the upper bound of the loss function.

    This type of algorithm aims to minimize the upper bound of ordering errors in the evaluation metric, allowing it to find the global optimal solution and achieve the best ordering results. However, the feasibility of these algorithms may decrease as the data volume increases.

  • Another type of algorithm aims to optimize the probability approximation of the sentence ordering evaluation metric. For example, Taylor et al. [39] proposed the SoftRank algorithm, which assumes that the sentence sequence is probabilistic rather than deterministic. It introduces an approximate normalized discounted cumulative gain (NDCG) called Soft NDCG and optimizes Soft NDCG to obtain the sentence ordering sequence.

    These algorithms aim to make the ranking approximation of the evaluation metric as close as possible to being continuous and differentiable. Since they optimize an approximate version of a specific evaluation metric, the resulting ranking list may have some level of approximation.

Compared to pointwise and pairwise approaches, listwise methods not only consider the absolute positional relationship of individual sentences and the relative positional relationship between sentence pairs but also take into account the overall positional and logical relationships among all the sentences in the list. As a result, listwise algorithms tend to achieve higher accuracy in sentence ranking. However, the complexity of the model and the training time greatly depend on the definition of the loss function or optimization objective for the given list of sentences, which can sometimes lead to issues of high complexity.

2.2 Methods categorization based on the implementation technologies

The sentence ordering algorithm can be categorized into two types based on the development of existing technologies: LTR-based algorithms and DL-based algorithms.

2.2.1 LTR-based sentence ordering algorithms

2.2.1.1 Concepts in LTR

During the 1990s, linguists conducted a series of studies on sentence ordering. They applied traditional ordering algorithms based on rules, probabilistic models [5,6,7], and entity grids [8, 9] to model sentence ordering. These methods can achieve sentence ordering at both syntactic and semantic levels, and can adjust parameters based on experience to obtain optimal results. However, the rapid growth of sentence features that need to be measured has resulted in an excessive number of parameters. This has led to an increase in the cost of manual tuning, a dramatic increase in workload, a decrease in the usability of tuning, and an increased risk of overfitting.

In such cases, researchers proposed learning to rank [10], which applies machine learning techniques used for solving classification and regression problems to sentence ordering. The general framework of LTR is illustrated in Fig. 2.

Fig. 2
figure 2

Framework of learning to rank. The learning system learns the features to obtain a ranking model, and the ranking system predicts the new features based on the ranking model and outputs the final ranked list

A standard sentence ordering training set consists of n queries \(q_{i}\in \left\{ i=1,2,...,n\right\} \), where each query consists of a set of relevant document features \(D_{i}\in \{D_{1},D_{2},...,D_{n}\}(D_{1}=\{d_{i,1},d_{i,2},\cdot \cdot ,d_{i,m}\},i=1,2,\cdot \cdot \cdot ,n)\). Unlike the test set, the training set provides the ground truth ranking list. The learning system then trains on the training set using pointwise, pairwise or listwise approaches to obtain a ranking model f(qd). Finally, the ranking system applies the learning model f(qd) to new queries \(q_{n+1}\) for prediction and outputs the final ranked list \(D_{\mathrm {n+1}}=\Bigl \{d_{\mathrm {n+1,1}},d_{\mathrm {n+1,2}},...,d_{\mathrm {n+1,m_{n+1}}}\Bigr \}\).

The emergence of LTR has addressed the challenges of data integration, automatic parameter tuning, and overfitting, significantly improving the efficiency of sentence ordering tasks. The core of LTR lies in the ranking model, which can be categorized based on the different techniques used to construct the ranking model. Figure  3 presents a schematic view of the categorization methods.

Fig. 3
figure 3

Schematic of LTR-based sentence ordering algorithm classification. These categories include SVM-based sentence ordering algorithms, neural network-based sentence ordering algorithms, boosting-based sentence ordering algorithms, and Bayesian-based sentence ordering algorithms

2.2.1.2 SVM-based sentence ordering algorithms

SVM, through supervised learning on a given training set, achieves binary classification of the dataset. The idea behind this category of ranking algorithms is to construct an optimal hyperplane that completely separates the data points with correct sentence ordering from those with incorrect sentence ordering, maximizing the separation margin. These algorithms can be further divided into linear and nonlinear ranking models, as detailed in Table 4.

Table 4 Representative algorithms for sentence ordering algorithm based on SVM, data input methods, and their respective algorithmic characteristics

Joachims et al. [40] proposed RankSVM, a ranking algorithm based on SVM. This algorithm uses linear SVM to train the ranking model. However, it has the disadvantages of slow speed and incomplete training. In 2010, Chapelle et al. [23] introduced the Primal Newton Method to improve the speed of the RankSVM algorithm.

Due to the inability of linear SVM-based sentence ordering algorithms to capture the mutual influence between query-doc pairs, Suzuki et al. [24] extended linear SVM to non-linear SVM and proposed the PKRank algorithm. Building upon RankSVM [22], PKRank introduces pairwise kernel functions, making the algorithm more scalable.

The SVM-based ordering algorithms are widely used in sentence ordering tasks due to their strong generalization capability and ability to handle large amounts of features and data. However, the performance and accuracy of these algorithms are closely related to the selected document features. When the data volume is small, the algorithms tend to exhibit higher accuracy. However, as the data volume increases, the accuracy may decrease.

2.2.1.3 Neural network-based sentence ordering algorithms

The core of these algorithms is to construct an n-dimensional decision boundary that can determine whether the sentence ordering is correct based on the input list of sentences to be ordered. Representative algorithms in this category include RankNet, SortNet, and ListNet, as shown in Table 5.

Table 5 Representative algorithms for sentence ordering algorithm based on neural network, data input methods, and their respective algorithmic characteristics

RankNet [13], proposed by Microsoft, is based on the idea of training a ranking function using a simple probability loss function. It utilizes neural networks and gradient descent optimization, making it suitable for any differentiable function and easy to train. Building upon this, Burges et al. [25] introduced the LambdaRank algorithm, which improves the computation of gradients by incorporating the NDCG evaluation metric, making the loss function closely aligned with the final evaluation metric. Tsai et al. [41] further improved the loss function in RankNet and proposed the FRank algorithm. By employing a fidelity loss function, FRank effectively learns the ranking function while mitigating the impact of extreme data points. It also introduces new properties that contribute to the ordering process, addressing the issue of the original loss function being susceptible to extreme data influences.

The aforementioned algorithms, led by RankNet, primarily transform the ordering problem into a binary classification problem. However, Rigutini et al. [11] proposed an alternative neural network-based adaptive ordering algorithm called SortNet, which transforms the ordering problem into a multi-class classification problem. This algorithm utilizes neural networks to construct comparators and maximizes the cross-entropy between predictions and the true ordering order for ordering. Later, they improved the comparators by using a comparative neural network (CmpNN) [42] to train the ordering function, enabling the global ordering of a set of data.

ListNet-based algorithms [31,32,33,34] are similar to RankNet [13], with the main difference being that ListNet uses lists of sentences as instances and utilizes the listwise loss function, while RankNet uses pairs of sentences as instances and utilizes the pairwise loss function. Experiments have shown that as the number of sentences to be sorted increases, ListNet achieves significantly higher accuracy than RankNet, while also having lower time complexity.

NN-based ordering algorithms have strong learning capabilities and can be applied to various ordering scenarios. However, they require a large amount of data and parameter tuning. The training process can be slow, and there is a risk of overfitting. These algorithms may not be robust to outliers in the dataset, and their interpretability and generalization ability can be limited.

2.2.1.4 Boosting-based sentence ordering algorithms

The main idea of boosting is to combine multiple weak ranking models into a strong ranking model. Representative algorithms include AdaBoost and gradient boosted trees (GBT)-based sentence ordering algorithms. Please refer to Table 6 for more details.

Table 6 Representative algorithms for sentence ordering algorithm based on Boosting, data input methods, and their respective algorithmic characteristics

Freund et al. [12] proposed the RankBoost algorithm, a ranking algorithm based on AdaBoost. It trains multiple weak ranking models and combines them to create a strong ranking model, optimizing the classification errors between sentence pairs. Connamacher et al. [43] introduced a new method for combining weak ranking models, called weighted exponential function, which improves the performance of weak models in RankBoost and shows promising results in both theory and practice. Wu et al. [37] improved the weight adjustment method in RankBoost using Lambda and proposed the LambdaMART algorithm, which provides more accurate estimation of each sentence’s contribution and thus improves the accuracy of the ranking.

The weak ranking models in the mentioned algorithms utilize decision tree-based rankers, which are more suitable for handling nonlinear relationships and missing values. They have strong interpretability. However, when the dataset size increases, they may suffer from overfitting and lack robustness. To address these issues, researchers have employed weak ranking models based on regression trees.

Zheng et al. [44] proposed GBRank, a ordering algorithm based on GBT. They employed the GBT framework, originally designed for regression problems, to tackle the ordering problem of sentence pairs. Mohan et al. [45] extended GBRank by introducing gradient boosted regression trees (GBRT) and combining them with random forest (RF). They used the ordering function learned from RF as the initialization function for GBRT, effectively reducing overfitting.

Compared to ordering algorithms based on NNs, the ordering algorithms discussed in this section have fewer parameters, are less prone to overfitting, and perform well in document feature selection. Since the errors of weak ordering models are typically random, combining multiple weak ordering models can reduce the overall error rate of the model, thereby improving the accuracy of the algorithm. Moreover, these algorithms exhibit good scalability. It is crucial to focus on how to effectively combine weak learners, select appropriate loss functions, and optimize generalization error in this class of algorithms.

2.2.1.5 Bayesian-based sentence ordering algorithms Bayesian algorithms are a class of algorithms that utilize probability and statistical knowledge for classification. In the context of Bayesian sentence ordering algorithms, the algorithm integrates prior information and data information from the set of sentences to be ordered. It then uses the Bayesian formula to calculate posterior information. Finally, based on the obtained posterior information, it infers the ordering sequence of the set of sentences to be ordered.

One of the most groundbreaking research achievements in this category is the Bayesian Personalized Ranking (BPR) algorithm proposed by Rendle et al. [47]. It combines Bayesian principles with matrix factorization techniques for personalized ranking. Compared to traditional matrix factorization algorithms, BPR can effectively handle missing values in the data. By constructing partial order relations, BPR establishes a complete partial order relation for each document, leading to more accurate ordering lists. Subsequent research has made improvements on the BPR algorithm to enhance its ordering performance. Please refer to Table 7 for related research.

Table 7 Representative algorithms for sentence ordering algorithm based on Bayesian, data input methods, and their respective algorithmic characteristics

Lerche et al. [48] extended BPR to handle more fine-grained, higher retrieval accuracy, and multi-level preference relations. This algorithm is called Graded Implicit Feedback for Bayesian Personalized Ranking (BPR++). It expands the binary decision to a multivariate setting, allowing for more nuanced preferences.

Qiu et al. [49] proposed Bayesian Personalized Ranking for Heterogeneous Implicit Feedback (BPRH), which takes into account the co-occurrence of different types of sentences to quantify their relevance. Based on this analysis, it studies the preference differences in the relevance between different features in the sentences. The preferences of different features are integrated into the BPR model in descending order, forming the BPRH framework.

Wang et al. [50] proposed Adversarial Training-Based Mean Bayesian Personalized Ranking (AT-MBPR), which is based on Mean Bayesian Personalized Ranking (MBPR) and divides the feedback information into three categories. It obtains implicit feedback from the average value of each document and unrelated documents. Finally, it reduces noise through adversarial training.

Bayesian-based ordering algorithms are relatively easy to understand and interpret. They have fast overall algorithmic execution and are relatively straightforward to construct. They also achieve high classification accuracy. However, this algorithm requires computation of prior probabilities and has stricter requirements for the representation of input data.

2.2.2 DL-based Sentence Ordering Algorithms

In recent years, the development of dl has led to significant advancements in the field of NLP, especially with the emergence of pretrained language models such as BERT [51]. These models have brought new breakthroughs to the field of sentence ordering. Based on the training approach for sentences, they can be categorized into two types: algorithms without pretrained language models and algorithms based on pretrained language models. For a detailed comparison of specific algorithms, evaluation metrics, and datasets, please refer to Table 8.

Table 8 Sentence ordering algorithm based on deep learning, summarizing the differences in data input methods, evaluation metrics, and training datasets among different representative algorithms

2.2.2.1 Sentence ordering algorithms without pretrained language models

Since 2016, sentence ordering algorithms have shifted toward deep learning-based approaches. Most researchers use methods such as CNN or RNN to construct sentence encoders that encode the input sentences. Pointer networks (Ptr-Net) [67] have also been introduced as decoders to predict the sequence of sentences.

Chen et al. [14] proposed a ordering algorithm based on Continuous Bag of Words (CBoW), CNN, and Long Short-Term Memory Network (LSTM). They trained the algorithm using sentence pairs and employed Beam Search to find the optimal ordering of sentences. Gong et al. [52] extended this approach by introducing Ptr-Net as a decoder. Ptr-Net mitigates the problem of error propagation and utilizes the entire contextual information to determine the likelihood of each sentence’s position, resulting in an improved ordering list. Narayan et al. [54] presented an extractive ordering algorithm based on CNN and LSTM, which ensures that the ordered list does not go beyond the original text, but it may not guarantee comprehensive extraction of key information. Logeswaran et al. [15] proposed an end-to-end unsupervised deep learning algorithm using RNN for sentence ordering. They determine the likelihood of a sentence at each position to derive the optimal ordering list.

The use of reinforcement learning (RL) in ordering algorithms initially gained widespread application and achieved good performance in the field of search ranking [68]. Subsequently, researchers utilized RL to dynamically adjust the subsequent ordering based on positional feedback from already ordered sentences, applying it to sentence ordering tasks. Narayan et al. [54] introduced RL to globally optimize the evaluation metric ROUGE. By dynamically adjusting ROUGE, the model was able to better differentiate sentences and order them to obtain high-scoring sentences that are most suitable for summarization. This approach represents a shift from prescoring followed by ordering to dynamic order decision-making.

The mentioned algorithms mostly employ an encoder-decoder structure to learn and predict the sentence ordering. Cui et al. [16] proposed a new structure that consists of a sentence encoder, a paragraph encoder, and a decoder. The sentence encoder utilizes bi-directional long short-term memory networks (BiLSTM) to encode the sentences. The paragraph encoder employs self-attention mechanisms to capture global dependencies between sentences and obtain a reliable representation of the dataset. Finally, Ptr-Net is used to generate the ordered sequence. Based on this model, several variations have been proposed by researchers. Wang et al. [55] combined a hierarchical attention-based encoder with a decoder based on masked self-attention. Yin et al. [28] utilized two paired ordering prediction modules (FUTURE and HISTORY) to enhance the Ptr-Net decoder. Yin et al. [56] employed graph recurrent networks (GRN) to model the relationships between sentences and entities. Lai et al. [57] developed the IRSEGraph ordering algorithm based on the GRN model, which combines pairwise ordering models with sequence models to achieve complementary advantages.

Deep learning plays a crucial role in the development of sentence ordering tasks. Compared to traditional ordering algorithms, deep learning-based sentence ordering algorithms can learn the semantics and contextual information of sentences, enabling better ordering. They can capture the relationships between sentences more accurately, thus improving the accuracy of ordering. They also allow for end-to-end learning, making the training and usage of the algorithms simpler and more efficient. Additionally, deep learning-based sentence ordering algorithms can further enhance the performance through various techniques. For example, the use of attention mechanisms can focus on important parts, thereby improving the accuracy of ordering. Reinforcement learning can enable dynamic ordering decisions, leading to improved overall ordering performance.

2.2.2.2 Sentence ordering algorithms with pretrained language models

With the emergence of issues such as larger model sizes, increasing parameter counts, and expanding application domains, traditional deep learning algorithms face challenges in terms of poor generalization, long training times, and high data requirements.

In response to these challenges, researchers have proposed pretrained language models, with BERT being a prominent example. The introduction of pretrained language models has greatly facilitated advancements in the field of NLP. As a result, researchers have started incorporating pretrained language models into the task of sentence ordering.

Kumar et al. [35], based on the ATTOrderNet algorithm, employed BERT for pretraining. They incorporated transformer networks into both the sentence encoder and the paragraph encoder. Finally, they optimized the loss function using the ListMLE algorithm. Li et al. [59] proposed BERSON, a sentence ordering algorithm within paragraphs based on BERT. They developed a novel relational pointer decoder (RPD) and integrated the relative ordering information into Ptr-Net through a deep relational module (DRM), enhancing both local coherence and global dependencies among sentences. Manku et al. [60], building upon BERSON, utilized a pairwise model for prediction and replaced BERT with ALBERT [69]. This modification led to improved ordering performance.

Although the aforementioned algorithms have achieved excellent ordering performance, they still suffer from the issues of large model size and slow prediction speed. In order to address these concerns and improve the overall speed of the algorithm, Golestani et al. [61] proposed the Sentence Correlation Measure (SCM) algorithm. By employing SBERT-WK5 to obtain sentence embedding vectors, they calculated the similarity scores of sentence pairs using cosine similarity and then performed brute-force ordering. This approach aims to reduce model size and enhance algorithmic efficiency.

Differing from the previous algorithms that individually encoded each sentence, Zhu et al. [62] introduced the BERT4SO algorithm by fine-tuning BERT. They connected all the sentences requiring ordering into a long sequence and utilized the [CLS] token to denote sentence boundaries, aiming to capture the interaction information between sentences. Furthermore, Zhu et al. [63] considered multi-granularity relative ordering information. They analyzed that sentence positions at different distances might rely on different types of information. To address this, they employed a graph neural network (GNN) to integrate the relative ordering information into sentence representations, which were then used for the final ordering prediction.

The sentence ordering algorithm based on SE-Graph [56] has shown promising results for long texts. However, its performance is not satisfactory for short texts. In order to address this issue, Golestani et al. [64], in addition to incorporating BERT as a pretrained language model, also applied pruning techniques to the obtained graph relation network. This approach aims to mitigate the problem of excessively high correlations between sentences in short texts.

In contrast to the aforementioned algorithms that predict the sentence ordering sequence unidirectionally from start to end, Bai et al. [65] proposed a bidirectional ordering algorithm based on Ptr-Net. This algorithm can simultaneously predict the sentence ordering in both forward and backward directions, allowing for mutual influence between the two directions. This approach helps alleviate the issue of error accumulation during the ordering process and has achieved SOTA (state-of-the-art) performance.

In addition to utilizing BERT and its variants [69, 70] for constructing sentence ordering algorithms, Chowdhury et al. [66] proposed the use of the pretrained language model BART (bidirectional and auto-regressive transformers) [71] for sentence ordering tasks, which has also shown impressive performance.

The importance of pretrained language models in sentence ordering tasks is evident. Introducing pretrained language models allows for the utilization of existing language knowledge and data through transfer learning and fine-tuning. This enables better learning of general features in the data and enhances the understanding of semantic and contextual information in the text. Not only does it help avoid overfitting on small-sized datasets, but it also improves ordering accuracy, provides better generalization performance, and facilitates faster training speed.

3 Datasets utilized for the Sentence Ordering Algorithms

For sentence ordering algorithms, obtaining high-quality datasets as a foundation is crucial for achieving desirable ordering results and improving both the training quality and prediction accuracy of the models. Existing sentence ordering datasets can be classified into two categories: public datasets and domain-specific datasets constructed based on specific task requirements. In order to assess the performance of algorithms and conduct comparative experiments, it is generally preferred to prioritize the use of public datasets.

3.1 Public Datasets of LTR

The LRT sentence ordering algorithm commonly utilizes three public datasets [72], as shown in Table 9.

Table 9 Public dataset for learning to rank. Compare the differences between different public datasets in terms of composition, query, number of documents, number of features, dimension type, and evaluation metrics
  1. 1.

    LETOR Datasets

    Microsoft Research Asia has released LETOR 1.0 to 4.0 versions, with LETOR 3.0 and 4.0 being the most commonly used public datasets in the field of learning to rank. The official website for LETOR is http://research.microsoft.com/en-us/um/beijing/project-s/letor/.

    The LETOR 3.0 dataset [73] is formed by combining four additional datasets with the existing three datasets. Each dataset has three versions, and each version is further divided into five parts. From these, three datasets are randomly selected as the training set, one as the validation set, and another as the test set.

    The LETOR 4.0 dataset [74] is divided into four categories, with each category consisting of two datasets, making a total of eight datasets. Each dataset is further divided into five parts, and each part contains three subsets: the training set, validation set, and test set. Fivefold cross-validation can be applied to this dataset.

  2. 2.

    Yahoo! Learning-to-Rank Challenge Datasets

    The dataset [75] you mentioned is sourced from the Learning-to-Rank Challenge organized by Yahoo Labs. It consists of two datasets, each divided into three groups: the training set, validation set, and test set. The official website for this dataset is http://learningtorankchallenge.yahoo.com/.

  3. 3.

    Microsoft Learning-to-Rank Datasets

    The dataset is released by Microsoft Research Asia and consists of MSLR-WEB30k and randomly sampled MSLR-WEB10k. Each dataset is divided into five parts: S1, S2, S3, S4, and S5, which are used for five-fold cross-validation. You can find more information on the official website at https://www.microsoft.com/en-us/research/project/mslr/.

The above three public datasets commonly used in LRT are different in terms of data complexity and application scenarios in addition to their different sources. The LETOR datasets are usually larger and has higher complexity, and are mostly used for academic research and algorithm development. The other two datasets are relatively smaller in size and complexity and are mostly used for practical business research, which is closer to real application scenarios.

3.2 Public Datasets of DL

There are eight commonly used public datasets for sentence ordering algorithms based on dl, as shown in Table 10.

Table 10 Public dataset for deep learning. Compare the differences between different public datasets in terms of composition, average number of sentences, average number of words, and vocabulary
  1. 1.

    Accident, Earthquake

    The accident dataset [8] is sourced from the National Transportation Safety Board’s aviation accident database, while the earthquake dataset [8] is derived from Associated Press articles on the topic of earthquakes. Each dataset consists of 100 documents of comparable length, with average sentence counts of 11.5 and 10.4, respectively. Both datasets include training and testing sets. These two datasets cover different subject areas and have distinct data sources and content. Therefore, they are primarily used to validate and analyze the generalization performance of the models when in use.

  2. 2.

    NIPS abstract, AAN abstract, NSF abstract

    The NIPS abstract dataset [15] is sourced from the abstracts of NIPS papers from 2005 to 2015. The dataset comprises 3,259 abstracts, with training data spanning from 2005 to 2013, validation data from 2014, and testing data from 2015. The dataset contains numerous high-quality paper abstracts in the fields of machine learning and computational neuroscience. The consistent language style and terminology facilitate the learning of sentence ordering models. However, the dataset’s subject matter is limited and cannot be applied to sentence ordering tasks in other fields.

    The AAN abstract dataset [15] is sourced from the ACL Anthology Network (AAN), specifically from the selected papers of the ACL conferences from 2010 to 2013. The dataset consists of 12,157 abstracts, with training data from 2010, validation data from 2011, and testing data from 2012 to 2013. The dataset encompasses a broad spectrum of abstracts from research papers in the field of computational linguistics, making it suitable for various studies on sentence ordering tasks. However, it is important to note that the dataset may contain some noise and inconsistencies and therefore requires additional data cleaning and preprocessing.

    The NSF abstract dataset [15] is sourced from the abstracts of research papers funded by the National Science Foundation (NSF) from 1990 to 2003. The dataset comprises 127,865 abstracts and differs from the previous datasets as it covers various scientific disciplines and contains relatively longer abstracts. The training data span from 1990 to 1999, the validation data is from 2000, and the testing data covers the years 2001 to 2003. Compared to the previous two datasets, the NSF abstracts dataset covers abstracts of research projects across multiple scientific domains and is relatively long, making it suitable for interdisciplinary research and for research on sentence ordering tasks for scientific text analysis. However, the text in the dataset is more diverse, having significant differences between different fields, requiring domain adaptation processing.

  3. 3.

    arXiv abstract

    The arXiv abstract dataset [14] includes abstracts from papers published on the arXiv website in seven fields, including statistics, computer science, and mathematics. Each document consists of 2 to 20 sentences, with an average word count of 134.65 per document. The dataset is divided into training, validation, and testing sets. The validation set is randomly selected as the first 10% of the dataset, the testing set is the last 10%, and the remaining portion is used for training. The dataset is highly real-time and can reflect the latest research progress, which helps to train the model’s ability to reflect current research trends. However, as the arXiv abstract dataset covers a number of subject areas, research in certain areas may be more active with a higher number of papers, while the number of papers in other areas is lower, which may lead to the model’s weaker ability to generalize to certain areas on the sentence ordering task.

  4. 4.

    SIND caption, ROCStory

    The SIND caption dataset is sourced from the story texts in the Sequential Image Narrative Dataset (SIND) [76]. The dataset comprises 50,200 stories, with each story composed of 5 sentences and an average word count of approximately 57. Since stories generally exhibit coherence in terms of temporal sequence and causal relationships, they are suitable for sentence ordering algorithm datasets.

    The ROCStory dataset [77] consists of common-sense stories and contains 98,162 stories. Each story is composed of 5 sentences, with an average word count of approximately 50. In comparison to the SIND caption dataset, ROCStory lacks additional images as supplementary information, resulting in stories that adhere more closely to logical coherence. Both the SIND caption and ROCStory datasets are randomly divided into training, validation, and testing sets in an 8:1:1 ratio.

4 Evaluation Metrics

The ultimate goal of sentence ordering tasks is to form readable and semantically coherent texts, so the ordering results need to be evaluated from multiple perspectives. As the sentence ordering task has developed, researchers have proposed various evaluation metrics to measure sentence ordering results. Researchers can select appropriate evaluation metrics based on the characteristics of their algorithms and datasets to judge the effectiveness of a sentence ordering algorithm. This section provides a detailed analysis and summary of the most commonly used evaluation metrics in LTR and DL-based sentence ordering algorithms.

4.1 Evaluation Metrics of LTR

The evaluation metrics of LTR-based sentence ordering algorithms mainly include MAP, NDCG and Expected Reciprocal Rank (ERR). Among them, MAP focuses on the average ordering quality of the ordering results, NDCG focuses on the relevance and positional relationship of the ordering results, and ERR focuses more on the top-ordered results, and the more top-ordered sentences have a greater impact on the overall evaluation. Their applications are shown in Table 9.

4.1.1 MAP

MAP [72] is computed by first calculating the average precision (AP) for each query and then taking the average of all the AP values across queries. It reflects the relationship between sentence relevance and ranking position, where higher MAP values indicate that relevant sentences are ranked higher. Relevance judgments are typically represented as either 0 (non-relevant) or 1 (relevant). AP is calculated as follows:

$$\begin{aligned} AP={\frac{\sum _{i=1}^n{P_{i}}{\varvec{y}}_{i}/R_{i}}{n}} \end{aligned}$$
(1)

where \(P_i\) represents the relative order of relevant sentences in the list of sentences to be ranked, \(Y_i\) represents the label value (0 or 1) for each sentence, \(R_i\) represents the order of all sentences in the list to be ranked, and n represents the total number of relevant sentences in the list. Finally, the AP values obtained for all queries are averaged to calculate the MAP:

$$\begin{aligned} MAP={\frac{\sum ^{m}_{j=1}{AP_j}}{m}} \end{aligned}$$
(2)

where m represents the total number of queries.

4.1.2 DCG and NDCG

NDCG [72] is a comprehensive evaluation metric that takes into account the relationship between the algorithm’s ranking list and the true sequence. It is one of the most commonly used evaluation metrics in LTR. The basic idea is that the higher the relevance of a sentence to a query, the higher its rank should be in the ranking list, and the higher the rank, the higher its position in the ranking sequence. Therefore, a higher NDCG value indicates a better ranking. NDCG consists of four components: Normalized, Discounted, Cumulative, and Gain.

First, the gain values need to be calculated, where higher relevance generally corresponds to larger gain values. Taking the i document as an example, let \(y_i\) denote the relevance level of the sentence, where higher relevance corresponds to a higher level value. Therefore, the gain value \(G_i\) can be represented as:

$$\begin{aligned} G_i=2^{y_{i}}-1 \end{aligned}$$
(3)

Next, the gain values of all sentences in the query are accumulated to obtain the CG (Cumulative Gain):

$$\begin{aligned} CG=\sum _{i=1}^{n}2^{y_{i}}-1 \end{aligned}$$
(4)

Because CG does not consider the ordering of sentences in the evaluation, its value only indicates the overall quality without judging the effectiveness of the ranking. Therefore, different weight values are assigned based on the position of sentences in the ranking, which forms the foundation of the evaluation metric for NDCG, known as DCG (discounted cumulative gain):

$$\begin{aligned} DCG=\sum _{i=1}^{n}{\frac{2^{y_{i}}-1}{l o g_{2}\left( 1+i\right) }} \end{aligned}$$
(5)

Finally, in order to facilitate the comparison of ranking effectiveness between different queries, normalization is performed. The normalization factor is the most ideal ranking value for each query, referred to as the IDCG (Ideal DCG). Therefore, NDCG is calculated as follows:

$$\begin{aligned} NDCG={\frac{D C G}{I D C G}}={\frac{1}{I D C G}}\sum _{i=1}^{n}{\frac{2^{y_{i}}-1}{I o g_{2}\left( 1+i\right) }} \end{aligned}$$
(6)

However, NDCG also has certain limitations. For example, it does not explicitly penalize irrelevant sentences.

4.1.3 ERR

In order to address the limitation of NDCG in not considering the influence of preceding sentences on the subsequent ones within the same ranking list, ERR [78] introduces the concept of the cascade model, which fully takes into account the positional dependencies among sentences in the same ranking list. If the ranking position of a particular sentence is satisfactory, the ordering process is considered complete. Otherwise, that sentence is skipped, and the process continues to the next one. The probability of stopping at the i position can be expressed as \(P_i\):

$$\begin{aligned} P_{i}=\prod _{j=1}^{i-1}\left( 1-R_{j}\right) R_{i} \end{aligned}$$
(7)

where \(R_j\) represents the relevance level function for sentences, which is defined as: \(R_{j}=R\Bigl (g_{j}\Bigr )=\frac{2^{g_{j}}-1}{2^{g_{m a x}}}, g_{j}\in \left\{ 0,1,...,g_{m a x}\right\} \), where \(g_j\) represents the relevance level of the j sentence, and \(g_{max}\) represents the maximum relevance level among all sentences. Therefore, ERR can be expressed as:

$$\begin{aligned} ERR=\sum _{i=1}^{n}\varphi (i)\prod _{j=1}^{i-1}\left( 1-R_{j}\right) R_{i}=\sum _{i=1}^{n}\prod _{j=1}^{i-1}\left( 1-R_{j}\right) R_{i}\nonumber \\ \end{aligned}$$
(8)

Unlike MAP, ERR and NDCG evaluation metrics divide the relevance of sentences into multiple levels (ranking levels > 2), while MAP only have two levels (relevant and non-relevant). Another advantage of NDCG and ERR are that they consider the positional information of sentences more prominently when calculating the evaluation scores, giving higher weight to sentences in higher positions.

4.2 Evaluation Metrics of DL

The evaluation metrics of DL-based sentence ordering algorithms mainly include PMR, Accuracy and \(\tau \) coefficient. Among them, PMR focuses on the proportion of all sentences in the sentence ordering result that are completely correct, Accuracy focuses on the accuracy of the absolute position of each sentence, and \(\tau \) coefficient the accuracy of the relative position of the sentences in the ordering result. Their applications in sentence ordering algorithms are shown in Table 8.

4.2.1 PMR

PMR [14] represents the probability of perfect match in the order of all ordered documents:

$$\begin{aligned} PMR={\frac{1}{Q}}\sum _{i=1}^{Q}{\mathbb {I}}{\left( {{\hat{o}}^i}=o^i\right) } \end{aligned}$$
(9)

where \(\mathbb {I}(\bullet )\) is an indicator function, \({\hat{o}}^i\) denotes the predicted ranking of the i document, \(o^i\) denotes the correct ranking of the i document, and Q represents all the ordered documents. PMR is a stricter metric and is only counted as correct if the ordering result is exactly the same as the target ordering. Therefore, it has high requirements for the ordering accuracy of the algorithm.

4.2.2 Accuracy

Accuracy considers the rate at which the absolute position of a sentence in ordering is correctly predicted by the algorithm, which can be defined as:

$$\begin{aligned} Accuracy={\frac{N_{correct}}{N_{tatal}}} \end{aligned}$$
(10)

where Accuracy [29] represents the accuracy, which \(N_{correct}\) represents the number of sentences with correctly predicted ranking positions, and \(N_{tatal}\) represents the total number of sentences to be ranked. Compared to PMR, Accuracy is more fault tolerant for ordering results, even if there are some errors in the ordering results, as long as most of the positions are correct, Accuracy can still be higher.

4.2.3 Kendall’s tau

Kendall’s Tau coefficient [79], also known as \(\tau \) coefficient, is primarily used to compare the ranking list generated by an algorithm with the true ranking list. If the two ranking lists are completely identical, the tau value is 1. If the two ranking lists are unrelated, the tau value is 0. If the two ranking lists are inversely related, the tau coefficient is -1, which can be defined as:

$$\begin{aligned} \tau =1-\frac{2S\left( \pi ,\sigma \right) }{N\left( N-1\right) /2} \end{aligned}$$
(11)

where N represents the number of sentences to be ranked, and \(S\left( \pi ,\sigma \right) \) represents the number of discordant pairs between the two ranking lists. The \(\tau \) coefficient captures the overall trend and relevance of the ordering results, not just the accuracy of the absolute position.

5 Future Work

The sentence ordering algorithm has vast application prospects, ranging from early tasks such as multi-document summarization, collaborative filtering, and document retrieval, to a growing number of domains that require the use of ordering algorithms to obtain optimal ranked lists. For example, in machine translation, the ListMLE algorithm [80] can be used to obtain the optimal translation results. The LambdaMART algorithm [81] can be applied to personalized recommendations for search predictions below the search box when a user performs a search. In shopping websites, the RankSVM algorithm [82] can be employed to analyze each user’s search data, calculate personalized ranking lists, and select the top-k items as the final recommended products. In essay generation systems [83, 84], sentences can be ordered to produce a coherent and highly readable essay. In intelligent writing systems, dynamic sentence recommendations for subsequent writing materials can be achieved based on the content input by the user, and so on. While sentence ordering algorithms have significant applications in fields such as information retrieval, machine translation, and automated writing, reflecting on the past and looking toward the future, there are still many issues that merit further research and solutions in the realm of sentence ordering algorithms.

  1. 1.

    The utilization of deep neural network models has elevated the performance of ordering algorithms to new heights. However, these models also encounter challenges such as large model sizes, increased data processing requirements, and slower convergence speeds. A key research focus for future researchers is to explore methods that can reduce model sizes and accelerate prediction speeds while maintaining accuracy.

  2. 2.

    One of the crucial factors determining the accuracy of the final ordering sequence in sentence ordering algorithms is the precise capture of semantic features and contextual relationships within sentences. It is an ongoing research challenge to effectively capture both the semantic features and contextual relationships of sentences simultaneously in order to enhance algorithm efficiency. Additionally, exploring methods to integrate multiple semantic features of sentences together to improve algorithm accuracy, and effectively utilizing these features to achieve better ordering results, are also areas that require further investigation.

  3. 3.

    Currently, sentence-level ordering algorithms have achieved SOTA performance. It is worth considering the transfer of these algorithms to larger-scale ordering tasks, such as paragraph, chapter, and document ordering. However, research in this area is relatively limited at present. Wan et al. [85] used BiLSTM and self-attention to construct a multi-paragraph ordering model, but the performance still lags behind human performance. Transferring existing sentence ordering algorithms to larger-scale text ordering poses different challenges. For example, BERT-based ordering algorithms are only suitable for sentence and paragraph-level ordering tasks and cannot handle document-level ordering tasks. Therefore, there are still many issues to be addressed in transferring sentence ordering algorithms to larger-scale document ordering tasks. Research in this area is a long and challenging journey, but it holds significant research value.

  4. 4.

    Existing public datasets are unable to meet all research and application requirements, which necessitates researchers to construct domain-specific datasets that align with their research needs. High-quality datasets are crucial for sentence ordering algorithms, and currently, domain-specific datasets are often constructed through manual annotation. However, this approach comes with high costs and subjective judgments, posing challenges in obtaining reliable labels. Improving the above-mentioned issues and obtaining trustworthy labels are areas that researchers need to focus on.

  5. 5.

    Currently, sentence ordering algorithms heavily rely on training with large-scale datasets to achieve good performance. However, for certain specific research domains, acquiring a sufficient number of training samples can be challenging. Therefore, future research in sentence ordering algorithms needs to explore methods to achieve good performance on small-sample datasets and improve the utilization efficiency of deep learning models with limited data.

The resolution of the aforementioned issues holds significant importance for the further advancement and practical application of sentence ordering algorithms.

6 Conclusion

The sentence ordering task, as an important research direction in the field of NLP, has been widely applied in various research domains, bringing great convenience to our daily lives. This article mainly analyzes and summarizes the different input data approaches and implementation techniques of sentence ordering algorithms. Through the analysis and comparison of these algorithms, it is found that sentence ordering algorithms based on deep learning can reduce the complexity of manual feature extraction. By modeling sentences using multi-layer neural networks, they are able to better capture the semantic information within sentences, thus learning more abstract and complex feature representations. Currently, this approach is also the most widely used algorithm. In future, there will be increasing demands and higher performance requirements for sentence ordering algorithms. Researchers can delve into the future prospects proposed in this article to further promote the development of sentence ordering algorithms, as well as related algorithms for larger-scale ordering, such as paragraphs and chapters.