Introduction

The current generation has observed a great necessity for text classification [1]. The processing capability of new generation computers has grown exponentially in tandem with the amount of data handled. This is partly due to the fact that there are more end users now, which calls for efficient data management. In order to handle things effectively, data must be uploaded and retrieved as quickly as possible. Handling the data comprises sending and retrieving data with the shortest probable time without loss of data, that is exactly the requirement within the framework of successful scientific literature, health records, e-books, web application, banking details, and digital content. Text classification has been applied in many applications now [2,3,4,5]. This could entail data processing and structuring, where a large amount of data is sorted and organized on the basis of relevancy. Opinion mining, on the other hand, is an equally essential application. There are numerous initiatives to expand the use of categorized data for collaborative filtering. The email classification is also quite important, and one use of this is the classification of text to identify spam communications. Given their criticality and precision requirements, the aforementioned applications challenge existing models’ performance. Recently, researchers have launched several initiatives to address these real-world situations [6].

In the recent past, researchers have taken several remarkable initiatives in response to real-world situations. There are significant attempts being made to manage large amounts of text, categorization techniques used to expose data to text mining and Natural Language Processing (NLP) with purpose for making the process more successful and intricate [7]. The raw data, which is the Twitter data used for the present study, is subjected to text classification by a series of processes, namely feature extraction, dimensionality reduction, selection are introduced a new CNN model in [11] that focuses on multi-scaling and dense connectivity because of the large number of connections, the model is able to generate large n-gram properties dynamically from changeable small n-gram properties, and the model may adaptively select task-friendly yet effective features from a large set of multi-scale characteristics for text categorization by paying attention to several scales of features, appropriate, and finally using the evaluation metrics that make it easier to measure classification accuracy [8]. The major goal of this study is to create a hybrid deep learning text classification system using RNN embedded With BiLSTM and GRU. This research employs 83267 rows of Twitter data for this research by using the hybrid Deep Learning technique.

This paper is organized as follows: Section II describes the relevant work in the field of text categorization. Section III outlines the standard and hybrid text categorization algorithms employed in this work. Section IV presents the evaluation criteria employed to evaluate the proposed technique against other algorithms. Section V describes the results of the algorithm comparison. Section VI summarizes the study’s findings.

Related Works

Authors, Gang Liu and Jiabao Guo have proposed a hybrid deep learning tactic called attention based BiLSTM with CNN (AC-BiLSTM). In this model, the word embedding vectors are utilized to create the convolutional layer, which fetches the high level phrase illustrations. BiLSTM is then used to obtain both the existing and subsequent context depictions. An attention method is used to apply a varying degree of attention to the data output from the biLSTM’s concealed layers. The processed contextual information is classified using the Softmax classifier. The AC-BILSTM demonstrates a high degree of classification accuracy when compared to other current cutting-edge models [9]. In [10], surveyed over 150 deep learning models developed in the last few years. These models were significantly better than the current state of text categorization with respect to different classification challenges. In addition, the authors provided an overview of 40 + widely used datasets and compared with a number of open standards have been published. For the classification of texts, [11] introduced a new CNN model that focuses on multi-scaling and dense connectivity. Because of the large number of connections, the model is able to generate large n-gram properties dynamically from changeable small n-gram properties. The model may adaptively select task-friendly yet effective features from a large set of multi-scale characteristics for text categorization by paying attention to several scales of features. A competitive performance is shown by the model on six benchmark datasets. Our approach can choose the best size to provide meaningful representations for text classification, according to attention visualization.

In [12], proposed Text Graph Convolutional Network as a text classification method. Text GCN gets familiar with the embeddings of words as well as documents parallelly under the supervision of the pre-defined labels for documents after initializing with a one-hot representation of each. On a variety of standard datasets, the results showed that a Text GCN without any external word embeddings or information outperforms cutting-edge text categorization algorithms. Text GCN also picks up on word and document embeddings that are predictive. Furthermore, experimental results show that as the percentage of training data is reduced, Text GCN becomes increasingly superior to state-of-the-art comparison techniques, demonstrating that Text GCN is robust to less training data in text classification. Semi-supervised short text categorization uses a unique heterogeneous graph neural network based technique introduced by [13] and takes use of both little amounts of labelled data as well as vast amounts of data that have not been labelled. Convolutional neural networks may be used to concatenate word graphs, according to [14] Non-contiguous and long-distance semantics can be captured by using a graph of word representations of texts. CNN models are capable of learning at several semantic levels. The results from the RCV1 and NYTimes datasets indicated that employing hierarchical text classification was superior to traditional hierarchical text classification and current deep models.

In [15], proposed two text classification methods in their paper, NA-CNN COIF-LSTM and NA-CNN LSTM. These were created by combining a CNN with no activation function and an LSTM and one of its hybrid models, COIF-LSTM. Comparative tests show that combining a CNN with no activation function and an LSTM or its variations improves performance. In [16], a new text classification algorithm that integrates CNN, LSTM, and attention mechanisms. Initial features are extracted using convolutional layers. Second, LSTM saves context history. Attentional mechanisms generate semantic codes that represent attentional probability distributions and emphasize the effect of inputs on outputs. Create a new model (CNN) by combining long short-term memory (LSTM) and convolutional neural network (CNN) which are standard neural network models [17]. Long text sequences will benefit from LSTM’s ability to preserve the quality of historical information while using the CNN structure to extract the text’s local attributes. For example, in a hybrid model, a CNN is built on top of his LSTM model, and the CNN extracts text feature vectors from those features. In the experiment, we compared the performance of the hybrid model with other models. Experimental data show that using the hybrid model in [18] significantly improves text classification.

In [19], proposed a new hybrid deep learning technique to detect deception by combining recurrent neural networks (RNN) and convolutional neural networks (CNN). The model outperformed state-of-the-art AI and ML models when evaluated using a benchmark dataset. Introduced a hybrid technology that improves the reliability and openness of classification options for medical documents. This model classifies medical text using a three-level hybrid technique, combining a gated attention-based BiLSTM (ABLSTM) with a regular expression-based classifier. The proposed approach goes beyond the state of the art in terms of selecting specific domain and topic related features in [20].

In the recent research discusses various deep learning models for text classification, highlighting their architectures and performance metrics. However, some research gaps and potential areas for further exploration can be identified as below.

  • The authors compare their proposed models with current cutting-edge models, there is a lack of comprehensive benchmarking across a wider range of datasets and classification challenges. Further research should explore the models’ performance under diverse conditions to assess their generalizability.

  • The literature discusses attention mechanisms in several models, but there is limited discussion on the interpretability of these mechanisms. Research could delve into understanding how attention is applied and whether it aligns with linguistic or semantic importance, enhancing the models’ interpretability.

  • Although some researches mention robustness to reduced training data, further investigation is needed to assess the models’ performance across various domains and genres when data availability is limited.

  • Several researchers introduces the hybrid models, but a systematic evaluation comparing the efficacy of different hybrid architectures is missing.

  • Some researchers briefly touch upon applications in medical document classification, but there is a need for more in-depth exploration of domain-specific challenges and the adaptability of these models to diverse industries beyond the benchmark datasets mentioned.

Addressing these research gaps can contribute to a more comprehensive understanding of the strengths and limitations of the proposed models, facilitating their practical application in real-world scenarios and could explore the impact of combining various neural network components on classification performance.

Methodology

Flaws are inevitable in neural network models. But by combining different designs, flaws can be minimized. So, in this paper, we look at how layers from different designs are added to classic neural network models and how they affect model performance.

Depending on architecture adapted, recurrent neural networks (RNNs) can take two forms: Gated recurrent unit (GRU) [20] and long short-term memory (LSTM) [19]. The ability of the LSTM layer (Fig. 1) to maintain long-range relationships is its greatest asset [21], whereas the GRU layer (Fig. 2) does not. LSTM is particularly useful for dealing with dying GRU’s, but on the other hand; it has the advantage that training is faster than LSTM because the amount of training data is significantly smaller [22]. The output layer hyperbolic activation function is used in the LSTM and also in the GRU unit. This makes it possible to extract information even after a long period of time. GRU’s are easier to train than LSTMs because they require fewer data points and guarantee better performance than LSTMs. Finally, GRU requires less computing space and time than LSTM. This has led scientists to use GRUs instead of LSTMs when searching is not strictly necessary. The sophisticated variant LSTM variant is Bidirectional LSTM (BiLSTM) that can capture long-term dependencies in both forward and backward directions [23]. A classical LSTM is a unidirectional neural network that only remembers future relationships.

Fig. 1
figure 1

Long-term short-term memory unit gradient problem

Fig. 2
figure 2

Gated recurrent units

The previous version of this RNN model included two layers of GRU [24] and three layers of LSTM. The Authors attempted to improve the model in this paper by using BiLSTM and GRU layers to build a hybrid RNN model. Figure 3 shows a BiLSTM cell that compares the text to both the next and previous word, predicting the relationship substantially better than a conventional LSTM cell. As in a standard LSTM cell, the “Green” path represents data flow backward, whereas the “Red” path represents data flow forward, as in a standard LSTM cell. The data (x_0, x_1, x_2, .x_i) are compared in “red” with the next word in each pass and in “green” with the previous word in each pass. The final result will be (y_0, y_1, y_2, yn). The proposed hybrid model for the current study is RNN + 2-BiLSTM + 2-GRU, which is compared against RNN + 3-LSTM + 2-GRU, RNN + 4-GRU model, and the RCNN + 4-LSTM layers.

Fig. 3
figure 3

Bidirectional LSTM (BiLSTM) Units

Figure 4 depicts the enhancement of a typical RNN [25] model by integrating two GRU layers and two layers of Bidirectional LSTM layers. The performance of hybrid model is compared to three other hybrid models. The other hybrid models taken into consideration when evaluating the efficacy of the proposed hybrid model include a traditional RNN model with 2 layers of GRU and 3 layers of LSTM (Fig. 5), a standard RNN model developed by adding 4 layers of GRU (Fig. 6), and a traditional RCNN model [26] that is improved by incorporating 4 LSTM layers in addition to the convolutional layers as shown in Fig. 7. As shown in Fig. 1, the hybrid RNN uses an LSTM model and the RCNN model uses a forget gate with an input gate and an output gate.

Fig. 4
figure 4

RNN + 2-GRU Layers + 2-BiLSTM Layers

Fig. 5
figure 5

3-LSTM Layers + RNN + 2-GRU Layers

Fig. 6
figure 6

RNN + 4-GRU Layers

Fig. 7
figure 7

RCNN + 4-LSTM Layers

The output gate layer is the most important layer for maintaining long-term reliability. Based on the current cell’s input vector, the previous cell’s output vector, and the cell’s previous state, the forget gate layer computes a value. This means that the forgotten level will also provide the input level value. A sigmoid neural function with a point multiplication operator generates values ​​in the forget layer of the LSTM gate. The input gate process is required for two tasks. The sigmoid activation function is applied to the incoming vector data and then compared with the value of the hyperbolic activation function.

The previous cell status is added to the new values generated by comparison. The final gate, the output gate layer compare the resulting hyperbolic and sigmoid activation functions by comparing the updated cell state to the resulting input vector. The only gates that comprise a fully-gated GRU are an input gate and forget gate layer, which is separate from the LSTM. Despite being introduced a few years ago, GRU is ideal in some circumstances since it needs a smaller dataset and takes less time to train the model. Since LSTM contains a specific update and forget gate, it is clear that LSTM is more complex. GRU can be utilized because of the LSTM’s complexity, which improves control over the model that contains GRU units.

On the basis of the findings, four models are tested with integrated LSTM units, GRUs, and LSTMs-GRUs, and their performance is assessed. The Twitter dataset GloVe version makes it easier to train models by accurately capturing word syntactic and semantic representation. GloVe’s [26] flaw is its inability to recognize terms that are outside its vocabulary, and deep learning model training requires a lot of data, thus escalating memory needs. Although GloVe performs a similar purpose as Word2Vec [27], the training process will not be hampered by the weights attached to frequent word pairings. Due to the advantages mentioned above, it is possible to train the models considered for the study using vectorized data from the Twitter Dataset with GloVe, which offers fascinating linear word substructures in vector space.

Evaluation

This research employs 83267 rows of Twitter data for this research by using the hybrid Deep Learning technique. Accuracy and recall are the two factors used to evaluate models. Recall is the ratio of true positive predictions to the sum of false positive and false negative predictions (Eq. 1). Precision is the relationship between the true predictions and the total predictions. Recall displays the model’s sensitivity, while precision highlights the model’s accuracy. Increased accuracy results in more false negatives while decreasing the incidence of false positives. Therefore, a rise in false negatives lowers recall value. Unquestionably, a model is useful for a certain application when accuracy and recall are compromised, as they have opposing tendencies. The F1 score quantifies the area below the precision recall arch displayed by both models and provides an indication of the combined precision and recall effect. The F1 score is derived from the harmonic mean of precision and recall (Eq. 3), an important statistic for evaluating model performance.

$$\:Precision\:\left(P\right)=\frac{True\:Positive}{True\:Positive+False\:Positive}\:\:\:\:\:\:\:\:\:\:$$
(1)
$$\:Recall\:\left(R\right)=\frac{True\:Positive}{True\:Positive+False\:Negative}\:\:\:\:\:\:\:\:\:\:$$
(2)
$$\:{F}_{1}=2\:\left[\frac{P\:X\:R\:}{P+\:R\:}\right]\:\:\:\:\:\:\:\:$$
(3)

Results and Discussions

The graphs below compare the proposed RNN-BiLSTM-GRU hybrid model with other models. Figures 8 and 9 show the results plots for four different models, exhibiting precision and recall. RNN models routinely beat RCNN models, despite the fact that, after 18 iterations, RNN and RCNN provide nearly similar accuracy and recall values. However, as illustrated in Figs. 8 and 9, compared to RNN + BILSTM + GRU and RNN + LSTM + GRU hybrid models, RNN models embedded in GRU layers require less time and data to achieve high precision.

Fig. 8
figure 8

Assessment of precision for Deep Learning Models

Fig. 9
figure 9

Assessment of recall for Deep Learning Models

The recall-precision curve, which is represented in Figs. 10, 11 and 12, and 13, is extremely important for assessing effectiveness. The least squares approach is used to produce a polynomial curve that displays the findings. The F1 score for each model is represented by the area under the curve that shows the model with the best recall and precision. Compared to the RNN + 2-BiLSTM + 2-GRU, RNN + 3-LSTM + 2-GRU, RNN + 4-GRU models, the RCNN + 4-LSTM model occupies more space and suggests a larger range of recall-precision values.

Fig. 10
figure 10

Precision-to-recall deviation for the RCNN + 4-LSTM Model

Fig. 11
figure 11

Precision-to-recall deviation for the RNN + 4-GRU Model

Fig. 12
figure 12

Precision-to-recall deviation for the RNN + 3-LSTM + 2-GRU Model

Fig. 13
figure 13

Precision vs. recall for the RNN-BiLSTM-GRU Model

Figure 14 shows the difference more clearly, and Fig. 15 shows the corresponding variation in F1 scores. The average F1 scores for RNN + 2-BiLSTM + 2-GRU, RNN + 3-LSTM + 2-GRU, RNN + 4-GRU, and RCNN + 4-LSTM models are shown in Table 1. The results show that the RNN + 4 - GRU model is 10% more efficient than RCNN + 4 - LSTM and 4% more efficient than RNN + 3 - LSTM + 2-GRU, and the difference between RNN and RNN + 4-GRU is 0.3%. + 2-BiLSTM + 2-GRU model. The initial slopes of accuracy and loss for RNN + 4-GRU and RNN + 2-BiLSTM + 2-GRU with normalized time agree well with other models that are significantly different from RNN + 4-GRU curves. The RNN model’s F1 curve is higher than that of the RCNN + 4-LSTM model and RNN + 3-LSTM + 2-GRU model due to the fact that the GRU layer in the RNN model has a shorter training time and a smaller data set.

Fig. 14
figure 14

Comparison of Accuracy for Deep Learning Models

Fig. 15
figure 15

Variation of loss in values for Deep Learning Models

The RNN + 4-GRU model cannot solve long-term dependencies, since it is integrated with the GRU layer. The disadvantage is, however, satisfactorily reduced by the inclusion of layer of BILSTM between the GRU layers and RNN layers. RNN + 2-BiLSTM + 2-GRU, which combines the advantages of the LSTM and GRU layers, is advised above RNN-LSTM-GRU where long-term dependencies are crucial, such as in text categorization. RNN-BiLSTM-GRU is the best option for text categorization since it has LSTM layers, which are present in RCNN but require more time and data to train despite being able to resolve long-term dependencies.

The relationships between accuracy and normalized time are depicted in Figs. 16 and 17. However, the RNN + 3-LSTM + 2-GRU model is able to fit the RNN + 4-GRU and RNN + 2-BiLSTM + 2-GRU models after 0.75 s of normalization. Both models are almost equally accurate suggesting that the RNN + 2-BiLSTM + 2-GRU model can replace the RNN + 4-GRU and the RNN + 3-LSTM + 2-GRU models since they enable retrieval of long-term dependencies with a similar level of accuracy. The RNN + 3-LSTM + 2-GRU model has an LSTM layer that requires more time for training, which results in a discrepancy in slopes from 0 to 0.6s.

Table 1 A comparison of average precision, average recall, and F1 score
Fig. 16
figure 16

Deep learning model performance evaluation

By matching the slope, the GRU layer compensates for the initial delay, while learning for longer than 0.75 s. The F1 score is determined by the area under the precision-recall curve (Fig. 17). The difference in F1 values for each of these three models are shown in Fig. 18. The model RNN + 2-BILSTM + 2-GRU improves the RNN + 4-GRU model consistently, while RNN + 3-LSTMs + 2-GRUs and RCNN + 4-LSTMs have average F1 values that are higher than the other models but marginally lower than the RNN + 4-GRUs. The same comparison is depicted as a graph in Fig. 16, Fig. 19.

Fig. 17
figure 17

Comparison of precision-recall for Deep Learning Models

Fig. 18
figure 18

Variation of F1 Score for Deep Learning Models

Fig. 19
figure 19

P-values for Text Classification

The graph 19, displays the p-values from the paired t-test for each dataset, along with a significance threshold (0.05):

p-values.

  1. 1.

    p-value indicates the statistical significance of the differences between Model 1 and Model 2’s accuracies are shown in Fig. 20.

The red Dashed line Represents the Significance Threshold of 0.05

Fig. 20
figure 20

ROC curve for the Models

Conclusions

The current comprehensive study attempted to tackle a hybrid text classification approach by fitting a model and evaluating its performance in terms of F1 score, recall and accuracy. It has been observed that developing a model for a particular purpose requires trade-offs between aspects such as time, data sets, and long-term dependency management that establish an effective application model. Consequently RNN + 2-BiLSTM + 2-GRU proved to be more accurate than RNN + 4-GRU and RNN + 3-LSTM + 2-GRU, with BiLSTM dealing with long-term dependency queries and GRU is being useful for quick training of model. The proposed RNN + 2-BiLSTM + 2-GRU hybrid model has an average F1 score of 0.76, RNN (4 GRU classes) has an average F1 score of 0.77, and RCNN (4 LSTM classes) has an F1 score average is 0.69. Moreover, due to the bi-LSTM layer of the model, the RNN + 2-BiLSTM + 2-GRU model is able to maintain long-term dependencies without storing redundant context information, even though the training period and dataset are slightly longer.