1 Introduction

World health organization (WHO) has declared COVID-19 infectious disease to be a global pandemic in March 2020 in view of its exponential rise. Moreover, the spread of the virus crossed geographical boundaries spreading across the world. The spread is more concerning as the disease is closely related to respiratory tract infections. The additional symptoms of the disease include fever, cold and cough, breathlessness as well as diarrhea. In the worst cases, such type of disease can lead to death causing pneumonia [1, 2].

The maturation time period for the disease has been calculated to be 14 days but in some cases, it can be more than the estimated time [3]. The epidemic is also declared to be contagious as it can spread widely among the persons through respiratory droplets and contact with the persons suffering from COVID-19. To date no definite medicines or vaccines have been developed for such an epidemic except few preventive and awareness measures like social distancing, wearing masks, patient isolation, as well as travel constraints as guidelines, are made as compulsory rules to be adopted by everyone in the society [4]. But the success of these protocols being put forth in almost all countries depends on the people practicing the same practically. Since its first outbreak in China, it has been spreading to different countries irrespective of community, time, climate, or geographical regions. Since the disease is spreading exponentially, putting a large number of people’s lives at risk thus demands urgent attention as well as its mortality rate prediction with an assessment of associated risks. The rise of such an epidemic also has created various challenges for healthcare workers as well as public health communities because of the uncertainties associated with the disease like the definite transmission medium, its treatment as well as recovery rates and the prediction of risks [5].

Estimating or Predicting and modeling such an epidemic outbreak’s spread, effects, impacts as well as mortality rate is not only a difficult task but also challenging. In comparison to the traditional modeling approach, AI-oriented models and deep learning algorithms can be thought of to be an operative tool in the prediction of patient’s health conditions as well as the rate of risks associated with it so as to derive a concluding remark in healthcare decision-making strategies [6]. The basis for supremacy of AI-oriented models is its competence in handling continuous data involving unpredictable uncertainties whereas traditional prediction models use only input data. Resultantly, AI-oriented models succeed to provide coherent results. Such efficient and accurate prediction regarding mortality rate as well as associated risk predictions can help the healthcare workers at the hospital to deal with the patients on a priority basis [7].

Thus the main objective of the present research is to employ deep learning algorithms for predicting the mortality rate based on the standard COVID-19 datasets from WHO and JOHNS HOPKINS university. The study carries out the preprocessing step on the data sets to extract valuable features so as to handle missing and redundant values. The processed datasets are then analyzed and visualized with SARIMAX, Auto SARIMA, LSTM, Auto ARIMA, Facebook Prophet model, Holt’s Linear model and Rolling Window to predict the mortality rate for COVID-19 patients within India. Thereafter, the performance of the models is compared using various various error metrics.

The manuscript has been organized into various sections. Section 2 discusses the related work in the domain. The methodology is presented in Sect. 3 discussing various forecasting models. Evaluation scores of various models are presented in Sect. 4 and finally, the conclusion is presented in Sect. 5.

2 Related work

COVID-19 disease spread all across the world like a wildfire and resultantly has been declared as a global epidemic by WHO. The spread of COVID-19 has lured researchers and all over the world to work unremittingly in order to understand the behavioral pattern of the virus. Researchers employed various approaches during their research. Among numerous approaches, Artificial intelligence (AI) has emerged as a promising solution and has proved its competence in this regard. AI has been widely employed to predict the count of COVID-19 and thus can be a great aid for the medical professional and associated sensitive medical decisions [8,9,10,11,12,13]. Resultantly, numerous researchers have widely employed AI and its associated disciplines for COVID-19.

The motive behind selecting AI is the proven competence of AI and deep learning in the domain of healthcare from the literature. The implementation of deep learning in healthcare is observed for many years although it has observed an exponential rise during past few years. For instance, authors in [14], healthcare observes diverse types of data in form of electronic health record (HER), text and imaging etc. This prevents traditional data mining approaches to obtain efficient features from this data and thus fails to make efficient prediction model. The latest development in the domain of deep learning makes it a standing choice so as to efficiently model complex data. Similarly, authors in [15] used the advent of deep learning to revolutionize the healthcare. Here, authors claim that deep learning has immense potential in healthcare owing to requirement of process automation. In [15], authors discuss the importance of deep learning in healthcare from the perspective of radiologist. It claims that in coming era, deep learning will act as basis for augmented radiology which will aid in improving the lifestyle while minimizing the healthcare cost. Several other researchers have also tried their hand in establishing the role of AI and deep learning in healthcare.

For instance, Benvenuto et. al. used the most conventional ARIMA forecasting model to predict the mortality rate of COVID-19 using data from Johns Hopkins epidemiological of the predominance [16]. Similarly, authors in Chakraborty et. al. [17] undertook the problem of current number of COVID-19 cases. Additionally, authors in [17] also focused on the spread of novel virus in different countries. Authors employed Wavelet-based forecasting model to predict the count of COVID-19 cases for a period of 10 days in various countries like Canada, France, India, South Korea and UK. The model also applied relapse tree calculation in order to find the principal factors determining COVID-19 casualties in various nations. Authors in [18] implemented different variations of ARIMA to predict COVID-19 cases for various countries like Italy, France and Spain. The work revealed that the model is suitable to recognize the effects of COVID-19 so that the epidemiological phase of these nations can be further researched.

Further, Chintalapudi et al. [19] also implemented ARIMA model to predict the COVID-19 cases for Italian ministry of health. The work attained an accuracy of 93.75 and 84.4% for COVID-19 cases and recovery cases respectively within February mid to end of March 2020. The model by authors in [19] achieved reduction of new cases enrollment by 35%.

Further, authors in [20] developed a classifier using data mining and hybrid deep learning named as deepsense classifier to classify the COVID-19 patients with respect to health conditions of lungs. The implementation of various ML forecasting models in COVID-19 is also simulated by authors in [21] by implementing different models like ARIMA, CUBIST, RF, RIDGE, SVR and the stacking-ensemble method for time series analysis and prediction. During simulation of the various prediction models, it is observed that models had a prediction accuracy within range of 0.87–3.51%, 1.02–5.63% and 0.95–6.90% for forecasting period of 1 day, 3 days and 6 days respectively. Similarly, authors in [22] adopted the SEIR and regression model to predict the COVID-19 cases. The efficacy of the model is evaluated by Root Mean Square Logarithmic Error (RMSLE) and reproduction number \(\left({R}_{0}\right)\). The similar line of research is also carried out by authors in [23] who carried out a survey with 194,909 COVID-19 patients from March 18, 2020. Here, authors adopted a multivariate key back slip test to trace chronic diseases like smoking and considers it as a major risk predicting factor for disease growth. The work established that smokers are more prone to abnormal COVID-19 symptoms requiring ICU, ventilators in comparison to non-smokers.

The attempt to understand the symptoms and signs for COVID-19 was also made by authors in [24] using a relapse rate. Authors carried out the research using 1480 COVID-19 patients who showed influenza-like symptoms. The result of experiment yielded 58% of positive cases and 15% negative cases. The patients who were detected as COVID-19 positive were also tested for taste and odor which came out to be 68 and 71% respectively. Apart from taste and odor, it also studied other factors as age, insomnia and sore throat. The study concluded that sore throat can be dangerous for COVID-19 and loss of olfaction. Additionally, authors in [25] suggested a prognostic forecasting model with respect to three indices to predict the mortality rate and associated risks. They further suggested a clinical path to recognize different cases based on its severity that will aid doctors to timely diagnose and identify so as to minimize the mortality rate.

3 Methodology

This section discusses the various phase of the methodology adopted in the present study. The steps are illustrated below:

3.1 Data collection

The dataset pertaining to COVID-19 is available at various sources. Here, in this study, the dataset is collected from https://covid19.who.int/. The dataset consists of various parameters viz. Date, daily confirmed, total confirmed till data, daily recovered, total recovered to date, daily deceased and finally total deceased. The considered dataset starts from 30th January and records an entry each day.

3.2 Date pre-processing

The data in the dataset is in raw format that prevents its direct application. Hence, the data needs to be preprocessed so as to get recognized by pandas. During preprocessing, the date attribute in the dataset is set as the index column of the data Frame. In order to apply time series forecasting, stationarity of the time series is required. So to check stationarity of the time series, Augmented Dickey-Fuller test (ADF Test), a common statistical method is performed [26, 27], which is one of the most widely used statistical measures when it comes to the study of the stationary sequence. The ADF test results reveal that the dataset for daily confirmed cases and total confirmed cases are non-stationary. Further, the auto correlation function (ACF) and partial auto correlation function (PACF) among observations of time series are measured. Finally, various forecasting models are applied so as to evaluate the most efficient forecasting model. For the same, various error metrics viz. Mean absolute error (MAE), mean squared error (MSE) and root mean square error (RMSE).

3.3 Forecasting models

This section discusses various forecasting models used for prediction. Some of these models are as follows:

3.3.1 Averaging model

The averaging model is the most conventional forecasting model where the data value at any future instance is predicted based on historical values. Multiple variants of the averaging model are in existence. The most basic model is the simple moving average model (SMA) that predicts a value by finding the average of recent historical data values. In the SMA, short-term averaging quickly exhibit the change in values while long-term averaging takes longer to exhibit reaction. In SMA, each historical value contributes equally to predict the future value as equal weight is considered for each historical value. There exists a variation of SMA which assumes different weights for different historical values, known as weighted moving average model (WMA). In WMA, the most recent historical value is generally assigned maximum weight in comparison to older values. Thereafter, WMA is further extended into the exponential moving average (EMA) where the most recent data value is assigned the highest weight and the least recent data value is assigned minimum weight. Finally, a rolling window, another type of averaging model involves calculating the statistic on a fixed number of contiguous previous observations for prediction. In the current manuscript, authors have illustrated different rolling window sizes in the range of 1 to 24 days for prediction. The rolling window size averaging model for total confirmed and daily confirmed for window size 2, 3 and 7 is demonstrated in Figs. 1 and 2 respectively.

Fig. 1
figure 1

Moving average for 2, 3 and 7 days

Fig. 2
figure 2

Exponentially weighted moving average for 2, 3 and 7 Days

3.3.2 Auto ARIMA model

Auto Regression Integrated Moving (ARIMA) model basically combines three models viz. Autoregression (AR), Integration (I) and finally MA. Readers can refer to [28] for detailed understanding of ARIMA. Here, ARIMA (p,d,q) refers to AR(p), I(d) and MA(q). The general form of \(AR(p)\) and \(MA\left(q\right)\) is as follows [29]:

$$Y_{t} = \alpha_{1} Y_{t - 1} + \alpha_{2} Y_{t - 2} + \alpha_{3} Y_{t - 3} + \ldots \ldots .. + \alpha_{p} Y_{t - p} + \varepsilon_{t}$$
$$Y_{t} = \beta_{1} \varepsilon_{t - 1} + \beta_{2} \varepsilon_{t - 2} + \beta_{3} \varepsilon_{t - 3} + \ldots \ldots .. + \beta_{p} \varepsilon_{t - p} + \varepsilon_{t}$$

where \(\alpha s\) and \(\beta s\) are the parameters for auto regression and moving average. \({Y}_{t}\) and \({\varepsilon }_{t}\) represent the value and error of series respectively at instance \(t\).

The best values for parameters p, d and q are determined using minimum Akaike information criteria (AIC) and Bayesian information criterion (BIC) value. The forecasting obtained through the ARIMA model is illustrated in Fig. 3. Further, ARIMA is extended into seasonal ARIMA (SARIMA) model which is efficient to handle seasonality of the dataset. Unlike ARIMA, SARIMA model automatically chooses the best non-seasonal p, d, q parameters. The forecasting obtained through SARIMA model is illustrated in Fig. 4. Thus, SARIMA model provides efficient support to seasonal univariate time series data. The parameters (p, d, q) of SARIMA are chosen using ACF and PACF graphs or AIC value. Furthermore, there is another variant of ARIMA model to support exogenous variables, known as SARIMAX. The SARIMAX model for daily confirmed cases is illustrated in Fig. 5. In Fig. 5, the training data is illustrated by the blue line. Further, the actual value and predicted value for test data is illustrated by the black and pink line respectively.

Fig. 3
figure 3

Auto ARIMA forecast of total confirmed cases

Fig. 4
figure 4

Auto seasonal ARIMA forecast of total confirmed cases

Fig. 5
figure 5

SARIMAX forecasting of confirmed cases

3.3.3 Holt’s winter model

The authors in current manuscript also implemented Holt's winter forecasting method. The model is primarily characterized by three parameters viz. stage, trend and seasonality and hence uses triple exponential smoothing to handle these three parameters. Here, the seasonal period indicates the number of steps in a seasonal period. Trend refers to the trend component which is either additive or multiplicative. Basically, this model takes weighted average of past values. Also, it assigns exponentially decaying weights to past observations to capture the trend present in the time series [30]. The general form of this model is as follows:

$$\hat{Y}_{T + 1| T} = \alpha Y_{T} + \alpha \left( {1 - \alpha } \right)Y_{T - 1} + \alpha \left( {1 - \alpha } \right)^{2} Y_{T - 2 + \ldots \ldots }$$

Here, \(\alpha\) tunes the response of the model. The Holt's winter model for daily confirmed cases is illustrated in Fig. 6. In this figure, the green line indicates the training data whereas the actual and forecasted values of test data is represented through red and green lines respectively.

Fig. 6
figure 6

Holt’s winter model forecasting of daily confirmed cases

3.3.4 Holt’s linear model

The Holt Linear Trend model maps the trend accurately without any assumptions. This model primarily uses two parameters where one is used for overall smoothing and the other is used for trend smoothing equation. The forecasting obtained through this model is demonstrated in the Fig. 7. It is evident from Fig. 7 that this model outperforms other comparative models.

Fig. 7
figure 7

Holt's linear model forecast of total confirmed cases

3.3.5 Recurrent neural network-LSTM model

LSTM (Long Short Term Memory) network is a special form of recurrent neural networks (RNN) capable of learning order dependence for sequence prediction. In LSTM, data is given to the model in batches where each model comprises multiple nodes along with the activation function. The present work uses Tanh & Relu activation function and the result is demonstrated in Fig. 8. However, ReLu activation function causes some nodes to die without learning anything i.e. dying ReLU problem. Additionally, it may also cause an explosion of activation as the higher limit is infinite which leads to unusable nodes. So, authors use Adam optimizer in the current work which can update the network weights iteratively in response to the training data. Further, means square error is used in order to control the iterations. The research study also considers the aspect of overfitting through dropout regularization.

Fig. 8
figure 8

Activation function graphs (a) Tanh (b) ReLn

3.3.6 Facebook prophet model

Facebook Prophet method is an additive regression time-series forecasting algorithm developed by Facebook. It has proven competence in handling time series having strong seasonal effects, missing data and outliers in the time series. This model can be basically written as follows [30]:

$$Y\left( t \right) = g\left( t \right) + s\left( t \right) + h\left( t \right) + \varepsilon \left( t \right)$$

Here, \(g\left(t\right), s(t)\) and \(h\left(t\right)\) represents the trend, seasonal changes and irregular effects. The prophet is also resilient to missing data and pattern changes and can efficiently manage outliers. The prophet model for total confirmed cases is illustrated in Fig. 9. The forecasting graph of FB prophet in-built model is illustrated in Fig. 10 where black dots represent actual value while blue line represents the forecasted value. The graph in Fig. 11 demonstrates the seasonal decomposition illustrating how the number of cases increase from Wednesday to Sunday. It also demonstrates that Tuesday demonstrates the fewest cases.

Fig. 9
figure 9

Facebook prophet forecast of total confirmed cases

Fig. 10
figure 10

Facebook Prophet in-built forecasting graph of total confirmed cases

Fig. 11
figure 11

Seasonal decomposition graph (a) seasonal trend in the entire data, (b) weekly decomposition graph (c) monthly decomposition graph

4 Results and discussion

The current research aims to establish the efficiency of various forecasting models in reference to the dataset of COVID-19. For the same, authors have taken the dataset from WHO. In the dataset, there exist one entry for each day starting from Feb 2020. The collected data is classified into training and testing data. The training is considered till Aug 25, 2020 and the data beyond Aug 25 is considered to be the test data. The forecasting models discussed in the previous section has been implemented on the collected data. Further, as the problem under consideration is a class balanced problem (the percentage of negative and positive classes is almost similar), hence the efficacy of these forecasting models is established through various error metrics viz. RMSE, MAE and Mean square error (MSE).

Error Metrics are basically used to measure performance and accuracy of forecasting model. Here, error basically refers to the deviation of predicted value to the actual value. The forecasting model exhibiting smaller values of error metrics is considered to outperform comparative models. The mathematical expressions for considered error metrics is as follows:

RMSE metric is considered to estimate the accuracy of prediction model to predict the COVID-19 cases. It represents the standard deviation of predicted value to the actual value. RMSE is basically calculated as follows:

$$RMSE = \frac{{\sqrt {\mathop \sum \nolimits_{i = 1}^{N} \left( {\hat{x}_{i} - x_{i} } \right)^{2} } }}{N}$$

Here, \({\widehat{x}}_{i}\) and \({x}_{i}\) indicates the predicted value and actual value respectively.

MAE indicates the average of absolute difference of predicted and actual data value, thus serving as an efficient parameter to refer effectiveness of prediction model. The mathematical expression for MAE is as follows:

$$MAE = \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left| {\hat{x}_{i} - x_{i} } \right|}}{N}$$

Like MAE, MSE refers to the square error and is calculated as follows:

$$MSE = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {\hat{x}_{i} - x_{i} } \right)^{2}$$

The scaled value of the considered error metrics for different forecasting models is illustrated in Table 1. From Table 1, it is evident that the Auto SARIMA model has the lowest value for MAE, MSE and RMSE. This establishes the supremacy of the Auto SARIMA model to forecast COVID-19 cases. The higher value of error metrics of other models indicates their unsuitability for this forecasting scenario. The prime factor that causes a higher value of error metrics for models is the lack of sufficient historical data. Hence, considering the lack of historical data, it is established that the Auto SARIMA model can be accurately and efficiently used for forecasting of COVID-19.

Table 1 Comparative analysis of various models in terms of error metrics

5 Conclusion

The present research has adopted various forecasting models and proves to be fruitful in predicting the mortality rate of COVID-19, its corroborative risks so that various controlling and selective healthcare measures can be taken for critical to severe medical cases. An efficient forecasting model can be a great help to the health professionals and governing authorities so that they get well prepared in advance. The efficiency of various forecasting models is evaluated in terms of error metrics. However, lack of long historical data prevents to achieve very high accuracy. Still, it is established that Auto SARIMA can efficiently predict the cases of COVID-19 despite data deprivation. The researchers and healthcare professionals across the world have been continuously working hard to deal with this epidemic outbreak [31, 32]. Moreover, the pandemic spread of such type of epidemic is also closely related to the social awareness as well as stringent policies put forth by the government. It is worth noting that above all, as a citizen, each individual has a moral responsibility to obey the protocols set by the government so that unitedly we can fight back this pandemic outbreak. The current work can be further extended to refine the mortality rate predictions with different hybrid.