Introduction

Yamuna is the largest tributary river of the Ganga in northern India. It originates from the Yamunotri glacier at a height of 6,387 m on the south western slopes of Banderpooch peaks (38° 59′ N 78° 27′ E) in the lower Himalayas in Uttarakhand. It travels a total length of 1,376 km by crossing several states, Uttarakhand, Haryana, Himachal Pradesh, Delhi, Uttar Pradesh and has a mixing of drainage system of 366,233 km2 before merging with the Ganga at Allahabad i.e., a total of 40.2 % of the entire Ganga basin. The river accounts for more than 70 % of Delhi’s water supplies and about 57 million people depend on river water for their daily usage (CPCB 2006).

Hathnikund is approximately 157 km downstream from Yamunotri and 2 km upstream from Tajewala barrage. Hathnikund barrage is 38 km downstream from Dakpathar and 2 km upstream from Tajewala barrage. Sample location (Hathnikund) provides water quality after joining of the tributaries Tons, Giri and Asan of the lower Himalaya region of Yamuna River. The study of quality of water at Hathnikund is important because after this station Yamuna River enters Delhi (capital of India) and accounts for more than 70 % of Delhi’s water supplies and about 57 million people depend on river water for their daily usage (CPCB 2006).

Pollution in river water is continuously increasing due to urbanization, industrialization etc. and most of the rivers are at dying position which is an alarming signal (Parmar et al. 2009; Phiri et al. 2005). Physico-chemical parameters, trace metals have effects of industrial wastes, municipal sewage, and agricultural runoff on river water quality (Akoto and Adiyiah 2007; Alam et al. 2007; Banu et al. 2007; Juang et al. 2008). The analysis of the simultaneous effect of water pollution and eutrophication on the concentration of dissolved oxygen (DO) in a water body shows that the decrease in the concentration of DO is much more than when only single effect is present in the water body, thus leading to more uncertainty about the survival of DO-dependent species (Kumar and Dua 2009; Shukla et al. 2008). Trihalomethanes compounds were determined in the drinking water samples at consumption sites and treatment plants of Okinawa and Samoa Islands and observed that the chloroform, bromodichloromethane compounds exceed the level of Japan water quality and World Health Organization (WHO) standards (APHA 1995; Imo et al. 2007; WHO 1971). Water quality modeling using hydrochemical data, multiple linear regression, structural equation, predictability, trend and time-series analysis provides major tools for application in water quality management (Attah and Bankole 2012; Chenini and Khemiri 2009; Fang et al. 2010; Singh et al. 2004; Su et al. 2011; Prasad et al. 2013; Seth et al. 2013). Water quality managers use regression equations to estimate constituent concentrations for comparison of current water quality conditions to water quality standards (Joarder et al. 2008; Korashey 2009; Psargaonkar et al. 2008; Singh et al. 2004; Vassilis et al. 2001; Ravikumar et al. 2013).

Climatic dynamic also plays an important role in determining the water quality standards using fractal dimensional analysis, trend and analyzed time-series data of three major dynamic components of the climate i.e., temperature, pressure and precipitation (Bhardwaj and Parmar 2013a, 2013b; Rangarajan 1997; Damodhar and Reddy 2013). It is analyzed that regional climatic models would not be able to predict local climate as it deals with averaged quantities and that precipitation during the southwest monsoon is affected by temperature and pressure variations during the preceding winter (Kahya and Kalayci 2004; McCleary and Hay 1980; Mousavi et al. 2008; Movahed and Hermanisc 2008; Park and Park 2009; Prasad and Narayana 2004; Rangarajan and Ding 2000; Rangarajan and Sant 2004).

Yamuna River is the most vulnerable polluted water body because of the role in carrying municipal, industrial wastes, and run offs from agriculture lands in vast drainage basins. Detailed water quality management research is needed to maintain water quality standards. The quality of Yamuna River water depends upon the quality of water parameters potential of hydrogen (pH), chemical oxygen demand (COD), biochemical oxygen demand (BOD), dissolved oxygen (DO), water temperature (WT), free ammonia (AMM) and total Kjeldahl nitrogen (TKN). In this paper, statistical analysis, time series, auto regressive integrated moving average (ARIMA), stationary R-square, R-square, root mean square error (RMSE), mean absolute percentage error (MAPE), mean absolute error (MAE), normalized Bayesian information criterion (BIC) of these water parameters have been estimated at Hathnikund (Haryana, India) of Yamuna River as shown in Fig. 1.

Fig. 1
figure 1

River Yamuna route map description of sample site

Methodology

The monthly average value of the last 10 years of water quality parameters pH, COD, BOD, AMM, TKN, DO, and WT observed by Central Pollution Control Board (CPCB) at Hathnikund of Yamuna River in Delhi (India) has been considered for the present study.

Statistical analysis

Statistical analysis is used to calculate mean, median, mode, standard deviation, kurtosis, skewness, and coefficient of variation. Mean explains average value; median gives the middle values of an ordered sequence or positional average; mode is defined as the value which occurs the maximum number of times that is having the maximum frequency; standard deviation gives a measure of “spread” or “variability” of the sample; kurtosis refers to the degree of flatness or peakedness in the region about the mode of a frequency curve; skewness describes the symmetry of the data; coefficient of variation gives the relative measure of the sample (Bhardwaj and Parmar 2013a, 2013b).

R-squared and stationary R-square

R-squared is an estimate of proportion of total variation in the series which is explained by the model and measure is useful when the series is stationary. Stationary R-squared is a measure that compares stationary part of the model to a simple mean model and is preferable to ordinary R-squared when there is a trend or seasonal pattern. Stationary R-squared can be negative with a range of negative infinity to 1. Negative values mean that the model under consideration is worse than the baseline model. Positive values mean that the model under consideration is better than the baseline model (Box et al. 2008; DeLurgio 1998; McCleary and Hay 1980).

RMSE

RMSE is a measure of variation of the dependent series from its model-predicted level, expressed in the same units as the dependent series (Box et al. 2008; DeLurgio 1998; McCleary and Hay 1980). RMSE of an estimator \( \hat{\theta } \) with respect to estimator parameter θ is defined as:

$$ {\text{RMSE}}\left( {\hat{\theta }} \right) = \sqrt {E\left[ {\left( {\hat{\theta } - \theta } \right)^{2} } \right]} $$
(1)

MAPE

MAPE is a measure of variation of dependent series from its model-predicted level. It is independent of the units used and can therefore be used to compare series with different units. It usually expresses accuracy as a percentage and is defined as:

$$ M = \frac{100\,\% }{n}\sum\limits_{i = 1}^{n} {\left| {\frac{{A_{i} - F_{i} }}{{A_{i} }}} \right|} $$
(2)

where At is the actual value and Ft is the forecast value. For perfect fit, the value of MAPE is zero, but for upper level the MAPE has no restriction (Box et al. 2008; DeLurgio 1998; McCleary and Hay 1980).

MAE

MAE measures variation of series from its model-predicted level and is reported in the original series units. Also, the MAE is a quantity used to measure variation of forecasts or predictions from the eventual outcome. It is given by

$$ {\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {F_{i} - A_{\text{t}} } \right|} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {e_{i} } \right|} $$
(3)

where e i is absolute error, F i is prediction and A i is calculated value. It is a common measure of forecast error in time-series analysis (Box et al. 2008; DeLurgio 1998; McCleary and Hay 1980).

Maximum absolute percentage error (MaxAPE)

MaxAPE measures largest forecasted error, expressed as a percentage. This measure is useful for imagining a worst-case scenario for the forecasts (Box et al. 2008; DeLurgio 1998; McCleary and Hay 1980).

Maximum absolute error (MaxAE)

MaxAE measures largest forecasted error, expressed in same units as of dependent series. It is useful for imagining the worst-case scenario for the forecasts. MaxAE and MaxAPE may occur at different series points. When absolute error for large series value is slightly larger than absolute error for small series value, then MaxAE will occur at larger series value and MaxAPE will occur at smaller series value (Box et al. 2008; DeLurgio 1998; McCleary and Hay 1980).

Normalized BIC

Normalized BIC is general measure of overall fit of a model that attempts to account for model complexity. It is a score based upon mean square error and includes a penalty for number of parameters in the model and length of series (Box et al. 2008; DeLurgio 1998; McCleary and Hay 1980).

$$ {\text{BIC}} = \chi^{2} + k.\,\ln \left( n \right) $$
(4)

BIC is used to determine the best model the constant (k). It can measure the efficiency of parameterized model in terms of predicting the data also it penalizes the complexity of the model where complexity refers to the number of parameters in model.

Time series

Time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Time-series analysis comprises methods for analyzing time-series data to extract meaningful statistics and other characteristics of data and to forecast future events based on known past events to predict data points before these are measured. Time-series model reflect that observations close together in time one closely related than observations further apart. In addition, time-series models will often make use of natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values, rather than from future values (Lu et al. 2013).

ARIMA

ARIMA model of a time series is defined by three terms (p, d, q). Identification of a time series is process of finding integer, usually very small (e.g., 0, 1, or 2), values of p, d, and q model patterns in data. When value is 0, element is not needed in model. The middle element, d, is investigated before p and q. The goal is to determine if process is stationary and, if not, to make it stationary before determining the values of p and q. A stationary process has a constant mean and variance over time period of study. The representation of an auto regressive model in time series (Box et al. 2008; DeLurgio 1998; McCleary and Hay 1980), well known as AR(p), is defined as

$$ Y_{\text{t}} = \alpha_{0} + \alpha_{1} Y_{{{\text{t}} - 1}} + \alpha_{2} Y_{{{\text{t}} - 2}} + \cdots + \alpha_{\text{p}} Y_{{{\text{t}} - {\text{p}}}} + \varepsilon_{\text{t}} $$
(5)

where the term ɛt is source of randomness and is called white noise, αi are constants. It is assumed to have the following characteristics:

$$ \begin{gathered} E\left[ {\varepsilon_{\text{t}} } \right]\, = 0, \hfill \\ E\left[ {\varepsilon_{\text{t}}^{2} } \right]\, = \sigma^{2} ,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ E\left[ {\varepsilon_{\text{t}} \,\varepsilon_{\text{s}} } \right]\, = 0\,\,\,\,\,\,{\text{for}}\,{\text{all}}\,\,t \ne s\, \hfill \\ \end{gathered} $$

A series may have both auto regressive and moving average components so both types of correlations are required to model the patterns. If both elements are present only at lag 1, to understand this let the linear equation

$$ y_{\text{t}} = \,x_{\text{t}} \beta + \varepsilon_{\text{t}} $$
(6)
$$ \varepsilon_{\text{t}} = \rho \varepsilon_{{{\text{t}} - 1}} + \nu_{\text{t}} $$
(7)

where −1< ρ <1.

where νt is independent and identically distributed (iid) and from above expectation values

$$ E(\nu_{\text{t}} ,\nu_{{{\text{t}} - 1}} ) = 0 $$
(8)

As we learn in a moment, the disturbance or error in this model is said to follow a first order auto regressive (AR1) process. Thus, the current error is part of the previous error plus some shock. So Eq. (6) can be re written as

$$ y_{\text{t}} = \,x_{\text{t}} \beta + \rho \varepsilon_{{{\text{t}} - 1}} + \nu_{\text{t}} $$
(9)

Also, we know that

$$ y_{{{\text{t}} - 1}} = \,x_{{{\text{t}} - 1}} \beta + \varepsilon_{{{\text{t}} - 1}} $$
(10)
$$ \Rightarrow \,\,\,\,\varepsilon_{{{\text{t}} - 1}} \, = \,y_{{{\text{t}} - 1}} - \,x_{{{\text{t}} - 1}} \beta $$
(11)

From Eq. (9) \( y_{\text{t}} = \,x_{\text{t}} \beta + \rho \left( {y_{{{\text{t}} - 1}} - \,x_{{{\text{t}} - 1}} \beta } \right) + \nu_{\text{t}} \).

$$ y_{t} = \,x_{t} \beta + \rho y_{t - 1} - \,\rho x_{t - 1} \beta + \nu_{t} $$
(12)

Results and discussion

Figure 2 and Table 1 give the detail of statistical analysis including mean, median, mode, range, standard deviation, kurtosis, skewness, coefficient of variation for all water quality parameters. Table 2 and Eqs. (1)–(12) depict time-series analysis of ARIMA model, stationary R-squared, R-squared, RMSE, MAPE, MaxAPE, MAE, MaxAE, normalized BIC, Ljung–Box analysis, predicted value, lower confidence limit (LCL), upper confidence limit (UCL), residual for all water quality parameters. Figure 3 plots time series of observed data, best fit, LCL, UCL and ARIMA prediction monthly values for next 5 years for all water parameters. It is observed that for:

Fig. 2
figure 2

Statistical analysis of Yamuna River water at Hathnikund

Table 1 Statistical analysis of water quality parameters
Table 2 Time-series analysis of water quality parameters
Fig. 3
figure 3figure 3figure 3

Time-series (ARIMA model) prediction of Yamuna River water at Hathnikund

pH

Average, positional average and mode value of pH is 7.77, 7.72 and 7.7. These values are close to 7.7 thus data exhibit normal behavior. Standard deviation (SD) is 0.422, skewness approximates to 0 thus pH is symmetrical and values are very close to each other. Curve is platykurtic as kurtosis is less than 3. Stationary R-squared and R-squared values exhibit similar behavior thus model is better than baseline model and RMSE values are low so dependent series is closed with its model-predicted level. For all sites, using Ljung–Box Q(18) model statistics is 23.936, significance is 0.121 and degree of freedom is 17. ARIMA (0, 0, 1) model fitted and boundary lines are at 95 % confidence limits. Predicted, LCL, UCL and residual values of pH are 7.7672, 6.9887, 8.5458, and −0.00017.

COD

Mean, median and mode value is 7.2, 6.0 and 5.0, respectively, thus curve is not normal. Standard deviation value is high (5.6932807) thus values of COD are not close to each other. It is skewed (2.435) and curve is leptokurtic because kurtosis is more than 3. Stationary R-squared and R-squared values exhibit the similar behavior thus model is better than the baseline model and RMSE values are high so dependent series is not closed with its model-predicted level. Using Ljung–Box Q(18) model statistics is 14.645, significance is 0.477 and degree of freedom is 15. ARIMA (1, 0, 0) model fitted and boundary lines at 95 % confidence limits. Predicted, LCL, UCL and residual values of COD are 8.1648, −2.1331, 18.4629, and 0.3341.

BOD

Mean and median value is approximately equal but mode value is different so curve is not normal. Standard deviation value suggests that data are spread out and curve is leptokurtic. Stationary R-squared and R-squared values exhibit the similar behavior thus model is better than the baseline model and RMSE values are low so dependent series is closed with its model-predicted level. Using Ljung–Box Q(18) model statistics is 31.755, significance is 0.011 and degree of freedom is 16. ARIMA simple model fitted and boundary lines at 95 % confidence limits. Predicted, LCL, UCL and residual values of BOD are 1.2477, 0.1189, 2.3765, and 0.0163.

AMM

Average, median and mode value is approximately equal thus data exhibit normal behavior. Standard deviation (0.2879) suggests that data are close to each other. Curve is platykurtic. Stationary R-squared and R-squared values exhibit the similar behavior thus model is better than the baseline model and RMSE values is low so dependent series is closed with its model-predicted level. Using Ljung–Box Q(18) model statistics is 26.689, significance is 0.045 and degree of freedom is 16. ARIMA simple model fitted and boundary lines at 95 % confidence limits. Predicted, LCL, UCL and residual values of AMM are 0.1832, −0.3815, 0.7480 and 0.00038.

TKN

Mean and median value is approximately equal. Standard deviation (0.83) suggests that sample data are close to each other. Skewness value is approximately 0 thus curve is symmetrical and platykurtic. Stationary R-squared and R-squared values exhibit the similar behavior thus model is better than the baseline model and RMSE values are low so dependent series is closed with its model-predicted level. Using Ljung–Box Q(18) model statistics is 16.311, significance is 0.431 and degree of freedom is 16. ARIMA (1, 0, 0) model fitted and boundary lines at 95 % confidence limits. Predicted, LCL, UCL and residual values of TKN are 0.9066, −0.6859, 2.4991, and −0.0414.

DO

Average, median and mode value is equal thus data behave normally. Standard deviation value is 1.344 curve is symmetric and platykurtic. Stationary R-squared and R-squared values exhibit the similar behavior thus model is better than the baseline model and RMSE values are low so dependent series is closed with its model-predicted level. Using Ljung–Box Q(18) model statistics is 16.161, significance is 0.442 and degree of freedom is 16. ARIMA (0, 0, 6) model fitted and boundary lines at 95 % confidence limits. Predicted, LCL, UCL and residual values of DO are 9.3279, 7.2570, 11.3988, and −0.0756.

WT

Mean, median and mode value is approximate to 21 thus data exhibit normal characteristic. Curve is skewed and platykurtic. Stationary R-squared and R-squared values exhibit the similar behavior thus model is better than the baseline model and RMSE values are low so dependent series is closed with its model-predicted level. Using Ljung–Box Q(18) model statistics is 22.148, significance is 0.138 and degree of freedom is 16. ARIMA (1, 0, 13) model fitted and boundary lines at 95 % confidence limits. Predicted, LCL, UCL and residual values of WT are 20.5947, 15.6442, 25.5453, and −0.0136.

Conclusion

Statistical and time-series analysis of water quality parameters monitored at the Hathnikund of Yamuna River in India has been studied. ARIMA model used for the prediction of the monthly values of water quality parameters for next 5 years. It is observed that curve is platykurtic for pH, AMM, TKN, DO, WT; leptokurtic for COD, BOD and normal for pH, AMM, DO, WT.

For all ARIMA model (p,d,q) value of ‘d’ i.e., middle value is zero thus process is stationary and has constant mean and variance. It is also observed that RMSE value is comparatively very low which shows that dependent series is closed with the model-predicted level, thus predictive model is useful at 95 % confidence limits. MAPE, MaxAPE, MAE, MaxAE, normalized BIC are calculated for all parameters and it is observed that all water quality parameters have low value. It concludes that the predicted series is close to the original series thus it is a perfect fit. Five year next predicted, LCL–UCL mean values using time series are given as for pH 7.7672, 6.9887–8.5458 with −0.00017 residual error; for COD 8.1648, −2.1331–18.4629 with 0.3341 residual error; for BOD 1.2477, 0.1189–2.3765 with 0.0163 residual error; for AMM 0.1832, −0.3815–0.7480 with 0.00038 residual error; for TKN 0.9066, –0.6859–2.4991 with −0.0414 residual error; for DO 9.3279, 7.2570–11.3988 with −0.0756 residual error and for WT 20.5947, 15.6442–25.5453 with −0.0136 residual error, respectively.

Therefore, using time series and statistical analysis, it is concluded that all parameters except pH and WT cross the prescribed limits of WHO/EPA and water is not fit for drinking, agriculture and industrial use. River is a natural resource of water, prediction results of ARIMA model indicate the increase in pollution, which is an alarming situation and the preventive measure has to be taken to control the same.