1 Introduction

The assessment of loan default risk is crucial for financial institutions such as banks and lending platforms (Everett 2015). Compared to traditional bank deposits, online peer-to-peer (P2P) lending platforms, such as the Lending Club, enable investors to obtain a higher return on investment. In recent years, the P2P lending market has experienced a substantial increase in popularity. In 2014, the P2P lending generated approximately \(\$5.5\) billion worth of loans in the USA according to PricewaterhouseCoopers (2015), and the market may reach \(\$150\) billion or higher by 2025 (Calabrese et al. 2019). There is a risk, however, that loans and interest may not be repaid due to information asymmetry between borrowers and lenders, which can lead to adverse selection and moral hazard problems (Lin et al. 2017). It was reported that the UK COVID loan scheme could lose 26 billion pounds from fraud, defaults, and “Criminals may have made billions of pounds by exploiting a government loan scheme to help British businesses hit by the coronavirus, while large losses are expected from firms unable to repay,” according to the country’s spending watchdog. A rational strategy for reducing such losses is to make appropriate loan decisions based on accurate default risk prediction. Therefore, it is particularly important for lenders to predict the probability of default to avoid unsecured borrowers and allocate investment more efficiently (Jagtiani and Lemieux 2019). Meanwhile, such a problem has also attracted considerable interest from statisticians to develop feasible credit risk analysis methodologies (Ding et al. 2012; Yuan et al. 2018).

Traditional statistical models for credit risk generate a credit score for each borrower that is indicative of whether the borrower can repay the loan (Babaev et al. 2019; Zheng et al. 2019). For example, Li et al. (2022) use a logistic regression model to estimate the probability of default, while considering the highly imbalanced distributions of the event of interest (default). Xu et al. (2021) study the estimation of credit risk with logistic regression augmented with information on observed interfirm financial transactions for small and medium-sized enterprises. And logistic regression is also applied in Costa e Silva et al. (2020) to evaluate the default risk of consumer loans. Although these models are popular and useful, they have limitations. First, those based on logistic regression cannot take full advantage of the temporal information in time-to-event data, as these models use the indicator variable of default but ignore a borrower’s time of default. Second, the traditional models only generate a risk score, whereas users can be equally interested in the time of default and default probability over time. In fact, it has been shown that early intervention under guidance can both effectively minimize arrears and reduce the number of accounts that become bad debts and result in losses (Wang et al. 2018; Li and Chen 2021). Accordingly, more informative predictive outcomes are required for early guidance as well as for improved efficiency in post-loan risk management (Dirick et al. 2017). This has motivated us to develop credit risk models with the aid of survival analysis techniques.

Several studies have examined credit risk assessment using survival analysis techniques (Baesens et al. 2005; Stepanova and Thomas 2002). For example, Hassan et al. (2018) incorporate time-varying macroeconomic variables into their modeling and show that survival analysis has better prediction performance than logistic regression. Jaber et al. (2017) estimate the probability of default by fitting right-censored data from a bank in Jordan with both parametric and non-parametric survival models. To deal with the challenges of high-dimensional predictors, Wan et al. (2019) use the Cox’s proportional hazards (CPH) model with a Lasso penalty to select important factors for P2P loan repayment risk. However, the aforementioned model assumptions and other existing approaches, such as proportional hazards, linearity in coefficients, and independent censoring, may not be tenable in real applications (Wang and Li 2019).

Compared to the aforementioned approaches, machine learning methods, such as neural networks and random forests, are much more powerful due to their flexibility and high-quality prediction under very weak model assumptions. As pointed out in Murdoch et al. (2019), “Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data.” Recently, studies have introduced neural networks to improve the performance of traditional survival models. They partially retain the assumption of the CPH model, which is also referred to as the linear proportional hazards condition. For example, Katzman et al. (2018) combine deep neural network (DNN) with the CPH model to develop DeepSurv, which they use to predict the risk of an event occurrence for patients and provide personalized treatment recommendations. Ching et al. (2018) propose the Cox-nnet algorithm with a two-layer neural network, where one hidden layer implements the dimension reduction of input covariates, and the output layer performs the Cox regression based on the activation levels of the hidden layer. Hao et al. (2021) apply the Cox-based DNN prediction survival model to high and ultrahigh dimensional survival data. This Cox-like model can be obtained by replacing the linear combination of covariates in its log-likelihood function with the output layer function in the neural network. These methods are nonlinear extensions of the CPH model, and most of the advantages of the CPH model are preserved.

In applications, however, survival models that maintain parts of the proportional hazards structure do not fit survival data with nonlinear risk functions well. Some researchers have considered modeling non-proportional hazards with neural networks. One example is Nnet-survival, which is proposed in Gensheimer and Narasimhan (2019). This is a discrete-time survival model with neural networks, and its baseline hazard rate and the effect of input data on hazard can vary with follow-up time. Nezhad et al. (2019) propose a deep active survival analysis approach, which is a new survival analysis framework that uses deep learning and active learning with a new sampling strategy, and this approach has better performance in treatment recommendation for new patients. Lee et al. (2018) introduce an approach called DeepHit, which transforms the problem of learning the distribution of survival time into a discrete-time classification problem and employs a deep neural network to estimate the survival function.

This study is motivated by the appealing flexibility and prediction accuracy of deep learning, which has made extraordinary achievements in many applications, such as computer science, personalized treatment recommender system, and voice/image recognition. As pointed out by Jordan and Mitchell (2015), “Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of today’s most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science.” But it has not been well developed in predicting default risk, especially in the framework of survival analysis. The goal of this study is to further expand the paradigm of default risk prediction via penalized deep learning based on time-to-event data. This study can contribute beyond the existing literature in the following aspects. First, it significantly differs from traditional classification studies by making full use of the temporal information of time-to-event data. Instead of merely predicting whether default will occur, the proposed method provides more detailed information on the probability of default over time in the framework of survival analysis. Second, this study proposes to improve default risk prediction with survival data via deep learning as opposed to model-based methods. It can also identify important features, which can provide guidance to P2P lending platforms regarding post-loan risk management. Although the CPH model with penalization can be applied effectively to accommodate small samples, when dealing with large-scale settings, these methods no longer have computational superiority. Third, the network structure in this study is different from that of other neural networks for survival analysis. For example, the output layer has multiple nodes, which can be converted to the estimate of survival probability. By adding an additional one-to-one layer, we can achieve feature extraction and estimation simultaneously through the incorporation of an \(L_1\)-penalty into the objective function, which is unavailable in Nnet-survival (Ching et al. 2018) and other deep learning models with survival data. As most research on deep learning has not paid sufficient attention to interpretability, we, instead, pay more attention to features’ statistical importance. Moreover, computation is made more efficient via the minibatch gradient descent algorithm, which splits the training set into smaller batches and implements gradient descent on each one. This makes it possible to accommodate massive data.

The rest of this paper is divided into four sections. Section 2 outlines the proposed approach, called Survival Analysis with Feature Extraction via Deep Neural Network (SAFE-DNN). In Sect. 3, a real-world loan dataset is described, and its modeling results are presented. Additionally, performance of the proposed method is illustrated and compared with the competing methods through extensive simulations in Sect. 4. Section 5 includes conclusions and discussions of future research directions.

2 Methodology

2.1 Data and network structure

Let \((Y_i,\delta _i,\varvec{X}_i), i=1,\ldots , n\), be independent and identically distributed samples, where \(Y_i=\min (T_i,C_i)\) and \(\delta _i=I(T_i\le C_i)\) is the event indicator (Lin and Wei 1989). Here, \(T_i\) is the survival time, which is the time of default in this study. \(C_i\) is the censoring time, meaning that there is no record of default by \(C_i\). \(\varvec{X}_i\) contains the p-dimensional covariates. In survival analysis, the survival function \(S(t)=P(T\ge t)\) is of fundamental interest. Another function of interest is the hazard function, defined as:

$$\begin{aligned} \lambda (t)=\lim _{h\rightarrow 0^{+}}\frac{P(t\le T<t+h\mid T\ge t)}{h}, \end{aligned}$$

which is the conditional probability that a borrower defaults at time t given that he or she has not defaulted up to that point. For the Cox’s proportional hazards model, the hazard function is assumed to have the form:

$$\begin{aligned} \lambda (t \mid \varvec{X}_i)=\lambda _0(t)\exp (\varvec{\beta }^T\varvec{X}_i), \end{aligned}$$

where \(\lambda _0(\cdot )\) is the baseline hazard function, and \(\varvec{\beta }\) is the p-dimensional coefficient vector. As stated in Sect. 1, the linear proportional hazards condition can be restrictive in application. In this study, we propose a means of prediction for default risk via deep learning.

What is proposed in this study is a multi-layer network, including one layer of input nodes, one layer of output nodes, and multiple layers of hidden nodes. A schematic presentation of the structure is shown in Fig. 1a. Consider a vector of covariates \(\varvec{X}_i=(x_{i1},\ldots ,x_{ip})^{\top }\) for the i-th borrower as input to the network. The j-th input node takes value \(x_{ij}\) for this input vector (\(j=1,\ldots , p\)). Each input is connected directly to only one node of the weighted input layer, and a weight \(w_j\) is associated with the connection between input \(x_{ij}\) and the corresponding weighted input node. As suggested in the literature, we consider the scenario where some input features are non-informative for predicting default risk. Correspondingly, the connected coefficients, \(\varvec{w}=(w_1,\ldots , w_p)^{\top }\), are assumed to be sparse with some zero values. In all hidden layers, each node is fully connected to the nodes in the next layer. There are K hidden layers, and the connected weights of the k-th hidden layer are collectively denoted by \(W^{(k)}\). The activation function is set as the popular Sigmoid function.

Fig. 1
figure 1

Network structure and model framework of SAFE-DNN

For the output layer, there are m nodes, and the output value is \((y_1,\ldots ,y_m)^{\top }\), which is related to the prediction of survival probability. Specifically, we divide the follow-up time into m intervals, with \(t_{1},t_{2}\), \(...,t_{m}\) being their right endpoints. Additionally, the intervals are left-open and right-closed with \(I_1=(0,t_{1}], I_2=(t_{1},t_{2}],...,I_m=(t_{m-1},t_{m}]\). Let \(p_{k}\) be the probability of default in the k-th interval, which can be estimated by the output vector, i.e.,

$$\begin{aligned} {\hat{p}}_{k} = \frac{\exp (y_{k})}{\sum _{j=1}^{m}\exp (y_{j})},~~k=1,\ldots , m. \end{aligned}$$

Now the survival function, \(S(t)=P(T\ge t)\), of the default time can be estimated by:

$$\begin{aligned} {\hat{S}}(t)=1-\sum _{k=1}^j{\hat{p}}_k, ~~~t \in I_j. \end{aligned}$$

Here, the cut-points for the time intervals can be equally spaced. As suggested in Gensheimer and Narasimhan (2019), another option is to space the time intervals more widely apart, with increasing follow-up time, especially when the frequency of events decreases with follow-up time. Compared with equal intervals, varying intervals ensure that the number of events in the different intervals is more or less comparable, which may help improve estimation. In this study, we use \(m = 20\) to 30 time intervals, and the cut-points are determined by:

$$\begin{aligned} t_{j} = - \frac{\log (1-0.03j)\cdot Y^{*}}{\log (2)},~~ j=1,2,..,m, \end{aligned}$$

where \(Y^{*} = 0.27Y_{\max }\) and \(Y_{\max }\) is the maximum observed survival time.

2.2 Objective function

We set the loss function as:

$$\begin{aligned} L = -\sum _{i=1}^n \log (L_i), \end{aligned}$$

where \(L_i\) is the likelihood of the i-th subject. To specify the likelihood, we consider two cases according to whether the observation is a failure or is censored. First, if observation i is a failure in interval j (i.e., \(Y_i\in I_j\) and \(\delta _i=1\)), the likelihood is the probability of failing in this interval:

$$\begin{aligned} L_{i} = p_{ij}, \end{aligned}$$

where \(p_{ij}\) is the probability of default in interval j for individual i. Second, if individual i is censored, with the censoring time falling in the second half of interval \(j-1\) or the first half of interval j, that is,

$$\begin{aligned} \left( t_{j-2}+t_{j-1} \right) /2\le Y_i< \left( t_{j-1}+t_{j} \right) /2,~~\text{ and }~~\delta _i=0, \end{aligned}$$

the failure time is treated as being in the j-th interval or later. This means that if the subject is censored in the second half of an interval, it is considered to be alive in this interval. By doing this, we can avoid a downward bias of survival estimates (Brown et al. 1997). Hence, the likelihood is the probability of surviving through intervals 1 to \(j-1\):

$$\begin{aligned} L_{i} = 1-\sum _{l=1}^{j-1}p_{il}. \end{aligned}$$

For the ease of calculation, we convert the observed \((Y_i,\delta _i)~ i=1,\ldots ,n\) into two vectors, surv and fail, both of length m. Specifically, for an uncensored subject with failure time t in the j-th interval, the j-th component of surv and fail are defined, respectively, as

$$\begin{aligned}&surv (j) = 1,\\&fail (j)=\left\{ \begin{matrix}1, &{} \quad if\quad t_{j-1}\le t<t_{j}\\ 0,&{} \mathrm{otherwise} \end{matrix}\right. . \end{aligned}$$

For a censored subject with censoring time t in the j-th interval, the j-th component of surv and fail are defined, respectively, as

$$\begin{aligned} surv (j)= & {} \left\{ \begin{matrix}1, &{} \quad if\quad t\ge \frac{1}{2}(t_{j-1}+t_{j})\\ 0,&{} \mathrm{otherwise} \end{matrix}\right. ,\\ fail (j)= & {} 0. \end{aligned}$$

Then, for the i-th individual, the log-likelihood function is defined as:

$$\begin{aligned} l_{i} = \log \left( 1-\sum _{j=1}^{m} surv _{i}(j)\cdot p_{ij} +\sum _{j=1}^{m}{} fail _{i}(j)\cdot p_{ij} \right) . \end{aligned}$$
(1)

More details about formula (1) are elaborated in Appendix.

To implement the selection of important features, we first consider the penalized objective function:

$$\begin{aligned} Q_1 =- \sum _{i=1}^{n}l_i+\lambda _{1}\sum _{i=1}^{p} \mid w_{i} \mid , \end{aligned}$$

where \(w_{i}\) is the connected weight of the i-th neuron in the one-to-one layer, and \(\lambda _{1}\) is the tuning parameter and controls the sparsity of parameters. It can be selected according to the loss function of a validation set. Further, the overfitting issue of deep learning can be addressed by adding a regularization term. In this study, \(L_{2}\) penalization is applied, and the final objective function is:

$$\begin{aligned} Q = -\sum _{i=1}^{n}l_i+ \lambda _{1}\sum _{i=1}^{p} \mid w_{i} \mid + \lambda _{2}\sum _{k=1}^{K+1} \left\| W^{(k)}\right\| ^2, \end{aligned}$$
(2)

where \(W^{(k)}\)’s are the connected coefficients in the hidden layers, and \(\lambda _{2}\) is the tuning parameter to control model complexity. Here, we use \(L_{2}\) regularization for preventing overfitting, and it yields satisfactory performance, as evidenced by our numerical studies. We note that, other techniques, such as early stopping (Li et al. 2020), \(L_{1}\) regularization (Mollalo et al. 2019), and Monte Carlo dropout (Hinton et al. 2012; Gal and Ghahramani 2016), can also be used for addressing the problem of overfitting. It is beyond our scope to evaluate which technology is more efficient.

2.3 Statistical feature importance

In addition to using neural network for prediction, interpretation and inference has been attracting increasing attention. However, it is difficult to define a straightforward measurement for statistical feature importance due to the complex structure of neural network (Murdoch et al. 2019). A possible way is to use the local interpretable model-agnostic explanation (LIME) method (Sun et al. 2020), which provides prediction importance for each variable and each subject by perturbing the feature values and evaluating how the prediction results change. It is computationally demanding to perform LIME. Moreover, it may not be very reasonable to fit a linear regression model between the perturbed feature values and their corresponding output values, especially for the output of SAFE-DNN with multiple nodes.

Intuitively, the feature with very small \(w_i\) has a weak signal in the following transmission of the neural network, and its influence on the final output, equivalently the survival probability, is insignificant. Hence, we select the features with relatively large weights in the one-to-one layer by penalizing \(w_i\) in the objective function Q. Nonetheless, it does not mean that the selected features would definitely have clear implications for the final outputs. For rigorously assessing the relative importance of features, we adopt the concept of gradient for a function. Consider the feature vector \(\varvec{X}=(X_1,\ldots ,X_p)^{\top }\) as an input to the SAFE-DNN network. The corresponding output vector is denoted as:

\(\varvec{g}(\varvec{X})=(g_1(\varvec{X}),\ldots ,g_m(\varvec{X}))^{\top }\), where \(g_j(\varvec{X})\) is the output value in the j-th interval, \(j=1,\ldots , m\). Then for interval \(I_j\), the “gradient” of \(g_j(\varvec{X})\) with respect to the i-th feature \(X_i\) is defined as:

$$\begin{aligned} \mathcal {G}^j(X_i)=\frac{g_j(X_1,\ldots ,X_{i-1},X_i+\Delta X_i,X_{i+1},\ldots ,X_p)-g_j(X_1,\ldots ,X_p)}{\Delta X_i}, \end{aligned}$$
(3)

where \(\Delta X_i\) is a small disturbance. Similarly, we can define the “gradient” of \(\varvec{g}(\varvec{X})\) with respect to \(X_i\) as:

$$\begin{aligned} \mathcal {G}(X_i)=\frac{\left\| \varvec{g}(X_1,\ldots ,X_{i-1},X_i+\Delta X_i,X_{i+1},\ldots ,X_p)-\varvec{g}(X_1,\ldots ,X_p)\right\| }{\Delta X_i}, \end{aligned}$$
(4)

where \(\Vert \cdot \Vert \) is the \(L_2\) norm of a vector. The “gradient” \(\mathcal {G}(X_i)\) is always positive, measures the change of the output caused by a feature’s disturbance, and can be regarded as a reasonable measurement for assessing the relative influence of a feature.

With traditional statistical models, such as CPH, we can compute confidence intervals for the regression coefficients and make inference for the statistical feature importance. Similarly, we can compute the “gradient” \(\mathcal {G}(X_i)\) for all observations and use the sample confidence interval for making inference. It is interesting to note that the proposed “gradient” measures the influence of a feature on the whole neural network, while the weight \(w_i\) reflects its impact on the first layer. Comparatively, the “gradient” provides a more comprehensive index for measuring a feature’s effect on the survival probability. In this study, we propose to select the features via penalizing the weights \(w_i\) instead of “gradient,” since it is more practical and easier to implement penalizing the weights than the “gradient.” Moreover, the feature with a small weight will typically have little impact on the following network transmission, and possibly result in relatively small “gradient.” In this sense, penalization of weights \(w_i\)’s would not miss the features with large “gradient.” Considering the computational complexity of selecting features via “gradient,” we suggest using weights \(w_i\)’s for implementing feature selection. Nonetheless, the proposed “gradient” is a feasible measurement for inferring the statistical importance of features.

2.4 Minibatch gradient descent algorithm

The framework of the proposed SAFE-DNN is described in Fig. 1b. To train the model efficiently, we propose using the minibatch gradient descent algorithm (Khirirat et al. 2017). The detailed algorithm is described in Algorithm 1.

figure a

Briefly, the training samples are randomly divided into smaller groups. In our numerical study, the sample size of each group is equal to \(batch\_size=200\), and the number of batches S depends on training data and batch size. The model parameters are updated on these small sample sets. Here \(loss^{opt}\) is the optimal loss of the validation set. When the loss does not improve persistently in the validation set, we use \(n^{not~improved}\) to record the present iteration number. The maximum number of iterations, E, is set as 10,000. In addition, the number of subsets, B, is chosen as 10, and L equals 30.

Here, the derivation of the objective function is calculated with the back propagation algorithm. In \(\beta ^{k+1} = \beta ^{k} -\eta ^{k}{Q}'(\beta ^{k})+\alpha (\beta ^{k}-\beta ^{k-1})\), the gradient of loss function, \({Q}'(\beta ^{k})\), is calculated using the Theano module in Python, which omits the specific derivation of gradient. Thus, we define the objective function according to Theano. The learning rate, \(\eta ^{k}\), and momentum term, \(\alpha (\beta ^{k}-\beta ^{k-1})\), are tuned to optimize the minibatch gradient descent algorithm and make the model converge faster and more stably. In our numerical implementation, all variables have the same learning rate, and its initial value is 0.005. It is multiplied by 0.8 every 100 iterations. Moreover, it decreases as the iteration number increases to improve convergence efficiency. The momentum, \(\beta ^{k}-\beta ^{k-1}\), records the direction of the last iteration coefficient to avoid a local optimum.

The hyper parameters \(\lambda _{1}\) and \(\lambda _2\) are selected by cross-validation. Note that \(\lambda _{2}\) is used to control model complexity and avoid overfitting. A larger value of \(\lambda _2\) corresponds to a simpler model. We also use early stopping to avoid overfitting. In this study, we set the early stopping parameter as \(n^{stop}=300\) to achieve “double safety.” We stop training if the loss function of the validation set does not decrease.

2.5 Advancements from existing works

In this study, we stress solving the practical problem of predicting credit risk with deep learning. Although deep learning has had extensive applications, it has not been fully developed in survival analysis. In SAFE-DNN, the traditional proportional hazards condition is discarded so that it can better accommodate flexible relationships between features and hazard function. A network structure with a one-to-one layer is designed to achieve feature selection, which is unavailable in Nnet-survival. Unlike in the existing deep learning studies, we pay more attention to statistical feature importance and propose a new concept of feature importance that has meaningful interpretations and is much easier to implement than LIME. Meanwhile, there are both similarities and differences between the proposed SAFE-DNN model and traditional survival models.

First, they share the common target of predicting survival/hazard function based on time-to-event data, but with different model assumptions. In the SAFE-DNN model, some model assumptions, such as the proportional hazards assumption, are discarded, and the proposed neural network allows us to learn highly complex and nonlinear relationships between features and hazard function. The outputs of a SAFE-DNN neural network are the probabilities of default in divided intervals, which can be organized into survival/hazard function as the traditional statistical models can provide. However, the default probabilities in different intervals may be more informative for applications owing to their straightforward interpretations.

Second, the most commonly used technique—maximum likelihood estimation as in traditional survival models—is also used here to train the SAFE-DNN network. It is noted that we use the “full” likelihood in the loss function, which can be more efficient than the partial likelihood used in estimating CPH. Meanwhile, we implement feature selection through penalizing parameters \(w_i, i=1, \ldots , p\) in the one-to-one layer, just like penalizing coefficients in the CPH model. Somewhat differently, the objective function of SAFE-DNN includes a regularization term on \(W^{(k)}\) to overcome the overfitting issue of deep learning.

Third, unlike the coefficients in a CPH model, the parameters in a SAFE-DNN network have no straightforward interpretations, but the “gradient” defined in (4) has some implications of statistical feature importance. Thus, we can infer the influence of a feature through its “gradient,” similar to a coefficient in CPH.

Fourth, the minibatch gradient descent algorithm used in SAFE-DNN is a variation of the gradient descent algorithm, which is often used for finding weights of machine learning algorithms or coefficients of statistical models. With the essence of gradient descent, the minibatch gradient descent algorithm splits the training data into small batches that are used to calculate model error and update parameters, which allows the efficiency of not having all training data in memory. Thus it provides a computationally more efficient process and more robust convergence for the optimization and then enables us to handle big data.

Fig. 2
figure 2

Flowchart of loan data analysis

Table 1 Data analysis: feature extraction results

3 Analysis of loan data from Lending Club

3.1 Data description

The Lending Club is one of the largest P2P lending marketplaces in the U.S. This dataset contains more than 90k records with 280 features, such as borrower credit characteristics, borrower assessments, loan characteristics, borrower indebtedness, credit history, and status of loan (default or non-default). This dataset is publicly available at the Lending Club website (https://www.lendingclub.com).

We analyze data from the first quarter of 2017 and note that a larger-scale analysis is also possible. Before analysis, we delete features that have missing rates of more than 60% or more than 95% of subjects have the same values. For continuous variables, the missing values are imputed with medians. For categorical variables, the missing values are imputed with modes. In addition, categorical variables are recoded with numerical coding or dummy variables. Finally, we have a total of 96,781 observations and 114 explanatory variables. In the analysis, the survival time is defined as the length of the sustained repayment period, which is calculated by dividing the total repayment amount by that of each period. If borrowers have fully paid or are still in the repayment period, their event times are right-censored, as default events have not happened. A total of 15,670 subjects have default events by the end of the study.

3.2 Results

The flowchart of data analysis is shown in Fig. 2. For comparison, we consider the following alternatives: Nnet-survival-\(L_1\), which is based on Nnet-survival (Gensheimer and Narasimhan 2019) and revised with an \(L_1\) penalty to accommodate high-dimensional features, and Cox-DAC, which is proposed by Wang et al. (2018). Cox-DAC can deal with massive data where the sample size, n, is exceedingly large and the covariate dimension, p, is not small but \(n\gg p\). Both the SAFE-DNN and Nnet-survival-\(L_1\) are trained with the minibatch gradient descent algorithm, and the tuning parameter of the Cox-DAC method is selected by minimizing the BIC score.

When evaluating the prediction accuracy of different methods, the following measure is used:

C-index The concordance index (C-index) is a rank correlation that counts concordant pairs between the predicted and true outcomes (Uno et al. 2011). The value of C-index is between 0 and 1, and 1 means an ideal prediction. We use the Kaplan–Meier estimate (Fig. 5 in the Appendix) of the survival function as the true value when calculating the C-index.

To evaluate the prediction and identification performance, the dataset is randomly divided into two parts: 70% of the data is used for training, and the other 30% for testing. This procedure is repeated 100 times. We calculate the C-index in period 15, 35, and 55. We find that SAFE-DNN and Cox-DAC share the same C-index value (0.679 for period 15, 35, 55), which is larger than that of Nnet-survival-\(L_1\) (0.675 for period 15 and 35, 0.676 for period 55).

The results of feature extraction are shown in Table 1. The SAFE-DNN method identifies 11 features, and Nnet-survival-\(L_1\) and Cox-DAC identify 11 and 10 features, respectively. All the features belong to one of the four categories: loan contract contents, asset strength, external credit rating, and external credit situation, which have been suggested to play important roles in credit risk (Dirick et al. 2017). Among them, all three methods select listed amount of loan applied for by a borrower, interest rate on loan, and number of mortgage accounts, which have been found to be positively related to default risk (Shi et al. 2018). SAFE-DNN identifies number of inquiries in the past six months, number of personal finance inquiries, number of credit inquiries in the past 12 months and the home ownership status provided by the borrower, which are missed by both Nnet-survival-\(L_1\) and Cox-DAC. A literature search (Hackbarth et al. 2006) suggests that these features may have important implications for credit risk, which may provide support to the validity of the proposed approach. In addition, Cox-DAC identifies five features that are absent from the selection results of SAFE-DNN and Nnet-survival-\(L_1\), while Nnet-survival-\(L_1\) identifies more unique features.

3.3 Evaluation

With real data, it is difficult to objectively evaluate identification accuracy. To evaluate the stability of feature extraction, we calculate the observed occurrence index (OOI, Huang and Ma (2010)) for each variable. Specifically, we randomly select 80% of the subjects and perform the algorithm to identify features using the selected samples. This procedure is repeated 100 times. For the j-th feature, OOI is defined as \(o_j=c_j/100\), where \(c_j\) is the number of times that the feature is identified. Hence, OOI measures the relative importance and stability of the selected feature. Overall, a larger OOI indicates higher stability. The results are plotted in Fig. 3. Each row represents one feature, and each vertical column represents one repeat. A black spot means the feature is selected in the corresponding repeat, and otherwise it is blank. The mean values of OOI for SAFE-DNN, Nnet-survival-\(L_1\), and Cox-DAC are 0.825, 0.760, and 0.806, respectively. Combined with the results in Fig. 3, we can conclude that the proposed method has prominent performance in selection stability, while Cox-DAC is slightly worse, and Nnet-survival-\(L_1\) is the most unstable method among the three.

Fig. 3
figure 3

Data analysis: OOI of the three models

Fig. 4
figure 4

Data analysis: features’ “gradient”

For assessing the relative importance of the selected variables, we calculate “gradient” based on (4). Specifically, we take the disturbance \(\Delta x_{ij}= \mid 0.01\cdot x_{ij}\mid \). For the selected features in Table 1, we calculate the “gradient” \(\mathcal {G}(X_j), j=1,\ldots , 22\) for all 96,781 subjects, and the boxplot and corresponding 95% confidence intervals are shown in Fig. 4a and b, respectively. Here, we focus on the relative sizes of “gradient” instead of their absolute values. As shown in Fig. 4 a, the SAFE-DNN selected features have larger “gradient” values than most of the other features, which means that the SAFE-DNN selected features have a higher influence on the survival probability. We also note that \(x_{13}\) has a relatively large “gradient” value, suggesting that SAFE-DNN may still miss important features. Nonetheless, it selects most features with large “gradient.” A similar conclusion can be reached from Fig. 4b. Besides, we show the “gradient” for the SAFE-DNN selected features and each subject in Fig. 4 c, which is plotted similarly to the LIME method (Sun et al. 2020). In this figure, each row represents one feature and each vertical column represents one sample. A darker color suggests that the feature is more important for this sample. Among the selected features, \(x_2\) has the greatest influence on the output for almost all samples. We plot the “gradient” of \(x_2\) for all 27 intervals in Fig. 4 d. It is observed that the values of “gradient” in most of the intervals are relatively large. \(x_{2}\) represents interest rate on loan, which has been shown to play a vital role in loan default Ntiamoah et al. (2014).

4 Simulation

We conduct simulations to further evaluate performance of the proposed method under settings with different hazard functions, censoring rates, and dimensions of features. We consider two data generating processes, and the corresponding survival times are generated with hazard functions:

$$\begin{aligned}&\text{ s1 }: \lambda (t \mid X) = \exp (h(X,\beta )),\\&\text{ s2 }: \lambda (t \mid X) = \{0.75h(X,\beta )^3+0.75h(X,\beta )+1.2\}t, \end{aligned}$$

where

$$\begin{aligned} h(X,\beta )= & {} X^{\top }\beta +0.8(X^{\top }\beta )^{2}-0.5(X^{\top }\beta )^{3}\\&+0.0001\exp (X^{\top }\beta ). \end{aligned}$$

These hazards represent complicated scenarios for which deep learning and other flexible methods may especially be needed. The censoring times are drawn from the uniform distribution U(ab). We obtain 25% and 50% censoring rates (cr) by adjusting a and b. The p-dimensional variables, X, are generated from the multivariate normal distribution with mean 0 and covariance matrix \(I_{p\times p}\), where \(I_{p\times p}\) is the identity matrix. We consider different dimensions, with \(p = 50, 100, 150\) for both cases, and the coefficients \(\beta \) are set as:

$$\begin{aligned}&\beta _{1}=\cdots =\beta _{\frac{p}{5}}=0.2,~~ \beta _{\frac{p}{5}+1}=\cdots =\beta _{\frac{2p}{5}}=0.1,\\&\beta _{\frac{2p}{5}+1}=\cdots =\beta _{\frac{3p}{5}}=0.05, ~~\beta _{\frac{3p}{5}+1}=\cdots =\beta _{p}=0. \end{aligned}$$

The sample size of the training set is 10,000, and that of the testing set is 5,000. For each scenario, 100 replicates are simulated.

When comparing different methods, in addition to the C-index, we also consider the following to evaluate prediction accuracy:

Brier score: This can be regarded as the mean squared error between the estimated survival function, \({\hat{S}}_{i}(t)\), and the actual survival function, \(S_{i}(t)\), at time t (DeVogel et al. 2019), defined asgination

$$\begin{aligned} B(t)=\frac{1}{n}\sum _{i=1}^{n} (S_{i}(t)- {\hat{S}}_{i}(t))^{2}. \end{aligned}$$

A smaller Brier score value implies better performance.

The accuracy of feature extraction is evaluated using Recall, Precision, and their harmonic mean F1 score.

We calculate C-index values at \(0.25Y_{max}\), \(0.5Y_{max}\), and \(0.75Y_{max}\), respectively, where \(Y_{max}\) is the maximum survival time. The results of C-index are presented in Table 2 of Appendix. Because of the violation of the CPH assumption, Cox-\(L_1\) has the worst prediction accuracy. Overall, SAFE-DNN has superior prediction performance, with the largest C-index in most of the scenarios. Figures 6 and 7 provide the curves of Brier score for cases s1 and s2, respectively. SAFE-DNN outperforms the other approaches with the smallest Brier scores. Cox-\(L_1\) performs badly in terms of Brier score with a maximum error of 0.14, while SAFE-DNN has a Brier score of no more than 0.04. Although Nnet-survival-\(L_1\) enjoys the advantages of being model-free and has better prediction performance than Cox-\(L_1\), it is inferior to SAFE-DNN in terms of prediction accuracy.

Table 3 presents the results of Recall, Precision, and F1 score. Cox-\(L_1\) has the worst identification performance because of model misspecification. Except for a small number of cases, SAFE-DNN produces larger Recall and Precision values than Nnet-survival-\(L_1\). Nonetheless, the result of F1 score indicates that SAFE-DNN outperforms Nnet-survival-\(L_1\) in feature extraction.

The proposed approach is also computationally affordable, making it possible to have broad practical applications. Overall, it clearly and significantly outperforms the two most relevant competitors.

5 Conclusions and future work

We have developed a new approach to predict default risk using penalized deep learning that can be used for loan default analysis with time-to-event data. Rather than predicting whether borrowers will default based on classification methods, SAFE-DNN can predict default risk over time, which provides not only theult but also forecasts when default will most probably occur. Although several studies have considered risk analysis based on time-to-event data, this study is among the first to conduct penalized deep learning in the context of survival analysis without model structure assumptionsMoreover, we can predict default risk and identify important features simultaneously via a penalized objective function. The minibatch gradient descent algorithm has made it possible to accommodate large-scale data, and numerical studies show the practical superiority of the proposed method. Overall, this study has delivered a practical and useful new way to analyze default risk with massive time-to-event data.

Extensive research on survival analysis has shown its appealing advantages and merits in modeling default risk (Caselli et al. 2021; Xia et al. 2021; Blumenstock et al. 2022). As an alternative, one can also consider analysis based on extreme value theory for default events (Molenberghs and Verbeke 2011). Developing such an analysis and comparing with the proposed is beyond our scope. In this study, we have considered right-censored data. Future work can be extended to other data settings, such as truncated data and interval censored data. Further, when several commercial datasets are available for analysis, it can be important and challenging to recognize and accommodate cross-data heterogeneity. We leave it to future research to study the prediction of default risk in an integrative analysis framework. Although the minibatch gradient descent algorithm makes it possible to deal with massive data, determining how to train the model with parallel computation to improve computational efficiency deserves additional research. In addition, the structure of the proposed neural network is still not sufficiently complicated. Some studies have considered more advanced neural network structures to improve model performance. In future work, one can examine other deep neural networks or more complex computing algorithms to improve performance.