3.1 Demand Analysis on Smart Healthcare System
The rapid development of social economy and cultural level has a direct impact on people's health consumption needs. The unbalanced development of economy and consumption power among regions and residents in China has led to the diversification of people's health needs. Taking the basic medical needs of the people as the starting point, this study provides multi-level and diversified medical services for different service targets, and actively creates related conditions to meet the medical service needs of different groups of people. This is not only a response for the call for health development in China, but also can bring reliable economic benefits [
17,
18]. The feasibility analysis of a smart healthcare system is shown in Figure
1 below.
The feasibility analysis of the smart healthcare system mainly includes algorithm feasibility analysis, system feasibility analysis, and economic feasibility analysis. In terms of algorithm feasibility, the constructed system adopts the most popular and hottest deep learning, machine learning, and data mining algorithms, and is trained and analyzed based on the current cutting-edge data preprocessing technology. The model with the optimal solution is saved and arranged in the system for application and promotion of the algorithms. In the system feasibility, the WeChat applet platform is undertaken as the system development platform, which can be compatible with various versions of Android and iphone operation systems (IOSs), realizing the convenience of multi-platform use. In the economic feasibility, the development cost of this system mainly lies in the lease and maintenance of the server, and the developers are full-time engineering graduate students in colleges, so the cost is relatively small. With the rapid development of scientific information technology, many scholars in China and other countries have applied the deep learning, machine learning, and data mining algorithms to the field of smart healthcare. Thus, it can also be considered that applying the AI technology in the medical field is the current research trend.
There are many target groups in the smart healthcare system, including professional doctors, users with healthcare demands, busy users, users with minor illnesses, and users with demands of privacy protection. Doctors in medical institutions in remote areas and doctors over 55 years of age are mainly targeted among the medical staff, and the overall population of rural doctors is aging. Therefore, the emergence of the smart healthcare system can be used as a supplement to its daily diagnosis, providing some help or tips in the process, so that it can serve patients more accurately and conveniently. Unprofessional users who are extremely concerned about subtle physical discomforts or symptoms may suffer excessive costs in the hospital, or inaccurate results of online searches when the above discomforts or symptoms appear. In the smart healthcare system, users can describe the physical symptoms or select the symptoms keywords to obtain the reminding or help through an online diagnosis function. It is easy for busy users to ignore some minor physical symptoms, or they may give up going to the hospital due to financial or work reasons when experiencing subtle symptoms. The death rate of cardiovascular diseases has continued to rise in recent years, and its prevention is far more important than treatment. Therefore, the use of a smart healthcare system for pre-diagnosis is conducive to examining the tendency of a disease in a hospital. Under the imbalance in the distribution of medical resources, the bed utilization rate of many top three hospitals is as high as 99% [
19], so that people with minor diseases in large cities can't solve their diseases timely, effectively, and economically. Therefore, the smart healthcare system can include the disease encyclopedia function to facilitate the retrieval of the profile, symptoms, etiology, treatment, and nursing information of the disease. In addition, the young boys or girls who have symptoms in their private parts can use the smart healthcare system for diagnosis and treatment without registering any personal information [
20–
22]. In this study, the CNN in the deep learning algorithm is improved according to the feasibility of the system to construct a smart healthcare system, and the system is divided into different functional modules according to the demand analysis of the target population.
3.2 Deep Learning
Deep learning is also called deep structure learning or cascade learning, and it is a part of machine learning algorithms. As an algorithm that can independently extract the data features, deep learning can be applied to the medical field to promote fast development of smart detection [
23]. Deep learning algorithms can automatically learn the multi-level features in the system from the original data, requiring no participation of human experts in related fields. Thus, it greatly saves the manpower, material resources, and time costs, and can complete the classification task with the learnt key features [
24]. Deep learning algorithms are closely related to the information processing and communication modes in the biological nervous system. It contains many architectures, such as multilayer
artificial neural networks (ANN), CNN, and
recurrent neural networks (RNN). The power of deep learning algorithms is that the output of a certain layer in the network can be undertaken as the expression of data, so that the features are extracted through the neural network [
25]. The architecture and optimization of the deep learning algorithm are shown in Figure
2 below.
In the deep learning algorithm, the forward propagation of the neural network is transmitted from the input layer to the output layer by the linear combination of many neurons among the hidden layers, which is realized through nonlinear change of activation function. The equation is as follows:
In the above equation,
m refers to the activation function. The linear calculation of neurons among the hidden layer is contained in the activation function. After the nonlinear calculation of the activation function, the result is passed to the neurons in next hidden layer. The essence of training the neural network in the deep learning algorithm is to enable the neural network to better fit the data distribution and make a decision boundary [
26]. The parameters have to be optimized to find the best decision boundary, during which, it is necessary to compare the difference between the real sample value
k and the predicted value
f(l) of the sample (namely, loss function). The loss function can be calculated with Equation (
2) below:
In the above equation,
c refers to loss. The goal of neural network operation is to make the loss sum of all training data as small as possible. The CNN composing of neurons with weights can be obtained by further deriving the
fully connected neural network (FCNN), and its main characteristics are sparse connections and weight sharing. Sparse connection means that the CNN network is connected using the spatial local correlation among image pixels through the local connection of neurons; and weight sharing means that the feature maps generated by convolution operations on the input image by each convolution kernel in the CNN network share the same parameters [
27]. From a structural point of view, the CNN network includes the convolutional layer, pooling layer, and fully connected layer. The parameters of the convolution layer are composed of some convolution kernels or convolution filters. The feature maps can be generated in this layer. The
pth feature map is often defined as
mp, and the convolution kernel is composed of parameter
\({V^p}\) and
\({s_p}\), then below the equation can be obtained:
In the above Equation (
3),
f(.) refers to an activation function,
l represents an inputted feature map, and the parameter
V of the hidden layer can be represented as a four-dimensional vector. The main function of the convolutional layer is to realize convolution calculations, and features of abstract data can be extracted finally through continuous adjustment of parameters. During the convolution calculation, all areas of the layer will be scanned in turn, and finally the features will be input to become the matrix element multiplication summation and superimposing the deviation, as given below:
In the above equations,
s refers to the amount of deviation,
\({Z_h}\) and
\({Z_{h + 1}}\) represent the convolutional input and output of the
h+1
th layer, respectively, which is also called a feature map.
\({H_{h + 1}}\) refers to the size of
\({Z_{h + 1}}\), and the length and width of the feature map are assumed to be the same in the above equations.
Z(i, j) refers to the pixels of the corresponding feature map,
P is the number of channels for the feature map.
f, \({b_0}\), and
k are the convolutional layer parameters, representing the size of the convolution kernel, convolution stride, and padding number. Due to the convolutional layer, the CNN is more suitable for abstract data such as images and audio. The pooling layer exists in the continuous convolutional layers, and its main function is to gradually reduce the size of the feature map, so that the parameters and calculations in the network can be gradually reduced, preventing overfitting. It is assumed that the size of the input feature map is
W, the filter size is
C, the step size is
B, and the number of zero padding added to the boundary is
K, then the size of the output feature map can be calculated with Equation (
6) below:
The fully connected layer is connected in the same way as the ANN. At present, with the rapid development of the CNN network, more variants have been proposed, such as
visual geometry group network (VGGNet) [
28], GoogleNet [
29], AlexNet [
30],
Institute for Global Communications Network (IGCNet) [
31], and
gated CNN (GCNN) [
32].
3.4 Construction of Interactive Smart Healthcare Prediction and Evaluation Model based on the Deep Learning
People's physical health is affected by many diseases, and the ideal goal of physical health is to predict the disease risk before it is diagnosed, so as to discover the potential risks and trends of the disease and take effective preventive and intervention measures. In terms of the design of a variety of disease prediction and evaluation models, the disease risk prediction models based on AI technology studied by many scientific research scholars can generally only be applied to specific medical data due to the particularity of medical data itself. However, the model constructed in this study can predict multiple diseases based on deep learning. CNN shows strong generalization ability and versatility, so it is selected for analysis and finally used for analysis. The process for interactive SHPE model based on deep learning is shown in Figure
3 below.
The medical data used in this study mainly comes from the Medicare data set and Healthdata set. Then, the medical data is processed to sort out the chaotic medical data and convert it into a format that can be recognized and processed by a computer or machine learning model, including Chinese word segmentation, loading a dictionary, and dictionary supplementation. The data conversion mainly refers to the symptom data conversion and diagnosis data conversion. The data is summarized and collected finally. During the data preprocessing, the Min-Max method is adopted to normalize the characteristics data, as below:
The specific process for data preprocessing is given below in Figure
4.
CNN in the deep learning algorithm is mainly used in the algorithm design of the SHPE model. This is because the use of FCNN can't quickly capture the key features and filter features, resulting in excessive model training and slower speed. As the most popular algorithm model currently used, CNN can extract the key features based on the calculation of the convolution kernel. It is more used in computer vision image processing. Through the combination of convolution kernel and maximum pooling, the key image contour features are extracted based on the back propagation algorithm. Therefore, CNN is adopted in this study to extract the key features in a variety of diseases.
Firstly, the processed data is inputted into the CNN through the convolution kernel for convolution operations, as below:
In Equation (
8),
V is the weight of the convolution kernel,
L is the input value,
s refers to the bias value, and
valid refers to the padding method. After the successful dot multiplication operation, the output dimension is different, generating a new feature matrix. The matrix size can be calculated with the below equation:
In the Equation (
9),
v refers to the size of the input matrix,
k refers to the size of the pooling layer,
p refers to the size of the convolution kernel, and
b refers to the step size of the convolution kernel. The features are extracted automatically. To ensure the convolutional layers have better feature selection capabilities and dimensionality reduction, a pooling layer is added after convolution of the convolution kernel, so that the features from the maximum pooling can be utilized for further feature screening.
In this study, a pooling layer with a step size of 2 and a size of 2*2 is adopted to extract the key features of medical data. Besides, the parameter update of the convolution kernel in the experimental CNN optimizes and updates the parameters of the convolution kernel pooling layer through the back propagation algorithm. The recursive equation is written as below:
In Equation (
11) and (
12) above,
h refers to the number of convolution layers, and
\({V^h}\) refers to the parameter weight, which is a recursive equation.
\(\delta\),
v, and
w are matrix forms,
rotl80() refers to the operation of rotating the input matrix 180° counterclockwise, and
‘full’ refers to the full convolution calculation.
When the extracted features do not conform to the actual situation, the algorithm of the SHPE system will be further improved to perform repeated operations. The improvement methods are mainly to reduce the convolution parameters or save memory and increase sparsity or alleviate convolution redundancy. After reducing the convolution parameter or saving memory is realized, both the convolution parameter and memory requirement becomes 1/2 of the original compared with the ordinary convolution method. Finally, the Softmax function is usually undertaken as the classifier in the deep learning, so does this study. The sum of the probabilities of all markers in the Softmax classifier is 1, and the index value with the largest probability value is selected as the predicted marker. The Softmax function can be expressed by the following calculation equations:
In the CNN, \({q_p}\) refers to the neuron node after activation in the penultimate layer, \({v_{pi}}\) refers to the weight matrix connecting the penultimate layer and the Softmax layer, Zi is the input of the Softmax layer, and \({k_i}\) refers to the probability of each category. The final predicted disease category i can be determined by selecting the largest \({k_i}\).
3.5 Disease Prediction and Tracking by Interactive Smart Healthcare Model
In the process of disease prediction, the system can speed up the feature search by narrowing the range, so as to quickly find the target within a limited range. In this study, the inter-frame tracking algorithm using spatio-temporal context learning is adopted to model the spatiotemporal relationship between the target to be tracked and its local context area through the Bayesian framework, so as to obtain the statistical correlation between the low-level features of the target and its surrounding area. During a surgery, the tool tip is connected with the surrounding local background space, so as to learn a model of the spatial context by solving the deconvolution. Then, the spatial context model is adopted to update the spatiotemporal context model of the next frame. The third step is to calculate the convolution of a confidence map seat integration under the spatiotemporal context with related equations, so that the tip position can be found out quickly through calculating the maximum value of the confidence map during the tracking of the next frame. Of which, the confidence map can be utilized to estimate the likelihood of the target location, and the equation was given as follows.
In the above Equation (
16),
l refers to the location of the target, and
o refers to the target object in the disease detection scene.
For the disease prediction and tracking, it has to know the tip position in the current frame, denoted as
l*, and the context feature set can be defined as the following equation:
In the above equation,
I(z) refers to the pixel value at the position
z in the image, and
\({\Omega _c}({l^*})\) refers to the neighborhood of the position
l*. After the joint probability
P(l,c(z)|o) is marginalized, the likelihood equation of the target position can be written as follows based on the above Equation (
16):
In the Equation (
18),
\(K( {l,c( z )|o} )\) refers to the spatial relationship to model the target location and its context information to avoid ambiguity when the measurements of the image were different.
\(K( {c( z )|o} )\) can be adopted to model the prior probability of the local context content. Then,
\(K( {l|c( z ),o} )\) should be obtained firstly, it indicated that the relationship between the target location and its spatial context can be learned. The conditional probability can be modeled as the following equation:
In the above equation,
\({s^{bc}}(l - z)\) refers to the equation for the relative distance and direction of the target position
x and its local context position
z, so as to encode the target and its spatial context. It is known that there is a direction for
\({s^{bc}}(l - z)\), so there will be no ambiguity due to various symmetry problems when the spatial context information of the tip position and the surrounding background is considered. After the
\(K( {l|c( z ),o} )\) is obtained, the context prior probability
\(K( {c( z )|o} )\) has to be simply modeled as follows:
In the above Equation (
20),
I(z) refers to the gray value of
z, which is the description of context appearance of z; and
\({\omega _\sigma }(.)\) represent the weighting function, which could be expressed as in the below equation:
In Equation (
21) above,
n represent a normalization constant with a value of 0∼1, which is to constrain the
\(K( {c( z )|o} )\) in Equation (
18) to satisfy the definition of probability, and
\(\sigma\) refers to the scale parameter. This weighting function is inspired by the attention points of the biological vision system, which means that a certain image area will be focused on. Simply speaking, the shorter the distance between the point and the target, the more attention the point received. The specific distance is determined by
\(\sigma\). Next, the confidence map of the target position can be calculated as follows:
\(\beta\) refers to the normalized shape parameter. If it is too large, there will be an excessive smoothing effect, causing the information near the center of the target to be lost, so that the positioning suffers from ambiguity. If it is too small, it will sharpen the information near the center of the target. Thus, insufficient information for modelling the context of the target space may cause over-fitting of the paradigm and recognition errors. After experimentation, strong robustness can be found when
\(\beta\) = 1. Moreover, the confidence map is obtained by calculating the likelihood of any point
x in the context area based on the given target position
l*. According to the equation of the confidence map and the context prior probability, the below equation can be obtained based on the convolution and
fast Fourier transform (FFT) operations:
\(\otimes\) refers to the convolution operation. With further FFT, Equation (
23) can be transformed into the below equation in the frequency domain:
In the above equation,
F refers to FFT, and
\(\circ\) refers to multiplying by element. Then,
\({q^{bc}}(l)\) can be obtained by using inverse FFT
F1, as follows:
Finally, the spatial context model
\(q_t^{bc}(l)\) of the current frame can be learned, which can update the spatio-temporal context model
\(Q_{t + 1}^{btc}\) of the next frame to find the position of the target in the next frame. The equation is as follows:
In the above equation,
\(\rho\) refers to the learning rate. After the spatio-temporal context model of the next frame is updated, the confidence map of the next frame can be obtained based on the Equation (
23) and (
24), as follows:
Then, the target position
\(l_{t + 1}^*\) of the next frame can be obtained by calculating the maximum value of confidence map
\({c_{t + 1}}(l)\) for the next frame, as below:
With all above steps, the positioning and tracking among frames for diseases in the medical system can be completed, so that the disease type can also be predicted well.