research-article

Open access

Health Status Prediction with Local-Global Heterogeneous Behavior Graph

Authors:

Xuan Ma,

Xiaoshan Yang,

Junyu Gao,

Changsheng XuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 17, Issue 4

Article No.: 129, Pages 1 - 21

https://doi.org/10.1145/3457893

Published: 12 November 2021 Publication History

All formats PDF

Abstract

Health management is getting increasing attention all over the world. However, existing health management mainly relies on hospital examination and treatment, which are complicated and untimely. The emergence of mobile devices provides the possibility to manage people’s health status in a convenient and instant way. Estimation of health status can be achieved with various kinds of data streams continuously collected from wearable sensors. However, these data streams are multi-source and heterogeneous, containing complex temporal structures with local contextual and global temporal aspects, which makes the feature learning and data joint utilization challenging. We propose to model the behavior-related multi-source data streams with a local-global graph, which contains multiple local context sub-graphs to learn short-term local context information with heterogeneous graph neural networks and a global temporal sub-graph to learn long-term dependency with self-attention networks. Then health status is predicted based on the structure-aware representation learned from the local-global behavior graph. We take experiments on the StudentLife dataset, and extensive results demonstrate the effectiveness of our proposed model.

1 Introduction

Health is not only the basic guarantee of human happiness and well-being but also the foundation of economic progress. Keeping in good health needs reasonable health management, which has attracted increasing attention from governments and companies all over the world. Existing health management mainly relies on medical examination and specialized patient treatment in hospitals. However, many citizens usually do not consider going to the hospital for a check-up until they have abnormal physical symptoms. Moreover, regular or irregular medical examination with the professional medical equipment in hospitals can only get some discrete measurements of individuals’ health status at a specific moment. Both reasons bring a lot of challenges for early detection and prevention of diseases, which are the key components of effective health management.

According to a study of the World Health Organization (WHO), personal behaviors and lifestyles account for 60% of factors affecting human health [1]. For example, non-smokers with defective genes are much less likely to suffer from lung disease than regular smokers [2]. A healthy diet and adherence to appropriate exercise can greatly reduce the incidence of diabetes and cardiovascular diseases [3]. Therefore, the real-time and continuous analysis and monitoring of personal behavior and health status are helpful for individuals to enhance self-active health awareness and learn disease prevention knowledge, thus improving health management capabilities [47].

The rapid development of smart portable and wearable devices has promoted the widespread use of various low-power sensors, and the advent of the 5G era has made it possible to collect individual health data streams of multiple sensors in real time. As long as people carry their devices, all their daily routines, diets, and activities are recorded automatically and instantly without extra effort. For example, the daily routine can be recorded by a GPS sensor, a picture showing people’s diet can be captured using a phone camera, and the activity information can be recognized using an accelerometer sensor. These real-time data streams can be transferred to a back-end system for behavior analysis and health status estimation and finally help people improve their health. Compared with the patient records in the hospital, these kinds of data streams not only provide long-term signals to fully describe the individual’s daily behavior and lifestyle but also support continuous data transmission and analysis without interfering with individuals’ daily lives and work.

The health data streams collected from various sensors are multi-source and heterogeneous. On the one hand, the sampling frequency of each sensor is different, which makes the co-processing difficult. On the other hand, the collected sensor data are multimodal (e.g., pictures or videos from cameras, motion signals from accelerometers and gyroscopes, and location coordinates from GPS sensors). Though multi-source sensors can provide complementary information, the feature learning and joint utilization of the multimodal data streams remain a challenge for health status estimation.

There have been a bunch of health status prediction methods based on mobile devices. A few methods rely on non-parametric methods (e.g., K-Means, Mean Shift) with the single-source data as input, such as acceleration signals [51], camera data in smartphones [7], and sound from microphones [50]. Other methods are devoted to taking advantage of information from various sources [41, 46]. However, most of the existing methods do not make full use of the structure information of the multi-source data streams, which affect the further improvement of performance. Practically, the multi-source and heterogeneous health data stream mainly reflects the individual behaviors and lifestyles that contain complex temporal structures with local contextual and global temporal aspects. Local context refers to the behavior in the short term, such as the activities and routines in 1 day. Detailed behavior information such as the activity sequence and location transfer should be considered in local context to get the principal characteristic of the daily behavior. For the global temporal aspect, the temporal dependency among local contexts needs to be captured for representing the long-term comprehensive description of individual behaviors.

Recently, Graph Neural Networks (GNNs) [22, 69, 72] have drawn great attention in modeling interactions in structural data. Taking a graph as input, GNNs propagate messages between nodes according to the edges, and thus learn the representations for both nodes and edges. Most GNNs work on the homogeneous cases where the nodes in a graph belong to one type [6, 26, 52]. Heterogeneous graph neural network [62], a special case for GNN, is devoted to solving the other situation where nodes are of different types. It has been successfully applied in [14, 18, 67], where highly competitive performances are obtained. Inspired by its development, it is promising to model the intra-modality structure and inter-modality interaction of the multi-source and heterogeneous health data with the heterogeneous graph neural networks.

In this article, we propose to predict daily mental health status based on the multi-source wearable sensor data. We build a local-global individual behavior graph (LGIBG) based on the heterogeneous data and then predict the daily health status with the help of heterogeneous graph neural networks. Specifically, we take three kinds of sensor data streams (accelerometer, audio, WiFi) as input and detect middle-level behavior-related concepts (i.e., walking, running, silence) with pretrained backbone models. These concepts are further used to build the local-global individual behavior graph, which consists of multiple local context sub-graphs and a global temporal sub-graph. The local context sub-graphs are created with the concepts detected from daily data streams as heterogeneous nodes that are connected with homogeneous and heterogeneous edges. Next, a densely connected global temporal sub-graph is created on top of the local context sub-graphs. Then we take advantage of the heterogeneous neural network to learn the features of local context sub-graphs and get both the semantic and structural representations. The representation of the global temporal sub-graph is learned with a self-attention network, and it is finally used to predict the health status.

In summary, the contributions of our work are threefold: (1) To effectively represent the behavior-related multi-source data collected from wearable sensors, we build a local-global graph that consists of multiple local context sub-graphs and a global temporal sub-graph. The local-global graph can well describe the short-term context information of individual behaviors and their long-term temporal dependencies. (2) We learn the short-term semantic and structural representations from local context sub-graphs with heterogeneous graph neural networks and the long-term representation from the global temporal sub-graph with self-attention networks. (3) We demonstrate the effectiveness of the proposed method in health prediction on the public dataset Studentlife.

2 Related Work

2.1 Health Status Prediction

Our task is to predict the health status based on personal behaviors with the multi-source sensor data collected in daily life. This task has important medical implications because it can provide early prevention of diseases and complement the clinical treatment in hospital. As we know, most existing health prediction methods can be divided into two categories: Electronic Health Record (EHR)-based methods [45, 48] and mobile sensor-data-based methods [9, 11]. The EHR-based methods are more relevant to medical studies since they use the record data collected during the hospital treatment process with professional medical equipment. However, people have little EHR data before they are detected with diseases, and thus this kind of method, however good, can only provide a piece of the puzzle. More kinds of data about the early-life experiences of patients should also be considered in treatment, which is actually what the second kind of method does. Sensor-data-based methods focus on the personal behavior in daily life and take advantage of mobile devices, which are more convenient to monitor people’s health and also provide additional useful information for clinical treatment. Below we will introduce these two categories in detail.

EHR data are collected during the hospital treatment process and contain nearly all the information of patients, such as diagnoses, medication prescriptions, and clinical notes. In early stages, expert-defined rules are adopted to identify disease based on the EHR data, such as type 2 diabetes [35] and cataracts [53]. Much work based on EHR data has been done with deep learning models, with the disease classification task most commonly. Cheng et al. [15] and Acharya et al. [4] train a CNN model to classify the normal, preictal, and seizure EEG signals. Che et al. [13] propose a multi-class classification task to predict the different stages of Parkinson’s disease with Recurrent Neural Network (RNN). Kam and Kim [34] do binary classification of sepsis by regarding the EHR data as input to a long short-term memory network (LSTM). In addition to the disease classification, the future event prediction is another task that has attracted much attention recently, which aims to predict future medical events according to the historical records. For example, Futoma et al. [20] and Rajkomar et al. [44] use the EHR data from the hospital to predict events such as mortality, readmission, length of stay, and discharge diagnoses with deep feed-forward networks and LSTMs. There has been much work on mental health prediction based on EHR data, among which T1-weighted imaging [45, 48] and functional magnetic resonance imaging (fMRI) [19, 28] are the most commonly used data to study brain structure, with other physiological signals such as electroencephalograms also playing an important role. Recently, machine learning has received more attention for its effect on improving the management of mental health. Costafreda et al. [16] do a depression classification task with SVM using the smoothed gray matter voxel-based intensity values. Rosa et al. [58] propose a sparse L1-norm SVM to predict depression with the feature of region-based functional connectivity. Cai et al. [10] collect the electroencephalogram (EEG) signals of participants and use four classification methods (SVM, KNN, DT, and ANN) to distinguish the depressed participants from normal controls.

Mobile devices provide another way for health status prediction, where diverse sensors can be used to catch various signals of people, thus making it easier to monitor daily behavior and predict health status. Machado et al. [43] calculate the signal magnitude area of the acceleration signal to recognize activities with several cluster algorithms (e.g., K-Means, Mean Shift). Koenig et al. [37] and Banhalmi et al. [7] use the camera in smartphones to monitor heart rate (HR) and heart rate variability (HRV), which are vital signs of cardiovascular health. Stafford et al. [61] and Goel et al. [27] detect the sound of breathing and cough by microphone in smartphones for assessing pulmonary health in a quick and efficient way. Besides the use of data from only one source, many researchers are devoted to taking advantage of information from various sources [31, 56, 71]. Asselbergs et al. [5] integrate accelerometer data, call history, and the short message service pattern to predict mood. Burns et al. [9] predict depression based on GPS, accelerometer, and light sensor data from smartphones. Nag et al. [49] estimate heart health status by combining sensor data from wearable devices and other factors, such as inherent genetic traits, circadian rhythm, and living environmental risks analyzed from cross-modal data, which provides better personalized health insight.

However, all of the above work cannot well explore local and global temporal characteristics of the daily behavior based on multi-source wearable sensors.

2.2 Graph Neural Network

Recently, the emergence of structural data, especially structured graphs, has promoted the development of GNNs [23, 69, 72]. As the early work of GNNs, recurrent graph neural networks (RecGNNs) [21, 59] apply recurrent architectures to learn the node representation, where message passing is done constantly with nodes’ neighborhoods until the node representations are stable. Inspired by the success of Convolutional Neural Networks (CNNs), the convolution operation is also introduced to graph data in both spectural [17, 29, 36] and spatial ways [6, 24, 25]. The spectral approaches adapt the spectral graph theory to design a graph convolution. The spatial approaches inherit the message passing idea in RecGNNs but are different in getting node representations by stacking multiple convolutional layers. Besides RecGNNs and ConvGNNs, many other graph architectures have been developed to cope with different scenarios. For example, graph autoencoders (GAEs) [12, 65] are used to learn the graph embedding by reconstructing the structural information such as the adjacency matrix of graph. Spatial-temporal graph neural networks (STGNNs) [32, 42, 60] aim to model both the spatial and temporal dependency of data and learn the representation of the spatial-temporal graph, which have advantages in the related tasks, such as human action recognition.

Most of the existing GNNs focus on homogeneous graphs where nodes are in the same type and can be calculated in the same way. In comparison, heterogeneous graphs contain diverse types of nodes and edges, leading to a more complicated situation in calculation. On the one hand, different types of nodes may have different semantic meanings and different feature spaces. On the other hand, the heterogeneous graph represents both homogeneous and heterogeneous relations of data. Recently, some work has been done on heterogeneous graphs. Dong et al. [18] propose a path2vec method to learn heterogeneous graph embeddings with a meta-path-based random walk. Chen et al. [14] process different kinds of nodes with several projection matrices used to embed all the nodes into a same space and then do link prediction. Wang et al. [67] further introduce hierarchical attention to heterogeneous graphs to learn attentions for both nodes and meta-paths. Until now, the application of heterogeneous GNNs to individual behavior analysis and health status prediction is yet to be explored.

3 Methods

3.1 Framework Overview

The individual behavior refers to the way that a person lives. Our purpose is to model the individual behavior in a period based on multi-modal data streams collected by wearable devices and then learn effective representations to predict the health status.

As shown in Figure 1, we take multi-source data streams as input and detect behavior-related middle-level concept sequences with pre-trained backbone models. Then, the behavior-related middle-level concept sequences are used to build the behavior graph, which consists of multiple local context sub-graphs and a global temporal sub-graph. Specifically, the concepts are regarded as different types of nodes to build local context sub-graphs. Each local context graph is regarded as a node in the global temporal sub-graph to catch temporal dependency. The representations of the local context sub-graph and global temporal sub-graph are learned by local context modeling and global temporal relation modeling, based on which the final representation of the behavior graph is learned and used to predict the health status.

Fig. 1.

3.2 Behavior-related Concept Detection

We take three kinds of data streams (i.e., accelerometer, microphone, and WiFi) as input and detect behavior-related middle-level concept sequences

with three pre-trained backbone models (i.e., activity, audio, location detectors), where each concept sequence

and the

denotes the specific concept class (e.g., walking detected by the activity detector). Meanwhile, we also obtain timestamp sequences

, where

and each timestamp

is a 2-dimensional vector that represents the start and end time of the detected concept class

in the corresponding data stream. More details of the pre-trained backbone models are introduced in Section 4.2, and the notations and their corresponding explanations are shown in Table 1.

Table 1.

Notation	Explanation
	concept class for type k at the tth moment
	a 2D vector of start and end time of
	the attribute of ith node from type k.
	the embedding of ith node from type k
	the weight of edge between node pair i,j
	the embedding of edge between node pair i,j

Table 1. Notations and Explanations

3.3 Behavior Graph Building

To capture the temporal structure of the individual behavior, we need to build a behavior graph that contains both the local context information and the long-term relationships from the multimodal data stream. However, a huge densely connected network may increase the computation complexity and impact the performance. For this reason, we decompose the whole graph into two kinds of sub-graphs: local context graphs to explore the local information of individual behaviors in the short term, and a global temporal graph to capture temporal dependency in the long term. The local context graphs are regarded as nodes of the global temporal graph.

3.3.1 Local Context Sub-Graph.

The local context graph is built based on the daily data streams, which could reflect individual behaviors (e.g., activities, audio, locations) from various aspects. Taking these individual behaviors into consideration, the local context graph is actually a heterogeneous graph. As illustrated in Section 3.2, we have detected three types of concept sequences

from the data streams. In the following, we explicitly use activity, audio, and location to denote the type names. For the tth time step, we use a sliding time window of 1 day to crop out three concept sub-sequences

from the original concept sequences. Each concept sub-sequence

represents several consecutive concepts detected in 1 day. Here the

denotes the size of the sliding time window that represents the number of timestamps contained in 1 day for the kth type of the concept sequence. Accordingly, we can obtain the timestamp sub-sequences

, where

. Next, we will introduce how to create the local context graph based on

and

For the tth time step, the local context graph can be formally defined as

, where

, representing different types of nodes. Each type corresponds to a specific aspect to describe individual behavior.

is the set of edges containing both the homogeneous edges, which connect two nodes in the same type, and the heterogeneous edges, which connect two nodes in different types.

The nodes of the local context graph are composed of all the concept classes in

. Since there are three types of concept classes, the local context graph has three different node types and thus is a heterogeneous graph. Each node has an attribute and an embedding representation, noted as

and

for the ith node of type k. For the attribute

of the node with concept class

, we use its corresponding timestamp

to compute the time interval that represents the duration of the concept-related behavior. For the node embedding

, we introduce external semantic knowledge to help its learning. Specifically, we extract Glove embeddings [54] corresponding to the concept class name of each node, which are pre-trained on Wikipedia according to the word-by-word co-occurrence. The embeddings are proved effective in capturing semantic meanings of words in many NLP tasks. Therefore, these word embeddings would provide a reasonable representation for nodes at first.

As for edges, homogeneous edges and heterogeneous edges are considered in different ways. Two nodes with the same type k are connected with homogeneous edges if they are temporal neighborhoods in the concept sequence

. For example, a node with concept class

from the type loation is connected to another node with concept class

from the same type if an individual moves to the library from the dormitory with continuous timestamps in reality. The weight of the homogeneous edge is noted as the frequency of two nodes being a neighborhood in the concept sequence. With the edge weight, the homogeneous edge captures specific patterns of an individual’s behavior change information. For heterogeneous nodes (e.g.,

), we connect them according to their co-occurrences in time, i.e.,

. For example, a node with concept class

from the type audio and a node with concept class

from the type location are connected when they are detected from the data streams at the same timestamp. The co-occurrences in time reflect the interactions of heterogenous nodes, which could describe the individual behaviors from more aspects. We do not connect heterogeneous nodes that are temporal neighboring in the concept sequence, because connecting different types of nodes according to their temporal relations has no practical significance. The weight of the heterogeneous edge is the frequency of co-occurrences in time. Note that the weights of both homogeneous edges and heterogeneous edges are written as

with

as node indexes. Whether

represents a homogeneous edge or a heterogeneous edge depends on specific types of node i and node j.

3.3.2 Global Temporal Sub-Graph.

The global temporal sub-graph models the long-term time dependency of the daily information and gets the global information for the whole period, which is used to predict the health status. Formally, the global temporal graph is denoted as

, where nodes in

refer to local context graphs introduced in Section 3.3.1, and

represents interactions between any two local context graphs.

3.4 Local Context Graph Modeling

As introduced in Section 3.3.1, we have created a heterogeneous local context graph

at each time step of multi-source data streams. Now we will introduce how to capture local context information of the short-term individual behavior with the heterogeneous graph neural network. The network contains m layers of node message passing modules and edge embedding learning modules. Here we only introduce these two kinds of modules for one layer. As shown in Figure 2, the node message passing module is used to learn the node embeddings and graph semantic representation, while the edge embedding learning module is used to learn the edge embeddings and graph structural representation. Then the final representation of the local context graph is obtained with the combination of the semantic and the structural representation. It is worth noting that all local context graphs share the same network parameters.

Fig. 2.

3.4.1 Node Message Passing.

We consider the node message passing process in two ways: homogeneous message passing through homogeneous edges and heterogeneous message passing through heterogeneous edges. At first, we multiply each node embedding

with its attribute

, which reveals the node importance in time. For simplicity, the representation of each node is still denoted by

Homogeneous message passing aims to learn information from the same type of nodes according to their edges. For the ith node of type k, its message passing process is done as below:

(1)

where

and

are learnable matrices. The i and j are node indexes, and

is the normalized value of the homogeneous edge weight between the ith node and the jth node defined in Section 3.3.1. The calculations for other types of nodes are in the same way with different projection matrices. By this means, each node gets information from its homogeneous neighborhoods according to their connections.

Heterogeneous message passing manages to capture additional semantic meanings from other types of nodes and thus learns a comprehensive representation for individual behaviors. Since each node has more than one type of heterogeneous neighbor, we add embeddings learned from all types of heterogeneous neighbors to do the message passing, shown as follows:

(2)

where

and

are learnable matrices. The k and

are node type indexes and

. The i and j are node indexes. The

is the normalized value of the heterogeneous edge weight defined in Section 3.3.1.

3.4.2 Edge Embedding Learning.

Compared with the node embeddings, edge embeddings reveal much more structure information of the graph. For our heterogenous graph, edges are also in different types since they connect different types of nodes. Considering that nodes have been embedded in a common semantic space by the node message passing module, we directly concatenate them and use a projection to extract the edge embedding:

(3)

where k and

are node type indexes and

means concatenation. The

is a learnable matrix. If

is the embedding of the homogeneous edge, otherwise the heterogeneous edge.

3.4.3 Local Context Graph Representations.

For each local context graph, we learn two kinds of representations to capture the short-term behavior information: a semantic representation that reflects semantic meanings and a structural representation that catches information of the graph structure.

We obtain the semantic representation of the local context graph by combining the embeddings of all types of nodes learned with the node message passing module. Although the attribute of a node defined in Section 3.3.1 could reflect its importance, the semantic meaning of the node should also be considered. Because a concept appearing a few times in the concept sequence may contain important factors for the health status prediction, we take advantage of the soft-attention mechanism to determine the importance of different nodes and combine node embeddings to get the semantic representation:

(4)

(5)

where

is the semantic representation for the local context graph,

is the relevant importance given to each node when blending all nodes together, and q is a trainable vector used as query. The reason for using softmax in Equation (5) mainly lies in two points: (1) The softmax is differentiable and thus can be easily integrated in the graph neural networks for end-to-end training, and (2) the output values of the softmax function are in the range of [0,1] with the sum of 1. With the softmax function,

can be interpreted as the relevant importance given to node

when blending all nodes together.

For the structural representation, since different edges play various roles for the graph structure, we also use the attention mechanism to calculate their correlations and then combine them to get the graph structural representation. Since the semantic representation

contains the global information of the local context graph with semantic meanings of all nodes considered, we treat it as a query vector to help learn more effective attentions for combining edge embeddings:

(6)

(7)

where

is the embedding for either homogeneous edges or heterogeneous edges,

is the relevant importance given to each edge when blending all edges together, and

is a projection matrix for semantic representation

We get the final representation for each local context graph with the concatenation of its semantic and structural representations as

3.5 Global Temporal Relation Modeling

The self-attention network (SAN) is introduced in Transformer [64] for the first time, which has a sequence-to-sequence architecture and is popularly used in neural machine translation. Taking a token sequence as input, SAN calculates the attention scores between each token and other tokens with multiple attention heads. Then the token embeddings are updated with other token embeddings according to their attention scores. From this perspective, SAN can be regarded as a graph neural network with a token sequence as fully connected nodes, while the multi-head attention mechanism is a special message passing method. Inspired by this, we implement the global temporal relation modeling with the self-attention network.

Specifically, we can get a sequence of all local context graph representations

as illustrated in Section 3.4. For the temporal information among different local graphs, we adopt position embeddings in [64] to encode the relative position of each local context graph, noted as

. Therefore, the representation for the ith graph is the sum of

and

. Then the correlations between any two local context graphs can be calculated with the attention scheme:

(8)

where

and

are learnable matrices.

The attention scores are scaled and normalized with a Softmax function, which are used to get an attended representation for each local context graph

(9)

(10)

where

is a learnable projection matrix,

is the dimension of

, T is the number of local context graphs, and

is the attended representation for the ith local context graph. Finally, we can get the structure-aware representation

of the global temporal graph

by adding the attended representations of all local context graphs.

3.6 Objective Function

The final loss function is written as the sum of a classification loss and a node variance constraint: where

is a trade-off parameter.

Classification Loss: With the representation

of the global temporal graph, we predict the health status label y by a fully connected layer with Softmax activation. Then we calculate the cross-entropy loss:

(11)

where N is the number of instances used in the training process and

is the ground-truth label of the health status.

Node Variance Loss: It is worth noting that many GNNs will face the problem of node homogenization after several epoches of node message passing since all nodes exchange information with their neighbors. In the local context graph modeling illustrated in Section 3.4, the message passing is applied not only on homogeneous nodes but also on heterogeneous nodes, which may make the representations of nodes similar. To alleviate this problem, we add a constraint

to the node representations to control the variance of all nodes. Specifically, we concatenate all the node embeddings into a matrix noted as

, where N is the number of nodes of all types and d is the dimension of the node embedding. Then we calculate the variance and get a vector

, where each element represents the variance of the corresponding dimension in E. Finally, the node variance loss is defined as the average of the elements in the variance vector followed by a sigmoid function:

(12)

4 Experiments

4.1 Dataset

We evaluate the performance of our method on the StudentLife dataset [66], which is collected from 48 Dartmouth students over a 10-week term. It contains sensor data, Ecological Momentary Assessment (EMA) data (i.e. stress), pre- and post-survey responses (i.e., PHQ-9 depression scale), and educational data (GPA). During the 10 weeks, students carry their phones throughout the day. Data streams from multiple sensors, including accelerometer, microphone, GPS, Bluetooth, light sensor, phone charge, phone lock, and WiFi, are collected in real time by the mobile phone. Besides, students are asked to respond to various EMA questions and surveys, which are provided by psychologists to measure their mental health status. Educational performance data, such as the grades, are also collected.

In our experiment, we use data streams collected by three representative sensors (i.e., accelerometer, microphone, WiFi) as the input, since these three kinds of sensor data contain the most useful information to reflect the individual behaviors. The reason for collecting location information with WiFi instead of GPS is that most students’ activities are in an indoor environment, where the college’s WiFi AP deployment is more effective to accurately infer the location information than the GPS.

For the ground-truth annotations of the multi-source data streams, we use photographic affect meter (PAM) [55] values in EMA data. The PAM value practically represents a score between 1 and 16, which represents the Positive and Negative Affect Schedule (PANAS) [68] and reflects the instantaneous mental health status of users. The annotations are collected by a mobile application that captures users’ feelings according to users’ preference of specific photos. To keep with the conceptualization of PANAS, which ranges from low pleasure and low arousal to high pleasure and high arousal, the PAM score is further divided into four quadrants: negative valence and low arousal with score 1 to 4, negative valence and high arousal with score 5 to 8, positive valence and low arousal with score 9 to 12, and positive valence and high arousal with score 13 to 16. Following [55], we map the PAM value into the above four classes.

Finally, we use data streams of 30 students who have valid PAM annotations. Specifically, we get 912 samples in total, each sample consists of 3-day multi-modal data streams collected by the sensors of accelerometer, microphone, and WiFi. For each sample, the instantaneous PAM label of the last day is regarded as the ground-truth label of the whole data streams. The sample number for each student and the sample distribution on four classes are shown in Figure 3. Up to now, the mental health status prediction task in our experiment is practically a four-class classification problem. For the training and test, we split our datasets into 10 splits and build 10 tasks on them. Each task takes nine splits for training and the remaining one for test. We also show the average results of all 10 tasks.

Fig. 3.

Discussion on data selection. In this article, our aim is to model the personal behavior and predict the health status based on the multi-source sensor data collected from people’s daily lives. To the best of our knowledge, StudentLife is the only public dataset that satisfies our requirements. On the one hand, the data should be collected from both healthy and unhealthy people. On the other hand, the sensor data we take advantage of should be continuous and long term; i.e., the data are recorded as long as the people carry their wearable devices, such as mobile phones and smart wristbands. Some work has been done on this task. For example, [11] uses GPS data to predict depression with an SVM classifier. [9] predicts depression based on GPS, accelerometer, and light sensor data from smartphones. However, all these works do not release their data. Although some medical datasets, such as MIMIC-III [33] and ANDI [57], have been widely used to predict diseases [13] or other medical related events [20], they do not satisfy the need of our task. First, these medical datasets pay more attention to disease analysis, where the studied people are mainly patients. By contrast, our work focuses on personal health, where daily behaviors are considered for both healthy and unhealthy people. Second, these medical datasets are discretely collected from patients during the hospital treatment process with professional medical equipment, such as medical imaging, while the long-term sensor data we use is continuously collected from daily life.

4.2 Implementation Details

As introduced in Section 3, to create the behavior graph that reflects the structure information contained in the multi-source data streams, we first need to detect the behavior-related concepts contained in the data streams. Here, three backbone models proposed in [39] are adopted to get the middle-level semantic concepts from raw sensor data. For the accelerometer, a decision tree model for activity classification is used with the features extracted from the accelerometer stream to infer the concept class (i.e., stationary, walking, running, and unknown). For the microphone, the audio data are classified into concept classes (i.e., silence, voice, noise, and other) with an HMM model. As for the WiFi, students’ WiFi scan logs are first recorded, and then the location concept classes, such as in[dana-library], are inferred according to the WiFi AP deployment information, which results in a number of 9,037 classes. The location classes are in a long-tail distribution with many classes appearing few times, and hence we choose the top 100 most frequent classes, which cover 93% of the location data.

After obtaining the three kinds of detected concepts, we cut the sequences into days and build the local context graph and the global temporal graph to predict mental health status with the method illustrated in Section 3. We use the metrics of accuracy, precision, recall, and F1-score to evaluate our model. Accuracy is the ratio of correctly predicted samples to total predicted samples. Precision, recall, and F1-score are first calculated in each class and then weighted by the sample number of each class.

Discussion on data synchronization. In existing health-related systems and methods for analyzing wearable sensor data, such as the risk situation recognition system [70], synchronization of different sensors is a very important issue. Specifically, the system always contains several devices to collect different kinds of sensor data as well as a smartphone to receive the data from sensors. An algorithm of data synchronization is necessary since the sensors are on different devices and there exist time differences between sensor data generated by sensors and received by smartphones. As for our work, the sensors we use here are all embedded in the same smartphone [66] and have a common reference time naturally and do not have the time error when sending and receiving the data, and thus the synchronization is not essential.

4.3 Compared Methods

Since there are no previous works on PAM prediction on the same dataset, we compare our method with three popularly used conventional machine learning algorithms (i.e., RF [63], KNN [38], and SVM [8]) and two deep learning algorithms (i.e., DNN [40] and LSTM [30]). To apply these baselines on multi-source data streams collected from wearable devices, we compute the behavior feature by extracting a 108-dimensional feature vector to represent the duration time of three kinds of concept classes in 1 day. Specifically, the activity concept takes four elements of the vector, which represent the duration time of stationary, walking, running, unknown in 1 day. The audio concept takes four elements, which represent the duration time of silence, voice, noise, other in 1 day. The location concept takes 100 elements, which represent the duration time of 100 locations student staying in 1 day. The behavior feature contains the principal information of the individual life, such as where did the person go in that day and how long did he or she communicate with other people. We also compare our method with a recent GNN-based method (i.e., HAN [67]) that can capture the structural information of a graph with different kinds of nodes. Details of the compared baselines and the variants of our method are illustrated as follows.

•

RF [63]: This method uses Random Forest to do the classification. Specifically, we concatenate the behavior features of 3 days to get a 324-dimensional feature. Then we input it to the Random Forest.

•

KNN [38]: This method uses K-nearest neighbor. We first get the 324-dimensional feature of continuous 3 days as in RF. Then we train a K-nearest neighbor based on the feature.

•

SVM [8]: This method trains SVM to do the classification. Features are obtained as in RF, and then an SVM is trained to predict the health status.

•

DNN [40]: This method uses a two-layer deep neural network with the input feature computed as in RF. Each layer is fully connected and the hidden size is 50, which is determined by cross-validation.

•

LSTM [30]: This method uses an LSTM with the hidden size of 100 to capture the temporal information of sequences. Specifically, we transform the 3-day data into a sequence with the length of 3. Each element in the sequence is a 108-dimensional behavior feature. Then the sequence is input into LSTM, and the hidden state at the last step is used to predict the health status.

•

HAN [67]: This method uses the heterogeneous attention network [67] instead of our local context graph modeling method to learn the node embedding and graph representation. Specifically, we get the meta-path-based neighbors for each kind of node according to our local context graph. Then the node-level attention and semantic-level attention are performed as HAN [67] to get the local context graph representation, which is finally input into the self-attention network to predict the health status.

•

Ours\he: This variant of the proposed method omits the heterogeneous message passing in Section 3.4.1 while keeping the homogeneous message passing in the heterogeneous graph neural network.

•

Ours\ho: This variant omits the homogeneous message passing in Section 3.4.1 while keeping the heterogeneous message passing in the heterogeneous graph neural network. It is used to compare with Ours\he to illustrate the impact of the homogeneous and heterogeneous message passing.

•

Ours\s: In this variant, we omit the semantic representation learned by Equation (4) and only use the structural representation of the local context graph. The structural representation is then input into the self-attention network to get the global temporal graph representation.

•

Ours\t: In this variant, we omit the structural representation learned by Equation (6) and only use the semantic representation of the local context graph. The semantic representation is then input into the self-attention network to get the global temporal graph representation.

•

Ours\g: In this variant, representations of all local context graphs are directly added to get the final global temporal graph representation without using the self-attention network.

4.4 Result Analysis

4.4.1 Performance Comparison.

Here we show both the results on 10 tasks in Table 2 and Table 3 and the average results in Figure 4. It can be seen that our model performs better than baselines on all metrics, and most variants of the proposed method also have good results.

Fig. 4.

Table 2.

Table 3.

Compared with the traditional machine learning methods, all the deep-learning-based methods perform better. Baseline-LSTM has better results than Baseline-DNN since it takes the temporal information into consideration. The HAN updates the node embedding and graph representation in a meta-path way with several kinds of adjacent matrix. However, this method does not consider the direct connection between heterogeneous nodes, which ignores the semantic interaction between different kinds of nodes and thus performs worse than our method.

As for the ablation study, it can be concluded that each module in our framework plays a significant role in the performance improvement. By comparing the full model and Ours\he as well as Ours\ho, we note that the full model performs better than both of the two variants, which proves the assumption that the message passing module has positive effects on the node embedding learning, and the homogeneous edges and heterogeneous edges succeed in building the homogeneous and heterogeneous node structures. When comparing the performances of the homogeneous message passing Ours\he and the heterogeneous message passing Ours\ho, it can be seen that the heterogeneous message passing performs better than the homogeneous message passing, which may be because the heterogeneous edges help learn more comprehensive embeddings by getting additional information from other types of nodes.

As for the comparison between Ours\t and Ours\s, which use semantic representations and structural representations, respectively, it can be noted that both the semantic representation and the structural representation benefit the prediction process, meaning that these two kinds of representations reveal different aspects of the graph information. The semantic representations perform better than structural representations, which may be because the semantic representation provides more global information of the graph. When we omit the global temporal graph while only using the representation of the local context graph to predict PAM in Ours\g, the performance drops a little, meaning that not only the local context information has a positive effect on individuals’ health status, but also the long-term temporal structure of behaviors makes a difference.

In Table 2, Table 3, and Figure 4, it is worth noting that the results of all methods are below 50%, which demonstrates that it is an extremely challenging task to predict mental health status based on personal behaviors in daily life, especially with limited samples. However, the average accuracy improvement of our method accounts for about 5% of the result of the second best method, HAN, as shown in Figure 4(a), which still shows the advantage of the proposed method. We believe that our model will get better performances on larger-scale datasets.

4.4.2 Parameter Analysis.

Here we investigate the influence of two important hyper-parameters m and d_p, which represent the number of layers in the local context graph and the dimension of the final local context graph representation, respectively. We vary m from 1 to 5 and keep other settings fixed; the results on PAM prediction are shown in Figure 5. It can be noted that the performance is improved first with the increase of m. However, when the layer number keeps increasing, the performance drops, because too many layers could make the node embedding less discriminative. As for d_p, we vary it from 16 to 256 while keeping other settings fixed. The results are shown in Figure 5. We can see that the best performance is achieved in 64, since the low dimension makes it hard to get useful information, while the high dimension is difficult to train with limited instances.

Fig. 5.

4.4.3 Visualization.

Here we analyze the attention in learning representations from local context graphs and the global temporal graph to figure out the factors that influence mental health. Figure 6 shows the example of an individual’s 3-day data streams. The PAM is predicted with the global representation, which is learned based on representations of three local context graphs.

Fig. 6.

The attention score

between two different local context graphs is computed with Equation (9). We show the attention score between any two local graphs by the darkness of the corresponding line. The attention of each graph is represented by the darkness of the text box. We can see that for each local context graph, the attention to itself tends to play a major role, though each one pays attention to all other ones. Besides, it can be seen that the third local context graph has much influence on all three.

We further visualize the concept sequences in the third day, which are used to build the local context graph in that day. In the concept sequences, each text box represents a concept, and different concept sequences are shown in different colors. The importance of each concept is calculated with the attention mechanism in Equation (5) and represented by the darkness of the text box. It is noted that the concepts of stationary, silence, and classroom play a major role in learning the representation from the local context graph. Homogeneous edges that represent the behavior transfers are shown as arrows, while heterogeneous edges that represent behavior co-occurrences are shown as dotted lines. The importance of the edge is also indicated by the darkness. We notice that in general, heterogeneous edges receive more attention than homogeneous edges, although their occurrences are few, which demonstrates the importance of combination of different concepts and validates that our attention mechanism successfully finds useful patterns from the multi-source data streams. Besides, the edges, which have nodes from the location type, tend to get more attention than the ones that have nodes from the activity and audio types, because locations reveal much information about daily lives.

4.5 Application on Grade Prediction

To better illustrate the effectiveness of our model for learning representations from behavior-related data streams, we propose to apply our model on the grade prediction task. The grade annotation is the GPA, which indicates a student’s overall long-term academic performance in a range of 0 to 4.

First, we take our model, which is well trained on the health status prediction task, to extract the global representation for each student’s data stream and use KNN to do the grade prediction. Then we compare our model with a baseline, which takes the 108-dimensional feature introduced in Section 4.3 as the representation for each day, and add them to get the final feature. The same KNN is used to do the grade prediction.

We use the mean absolute errors (MAEs), the coefficient of determination (R²), and Pearson correlation as evaluation metrics for the grade prediction. We adopt the leave-one-out way to evaluate the performance. The average results are shown in Table 4.

Table 4.

Method	Grade Prediction
Method	MAE	R²	Pearson
Hand-crafted feature + KNN	0.296	0.10	0.32
Graph representation + KNN	0.195	0.21	0.51

Table 4. Grade Prediction Results

As shown, our model outperforms the baseline on three metrics, which demonstrates that our model is effective in learning the structure-aware representation of the individual’s long-term behavior. Moreover, our model has good generalization ability and can be used to extract global behavior-related features in different tasks without model fine-tuning.

5 Conclusion

In this article, we propose a local-global graph to model personal behavior and predict daily mental health status based on the multi-source wearable sensor data. The graph contains multiple local context sub-graphs and a global temporal sub-graph to capture the short-term context information and long-term temporal dependencies of individual behaviors, respectively. We learn the semantic representation and structural representation for the local context graph with a heterogeneous neural network. A self-attention network is designed to learn the representation for the global temporal graph, which is finally used to predict the health status. We perform experiments on the public dataset StudentLife and compare our method with popularly used machine learning and deep learning methods. Our method outperforms all existing methods, which validates its effectiveness. In future work, we will integrate more kinds of data streams to improve the local-global individual graph and try to apply our method on larger-scale multi-source sensor datasets for health prediction.

References

[1]

World Health Organization. 2009. https://www.who.int/mediacentre/multimedia/podcasts/2009/lifestyle-interventions-20090109/en/.

Abstract

1 Introduction

2 Related Work

2.1 Health Status Prediction

2.2 Graph Neural Network

3 Methods

3.1 Framework Overview

3.2 Behavior-related Concept Detection

3.3 Behavior Graph Building

3.3.1 Local Context Sub-Graph.

3.3.2 Global Temporal Sub-Graph.

3.4 Local Context Graph Modeling

3.4.1 Node Message Passing.

3.4.2 Edge Embedding Learning.

3.4.3 Local Context Graph Representations.

3.5 Global Temporal Relation Modeling

3.6 Objective Function

4 Experiments

4.1 Dataset

4.2 Implementation Details

4.3 Compared Methods

4.4 Result Analysis

4.4.1 Performance Comparison.

4.4.2 Parameter Analysis.

4.4.3 Visualization.

4.5 Application on Grade Prediction

5 Conclusion

References

Cited By

Index Terms

Recommendations

Graph-Based Data Representation and Prediction in Medical Domain Tasks Using Graph Neural Networks

Variationally regularized graph-based representation learning for electronic health records

Graph-based clinical recommender: Predicting specialists procedure orders using graph representation learning

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations