[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

A Computation Model to Estimate Interaction Intensity through Non-Verbal Behavioral Cues: A Case Study of Intimate Couples under the Impact of Acute Alcohol Consumption

Published: 18 September 2024 Publication History

Abstract

This work introduced a novel analysis method to estimate interaction intensity, i.e., the level of positivity/negativity of interaction, for intimate couples (married and heterosexual) under the impact of alcohol, which has great influences on behavioral health. Non-verbal behaviors are critical in interpersonal interactions. However, whether computer vision-detected non-verbal behaviors can effectively estimate the interaction intensity of intimate couples is still unexplored. In this work, we proposed novel measurements and investigated their feasibility to estimate interaction intensities through machine learning regression models. Analyses were conducted based on a conflict-resolution conversation video dataset of intimate couples before and after acute alcohol consumption. Results showed the estimation error was at the lowest in the no-alcohol state but significantly increased if the model trained using no-alcohol data was applied to after-alcohol data, indicating that alcohol altered the interaction data in the feature space. While training a model using rich after-alcohol data is ideal to address the performance decrease, data collection in such a risky state is challenging in real life. Thus, we proposed a new State-Induced Domain Adaptation (SIDA) framework, which allows for improving estimation performance using only a small after-alcohol training dataset, pointing to a future direction of addressing data scarcity issues.

1 Introduction

Interpersonal interactions are conventionally analyzed through manual observations (e.g., manual annotation of research videos) in behavioral research and healthcare studies. The manual research data extraction process requires much human labor and is limited by the availability of annotation experts (e.g., human video coders). Therefore, developing accurate and automatic analysis has great potential to address the current barrier in research as well as rapid assessment and individualized treatment recommendations in clinical settings.
This work focuses on the quantification of an interpersonal interaction session’s positivity and negativity in the context of conflict-resolution interactions in intimate couples before and after acute alcohol consumption. Here, positive interaction refers to the prosocial or relationship-enhancing behaviors directed toward the partner, including acceptance, relationship-enhancing attributions, self-disclosure, and humor, while negative interaction is hostile or relationship-damaging behaviors directed toward the partner, such as psychological abuse, distress-maintaining attributions, withdrawal, and dysphoric effect [30]. Acute alcohol consumption may change human behaviors [57], have a significant impact on marital/relationship functioning [47], and play an important role in people’s health [46]. Traditionally, manual observations are applied to understand the acute effects of alcohol on dyadic behaviors [18, 57]. The overall level of positivity of an interaction session, named positive interaction intensity in the article for simplicity, can be quantified as the ratio between the number of positive behavior occurrences and the total (positive, negative, neutral, and “other”) of behavior occurrences. Similarly, negative interaction intensity can be quantified as the ratio between the number of negative behavior occurrences and the total occurrences.
While automatically detecting all types of behavior occurrences to completely replicate human annotation using technology is still unrealistic, state-of-the-art computer vision (CV) based human sensing technology enables autonomous and accurate detection and tracking of elementary non-verbal cues [10]. Non-verbal behaviors (e.g., head shaking, facial expressions, and body leaning), the elements of interaction besides the spoken words [11], account for a significant portion of interpersonal interactions [15]. In all interpersonal communication, such as clinical communication between patients and healthcare providers, a great amount of information is conveyed through non-verbal behaviors, such as body gestures, facial movement, interpersonal distance, and so on [11]. However, there has been no investigation on the feasibility of using these cues to estimate the intensity of interaction, especially under the impact of acute alcohol consumption within intimate couples. If feasible, such technologies have a great potential to be used for convenient behavioral health management, such as self-monitoring and assessing behavioral change over time from baseline functioning for physical and mental health evaluation.
The relations between non-verbal cues and positive and negative interactions are non-trivial. For example, smiling usually shows a positive attitude; however, it may also indicate disbelief or derision and thus plays a negative role in certain contexts. Non-verbal cues of interaction are dynamic and function together to convey information [45]. Therefore, although a single cue may not clearly indicate if an interaction is positive or negative by itself, fusing multi-modal cues may improve the certainty [33]. Similarly, a single cue’s level does not necessarily reflect the intensity of an interaction, i.e., how strong the interaction is. By combining multiple cues, we hope to improve the intensity estimation accuracy.
Machine learning has been widely used for data-driven human behavior analysis [21]. Regression models are effective tools to investigate the association between elementary behavioral cues and high-level activity interpretations [19]. Nevertheless, very limited works have been conducted to link non-verbal cues to the intensity of positive and negative interactions. Therefore, as a starting point to fill this gap, we tested the feasibility of interaction intensity estimation through machine learning-based regression using CV-based elementary non-verbal behavioral cues. First, we designed new measurements based on elementary cues as features for regression. Then, regression models were trained using the Couple Conflict Dataset (CCD) [57], which contains conflict-resolution conversation videos specifically recorded to investigate the impact of acute alcohol consumption on intimate couples [24]. We analyzed the feasibility and performance of different regression models. The results confirmed the potential of using non-verbal cues to evaluate the interaction intensities. Analyses also demonstrated that alcohol consumption altered non-verbal behavioral cues and increased estimation errors as a consequence. While the ideal solution is training new models using large after-alcohol datasets, data collection in alcohol consumption states is challenging in real life. Thus, we proposed a domain adaptation method to recycle knowledge discovered in no-alcohol states and emphasize new knowledge learned from a small amount of after-alcohol data to improve estimation accuracy. These results provided important references for both future machine-learning designs and real-world applications.
To the best of our knowledge, this work is the first investigation of using non-verbal behavioral cues to estimate the intensity of interaction within intimate couples and study the impact of alcohol consumption in this context. In summary, this work has four unique contributions:
We proposed a novel set of non-verbal behavioral features to estimate positive and negative interaction intensities in intimate couples’ interactions. More details are in Section 3.2.
This study investigated the relationship between non-verbal behavioral cues and interaction intensity through regression modeling. Analysis showed that neural network (NN) outperformed other common regression models by limiting the estimation error to 0.11 overall. The results are presented in Section 4.2.1.
We studied how acute alcohol consumption impacted the estimation. Regression errors of the no-alcohol state and those of the after-alcohol state were compared. Results showed that acute alcohol consumption could significantly increase estimation errors. More details are discussed in Sections 4.2.2 and 4.2.3.
We designed a domain adaptation model to improve after-alcohol estimation. Collecting behavioral data in risky states (e.g., alcohol consumption) is difficult; this new method could significantly improve estimation performance without the need of retraining models using a large after-alcohol dataset. This provides a reference to relieving data collection burdens for human behavior analysis in challenging states. Details can be found in Section 4.3.
The rest of this article is organized as follows. Section 2 reviews background work related to primary contributions. Section 3 presents the research methods. Section 4 details the analyses and results. Finally, Section 5 summarizes the study and discusses the contributions, limitations, and future research directions.

2 Related Work

This section reviews existing research related to the main focuses of the current study and presents the motivations of our key innovations. How non-verbal behaviors are detected and used in healthcare and well-being applications is discussed in Section 2.1. Then, previous works on the quantification of positive and negative interactions are introduced in Section 2.2. Finally, the background of domain adaptation applications in human behavior analysis is reviewed in Section 2.3.

2.1 Non-Verbal Behavioral Cues in Healthcare and Well-Being

Non-verbal behavioral cues, such as facial expressions [48, 52], eye movements [36, 65], and body postures [6, 12, 54, 58] are capable of reflecting individuals’ affective states [6, 36, 48, 52, 58, 65] and personality [12, 54]. Non-verbal behavioral cues are good candidates to reflect individuals’ real status as they are more difficult to be controlled or suppressed than verbal cues due to their involuntary nature [29]. A growing body of research has been conducted to understand how non-verbal behavioral cues can be used in healthcare and well-being studies. Video-based annotation [49, 57] is a classic quantification method applied to analyze non-verbal behaviors and understand how these cues are impacted by alcohol consumption in romantic relationships [49, 57].
In recent years, advancements in CV and machine learning [5, 23, 53, 60, 66] have allowed the automatic detection of non-verbal behavioral cues from image- or video-based resources, which addressed the burden of time-consuming manual observation-based analyses. For instance, Wang et al. [63] proposed a dual-stream bidirectional convolutional neural network (BCNN) to seize the eye gaze pattern as one of the major features for fatigue detection through GP-BCNN, a BCNN incorporates Gabor filters and Projection vectors. Schultebraucks et al. [53] designed a deep learning-based framework to catch facial features of emotion and related intensities to facilitate the detection of major depressive disorder and posttraumatic stress disorder. Ascari et al. [5] developed a CV-based framework to support interaction for people with motor and speech impairments through gesture recognition (e.g., hands). Through Support Vector Machine (SVM)-based and Convolutional Neural Network (CNN)-based classifiers, this work demonstrated the feasibility of personalized gestural interactions. Tuyen et al. [59] introduced a generative adversarial network-based framework that included a Context Encoder, Generator, and Discriminator, incorporating context awareness to capture the influence of interacting partners’ non-verbal signals (e.g., eye gaze) on the target individual. Fatima et al. [25] implemented CNN-based frameworks to model the devised concept of dyadic affect context to improve continuous emotion recognition (such as happiness and sadness) in dyadic scenarios. Dindar et al. [22] proposed the Riesz-based volume local binary patterns to classify facial expressions (i.e., positive, negative, and neutral) through an SVM with the linear kernel and validated the feasibility of using emotional mimicry to confirm the leader and follower students in collaborative learning settings. Drimalla et al. [23] applied OpenFace 2.0 [7] to automatically analyze facial action units from video sources to assist the facial emotion recognition and imitations for individuals with autism. However, very limited works have been extended to study the impact of alcohol consumption. The only example we are aware of so far was done by Yu et al. [66]. This work took advantage of a CNN-based framework, Deepgaze [42], to extract intimate couples’ head-shaking behaviors and then developed a computation model to dynamically monitor how alcohol consumption impacted disagreement expressions.
Most of the existing work focuses on behavioral cues in one modality, such as facial expression or gesture, however, different non-verbal behavioral cues usually function together to convey information in interactions [45]. A Few attempts have been made to investigate the feasibility of applying multi-modal non-verbal behavioral cues to automatically quantify comprehensive human interactions. To this end, in the current work, we designed multiple non-verbal behavioral features across modalities (i.e., using head pose, body movements, and facial expressions) based on state-of-the-art CV techniques. These cues are fused to quantify the interaction of couples and build regression models for interaction intensity estimation.

2.2 Quantification of Positive and Negative Interactions

The current study focuses on estimating the level of positivity and negativity during a communicative interaction within a couple. Positive interactions refer to friendly communication, a sense of supporting and caring, and praise or acknowledgment that enhances intimacy with others [28]. Negative interactions include irritation, conflicts, or criticisms that damage the closeness with others [28]. Evaluating these relationship-enhancing and relationship-damaging behaviors provides cues to study various health and well-being aspects, such as the impact of alcohol consumption [57], how the conflict-affected intimate relationships [64], and how psychological conditions of one family member impacted other family members [35]. Questionnaires and manual observation (e.g., manual video annotation) are also the most commonly used evaluation methods. For instance, the Relationship Questionnaire [28], a self-report questionnaire that includes 12 items, was used to assess positive and negative interactions in couples as factors to understand anxiety and depression symptoms [27]. The Rapid Marital Interaction Coding System (RMICS) [30] which uses subscales to describe positive (i.e., acceptance, relationship-enhancing attributions, self-disclosure, and humor) and negative interaction behaviors (i.e., psychological abuse, distress-maintaining attributions, hostility, dysphoric effect, and withdrawal), has been applied to guide manual video annotation to understand how acute alcohol consumption impacts interactive behaviors in cohabiting couples [57].
Although these classic interaction quantification methods are well-validated and have generated significant scientific discoveries, they are labor-intensive (human observation/annotation is limited to the availability of qualified/trained coders) and cannot be conducted automatically. Therefore, researchers started to explore automatic quantification methods using behavioral or related biological signals. For instance, Ahmed et al. [2] proposed a CNN-based model to automatically extract the asymmetry of different brain regions as a 2D vector from an electroencephalograph for positive and negative emotion classification. Wang et al. [62] devised a fully convolutional framework based on a feature pyramid network [37] to detect key points (i.e., interaction between a human and an object (e.g., computer)) from images and classify positive and negative interactions between human-object pairs. Zhang et al. [67] proposed a unary-pairwise transformer with two stages that captured both unary and pairwise characterizations of humans and objects (e.g., motorcycles) from image sources to recognize positive and negative interactions. Feil-Seifer et al. [26] designed a naïve-Bayes classifiers based on the Gaussian mixture model to automatically analyze positive and negative interactions between an assistive robot and a child using distance (the proximity between the robot and the child) features. Those studies have demonstrated the power of machine learning techniques in addressing the quantification problem. However, existing works usually consider only one aspect of positive or negative interactions, such as emotion or physical proximity, while the classic questionnaire/observation-based methods evaluate comprehensive aspects (e.g., verbal and non-verbal expressions and body postures) of positivity and negativity. In addition, current technologies focus on detecting instantaneous events (e.g., a certain moment when people show a facial expression or a posture), while a comprehensive evaluation over a longer interaction (e.g., a few minutes or longer) [27, 57] is required to interpret more socially meaningful scenarios.
Note that completely automating the classic comprehensive evaluation methods, such as automatically detecting all aspects of a questionnaire using a piece of technology, is still unrealistic. Therefore, in this article, we attempted to bridge the automatic detection of instantaneous behavioral cues and an overall evaluation of a much longer interaction. Machine learning-based regression models were designed to achieve this goal. These models use multiple non-verbal behavioral features in different modalities as inputs to estimate the level of overall positivity and negativity of an interaction. The CCD dataset [57] was used to validate the feasibility.

2.3 Domain Adaptation

CCD [57] is among the largest video datasets available to train such models; however, its sample size (i.e., 152 couples) is still small when compared to other types of human video datasets. The main reason was that collecting behavioral data in a risky state is challenging due to practical risks and ethical considerations. Therefore, it is important to explore how to reduce the amount of training data required to build good estimation models in a challenging state. The simplest attempt would be applying a model trained by data collected from the no-alcohol state. However, behaviors may change after drinking alcohol [57, 66], which means that no-alcohol and after-alcohol data may not share the same distribution in feature space. As a consequence, the performance of an estimation model trained by no-alcohol data may decrease when applied directly to people after drinking.
In this case, domain adaptation could be a good candidate to address the problem. Domain adaptation leverages knowledge derived from one domain to another related domain [1]. In our case, this would be taking advantage of data collected in the no-alcohol state to train a base model to catch behavioral patterns that are stable regardless of alcohol use. Then, the dataset collected in the after-alcohol state is needed to add new knowledge due to alcohol consumption on the based model and make it work better for the after-alcohol prediction.
Domain adaptation has boosted human behavior studies over the past years. Qin et al. [44] devised an adaptive spatial-temporal domain adaptation framework to recognize human activities (e.g., standing) cross datasets (e.g., UCI daily and sport datasets [9] and USC Human Activity Dataset [69]) collected in different conditions, aiming at proper source domain selection and precise knowledge transfer. Luo et al. [38] introduced a method based on non-negative matrix factorization to recognize speech emotions across corpora. It could identify and transfer the feature subspace between the source and the target corpora. An et al. [3] proposed an AdaptNet model, composed of variational autoencoders and Generative Adversarial Networks, which demonstrates robust human activity recognition (e.g., standing rest and walking) from a single triaxial accelerometer via bilateral domain adaptation when limited labeled data was available. Sun et al. [56] pioneered a joint transferrable dictionary learning and view adaptation model to recognize human actions (e.g., kick). Similar sparse features of the same action from multiple views in the source domain were transferred to the target domain to narrow the distribution gap among multi-view human action recognition.
Although domain adaptation has shown significant success, few attempts have been made to understand human behaviors under the impact of substance usage (e.g., alcohol). Previous research has made efforts to analyze alcoholism through transfer learning on brain images [61] or electroencephalography signals [55, 68] to address the data scarcity. However, these methods cannot be extended to daily behavior monitoring, as scanning the brain daily is unrealistic. In addition, most alcohol consumption cases are not at the level of alcoholism. Nevertheless, the philosophy of these methods, i.e., many human features are transferrable from the no-alcohol state to the after-alcohol state, is likely to be true in behavioral (controlled by the brain) studies. Therefore, we proposed a domain adaptation-based framework to recycle the knowledge learned from the no-alcohol state and add new information reflecting features that were altered by alcohol. These two parts were integrated to interpret interaction intensities in the after-alcohol state. To the best of our knowledge, this work was among the initials that explored domain adaptation as a tool to understand interaction intensities for couples under the impact of alcohol.

3 Methods

3.1 CCD

The CCD collected by Testa et al. [57], one of the largest video datasets for investigating the effects of administered alcohol on intimate partners’ interactions, was used to conduct this study. The dataset is not publicly available due to ethical and legal restrictions. One hundred fifty-two married and heterosexual couples were recruited and committed to two 15-minute discussions about conflicts in their relationship during a laboratory visit. The average length of marriage (or cohabitation) ranged from 0.4 to 22.4 years (\(M=6.11,\textit{SD}=5.20\) years). Videos of the conversations were recorded, as demonstrated in Figure 1 (real images are not presented to protect identifiable information). The first interaction was conducted while all participants were sober (S1) before the second session (S2), when couples were randomly assigned to four experimental groups (G1: both drank alcoholic beverage, n = 40; G2: male drank only, n = 39; G3: female drank only, \(n=37\); G4: neither drank, n = 36). The alcoholic beverages contained 80-proof vodka mixed with cranberry juice in a 2.22 ml/kg ratio for females and 2.39 ml/kg for males. The dosages targeted at 0.08% breath alcohol content (about 4–6 standard drinks). The no-alcohol beverages were an equivalent amount of juice.
Fig. 1.
Fig. 1. Illustration of the experiment setting for the conflict-resolution paradigm. Females and males were required to work out solutions to their disagreements in daily life. The interactions were recorded. The red rectangles indicate the visual attention on partners. The green and blue dotted lines refer to the back and the forward leanings, respectively. Angles \(\boldsymbol{\sigma}_{\mathbf{1}}\) and \(\boldsymbol{\sigma}_{\mathbf{2}}\) are the thresholds to define the forward and back leaning from the baseline.
Positive, negative, and neutral behavior occurrences of each subject in each CCD session [57] were identified through video annotation using the RMICS [30], a well-validated event-based system to code observed dyadic behaviors. RMICS consists of four positive codes (acceptance, relationship-enhancing attributions, self-disclosure, and humor), five negative codes (psychological abuse, distress-maintaining attributions, hostility, dysphoric effect, and withdrawal), one neutral code (problem description), and one “other” code to refer to scenarios when the couple discussed something off the topic (e.g., the experiment itself). The basic coding unit was turn-talking. During each unit, well-trained observers assigned a code to the speaker and listener based on verbal (meaning of utterances), non-verbal (e.g., body and/or facial messages), and para-verbal (voice tone, pitch, and speed of speech) expressions. Observers only gave 1 out of 11 codes to the speaker and listener within each unit, and the order of code importance was considered (negative codes \({ \gt }\) positive codes \({ \gt }\) neutral code \({ \gt }\) “other” code) when several codeable behaviors were presented. Two annotators manually coded the experiment videos. The interrater reliability was acceptable (67%, average Cohen’s kappa: 0.5) [57]. Note that all participants showed both positive and negative behavior occurrences during each session. Therefore, each session should be evaluated regarding both positivity and negativity. Aligned with previous works [31, 57], we defined ratios of the number of positive and negative codes divided by the total of all codes as the positive interaction intensity (\(I\)pos, (1)) and the negative interaction intensity (\(I\)neg, (2)), respectively, as follows.
\begin{align}I_{pos}&=\frac{\#\ of \ positive \ behavior \ occurrences}{\#\ of \ positive, negative, neutral, \textit{``other''}\ \ behavior \ occurrences},\end{align}
(1)
\begin{align}I_{neg}&=\frac{\#\ of \ negative \ behavior \ occurrences}{\#\ of \ positive, negative, neutral, \textit{``other''}\ behavior \ occurrences},\end{align}
(2)
Ipos and Ineg \(\in\) (0, 1), and Ipos + Ineg \({ \lt }\) 1 within each session, due to the fact that most codes were neutral and “other” codes existed. The average Ipos and Ineg were 0.24 (SD: 0.12) and 0.13 (SD: 0.14) in S1, and 0.25 (SD: 0.013) and 0.15 (SD: 0.16) in S2 for all participants (n = 304, 152 couples) [57]. A participant who had a higher Ipos/Ineg value indicated this participant showed more positive/negative behaviors during the interaction.
For the current study, the CCD [57] was divided into a no-alcohol state (n = 304, 152 couples) and an after-alcohol state (n = 232, 116 couples). Videos were excluded from the current study if over 15% of the recording had the following issues: (1) at least one person’s upper body was not fully captured; (2) at least one person moved out of the camera view. This resulted in 141 couples in the no-alcohol state (age (years): females: mean = 31.51, SD = 6.58; males: mean = 32.83, SD = 6.75) and 106 couples in the after-alcohol state (age (years): females: mean= 31.86, SD = 6.86; males: mean = 33.12, SD = 6.80). Since our previous study showed that females and males demonstrated different behavioral patterns [66], in the following context, data analysis was conducted mainly in four categories: females in the no-alcohol state (F_NoAlc), males in the no-alcohol state (M_NoAlc), females in the after-alcohol state (F_Alc), and males in the after-alcohol state (M_Alc).

3.2 Design of Non-Verbal Behavioral Features

A few elementary non-verbal behavioral cues, which can be extracted from the CCD [57] videos using computer-vision methods, were used to define a few measurements that might be related to positive and negative interactions. Due to the nature that the change of these behaviors usually takes a few seconds, recorded videos were down-sampled from 30 frames/s to 1 frame/s to reduce data processing time. Therefore, for each 15-min session, about 900 frames were extracted for the following analysis. Features in Section 3.2.1 were analyzed through the sliding window-based techniques, and the reminding ones were estimated from each frame.

3.2.1 Head-Shaking Cues.

Head-shaking may indicate “disagreement” or “refusal” in social contexts across cultures [13]. Head-shaking consists of left-and-right horizontal head rotations [66], as illustrated in Figure 2(a). It can be quantified using yaw degrees through head-pose estimation. Considering the limited resolution of the CCD [57] videos (360\({\times}\) 480 pixels), a CNN-based robust framework, Deepgaze [42], was applied due to its good performance on low-resolution images. To avoid unnecessary computation on non-facial areas, a 240 \({\times}\) 240 pixels area around the participant’s head of each frame was extracted for the estimation. For each session in CCD [57], two cameras were positioned at arbitrary angles to record the female and the male, resulting in various head pose angle baselines, and they were corrected to 0\({}^{\circ}\) for all participants.
Fig. 2.
Fig. 2. Illustrations of the head-shaking behavior (a) and the positive and negative emotions (b). Positive emotion refers to “happiness”. Negative emotions include “sadness,” “anger,” “fear,” and “disgust” from left to right.
Moderate (yaw rotation angular change across 4\({}^{\circ}\)–11\({}^{\circ}\) per second) and large (yaw rotation angular change over 11\({}^{\circ}\) per second) head shakings may indicate moderate and strong disagreements, which presented different interaction patterns in CCD [66]. For each type of head shaking, the magnitude (the absolute yaw angle change) and the following behavior (the number of the dyadic situations where a person initiated head shakings and the other person shook the head following the initiator within 3–5 seconds) were extracted as features. Our previous work showed that females and males demonstrated distinct patterns in both [66]. For instance, males tended to show fewer head shakings and following behaviors than females.

3.2.2 Visual Attention.

Visual attention is a person’s focus of attention coinciding with eye movements [39]. Here, it refers to scenarios where an individual’s eye gaze falls on the partner in dyadic interactions. Eye gaze signifies the engagement of both the speaker and the listener [20] and may also serve as an indicator of potential aggression [43]. Since the participants did not wear any gaze-tracking devices and the resolution of the video did not allow us to extract sharp eye images, the eye gaze direction was approximated using the frontal head orientation. Prior research confirmed the use of frontal head pose as a coarse indicator of visual attention estimation [32, 40]. In this study, we defined a person’s visual attention on her/his partner when her/his approximated gaze direction fell within a rectangular area around the partner’s head, as marked in Figure 1.
To confirm the range of the gaze area, for each couple in each session, four frames where the couple had mutual eye contact were collected. The average frontal head orientation in terms of pitch (\(x_{p}\)) and yaw (\(x_{y}\)) angles were calculated to indicate the coarse centroid of the visual attention area (\(x_{p}\), \(x_{y}\)). Then, frames where a person looked away from her/his partner in the up (\(x_{u}\)), down (\(x_{d}\)), left (\(x_{l}\)), and right (\(x_{r}\)), head-turning directions were sampled to determine the range of visual attention in terms of the pitch and yaw angle deviations from the centroid in each of the four directions. These deviations were further averaged across females and males in each group to reduce individual biases. The whole process can be presented by:
\begin{align}&-\frac{1}{N}\sum_{i=0}^{N}\left|x_{d,i}-x_{p,i}\right|\leq\left(X_{P,i}-x_{p,i}\right)\leq\frac{1}{N}\sum_{i=0}^{N}\left|x_{u,i}-x_{p,i}\right|,\end{align}
(3)
\begin{align}&-\frac{1}{N}\sum_{i=0}^{N}\left|x_{l,i}-x_{y,i}\right|\leq\left(X_{Y,i}-x_{y,i}\right)\leq\frac{1}{N}\sum_{i=0}^{N}\left|x_{r,i}-x_{y,i}\right|,\end{align}
(4)
where \(X_{P,i}\) and \(X_{Y,i}\) are vectors that demonstrate the pitch and yaw angles of head movements for each individual in each session, respectively. Here, N represents the sample size in each state. The percentage of frames where both the pitch and yaw angle deviations fell within the visual attention area was defined as quantified visual attention.

3.2.3 Forward Leaning and Backward Leaning.

Prior evidence suggested that forward torso leaning signaled a relatively positive attitude in interactions, and backward leaning indicated a more negative attitude [50]. These leaning movements can be assessed by measuring the angle of the spine. A state-of-the-art human pose detection program, OpenPose [14] (version of 1.7.0), was applied due to its good performance in estimating human body skeleton key points. As shown in Figure 1, forward and backward leaning can be detected by measuring if the spine angle increased or decreased from a baseline, the natural position held by the person for the majority of the session time. Since the spine angle was dynamic according to the sitting position during the conversation, the baseline angle was computed for each person in three steps. First, a moving average filter (window size = 30) was applied to reduce the impact of large motion (large upper body tilt). Then, the Asymmetric Least Squares Smoothing (ALSS) [41] method was applied to further smooth out abrupt variations. Finally, a baseline angle was computed as the average of the ALSS result (\(X_{ALSS}\)) across the whole session. To reduce the impact of individual differences, a group baseline was calculated using the average of individual baselines within each group (females or males).
Here we focused on apparent leanings that have relatively clear social meanings. An apparent leaning was recorded if the spine angle exceeded one standard deviation (\(\sigma_{1}=\sigma_{2})\) from the baseline. Forward leaning (\(X_{LF}\)) and backward leaning (\(X_{LB}\)) would lead to angle changes in positive and negative values, respectively, as demonstrated in formulas (5) and (6).
\begin{align} X_{LF}&\geq\frac{1}{N}\sum_{i=1}^{N}X_{ALSS,i}+\sigma_{1},\end{align}
(5)
\begin{align}X_{LB}&\leq\frac{1}{N}\sum_{i=1}^{N}X_{ALSS,i}-\sigma_{2},\end{align}
(6)
where N indicates the sample size in each state. The percentages of the session that showed each type of leaning were used as numerical features for regression.

3.2.4 Overall Body Reposition—Reposition Power.

The CCD [57] videos showed that not only the aforementioned obvious/large movements that can be clearly defined but also subtle movements (such as shrugging shoulders, rubbing palms, sidewise body shaking) changed along the conversation. In addition, in many cases, the large movements and the small movements were blended together. Therefore, it is important to quantify the overall body reposition. We proposed a new measurement named “reposition power.” Twenty-five key points along the skeleton of each subject were tracked using OpenPose. First, the Euclidian distance across all 25 key skeleton points between two consecutive frames was calculated. The reposition power (\(X_{RP}\)) was defined as the average of the distances across all pairs of consecutive frames in a session:
\begin{align}X_{RP}=\frac{1}{N-1}\sum_{i=1}^{N-1}\left(\sqrt{\sum_{k=1}^{n=25}\left(X_{ i+1,k}-X_{i,k}\right)^{2}}\right),\end{align}
(7)
where \(X_{i,k}\) and N indicate the vector consisting of 25 key points for each frame and the number of frames in each session, respectively.

3.2.5 Positive and Negative Emotion Ratios.

Facial expressions play a substantial role, such as conveying personal emotions and revealing intentions [51], in social interactions [8, 34]. A CNN-based framework with global average pooling and step-wise separable convolutions [4], which can successfully detect emotions from facial expressions in the CCD [57] videos’ configuration (recording from the side view with limited resolution), was applied to detect seven fundamental emotional facial expressions, including happiness, sadness, anger, surprise, fear, disgust, and neutral. Considering the nature of these emotions, “happiness” was identified as the positive emotion, while “sadness,” “anger,” “fear,” and “disgust” were grouped as the negative emotions (Figure 2(b)). Positive emotion ratio (PER) and negative emotion ratio (NER) were defined as:
\begin{align}{\rm PER}&=\frac{\#\ of \ \textit{positive emotion}}{\#\ of \ all \ \textit{emotions}},\end{align}
(8)
\begin{align}{\rm NER}&=\frac{\#\ of \ \textit{negative emotions}}{\#\ of \ all \ \textit{emotions}}.\end{align}
(9)

3.3 Estimation of Positive and Negative Intensities

Positive, negative, and neutral behavior occurrences of each subject in each CCD session [57] were identified through manual video annotation using the RMICS [30] (see Section 3.1). As defined in Section 3.1, the positive intensity (Ipos) of a session is a ratio between the number of positive behavior occurrences and the total number of positive, negative, neutral, and “other” behavior occurrences. Similarly, the negative intensity (Ineg) of a session is the ratio between the number of negative occurrences and the total number of all occurrences. Therefore, each session had both Ipos and Ineg, ranging from 0 to 1. Regression analysis was conducted to estimate Ipos and Ineg (dependent variables) using the aforementioned non-verbal behavioral features (independent variables).
We first measured the correlation between each proposed behavioral feature and Ipos as well as the correlation between each feature and Ineg for both females and males in all sessions using Pearson’s correlation analysis. There was no significant (p = 0.05) correlation found on the moderate (r = 0.5 \(\sim\) 0.7) or strong (r \({ \gt }\) 0.7) level, indicating associations between the non-verbal features and Ipos, as well as Ineg, were beyond simple linearity. Therefore, we applied multiple widely used regression algorithms, including both linear and non-linear models as an initial exploration for such associations. The tested models were Linear Regression (LR), Ridge Regression (RR), Lasso Regression (LaR), ElasticNet Regression (ENR), Bayesian Ridge Regression (BRR), Support Vector Machines with the linear kernel (SVMl) (C = 100), SVM with the rfb kernel (SVMr), SVM with the poly kernel (SVMp) (degree = 2), and the Artificial NN. According to experimental parameter analysis, we configured the NN with four dense layers and set the numbers of units to 5, 5, 5, and 1, respectively, to yield good results. All kernels were initialized with a normal distribution of weights. The ReLU activation function was applied in each layer. The NN was compiled using the Adam optimizer.

3.4 Domain Adaptation for After-Alcohol Estimation

Collecting human subject data under risks (e.g., alcohol consumption) is difficult due to practical and ethical reasons. This poses a challenge to learning machine learning models due to data scarcity. Domain adaptation is a good candidate to solve the problem, in which knowledge learned from a general (e.g., no-alcohol) state with abundant data can be adopted (although very often requires modification) to a related (e.g., after-alcohol) state where data collection is hard [1]. In this study, we designed a domain adaptation framework to understand if interaction intensities can be effectively estimated in the after-alcohol state using a regression model that combines knowledge from two resources: (1) a base model trained using no-alcohol data and (2) new knowledge gained from a smaller after-alcohol dataset.
In case we have access to no-alcohol data from the same group of people who need after-alcohol estimation and when the alcohol consumption is moderate, we assume most non-verbal behavioral features remain stable across the states. The CCD [57] resembled this case. These stable features describe the behavioral patterns that reflected interaction intensities regardless of whether people drank alcohol or not. Thus, the model parameters learned from these stable features in the no-alcohol state may be used as a basis for domain adaptation. Note that “stable” does not mean that the individual value of each feature was frozen, but the distribution of the features’ values did not change significantly across states.
Using RMICS [30], Testa et al. [57] found some behavioral changes after alcohol consumption. Consequently, we assumed a fraction of the non-verbal features would have been changed significantly. These changed features were new information that created the major disparities between the no-alcohol and after-alcohol states. In other words, these changes were the reasons why using the no-alcohol model directly would lead to lower estimation accuracy. Hence, the base model should be updated emphasizing the new information brought by the changed features to effectively estimate interaction in the after-alcohol state. Since the significantly changed features brought salient new information, we anticipated that using a smaller after-alcohol dataset (compared to the no-alcohol dataset) should be sufficient to catch the patterns.
Accordingly, we proposed a novel State-Induced Domain Adaptation (SIDA) framework, as shown in Figure 3. The General Behavior (Gen-Beh) module is built using stable features (i.e., their distributions did not change significantly after alcohol consumption). The State-Induced (Sta-Ind) module handles features that changed significantly. The Gen-Beh module will be built based on the regression model trained by the no-alcohol data, while the Sta-Ind module needs to be completely trained using the new data—those collected in the after-alcohol state. Then, the two modules’ outputs are fused to generate the final regression output.
Fig. 3.
Fig. 3. The scheme of SIDA framework. Stable features were inputted to the Gen-Beh module. Features with significant changes were inputted to the Sta-Ind module. \(\boldsymbol{\theta}_{\boldsymbol{GB}}\) and \(\boldsymbol{\theta}_{\boldsymbol{SI}}\) were learned parameters from the two modules. Their outputs, \(\boldsymbol{Y}_{\boldsymbol{GB}}\) and \(\boldsymbol{Y}_{\boldsymbol{SI}}\), were fused together through \(\boldsymbol{F}^{\boldsymbol{fu}s\boldsymbol{ion}}\) in the fusion module to calculate the final output. SIDA, State-Induced Domain Adaptation; Gen-Beh, General Behavior; Sta-Ind, State-Induced.
Figure 4 illustrates the implementation of the SIDA framework in this study. According to statistical analysis, we identified stable features and unstable features from CCD [57] (see Section 4.3.1). Both the Gen-Beh and the Sta-Ind modes were NNs configured with a 10-unit input layer and two consecutive dense layers with 5 units. Each data sample (i.e., input) was formatted in two parts: (1) a 10-dimensional vector that contained only the values of the stable features while the positions of the unstable features were replaced by 0s, and (2) a 10-dimensional vector that contained only the values of the unstable features while the positions of the stable features were replaced by 0s. In this way, the stable and unstable features were separated without changing the length of the input vectors for each model. The fusion module was also an NN. The outputs of the Gen-Beh module (five-dimensional) and the Sta-Ind module (five-dimensional) were connected through a 10-dimensional concatenation layer and were fed as the input into two fully connected dense layers. Both the Gen-Beh and the Sta-Ind modules used a LeakyReLU (α = 0.3) activation function. The fusion module applied a ReLU as the activation function. Finally, the framework was compiled using the Adam optimizer.
Fig. 4.
Fig. 4. The implementation of the SIDA framework. Non-verbal behavioral features were formatted into two sets: the stable feature matrix for the Gen-Beh module and the feature matrix with significant changes for the Sta-Ind module, respectively. For each set, the light blue circles represent 0s, and circles with the other color indicate values of the specified features. In phase I, no-alcohol data were applied to learn initial parameters across the whole framework. In phase II, a small set of after-alcohol data was used to update the framework after resetting the Sta-Ind module (circled by the blue dash rectangle).
The SIDA framework was trained in two phases. In phase I, no-alcohol data (N = 141) were used to train the whole framework, which sets initial parameters for all three modules to configure a base framework. In phase II, the base framework was updated using a smaller set of after-alcohol data. How the size of the after-alcohol dataset impacted the estimation performance was analyzed in Section 4.3.2. Since the Gen-Beh module catches the patterns in the stable features that did not change significantly across states, the parameters of the base framework (gotten from phase I) were used as initial values in phase II training, which further strengthened the existing patterns caught in phase I. The Sta-Ind module was designed to catch new information from the features that tended to be changed significantly after alcohol consumption, and thus, the outdated patterns learned from phase I should be wiped out. Therefore, in phase II, the parameters of the Sta-Ind module were reset using a normal distribution before passing in the new training data. The fusion model was updated in the same way as the Gen-Beh model since it mainly facilitated the combination of the Gen-Beh and the Sta-Ind modules; therefore, the base framework’s knowledge should be maintained and reinforced by the new data.

4 Results

We first analyzed if there was redundancy among the non-verbal measurements (features), i.e., if multiple measurements presented the same/similar information. Then, the performance of different regression models in estimating Ipos and Ineg are compared. After that, according to the best-performed model, we investigated how the model performed in different training and testing configurations: training and testing within and across states (no-alcohol and after-alcohol). These comparisons helped us understand how alcohol influenced the estimation, for instance, how well people’s no-alcohol behaviors can estimate their interaction after alcohol consumption. Finally, we evaluated the performance of the proposed SIDA framework.

4.1 Relations among Non-Verbal Behavioral Features

To estimate the level of redundancies among the non-verbal features, Pearson’s correlation coefficients were calculated. Figure 5 shows the results for females in the no-alcohol state (F_NoAlc), males in the no-alcohol state (M_NoAlc), females in the after-alcohol state (F_Alc), and males in the after-alcohol state (M_Alc), respectively.
Fig. 5.
Fig. 5. Pearson’s correlation coefficients among non-verbal measurements. F_NoAlc: females in the no-alcohol state. M_NoAlc: males in the no-alcohol state. F_Alc: females in the after-alcohol state. M_Alc: males in the after-alcohol state. HS-Mag-S: magnitudes of strong head shaking; HS-Mag-M: magnitudes of moderate head shaking; HS-FB-S: following behaviors of strong head shaking; HS-FB-M: following behaviors of moderate head shaking; VA: visual attention; RP: reposition power; LF: leaning forward; LB: leaning back; PER: positive emotion ratio; NER: negative emotion ratio.
For all four groups, non-verbal behavioral features were minimally correlated. The only significant correlation found was on the moderate-strong between PER and NER (\(\textit{r}=-0.62\)\(-0.73,p \lt 0.01\)). If a person showed more positive emotions, she/he tended to show fewer negative emotions within a limited session time, and vice versa. However, since each person might show a different number of neutral emotions, PER \({+}\) NER \({\neq}\)1, and thus we cannot simply derive one number using the other. Therefore, both PER and NER were kept with all other measurements for regression.

4.2 Regression Performance

Each non-verbal behavioral feature was normalized to [0, 1] using the min-max normalization and thus contributed equally to the regression. Due to the limited sample size, a repeated (n = 3) five-fold cross-validation was applied. Root mean square error (RMSE) was used as the outcome measurement.

4.2.1 Estimation Errors in the No-Alcohol State and the After-Alcohol State.

Figure 6(a) and (b) shows the RMSE of each algorithm in the no-alcohol and after-alcohol categories, respectively. NN achieved the lowest RMSEs (except F_NoAlc_Pos in the no-alcohol state), while all others performed similarly. To simplify the presentation, LR was selected as a comparison baseline to demonstrate NN’s advantages. As shown in Figure 7, the paired two-sided t-test showed that NN’s RMSEs were significantly lower (p \({ \lt }\) 0.05) than LR’s in all comparisons except the positive intensity of females (F_NoAlc_Pos) and males (M_NoAlc_Pos) in the no-alcohol state.
Fig. 6.
Fig. 6. The performances of regression models for the no-alcohol state (a) and the after-alcohol state (b). F: female; M: male; NoAlc: no-alcohol state; Alc: after-alcohol state; Pos: Ipos; Neg: Ineg. Example: F_NoAlc_Pos is the females’ positive intensity of the interaction sessions in the no-alcohol state. LR: Linear Regression; RR: Ridge Regression; LaR: Lasso Regression; ENR: ElasticNet Regression; BRR: Bayesian Ridge Regression; SVMl: Support Vector Machines with linear kernel; SVMr: Support Vector Machines with rfb kernel; SVMp: Support Vector Machines with poly kernel; NN: Neural Network. Grey bar: 50% standard deviation. Horizontal dashed lines serve as visual references for comparing root mean square error values.
Fig. 7.
Fig. 7. Statistical comparisons between LR and NN on the no-alcohol state (a) and the after-alcohol state (b). F: female; M: male; NoAlc: no-alcohol state; Alc: after-alcohol state; Pos: Ipos; Neg: Ineg. Example: M_Alc_Neg is the males’ negative interaction intensity in the after-alcohol state. Green \(*\): significance between Ipos and Ineg. Orange \(*\): significance between LR and NN. \(P\) value is also marked.
In the no-alcohol state, the RMSEs of Ipos were significantly lower than those of Ineg for both females (LR: p \({ \lt }\) 0.01; NN: p \({ \lt }\) 0.01) and males (LR: p \({ \lt }\) 0.01; NN: p \({ \lt }\) 0.01), showing that the association between the measurements and Ipos was less complicated. This pattern was not shown for NN in the after-alcohol state. These indicate that alcohol intake could have changed how non-verbal behaviors were associated with positive and negative interaction intensities. Therefore, in the next section, we conducted a comparison between the no-alcohol state and the after-alcohol state in terms of interaction intensities.

4.2.2 Comparison between the No-Alcohol State and the After-Alcohol State.

Figure 8 shows the comparisons between the no-alcohol state and the after-alcohol state. For both NN and LR, RMSEs in the no-alcohol state were mostly significantly lower than those in the after-alcohol state. Therefore, in general, the results indicate alcohol intake complicates the association between non-verbal behavioral features and interaction intensities. In other words, the data distribution of dependent variables (Ipos and Ineg) and/or independent variables (non-verbal behavioral features) might be changed and the connections between the dependent and independent variables might become more complex after alcohol intake. While explaining the biological and psychological causes and mechanisms underneath this phenomenon is beyond the scope of the current work, we can quantify the impact of alcohol through the change of the estimation error on different training and test configurations.
Fig. 8.
Fig. 8. Statistical comparisons between the no-alcohol state and after-alcohol state. F: female; M: male; Pos: Ipos; Neg: Ineg. \({\ast}\): significant difference. P-value is also marked.

4.2.3 Evaluating the Impact of Alcohol in Terms of Estimation Error.

The estimation models have the potential to be applied to real-life solutions, such as helping alcohol consumers self-monitor how alcohol impacts their interactions. Developing a robust and mature model requires a much larger dataset than the CCD. This reflects data scarcity, one of the major challenges of using machine learning for human behavior analysis. While collecting data in a general and normal state (i.e., when people are not impacted by alcohol) to train models may be realistic, it is much harder to collect data in a risky state (i.e., right after consuming alcohol).
To overcome the data scarcity, in the realm of the current study, we initially explored training a regression model in the no-alcohol state and applying the trained model in the after-alcohol state. This may decrease the estimation accuracy due to data distribution disparities across the training and test samples; however, if the extra error introduced is acceptable in practice, this method is worthwhile as it eliminates the need for the difficult and time-consuming data collection in alcohol consumption states.
Therefore, we quantified how much extra error would be introduced when using models trained by no-alcohol data to estimate the after-alcohol state. This investigation was done through the comparison among three different training and testing configurations: Configuration 1: both training and testing on no-alcohol data (reference). Configuration 2: both training and testing on after-alcohol data (ideal configuration). Configuration 3: training on no-alcohol data and testing on after-alcohol data (the alternative).
We anticipated that the estimation errors in configurations 1 and 2 would be lower than that in configuration 3, due to the change of the association between non-verbal behavioral cues and the intensities. Figure 9 shows the comparison results. Similar to the earlier content, NN performed better than LR (lower RMSEs) in general. In all cases in LR and NN, RMSEs were at the lowest in configuration 1. In two out of the four cases in LR (F_Neg_LR, M_Neg_LR) and all cases in NN, RMSEs were at the highest in configuration 3. The two exceptions in LR were quite close to the highest. For LR, the average RMSE across both genders and interaction intensities was 0.105 in Configuration 1 and 0.153 in Configuration 3. For NN, the average RMSE across both genders and interaction intensities was 0.095 in Configuration 1 and 0.147 in Configuration 3.
Fig. 9.
Fig. 9. Estimation performance in different training and test configurations. Configuration 1: training and testing on no-alcohol data. Configuration 2: training and testing on after-alcohol data. Configuration 3: training on no-alcohol data and testing on after-alcohol data. F: female; M: male; Pos: Ipos; Neg: Ineg. Green \({\ast}\): significance between configuration 1 and configuration 3. Orange \({\ast}\): significance between configuration 2 and configuration 3. P-value is also marked.
All RMSEs in Configuration 2 were higher than those in Configuration 1. In general, this observation was aligned with the result in Section 4.2.2 that alcohol intake increased the estimation error. For LR, the RMSEs in configuration 2 were mostly comparable (except M_Pos_LR) to those in configuration 3, while for NN, in three out of four cases (except M_Neg_NN), RMSEs in configuration 2 were significantly lower than those in configuration 3. For LR, the average RMSE across both genders and interaction intensities was 0.145 in configuration 2 and 0.153 in configuration 3, respectively. For NN, the average RMSE across both genders and interaction intensities was 0.123 in configuration 2 and 0.147 in configuration 3, respectively.
So far, there were three major findings from these data analyses: (1) NN outperformed other regression models; (2) Alcohol intake significantly increased estimation errors; (3) The intensity estimation was better achieved by a model trained using data collected in the same state. These laid a foundation for the domain adaptation work described in the next section.

4.3 Domain Adaptation from the No-Alcohol State to the After-Alcohol State

4.3.1 Distribution Change of Interaction Intensities and Non-Verbal Behavioral Features.

Testa et al. [57] showed that the immediate effects of alcohol consumption on couple interaction behaviors appeared more positive than negative. Therefore, we anticipated that the distribution of Ipos and Ineg could have been changed accordingly. The Kolmogorov–Smirnov test was applied to evaluate the distribution change. As shown by the histograms in Figure 10, we found significant changes in Ipos for females (p \({ \lt }\) 0.01) and males (p \({ \lt }\) 0.01), while the changes in Ineg were non-significant.
Fig. 10.
Fig. 10. Histograms of positive and negative intensities for females and males. F: female; M: male; NoAlc: no-alcohol state; Alc: after-alcohol state; Pos: Ipos; Neg: Ineg. Example: F_NoAlc_Pos is the females’ positive intensity of the interaction sessions in the no-alcohol state. Red rectangles mark the significant distribution changes.
As discussed in Section 3.4, we anticipated some non-verbal behavioral features would have been changed significantly due to alcohol, while others stayed stable. Kolmogorov-Smirnov test showed that the magnitude of moderate head shaking (p \({ \lt }\) 0.01) and visual attention (p \({ \lt }\) 0.05) changed significantly in females, and the magnitude of strong head shaking (p \({ \lt }\) 0.01) changed significantly in males. The distribution changes of other features were not significant. Figure 11 illustrates these significant distribution differences, which might be the primary roots of the significant Ipos distribution changes and thus used as the inputs to the Sta-Ind model in Figure 4.
Fig. 11.
Fig. 11. Histograms of the moderate head shaking’s magnitude (a) and attention (b) of females and the strong head shaking’s magnitude of males (c). No_Alc: no-alcohol state; Alc: after-alcohol state.

4.3.2 Performances of Domain Adaptation.

Since Ipos was changed significantly after alcohol consumption, this section targets the evaluation of the proposed SIDA framework on Ipos. To validate the effectiveness and robustness of the SIDA framework, we compared its performance with a baseline, i.e., the NN that achieved the lowest RMSEs in Section 4.2.1. In domain adaptation, for both males and females, the SIDA framework and the NN were trained using all available no-alcohol data, then updated by a portion of the after-alcohol data. The remaining after-alcohol data were used to test the estimation performance. Therefore, to further understand how the amount of after-alcohol training data impacts the performances of the NN and the SIDA framework, the after-alcohol dataset was split in 4 ways: the 0:10 ratio (0% available for SIDA phase II training, and thus did not apply to SIDA; 100% for testing,), the 2:8 ratio (20% for SIDA phase II training and 80% for testing), the 5:5 ratio (50% for SIDA phase II training and 50% for testing), and the 8:2 ratio (80% for SIDA phase II training and 20% for testing). For each ratio, the after-alcohol dataset was split randomly 15 times for statistical analysis.
Figure 12 illustrates performance comparisons of Ipos between the NN and the SIDA framework on different ratio sets. The SIDA framework achieved lower RMSEs than the NN for both females (a) and males (b), where such differences were significant in males. In addition, a decreasing trend was observed for both the NN and the SIDA framework as the ratio increased. This was expected as the more after-alcohol data was used to update the model, the better the model could catch the alcohol-induced information.
Fig. 12.
Fig. 12. Comparisons of positive interaction intensity estimations between NN and SIDA on different ratio sets for females (a) and males (b). The orange dashed line indicates the descending trend. Grey grid bars indicate 0:10 ratio set did not apply to the SIDA framework. \({\ast}\): significance. P-value is also marked. Black bar: 10% standard deviation.

5 Conclusion and Discussion

This study explored the feasibility of estimating interaction intensities using CV-based non-verbal behavioral features in the context of acute alcohol consumption in intimate couples. Results demonstrated that by fusing multiple cues, common regression models estimated interaction intensities (from 0 to 1) with an average error below 0.16. NN could achieve an average error of around 0.11, which might be sufficient for real-world applications that do not require very high precision, such as tracking daily interaction intensity change. Moreover, results showed that the errors in the after-alcohol state were higher than those in the no-alcohol state, indicating alcohol consumption might have increased data complexity. Furthermore, as expected, the cross-state estimation results confirmed that models trained by data collected in a particular state performed the best in the estimation within the same state. Using an NN model trained by no-alcohol data to estimate intensities in the after-alcohol state increased the error significantly. Finally, the SIDA framework demonstrated the feasibility of using domain adaptation to improve estimation performance in the after-alcohol state without the need to collect large datasets. This method helps relieve data collection burdens in high-risk states by leveraging knowledge learned from low-risk data.
Traditional methods to investigate the impact of alcohol on human behaviors, such as interviews and questionnaires, are time- and labor-consuming. The current work was a step toward machine learning-assisted automatic analysis. Note that the machine learning models do not replicate any existing human annotation protocols, and thus, we do not claim the proposed method as a replacement for existing well-validated human annotation protocols/systems (e.g., RMICS [30]). Instead, our method provides a complementary component that can be used in coordination with the traditional/classic methods to pursue a more comprehensive understanding of the acute behavioral effects of alcohol consumption.
The proposed method leverages state-of-the-art CV technologies to help eliminate the need for time-consuming manual data annotation/analysis, which is expensive and limited to the availability of qualified video coders. Thus, the proposed method helps resolve a barrier that has limited the use of experimental procedures to understand the relationship between acute alcohol consumption and behavior change. CV-based methods can extract human behaviors at a much higher resolution and precision compared to traditional human annotation, offering an angle that may not be captured by human observations.
In summary, this study has four major contributions: (1) the design and quantification of new CV-based non-verbal behavioral features; (2) the exploration of machine learning-based regression models to reveal the association between non-verbal behavioral features and interaction positivity and negativity; (3) the quantification of how alcohol consumption impacts regression performance; (4) the SIDA framework to enable improved estimation in the after-alcohol state using limited new training data.
Meanwhile, there are limitations of the current study that should be pointed out with corresponding future works. First, as a starting point to demonstrate the feasibility of interaction intensity estimation using regression, this work tested mainly classic regression models. In the future, more advanced regression/deep learning models can be designed to target particular characteristics, such as the non-normal feature distributions and automatic feature selections, in alcohol consumption data. Second, advanced CV technologies have the potential to detect more complicated and higher-level interaction cues, such as body language, than the currently proposed basic measurements. These may serve as more deterministic markers of positive and negative interaction intensities and thus offer better estimation accuracies. Third, the proposed SIDA framework presented an idea of fusing Sta-Ind features with Gen-Beh features that are independent of the state. This article shows a basic implementation of the idea, but the SIDA framework is not bounded by this particular implementation and alcohol consumption context. Therefore, it is worthwhile to explore improvements and extend the application of SIDA in the future. Finally, any interpretation of this study was based on the CCD dataset. While it is one of the largest video datasets to study the effect of alcohol on intimate couples, it has limitations, such as low video resolution and the angled camera positioning that caused minor face occlusions. In the future, estimation models are better to be trained and validated using more high-quality data samples if available. Results showed that the estimation was more accurate when participants were sober than after drinking. This indicated the importance of investigating the corresponding biological and psychological mechanism, which may guide more effective design of behavioral measurements and estimation algorithms.
Nevertheless, to the best of our knowledge, this study was one of the first works that leveraged machine learning models to estimate positive and negative interaction intensities using CV-based non-verbal behavioral cues. This work provided important technical and practical references for future automatic alcohol consumption behavior analysis. Relationship discord and conflict exacerbate behavioral and physical health conditions, particularly among those already at high risk due to problematic alcohol use [16]. Therefore, CV- and machine learning-aided behavior analysis, piloted by the current research, pointed out future ways to help inform individualized treatment recommendations, reducing client burden, for first-line cognitive-behavioral interventions that target abstinent or alcohol-involved individuals or couples to mitigate the effects of conflict on health [17].

References

[1]
Nidhi Agarwal, Akanksha Sondhi, Khyati Chopra, and Ghanapriya Singh. 2021. Transfer learning: Survey and classification. In Proceedings of the International Conference on Smart Innovations in Communication and Computational Sciences: Proceedings (ICSICCS ’20), 145–155.
[2]
Md Zaved Iqubal Ahmed, Nidul Sinha, Souvik Phadikar, and Ebrahim Ghaderpour. 2022. Automated feature extraction on AsMap for emotion classification using EEG. Sensors 22, 6 (2022), Article 2346.
[3]
Sungtae An, Alessio Medda, Michael N. Sawka, Clayton J. Hutto, Mindy L. Millard-Stafford, Scott Appling, Kristine L. S. Richardson, and Omer T. Inan. 2021. AdaptNet: Human activity recognition via bilateral domain adaptation using semi-supervised deep translation networks. IEEE Sensors Journal 21, 18 (2021), 20398–20411.
[4]
Octavio Arriaga, Matias Valdenegro-Toro, and Paul Plöger. 2017. Real-time convolutional neural networks for emotion and gender classification. arXiv:1710.07557. Retrieved from https://arxiv.org/abs/1710.07557
[5]
Rúbia E. O. Schultz Ascari, Roberto Pereira, and Luciano Silva. 2020. Computer vision-based methodology to improve interaction for people with motor and speech impairment. ACM Transactions on Accessible Computing 13, 4 (2020), 1–33.
[6]
Julia Bachmann, Adam Zabicki, Jörn Munzert, and Britta Krüger. 2020. Emotional expressivity of the observer mediates recognition of affective states from human body movements. Cognition and Emotion 34, 7 (2020), 1370–1381.
[7]
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG ’18), 2018. IEEE, 59–66.
[8]
Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M. Martinez, and Seth D. Pollak. 2019. Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychological Science in the Public Interest 20, 1 (2019), 1–68.
[9]
Billur Barshan and Murat Cihan Yüksek. 2014. Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. The Computer Journal 57, 11 (2014), 1649–1667.
[10]
Djamila Romaissa Beddiar, Brahim Nini, Mohammad Sabokrou, and Abdenour Hadid. 2020. Vision-based human activity recognition: A survey. Multimedia Tools and Applications 79, 41 (2020), 30509–30555.
[11]
Danielle Blanch-Hartigan, Mollie A. Ruben, Judith A. Hall, and Marianne Schmid Mast. 2018. Measuring nonverbal behavior in clinical interactions: A pragmatic guide. Patient Education and Counseling 101, 12 (2018), 2209–2218.
[12]
Simon M Breil, Sarah Osterholz, Steffen Nestler, and Mitja D. Back. 2021. Contributions of nonverbal cues to the accurate judgment of personality traits. In The Oxford Handbook of Accurate Personality Judgment. T. D. Letzring and J. S. Spain (Eds.), Oxford University Press, 195–218.
[13]
Fabian Bross. 2020. Why do we shake our heads?: On the origin of the headshake. Gesture 19, 2–3 (2020), 269–298.
[14]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7291–7299.
[15]
Dana R. Carney. 2020. The nonverbal expression of power, status, and dominance. Current Opinion in Psychology 33 (2020), 256–264.
[16]
Ann L. Coker, Paige H. Smith, Lesa Bethea, Melissa R. King, and Robert E. McKeown. 2000. Physical health consequences of physical and psychological intimate partner violence. Archives of Family Medicine 9, 5 (2000), 451.
[17]
Cory A. Crane and Caroline J. Easton. 2017. Integrated treatment options for male perpetrators of intimate partner violence. Drug and Alcohol Review 36, 1 (2017), 24–33.
[18]
Cory A. Crane, Stephanie A. Godleski, Sarahmona M. Przybyla, Robert C. Schlauch, and Maria Testa. 2016. The proximal effects of acute alcohol consumption on male-to-female aggression: A meta-analytic review of the experimental literature. Trauma, Violence, & Abuse 17, 5 (2016), 520–531.
[19]
L Minh Dang, Kyungbok Min, Hanxiang Wang, Md Jalil Piran, Cheol Hee Lee, and Hyeonjoon Moon. 2020. Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognition 108 (2020), 107561.
[20]
Ziedune Degutyte and Arlene Astell. 2021. The role of eye gaze in regulating turn taking in conversations: a systematized review of methods and findings. Frontiers in Psychology 12 (2021), Article 616471.
[21]
Pegah Dehghan, Hany Alashwal, and Ahmed A. Moustafa. 2022. Applications of machine learning to behavioral sciences: focus on categorical data. Discover Psychology 2, 1 (2022), 1–10.
[22]
Muhterem Dindar, Sanna Järvelä, Sara Ahola, Xiaohua Huang, and Guoying Zhao. 2020. Leaders and followers identified by emotional mimicry during collaborative learning: A facial expression recognition study on emotional valence. IEEE Transactions on Affective Computing 13, 3 (2020), 1390–1400.
[23]
Hanna Drimalla, Irina Baskow, Behnoush Behnia, Stefan Roepke, and Isabel Dziobek. 2021. Imitation and recognition of facial emotions in autism: A computer vision approach. Molecular Autism 12 (2021), 1–15.
[24]
Catharine E. Fairbairn and Maria Testa. 2017. Relationship quality and alcohol-related social reinforcement during couples interaction. Clinical Psychological Science 5, 1 (2017), 74–84.
[25]
Syeda Narjis Fatima and Engin Erzin. 2021. Use of affect context in dyadic interactions for continuous emotion recognition. Speech Communication 132 (2021), 70–82.
[26]
David Feil-Seifer and Maja J. Matarić. 2012. Distance-based computational models for facilitating robot interaction with children. Journal of Human-Robot Interaction 1, 1 (2012), 55–77.
[27]
Bárbara Figueiredo, Catarina Canário, Iva Tendais, Tiago Miguel Pinto, David A. Kenny, and Tiffany Field. 2018. Couples’ relationship affects mothers’ and fathers’ anxiety and depression trajectories over the transition to parenthood. The Journal of Affective Disorders 238 (2018), 204–212.
[28]
Barbara Figueiredo, Tiffany Field, Miguel Diego, Maria Hernandez-Reif, Osvelia Deeds, and Angela Ascencio. 2008. Partner relationships during the transition to parenthood. The Journal of Reproductive and Infant Psychology 26, 2 (2008), 99–107.
[29]
Mark G. Frank and Anne Solbu. 2020. Nonverbal communication: Evolution and today. In Social Intelligence and Nonverbal Communication. R. J. Sternberg and A. Kostić (Eds.), Palgrave Macmillan/Springer Nature, 119–162.
[30]
Richard E. Heyman. 2004. Rapid marital interaction coding system (RMICS). In Couple Observational Coding Systems. Routledge, 81–108.
[31]
Richard E. Heyman, Ashley N. Hunt-Martorano, Jill Malik, and Amy M. Smith Slep. 2009. Desired change in couples: Gender differences and effects on communication. Journal of Family Psychology 23, 4 (2009), 474.
[32]
Sumit Jha, Carlos Busso, H. Abut, J. Hansen, G. Scmidt, and K. Takeda. 2020. 8 Head pose as an indicator of drivers’ visual attention. In Vehicles, Drivers, and Safety. Huseyin Abut, Kazuya Takeda, Gerhard Schmidt, and John Hansen (Eds.), De Gruyter, Berlin, Boston, 113–132.
[33]
Yingying Jiang, Wei Li, M Shamim Hossain, Min Chen, Abdulhameed Alelaiwi, and Muneer Al-Hammadi. 2020. A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition. Information Fusion 53 (2020), 209–221.
[34]
Gerben A van Kleef and Stéphane Côté. 2022. The social effects of emotions. Annual Review of Psychology 73 (2022), 629–658.
[35]
Marie-Louise J. Kullberg, Renate S. M. Buisman, Charlotte C. van Schie, Katharina Pittner, Marieke Tollenaar, Lisa J. M. van den Berg, Lenneke R. A. Alink, Marian J. Bakermans-Kranenburg, and Bernet M. Elzinga. 2023. Linking internalizing and externalizing problems to warmth and negativity in observed dyadic parent–offspring communication. Family Relatations 72, 5 (2023), 2777–2799.
[36]
Sébastien Lallé, Rohit Murali, Cristina Conati, and Roger Azevedo. 2021. Predicting co-occurring emotions from eye-tracking and interaction data in MetaTutor. In Proceedings of the 22nd International Conference on Artificial Intelligence in Education (AIED ’21). Springer, 241–254.
[37]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2117–2125.
[38]
Hui Luo and Jiqing Han. 2020. Nonnegative matrix factorization based transfer subspace learning for cross-corpus speech emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2047–2060.
[39]
Kate T. McKay, Sarah A. Grainger, Sarah P. Coundouris, Daniel P. Skorich, Louise H. Phillips, and Julie D. Henry. 2021. Visual attentional orienting by eye gaze: A meta-analytic review of the gaze-cueing effect. Psychological Bulletin 147, 12 (2021), 1269.
[40]
Guangtao Nie, Akshith Ullal, Zhi Zheng, Amy R Swanson, Amy S. Weitlauf, Zachary E. Warren, and Nilanjan Sarkar. 2021. An immersive computer-mediated caregiver-child interaction system for young children with autism spectrum disorder. IEEE Transactions on Neural Systems and Rehabilitation Engineering 29 (2021), 884–893.
[41]
Sergio Oller-Moreno, Antonio Pardo, Juan Manuel Jiménez-Soto, Josep Samitier, and Santiago Marco. 2014. Adaptive asymmetric least squares baseline estimation for analytical instruments. In Proceedings of the 2014 IEEE 11th International Multi-Conference on Systems, Signals & Devices (SSD14). IEEE, 1–5.
[42]
Massimiliano Patacchiola and Angelo Cangelosi. 2017. Head pose estimation in the wild using convolutional neural networks and adaptive gradient methods. Pattern Recognition 71 (2017), 132–143.
[43]
Raviv Pryluk, Yosef Shohat, Anna Morozov, Dafna Friedman, Aryeh H. Taub, and Rony Paz. 2020. Shared yet dissociable neural codes across eye gaze, valence and expectation. Nature 586, 7827 (2020), 95–100.
[44]
Xin Qin, Yiqiang Chen, Jindong Wang, and Chaohui Yu. 2019. Cross-dataset activity recognition via adaptive spatial-temporal transfer learning. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 4 (2019), 1–25.
[45]
Ronald E. Riggio and Heidi R. Riggio. 2012. Face and body in motion: Nonverbal communication. In Encyclopedia of Body Image and Human Appearance. Elsevier, 425–430.
[46]
Theodore F. Robles, Richard B. Slatcher, Joseph M. Trombello, and Meghan M. McGinn. 2014. Marital quality and health: A meta-analytic review. Psychological Bulletin 140, 1 (2014), 140.
[47]
Lindsey M. Rodriguez, Clayton Neighbors, and C. Raymond Knee. 2014. Problematic alcohol use and marital distress: An interdependence theory perspective. Addiction Research & Theory 22, 4 (2014), 294–312.
[48]
Anas Samara, Leo Galway, Raymond Bond, and Hui Wang. 2019. Affective state detection via facial expression analysis within a human–computer interaction context. Journal of Ambient Intelligence and Humanized Computing 10 (2019), 2175–2184.
[49]
Jennifer A. Samp and Jennifer L. Monahan. 2009. Alcohol-influenced nonverbal behaviors during discussions about a relationship problem. Journal of Nonverbal Behavior 33, 3 (2009), 193–211.
[50]
Gurmeet Singh Sarla. 2021. Non-verbal Communication: Be kind with what you wordlessly say. Practique Clinique et Investigation 4, 1 (2021), 8–11.
[51]
Qosimova Sarvinoz. 2022. Emotional understanding of individuals the role of emotions. ResearchJet Journal of Analysis and Inventions 3, 1 (2022), 12–18.
[52]
Andrea Scarantino, Shlomo Hareli, and Ursula Hess. 2021. Emotional expressions as appeals to recipients. Emotion 22, 8 (2021), 1856–1868.
[53]
Katharina Schultebraucks, Vijay Yadav, Arieh Y. Shalev, George A. Bonanno, and Isaac R. Galatzer-Levy. 2022. Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychological Medicine 52, 5 (2022), 957–967.
[54]
Zhihao Shen, Armagan Elibol, and Nak Young Chong. 2020. Understanding nonverbal communication cues of human personality traits in human-robot interaction. IEEE/CAA Journal of Automatica Sinica 7, 6 (2020), 1465–1477.
[55]
Francisco H. S. Silva, Aldisio G. Medeiros, Elene F. Ohata, and Pedro Pedrosa Reboucas Filho. 2020. Classification of electroencephalogram signals for detecting predisposition to alcoholism using computer vision and transfer learning. In Proceedings of the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS). IEEE, 126–131.
[56]
Bin Sun, Dehui Kong, Shaofan Wang, Lichun Wang, and Baocai Yin. 2021. Joint transferable dictionary learning and view adaptation for multi-view human action recognition. ACM Transactions on Knowledge Discovery from Data 15, 2 (2021), 1–23.
[57]
Maria Testa, Cory A. Crane, Brian M. Quigley, Ash Levitt, and Kenneth E. Leonard. 2014. Effects of administered alcohol on intimate partner interactions in a conflict resolution paradigm. Journal of Studies on Alcohol and Drugs 75, 2 (2014), 249–258.
[58]
Ashwin TS and Ram Mohana Reddy Guddeti. 2020. Automatic detection of students’ affective states in classroom environment using hybrid convolutional neural networks. Education and Information Technologies 25, 2 (2020), 1387–1415.
[59]
Nguyen Tan Viet Tuyen and Oya Celiktutan. 2022. Context-aware human behaviour forecasting in dyadic interactions. In Proceedings of the Understanding Social Behavior in Dyadic and Small Group Interactions. PMLR, 88–106.
[60]
Alexandria K. Vail, Tadas Baltrušaitis, Luciana Pennant, Elizabeth Liebson, Justin Baker, and Louis-Philippe Morency. 2017. Visual attention in schizophrenia: Eye contact and gaze aversion during clinical interactions. In Proceedings of the 7th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 490–497.
[61]
Shui-Hua Wang, Shipeng Xie, Xianqing Chen, David S. Guttery, Chaosheng Tang, Junding Sun, and Yu-Dong Zhang. 2019. Alcoholism identification based on an AlexNet transfer learning model. Frontiers in Psychiatry 10 (2019), 205.
[62]
Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, and Jian Sun. 2020. Learning human-object interaction detection using interaction points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4116–4125.
[63]
Yan Wang, Rui Huang, and Lei Guo. 2019. Eye gaze pattern analysis for fatigue detection based on GP-BCNN with ESM. Pattern Recognition Letters 123, (2019), 61–74.
[64]
Elisa Weber and Gizem Hülür. 2022. The role of relationship conflict for momentary loneliness and affect in the daily lives of older couples. The Journal of Social and Personal Relationships 40, 7 (2022), 2033–2060.
[65]
Xu Yan, Li-Ming Zhao, and Bao-Liang Lu. 2021. Simplifying multimodal emotion recognition with single eye movement modality. In Proceedings of the 29th ACM International Conference on Multimedia, 1057–1063.
[66]
Zhiwei Yu, Cory A. Crane, Maria Testa, and Zhi Zheng. 2021. How moderate alcohol consumption impacts married or cohabiting couples in expressing disagreements: An automatic computation model and analysis. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 1902–1905.
[67]
Frederic Z. Zhang, Dylan Campbell, and Stephen Gould. 2022. Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20104–20112.
[68]
Hongyi Zhang, Francisco H. S. Silva, Elene F. Ohata, Aldisio G. Medeiros, and Pedro P. Rebouças Filho. 2020. Bi-dimensional approach based on transfer learning for alcoholism pre-disposition classification via EEG signals. Frontiers in Human Neuroscience 14 (2020), 365.
[69]
Mi Zhang and Alexander A. Sawchuk. 2012. USC-HAD: A daily activity dataset for ubiquitous activity recognition using wearable sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, 1036–1043.

Index Terms

  1. A Computation Model to Estimate Interaction Intensity through Non-Verbal Behavioral Cues: A Case Study of Intimate Couples under the Impact of Acute Alcohol Consumption

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Computing for Healthcare
      ACM Transactions on Computing for Healthcare  Volume 5, Issue 3
      July 2024
      136 pages
      EISSN:2637-8051
      DOI:10.1145/3613669
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 September 2024
      Online AM: 19 July 2024
      Accepted: 15 April 2024
      Revised: 07 December 2023
      Received: 03 July 2023
      Published in HEALTH Volume 5, Issue 3

      Check for updates

      Author Tags

      1. Interpersonal interaction intensity
      2. non-verbal behaviors
      3. acute alcohol consumption
      4. domain adaptation

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 410
        Total Downloads
      • Downloads (Last 12 months)410
      • Downloads (Last 6 weeks)84
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media