Abstract
Facial expression-based Emotion Recognition (FER) is crucial in human–computer interaction and affective computing, particularly when addressing diverse age groups. This paper introduces the Multi-Scale Vision Transformer with Contrastive Learning (MViT-CnG), an age-adaptive FER approach designed to enhance the accuracy and interpretability of emotion recognition models across different classes. The MViT-CnG model leverages vision transformers and contrastive learning to capture intricate facial features, ensuring robust performance despite diverse and dynamic facial features. By utilizing contrastive learning, the model's interpretability is significantly enhanced, which is vital for building trust in automated systems and facilitating human–machine collaboration. Additionally, this approach enriches the model's capacity to discern shared and distinct features within facial expressions, improving its ability to generalize across different age groups. Evaluations using the FER-2013 and CK + datasets highlight the model's broad generalization capabilities, with FER-2013 covering a wide range of emotions across diverse age groups and CK + focusing on posed expressions in controlled environments. The MViT-CnG model adapts effectively to both datasets, showcasing its versatility and reliability across distinct data characteristics. Performance results demonstrated that the MViT-CnG model achieved superior accuracy across all emotion recognition labels on the FER-2013 dataset with a 99.6% accuracy rate, and 99.5% on the CK + dataset, indicating significant improvements in recognizing subtle facial expressions. Comprehensive evaluations revealed that the model's precision, recall, and F1-score are consistently higher than those of existing models, confirming its robustness and reliability in facial emotion recognition tasks.
Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
References
Abbassi N, Helaly R, Hajjaji MA, Mtibaa A (2020) A deep learning facial emotion classification system:a vggnet-19 based approach. In: 2020 20Th International conference on sciences and techniques of automatic control and computer engineering STA, IEEE, pp 271–276
Bisogni C, Castiglione A, Hossain S, Narducci F, Umer S (2022) Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Trans Industr Inf 18(8):5619–5627
Borgalli MR, Surve S (2022) Deep learning for facial emotion recognition using custom CNN architecture. In: Journal of Physics: Conference Series, vol 2236, No. 1, IOP Publishing, p 012004
Chen Z, Duan Y, Wang W, He J, Lu T, Dai J, Qiao Y (2022) Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534
Chowdary MK, Nguyen TN, Hemanth DJ (2021) Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Comput Appl 35:1–8
Chowdary MK, Nguyen TN, Hemanth DJ (2023) Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Comput Appl 35(32):2331–23328
Georgescu MI, Ionescu RT, Popescu M (2019) Local learning with deep and handcrafted features for facial expression recognition. IEEE Access 7:64827–64836
Gonçalves AR, Fernandes C, Pasion R, Ferreira-Santos F, Barbosa F, Marques-Teixeira J (2018) Emotion identification and aging: behavioural and neural age-related changes. Clin Neurophysiol 129(5):1020–1029
Grm K, Štruc V, Artiges A, Caron M, Ekenel HK (2018) Strengths and weaknesses of deep learning models for face recognition against image degradations. Iet Biometrics 7(1):81–89
Guo G, Zhang N (2019) A survey on deep learning based face recognition. Comput vis Image Underst 189:102805
Gupta S, Kumar P, Tekchandani RK (2023) Facial Emotion Recognition based real-time learner engagement detection system in online learning context using deep learning models. Multimed Tools Appl 82(8):11365–11394
Gupta S (2018) Facial emotion recognition in real-time and static images. In: 2018 2nd international conference on inventive systems and control (ICISC), IEEE, pp 553–560
Hayes GS, McLennan SN, Henry JD, Phillips LH, Terrett G, Rendell PG, Pelly RM, Labuschagne I (2020) Task characteristics influence facial emotion recognition age-effects: a meta-analytic review. Psychol Aging 35(2):295
Hung JC, Lin KC, Lai NX (2019) Recognizing learning emotion based on convolutional neural networks and transfer learning. Appl Soft Comput 84(105):724
Jais IK, Ismail AR, Nisa SQ (2019) Adam optimization algorithm for wide and deep neural network. Knowl Eng Data Sci 2(1):41–46
Jaiswal A, Raju AK, Deb S (2020) Facial emotion detection using deep learning. In2020 international conference for emerging technology (INCET), IEEE, pp 1–5
Jmour N, Masmoudi S, Abdelkrim A (2021) A new video based emotions analysis system (VEMOS): An efficient solution compared to iMotions Affectiva analysis software. Adv Sci Technol Eng Syst J 6:990–1001
Karnati M, Seal A, Yazidi A, Krejcar O (2021) LieNet: a deep convolution neural network framework for detecting deception. IEEE Trans Cognit Dev Syst 14(3):971–984
Karnati M, Seal A, Yazidi A, Krejcar O (2022) Flepnet: feature level ensemble parallel network for facial expression recognition. IEEE Trans Affect Comput 13(4):2058–2070
Karnati M, Seal A, Bhattacharjee D, Yazidi A, Krejcar O (2023a) Understanding deep learning techniques for recognition of human emotions using facial expressions: a comprehensive survey. IEEE Trans Instrum Meas 72:1–31
Karnati M, Seal A, Jaworek-Korjakowska J, Krejcar O (2023b) Facial expression recognition in-the-wild using blended feature attention network. IEEE Trans Instrum Meas 72:1–16
Khaledyan D, Amirany A, Jafari K, Moaiyeri MH, Khuzani AZ, Mashhadi N (2020) Low-cost implementation of bilinear and bicubic image interpolation for real-time image super-resolution. In: 2020 IEEE Global Humanitarian Technology Conference (GHTC), IEEE, pp 1–5
Kim E, Bryant DA, Srikanth D, Howard A (2021) Age bias in emotion detection: An analysis of facial emotion recognition performance on young, middle-aged, and older adults. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp 638–644
Kim JH, Kim N, Won CS (2022) Facial expression recognition with swin transformer. arXiv preprint arXiv:2203.13472
Ko BC (2018) A brief review of facial emotion recognition based on visual information. Sensors 18(2):401
Kumar A, Kaur A, Kumar M (2019) Face detection techniques: a review. Artif Intell Rev 52:927–948
Kumar A, Zhang ZJ, Lyu H (2020) Object detection in real time based on improved single shot multi-box detector algorithm. EURASIP J Wirel Commun Netw 2020:1–8
Kumari N, Bhatia R (2022) Efficient Facial Emotion Recognition model using deep convolutional neural network and modified joint trilateral filter. Soft Comput 26(16):7817–7830
Lakshmi D, Ponnusamy R (2021) Facial emotion recognition using modified HOG and LBP features with deep stacked autoencoders. Microprocess Microsyst 82:103834
Li S, Deng W (2020) Deep facial expression recognition: a survey. IEEE Trans Affect Comput 13(3):1195–1215
Lim JZ, Mount Stephens J, Teo J (2020) Emotion recognition using eye-tracking: taxonomy, review and current challenges. Sensors 20(8):2384
Liu X, Zhang F, Hou Z, Mian L, Wang Z, Zhang J, Tang J (2021) Self-supervised learning: generative or contrastive. IEEE Trans Knowl Data Eng 35(1):857–876
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In: 2010 ieee computer society conference on computer vision and pattern recognition-workshops, IEEE, pp 94–101
Mehendale N (2020) Facial emotion recognition using convolutional neural networks (FERC). SN Appl Sci 2(3):446
Mohan K, Seal A, Krejcar O, Yazidi A (2020) Facial expression recognition using local gravitational force descriptor-based deep convolution neural networks. IEEE Trans Instrum Meas 70:1–2
Mohan K, Seal A, Krejcar O, Yazidi A (2021) FER-net: facial expression recognition using deep neural net. Neural Comput Appl 33(15):9125–9136
Oguine OC, Oguine KJ, Bisallah HI, Ofuani D (2022) Hybrid facial expression recognition (FER2013) model for real-time emotion classification and prediction. arXiv preprint arXiv:2206.09509
Rebanowako TG, Yadav AR, Joshi R (2021) Age-invariant facial expression classification method using deep learning. In: Proceedings of 6th International Conference on Recent Trends in Computing: ICRTC 2020, Springer, Singapore, pp 571–579
Revina IM, Emmanuel WS (2021) A survey on human face expression recognition techniques. J King Saud Univ-Comput Inf Sci 33(6):619–628
Rizwan SA, Ghadi YY, Jalal A, Kim K (2022) Automated facial expression recognition and age estimation using deep learning. Comput, Mater Contin 71(3):5235–5252
Saeed S, Shah AA, Ehsan MK, Amirzada MR, Mahmood A, Mezgebo T (2022) Automated facial expression recognition framework using deep learning. J Healthc Eng. 2022:1–115
Satamraju KP, Balakrishnan M (2022) A secured healthcare model for sensor data sharing with integrated emotional intelligence. IEEE Sens J 22(16):16306–16313
Saurav S, Saini R, Singh S (2021) EmNet: a deep integrated convolutional neural network for facial emotion recognition in the wild. Appl Intell 51:5543–5570
Tabassum F, Islam MI, Khan RT, Amin MR (2022) Human face recognition with combination of DWT and machine learning. J King Saud Univ-Comput Inf Sci 34(3):546–556
Tarnowski P, Kołodziej M, Majkowski A, Rak RJ (2017) Emotion recognition using facial expressions. Proced Comput Sci 108:1175–1184
Wang M, Deng W (2021) Deep face recognition: a survey. Neurocomputing 429:215–244
Zahara L, Musa P, Wibowo EP, Karim I, Musa SB (2020) The facial emotion recognition (FER-2013) dataset for prediction system of micro-expressions face using the (CNN) algorithm based Raspberry Pi. In: 2020 Fifth international conference on informatics and computing (ICIC), IEEE, pp. 1–9
Zhang H, Jolfaei A, Alazab M (2019) A face emotion recognition method using convolutional neural network and image edge computing. IEEE Access 7:159081–159089
Zhang F, Zhang T, Mao Q, Xu C (2018) Joint pose and expression modelling for facial expression recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3359–3368
Zheng T, Deng W, Hu J (2017) Cross-age lfw: a database for studying cross-age face recognition in unconstrained environments. arXiv preprint arXiv:1708.08197
Funding
There is no funding for this study.
Author information
Authors and Affiliations
Contributions
All the authors have participated in writing the manuscript and have revised the final version. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Authors declares that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants and/or animals performed by any of the authors.
Informed consent
There is no informed consent for this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Pre-processing and augmentation
Noisy images can make emotion recognition more challenging. Also, people express emotions differently, and individuals' facial expressions can vary based on cultural, personal, and physiological factors. This subject variability can make it harder to train models that generalize well across individuals. To mitigate these problems,
To ensure uniformity, resize all images to a consistent resolution, typically square dimensions48 \(\times\) 48 pixels. based on its position in the original image. Perform bicubic interpolation to calculate the pixel value at the resized image coordinates \(\left( {X^{\prime} ,Y^{\prime} } \right)\) based on the surrounding pixel values in the original image. Bicubic interpolation uses a 4 \(\times\) 4-pixel neighbourhood for interpolation and a bicubic kernel function. The interpolated pixel value \(K^{\prime} \left( {X^{\prime} ,Y^{\prime} } \right)\) are calculated as follows:
where \(K^{\prime} \left( {X^{\prime} ,Y^{\prime} } \right)\) is the pixel value in the resized image at coordinates \(\left( {X^{\prime} ,Y^{\prime} } \right)\),\(K\left( {X_{m} + m,\,Y_{m} + \,n} \right)\) are the pixel values in the original image at the surrounding coordinates within the 4 \(\times\) 4 neighbourhood.\(g\left( m \right)\) and \(g\left( n \right)\,\) are the bicubic kernel weights, which are typically defined using a cubic spline function. This resizing method ensure that the original image is proportionally scaled to fit the desired dimensions using bicubic interpolation, which provides a smoother and more accurate result. Also, it ensures uniformity and reduces computational complexity for training process.
Normalization is a crucial pre-processing step in FER to ensure that the input data has consistent scales and distributions. Commonly, normalization involves scaling pixel values to a standard range, such as [0, 1], and subtracting the mean to centre the data. Scale pixel values to the range [0, 1] to ensure that all values fall within a consistent range using min–max scaling. Subtract the mean pixel value to centre the data around zero. Combine the scaling and zero-cantering into a single step normalization process, expressed as:
where \(\mu\) is the mean pixel value of the entire image and \(\sigma\) is the standard deviation of the pixel values in the image.
The model generalizes better and become more robust to variations in lighting, pose, and other factors. Here, we used some augmentation operations with ranges of values that can be applied in FER:
-
Rotation: Rotate the image by a random angle within the specified range to simulate variations in head pose with the range of -15 degrees to + 15 degrees
-
Horizontal Flip: Flip the image horizontally with a 50% chance. This creates a mirror image and helps the model learn from both left and right-facing faces (Boolean (either flip or not)).
-
Brightness and Contrast Adjustment: Randomly adjust the brightness and contrast within the specified range to simulate variations in lighting conditions with the range of -20% to + 20% for brightness and contrast.
-
Translation: Shift the image horizontally and/or vertically by a random percentage within the specified range to simulate variations in the position of the face within the image with the range of -10% to + 10% of image width/height.
Facial region detection
One of the key components of SSD is the use of default anchors, also known as default bounding boxes. These default anchors are predefined bounding boxes of various aspect ratios and scales that are placed at different positions within the image.
This step ensures that the extracted faces are aligned to a consistent position. These anchors come in different sizes (scales) and aspect ratios. For example, there might be small square anchors, tall rectangular anchors, and wide rectangular anchors. ImSSD generates a set of default anchors, represented by their width (W) and height (H), at each spatial location on the feature maps of different convolutional layers. These anchors are determined based on different aspect ratios and scales. The illustration of anchor box shapes in the ImSSD model, as shown in the Fig.
14. These default anchors are placed at various positions across the spatial dimensions of the feature maps generated by the convolutional layers in the ImSSD network. Each anchor is cantered at a grid point within the feature map, covering different regions of the image. The spatial location is represented as (x, y). The anchors are distributed across the image in a dense grid pattern.
Region score represents the likelihood that an intricate facial feature is present within the anchor's region. It is a measure of how well the anchor aligns with an actual region. Typically, it is calculated employing a sigmoid activation function, and its values are confined within the range of 0 to 1. High region scores indicate a high probability of containing a facial feature. The model predicts adjustments (offsets) to the position and size of each anchor to better fit the actual facial region. These offsets allow the anchor to match the region location and dimensions more accurately. It matches anchors with high region scores to actual facial regions based on their overlap. Anchors with low region scores or poor matches are discarded. The remaining anchors with high region scores, along with their corresponding offsets, are used to generate the final detection boxes. These detection boxes are the model's predictions for the locations and dimensions of facial regions within the image. By using default anchors of different scales and aspect ratios, the improved SSD detect facial regions of various sizes and shapes, including faces with different orientations and aspect ratios. This makes SSD a versatile choice for facial region detection and other object detection tasks, as it can adapt to a wide range of facial region appearances within an image.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Balachandran, G., Ranjith, S., Chenthil, T.R. et al. Facial expression-based emotion recognition across diverse age groups: a multi-scale vision transformer with contrastive learning approach. J Comb Optim 49, 11 (2025). https://doi.org/10.1007/s10878-024-01241-8
Accepted:
Published:
DOI: https://doi.org/10.1007/s10878-024-01241-8