[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Facial expression-based emotion recognition across diverse age groups: a multi-scale vision transformer with contrastive learning approach

  • Published:
Journal of Combinatorial Optimization Aims and scope Submit manuscript

Abstract

Facial expression-based Emotion Recognition (FER) is crucial in human–computer interaction and affective computing, particularly when addressing diverse age groups. This paper introduces the Multi-Scale Vision Transformer with Contrastive Learning (MViT-CnG), an age-adaptive FER approach designed to enhance the accuracy and interpretability of emotion recognition models across different classes. The MViT-CnG model leverages vision transformers and contrastive learning to capture intricate facial features, ensuring robust performance despite diverse and dynamic facial features. By utilizing contrastive learning, the model's interpretability is significantly enhanced, which is vital for building trust in automated systems and facilitating human–machine collaboration. Additionally, this approach enriches the model's capacity to discern shared and distinct features within facial expressions, improving its ability to generalize across different age groups. Evaluations using the FER-2013 and CK + datasets highlight the model's broad generalization capabilities, with FER-2013 covering a wide range of emotions across diverse age groups and CK + focusing on posed expressions in controlled environments. The MViT-CnG model adapts effectively to both datasets, showcasing its versatility and reliability across distinct data characteristics. Performance results demonstrated that the MViT-CnG model achieved superior accuracy across all emotion recognition labels on the FER-2013 dataset with a 99.6% accuracy rate, and 99.5% on the CK + dataset, indicating significant improvements in recognizing subtle facial expressions. Comprehensive evaluations revealed that the model's precision, recall, and F1-score are consistently higher than those of existing models, confirming its robustness and reliability in facial emotion recognition tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Figures 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

References

  • Abbassi N, Helaly R, Hajjaji MA, Mtibaa A (2020) A deep learning facial emotion classification system:a vggnet-19 based approach. In: 2020 20Th International conference on sciences and techniques of automatic control and computer engineering STA, IEEE, pp 271–276

  • Bisogni C, Castiglione A, Hossain S, Narducci F, Umer S (2022) Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Trans Industr Inf 18(8):5619–5627

    Article  Google Scholar 

  • Borgalli MR, Surve S (2022) Deep learning for facial emotion recognition using custom CNN architecture. In: Journal of Physics: Conference Series, vol 2236, No. 1, IOP Publishing, p 012004

  • Chen Z, Duan Y, Wang W, He J, Lu T, Dai J, Qiao Y (2022) Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534

  • Chowdary MK, Nguyen TN, Hemanth DJ (2021) Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Comput Appl 35:1–8

    Google Scholar 

  • Chowdary MK, Nguyen TN, Hemanth DJ (2023) Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Comput Appl 35(32):2331–23328

    Article  Google Scholar 

  • Georgescu MI, Ionescu RT, Popescu M (2019) Local learning with deep and handcrafted features for facial expression recognition. IEEE Access 7:64827–64836

    Article  Google Scholar 

  • Gonçalves AR, Fernandes C, Pasion R, Ferreira-Santos F, Barbosa F, Marques-Teixeira J (2018) Emotion identification and aging: behavioural and neural age-related changes. Clin Neurophysiol 129(5):1020–1029

    Article  Google Scholar 

  • Grm K, Štruc V, Artiges A, Caron M, Ekenel HK (2018) Strengths and weaknesses of deep learning models for face recognition against image degradations. Iet Biometrics 7(1):81–89

    Article  Google Scholar 

  • Guo G, Zhang N (2019) A survey on deep learning based face recognition. Comput vis Image Underst 189:102805

    Article  Google Scholar 

  • Gupta S, Kumar P, Tekchandani RK (2023) Facial Emotion Recognition based real-time learner engagement detection system in online learning context using deep learning models. Multimed Tools Appl 82(8):11365–11394

    Article  Google Scholar 

  • Gupta S (2018) Facial emotion recognition in real-time and static images. In: 2018 2nd international conference on inventive systems and control (ICISC), IEEE, pp 553–560

  • Hayes GS, McLennan SN, Henry JD, Phillips LH, Terrett G, Rendell PG, Pelly RM, Labuschagne I (2020) Task characteristics influence facial emotion recognition age-effects: a meta-analytic review. Psychol Aging 35(2):295

    Article  Google Scholar 

  • Hung JC, Lin KC, Lai NX (2019) Recognizing learning emotion based on convolutional neural networks and transfer learning. Appl Soft Comput 84(105):724

    Google Scholar 

  • Jais IK, Ismail AR, Nisa SQ (2019) Adam optimization algorithm for wide and deep neural network. Knowl Eng Data Sci 2(1):41–46

    Article  Google Scholar 

  • Jaiswal A, Raju AK, Deb S (2020) Facial emotion detection using deep learning. In2020 international conference for emerging technology (INCET), IEEE, pp 1–5

  • Jmour N, Masmoudi S, Abdelkrim A (2021) A new video based emotions analysis system (VEMOS): An efficient solution compared to iMotions Affectiva analysis software. Adv Sci Technol Eng Syst J 6:990–1001

    Article  Google Scholar 

  • Karnati M, Seal A, Yazidi A, Krejcar O (2021) LieNet: a deep convolution neural network framework for detecting deception. IEEE Trans Cognit Dev Syst 14(3):971–984

    Article  Google Scholar 

  • Karnati M, Seal A, Yazidi A, Krejcar O (2022) Flepnet: feature level ensemble parallel network for facial expression recognition. IEEE Trans Affect Comput 13(4):2058–2070

    Article  Google Scholar 

  • Karnati M, Seal A, Bhattacharjee D, Yazidi A, Krejcar O (2023a) Understanding deep learning techniques for recognition of human emotions using facial expressions: a comprehensive survey. IEEE Trans Instrum Meas 72:1–31

    Google Scholar 

  • Karnati M, Seal A, Jaworek-Korjakowska J, Krejcar O (2023b) Facial expression recognition in-the-wild using blended feature attention network. IEEE Trans Instrum Meas 72:1–16

    Google Scholar 

  • Khaledyan D, Amirany A, Jafari K, Moaiyeri MH, Khuzani AZ, Mashhadi N (2020) Low-cost implementation of bilinear and bicubic image interpolation for real-time image super-resolution. In: 2020 IEEE Global Humanitarian Technology Conference (GHTC), IEEE, pp 1–5

  • Kim E, Bryant DA, Srikanth D, Howard A (2021) Age bias in emotion detection: An analysis of facial emotion recognition performance on young, middle-aged, and older adults. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp 638–644

  • Kim JH, Kim N, Won CS (2022) Facial expression recognition with swin transformer. arXiv preprint arXiv:2203.13472

  • Ko BC (2018) A brief review of facial emotion recognition based on visual information. Sensors 18(2):401

    Article  Google Scholar 

  • Kumar A, Kaur A, Kumar M (2019) Face detection techniques: a review. Artif Intell Rev 52:927–948

    Article  Google Scholar 

  • Kumar A, Zhang ZJ, Lyu H (2020) Object detection in real time based on improved single shot multi-box detector algorithm. EURASIP J Wirel Commun Netw 2020:1–8

    Article  Google Scholar 

  • Kumari N, Bhatia R (2022) Efficient Facial Emotion Recognition model using deep convolutional neural network and modified joint trilateral filter. Soft Comput 26(16):7817–7830

    Article  Google Scholar 

  • Lakshmi D, Ponnusamy R (2021) Facial emotion recognition using modified HOG and LBP features with deep stacked autoencoders. Microprocess Microsyst 82:103834

    Article  Google Scholar 

  • Li S, Deng W (2020) Deep facial expression recognition: a survey. IEEE Trans Affect Comput 13(3):1195–1215

    Article  MathSciNet  Google Scholar 

  • Lim JZ, Mount Stephens J, Teo J (2020) Emotion recognition using eye-tracking: taxonomy, review and current challenges. Sensors 20(8):2384

    Article  Google Scholar 

  • Liu X, Zhang F, Hou Z, Mian L, Wang Z, Zhang J, Tang J (2021) Self-supervised learning: generative or contrastive. IEEE Trans Knowl Data Eng 35(1):857–876

    Google Scholar 

  • Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In: 2010 ieee computer society conference on computer vision and pattern recognition-workshops, IEEE, pp 94–101

  • Mehendale N (2020) Facial emotion recognition using convolutional neural networks (FERC). SN Appl Sci 2(3):446

    Article  Google Scholar 

  • Mohan K, Seal A, Krejcar O, Yazidi A (2020) Facial expression recognition using local gravitational force descriptor-based deep convolution neural networks. IEEE Trans Instrum Meas 70:1–2

    Article  Google Scholar 

  • Mohan K, Seal A, Krejcar O, Yazidi A (2021) FER-net: facial expression recognition using deep neural net. Neural Comput Appl 33(15):9125–9136

    Article  Google Scholar 

  • Oguine OC, Oguine KJ, Bisallah HI, Ofuani D (2022) Hybrid facial expression recognition (FER2013) model for real-time emotion classification and prediction. arXiv preprint arXiv:2206.09509

  • Rebanowako TG, Yadav AR, Joshi R (2021) Age-invariant facial expression classification method using deep learning. In: Proceedings of 6th International Conference on Recent Trends in Computing: ICRTC 2020, Springer, Singapore, pp 571–579

  • Revina IM, Emmanuel WS (2021) A survey on human face expression recognition techniques. J King Saud Univ-Comput Inf Sci 33(6):619–628

    Google Scholar 

  • Rizwan SA, Ghadi YY, Jalal A, Kim K (2022) Automated facial expression recognition and age estimation using deep learning. Comput, Mater Contin 71(3):5235–5252

    Google Scholar 

  • Saeed S, Shah AA, Ehsan MK, Amirzada MR, Mahmood A, Mezgebo T (2022) Automated facial expression recognition framework using deep learning. J Healthc Eng. 2022:1–115

    Article  Google Scholar 

  • Satamraju KP, Balakrishnan M (2022) A secured healthcare model for sensor data sharing with integrated emotional intelligence. IEEE Sens J 22(16):16306–16313

    Article  Google Scholar 

  • Saurav S, Saini R, Singh S (2021) EmNet: a deep integrated convolutional neural network for facial emotion recognition in the wild. Appl Intell 51:5543–5570

    Article  Google Scholar 

  • Tabassum F, Islam MI, Khan RT, Amin MR (2022) Human face recognition with combination of DWT and machine learning. J King Saud Univ-Comput Inf Sci 34(3):546–556

    Google Scholar 

  • Tarnowski P, Kołodziej M, Majkowski A, Rak RJ (2017) Emotion recognition using facial expressions. Proced Comput Sci 108:1175–1184

    Article  Google Scholar 

  • Wang M, Deng W (2021) Deep face recognition: a survey. Neurocomputing 429:215–244

    Article  Google Scholar 

  • Zahara L, Musa P, Wibowo EP, Karim I, Musa SB (2020) The facial emotion recognition (FER-2013) dataset for prediction system of micro-expressions face using the (CNN) algorithm based Raspberry Pi. In: 2020 Fifth international conference on informatics and computing (ICIC), IEEE, pp. 1–9

  • Zhang H, Jolfaei A, Alazab M (2019) A face emotion recognition method using convolutional neural network and image edge computing. IEEE Access 7:159081–159089

    Article  Google Scholar 

  • Zhang F, Zhang T, Mao Q, Xu C (2018) Joint pose and expression modelling for facial expression recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3359–3368

  • Zheng T, Deng W, Hu J (2017) Cross-age lfw: a database for studying cross-age face recognition in unconstrained environments. arXiv preprint arXiv:1708.08197

Download references

Funding

There is no funding for this study.

Author information

Authors and Affiliations

Authors

Contributions

All the authors have participated in writing the manuscript and have revised the final version. All authors read and approved the final manuscript.

Corresponding author

Correspondence to G. Balachandran.

Ethics declarations

Conflict of interest

Authors declares that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants and/or animals performed by any of the authors.

Informed consent

There is no informed consent for this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Pre-processing and augmentation

Noisy images can make emotion recognition more challenging. Also, people express emotions differently, and individuals' facial expressions can vary based on cultural, personal, and physiological factors. This subject variability can make it harder to train models that generalize well across individuals. To mitigate these problems,

To ensure uniformity, resize all images to a consistent resolution, typically square dimensions48 \(\times\) 48 pixels. based on its position in the original image. Perform bicubic interpolation to calculate the pixel value at the resized image coordinates \(\left( {X^{\prime} ,Y^{\prime} } \right)\) based on the surrounding pixel values in the original image. Bicubic interpolation uses a 4 \(\times\) 4-pixel neighbourhood for interpolation and a bicubic kernel function. The interpolated pixel value \(K^{\prime} \left( {X^{\prime} ,Y^{\prime} } \right)\) are calculated as follows:

$$K^{\prime} \left( {X^{\prime} ,Y^{\prime} } \right)\,\, = \,\,\sum\limits_{m = - 1}^{2} {} \sum\limits_{n = - 1}^{2} {g\left( m \right)\, \cdot g\left( n \right)\, \cdot \,K\left( {X_{m} + m,\,Y_{m} + \,n} \right)}$$
(10)

where \(K^{\prime} \left( {X^{\prime} ,Y^{\prime} } \right)\) is the pixel value in the resized image at coordinates \(\left( {X^{\prime} ,Y^{\prime} } \right)\),\(K\left( {X_{m} + m,\,Y_{m} + \,n} \right)\) are the pixel values in the original image at the surrounding coordinates within the 4 \(\times\) 4 neighbourhood.\(g\left( m \right)\) and \(g\left( n \right)\,\) are the bicubic kernel weights, which are typically defined using a cubic spline function. This resizing method ensure that the original image is proportionally scaled to fit the desired dimensions using bicubic interpolation, which provides a smoother and more accurate result. Also, it ensures uniformity and reduces computational complexity for training process.

Normalization is a crucial pre-processing step in FER to ensure that the input data has consistent scales and distributions. Commonly, normalization involves scaling pixel values to a standard range, such as [0, 1], and subtracting the mean to centre the data. Scale pixel values to the range [0, 1] to ensure that all values fall within a consistent range using min–max scaling. Subtract the mean pixel value to centre the data around zero. Combine the scaling and zero-cantering into a single step normalization process, expressed as:

$$N\left( {X,Y} \right)\,\,\, = \,\,\frac{{S\left( {X,Y} \right) - \mu }}{\sigma }\,$$
(11)

where \(\mu\) is the mean pixel value of the entire image and \(\sigma\) is the standard deviation of the pixel values in the image.

The model generalizes better and become more robust to variations in lighting, pose, and other factors. Here, we used some augmentation operations with ranges of values that can be applied in FER:

  • Rotation: Rotate the image by a random angle within the specified range to simulate variations in head pose with the range of -15 degrees to + 15 degrees

  • Horizontal Flip: Flip the image horizontally with a 50% chance. This creates a mirror image and helps the model learn from both left and right-facing faces (Boolean (either flip or not)).

  • Brightness and Contrast Adjustment: Randomly adjust the brightness and contrast within the specified range to simulate variations in lighting conditions with the range of -20% to + 20% for brightness and contrast.

  • Translation: Shift the image horizontally and/or vertically by a random percentage within the specified range to simulate variations in the position of the face within the image with the range of -10% to + 10% of image width/height.

Facial region detection

One of the key components of SSD is the use of default anchors, also known as default bounding boxes. These default anchors are predefined bounding boxes of various aspect ratios and scales that are placed at different positions within the image.

This step ensures that the extracted faces are aligned to a consistent position. These anchors come in different sizes (scales) and aspect ratios. For example, there might be small square anchors, tall rectangular anchors, and wide rectangular anchors. ImSSD generates a set of default anchors, represented by their width (W) and height (H), at each spatial location on the feature maps of different convolutional layers. These anchors are determined based on different aspect ratios and scales. The illustration of anchor box shapes in the ImSSD model, as shown in the Fig. 

Fig. 14
figure 14

Illustration of facial region detection using anchor box shapes in the improved SSD model

14. These default anchors are placed at various positions across the spatial dimensions of the feature maps generated by the convolutional layers in the ImSSD network. Each anchor is cantered at a grid point within the feature map, covering different regions of the image. The spatial location is represented as (x, y). The anchors are distributed across the image in a dense grid pattern.

Region score represents the likelihood that an intricate facial feature is present within the anchor's region. It is a measure of how well the anchor aligns with an actual region. Typically, it is calculated employing a sigmoid activation function, and its values are confined within the range of 0 to 1. High region scores indicate a high probability of containing a facial feature. The model predicts adjustments (offsets) to the position and size of each anchor to better fit the actual facial region. These offsets allow the anchor to match the region location and dimensions more accurately. It matches anchors with high region scores to actual facial regions based on their overlap. Anchors with low region scores or poor matches are discarded. The remaining anchors with high region scores, along with their corresponding offsets, are used to generate the final detection boxes. These detection boxes are the model's predictions for the locations and dimensions of facial regions within the image. By using default anchors of different scales and aspect ratios, the improved SSD detect facial regions of various sizes and shapes, including faces with different orientations and aspect ratios. This makes SSD a versatile choice for facial region detection and other object detection tasks, as it can adapt to a wide range of facial region appearances within an image.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Balachandran, G., Ranjith, S., Chenthil, T.R. et al. Facial expression-based emotion recognition across diverse age groups: a multi-scale vision transformer with contrastive learning approach. J Comb Optim 49, 11 (2025). https://doi.org/10.1007/s10878-024-01241-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10878-024-01241-8

Keywords

Navigation