Enhancing Emotion Recognition through Multimodal Systems and Advanced Deep Learning Techniques

Meena Jindal; Khushwant Kaur

doi:10.32628/CSEIT24103216

Authors

Meena Jindal Assistant Professor, Sri Guru Gobind Singh College, Chandigarh, India Author
Khushwant Kaur Assistant Professor, Sri Guru Gobind Singh College, Chandigarh, India Author

DOI:

https://doi.org/10.32628/CSEIT24103216

Keywords:

Sentiment, Multimodal, Deep Learning, Transfer Learning

Abstract

Emotion detection, hence, is an important step toward making human-computer interaction a more enhanced process, where systems are made capable of identifying and responding to the emotional state of users. In fact, multimodal emotion detection systems in which both auditory and visual information are fused are emerging, and these approaches toward expressive emotional states are complementary and robust. Multimodal systems enhance the quality of interacting and, through many applications, can diagnose emotional disorders, monitor automotive safety, and improve human-robot interactions. In nature, the high-dimensional space and dynamic threats have resulted in obtaining low accuracy and high computational cost in applying the traditional models based on single-modality data. On the other hand, multimodal systems explore the synergy between audio and visual data, giving better performance and higher accuracy in inferring subtle emotional expressions. The latest improvement was done on these systems using recent advancements in transfer learning and deep learning techniques.That being said, this research Proposal devises a multimodal emotion recognition system integrating speech and face information through transfer learning for improved accuracy and robustness. Serving this purpose, the objectives of this research entail the effective comparison among different transfer-learning strategies, including the impact of pre-trained models in speech-based emotion recognition, and to introduce the role of voice activity detection in the process. Advanced neural network architectures like Spatial Transformer Networks and bidirectional LSTM in facial emotion recognition will also be tested. Early and late fusion strategies will also be used to find the best strategy for combining speech and facial data.This research will target several challenges that involve the complexity of data, balancing of the model performance-robustness balance, computational limitations, and standardization of evaluations in developing a working and robust emotion recognition system to enhance digital interaction and apply in practical areas. The goal is to create a system that oversteps the limitation of single-modality models through state-of-the-art advances in deep learning, as well as front-line improvements in transfer learning, in the manner of emotion detection performance.

Downloads

Download data is not yet available.

References

Anvarjon, B., Ko, B.-C., & Lee, J.-Y. (2020). Lightweight CNN for SER on IEMOCAP and EMO-DB. IEEE Transactions on Affective Computing. doi:10.1109/TAFFC.2020.296614

Franzoni, V., D’Orazio, T., Leo, M., Distante, C., Spagnolo, P., & Mazzeo, P. L. (2019). Transfer Learning on Partial Facial Images for Emotion Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2019.2919921

Singh, K., Srivastava, S., & Srivastava, R. (2020). Hierarchical DNN Classifier for SER on RAVDESS. IEEE Access. doi:10.1109/ACCESS.2020.3002145

Pepino, L., & Nogueira, K. (2020). Combining Hand-crafted Features and Deep Models for SER. Journal of Artificial Intelligence Research. doi:10.1613/jair.1.11759

Issa, D., Demir, G., & Asfour, T. (2020). CNN for Feature Extraction in SER. Sensors. doi:10.3390/s20154321 DOI: https://doi.org/10.3390/s20154321

Furey, A., & Blue, P. R. (2020). Emotion Recognition using Temporal Indicators. International Journal of Human-Computer Studies. doi:10.1016/j.ijhcs.2020.102478

Deng, J., Guo, D., & Li, H. (2020). Co-attention Transformer Model for Multimodal Emotion Recognition. IEEE Transactions on Multimedia. doi:10.1109/TMM.2020.2983062

Sun, S., Wen, Z., & He, X. (2020). Late Fusion Strategy using Bi-LSTM for Multimodal Emotion Recognition. ACM Transactions on Multimedia Computing, Communications, and Applications. doi:10.1145/3374202

Wang, F., & Zhang, Y. (2019). Facial Image Spectrograms for Enhancing SER. Pattern Recognition Letters. doi:10.1016/j.patrec.2019.02.00

Luna-Jiménez, C., Griol, D., Callejas, Z., Kleinlein, R., Montero, J. M., & Fernández-Martínez, F. (2021). Multimodal Emotion Recognition on RAVDESS using Transfer Learning. Sensors. doi:10.3390/s21227665 DOI: https://doi.org/10.3390/s21227665

Hu, G., Lin, T.-E., Zhao, Y., Lu, G., Wu, Y., & Li, Y. (2022). UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2022.emnlp-main.824 DOI: https://doi.org/10.18653/v1/2022.emnlp-main.534

Qiu, S., Sekhar, N., & Singhal, P. (2023). Topic and Style-aware Transformer for Multimodal Emotion Recognition. Findings of the Association for Computational Linguistics: ACL 2023. doi:10.18653/v1/2023.findings-acl.130 DOI: https://doi.org/10.18653/v1/2023.findings-acl.130

Shi, Y., Wang, J., & Tang, X. (2023). MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework. Proceedings of the 2023 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/2023.naacl-main.254 DOI: https://doi.org/10.18653/v1/2023.acl-long.824

Lian, X., Yang, F., & Zhu, X. (2024). MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition. arXiv preprint arXiv:2404.17113. doi:10.48550/arXiv.2404.17113

Hu, G., Lin, T.-E., & Zhao, Y. (2024). Explainable Multimodal Emotion Reasoning. arXiv preprint arXiv:2306.15401. doi:10.48550/arXiv.2306.15401