Journal of information and communication convergence engineering 2022; 20(3): 174-180

Published online September 30, 2022

https://doi.org/10.56977/jicce.2022.20.3.174

© Korea Institute of Information and Communication Engineering

Emotion Recognition Implementation with Multimodalities of Face, Voice and EEG

Miracle Udurume 1, Angela Caliwag1, Wansu Lim 1*, and Gwigon Kim2*

1Department of Aeronautics Mechanical and Electronic Convergence Engineering, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea
2Department of Business Administration, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea

Correspondence to : *Gwigon Kim (E-mail: metheus@kumoh.ac.kr, Tel: +82-54-478-7848)
Department of Business Administration, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea

Received: April 21, 2022; Revised: September 7, 2022; Accepted: September 8, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Emotion recognition is an essential component of complete interaction between human and machine. The issues related to emotion recognition are a result of the different types of emotions expressed in several forms such as visual, sound, and physiological signal. Recent advancements in the field show that combined modalities, such as visual, voice and electroencephalography signals, lead to better result compared to the use of single modalities separately. Previous studies have explored the use of multiple modalities for accurate predictions of emotion; however the number of studies regarding real-time implementation is limited because of the difficulty in simultaneously implementing multiple modalities of emotion recognition. In this study, we proposed an emotion recognition system for real-time emotion recognition implementation. Our model was built with a multithreading block that enables the implementation of each modality using separate threads for continuous synchronization. First, we separately achieved emotion recognition for each modality before enabling the use of the multithreaded system. To verify the correctness of the results, we compared the performance accuracy of unimodal and multimodal emotion recognitions in real-time. The experimental results showed real-time user emotion recognition of the proposed model. In addition, the effectiveness of the multimodalities for emotion recognition was observed. Our multimodal model was able to obtain an accuracy of 80.1% as compared to the unimodality, which obtained accuracies of 70.9, 54.3, and 63.1%.

Keywords Emotion recognition, Multimodality, Multithreading, Real-time implementation

Emotion recognition plays a very significant role in our daily lives and enables the responses of software applications to adapt to emotional states of the user [1-3]. The application of emotion recognition can be found in various domains, such as for the monitoring and prediction of the fatigue state [4], health monitoring, and communication skills [5]. Emotion recognition influences various levels of modalities [6]. Emotions can most often be expressed through external levels, such as visual, speech, gestures, body signals, heart rate, physiological signals, electroencephalogram (EEG), and body temperature [7,8]. Among these, speech and visual are visual, speech, gestures and body signals, heart rate, physiological signals, electroencephalogram (EEG), body temperatures [7,8], etc. Among these, speech and visual are used widely in emotion recognition because their datasets can easily be constructed. Recent studies have been concentrated on unimodal modalities of emotion recognition, such as text, speech, and images. Although unimodal emotion recognition has made many breakthrough achievements with the passage of time, it still faces some problems [9,10]. The use of unimodality cannot fully describe a certain emotion of the user at the moment, thereby resulting in poor accuracy. Hence, using multimodal features to describe a certain emotion together will be more comprehensive and detailed. Multimodality helps increase the accuracy of emotion recognition. However, the results show some difficulty in simultaneously implementing video, audio, and EEG emotion recognition in real-time because most models require the use of recorded data for offline implementation [11]. Recently, various multimodal emotion recognition methods have employed the use of fusion method for combining the unimodal systems by extracting features before fusing together [12]. Busso et al. combined features from audio and video datasets before independently applying a data classifier algorithm to each of the features and reported a significant increase in the accuracy, from 65% to 89.3%. Guanghui and Xiaoping proposed a multimodal emotion recognition method for fusing the correlation features of speech-visual, which were extracted using a twodimensional convolutional neural network before the application of the proposed feature correlation analysis algorithm for offline multimodal emotion recognition [14]. Xing et al. used machine learning algorithms to exploit EEG signals and audiovisual features for video emotion recognition, achieving video emotion classification accuracies of 96.79% for valence and 97.79% for arousal [15]. All of the studies mentioned above exploit multimodal fusion systems for emotion recognition. Nevertheless, these feature methods are inappropriate for the continuous emotion recognition of audio, face, and EEG, and none provide a real-time approach for multimodal emotion recognition.

Thus, this study addresses the above issues by providing a real-time multimodal emotion recognition system for face, voice, and EEG modalities using a multithreaded system for continuous synchronized execution to improve the performance of emotion recognition in real-time.

The rest of the paper is structured as follows: section II covers the proposed methodologies and overall architecture, section III present results and discussion, and section IV contains our conclusion.

This section discusses the proposed methodology for continuous real-time multimodal emotion recognition, and describes the overall architecture of the model used. Our model focuses on a continuous synchronized real-time multimodal emotion recognition implementation for audio, face, and EEG. An overview of the system model is shown in Fig. 1. The system model includes four layers, which are introduced as A) input devices, B) feature extraction, C) emotion recognition model, and D) multithreading. Each layer of the system architecture is described as follows.

Fig. 1. System overview.

A. Input Devices

The input devices include the hardware devices, such as an ACPI X64 PC, webcam (c922 pro stream), USB microphone, and Daisy module OpenBCI Cyton plus cEEGrid device, used for real-time implementation. The layer offers tools for face, voice, and EEG emotion recognition and first separately extracts the features from each source. Feature extraction, its steps, methodological approach, and the process used to select each feature are explained in the next section. Face-emotion recognition receives real-time facial expressions from the webcam and detects the face. We implemented the same steps for voice and EEG emotion recognition before the multimodal emotion recognition integration.

B. Feature Extraction

This involves extracting useful features to enable emotion recognition from both face, voice, and EEG. Some of the components, used for face, voice, and EEG emotion recognition, are discussed below.

Face Features

This is the process of extracting facial features from a face and classifying emotions. First, we achieved face detection, the process of detecting an individual’s face in real time, using Dlib, an open-source library that is a landmark facial detector with pretrained models [11]. Dlib is used to estimate the location of the coordinates (x, y) that map the facial points on a face, which enables the incorporation of features. After the face detection, feature points were extracted from the user, including face tracking and landmark detection algorithms used for tracking the face of the user in real time. Face-landmark detection enables the computer to detect and localize regions of the face, such as the eyes, eyebrows, nose, and mouth the face such as the eyes, eyebrows, nose, and mouth.

Voice Features

In this process, we extracted features from voice intonations and used them to classify emotions. Similar to facial emotion recognition, we detected the voice of the user and then extracted useful features. For voice emotion recognition, features, such as duration, channels, rate, chunk size, pitch, spectrum, Mel-Frequency Cepstrum Coefficients (MFCC), and zero crossing rate (ZCR), were extracted from the input speech signal using the Librosa library package.

Feature extraction is an important technique in speech emotion recognition. Different features are classified after their extraction. The features extracted in this study are as follows.

  • Energy-Root Mean Square.

  • Zero crossing rate.

  • Mel-Frequency Cepstral Coefficients.

The features were extracted with frame length = 2048 and hop length = 512 (similar to CHUNK, a batch of sequential samples to process at once). Our main focus was on MFCC, as it is the most popularly extracted feature method for speech emotion recognition. Every sample (sequences of 0.2 s) was analyzed and translated to four sequential feature values (2048/512 = 4). The MFCC is grouped into different stages: re-emphasis, windowing, spectral analysis, filter bank processing, log energy computation, and Mel frequency cepstral computation. The Librosa package helps achieve all the feature extractions.

EEG Features

For EEG emotion recognition, we used the v3 Daisy module Open-BCI Cyton plus cEEGrids device to measure EEG signals. Raw data can be read for post-processing based in the Open-BCI data format and files. We used the MATLAB programming language to implement a signal processing tool for raw Open-BCI data. The EEG data are composed of 24- bit signed values. A total of eight channels were used for signal acquisition. Additionally, 16-bit signed values were used for storing accelerometer data in the X, Y, and Z directions. The data sampling rate of Open-BCI was set to 256 Hz by default and could be tuned using the software provided by the Open-BCI project. The raw data output of Open-BCI is within the time domain. However, EEG data are usually analyzed in the frequency domain. The component frequencies of the EEG include eight major brain waves: Delta (1-3 Hz), Theta (4-7 Hz), Alpha low (8-9 Hz), Alpha high (10-12 Hz), Beta low (13-17 Hz), Beta high (18-30 Hz), Gamma low (31-40 Hz), and Gamma Mid (41-50 Hz). These frequencies represent specific brain states such as high alertness, deep sleep, meditation, and anxiety [13]. The raw data from the Open-BCI board were translated from the time domain into the frequency domain for effective analysis.

C. Emotion Recognition Model

Face Emotion Recognition

For the face network, we used the Xception model, which is an extension of the inception architecture. The pretrained model was preferred over training the network from scratch to benefit from the features already learned by the model. The 36 convolutional layers of the Xception architecture formed the feature extraction base of the network and were structured into 14 modules, all of which had linear residual connections around them, except for the first and last modules. The Xception architecture is a stack of depth-wise separable convolution layers with residual connections.

Algorithm 1 Overview of multi threading system

1 Initialization;
2 Create threads for each modality using the threading function:
3 face_thread = myThreads()
4 voice_thread = myThreads()
5 eeg_thread = myThreads()
6 Next lauch face, voice and EEG modalities using thread function.
7 face_thread = threading. Thread(target=self.record)
8 voice_thread = threading.Thread(target=self.record)
9 EEG_thread = threading. Thread(target=self.record)
10 Define main function by creating a start_new_thread method which creates new threads for the function passed as arguments and enables thread exection.
11 face_thread.start()
12 voice_thread.start()
13 eeg_thread.start()
14 return output:
15 Output: Emotion prediction from face, voice and EEG emotion recognition.

Voice Emotion Recognition

A pre-trained 2-layer long short-term memory (LSTM) model was used to consider the temporal structure of the voice. The first layer contained 256 units, and the second contained 512 units with a batch size of 32 and a default learning rate of 0.001. The sampling rate of the voice signal was 24414. The features used for the input signals were calculated at 0.2 s a window size of 2 secs steps. The window size was used to represent the number of samples and duration of the audio file. Finally, an output layer was added within four units of the dense layer of the softmax activation. This reflected one of the predicted emotion categories.

EEG Emotion Recognition

For EEG emotion recognition, a K-nearest neighbor (KNN) algorithm was used for model training. The KNN is a machine learning algorithm based on a supervised learning technique that is used to identify K samples that are close to the unknown sample points and determine the category information of the unknown samples from the majority of the K samples. Two types of models were used to describe the general state of emotion as follows: 1) the discrete emotion model, including basic emotions such as sadness, anger, fear, surprise, disgust, and happiness; 2) the multidimensional emotional model of valence and arousal [16]. Valence represents the degree of delight of the individual and varies from negative to positive, whereas arousal represents the degree of activation of emotions and varies from calm to excitement. KNN is used to classify the emotions in a discrete value order for the prediction result.

D. Multithreading

Multithreading allows the use of multiple threads for the same or different processing at the same time and allows for multiple threads to be created within a process. For our system model, we built a multithreaded system enabling the synchronized processes of the three different modalities. This approach helps the process execution at once. We defined three different threads, one for each modality. After the emotion classification of each modality, the parallel classification, that is, the multithreading aspect was implemented by configuring the thread for each modality. The models were built into thread channels, in which the algorithm appropriately organized the models. In the main thread, a video stream was continuously captured using the user webcam with the OpenCV library. The implementation of the library offered the use of different classifiers for the detector and performance of the frontal face. The frame was set to four and passed on to the second thread where the voice input was processed every 2 seconds before being passed to the third thread for the EEG signals to be processed, thereby making it viable for real-time application. The result of the classification was then passed to the main thread, which appended it to a list, retaining the four most recently detected emotions, after which the process continued once again. The algorithm used for the multithreading of all modalities is given in Algorithm 1.

In this section, we report the results of the study as shown in Fig. 2. and Fig. 3. Real-time implementation was selected for better accuracy and prediction. We developed the application using the python programming language along with the Pycharm platform to build and execute the applications by utilizing the graphic processing unit (GPU) provided by the Pycharm platform rather than using the central processing unit (CPU) space. The implementation was achieved in two stages. First, we separately removed the emotions from each modality before combining all three modalities into one processing process. The accuracy of each modality was verified to ensure good results after combining them. A multithreading system was used to combine all three modalities using three different threads. We first executed the voice, followed by the face, and finally the EEG emotion recognition. For each modality, we were able to achieve approximately six displayed emotions as follows: anger, fear, happiness, surprise, sadness, and neutrality. The model performed well for all three modalities and was operated continuously using a multithreading system. Fig. 2 shows the results of emotion recognition using unimodality, while Fig. 3 shows the results of emotion recognition using multimodalities.

Fig. 2. Result of emotion recognition using unimodality: (a) face, (b) voice, (c) EEG.
Fig. 3. Result of emotion recognition using multimodalities; face, voice, and EEG.

A focused prediction result of the emotion recognition, presented for the three modalities, is shown on the righthand side of Fig. 3. The predictions underlined in blue and red represent that of the voice and EEG emotion recognitions, respectively, as displayed on the terminal, while the face emotion recognition is presented on the screen on the left-hand side as shown in the figure. We were able to capture, display, and predict emotions continuously and synchronously for all modalities. The model was accurate in predicting user emotion with a fast processing speed. The obtained results for real-time emotion recognition are shown in Fig. 4 for unimodality and multimodalities. For individual modalities the accuracy obtained was 70.9, 54.3, and 63.1% respectively. However, results show that combining all three modalities results in higher accuracy of approximately 80.1% compared with the single modalities.

Fig. 4. Accuracy comparison between unimodality and multimodality.

This paper proposed a real-time emotion recognition implementation for face, voice, and EEG emotion recognition using a multithreaded system for synchronized continuous implementation. We focused on real-time implementation to improve the task of emotion recognition. The results show the continuous implementation of all three modalities and sustain the idea that using multimodality helps increase the accuracy of emotion recognition. We found that by using multithreading we were able to perform real-time implementation for voice, face, and EEG emotion recognition continuously. Although we focused on real-time implementation for face, video, and EEG, we would like to explore other ways to implement multimodalities in future work. Recent studies show that multimodal datasets can be useful in increasing the accuracy of emotion recognition. Therefore, we aim to explore multimodal datasets and find means to collect datasets for use in real-time implementation.

The work reported in this paper was conducted during the sabbatical year of Kumoh National Institute of Technology in 2019

  1. J. Zhao and X. Mao and L. Chen, Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control, vol. 47, pp. 312-323, Jan., 2019. DOI: 10.1016/j.bspc.2018.08.035.
    CrossRef
  2. M. Liu, and J. Tang, Audio and video bimodal emotion recognition in social networks based on improved alexnet network and attention mechanism, Journal of Information Processing Systems, vol. 17, pp. 754-771, Aug., 2021. DOI: 10.3745/JIPS.02.0161.
  3. J. N. Njoku, and A. C. Caliwag, and W. Lim, and S. Kim, and H. Hwang, and J. Jung, Deep learning based data fusion methods for multimodal emotion recognition, The Journal of Korean Institute of Communications and Information Sciences, vol. 47, no. 1, pp. 79-87, Jan., 2022. DOI: 10.7840/kics.2022.47.1.79.
    CrossRef
  4. Q. Ji and Z. Zhu and P. Lan, Real-time nonintrusive monitoring and prediction of driver fatigue, IEEE Transactions on Vehicular Technology, vol. 53, no. 4, pp. 1052-1068, Jul., 2004. DOI: 10.1109/TVT.2004.830974.
    CrossRef
  5. H. Zhao, and Z. Wang, and S. Qiu, and J. Wang, and F. Xu, and Z. Wang, and Y. Shen, Adaptive gait detection based on foot-mounted inertial sensors and multi-sensor fusion, Information Fusion, vol. 52, pp. 157-166, Dec., 2019. DOI: 10.1016/j.inffus.2019.03.002.
    CrossRef
  6. J. Gratch, and S. Marsella, Evaluating a computational model of emotion, Autonomous Agents and Multi-Agent Systems, vol. 11, no. 1, pp. 23-43, 2005. DOI: 10.1007/s10458-005-1081-1.
    CrossRef
  7. N. Cudlenco and N. Popescu and M. Leordeanu, Reading into the mind's eye: Boosting automatic visual recognition with EEG signals, Neurocomputing, vol. 386, pp. 281-292, 2020. DOI: 10.1016/j.neucom.2019.12.076.
    CrossRef
  8. O. Kwon, and I. Jang, and C. Ahn, and H. G. Kang, An effective style token weight control technique for end-to-end emotional speech synthesis, IEEE Signal Processing Letters, vol. 26, no. 9, pp. 1383-1387, Jul., 2019. DOI: 10.1109/LSP.2019.2931673.
    CrossRef
  9. Wei Wei, and Feng Yongli, and Gang Chen, and Ming Chu, Multimodal facial expression feature based on deep-neural networks, Journal on Multimodal User Interfaces, vol. 14, pp. 17-23, 2020. DOI: 10.1007/s12193-019-00308-9.
    CrossRef
  10. Y. Tian, and J. Cheng, and Y. Li, and S. Wang, Secondary information aware facial expression recognition, IEEE Signal Processing Letters, vol. 26, no. 12, pp. 1753-1757, Dec., 2019. DOI: 10.1109/LSP.2019.2942138.
    CrossRef
  11. G. Castellano and L. Kessous and G. Caridakis, Emotion recognition through multiple modalities: Face, body gesture, speech, in Affect and Emotion in Human-Computer Interaction, Lecture Notes in Computer Science, pp. 92-103, 2008. DOI: 10.1007/978-3-540-85099-1_8.
    CrossRef
  12. Y. Ma, and Y. Hao, and M. Chen, and J. Chen, and P. Lu, and A. Kosir, Audio-visual emotion fusion (AVEF): A deep efficient weighted approach, Information Fusion, vol. 46, pp. 184-192, Mar., 2019. DOI: 10.1016/j.inffus.2018.06.003.
    CrossRef
  13. C. Busso, and Z. Deng, and S. Yildirim, and M. Bulut, and C. M. Lee, and A. Kazemzadeh, and S. Lee, and U. Neumann, and S. Narayanan, Analysis of emotion recognition using facial expressions, speech and multimodal information, in Proceeding of ACM 6th International Conference on Multimodal Interfaces, New York: NY, USA, pp. 205-211, 2004. DOI: 10.1145/1027933.1027968.
    CrossRef
  14. C. Guanghui, and Z. Xiaoping, Multi-modal emotion recognition by fusing correlation features of speech-visual, IEEE Signal Processing Letters, vol. 28, pp. 533-537, 2021. DOI: 10.1109/LSP.2021.3055755.
    CrossRef
  15. B. Xing, and H. Zhang, and K. Zhang, and L. Zhang, and X. Wu, and X. Shi, and S. Yu, and S. Zhang, Exploiting EEG signals and audiovisual feature fusion for video emotion recognition, IEEE Access, vol. 7, pp. 59844-59861, May., 2019. DOI:10.1109/ACCESS.2019.2914872.
    CrossRef
  16. E. Perez, and I. Cervantes, and E. Duran, and G. Bustamante, and J. Dizon, and Y. Chnag, and H. Lin, Feature extraction and signal processing of open-source brain-computer interface, in Proceedings of 2nd Annual Undergraduate Research Expo, Dallas: TX, USA, 2016.
  17. C. Y. Park, and N. Cha, and S. Kang, and A. Kim, and A. H. Khandoker, and L. Hadjileontiadis, and A. Oh, and U. Lee, K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversation, Scientific Data, vol. 7, p. 293, Sep., 2020. DOI: 10.1038/s41597-020-00630.
    Pubmed KoreaMed CrossRef

Miracle Udurume

was born in Kaduna state, Nigeria in 1996, she received the B.S. degree in Mathematics from Delta State University, Abraka, in 2019. She is currently pursuing the M.S degree in Electronics engineering with the Kumoh National Institute of Technology (KIT), Gumi, South Korea, where she is a Research Assistant with Future Communications Systems Laboratory, since 2021. Her research interests include data generation, emotion recognition, and machine learning.


Angela Caliwag

received the B.S. degree in Electrical Engineering from Mapua Institute of Technology (MIT), Philippines, in 2017, and the M.S. degree in IT convergence engineering from the Kumoh National Institute of Technology (KIT), South Korea, in 2019. Since 2019, she has been a Researcher with the Future Communications Systems Laboratory, KIT. Her research interests include design and analysis of energy storage management system, embedded machine learning, voice-user interface, and data analysis.


Wansu Lim

received his Ph.D. from the Gwangju Institute of Science and Technology (GIST), South Korea in 2010. From 2010 to 2014, he was a Research Fellow (2010-2013) at the University of Hertfordshire, UK and then a Postdoctoral Researcher (2013-2014) at the Institut National de la Recherche Scientifique (INRS), Canada. Since Sep. 2014 he has been an Assistant Professor at the Kumoh National Institute of Technology (KIT), South Korea. His research interests include statistical analysis, machine learning, and optimization.


Gwingon Kim

is a full-time professor in the Department of Business Administration of Kumoh National Institute of Technology.


Article

Journal of information and communication convergence engineering 2022; 20(3): 174-180

Published online September 30, 2022 https://doi.org/10.56977/jicce.2022.20.3.174

Copyright © Korea Institute of Information and Communication Engineering.

Emotion Recognition Implementation with Multimodalities of Face, Voice and EEG

Miracle Udurume 1, Angela Caliwag1, Wansu Lim 1*, and Gwigon Kim2*

1Department of Aeronautics Mechanical and Electronic Convergence Engineering, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea
2Department of Business Administration, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea

Correspondence to:*Gwigon Kim (E-mail: metheus@kumoh.ac.kr, Tel: +82-54-478-7848)
Department of Business Administration, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea

Received: April 21, 2022; Revised: September 7, 2022; Accepted: September 8, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Emotion recognition is an essential component of complete interaction between human and machine. The issues related to emotion recognition are a result of the different types of emotions expressed in several forms such as visual, sound, and physiological signal. Recent advancements in the field show that combined modalities, such as visual, voice and electroencephalography signals, lead to better result compared to the use of single modalities separately. Previous studies have explored the use of multiple modalities for accurate predictions of emotion; however the number of studies regarding real-time implementation is limited because of the difficulty in simultaneously implementing multiple modalities of emotion recognition. In this study, we proposed an emotion recognition system for real-time emotion recognition implementation. Our model was built with a multithreading block that enables the implementation of each modality using separate threads for continuous synchronization. First, we separately achieved emotion recognition for each modality before enabling the use of the multithreaded system. To verify the correctness of the results, we compared the performance accuracy of unimodal and multimodal emotion recognitions in real-time. The experimental results showed real-time user emotion recognition of the proposed model. In addition, the effectiveness of the multimodalities for emotion recognition was observed. Our multimodal model was able to obtain an accuracy of 80.1% as compared to the unimodality, which obtained accuracies of 70.9, 54.3, and 63.1%.

Keywords: Emotion recognition, Multimodality, Multithreading, Real-time implementation

I. INTRODUCTION

Emotion recognition plays a very significant role in our daily lives and enables the responses of software applications to adapt to emotional states of the user [1-3]. The application of emotion recognition can be found in various domains, such as for the monitoring and prediction of the fatigue state [4], health monitoring, and communication skills [5]. Emotion recognition influences various levels of modalities [6]. Emotions can most often be expressed through external levels, such as visual, speech, gestures, body signals, heart rate, physiological signals, electroencephalogram (EEG), and body temperature [7,8]. Among these, speech and visual are visual, speech, gestures and body signals, heart rate, physiological signals, electroencephalogram (EEG), body temperatures [7,8], etc. Among these, speech and visual are used widely in emotion recognition because their datasets can easily be constructed. Recent studies have been concentrated on unimodal modalities of emotion recognition, such as text, speech, and images. Although unimodal emotion recognition has made many breakthrough achievements with the passage of time, it still faces some problems [9,10]. The use of unimodality cannot fully describe a certain emotion of the user at the moment, thereby resulting in poor accuracy. Hence, using multimodal features to describe a certain emotion together will be more comprehensive and detailed. Multimodality helps increase the accuracy of emotion recognition. However, the results show some difficulty in simultaneously implementing video, audio, and EEG emotion recognition in real-time because most models require the use of recorded data for offline implementation [11]. Recently, various multimodal emotion recognition methods have employed the use of fusion method for combining the unimodal systems by extracting features before fusing together [12]. Busso et al. combined features from audio and video datasets before independently applying a data classifier algorithm to each of the features and reported a significant increase in the accuracy, from 65% to 89.3%. Guanghui and Xiaoping proposed a multimodal emotion recognition method for fusing the correlation features of speech-visual, which were extracted using a twodimensional convolutional neural network before the application of the proposed feature correlation analysis algorithm for offline multimodal emotion recognition [14]. Xing et al. used machine learning algorithms to exploit EEG signals and audiovisual features for video emotion recognition, achieving video emotion classification accuracies of 96.79% for valence and 97.79% for arousal [15]. All of the studies mentioned above exploit multimodal fusion systems for emotion recognition. Nevertheless, these feature methods are inappropriate for the continuous emotion recognition of audio, face, and EEG, and none provide a real-time approach for multimodal emotion recognition.

Thus, this study addresses the above issues by providing a real-time multimodal emotion recognition system for face, voice, and EEG modalities using a multithreaded system for continuous synchronized execution to improve the performance of emotion recognition in real-time.

The rest of the paper is structured as follows: section II covers the proposed methodologies and overall architecture, section III present results and discussion, and section IV contains our conclusion.

II. PROPOSED METHOD

This section discusses the proposed methodology for continuous real-time multimodal emotion recognition, and describes the overall architecture of the model used. Our model focuses on a continuous synchronized real-time multimodal emotion recognition implementation for audio, face, and EEG. An overview of the system model is shown in Fig. 1. The system model includes four layers, which are introduced as A) input devices, B) feature extraction, C) emotion recognition model, and D) multithreading. Each layer of the system architecture is described as follows.

Figure 1. System overview.

A. Input Devices

The input devices include the hardware devices, such as an ACPI X64 PC, webcam (c922 pro stream), USB microphone, and Daisy module OpenBCI Cyton plus cEEGrid device, used for real-time implementation. The layer offers tools for face, voice, and EEG emotion recognition and first separately extracts the features from each source. Feature extraction, its steps, methodological approach, and the process used to select each feature are explained in the next section. Face-emotion recognition receives real-time facial expressions from the webcam and detects the face. We implemented the same steps for voice and EEG emotion recognition before the multimodal emotion recognition integration.

B. Feature Extraction

This involves extracting useful features to enable emotion recognition from both face, voice, and EEG. Some of the components, used for face, voice, and EEG emotion recognition, are discussed below.

Face Features

This is the process of extracting facial features from a face and classifying emotions. First, we achieved face detection, the process of detecting an individual’s face in real time, using Dlib, an open-source library that is a landmark facial detector with pretrained models [11]. Dlib is used to estimate the location of the coordinates (x, y) that map the facial points on a face, which enables the incorporation of features. After the face detection, feature points were extracted from the user, including face tracking and landmark detection algorithms used for tracking the face of the user in real time. Face-landmark detection enables the computer to detect and localize regions of the face, such as the eyes, eyebrows, nose, and mouth the face such as the eyes, eyebrows, nose, and mouth.

Voice Features

In this process, we extracted features from voice intonations and used them to classify emotions. Similar to facial emotion recognition, we detected the voice of the user and then extracted useful features. For voice emotion recognition, features, such as duration, channels, rate, chunk size, pitch, spectrum, Mel-Frequency Cepstrum Coefficients (MFCC), and zero crossing rate (ZCR), were extracted from the input speech signal using the Librosa library package.

Feature extraction is an important technique in speech emotion recognition. Different features are classified after their extraction. The features extracted in this study are as follows.

  • Energy-Root Mean Square.

  • Zero crossing rate.

  • Mel-Frequency Cepstral Coefficients.

The features were extracted with frame length = 2048 and hop length = 512 (similar to CHUNK, a batch of sequential samples to process at once). Our main focus was on MFCC, as it is the most popularly extracted feature method for speech emotion recognition. Every sample (sequences of 0.2 s) was analyzed and translated to four sequential feature values (2048/512 = 4). The MFCC is grouped into different stages: re-emphasis, windowing, spectral analysis, filter bank processing, log energy computation, and Mel frequency cepstral computation. The Librosa package helps achieve all the feature extractions.

EEG Features

For EEG emotion recognition, we used the v3 Daisy module Open-BCI Cyton plus cEEGrids device to measure EEG signals. Raw data can be read for post-processing based in the Open-BCI data format and files. We used the MATLAB programming language to implement a signal processing tool for raw Open-BCI data. The EEG data are composed of 24- bit signed values. A total of eight channels were used for signal acquisition. Additionally, 16-bit signed values were used for storing accelerometer data in the X, Y, and Z directions. The data sampling rate of Open-BCI was set to 256 Hz by default and could be tuned using the software provided by the Open-BCI project. The raw data output of Open-BCI is within the time domain. However, EEG data are usually analyzed in the frequency domain. The component frequencies of the EEG include eight major brain waves: Delta (1-3 Hz), Theta (4-7 Hz), Alpha low (8-9 Hz), Alpha high (10-12 Hz), Beta low (13-17 Hz), Beta high (18-30 Hz), Gamma low (31-40 Hz), and Gamma Mid (41-50 Hz). These frequencies represent specific brain states such as high alertness, deep sleep, meditation, and anxiety [13]. The raw data from the Open-BCI board were translated from the time domain into the frequency domain for effective analysis.

C. Emotion Recognition Model

Face Emotion Recognition

For the face network, we used the Xception model, which is an extension of the inception architecture. The pretrained model was preferred over training the network from scratch to benefit from the features already learned by the model. The 36 convolutional layers of the Xception architecture formed the feature extraction base of the network and were structured into 14 modules, all of which had linear residual connections around them, except for the first and last modules. The Xception architecture is a stack of depth-wise separable convolution layers with residual connections.

Algorithm 1 Overview of multi threading system

1 Initialization;
2 Create threads for each modality using the threading function:
3 face_thread = myThreads()
4 voice_thread = myThreads()
5 eeg_thread = myThreads()
6 Next lauch face, voice and EEG modalities using thread function.
7 face_thread = threading. Thread(target=self.record)
8 voice_thread = threading.Thread(target=self.record)
9 EEG_thread = threading. Thread(target=self.record)
10 Define main function by creating a start_new_thread method which creates new threads for the function passed as arguments and enables thread exection.
11 face_thread.start()
12 voice_thread.start()
13 eeg_thread.start()
14 return output:
15 Output: Emotion prediction from face, voice and EEG emotion recognition.

Voice Emotion Recognition

A pre-trained 2-layer long short-term memory (LSTM) model was used to consider the temporal structure of the voice. The first layer contained 256 units, and the second contained 512 units with a batch size of 32 and a default learning rate of 0.001. The sampling rate of the voice signal was 24414. The features used for the input signals were calculated at 0.2 s a window size of 2 secs steps. The window size was used to represent the number of samples and duration of the audio file. Finally, an output layer was added within four units of the dense layer of the softmax activation. This reflected one of the predicted emotion categories.

EEG Emotion Recognition

For EEG emotion recognition, a K-nearest neighbor (KNN) algorithm was used for model training. The KNN is a machine learning algorithm based on a supervised learning technique that is used to identify K samples that are close to the unknown sample points and determine the category information of the unknown samples from the majority of the K samples. Two types of models were used to describe the general state of emotion as follows: 1) the discrete emotion model, including basic emotions such as sadness, anger, fear, surprise, disgust, and happiness; 2) the multidimensional emotional model of valence and arousal [16]. Valence represents the degree of delight of the individual and varies from negative to positive, whereas arousal represents the degree of activation of emotions and varies from calm to excitement. KNN is used to classify the emotions in a discrete value order for the prediction result.

D. Multithreading

Multithreading allows the use of multiple threads for the same or different processing at the same time and allows for multiple threads to be created within a process. For our system model, we built a multithreaded system enabling the synchronized processes of the three different modalities. This approach helps the process execution at once. We defined three different threads, one for each modality. After the emotion classification of each modality, the parallel classification, that is, the multithreading aspect was implemented by configuring the thread for each modality. The models were built into thread channels, in which the algorithm appropriately organized the models. In the main thread, a video stream was continuously captured using the user webcam with the OpenCV library. The implementation of the library offered the use of different classifiers for the detector and performance of the frontal face. The frame was set to four and passed on to the second thread where the voice input was processed every 2 seconds before being passed to the third thread for the EEG signals to be processed, thereby making it viable for real-time application. The result of the classification was then passed to the main thread, which appended it to a list, retaining the four most recently detected emotions, after which the process continued once again. The algorithm used for the multithreading of all modalities is given in Algorithm 1.

III. RESULTS AND DISCUSSION

In this section, we report the results of the study as shown in Fig. 2. and Fig. 3. Real-time implementation was selected for better accuracy and prediction. We developed the application using the python programming language along with the Pycharm platform to build and execute the applications by utilizing the graphic processing unit (GPU) provided by the Pycharm platform rather than using the central processing unit (CPU) space. The implementation was achieved in two stages. First, we separately removed the emotions from each modality before combining all three modalities into one processing process. The accuracy of each modality was verified to ensure good results after combining them. A multithreading system was used to combine all three modalities using three different threads. We first executed the voice, followed by the face, and finally the EEG emotion recognition. For each modality, we were able to achieve approximately six displayed emotions as follows: anger, fear, happiness, surprise, sadness, and neutrality. The model performed well for all three modalities and was operated continuously using a multithreading system. Fig. 2 shows the results of emotion recognition using unimodality, while Fig. 3 shows the results of emotion recognition using multimodalities.

Figure 2. Result of emotion recognition using unimodality: (a) face, (b) voice, (c) EEG.
Figure 3. Result of emotion recognition using multimodalities; face, voice, and EEG.

A focused prediction result of the emotion recognition, presented for the three modalities, is shown on the righthand side of Fig. 3. The predictions underlined in blue and red represent that of the voice and EEG emotion recognitions, respectively, as displayed on the terminal, while the face emotion recognition is presented on the screen on the left-hand side as shown in the figure. We were able to capture, display, and predict emotions continuously and synchronously for all modalities. The model was accurate in predicting user emotion with a fast processing speed. The obtained results for real-time emotion recognition are shown in Fig. 4 for unimodality and multimodalities. For individual modalities the accuracy obtained was 70.9, 54.3, and 63.1% respectively. However, results show that combining all three modalities results in higher accuracy of approximately 80.1% compared with the single modalities.

Figure 4. Accuracy comparison between unimodality and multimodality.

IV. CONCLUSIONS

This paper proposed a real-time emotion recognition implementation for face, voice, and EEG emotion recognition using a multithreaded system for synchronized continuous implementation. We focused on real-time implementation to improve the task of emotion recognition. The results show the continuous implementation of all three modalities and sustain the idea that using multimodality helps increase the accuracy of emotion recognition. We found that by using multithreading we were able to perform real-time implementation for voice, face, and EEG emotion recognition continuously. Although we focused on real-time implementation for face, video, and EEG, we would like to explore other ways to implement multimodalities in future work. Recent studies show that multimodal datasets can be useful in increasing the accuracy of emotion recognition. Therefore, we aim to explore multimodal datasets and find means to collect datasets for use in real-time implementation.

ACKNOWLEDGMENTS

The work reported in this paper was conducted during the sabbatical year of Kumoh National Institute of Technology in 2019

Fig 1.

Figure 1.System overview.
Journal of Information and Communication Convergence Engineering 2022; 20: 174-180https://doi.org/10.56977/jicce.2022.20.3.174

Fig 2.

Figure 2.Result of emotion recognition using unimodality: (a) face, (b) voice, (c) EEG.
Journal of Information and Communication Convergence Engineering 2022; 20: 174-180https://doi.org/10.56977/jicce.2022.20.3.174

Fig 3.

Figure 3.Result of emotion recognition using multimodalities; face, voice, and EEG.
Journal of Information and Communication Convergence Engineering 2022; 20: 174-180https://doi.org/10.56977/jicce.2022.20.3.174

Fig 4.

Figure 4.Accuracy comparison between unimodality and multimodality.
Journal of Information and Communication Convergence Engineering 2022; 20: 174-180https://doi.org/10.56977/jicce.2022.20.3.174

References

  1. J. Zhao and X. Mao and L. Chen, Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control, vol. 47, pp. 312-323, Jan., 2019. DOI: 10.1016/j.bspc.2018.08.035.
    CrossRef
  2. M. Liu, and J. Tang, Audio and video bimodal emotion recognition in social networks based on improved alexnet network and attention mechanism, Journal of Information Processing Systems, vol. 17, pp. 754-771, Aug., 2021. DOI: 10.3745/JIPS.02.0161.
  3. J. N. Njoku, and A. C. Caliwag, and W. Lim, and S. Kim, and H. Hwang, and J. Jung, Deep learning based data fusion methods for multimodal emotion recognition, The Journal of Korean Institute of Communications and Information Sciences, vol. 47, no. 1, pp. 79-87, Jan., 2022. DOI: 10.7840/kics.2022.47.1.79.
    CrossRef
  4. Q. Ji and Z. Zhu and P. Lan, Real-time nonintrusive monitoring and prediction of driver fatigue, IEEE Transactions on Vehicular Technology, vol. 53, no. 4, pp. 1052-1068, Jul., 2004. DOI: 10.1109/TVT.2004.830974.
    CrossRef
  5. H. Zhao, and Z. Wang, and S. Qiu, and J. Wang, and F. Xu, and Z. Wang, and Y. Shen, Adaptive gait detection based on foot-mounted inertial sensors and multi-sensor fusion, Information Fusion, vol. 52, pp. 157-166, Dec., 2019. DOI: 10.1016/j.inffus.2019.03.002.
    CrossRef
  6. J. Gratch, and S. Marsella, Evaluating a computational model of emotion, Autonomous Agents and Multi-Agent Systems, vol. 11, no. 1, pp. 23-43, 2005. DOI: 10.1007/s10458-005-1081-1.
    CrossRef
  7. N. Cudlenco and N. Popescu and M. Leordeanu, Reading into the mind's eye: Boosting automatic visual recognition with EEG signals, Neurocomputing, vol. 386, pp. 281-292, 2020. DOI: 10.1016/j.neucom.2019.12.076.
    CrossRef
  8. O. Kwon, and I. Jang, and C. Ahn, and H. G. Kang, An effective style token weight control technique for end-to-end emotional speech synthesis, IEEE Signal Processing Letters, vol. 26, no. 9, pp. 1383-1387, Jul., 2019. DOI: 10.1109/LSP.2019.2931673.
    CrossRef
  9. Wei Wei, and Feng Yongli, and Gang Chen, and Ming Chu, Multimodal facial expression feature based on deep-neural networks, Journal on Multimodal User Interfaces, vol. 14, pp. 17-23, 2020. DOI: 10.1007/s12193-019-00308-9.
    CrossRef
  10. Y. Tian, and J. Cheng, and Y. Li, and S. Wang, Secondary information aware facial expression recognition, IEEE Signal Processing Letters, vol. 26, no. 12, pp. 1753-1757, Dec., 2019. DOI: 10.1109/LSP.2019.2942138.
    CrossRef
  11. G. Castellano and L. Kessous and G. Caridakis, Emotion recognition through multiple modalities: Face, body gesture, speech, in Affect and Emotion in Human-Computer Interaction, Lecture Notes in Computer Science, pp. 92-103, 2008. DOI: 10.1007/978-3-540-85099-1_8.
    CrossRef
  12. Y. Ma, and Y. Hao, and M. Chen, and J. Chen, and P. Lu, and A. Kosir, Audio-visual emotion fusion (AVEF): A deep efficient weighted approach, Information Fusion, vol. 46, pp. 184-192, Mar., 2019. DOI: 10.1016/j.inffus.2018.06.003.
    CrossRef
  13. C. Busso, and Z. Deng, and S. Yildirim, and M. Bulut, and C. M. Lee, and A. Kazemzadeh, and S. Lee, and U. Neumann, and S. Narayanan, Analysis of emotion recognition using facial expressions, speech and multimodal information, in Proceeding of ACM 6th International Conference on Multimodal Interfaces, New York: NY, USA, pp. 205-211, 2004. DOI: 10.1145/1027933.1027968.
    CrossRef
  14. C. Guanghui, and Z. Xiaoping, Multi-modal emotion recognition by fusing correlation features of speech-visual, IEEE Signal Processing Letters, vol. 28, pp. 533-537, 2021. DOI: 10.1109/LSP.2021.3055755.
    CrossRef
  15. B. Xing, and H. Zhang, and K. Zhang, and L. Zhang, and X. Wu, and X. Shi, and S. Yu, and S. Zhang, Exploiting EEG signals and audiovisual feature fusion for video emotion recognition, IEEE Access, vol. 7, pp. 59844-59861, May., 2019. DOI:10.1109/ACCESS.2019.2914872.
    CrossRef
  16. E. Perez, and I. Cervantes, and E. Duran, and G. Bustamante, and J. Dizon, and Y. Chnag, and H. Lin, Feature extraction and signal processing of open-source brain-computer interface, in Proceedings of 2nd Annual Undergraduate Research Expo, Dallas: TX, USA, 2016.
  17. C. Y. Park, and N. Cha, and S. Kang, and A. Kim, and A. H. Khandoker, and L. Hadjileontiadis, and A. Oh, and U. Lee, K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversation, Scientific Data, vol. 7, p. 293, Sep., 2020. DOI: 10.1038/s41597-020-00630.
    Pubmed KoreaMed CrossRef
JICCE
Sep 30, 2022 Vol.20 No.3, pp. 143~233

Stats or Metrics

Share this article on

  • line
  • mail

Journal of Information and Communication Convergence Engineering Jouranl of information and
communication convergence engineering
(J. Inf. Commun. Converg. Eng.)

eISSN 2234-8883
pISSN 2234-8255