Journal of information and communication convergence engineering 2022; 20(3): 174-180
Published online September 30, 2022
https://doi.org/10.56977/jicce.2022.20.3.174
© Korea Institute of Information and Communication Engineering
Correspondence to : *Gwigon Kim (E-mail: metheus@kumoh.ac.kr, Tel: +82-54-478-7848)
Department of Business Administration, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Emotion recognition is an essential component of complete interaction between human and machine. The issues related to emotion recognition are a result of the different types of emotions expressed in several forms such as visual, sound, and physiological signal. Recent advancements in the field show that combined modalities, such as visual, voice and electroencephalography signals, lead to better result compared to the use of single modalities separately. Previous studies have explored the use of multiple modalities for accurate predictions of emotion; however the number of studies regarding real-time implementation is limited because of the difficulty in simultaneously implementing multiple modalities of emotion recognition. In this study, we proposed an emotion recognition system for real-time emotion recognition implementation. Our model was built with a multithreading block that enables the implementation of each modality using separate threads for continuous synchronization. First, we separately achieved emotion recognition for each modality before enabling the use of the multithreaded system. To verify the correctness of the results, we compared the performance accuracy of unimodal and multimodal emotion recognitions in real-time. The experimental results showed real-time user emotion recognition of the proposed model. In addition, the effectiveness of the multimodalities for emotion recognition was observed. Our multimodal model was able to obtain an accuracy of 80.1% as compared to the unimodality, which obtained accuracies of 70.9, 54.3, and 63.1%.
Keywords Emotion recognition, Multimodality, Multithreading, Real-time implementation
Emotion recognition plays a very significant role in our daily lives and enables the responses of software applications to adapt to emotional states of the user [1-3]. The application of emotion recognition can be found in various domains, such as for the monitoring and prediction of the fatigue state [4], health monitoring, and communication skills [5]. Emotion recognition influences various levels of modalities [6]. Emotions can most often be expressed through external levels, such as visual, speech, gestures, body signals, heart rate, physiological signals, electroencephalogram (EEG), and body temperature [7,8]. Among these, speech and visual are visual, speech, gestures and body signals, heart rate, physiological signals, electroencephalogram (EEG), body temperatures [7,8], etc. Among these, speech and visual are used widely in emotion recognition because their datasets can easily be constructed. Recent studies have been concentrated on unimodal modalities of emotion recognition, such as text, speech, and images. Although unimodal emotion recognition has made many breakthrough achievements with the passage of time, it still faces some problems [9,10]. The use of unimodality cannot fully describe a certain emotion of the user at the moment, thereby resulting in poor accuracy. Hence, using multimodal features to describe a certain emotion together will be more comprehensive and detailed. Multimodality helps increase the accuracy of emotion recognition. However, the results show some difficulty in simultaneously implementing video, audio, and EEG emotion recognition in real-time because most models require the use of recorded data for offline implementation [11]. Recently, various multimodal emotion recognition methods have employed the use of fusion method for combining the unimodal systems by extracting features before fusing together [12]. Busso et al. combined features from audio and video datasets before independently applying a data classifier algorithm to each of the features and reported a significant increase in the accuracy, from 65% to 89.3%. Guanghui and Xiaoping proposed a multimodal emotion recognition method for fusing the correlation features of speech-visual, which were extracted using a twodimensional convolutional neural network before the application of the proposed feature correlation analysis algorithm for offline multimodal emotion recognition [14]. Xing et al. used machine learning algorithms to exploit EEG signals and audiovisual features for video emotion recognition, achieving video emotion classification accuracies of 96.79% for valence and 97.79% for arousal [15]. All of the studies mentioned above exploit multimodal fusion systems for emotion recognition. Nevertheless, these feature methods are inappropriate for the continuous emotion recognition of audio, face, and EEG, and none provide a real-time approach for multimodal emotion recognition.
Thus, this study addresses the above issues by providing a real-time multimodal emotion recognition system for face, voice, and EEG modalities using a multithreaded system for continuous synchronized execution to improve the performance of emotion recognition in real-time.
The rest of the paper is structured as follows: section II covers the proposed methodologies and overall architecture, section III present results and discussion, and section IV contains our conclusion.
This section discusses the proposed methodology for continuous real-time multimodal emotion recognition, and describes the overall architecture of the model used. Our model focuses on a continuous synchronized real-time multimodal emotion recognition implementation for audio, face, and EEG. An overview of the system model is shown in Fig. 1. The system model includes four layers, which are introduced as A) input devices, B) feature extraction, C) emotion recognition model, and D) multithreading. Each layer of the system architecture is described as follows.
The input devices include the hardware devices, such as an ACPI X64 PC, webcam (c922 pro stream), USB microphone, and Daisy module OpenBCI Cyton plus cEEGrid device, used for real-time implementation. The layer offers tools for face, voice, and EEG emotion recognition and first separately extracts the features from each source. Feature extraction, its steps, methodological approach, and the process used to select each feature are explained in the next section. Face-emotion recognition receives real-time facial expressions from the webcam and detects the face. We implemented the same steps for voice and EEG emotion recognition before the multimodal emotion recognition integration.
This involves extracting useful features to enable emotion recognition from both face, voice, and EEG. Some of the components, used for face, voice, and EEG emotion recognition, are discussed below.
This is the process of extracting facial features from a face and classifying emotions. First, we achieved face detection, the process of detecting an individual’s face in real time, using Dlib, an open-source library that is a landmark facial detector with pretrained models [11]. Dlib is used to estimate the location of the coordinates (x, y) that map the facial points on a face, which enables the incorporation of features. After the face detection, feature points were extracted from the user, including face tracking and landmark detection algorithms used for tracking the face of the user in real time. Face-landmark detection enables the computer to detect and localize regions of the face, such as the eyes, eyebrows, nose, and mouth the face such as the eyes, eyebrows, nose, and mouth.
In this process, we extracted features from voice intonations and used them to classify emotions. Similar to facial emotion recognition, we detected the voice of the user and then extracted useful features. For voice emotion recognition, features, such as duration, channels, rate, chunk size, pitch, spectrum, Mel-Frequency Cepstrum Coefficients (MFCC), and zero crossing rate (ZCR), were extracted from the input speech signal using the Librosa library package.
Feature extraction is an important technique in speech emotion recognition. Different features are classified after their extraction. The features extracted in this study are as follows.
Energy-Root Mean Square.
Zero crossing rate.
Mel-Frequency Cepstral Coefficients.
The features were extracted with frame length = 2048 and hop length = 512 (similar to CHUNK, a batch of sequential samples to process at once). Our main focus was on MFCC, as it is the most popularly extracted feature method for speech emotion recognition. Every sample (sequences of 0.2 s) was analyzed and translated to four sequential feature values (2048/512 = 4). The MFCC is grouped into different stages: re-emphasis, windowing, spectral analysis, filter bank processing, log energy computation, and Mel frequency cepstral computation. The Librosa package helps achieve all the feature extractions.
For EEG emotion recognition, we used the v3 Daisy module Open-BCI Cyton plus cEEGrids device to measure EEG signals. Raw data can be read for post-processing based in the Open-BCI data format and files. We used the MATLAB programming language to implement a signal processing tool for raw Open-BCI data. The EEG data are composed of 24- bit signed values. A total of eight channels were used for signal acquisition. Additionally, 16-bit signed values were used for storing accelerometer data in the X, Y, and Z directions. The data sampling rate of Open-BCI was set to 256 Hz by default and could be tuned using the software provided by the Open-BCI project. The raw data output of Open-BCI is within the time domain. However, EEG data are usually analyzed in the frequency domain. The component frequencies of the EEG include eight major brain waves: Delta (1-3 Hz), Theta (4-7 Hz), Alpha low (8-9 Hz), Alpha high (10-12 Hz), Beta low (13-17 Hz), Beta high (18-30 Hz), Gamma low (31-40 Hz), and Gamma Mid (41-50 Hz). These frequencies represent specific brain states such as high alertness, deep sleep, meditation, and anxiety [13]. The raw data from the Open-BCI board were translated from the time domain into the frequency domain for effective analysis.
For the face network, we used the Xception model, which is an extension of the inception architecture. The pretrained model was preferred over training the network from scratch to benefit from the features already learned by the model. The 36 convolutional layers of the Xception architecture formed the feature extraction base of the network and were structured into 14 modules, all of which had linear residual connections around them, except for the first and last modules. The Xception architecture is a stack of depth-wise separable convolution layers with residual connections.
Algorithm 1 Overview of multi threading system
1 | Initialization; |
2 | Create threads for each modality using the threading function: |
3 | face_thread = myThreads() |
4 | voice_thread = myThreads() |
5 | eeg_thread = myThreads() |
6 | Next lauch face, voice and EEG modalities using thread function. |
7 | face_thread = threading. Thread(target=self.record) |
8 | voice_thread = threading.Thread(target=self.record) |
9 | EEG_thread = threading. Thread(target=self.record) |
10 | Define main function by creating a start_new_thread method which creates new threads for the function passed as arguments and enables thread exection. |
11 | face_thread.start() |
12 | voice_thread.start() |
13 | eeg_thread.start() |
14 | return output: |
15 | Output: Emotion prediction from face, voice and EEG emotion recognition. |
A pre-trained 2-layer long short-term memory (LSTM) model was used to consider the temporal structure of the voice. The first layer contained 256 units, and the second contained 512 units with a batch size of 32 and a default learning rate of 0.001. The sampling rate of the voice signal was 24414. The features used for the input signals were calculated at 0.2 s a window size of 2 secs steps. The window size was used to represent the number of samples and duration of the audio file. Finally, an output layer was added within four units of the dense layer of the softmax activation. This reflected one of the predicted emotion categories.
For EEG emotion recognition, a K-nearest neighbor (KNN) algorithm was used for model training. The KNN is a machine learning algorithm based on a supervised learning technique that is used to identify K samples that are close to the unknown sample points and determine the category information of the unknown samples from the majority of the K samples. Two types of models were used to describe the general state of emotion as follows: 1) the discrete emotion model, including basic emotions such as sadness, anger, fear, surprise, disgust, and happiness; 2) the multidimensional emotional model of valence and arousal [16]. Valence represents the degree of delight of the individual and varies from negative to positive, whereas arousal represents the degree of activation of emotions and varies from calm to excitement. KNN is used to classify the emotions in a discrete value order for the prediction result.
Multithreading allows the use of multiple threads for the same or different processing at the same time and allows for multiple threads to be created within a process. For our system model, we built a multithreaded system enabling the synchronized processes of the three different modalities. This approach helps the process execution at once. We defined three different threads, one for each modality. After the emotion classification of each modality, the parallel classification, that is, the multithreading aspect was implemented by configuring the thread for each modality. The models were built into thread channels, in which the algorithm appropriately organized the models. In the main thread, a video stream was continuously captured using the user webcam with the OpenCV library. The implementation of the library offered the use of different classifiers for the detector and performance of the frontal face. The frame was set to four and passed on to the second thread where the voice input was processed every 2 seconds before being passed to the third thread for the EEG signals to be processed, thereby making it viable for real-time application. The result of the classification was then passed to the main thread, which appended it to a list, retaining the four most recently detected emotions, after which the process continued once again. The algorithm used for the multithreading of all modalities is given in Algorithm 1.
In this section, we report the results of the study as shown in Fig. 2. and Fig. 3. Real-time implementation was selected for better accuracy and prediction. We developed the application using the python programming language along with the Pycharm platform to build and execute the applications by utilizing the graphic processing unit (GPU) provided by the Pycharm platform rather than using the central processing unit (CPU) space. The implementation was achieved in two stages. First, we separately removed the emotions from each modality before combining all three modalities into one processing process. The accuracy of each modality was verified to ensure good results after combining them. A multithreading system was used to combine all three modalities using three different threads. We first executed the voice, followed by the face, and finally the EEG emotion recognition. For each modality, we were able to achieve approximately six displayed emotions as follows: anger, fear, happiness, surprise, sadness, and neutrality. The model performed well for all three modalities and was operated continuously using a multithreading system. Fig. 2 shows the results of emotion recognition using unimodality, while Fig. 3 shows the results of emotion recognition using multimodalities.
A focused prediction result of the emotion recognition, presented for the three modalities, is shown on the righthand side of Fig. 3. The predictions underlined in blue and red represent that of the voice and EEG emotion recognitions, respectively, as displayed on the terminal, while the face emotion recognition is presented on the screen on the left-hand side as shown in the figure. We were able to capture, display, and predict emotions continuously and synchronously for all modalities. The model was accurate in predicting user emotion with a fast processing speed. The obtained results for real-time emotion recognition are shown in Fig. 4 for unimodality and multimodalities. For individual modalities the accuracy obtained was 70.9, 54.3, and 63.1% respectively. However, results show that combining all three modalities results in higher accuracy of approximately 80.1% compared with the single modalities.
This paper proposed a real-time emotion recognition implementation for face, voice, and EEG emotion recognition using a multithreaded system for synchronized continuous implementation. We focused on real-time implementation to improve the task of emotion recognition. The results show the continuous implementation of all three modalities and sustain the idea that using multimodality helps increase the accuracy of emotion recognition. We found that by using multithreading we were able to perform real-time implementation for voice, face, and EEG emotion recognition continuously. Although we focused on real-time implementation for face, video, and EEG, we would like to explore other ways to implement multimodalities in future work. Recent studies show that multimodal datasets can be useful in increasing the accuracy of emotion recognition. Therefore, we aim to explore multimodal datasets and find means to collect datasets for use in real-time implementation.
The work reported in this paper was conducted during the sabbatical year of Kumoh National Institute of Technology in 2019
was born in Kaduna state, Nigeria in 1996, she received the B.S. degree in Mathematics from Delta State University, Abraka, in 2019. She is currently pursuing the M.S degree in Electronics engineering with the Kumoh National Institute of Technology (KIT), Gumi, South Korea, where she is a Research Assistant with Future Communications Systems Laboratory, since 2021. Her research interests include data generation, emotion recognition, and machine learning.
received the B.S. degree in Electrical Engineering from Mapua Institute of Technology (MIT), Philippines, in 2017, and the M.S. degree in IT convergence engineering from the Kumoh National Institute of Technology (KIT), South Korea, in 2019. Since 2019, she has been a Researcher with the Future Communications Systems Laboratory, KIT. Her research interests include design and analysis of energy storage management system, embedded machine learning, voice-user interface, and data analysis.
received his Ph.D. from the Gwangju Institute of Science and Technology (GIST), South Korea in 2010. From 2010 to 2014, he was a Research Fellow (2010-2013) at the University of Hertfordshire, UK and then a Postdoctoral Researcher (2013-2014) at the Institut National de la Recherche Scientifique (INRS), Canada. Since Sep. 2014 he has been an Assistant Professor at the Kumoh National Institute of Technology (KIT), South Korea. His research interests include statistical analysis, machine learning, and optimization.
is a full-time professor in the Department of Business Administration of Kumoh National Institute of Technology.
Journal of information and communication convergence engineering 2022; 20(3): 174-180
Published online September 30, 2022 https://doi.org/10.56977/jicce.2022.20.3.174
Copyright © Korea Institute of Information and Communication Engineering.
Miracle Udurume 1, Angela Caliwag1, Wansu Lim
1*, and Gwigon Kim2*
1Department of Aeronautics Mechanical and Electronic Convergence Engineering, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea
2Department of Business Administration, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea
Correspondence to:*Gwigon Kim (E-mail: metheus@kumoh.ac.kr, Tel: +82-54-478-7848)
Department of Business Administration, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Emotion recognition is an essential component of complete interaction between human and machine. The issues related to emotion recognition are a result of the different types of emotions expressed in several forms such as visual, sound, and physiological signal. Recent advancements in the field show that combined modalities, such as visual, voice and electroencephalography signals, lead to better result compared to the use of single modalities separately. Previous studies have explored the use of multiple modalities for accurate predictions of emotion; however the number of studies regarding real-time implementation is limited because of the difficulty in simultaneously implementing multiple modalities of emotion recognition. In this study, we proposed an emotion recognition system for real-time emotion recognition implementation. Our model was built with a multithreading block that enables the implementation of each modality using separate threads for continuous synchronization. First, we separately achieved emotion recognition for each modality before enabling the use of the multithreaded system. To verify the correctness of the results, we compared the performance accuracy of unimodal and multimodal emotion recognitions in real-time. The experimental results showed real-time user emotion recognition of the proposed model. In addition, the effectiveness of the multimodalities for emotion recognition was observed. Our multimodal model was able to obtain an accuracy of 80.1% as compared to the unimodality, which obtained accuracies of 70.9, 54.3, and 63.1%.
Keywords: Emotion recognition, Multimodality, Multithreading, Real-time implementation
Emotion recognition plays a very significant role in our daily lives and enables the responses of software applications to adapt to emotional states of the user [1-3]. The application of emotion recognition can be found in various domains, such as for the monitoring and prediction of the fatigue state [4], health monitoring, and communication skills [5]. Emotion recognition influences various levels of modalities [6]. Emotions can most often be expressed through external levels, such as visual, speech, gestures, body signals, heart rate, physiological signals, electroencephalogram (EEG), and body temperature [7,8]. Among these, speech and visual are visual, speech, gestures and body signals, heart rate, physiological signals, electroencephalogram (EEG), body temperatures [7,8], etc. Among these, speech and visual are used widely in emotion recognition because their datasets can easily be constructed. Recent studies have been concentrated on unimodal modalities of emotion recognition, such as text, speech, and images. Although unimodal emotion recognition has made many breakthrough achievements with the passage of time, it still faces some problems [9,10]. The use of unimodality cannot fully describe a certain emotion of the user at the moment, thereby resulting in poor accuracy. Hence, using multimodal features to describe a certain emotion together will be more comprehensive and detailed. Multimodality helps increase the accuracy of emotion recognition. However, the results show some difficulty in simultaneously implementing video, audio, and EEG emotion recognition in real-time because most models require the use of recorded data for offline implementation [11]. Recently, various multimodal emotion recognition methods have employed the use of fusion method for combining the unimodal systems by extracting features before fusing together [12]. Busso et al. combined features from audio and video datasets before independently applying a data classifier algorithm to each of the features and reported a significant increase in the accuracy, from 65% to 89.3%. Guanghui and Xiaoping proposed a multimodal emotion recognition method for fusing the correlation features of speech-visual, which were extracted using a twodimensional convolutional neural network before the application of the proposed feature correlation analysis algorithm for offline multimodal emotion recognition [14]. Xing et al. used machine learning algorithms to exploit EEG signals and audiovisual features for video emotion recognition, achieving video emotion classification accuracies of 96.79% for valence and 97.79% for arousal [15]. All of the studies mentioned above exploit multimodal fusion systems for emotion recognition. Nevertheless, these feature methods are inappropriate for the continuous emotion recognition of audio, face, and EEG, and none provide a real-time approach for multimodal emotion recognition.
Thus, this study addresses the above issues by providing a real-time multimodal emotion recognition system for face, voice, and EEG modalities using a multithreaded system for continuous synchronized execution to improve the performance of emotion recognition in real-time.
The rest of the paper is structured as follows: section II covers the proposed methodologies and overall architecture, section III present results and discussion, and section IV contains our conclusion.
This section discusses the proposed methodology for continuous real-time multimodal emotion recognition, and describes the overall architecture of the model used. Our model focuses on a continuous synchronized real-time multimodal emotion recognition implementation for audio, face, and EEG. An overview of the system model is shown in Fig. 1. The system model includes four layers, which are introduced as A) input devices, B) feature extraction, C) emotion recognition model, and D) multithreading. Each layer of the system architecture is described as follows.
The input devices include the hardware devices, such as an ACPI X64 PC, webcam (c922 pro stream), USB microphone, and Daisy module OpenBCI Cyton plus cEEGrid device, used for real-time implementation. The layer offers tools for face, voice, and EEG emotion recognition and first separately extracts the features from each source. Feature extraction, its steps, methodological approach, and the process used to select each feature are explained in the next section. Face-emotion recognition receives real-time facial expressions from the webcam and detects the face. We implemented the same steps for voice and EEG emotion recognition before the multimodal emotion recognition integration.
This involves extracting useful features to enable emotion recognition from both face, voice, and EEG. Some of the components, used for face, voice, and EEG emotion recognition, are discussed below.
This is the process of extracting facial features from a face and classifying emotions. First, we achieved face detection, the process of detecting an individual’s face in real time, using Dlib, an open-source library that is a landmark facial detector with pretrained models [11]. Dlib is used to estimate the location of the coordinates (x, y) that map the facial points on a face, which enables the incorporation of features. After the face detection, feature points were extracted from the user, including face tracking and landmark detection algorithms used for tracking the face of the user in real time. Face-landmark detection enables the computer to detect and localize regions of the face, such as the eyes, eyebrows, nose, and mouth the face such as the eyes, eyebrows, nose, and mouth.
In this process, we extracted features from voice intonations and used them to classify emotions. Similar to facial emotion recognition, we detected the voice of the user and then extracted useful features. For voice emotion recognition, features, such as duration, channels, rate, chunk size, pitch, spectrum, Mel-Frequency Cepstrum Coefficients (MFCC), and zero crossing rate (ZCR), were extracted from the input speech signal using the Librosa library package.
Feature extraction is an important technique in speech emotion recognition. Different features are classified after their extraction. The features extracted in this study are as follows.
Energy-Root Mean Square.
Zero crossing rate.
Mel-Frequency Cepstral Coefficients.
The features were extracted with frame length = 2048 and hop length = 512 (similar to CHUNK, a batch of sequential samples to process at once). Our main focus was on MFCC, as it is the most popularly extracted feature method for speech emotion recognition. Every sample (sequences of 0.2 s) was analyzed and translated to four sequential feature values (2048/512 = 4). The MFCC is grouped into different stages: re-emphasis, windowing, spectral analysis, filter bank processing, log energy computation, and Mel frequency cepstral computation. The Librosa package helps achieve all the feature extractions.
For EEG emotion recognition, we used the v3 Daisy module Open-BCI Cyton plus cEEGrids device to measure EEG signals. Raw data can be read for post-processing based in the Open-BCI data format and files. We used the MATLAB programming language to implement a signal processing tool for raw Open-BCI data. The EEG data are composed of 24- bit signed values. A total of eight channels were used for signal acquisition. Additionally, 16-bit signed values were used for storing accelerometer data in the X, Y, and Z directions. The data sampling rate of Open-BCI was set to 256 Hz by default and could be tuned using the software provided by the Open-BCI project. The raw data output of Open-BCI is within the time domain. However, EEG data are usually analyzed in the frequency domain. The component frequencies of the EEG include eight major brain waves: Delta (1-3 Hz), Theta (4-7 Hz), Alpha low (8-9 Hz), Alpha high (10-12 Hz), Beta low (13-17 Hz), Beta high (18-30 Hz), Gamma low (31-40 Hz), and Gamma Mid (41-50 Hz). These frequencies represent specific brain states such as high alertness, deep sleep, meditation, and anxiety [13]. The raw data from the Open-BCI board were translated from the time domain into the frequency domain for effective analysis.
For the face network, we used the Xception model, which is an extension of the inception architecture. The pretrained model was preferred over training the network from scratch to benefit from the features already learned by the model. The 36 convolutional layers of the Xception architecture formed the feature extraction base of the network and were structured into 14 modules, all of which had linear residual connections around them, except for the first and last modules. The Xception architecture is a stack of depth-wise separable convolution layers with residual connections.
Algorithm 1 Overview of multi threading system
1 | Initialization; |
2 | Create threads for each modality using the threading function: |
3 | face_thread = myThreads() |
4 | voice_thread = myThreads() |
5 | eeg_thread = myThreads() |
6 | Next lauch face, voice and EEG modalities using thread function. |
7 | face_thread = threading. Thread(target=self.record) |
8 | voice_thread = threading.Thread(target=self.record) |
9 | EEG_thread = threading. Thread(target=self.record) |
10 | Define main function by creating a start_new_thread method which creates new threads for the function passed as arguments and enables thread exection. |
11 | face_thread.start() |
12 | voice_thread.start() |
13 | eeg_thread.start() |
14 | return output: |
15 | Output: Emotion prediction from face, voice and EEG emotion recognition. |
A pre-trained 2-layer long short-term memory (LSTM) model was used to consider the temporal structure of the voice. The first layer contained 256 units, and the second contained 512 units with a batch size of 32 and a default learning rate of 0.001. The sampling rate of the voice signal was 24414. The features used for the input signals were calculated at 0.2 s a window size of 2 secs steps. The window size was used to represent the number of samples and duration of the audio file. Finally, an output layer was added within four units of the dense layer of the softmax activation. This reflected one of the predicted emotion categories.
For EEG emotion recognition, a K-nearest neighbor (KNN) algorithm was used for model training. The KNN is a machine learning algorithm based on a supervised learning technique that is used to identify K samples that are close to the unknown sample points and determine the category information of the unknown samples from the majority of the K samples. Two types of models were used to describe the general state of emotion as follows: 1) the discrete emotion model, including basic emotions such as sadness, anger, fear, surprise, disgust, and happiness; 2) the multidimensional emotional model of valence and arousal [16]. Valence represents the degree of delight of the individual and varies from negative to positive, whereas arousal represents the degree of activation of emotions and varies from calm to excitement. KNN is used to classify the emotions in a discrete value order for the prediction result.
Multithreading allows the use of multiple threads for the same or different processing at the same time and allows for multiple threads to be created within a process. For our system model, we built a multithreaded system enabling the synchronized processes of the three different modalities. This approach helps the process execution at once. We defined three different threads, one for each modality. After the emotion classification of each modality, the parallel classification, that is, the multithreading aspect was implemented by configuring the thread for each modality. The models were built into thread channels, in which the algorithm appropriately organized the models. In the main thread, a video stream was continuously captured using the user webcam with the OpenCV library. The implementation of the library offered the use of different classifiers for the detector and performance of the frontal face. The frame was set to four and passed on to the second thread where the voice input was processed every 2 seconds before being passed to the third thread for the EEG signals to be processed, thereby making it viable for real-time application. The result of the classification was then passed to the main thread, which appended it to a list, retaining the four most recently detected emotions, after which the process continued once again. The algorithm used for the multithreading of all modalities is given in Algorithm 1.
In this section, we report the results of the study as shown in Fig. 2. and Fig. 3. Real-time implementation was selected for better accuracy and prediction. We developed the application using the python programming language along with the Pycharm platform to build and execute the applications by utilizing the graphic processing unit (GPU) provided by the Pycharm platform rather than using the central processing unit (CPU) space. The implementation was achieved in two stages. First, we separately removed the emotions from each modality before combining all three modalities into one processing process. The accuracy of each modality was verified to ensure good results after combining them. A multithreading system was used to combine all three modalities using three different threads. We first executed the voice, followed by the face, and finally the EEG emotion recognition. For each modality, we were able to achieve approximately six displayed emotions as follows: anger, fear, happiness, surprise, sadness, and neutrality. The model performed well for all three modalities and was operated continuously using a multithreading system. Fig. 2 shows the results of emotion recognition using unimodality, while Fig. 3 shows the results of emotion recognition using multimodalities.
A focused prediction result of the emotion recognition, presented for the three modalities, is shown on the righthand side of Fig. 3. The predictions underlined in blue and red represent that of the voice and EEG emotion recognitions, respectively, as displayed on the terminal, while the face emotion recognition is presented on the screen on the left-hand side as shown in the figure. We were able to capture, display, and predict emotions continuously and synchronously for all modalities. The model was accurate in predicting user emotion with a fast processing speed. The obtained results for real-time emotion recognition are shown in Fig. 4 for unimodality and multimodalities. For individual modalities the accuracy obtained was 70.9, 54.3, and 63.1% respectively. However, results show that combining all three modalities results in higher accuracy of approximately 80.1% compared with the single modalities.
This paper proposed a real-time emotion recognition implementation for face, voice, and EEG emotion recognition using a multithreaded system for synchronized continuous implementation. We focused on real-time implementation to improve the task of emotion recognition. The results show the continuous implementation of all three modalities and sustain the idea that using multimodality helps increase the accuracy of emotion recognition. We found that by using multithreading we were able to perform real-time implementation for voice, face, and EEG emotion recognition continuously. Although we focused on real-time implementation for face, video, and EEG, we would like to explore other ways to implement multimodalities in future work. Recent studies show that multimodal datasets can be useful in increasing the accuracy of emotion recognition. Therefore, we aim to explore multimodal datasets and find means to collect datasets for use in real-time implementation.
The work reported in this paper was conducted during the sabbatical year of Kumoh National Institute of Technology in 2019