Journal of information and communication convergence engineering 2024; 22(2): 145-152
Published online June 30, 2024
https://doi.org/10.56977/jicce.2024.22.2.145
© Korea Institute of Information and Communication Engineering
Correspondence to : Jihoon Park (e-mail: nicedreamusic@gmail.com) Division of Vocal Music, Nicedream Music Academy
Young-Min Kim (e-mail: irobo77@kiom.re.kr), Digital Health Research Division, Korea Institute of Oriental Medicine
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Closed quotient (CQ) represents the time ratio for which the vocal folds remain in contact during voice production. Because analyzing CQ values serves as an important reference point in vocal training for professional singers, these values have been measured mechanically or electrically by either inverse filtering of airflows captured by a circumferentially vented mask or post-processing of electroglottography waveforms. In this study, we introduced a novel algorithm to predict the CQ values only from audio signals. This has eliminated the need for mechanical or electrical measurement techniques. Our algorithm is based on a gated recurrent unit (GRU)-type neural network. To enhance the efficiency, we pre-processed an audio signal using the pitch feature extraction algorithm. Then, GRU-type neural networks were employed to extract the features. This was followed by a dense layer for the final prediction. The Results section reports the mean square error between the predicted and real CQ. It shows the capability of the proposed algorithm to predict CQ values.
Keywords Vocal phonation, GRU, Artificial neural network, Electroglottography
Recently, attempts have been undertaken in the phonetics community to quantitatively analyze the vibratory behavior of human vocal folds during vocal phonation. A common method involves employing a windowed Fourier transform (spectrogram) to examine the audio waveform produced by the vocal activity. This technique facilitates the direct visualization of voice quality attributes such as harmonicity, kurtosis, and spectral centroid [1, 2]. Additionally, a few researchers have utilized mechanical or electrical devices to study vocal fold dynamics. Here, the focus was on metrics such as the subglottal pressure and the contact area of the vocal folds. For example, circumferentially vented Pneumotach split-flow air masks can measure pressure waveforms [3]. This enables the analysis of the nasal/oral aerodynamics. Although an air mask system is a highly effective and direct tool for voice quality evaluation, it is difficult to use. In contrast, electroglottography (EGG) provides a convenient and noninvasive technique for visualizing vocal fold vibrations during voice production [4,5]. By placing two electrodes around the vocal folds and passing a low-amperage current near the thyroid cartilage, the variations in the vocal fold contact area can be captured during the glottal cycle. This is based on the principle that closed vocal folds allow for higher electrical admittance across the larynx, which results in a higher current between the electrodes. This method simplifies the evaluation of the vocal quality by visualizing the variations in the contact area of the vocal folds during vocal production.
One of the crucial metrics extracted from the EGG signal is the closed quotient (CQ). It represents the time ratio during which the vocal folds remain in contact throughout voice production [6,7]. Various theories have emphasized the significance of CQ in voice analysis. Typically, a higher CQ is associated with voices considered stronger or more pressed and is, attributed to the increased duration of vocal fold contact during phonation; it yields richer and more vibrant sounds. Consequently, a higher CQ was more prevalent in vocal production when the chest register was used than when the head register was used. Additionally, CQ is effective in clinical diagnostics. This is because alterations in vocal fold closure patterns owing to lesions or paralysis affect typical CQ values [8]. It is generally acknowledged that CQ decreases as the fundamental frequency increases. To summarize, the insights gained from analyzing CQ levels serve as important reference points in vocal training for professional singers.
Notwithstanding the critical role of the closed quotient (CQ) values in vocal analysis, obtaining these values conventionally requires mechanical and electrical measurements. These methods involve either the inverse filtering of airflows captured by a circumferentially vented mask or the post-processing of EGG waveforms. In this study, we introduced a novel algorithm to predict the CQ values from only audio signals. This eliminates the need for mechanical and electrical measurement techniques. Our approach began by constructing a dataset that pairs the vocal audio waveforms with their corresponding CQ values. We then developed a machinelearning algorithm that leverages supervised learning for training. Recently, significant developments have been achieved in the ANN community. For example, see [9,10] for computer vision, [11,12] for reinforcement learning, and [13,14] for natural language processing. In particular, recurrent neural networks (e.g., LSTM [15-17] and GRU [18-20]) have been demonstrated to be effective in handling time-series data. Therefore, we employed neural network architectures that incorporate gated recurrent unit (GRU) layers in the CQ prediction algorithm. To optimize the performance of the algorithm, we preprocessed the audio input using a pitch feature extraction algorithm [21,22] before substituting it into a GRU-type neural network. In the Results section, we report the performance of the proposed algorithm. For all the tests, the MSE between the predicted and real CQ values were below 8E-03. This indicated the capability of the proposed algorithm to analyze the vibratory behavior of the vocal fold contact area.
The remainder of this paper is organized as follows: In Section 2, we describe the GRU-based neural network algorithm for predicting the CQ values. The results are presented in Section 3. Finally, the conclusions are presented in Section 4.
In this section, we describe the development of a novel algorithm for predicting CQ values. Utilizing a neural network based on GRU, the proposed algorithm is trained by audio and EGG signals. After the training stage is complete, our model can predict the CQ values only from audio signals. Subsection A outlines the methodology for data collection and the process of extracting the CQ value from an EGG waveform. Subsection B elaborates on the GRU-type neural networks and pitch feature extraction techniques for audio signals. The pitch algorithm reduces the length of the audio signal. GRU-type neural networks are employed to extract features from the pitch-reduced signal. This is followed by a dense layer for prediction. A comprehensive schematic of this process is shown in Fig. 1.
In this subsection, we describe the data collection process and CQ extraction process from EGG signal. Data were collected from the vocal productions of the vowel /‘a’/ by 22 individuals. Each of them spoke for approximately 10 s in a relaxed state. During this process, the audio signals were captured using a condenser microphone. Meanwhile, the EGG signals were recorded with two electrodes attached to the neck near the vocal folds, utilizing the EG2-PCX2 system from Glottal Enterprises. For each individual’s measured audio and EGG data, we divided the regions by 0.1 s. This yielded 2,076 samples. (Ai, EGGi) denotes the audio waveforms and EGG waveforms for samples i = 1, ..., 2076, respectively. Fig. 2 shows graphs of typical examples of Ai and EGGi. Here, we emphasize that EG2-PCX2 systems have synchronized audio and EGG waveforms during vocal phonation measurements. Therefore, the waveforms of Ai and EGGi in Fig. 2 have identical fundamental frequencies with almost completely matched temporal appearances of the (local) minimum amplitudes in each period.
Now, we describe the process of extracting the CQ value from an EGG waveform (see Fig. 3). The CQ value is defined as follows:
Here, the closed vocal fold region is defined by the region where the EGG waveform is above the tolerance value (defined as 50% percent of the maximum amplitude of EGG). Based on the definitions in Eq. (1), the CQ value is between zero and one. For convenience, we denote CQi using the CQ value extracted from EGGi. Finally, we define the dataset in the form of samples (Xi, Yi). The input variable Xi is an audio signal Ai. The target variable Yi is CQi obtained by the extraction process for EGG waveforms.
In this subsection, we describe the development of a GRUbased neural network architecture and its training process. To obtain a high efficiency, the audio signal Xi was preprocessed using the pitch feature extraction algorithm [17,18]. We briefly describe the pitch algorithm for the completeness of the work.
In general, extracting the pitch from an audio signal is identified as determining the peak of the frequency spectrum from a short-time Fourier transform (STFT). It remains to estimate accurate peaks in STFT. Paraboloid interpolation is generally applied using a quadratic polynomial near the peak. The maximum value of the quadratic polynomial is presumed to be the peak. Finally, values higher than 10% of the magnitude of the frequency spectrum were determined as the pitch. We use the notation
for the features extracted using the pitch algorithms. Here, Pi represents vector-type data with a size smaller than that of Xi. Therefore, considering Pi as a reduced version of an audio signal, we can efficiently employ a neural network algorithm.
We now describe GRU-based neural networks [16]. Given an input time seriesx, the hidden state ht of the GRU layer is obtained by sequentially computing Eqs. (3)-(6) for (t = 1, 2, ..., N). Here, N is the length of x.
Here rt, zt, and nt are the so-called reset, update, and new gates. σ is the sigmoid function. An important parameter in the GRU layer is the hidden size. It indicates the number of features in the hidden state h. One of the advantages of the GRU layers is the convenience of stacking layers by considering the hidden state of the previous layer as an input to the subsequent layer. We introduced four types of GRU-based neural network algorithms accompanied by a pitch algorithm to predict the CQ value:
Definition 1. GRU-based neural networks.
1) GRU 1L utilizes a single-layer GRU followed by a dense layer for regression.
2) GRU 2L employs two stacked GRU layers followed by a dense layer for regression.
3) BiGRU 1L uses a single-layer bidirectional GRU followed by a dense layer for regression.
4) CONV1D-GRU applies one-dimensional convolutional layer, followed by a single-layer GRU and a dense layer for the regression.
The four types of GRU-based neural networks in Definition 1 have one or two GRU layers, possibly with an accompanying one-dimensional convolutional layer. Here, all the neural networks listed in Definition 1 have a dense layer at the final stage of the CQ regression. Typically, GRU-type neural networks are trained using a backpropagation algorithm.
We are in a position to state the CQ prediction algorithm. Suppose Xi is a typical audio signal. First, we apply a pitch algorithm to reduce the length of the audio signal. Here, Pi is the feature extracted from the pitch algorithm. Next, by selecting a suitable type of GRU-based neural network, we obtain regression results by substituting the pitch features into the GRU-type neural network introduced in Definition 1. We use the notation N(·, θ) for the neural network architecture. θ denotes the collection of parameters appearing in the GRU-type neural network. Here, we summarize the CQ prediction algorithm. We call it Pitch-GRU.
Algorithm Pitch-GRU (Input: Xi, output:
1. Extract pitch features from the input audio sample:
2. Select one of the following GRU configurations to define neural network N(Pi, θ):
GRU 1L
GRU 2L
BiGRU 1L
CONV1D-GRU 1L
3. Predict CQ by
Note that we can alternatively substitute raw audio-signal Xi directly to N(·, θ):
Algorithm 2 (Input: Xi, output:
1. Select one of the following GRU configurations to define neural network N(Pi, θ):
GRU 1L
GRU 2L
BiGRU 1L
CONV1D-GRU 1L
2. Predict CQ by
Pitch-GRU is the primary proposed method. Algorithm 2 is used for a comparison to emphasize the enhanced performance of the pitch algorithm.
In this section, we report the performance of the Pitch-GRU algorithm. Note that Algorithm 2 (which does not use pitch extraction) was used for the comparison.
Data were collected from the vocal productions of the vowel /‘a’/ by 22 individuals. This yielded 2,076 audio and EGG pairs of samples. These samples were divided into training, valid, and test sets consisting of 1,230, 412, and 434 samples, respectively.
For all the tests, the hidden sizes in the GRU layers were set to 10. The losses were defined as the mean squared error (MSE) between Yi and . The GRU-based neural networks were trained using Adam optimization [23] with a learning rate of 0.1. All the tests were conducted using an NDIVIA RTX A5000 instrument.
We report the performance of Pitch-GRU in terms of the number of parameters, losses, and CPU time. It is noteworthy that for all the cases, the test errors were below 8.0E-3. This indicated a close match between the predicted and actual CQ values. The test loss was the smallest when GRU 2L was employed. However, the CPU time was shortest when CONV1D-GRU was used for the neural network.
Now, we compare Pitch-GRU and Algorithm 2. Because we did not use pitch feature extraction in Algorithm 2, the CPU time was longer for this algorithm. This occurred because in Pitch-GRU, pitch extraction reduces the length of the audio signal, which yields a small sequence for the input for neural networks. In terms of accuracy, Pitch-GRU has fewer errors than Algorithm 2. This shows that the pitch algorithm is capable of capturing the important features of audio signals. Therefore, we conclude that the proposed pitch-GRU algorithm is computationally efficient while obtaining a high CQ prediction accuracy.
It is reasonable to consider whether other types of feature extraction algorithms enhance GRU-type neural networks for predicting CQ values. Therefore, we changed the first step of Pitch-GRU by replacing the pitch algorithm with MFCC [24], Chroma [25], ZCR [26], and RMS_E [27]. The test losses of the resulting algorithm are listed in Table 3. We observed that the pitch algorithm is most accurately employed with GRU-type neural networks. This validates our selection of the pitch algorithm for audio-feature extraction.
Table 3 . Loss table of GRU-type neural network with feature extracted by MFCC, Chroma, ZCR, Pitch, and RMS_E algorithms
Model | MFCC | Chroma | ZCR | Pitch | RMS_E |
---|---|---|---|---|---|
GRU 1L | 1.5E-02 | 9.4E-03 | 8.7E-03 | 8.0E-03 | 8.5E-03 |
GRU 2L | 1.3E-01 | 8.8E-03 | 9.1E-03 | 7.6E-03 | 8.7E-03 |
BiGRU 1L | 1.7E-02 | 9.1E-03 | 9.3E-03 | 8.0E-03 | 8.7E-03 |
CONV1D-GRU | 8.9E-03 | 8.7E-03 | 8.3E-03 | 7.6E-03 | 9.2E-03 |
Finally, we compared the performance of Pitch-GRU employed with GRU 2L with that of other tree ensembletype algorithms. In Table 4, we report the losses and CPU-time for Pitch-GRU, random forest [28,29], and XGBoost [30]. It is observed that the test loss of the Pitch-GRU algorithm was lower than those of random forest and XGBoost.
Table 4 . Comparison of Pitch-GRU with random forest and XGBoost algorithm
Model | Training loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|
Pitch-GRU | 7.7E-3 | 7.7E-3 | 7.6E-3 | 107.4 s |
Random Forest | 3.5E-4 | 7.8E-2 | 7.4E-2 | 6.4 s |
XGBoost | 2.1E-3 | 7.6E-2 | 7.5E-2 | 2.5 s |
In this study, we developed a new method called Pitch-GRU to predict CQ using audio signals during vocal phonation. Data were collected from vocal productions of the vowel /‘a’/ by 22 individuals. The CQ values were extracted from EGG waveforms. By matching the audio signals and CQ values, we trained the GRU-based neural networks using supervised learning. To enhance the efficiency, the audio signal was preprocessed using the pitch feature extraction algorithm. The results revealed that, the MSE errors between the predicted CQ and real CQ was below 9E-03 for all the cases. This demonstrates the capability of the proposed algorithm in analyzing the vocal fold behavior during vocal phonation. Next, we discussed the likely utilization of the proposed Pitch-GRU algorithm. Because our algorithm can predict the CQ in real time, it can be used efficiently as a reference tool for educating professional singers or as vocal cord exercises for patients with vocal fold disorders. In future works, we can consider different vowel such as /‘i’/ or /‘u’/.
Table 1 . Number of parameters and training/validation/test losses, and CPU time of Pitch-GRU with different neural networks
Model | Parameter | Training Loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|---|
GRU 1L | 401 | 7.6E-3 | 7.1E-3 | 8.0E-3 | 99.3 s |
GRU 2L | 1061 | 7.7E-3 | 6.8E-3 | 7.6E-3 | 107.4 s |
BiGRU 1L | 801 | 7.7E-3 | 5.1E-3 | 8.0E-3 | 104.8 s |
CONV1D-GRU | 8627 | 7.5E-3 | 7.3E-3 | 7.6E-3 | 146.6 s |
Table 2 . Number of parameters and training/validation/test losses, and CPU time of Algorithm 2 with different neural networks (GRU 1L, GRU 2L, BiGRU 1L, Conv1D GRU)
Model | Parameter | Training Loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|---|
GRU 1L | 401 | 7.9E-3 | 5.5E-3 | 8.8E-3 | 526.9 s |
GRU 2L | 1061 | 7.8E-3 | 6.6E-3 | 9.0 E-3 | 649.4 s |
BiGRU 1L | 801 | 7.8E-3 | 6.9E-3 | 9.0 E-3 | 610.7 s |
CONV1D-GRU | 8627 | 7.2E-3 | 9.9E-3 | 9.6 E-3 | 240.2 s |
This study was supported by a grant (NRF KSN1824130) from the Korea Institute of Oriental Medicine.
Hyeonbin Han
has been pursuing his B.S. from Department of Mathematical Data Science, Hanyang University ERICA, since 2023. His research interests include computer vision, deep reinforcement learning, and chat-bot.
Keun Young Lee
received his Ph.D. from Department of Mathematical Sciences, KAIST, in 2009. From 2017 to 2020, he was a faculty member of Department of Mathematics, Sejong University, Republic of Korea. From 2020, he has been an independent scholar in Republic of Korea. His research interests include Banach space theory, machine learning, and fuzzy theory.
Seong-Yoon Shin
received his M.S. and Ph.D. from Department of Computer Information Engineering, Kunsan National University, Gunsan, Republic of Korea, in 1997 and 2003, respectively. From 2006, he has been a professor in School of Computer Science and Engineering. His research interests include image processing, computer vision, and virtual reality. He can be contacted at the following email ID: s3397220@kunsan.ac.kr
Yoseup Kim
received his B.S. in Material Science and Engineering, Business and Technology Management (double major), and Chemical and Biological Engineering (minor) from KAIST, Daejeon, Republic of Korea, in 2015. He received his M.D. from Yonsei University College of Medicine in 2024, and has been leading several government R&D projects as a CEO and principal investigator at Deltoid Inc. since 2020. His research interests include motion and visual analysis in the field of digital healthcare
Gwanghyun Jo
received his B.S., M.S., and Ph.D. from Department of Mathematical Sciences, KAIST, Daejeon, Republic of Korea, in 2018. From 2018 to August 2019, he was a Postdoctoral Researcher with KAIST. He was a faculty member of Department of Mathematics, Kunsan National University during 2019–2023. He has been a faculty member of Department of Mathematical Data Sciences, Hanyang University ERICA. His research interests include numerical analysis and simulation of various fluids problems originating from hemodynamics, and petroleum engineering.
Jihoon Park
graduated with a Bachelor of Music in Classic Vocal from Department of Music, Chungnam University. He also completed the Intermediate Course for Voice Correction Specialist certified by the Korean Vocology Association (KOVA). Since 2014, he has been performing as a soloist tenor or actor at prestigious venues such as the Korea National Theater, Daejeon Art Center, Daejeon Observatory, and Daegu International Musical Festival (DIMF). He has also organized numerous concerts including classic and jazz concerts such as "Playing Love” and “The moment”. He has held directorial positions at Nicedream Music Academy and Baekyang Studio. Additionally, he is an active member of the Korean Vocology Association (KOVA). His research interests include classical vocal phonation, vocal education, and scientific analysis of vocal behavior.
Young-Min Kim
received his B.S. in Mechanical Engineering from Yonsei University, Seoul, Republic of Korea, in 1999; M.S. in Mechanical Engineering from POSTECH, Republic of Korea, in 2001: and Ph.D. in Mechanical Engineering from KAIST, Republic of Korea, in 2011. From 2002 to 2006, he was a Research Scientist with Human-Welfare Robotic System Research Center, KAIST, Republic of Korea. Since 2011, he has been a Principal Researcher with Digital Health Research Division, Korea Institute of Oriental Medicine, Republic of Korea. His research interests include the medical devices for personalized healthcare, wearable sensors for daily health monitoring, sophisticated human–robot interface (HRI) technology, and innovative HRI applications.
Journal of information and communication convergence engineering 2024; 22(2): 145-152
Published online June 30, 2024 https://doi.org/10.56977/jicce.2024.22.2.145
Copyright © Korea Institute of Information and Communication Engineering.
Hyeonbin Han 1, Keun Young Lee 2, Seong-Yoon Shin 3, Yoseup Kim 4, Gwanghyun Jo 1, Jihoon Park 5*, and Young-Min Kim6*
1Department of Mathematical Data Science, Hanyang University ERICA, Ansan, Republic of Korea
2Independent scholar, Republic of Korea
3School of Computer Science and Engineering, Kunsan National University, Gunsan 54150, Republic of Korea
4Digital Healthcare Research Center, Deltoid Inc., 186 Jagok-ro, Seoul, Republic of Korea
5Division of Vocal Music, Nicedream Music Academy, 48 Eoun-ro, Daejeon, Republic of Korea
6Digital Health Research Divisions, Korea Institute of Oriental Medicine, Daejeon, Republic of Korea
Correspondence to:Jihoon Park (e-mail: nicedreamusic@gmail.com) Division of Vocal Music, Nicedream Music Academy
Young-Min Kim (e-mail: irobo77@kiom.re.kr), Digital Health Research Division, Korea Institute of Oriental Medicine
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Closed quotient (CQ) represents the time ratio for which the vocal folds remain in contact during voice production. Because analyzing CQ values serves as an important reference point in vocal training for professional singers, these values have been measured mechanically or electrically by either inverse filtering of airflows captured by a circumferentially vented mask or post-processing of electroglottography waveforms. In this study, we introduced a novel algorithm to predict the CQ values only from audio signals. This has eliminated the need for mechanical or electrical measurement techniques. Our algorithm is based on a gated recurrent unit (GRU)-type neural network. To enhance the efficiency, we pre-processed an audio signal using the pitch feature extraction algorithm. Then, GRU-type neural networks were employed to extract the features. This was followed by a dense layer for the final prediction. The Results section reports the mean square error between the predicted and real CQ. It shows the capability of the proposed algorithm to predict CQ values.
Keywords: Vocal phonation, GRU, Artificial neural network, Electroglottography
Recently, attempts have been undertaken in the phonetics community to quantitatively analyze the vibratory behavior of human vocal folds during vocal phonation. A common method involves employing a windowed Fourier transform (spectrogram) to examine the audio waveform produced by the vocal activity. This technique facilitates the direct visualization of voice quality attributes such as harmonicity, kurtosis, and spectral centroid [1, 2]. Additionally, a few researchers have utilized mechanical or electrical devices to study vocal fold dynamics. Here, the focus was on metrics such as the subglottal pressure and the contact area of the vocal folds. For example, circumferentially vented Pneumotach split-flow air masks can measure pressure waveforms [3]. This enables the analysis of the nasal/oral aerodynamics. Although an air mask system is a highly effective and direct tool for voice quality evaluation, it is difficult to use. In contrast, electroglottography (EGG) provides a convenient and noninvasive technique for visualizing vocal fold vibrations during voice production [4,5]. By placing two electrodes around the vocal folds and passing a low-amperage current near the thyroid cartilage, the variations in the vocal fold contact area can be captured during the glottal cycle. This is based on the principle that closed vocal folds allow for higher electrical admittance across the larynx, which results in a higher current between the electrodes. This method simplifies the evaluation of the vocal quality by visualizing the variations in the contact area of the vocal folds during vocal production.
One of the crucial metrics extracted from the EGG signal is the closed quotient (CQ). It represents the time ratio during which the vocal folds remain in contact throughout voice production [6,7]. Various theories have emphasized the significance of CQ in voice analysis. Typically, a higher CQ is associated with voices considered stronger or more pressed and is, attributed to the increased duration of vocal fold contact during phonation; it yields richer and more vibrant sounds. Consequently, a higher CQ was more prevalent in vocal production when the chest register was used than when the head register was used. Additionally, CQ is effective in clinical diagnostics. This is because alterations in vocal fold closure patterns owing to lesions or paralysis affect typical CQ values [8]. It is generally acknowledged that CQ decreases as the fundamental frequency increases. To summarize, the insights gained from analyzing CQ levels serve as important reference points in vocal training for professional singers.
Notwithstanding the critical role of the closed quotient (CQ) values in vocal analysis, obtaining these values conventionally requires mechanical and electrical measurements. These methods involve either the inverse filtering of airflows captured by a circumferentially vented mask or the post-processing of EGG waveforms. In this study, we introduced a novel algorithm to predict the CQ values from only audio signals. This eliminates the need for mechanical and electrical measurement techniques. Our approach began by constructing a dataset that pairs the vocal audio waveforms with their corresponding CQ values. We then developed a machinelearning algorithm that leverages supervised learning for training. Recently, significant developments have been achieved in the ANN community. For example, see [9,10] for computer vision, [11,12] for reinforcement learning, and [13,14] for natural language processing. In particular, recurrent neural networks (e.g., LSTM [15-17] and GRU [18-20]) have been demonstrated to be effective in handling time-series data. Therefore, we employed neural network architectures that incorporate gated recurrent unit (GRU) layers in the CQ prediction algorithm. To optimize the performance of the algorithm, we preprocessed the audio input using a pitch feature extraction algorithm [21,22] before substituting it into a GRU-type neural network. In the Results section, we report the performance of the proposed algorithm. For all the tests, the MSE between the predicted and real CQ values were below 8E-03. This indicated the capability of the proposed algorithm to analyze the vibratory behavior of the vocal fold contact area.
The remainder of this paper is organized as follows: In Section 2, we describe the GRU-based neural network algorithm for predicting the CQ values. The results are presented in Section 3. Finally, the conclusions are presented in Section 4.
In this section, we describe the development of a novel algorithm for predicting CQ values. Utilizing a neural network based on GRU, the proposed algorithm is trained by audio and EGG signals. After the training stage is complete, our model can predict the CQ values only from audio signals. Subsection A outlines the methodology for data collection and the process of extracting the CQ value from an EGG waveform. Subsection B elaborates on the GRU-type neural networks and pitch feature extraction techniques for audio signals. The pitch algorithm reduces the length of the audio signal. GRU-type neural networks are employed to extract features from the pitch-reduced signal. This is followed by a dense layer for prediction. A comprehensive schematic of this process is shown in Fig. 1.
In this subsection, we describe the data collection process and CQ extraction process from EGG signal. Data were collected from the vocal productions of the vowel /‘a’/ by 22 individuals. Each of them spoke for approximately 10 s in a relaxed state. During this process, the audio signals were captured using a condenser microphone. Meanwhile, the EGG signals were recorded with two electrodes attached to the neck near the vocal folds, utilizing the EG2-PCX2 system from Glottal Enterprises. For each individual’s measured audio and EGG data, we divided the regions by 0.1 s. This yielded 2,076 samples. (Ai, EGGi) denotes the audio waveforms and EGG waveforms for samples i = 1, ..., 2076, respectively. Fig. 2 shows graphs of typical examples of Ai and EGGi. Here, we emphasize that EG2-PCX2 systems have synchronized audio and EGG waveforms during vocal phonation measurements. Therefore, the waveforms of Ai and EGGi in Fig. 2 have identical fundamental frequencies with almost completely matched temporal appearances of the (local) minimum amplitudes in each period.
Now, we describe the process of extracting the CQ value from an EGG waveform (see Fig. 3). The CQ value is defined as follows:
Here, the closed vocal fold region is defined by the region where the EGG waveform is above the tolerance value (defined as 50% percent of the maximum amplitude of EGG). Based on the definitions in Eq. (1), the CQ value is between zero and one. For convenience, we denote CQi using the CQ value extracted from EGGi. Finally, we define the dataset in the form of samples (Xi, Yi). The input variable Xi is an audio signal Ai. The target variable Yi is CQi obtained by the extraction process for EGG waveforms.
In this subsection, we describe the development of a GRUbased neural network architecture and its training process. To obtain a high efficiency, the audio signal Xi was preprocessed using the pitch feature extraction algorithm [17,18]. We briefly describe the pitch algorithm for the completeness of the work.
In general, extracting the pitch from an audio signal is identified as determining the peak of the frequency spectrum from a short-time Fourier transform (STFT). It remains to estimate accurate peaks in STFT. Paraboloid interpolation is generally applied using a quadratic polynomial near the peak. The maximum value of the quadratic polynomial is presumed to be the peak. Finally, values higher than 10% of the magnitude of the frequency spectrum were determined as the pitch. We use the notation
for the features extracted using the pitch algorithms. Here, Pi represents vector-type data with a size smaller than that of Xi. Therefore, considering Pi as a reduced version of an audio signal, we can efficiently employ a neural network algorithm.
We now describe GRU-based neural networks [16]. Given an input time seriesx, the hidden state ht of the GRU layer is obtained by sequentially computing Eqs. (3)-(6) for (t = 1, 2, ..., N). Here, N is the length of x.
Here rt, zt, and nt are the so-called reset, update, and new gates. σ is the sigmoid function. An important parameter in the GRU layer is the hidden size. It indicates the number of features in the hidden state h. One of the advantages of the GRU layers is the convenience of stacking layers by considering the hidden state of the previous layer as an input to the subsequent layer. We introduced four types of GRU-based neural network algorithms accompanied by a pitch algorithm to predict the CQ value:
Definition 1. GRU-based neural networks.
1) GRU 1L utilizes a single-layer GRU followed by a dense layer for regression.
2) GRU 2L employs two stacked GRU layers followed by a dense layer for regression.
3) BiGRU 1L uses a single-layer bidirectional GRU followed by a dense layer for regression.
4) CONV1D-GRU applies one-dimensional convolutional layer, followed by a single-layer GRU and a dense layer for the regression.
The four types of GRU-based neural networks in Definition 1 have one or two GRU layers, possibly with an accompanying one-dimensional convolutional layer. Here, all the neural networks listed in Definition 1 have a dense layer at the final stage of the CQ regression. Typically, GRU-type neural networks are trained using a backpropagation algorithm.
We are in a position to state the CQ prediction algorithm. Suppose Xi is a typical audio signal. First, we apply a pitch algorithm to reduce the length of the audio signal. Here, Pi is the feature extracted from the pitch algorithm. Next, by selecting a suitable type of GRU-based neural network, we obtain regression results by substituting the pitch features into the GRU-type neural network introduced in Definition 1. We use the notation N(·, θ) for the neural network architecture. θ denotes the collection of parameters appearing in the GRU-type neural network. Here, we summarize the CQ prediction algorithm. We call it Pitch-GRU.
Algorithm Pitch-GRU (Input: Xi, output:
1. Extract pitch features from the input audio sample:
2. Select one of the following GRU configurations to define neural network N(Pi, θ):
GRU 1L
GRU 2L
BiGRU 1L
CONV1D-GRU 1L
3. Predict CQ by
Note that we can alternatively substitute raw audio-signal Xi directly to N(·, θ):
Algorithm 2 (Input: Xi, output:
1. Select one of the following GRU configurations to define neural network N(Pi, θ):
GRU 1L
GRU 2L
BiGRU 1L
CONV1D-GRU 1L
2. Predict CQ by
Pitch-GRU is the primary proposed method. Algorithm 2 is used for a comparison to emphasize the enhanced performance of the pitch algorithm.
In this section, we report the performance of the Pitch-GRU algorithm. Note that Algorithm 2 (which does not use pitch extraction) was used for the comparison.
Data were collected from the vocal productions of the vowel /‘a’/ by 22 individuals. This yielded 2,076 audio and EGG pairs of samples. These samples were divided into training, valid, and test sets consisting of 1,230, 412, and 434 samples, respectively.
For all the tests, the hidden sizes in the GRU layers were set to 10. The losses were defined as the mean squared error (MSE) between Yi and . The GRU-based neural networks were trained using Adam optimization [23] with a learning rate of 0.1. All the tests were conducted using an NDIVIA RTX A5000 instrument.
We report the performance of Pitch-GRU in terms of the number of parameters, losses, and CPU time. It is noteworthy that for all the cases, the test errors were below 8.0E-3. This indicated a close match between the predicted and actual CQ values. The test loss was the smallest when GRU 2L was employed. However, the CPU time was shortest when CONV1D-GRU was used for the neural network.
Now, we compare Pitch-GRU and Algorithm 2. Because we did not use pitch feature extraction in Algorithm 2, the CPU time was longer for this algorithm. This occurred because in Pitch-GRU, pitch extraction reduces the length of the audio signal, which yields a small sequence for the input for neural networks. In terms of accuracy, Pitch-GRU has fewer errors than Algorithm 2. This shows that the pitch algorithm is capable of capturing the important features of audio signals. Therefore, we conclude that the proposed pitch-GRU algorithm is computationally efficient while obtaining a high CQ prediction accuracy.
It is reasonable to consider whether other types of feature extraction algorithms enhance GRU-type neural networks for predicting CQ values. Therefore, we changed the first step of Pitch-GRU by replacing the pitch algorithm with MFCC [24], Chroma [25], ZCR [26], and RMS_E [27]. The test losses of the resulting algorithm are listed in Table 3. We observed that the pitch algorithm is most accurately employed with GRU-type neural networks. This validates our selection of the pitch algorithm for audio-feature extraction.
Table 3 . Loss table of GRU-type neural network with feature extracted by MFCC, Chroma, ZCR, Pitch, and RMS_E algorithms.
Model | MFCC | Chroma | ZCR | Pitch | RMS_E |
---|---|---|---|---|---|
GRU 1L | 1.5E-02 | 9.4E-03 | 8.7E-03 | 8.0E-03 | 8.5E-03 |
GRU 2L | 1.3E-01 | 8.8E-03 | 9.1E-03 | 7.6E-03 | 8.7E-03 |
BiGRU 1L | 1.7E-02 | 9.1E-03 | 9.3E-03 | 8.0E-03 | 8.7E-03 |
CONV1D-GRU | 8.9E-03 | 8.7E-03 | 8.3E-03 | 7.6E-03 | 9.2E-03 |
Finally, we compared the performance of Pitch-GRU employed with GRU 2L with that of other tree ensembletype algorithms. In Table 4, we report the losses and CPU-time for Pitch-GRU, random forest [28,29], and XGBoost [30]. It is observed that the test loss of the Pitch-GRU algorithm was lower than those of random forest and XGBoost.
Table 4 . Comparison of Pitch-GRU with random forest and XGBoost algorithm.
Model | Training loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|
Pitch-GRU | 7.7E-3 | 7.7E-3 | 7.6E-3 | 107.4 s |
Random Forest | 3.5E-4 | 7.8E-2 | 7.4E-2 | 6.4 s |
XGBoost | 2.1E-3 | 7.6E-2 | 7.5E-2 | 2.5 s |
In this study, we developed a new method called Pitch-GRU to predict CQ using audio signals during vocal phonation. Data were collected from vocal productions of the vowel /‘a’/ by 22 individuals. The CQ values were extracted from EGG waveforms. By matching the audio signals and CQ values, we trained the GRU-based neural networks using supervised learning. To enhance the efficiency, the audio signal was preprocessed using the pitch feature extraction algorithm. The results revealed that, the MSE errors between the predicted CQ and real CQ was below 9E-03 for all the cases. This demonstrates the capability of the proposed algorithm in analyzing the vocal fold behavior during vocal phonation. Next, we discussed the likely utilization of the proposed Pitch-GRU algorithm. Because our algorithm can predict the CQ in real time, it can be used efficiently as a reference tool for educating professional singers or as vocal cord exercises for patients with vocal fold disorders. In future works, we can consider different vowel such as /‘i’/ or /‘u’/.
Table 1 . Number of parameters and training/validation/test losses, and CPU time of Pitch-GRU with different neural networks.
Model | Parameter | Training Loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|---|
GRU 1L | 401 | 7.6E-3 | 7.1E-3 | 8.0E-3 | 99.3 s |
GRU 2L | 1061 | 7.7E-3 | 6.8E-3 | 7.6E-3 | 107.4 s |
BiGRU 1L | 801 | 7.7E-3 | 5.1E-3 | 8.0E-3 | 104.8 s |
CONV1D-GRU | 8627 | 7.5E-3 | 7.3E-3 | 7.6E-3 | 146.6 s |
Table 2 . Number of parameters and training/validation/test losses, and CPU time of Algorithm 2 with different neural networks (GRU 1L, GRU 2L, BiGRU 1L, Conv1D GRU).
Model | Parameter | Training Loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|---|
GRU 1L | 401 | 7.9E-3 | 5.5E-3 | 8.8E-3 | 526.9 s |
GRU 2L | 1061 | 7.8E-3 | 6.6E-3 | 9.0 E-3 | 649.4 s |
BiGRU 1L | 801 | 7.8E-3 | 6.9E-3 | 9.0 E-3 | 610.7 s |
CONV1D-GRU | 8627 | 7.2E-3 | 9.9E-3 | 9.6 E-3 | 240.2 s |
This study was supported by a grant (NRF KSN1824130) from the Korea Institute of Oriental Medicine.
Table 1 . Number of parameters and training/validation/test losses, and CPU time of Pitch-GRU with different neural networks.
Model | Parameter | Training Loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|---|
GRU 1L | 401 | 7.6E-3 | 7.1E-3 | 8.0E-3 | 99.3 s |
GRU 2L | 1061 | 7.7E-3 | 6.8E-3 | 7.6E-3 | 107.4 s |
BiGRU 1L | 801 | 7.7E-3 | 5.1E-3 | 8.0E-3 | 104.8 s |
CONV1D-GRU | 8627 | 7.5E-3 | 7.3E-3 | 7.6E-3 | 146.6 s |
Table 2 . Number of parameters and training/validation/test losses, and CPU time of Algorithm 2 with different neural networks (GRU 1L, GRU 2L, BiGRU 1L, Conv1D GRU).
Model | Parameter | Training Loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|---|
GRU 1L | 401 | 7.9E-3 | 5.5E-3 | 8.8E-3 | 526.9 s |
GRU 2L | 1061 | 7.8E-3 | 6.6E-3 | 9.0 E-3 | 649.4 s |
BiGRU 1L | 801 | 7.8E-3 | 6.9E-3 | 9.0 E-3 | 610.7 s |
CONV1D-GRU | 8627 | 7.2E-3 | 9.9E-3 | 9.6 E-3 | 240.2 s |
Table 3 . Loss table of GRU-type neural network with feature extracted by MFCC, Chroma, ZCR, Pitch, and RMS_E algorithms.
Model | MFCC | Chroma | ZCR | Pitch | RMS_E |
---|---|---|---|---|---|
GRU 1L | 1.5E-02 | 9.4E-03 | 8.7E-03 | 8.0E-03 | 8.5E-03 |
GRU 2L | 1.3E-01 | 8.8E-03 | 9.1E-03 | 7.6E-03 | 8.7E-03 |
BiGRU 1L | 1.7E-02 | 9.1E-03 | 9.3E-03 | 8.0E-03 | 8.7E-03 |
CONV1D-GRU | 8.9E-03 | 8.7E-03 | 8.3E-03 | 7.6E-03 | 9.2E-03 |
Table 4 . Comparison of Pitch-GRU with random forest and XGBoost algorithm.
Model | Training loss | Validation loss | Test Loss | CPU time |
---|---|---|---|---|
Pitch-GRU | 7.7E-3 | 7.7E-3 | 7.6E-3 | 107.4 s |
Random Forest | 3.5E-4 | 7.8E-2 | 7.4E-2 | 6.4 s |
XGBoost | 2.1E-3 | 7.6E-2 | 7.5E-2 | 2.5 s |
Lee Kyu-Chung;Hur Chang-Wu;
The Korea Institute of Information and Commucation Engineering 2006; 4(1): 18-22 https://doi.org/10.7853/.2006.4.1.18Andres C. Najarro,Sung-Man Kim
Journal of information and communication convergence engineering 2018; 16(1): 1-5 https://doi.org/10.6109/jicce.2018.16.1.1