Journal of information and communication convergence engineering 2023; 21(2): 159-166
Published online June 30, 2023
https://doi.org/10.56977/jicce.2023.21.2.159
© Korea Institute of Information and Communication Engineering
Correspondence to : Dongkeun Kim (E-mail: dkim@smu.ac.kr)
Department of Intelligent Engineering Informatics for Human, College of Convergence Engineering, Sangmyung University, Seoul, 03016, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The COI gene is a sequence of approximately 650 bp at the 5' terminal of the mitochondrial Cytochrome c Oxidase subunit I (COI) gene. As an effective DeoxyriboNucleic Acid (DNA) barcode, it is widely used for the taxonomic identification and evolutionary analysis of species. We created a CNN-LSTM hybrid model by combining the gene features partially extracted by the Long Short-Term Memory ( LSTM ) network with the feature maps obtained by the CNN. Compared to K-Means Clustering, Support Vector Machines (SVM), and a single CNN classification model, after training 278 samples in a training set that included 15 genera from two orders, the CNN-LSTM hybrid model achieved 94% accuracy in the test set, which contained 118 samples. We augmented the training set samples and four genera into four orders, and the classification accuracy of the test set reached 100%. This study also proposes calculating the cosine similarity between the training and test sets to initially assess the reliability of the predicted results and discover new species.
Keywords COI gene, DNA barcode, CNN-LSTM hybrid, Species classification
Mitochondrial DeoxyriboNucleic Acid (DNA) is the genetic structure of mitochondria and is an important organelle that produces energy (adenosine triphosphate) for cells. Because mitochondria mainly pass through egg cells, they have strong maternal genetic characteristics and enhance the genetic specificity of the species. As shown in Fig. 1, the Cytochrome c Oxidase subunit I (COI) gene is a fragment of about 650 bp (a base pair is a basic unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds) at the 5' terminal of the COI gene in mitochondrial Deoxyribonucleic Acid (DNA). The evolutionary rate of the COI gene was high, and the variation between species was generally obvious. However, within the species, the variation was relatively conserved.
Hebert conducted a series of confirmatory studies [1,2,3]; the first experiment used the COI gene to classify several species into their phyla and orders, and to classify several Lepidoptera insects into their own species; the second experiment selected about 2200 species from 11 animal phyla. After partial sequence comparison between the COI genes in intraspecific and closely related species, more than 90% of the species had significantly greater interspecific differences than intraspecific differences. The third experiment was performed on North American birds with better taxonomic studies. Most species can be distinguished by comparing their COI gene sequences.
Traditional species identification requires a familiarity with the morphological characteristics of multiple groups. Therefore, manual classification requires large investments in resources and time. With the development of next-generation sequencing technology, acquisition of the COI gene has become faster and easier. The COI gene is widely used as an effective DNA barcode taxonomic identification. It can greatly reduce manpower, and at the same time, it will have better performance [4] for identifying species that are difficult to distinguish, such as small insects, or a period of inconspicuous morphological features, such as larval stages. This approach will facilitate the development of species identification methods. Many related research projects have been launched, including the AII Leps Barcode of Life and Fish Barcode of Life Initiative.
The statistical method of constructing a phylogenetic tree by genetic comparison can be used to understand the evolutionary history of organisms and distinguish between species. The neighbor-joining method can determine the adjacent taxa that have the closest genetic distance [5]. The maximum likelihood method was used to select a phylogenetic tree with the most significant likelihood value. These methods require extensive computation to establish differentiation systems; therefore, they are only suitable for a limited amount of data analysis.
With the development of artificial neural networks, classification processes have become faster and more efficient. Tampuu et al. developed a ViraMiner model containing two branches based on a Convolutional Neural Network (CNN) to predict the likelihood that an input DNA sequence is a virus [6]. Singh et al. utilized deep bidirectional Long Short- Term Memory (LSTM) to predict the origin of replication sequences in organisms [7]. Gunasekaran et al. used a hybrid model of CNN-LSTM for nine types of viruses: COVID, SARS, MERS, dengue, hepatitis, and influenza; the model achieved a high accuracy of 93.13% [8]. These models demonstrated that artificial neural networks perform well in the field of biological genetic information.
We used the GenBank nucleic acid sequence database in the National Center for Biotechnology Information (NCBI) to retrieve relevant genetic information in two orders,
The one-hot encoding method can be used to encode nucleotides [10,11], so we used four types of vectors to represent
The K-means algorithm is a classic partition-based clustering method. The basic steps of the algorithm are as follows: (1) clustering is performed with k points in the space as centroids, (2) objects are classified in the nearest order, and (3) the value of the centroid of each cluster is updated iteratively until the best clustering result is obtained. However, clustering does not perform well when the data are unbalanced.
The Support Vector Machine (SVM) method has a positive effect on solving binary classification problems by creating a decision boundary that is the maximum-margin hyperplane. SVM parameters, such as the kernel and penalty parameters, have a significant influence on the complexity and performance of the prediction models [12]. SVM can perform nonlinear classification using the kernel method.
A CNN is a multilayer artificial neural network that uses weight-sharing and gradient back-propagation algorithms to train the model [13]. The CNN mainly consists of input layers, a convolutional layer for kernel computation to extract features, a Rectified Linear Unit layer, a pooling layer for dimensionality reduction, a fully connected layer for combining local features for classification, and an output layer to obtain confidence scores for predicting different categories using the softmax activation function.
The LSTM network [14] can memorize values for an indefinite length of time using four unique gates, as shown in Fig. 2(a): As shown in formula (1), the forget gate limits the impact of the previous state from the present state; as shown in formulas (2) and (3), the input gate for introducing inputs, as shown in formula (4), the c ell state c an b e updated; and as shown in formulas (5) and (6), the output gate determines the output value of this unit. In the formula (1~6), xt is the input at time t; bf, bi, bc, and bo are the bias respectively in the forget gate, input gate, cell state update, and output gate; wf, wi, wc, and wo respectively are the network weights in forget gate, input gate, cell state update, and output gate; ft, it, and ot respectively are the results of forget gate, input gate, and output gate at time t; ct-1, ct, and
We referred to other studies on gene classification and found that CNN are highly efficient classifiers. As described in [15], a conventional three-layer CNN model was developed to predict the effects of non-coding variants from genomic sequences only. Gene classification models do not require complex convolutional structures. Based on our experimental data, we found that the input vector of our CNN was only a 27 × 27 × 4 matrix; therefore, we decided to use CNN as our gene classification selector. We further optimized the performance of the CNN by adjusting its hyperparameters and achieved an accuracy of 91% on the test set. However, CNN convolutions typically require large amounts of data for feature learning. Given the limited amount of available COI gene data, enhancing the feature-extraction ability of the classification model is critical. As we all know, gene expression at the microscopic level determines the morphology of organisms at the macroscopic level. Organisms of the same species often have similar forms, resulting in differences in the probability of gene sequence arrangements at the microscopic level. Therefore, to take advantage of this characteristic, we chose to use the LSTM network, which performs well in long-series continuous prediction. We concatenate the feature maps of the CNN and LSTM networks and feed them into a CNN for classification prediction. A high accuracy of 94% was achieved for the same test set. Our model differs from traditional statistical methods because it is highly trainable and computationally efficient. In addition, our CNN-LSTM hybrid model achieved better classification performance than the CNN alone, even with a small amount of data, without increasing the number of training samples. From a biological perspective, we also explained that the mutability of genes could cause CNN networks to suffer from performance suppression, whereas the CNN-LSTM network improved the extraction of gene features by utilizing the differences in the probability of nucleotide arrangement in the genes, thus improving the performance of the classifier.
In the K-means algorithm model, 209 samples from the training set were classified correctly and 69 were classified incorrectly. The results indicated that the classification of the training set was not effective. Although inter-genera differences in COI genes are generally greater than intra-genera differences, there is still a certain degree of conserved sequences in the genes of the different genera, at the same time, there is a certain rate of variation in the genes within the genera. The inter- and intra-genera differences both had a significant impact on the results of this model. In the following section, we calculate the genetic distance of genes to discuss the reasons for this in depth.
Within the SVM algorithm model, which uses the linear kernel method and shows the best performance, 51 samples were correctly classified and 67 were incorrectly classified in the test set, with an accuracy rate of 43%. Owing to the uneven number of samples from various classes in the training set, overfitting the training set rendered the predictions less effective.
We compared the accuracy of the single CNN model with different hyperparameters, as listed in Tabel 1. The CNN model performed better than the other models, with 91% accuracy. For the experimental result that is genera
The average distances within
Another factor that may affect the accuracy is likely caused by the CNN model. During the downsampling process, the extracted features are most likely to lose details. The encoded gene sequences were not similar to the image matrix and exhibited a strong correlation at the pixel level. As shown in Fig. 4, similar compositions of A+T and C+G bases in
The total number of permutations in the triplets was 64, a value well exceeding the number of amino acids (20). This indicates that many amino acids are specified by more than one codon, a phenomenon called degeneracy [16]. At the same time, morphologically close genera will produce closer gene expression, so we believe that the combination between bases is not completely random--the combination probability between different bases in triplets is different inter-genera. To take advantage of this characteristic, we used an LSTM network, which has good performance in long-series continuous prediction, such as text learning, which improves the classification ability of the network by extracting long-term sequential features. We built a CNN-LSTM hybrid model, as shown in Fig. 5. The CNN and LSTM networks were trained individually, and the feature map of the LSTM was merged with that of the CNN. Finally, the combined features are passed through the dense layer in the CNN to predict the genera. The results showed that 7 samples of
We then added 17
Table 1 . Hyperparameters of the CNN and accuracy of the test set. Contents in the table: the numbers on the left indicate the variables of the parameters, and the numbers on the right is the accuracy of the CNN network on the test set. The parameters marked in red are the final optimization parameters of the CNN
Hyper-parameters and accuracy | ||||
---|---|---|---|---|
Number of kernels | 32, 78% | 64, 88% | 128, 83% | 256, 83% |
Kernel size of Convolution layer | 2, 88% | 3, 80% | 4, 78% | 5, 78% |
Kernel size of Max-pooling layer | 2, 88% | 3, 89% | 4, 89% | 5, 89% |
Number of Convolution layer | 1, 88% | 2, 89% | 3, 78% | 4, 78% |
Coefficient of dropout | 0.6, 84% | 0.5, 90% | 0.4, 91% | 0.3, 89% |
We also performed non-feature combining; thus, the feature was first obtained through the CNN model and then passed into the LSTM network [17]. The experimental results were very similar to those obtained using a single CNN model. The LSTM network does not play an effective role. We did not increase the number of layers in the network further because our network is very simple and performs well in classification. This will also be more convenient for finetuning. In contrast to previous studies, we used the original DNA sequence directly instead of the K-Mer method for encoding [18]. This met method is very effective for small
DNA sequences, such as the COI gene, which ensures the transmission of genetic information and reduces complicated pre-processing process.
As stated by Alexandrov et al. [19], the overall separation of a set of mutational signatures can be evaluated by examining the distribution of cosine similarities between signatures. Similarly, to understand the model’s genus prediction results, we also referenced the cosine similarity values in the model. The formula is shown in formula (7): Ai and Bi are the components of vectors, and n is the size of the vector. We used the average of the feature vectors from the same genera in the training set shown in Formula (8), where m is the number of training samples. It can be used to measure the similarity between the test and training sets. This value can be used to measure the similarity between the test samples and training set. The closer the value is to 1, the greater the similarity between the two vectors. The values range from 0.733 to 0.999, as shown in Fig. 6. This indicates that the test sequence was highly similar to the training set.
In contrast to previous studies that calculated the K-Mer frequency [20], the bases of the COI gene were converted to a vector matrix by one-hot encoding and could work directly on the sequence to make the model more direct and convenient. The feature extraction ability of the model can be improved with 94% accuracy by combining two features that are separate from the CNN and LSTM networks. When 17
We referred to the cosine similarity value to understand the results and initially assess the reliability of the predictions. The test set maintained a high degree of similarity with the training set, with values between 0.733 and 0.999. We calculated the cosine similarity value of the
We constructed a CNN-LSTM hybrid model because the genetic information in the database is unbalanced in quantity, and there is a large deviation in the number of randomly selected samples. Although good classification results have been achieved, in future research, we need to develop the model using a larger dataset or for species classification. Mitochondria may not have a sufficient and stable mutation rate if the species formation time is very short or if mitochondrial gene outflow is present in closely related species. This makes it difficult to classify COI genes. If we set a different cosine similarity threshold for every category, it could help quantitatively evaluate the prediction results and improve the function of the model.
is a M.S. student in the Department of Artificial Intelligence and Data Engineering at Sangmyung University. She received her B.S. degree in the Department of Biopharmaceutical from JILIN University in 2014. Her research areas include bioinformatics and data mining.
2003 Yonsei University Master's Degree in Medical Informatics
2008 Yonsei University Ph.D. in Biomedical Engineering
2009-present Professor at the Department of Human Intelligence and Information Engineering, Sangmyung University
2017-present Director of the Intelligent Information Technology Research Institute, Sangmyung University
2021-present Head of the Bio-Health Innovation Sharing University Unit, Sangmyung University
Areas of interest: Bio-Health, Biomedical Engineering, Data Mining, and Demand Forecasting
Journal of information and communication convergence engineering 2023; 21(2): 159-166
Published online June 30, 2023 https://doi.org/10.56977/jicce.2023.21.2.159
Copyright © Korea Institute of Information and Communication Engineering.
Meijing Li 1 and Dongkeun Kim2*
1Department of Artificial Intelligence and Data Engineering, Sangmyung University, Seoul 03016, Republic of Korea
2Department of Intelligent Engineering Informatics for Human, College of Convergence Engineering, Sangmyung University, Seoul 03016, Republic of Korea
Correspondence to:Dongkeun Kim (E-mail: dkim@smu.ac.kr)
Department of Intelligent Engineering Informatics for Human, College of Convergence Engineering, Sangmyung University, Seoul, 03016, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The COI gene is a sequence of approximately 650 bp at the 5' terminal of the mitochondrial Cytochrome c Oxidase subunit I (COI) gene. As an effective DeoxyriboNucleic Acid (DNA) barcode, it is widely used for the taxonomic identification and evolutionary analysis of species. We created a CNN-LSTM hybrid model by combining the gene features partially extracted by the Long Short-Term Memory ( LSTM ) network with the feature maps obtained by the CNN. Compared to K-Means Clustering, Support Vector Machines (SVM), and a single CNN classification model, after training 278 samples in a training set that included 15 genera from two orders, the CNN-LSTM hybrid model achieved 94% accuracy in the test set, which contained 118 samples. We augmented the training set samples and four genera into four orders, and the classification accuracy of the test set reached 100%. This study also proposes calculating the cosine similarity between the training and test sets to initially assess the reliability of the predicted results and discover new species.
Keywords: COI gene, DNA barcode, CNN-LSTM hybrid, Species classification
Mitochondrial DeoxyriboNucleic Acid (DNA) is the genetic structure of mitochondria and is an important organelle that produces energy (adenosine triphosphate) for cells. Because mitochondria mainly pass through egg cells, they have strong maternal genetic characteristics and enhance the genetic specificity of the species. As shown in Fig. 1, the Cytochrome c Oxidase subunit I (COI) gene is a fragment of about 650 bp (a base pair is a basic unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds) at the 5' terminal of the COI gene in mitochondrial Deoxyribonucleic Acid (DNA). The evolutionary rate of the COI gene was high, and the variation between species was generally obvious. However, within the species, the variation was relatively conserved.
Hebert conducted a series of confirmatory studies [1,2,3]; the first experiment used the COI gene to classify several species into their phyla and orders, and to classify several Lepidoptera insects into their own species; the second experiment selected about 2200 species from 11 animal phyla. After partial sequence comparison between the COI genes in intraspecific and closely related species, more than 90% of the species had significantly greater interspecific differences than intraspecific differences. The third experiment was performed on North American birds with better taxonomic studies. Most species can be distinguished by comparing their COI gene sequences.
Traditional species identification requires a familiarity with the morphological characteristics of multiple groups. Therefore, manual classification requires large investments in resources and time. With the development of next-generation sequencing technology, acquisition of the COI gene has become faster and easier. The COI gene is widely used as an effective DNA barcode taxonomic identification. It can greatly reduce manpower, and at the same time, it will have better performance [4] for identifying species that are difficult to distinguish, such as small insects, or a period of inconspicuous morphological features, such as larval stages. This approach will facilitate the development of species identification methods. Many related research projects have been launched, including the AII Leps Barcode of Life and Fish Barcode of Life Initiative.
The statistical method of constructing a phylogenetic tree by genetic comparison can be used to understand the evolutionary history of organisms and distinguish between species. The neighbor-joining method can determine the adjacent taxa that have the closest genetic distance [5]. The maximum likelihood method was used to select a phylogenetic tree with the most significant likelihood value. These methods require extensive computation to establish differentiation systems; therefore, they are only suitable for a limited amount of data analysis.
With the development of artificial neural networks, classification processes have become faster and more efficient. Tampuu et al. developed a ViraMiner model containing two branches based on a Convolutional Neural Network (CNN) to predict the likelihood that an input DNA sequence is a virus [6]. Singh et al. utilized deep bidirectional Long Short- Term Memory (LSTM) to predict the origin of replication sequences in organisms [7]. Gunasekaran et al. used a hybrid model of CNN-LSTM for nine types of viruses: COVID, SARS, MERS, dengue, hepatitis, and influenza; the model achieved a high accuracy of 93.13% [8]. These models demonstrated that artificial neural networks perform well in the field of biological genetic information.
We used the GenBank nucleic acid sequence database in the National Center for Biotechnology Information (NCBI) to retrieve relevant genetic information in two orders,
The one-hot encoding method can be used to encode nucleotides [10,11], so we used four types of vectors to represent
The K-means algorithm is a classic partition-based clustering method. The basic steps of the algorithm are as follows: (1) clustering is performed with k points in the space as centroids, (2) objects are classified in the nearest order, and (3) the value of the centroid of each cluster is updated iteratively until the best clustering result is obtained. However, clustering does not perform well when the data are unbalanced.
The Support Vector Machine (SVM) method has a positive effect on solving binary classification problems by creating a decision boundary that is the maximum-margin hyperplane. SVM parameters, such as the kernel and penalty parameters, have a significant influence on the complexity and performance of the prediction models [12]. SVM can perform nonlinear classification using the kernel method.
A CNN is a multilayer artificial neural network that uses weight-sharing and gradient back-propagation algorithms to train the model [13]. The CNN mainly consists of input layers, a convolutional layer for kernel computation to extract features, a Rectified Linear Unit layer, a pooling layer for dimensionality reduction, a fully connected layer for combining local features for classification, and an output layer to obtain confidence scores for predicting different categories using the softmax activation function.
The LSTM network [14] can memorize values for an indefinite length of time using four unique gates, as shown in Fig. 2(a): As shown in formula (1), the forget gate limits the impact of the previous state from the present state; as shown in formulas (2) and (3), the input gate for introducing inputs, as shown in formula (4), the c ell state c an b e updated; and as shown in formulas (5) and (6), the output gate determines the output value of this unit. In the formula (1~6), xt is the input at time t; bf, bi, bc, and bo are the bias respectively in the forget gate, input gate, cell state update, and output gate; wf, wi, wc, and wo respectively are the network weights in forget gate, input gate, cell state update, and output gate; ft, it, and ot respectively are the results of forget gate, input gate, and output gate at time t; ct-1, ct, and
We referred to other studies on gene classification and found that CNN are highly efficient classifiers. As described in [15], a conventional three-layer CNN model was developed to predict the effects of non-coding variants from genomic sequences only. Gene classification models do not require complex convolutional structures. Based on our experimental data, we found that the input vector of our CNN was only a 27 × 27 × 4 matrix; therefore, we decided to use CNN as our gene classification selector. We further optimized the performance of the CNN by adjusting its hyperparameters and achieved an accuracy of 91% on the test set. However, CNN convolutions typically require large amounts of data for feature learning. Given the limited amount of available COI gene data, enhancing the feature-extraction ability of the classification model is critical. As we all know, gene expression at the microscopic level determines the morphology of organisms at the macroscopic level. Organisms of the same species often have similar forms, resulting in differences in the probability of gene sequence arrangements at the microscopic level. Therefore, to take advantage of this characteristic, we chose to use the LSTM network, which performs well in long-series continuous prediction. We concatenate the feature maps of the CNN and LSTM networks and feed them into a CNN for classification prediction. A high accuracy of 94% was achieved for the same test set. Our model differs from traditional statistical methods because it is highly trainable and computationally efficient. In addition, our CNN-LSTM hybrid model achieved better classification performance than the CNN alone, even with a small amount of data, without increasing the number of training samples. From a biological perspective, we also explained that the mutability of genes could cause CNN networks to suffer from performance suppression, whereas the CNN-LSTM network improved the extraction of gene features by utilizing the differences in the probability of nucleotide arrangement in the genes, thus improving the performance of the classifier.
In the K-means algorithm model, 209 samples from the training set were classified correctly and 69 were classified incorrectly. The results indicated that the classification of the training set was not effective. Although inter-genera differences in COI genes are generally greater than intra-genera differences, there is still a certain degree of conserved sequences in the genes of the different genera, at the same time, there is a certain rate of variation in the genes within the genera. The inter- and intra-genera differences both had a significant impact on the results of this model. In the following section, we calculate the genetic distance of genes to discuss the reasons for this in depth.
Within the SVM algorithm model, which uses the linear kernel method and shows the best performance, 51 samples were correctly classified and 67 were incorrectly classified in the test set, with an accuracy rate of 43%. Owing to the uneven number of samples from various classes in the training set, overfitting the training set rendered the predictions less effective.
We compared the accuracy of the single CNN model with different hyperparameters, as listed in Tabel 1. The CNN model performed better than the other models, with 91% accuracy. For the experimental result that is genera
The average distances within
Another factor that may affect the accuracy is likely caused by the CNN model. During the downsampling process, the extracted features are most likely to lose details. The encoded gene sequences were not similar to the image matrix and exhibited a strong correlation at the pixel level. As shown in Fig. 4, similar compositions of A+T and C+G bases in
The total number of permutations in the triplets was 64, a value well exceeding the number of amino acids (20). This indicates that many amino acids are specified by more than one codon, a phenomenon called degeneracy [16]. At the same time, morphologically close genera will produce closer gene expression, so we believe that the combination between bases is not completely random--the combination probability between different bases in triplets is different inter-genera. To take advantage of this characteristic, we used an LSTM network, which has good performance in long-series continuous prediction, such as text learning, which improves the classification ability of the network by extracting long-term sequential features. We built a CNN-LSTM hybrid model, as shown in Fig. 5. The CNN and LSTM networks were trained individually, and the feature map of the LSTM was merged with that of the CNN. Finally, the combined features are passed through the dense layer in the CNN to predict the genera. The results showed that 7 samples of
We then added 17
Table 1 . Hyperparameters of the CNN and accuracy of the test set. Contents in the table: the numbers on the left indicate the variables of the parameters, and the numbers on the right is the accuracy of the CNN network on the test set. The parameters marked in red are the final optimization parameters of the CNN.
Hyper-parameters and accuracy | ||||
---|---|---|---|---|
Number of kernels | 32, 78% | 64, 88% | 128, 83% | 256, 83% |
Kernel size of Convolution layer | 2, 88% | 3, 80% | 4, 78% | 5, 78% |
Kernel size of Max-pooling layer | 2, 88% | 3, 89% | 4, 89% | 5, 89% |
Number of Convolution layer | 1, 88% | 2, 89% | 3, 78% | 4, 78% |
Coefficient of dropout | 0.6, 84% | 0.5, 90% | 0.4, 91% | 0.3, 89% |
We also performed non-feature combining; thus, the feature was first obtained through the CNN model and then passed into the LSTM network [17]. The experimental results were very similar to those obtained using a single CNN model. The LSTM network does not play an effective role. We did not increase the number of layers in the network further because our network is very simple and performs well in classification. This will also be more convenient for finetuning. In contrast to previous studies, we used the original DNA sequence directly instead of the K-Mer method for encoding [18]. This met method is very effective for small
DNA sequences, such as the COI gene, which ensures the transmission of genetic information and reduces complicated pre-processing process.
As stated by Alexandrov et al. [19], the overall separation of a set of mutational signatures can be evaluated by examining the distribution of cosine similarities between signatures. Similarly, to understand the model’s genus prediction results, we also referenced the cosine similarity values in the model. The formula is shown in formula (7): Ai and Bi are the components of vectors, and n is the size of the vector. We used the average of the feature vectors from the same genera in the training set shown in Formula (8), where m is the number of training samples. It can be used to measure the similarity between the test and training sets. This value can be used to measure the similarity between the test samples and training set. The closer the value is to 1, the greater the similarity between the two vectors. The values range from 0.733 to 0.999, as shown in Fig. 6. This indicates that the test sequence was highly similar to the training set.
In contrast to previous studies that calculated the K-Mer frequency [20], the bases of the COI gene were converted to a vector matrix by one-hot encoding and could work directly on the sequence to make the model more direct and convenient. The feature extraction ability of the model can be improved with 94% accuracy by combining two features that are separate from the CNN and LSTM networks. When 17
We referred to the cosine similarity value to understand the results and initially assess the reliability of the predictions. The test set maintained a high degree of similarity with the training set, with values between 0.733 and 0.999. We calculated the cosine similarity value of the
We constructed a CNN-LSTM hybrid model because the genetic information in the database is unbalanced in quantity, and there is a large deviation in the number of randomly selected samples. Although good classification results have been achieved, in future research, we need to develop the model using a larger dataset or for species classification. Mitochondria may not have a sufficient and stable mutation rate if the species formation time is very short or if mitochondrial gene outflow is present in closely related species. This makes it difficult to classify COI genes. If we set a different cosine similarity threshold for every category, it could help quantitatively evaluate the prediction results and improve the function of the model.
Table 1 . Hyperparameters of the CNN and accuracy of the test set. Contents in the table: the numbers on the left indicate the variables of the parameters, and the numbers on the right is the accuracy of the CNN network on the test set. The parameters marked in red are the final optimization parameters of the CNN.
Hyper-parameters and accuracy | ||||
---|---|---|---|---|
Number of kernels | 32, 78% | 64, 88% | 128, 83% | 256, 83% |
Kernel size of Convolution layer | 2, 88% | 3, 80% | 4, 78% | 5, 78% |
Kernel size of Max-pooling layer | 2, 88% | 3, 89% | 4, 89% | 5, 89% |
Number of Convolution layer | 1, 88% | 2, 89% | 3, 78% | 4, 78% |
Coefficient of dropout | 0.6, 84% | 0.5, 90% | 0.4, 91% | 0.3, 89% |