Journal of information and communication convergence engineering 2022; 20(1): 31-40
Published online March 31, 2022
https://doi.org/10.6109/jicce.2022.20.1.31
© Korea Institute of Information and Communication Engineering
Coronary heart disease (CHD) is a comorbidity of COVID-19; therefore, routine early diagnosis is crucial. A large number of examination attributes in the context of diagnosing CHD is a distinct obstacle during the pandemic when the number of health service users is significant. The development of a precise machine learning model for diagnosis with a minimum number of examination attributes can allow examinations and healthcare actions to be undertaken quickly. This study proposes a CHD diagnosis model based on feature selection, data balancing, and ensemble-based classification methods. In the feature selection stage, a hybrid SVM-GA combined with fast correlation-based filter (FCBF) is used. The proposed system achieved an accuracy of 94.60% and area under the curve (AUC) of 97.5% when tested on the z-Alizadeh Sani dataset and used only 8 of 54 inspection attributes. In terms of performance, the proposed model can be placed in the very good category.
Keywords coronary heart disease, genetic algorithm, feature selection, ensemble learning, support vector machine
The heart and lungs work together to maintain the oxygen levels in the body. When the lungs are affected by respiratory diseases, such as the novel coronavirus (COVID-19), the heart can be affected as well. The heart has to work hard to pump blood, which may be even more difficult for someone with a heart disease. Patients with coronary heart disease (CHD) are at high risk of contracting COVID-19. CHD in patients infected with COVID-19 can cause damage to the heart muscle or blood vessels [1]. The application of strict health protocols will have an impact on heart disease patients who have minimal activity. It was confirmed in a study by Hemphill et al. [2], where the number of steps during the COVID-19 pandemic was lower than before the pandemic. During the pandemic, in addition to implementing health protocols, you must also maintain a healthy lifestyle and perform routine health checks [3], one of which is routine heart health checks.
Cardiac health checks can be performed starting with routine examinations of risk factors, followed by electrocardiogram (ECG) examinations, laboratory examinations, and coronary angiography. Along with the development of artificial intelligence, all examination results in the diagnosis process can be used with machine learning to draw conclusions [4]. Many examinations result in several attributes that must be analyzed, and the number of attributes allows for ambiguity in drawing conclusions. There is one important process in the development of machine learning, namely, feature selection. Feature selection involves selecting the best feature that can provide a better machine learning performance [5-7]. Feature selection involves numerous methods, including filtering, wrappers, and embedding. Feature selection can be developed using a combination of several methods known as the hybrid method. The hybrid method aims to obtain better features than if only one method is used. In the development of machine learning for diagnosis, in addition to feature selection, there is an important influential component: the amount of data available for the learning process. Medical records are taken from hospitals, where people tend to check themselves when there are symptoms, therefore the diagnosis results are mostly positive. This condition causes the data obtained to obtain additional data that are diagnosed as positive than negative, which results in the availability of data being unbalanced [8, 9].
Several models for the diagnosis of CHD have been developed using feature selection methods. Numerous developments have been conducted using computational intelligence algorithms such as genetic algorithms (GAs), artificial bee colonies, and particle swarm optimization [10-13]. This study proposes a CHD diagnosis model that is preceded by feature selection using a hybrid method. The hybrid method used was a support vector machine (SVM) and a GA for the searching method. The feature selection stage ended with a filtering method that used a fast correlation-based filter (FCBF). To determine the output of the proposed system model, classification was performed using an ensemble learning algorithm, namely, the bagging-logistics model tree (bagging-LMT). At the learning stage of the bagging-LMT algorithm, to overcome unbalanced learning data, before the learning process oversampling was performed using the synthetic minority oversampling technique (SMOTE) method [14, 15]. The system testing was validated using k-folds cross-validation, as well as the z-Alizadeh Sani, Cleveland, and Statlog datasets. The performance parameters of the proposed system model include sensitivity (SSE), specificity (SPE), accuracy, area under the curve (AUC), positive prediction value (PPV), and negative prediction value (NPV).
CHD detection models have been developed along with machine learning [4, 16, 17]. Ghosh et al. [18] developed a diagnosis model using feature selection, namely, the Relief and LASSO techniques which can improve the performance of it. Despite the obstacles in implementing machine learning algorithms in clinical practice, the use of machine learning algorithms, such as convolutional neural networks, boosting, and SVM can provide good prospects for the development of diagnostic models [19]. Machine learning with a combination of SVM extreme gradient boosting can also be used to detect CHD, with good performance, namely, an F1 value of 91.86% and accuracy of 93.86%. However, in this study, feature selection was not optimized, and thus, the performance still requires many attribute checks [20].
Another study used the CART algorithm to detect CHD [21]. The CART algorithm is a decision tree-based algorithm; therefore, it can perform feature selection simultaneously with the training process. The CART algorithm has the ability to use five test attributes. When tested with the z-Alizadeh Sani dataset, it was able to provide 98.61% sensitivity but low specificity, namely 77.01% and 92.41% accuracy. The ability of this model is weak when the patient is negative but detected by the system model as true negative, with a percentage of 77.01%. The CART algorithm can extract knowledge into several rules arranged in a tree diagram. The ability to compose rules, such as decision tree models, can also be achieved using hybrid binary-real particle swarm optimization (PSO) [22]. The hybrid PSO model provided an average accuracy of 84.2%. This hybrid-PSO-based rule model can produce a relatively small number of rules, that is, 10 rules. Referring to the rule, testing is also conducted for the use of 13 and 11 attributes. The result is that the use of 13 attributes is better than the use of 11 attributes.
The use of computational intelligence in addition to the PSO algorithm also uses many GA [11]. The use of Gas combined with artificial neural networks for the diagnosis of CHD can provide excellent performance, with an accuracy of 93.5%. The next development is the use of neural network-based algorithms, namely an emotional neural network (ENN) combined with PSO (ENN+PSO). The use of ENN+PSO can reduce attributes from 55 to 22 with an accuracy of 88.34% [12]. The use of particle swarm can also provide good performance for the diagnosis of CHD, as shown by the test results with an accuracy parameter value of 87.097% [13]. The subsequent development uses a combination of PSO and GA. The use of the combination of the two algorithms can reduce attributes from 55 to 22, and the resulting performance is 93.08% for accuracy and 91.51% F-Score [23]. PSO+GA is capable of producing the same number of reduced attributes as PSO+ENN for the z-Alizadeh Sani dataset; however, its performance is better when using PSO+GA.
The ability of GAs in the feature selection process can also be demonstrated by Karegowda et al. [24]. This study proposed a combination of GA+correlation feature selection (CFS) with a radial basis function (RBF) classification algorithm. This model provided better CHD diagnosis system performance than the combination of a decision tree and RBF. The GA combined with SVM was also able to provide good performance compared to PSO, using the objective function in the form of accuracy [25]. A similar study was also conducted by Ephzibah [26], which showed that the ability of SVM+GA was better than that of SVM. This capability was indicated by the minimum number of features. Subsequent research using the genetic fuzzy system-logit-boost (GFS-LB) for the diagnosis of CHD provided better performance than without using GA [27]. Subsequent research was conducted using a feature selection model that combined filtering and wrapper methods [28]. The filtering algorithm used was conditional mutual information maximization (CMIM) combined with a binary GA (BGA), with the fitness function in the GA as an accuracy function. The resulting accuracy performance parameter was better than that of several other feature selections.
With the use of GAs in the feature selection process in machine learning, the majority of studies only focus on accuracy performance parameters, so that the fitness function used is also accurate [11, 23, 27]. The accuracy fitness function is also used in the PSO algorithm [12, 13, 22, 23]. This is certainly inappropriate in the medical field, particularly for the screening process or early diagnosis of the disease. In the screening or diagnosis of CHD, the sensitivity performance parameter is crucial. These parameters are used to measure when a patient is positive for CHD. Machine learning will also detect CHD, as indicated by the sensitivity parameter, so that in the screening or diagnosis system, it must suppress the incidence of positive patients, and using machine learning it is also diagnosed as positive [29]. Referring to this, the sensitivity performance parameter is crucial to be included in determining feature selection in the development of a CHD diagnosis system model.
The selection of the right feature selection method is important, particularly for high-dimensional data. FCBF is an effective feature selection method for high-dimensional data [30]. Sánchez-Maroño et al. [31] confirmed that the performance of accuracy of FCBF is better than the Relief when the number of attributes is more than 40 attributes. The FCBF feature selection algorithm combined with the SVM classification algorithm was able to provide better performance than the k-nearest neighbor, random forest, and naive Bayesian algorithms [32]. The CHD diagnosis system model with the z-Alizadeh Sani dataset has more than 40 attributes; therefore, the FCBF is very precise. The ability of feature selection was also demonstrated by Djellali et al. [25], where FCBF was able to reduce the number of large features but was still able to maintain performance in a good category. The ability of FCBF when combined with GA with a fitness function in the form of accuracy results in better performance than when using only FCBF.
The problem of developing machine learning is not limited in feature selection. The availability of good data for the learning process will make the learning process successful in building a system model for CHD diagnosis. The challenge in processing medical data is imbalanced data; this condition can cause the machine learning model built to be poor [8, 33]. The problem of imbalanced data can be overcome using the oversampling method. The development of a prognostic model for patients with heart failure also utilizes the oversampling method. Kim et al. [34] used oversampling algorithms such as SMOTE, borderline-SMOTE, and adaptive synthetic sampling (ADASYN). In addition, a study conducted by Brandt and Lanzén [35] analyzed the performance of SMOTE and ADASYN. The results of the two studies show that SMOTE is better than the others. SMOTE is also used in the classification of hypertension, where the data are unbalanced, and the results show an increase in accuracy from 91 to 98% [36]. The use of SMOTE is also effective for high-dimensional datasets [37], such as the z-Alizadeh Sani CHD dataset. The ability of SMOTE to predict compound-protein interactions has also been demonstrated [38]. In this study, SMOTE is better than random under-sampling (RUS), a combination of over-undersampling (COUS), and Tomek link (T-Link) algorithms, with reference to the AUC performance parameter.
In this study, the z-Alizadeh Sani dataset [39-42] was used, along with the Cleveland and Statlog datasets [43] as support. The z-Alizadeh Sani dataset has relatively complete types of examinations, such as demographic, symptom and examination, ECG, and laboratory and echo features. Another advantage of this dataset is that the data used are relatively new compared to their predecessors, such as the Cleveland and Statlog datasets. These datasets can be accessed online, and their distribution is presented in Table 1. This study used the research methods shown in Fig. 1. The stages are divided as follows: the first stage is pre-processing in the form of data normalization. The second stage performs feature selection using hybrid SVM-GA, which classifies data, and then measures the accuracy and sensitivity performance. The two performance parameters are then used as fitness functions in the GA, as shown in Equation (1). The SVM algorithm used is a binary SVM with a kernel using a radial basis function (RBF) [44]. The SVM algorithm works using nonlinear transformations, one of which uses the RBF kernel. The transformation is performed to map the input data to a higher-dimensional space and then perform a linear classification of the input data in the dimensional feature space to build the optimal hyperplane. Finally, the mapping returns to the original space and becomes a nonlinear classification in the input space [44, 45]. The GA serves to selects a subset of features using the benchmark as the fitness function. Chromosomal representation in the form of feature subsets or attributes of CHD examination. The GA works by performs several steps, as shown in Algorithm-1 [28, 46].
Dataset | #Feature | #Instance | Ratio Normal/CHD |
---|---|---|---|
z-Alizadeh Sani | 54 | 303 | 1:2.50 |
Cleveland | 13 | 303 | 1:0.85 |
Statlog | 13 | 270 | 1:0.80 |
The third stage is the oversampling process used to balance the data. Oversampling was performed using the SMOTE algorithm [14, 36]. The percentage of oversampling with reference to the ratio of positive and negative data for CHD is shown in Table 1. The SMOTE process is preceded by resampling. The next stage is the final stage of the feature selection process, namely, filtering using the FCBF algorithm [30, 47]. The FCBF algorithm generates a weight that indicates the weight of the attribute to the output. The next stage is the classification process using the bagging algorithm, with each bootstrap using the LMT algorithm [48], which is shown in Algorithm-2 [49, 50]. The system model was also tested using the random forest algorithm, forest by penalizing attributes (ForestPA) [51], C4.5, multilayer perceptron (MLP), and bagging-forestPA.
1: | Set generation=0 (The first generation) | ||
2: | Initialized initial population, P(generation), randomly | ||
//P(generation) is the population of one generation | |||
3: | Evaluate the fitness value for each individual | ||
4: | //equation (1) | ||
5: | |||
6: | Generation = generation +1 | ||
Population selection to get the parent candidate | |||
8: | (P’(generation)) | ||
9: | Crossover on P'(generation) | ||
10: | Mutation on P'(generation) | ||
11: | New population form = {P(generation) that survive, | ||
12: | P'(generation)} | ||
1: | The training set | ||
Machine learning | |||
2: | The number of base classifier | ||
3: | For | ||
4: | Create | ||
5: | bootstrap samples | ||
6: | Using St to learning | ||
Combining using majority voting: | |||
7: | |||
endFor | |||
ensemble N* |
The last stage is the evaluation of the system performance. The diagnosis system model was developed with feature selection using a hybrid SVM-GA and FCBF, and the resulting performance was measured by referring to the confusion matrix, as shown in Table 2. The performance parameters used included accuracy (ACC), sensitivity (SEN), specificity (SPE), AUC, positive predication value (PPV), and negative prediction value (NPV). Referring to Table 2, the performance parameters can be calculated using Equations (2-4).
Actual Class | Predictive Class | |
---|---|---|
Positive | Negative | |
Positive | TP | FN |
Negative | FP | TN |
Testing of the CHD diagnosis system model using the SMV-GA hybrid feature selection can produce 25 attributes, as shown in Fig. 2. The number of attributes is obtained when using the GA parameters with population 1000, generation 100, probability crossover 0.55, and probability mutation 0.3. Fig. 2 shows the feature weights of the 25 attributes resulting from the FCBF process. Referring to the process, it was shown that there are eight attributes that have a high weight in influencing the diagnosis of CHD. The attributes that have a high weight are typical chest pain, hypertension (HTN), age, diabetes mellitus (DM), regional wall motion abnormality (RWMA) region, T inversion, Q wave, and triglyceride (TG).
The z-Alizadeh Sani dataset was used for the test. Subsequent testing of the same model using the Cleveland dataset yielded 10 attributes. These results are shown in Fig. 3, which also shows the weights for each attribute using the FCBF algorithm. The attributes that had significant weights in the diagnosis process were thal, cp, slope, ca, age, and restecg. The final test used the StatLog dataset. The test results are shown in Fig. 4 with 10 attributes. The filtering results with the FCBF show that only six attributes had a significant weight in the diagnosis process. The attributes were thal, cp, oldpeak, thalac, ca, and restecg.
System testing after the feature selection and data balancing process using SMOTE was used to test the classification results. The classification process used the bagging-LMT algorithm. The performance parameters used for the analysis are shown in Equations (2-6). The results of testing the performance of the bagging-LMT algorithm using eight attributes, which were the result of feature selection in the z-Alizadeh Sani dataset, are shown in Fig. 5. Fig. 5 shows that the AUC performance of the bagging-LMT algorithm was better than that of the other ensemble algorithms. The highest AUC value was 97.5%, and the results for all performance parameters are listed in Table 2. In addition to the results for eight attributes in the z-Alizadeh Sani dataset, performance was also shown to determine changes in attribute reduction from 25 to eight attributes. These results are shown in Fig. 6, where changes in the number of attributes do not indicate a significant change in performance.
The result of subsequent system testing using Cleveland datasets are shown in Figs. 7, while using Statlog datasets in Fig. 8. The test results using the Cleveland and Statlog datasets show that the ability of the bagging-LMT algorithm is better than that of the random forest, forestPA, C4.5, MLP, and bagging-forestPA algorithms. The performance of the bagging-LMT algorithm for the AUC performance parameter was still lower than that of the z-Alizadeh–Sani dataset. This is because the z-Alizadeh Sani dataset has a high level of data imbalance, as shown in Table 3; therefore, the use of SMOTE is very effective when compared to the Cleveland and Statlog datasets. The Cleveland and Statlog datasets had the same attributes, however the feature selection results had two different attributes.
Dataset | SEN | SPE | AUC | ACC |
---|---|---|---|---|
z-Alizadeh Sani | 0.936 | 0.954 | 0.975 | 0.946 |
Cleveland | 0.840 | 0.821 | 0.896 | 0.830 |
Statlog | 0.878 | 0.784 | 0.902 | 0.828 |
The hybrid SVM-GA model combined with FCBF can significantly reduce the number of attributes, particularly for the z-Alizadeh Sani dataset. In the z-Alizadeh sani dataset, there was a decrease in the number of attributes from 54 to eight. Attribute reduction occurs in two stages, from 54 to 25 attributes, using a hybrid SVM-GA with accuracy and sensitivity fitness functions, while 25 to eight attributes refer to the weights generated by the FCBF algorithm. Changes in performance based on the number of attributes (25 to 8) are shown in Fig. 7. Performance changes that occur from the use of 25 to eight attributes with the bagging-LMT classification algorithm, as well as 10-fold cross-validation validation, are not significant. Tests using the Cleveland and Statlog datasets were also able to reduce the number of attributes from 13 to 10 when using hybrid SVM-GA and after being combined with FCBF to six attributes. The resulting performance is not as good as that when tested with the z-Alizadeh Sani dataset because the z-Alizadeh Sani dataset has a high level of imbalanced data, as shown in Table 1. When tested using the z-Alizadeh Sani dataset, the performance of the proposed system can provide an AUC performance of 97.5%, which is included in the very good category [52]. Meanwhile, in testing with the Clevelands dataset, the AUC value of 89.6% was still in the good category, whereas the statlog dataset AUC value of 90.2% was in the very good category [52].
Changes in the number of features when using features 25 to 8, as shown in Fig. 7. It can be observed that the AUC performance parameter decreases the number of attributes, and there is an increase in performance. This shows that when using 25 attributes, there is ambiguity between attributes that causes performance to decrease, which also confirms previous studies that states feature selection can improve performance [5-7]. In the Cleveland and Statlog datasets, when compared using 10 attributes with 6 attributes 6, the resulting AUC performance parameters are relatively the same. The Cleveland and Statlog datasets have the same number and type of attributes, however the result of the feature selection process produced has two different attributes. The difference is in the Cleveland dataset which includes slope and age, whereas the Statlog dataset includes old-peak and thalach. The difference is, of course, caused by the data in each dataset; therefore, the results of the weight ranking of the FCBF feature selection process are different.
In the proposed system model, by adding a data-balancing process, the sensitivity and specificity performance parameters were both high. Many studies have reported high accuracy performance parameters, high sensitivity, and low specificity. A significant difference between sensitivity and specificity was shown in the model of CHD diagnosis using the CART algorithm [21], bagging-SMO, naive Bayesian, SMO, and neural networks [53]. A significant difference between the sensitivity and specificity will also have an impact on the low AUC performance value if the AUC calculation is the average between the sensitivity and specificity [15]. In t he b iomedical f ield, if t here i s an AUC value o f 97.5%, it indicates that if 100 patients are positive for CHD disease, then there are 97 people correctly diagnosed with CHD by the system, while three patients are wrong. The capability of the proposed system with reference to the AUC parameter, is better than that of some previous studies, such as the research conducted by Joloudari et al. [54], which uses a random tree algorithm.
Using SVM-GA hybrid feature selection with fitness, which is a function of accuracy and sensitivity, causes the best feature subset to be determined by the sensitivity and accuracy parameters. Using the fitness function, as shown in Equation (1), results in better accuracy performance and a smaller number of attributes compared to using only the fitness function with accuracy [26, 55]. The same is true for the PSO algorithm [12]. A complete comparison with previous studies using either GA, PSO, or other algorithms, with testing using the z-Alizadeh Sani dataset, is shown in Table 4.
Ref | Method | #Feature | SEN | SPE | AUC |
---|---|---|---|---|---|
[59] | Var-IBLMM | 54 | 85.6 | 73.7 | - |
[53] | Bagging-SMO | 33 | 95.8 | 87.4 | - |
[42] | Combined IG for all arteries- SVM | 27 | 86.0 | - | - |
[11] | ANN+GA | 22 | 97.0 | 92.0 | - |
[10] | PSO-based FS | 27 | - | - | 98.7 |
[12] | Hybrid PSO+ENN | 22 | - | - | - |
[54] | Random Tree | 40 | - | - | 96.7 |
[60] | EHBM-DNN | 54 | 95.8 | 96.5 | - |
[40] | SVM along with Feature engineering | 32 | 1.00 | 88.0 | 92.0 |
Using the bagging-LMT algorithm in the proposed diagnosis system model can provide better performance than several o ther a lgorithms, such a s RF a nd C 4.5. T he L MT algorithm is a combination of logistic regression and C4.5. This combination can be used to prune and prevent over-fitting [56]. The ability of the LMT algorithm is even better when bagging is used. The application of bagging resolves the problem of unstable classifications. This makes the bagging ability of the LMT better than that of C4.5, as well as the MLP, because the MLP has a serious problem of overfitting. The random forest algorithm is a decision-tree-based ensemble algorithm that does not prune the resulting decision tree [57]. Failure to perform this pruning can lead to high prediction errors in new cases, and another weakness is the slow classification process, therefore it is not suitable for real-time cases. The forestPA algorithm is almost the same as a random forest, except that it is built using the CART algorithm [58]. The weakness of this algorithm is that the determination of the wrong weight on the attribute affects its performance.
The diagnosis system model using the hybrid feature selection method SVM-GA and FCBF, as well as the bagging-LMT algorithm, is able to provide good performance. The best performance occurs when using the z-Alizadeh Sani dataset because this dataset has a high level of imbalanced data compared to the Clevelands and Statlog datasets. This makes the addition of the SMOTE algorithm highly effective. The best performance was achieved with a sensitivity of 93.6%, specificity of 95.4%, AUC of 97.5%, PPV of 94.3%, NPV of 94.9%, and accuracy of 94.6% for the z-Alizadeh Sani dataset. This performance also shows that the LMT bagging ability is better than that of the random forest, MLP, C4.5, forestPA, and forestPA bagging algorithms. The resulting attribute reduction also showed a significant decrease from 54 to eight attributes. Referring to the resulting performance, the proposed diagnostic system model performed better than several previous studies. This makes the proposed model an alternative for the diagnosis of CHD with minimal examination attributes.
Wiharto is an Associate professor of Computer Science at Department of Informatics, Sebelas Maret University, Surakarta, Indonesia. He received his Ph.D. degree from Gadjah Mada University, Indonesia in 2017. He is conducting research activities in the areas of artificial intelligence, computational intelligence, expert system, Machine learning, and data mining.
Esti Suryani received obtained a Bachelor of Science (B.S.) from Gadjah Mada University, Yogyakarta, Indonesia, 2002 and master’s degree in Computer Science (M.Cs.) from Gadjah Mada University, Yogyakarta, Indonesia, 2006. He is presently working as an Assistant professor in the Department of Informatics, Faculty of mathematics and natural sciences, Sebelas Maret University, Surakarta, Indonesia. His experience and areas of interest focus on image processing and fuzzy logic.
Sigit Setyawan received obtained a Bachelor of Medicine from Sebelas Maret University, Surakarta, Indonesia, 2005 and master’s degree in Medicine (M.Sc.) from Gadjah Mada University, Yogyakarta, Indonesia, 2015. He is presently working as an Assistant professor in the Department of Medicine, Faculty of Medicine, Sebelas Maret University, Surakarta, Indonesia. His experience and areas of interest focus on Biologi molecular, Genomic, and health informatics.
student of undergraduate program in informatics, 2018, Faculty of Mathematics and Natural Sciences, Sebelas Maret University, Surakarta, Indonesia. The area of research being carried out is the image processing, data mining, artificial intelligence, machine learning, and computational intelligence.
Journal of information and communication convergence engineering 2022; 20(1): 31-40
Published online March 31, 2022 https://doi.org/10.6109/jicce.2022.20.1.31
Copyright © Korea Institute of Information and Communication Engineering.
Wiharto Wiharto, Esti Suryani, Sigit Setyawan, and Bintang PE Putra
Sebleas Maret Univeristy
Coronary heart disease (CHD) is a comorbidity of COVID-19; therefore, routine early diagnosis is crucial. A large number of examination attributes in the context of diagnosing CHD is a distinct obstacle during the pandemic when the number of health service users is significant. The development of a precise machine learning model for diagnosis with a minimum number of examination attributes can allow examinations and healthcare actions to be undertaken quickly. This study proposes a CHD diagnosis model based on feature selection, data balancing, and ensemble-based classification methods. In the feature selection stage, a hybrid SVM-GA combined with fast correlation-based filter (FCBF) is used. The proposed system achieved an accuracy of 94.60% and area under the curve (AUC) of 97.5% when tested on the z-Alizadeh Sani dataset and used only 8 of 54 inspection attributes. In terms of performance, the proposed model can be placed in the very good category.
Keywords: coronary heart disease, genetic algorithm, feature selection, ensemble learning, support vector machine
The heart and lungs work together to maintain the oxygen levels in the body. When the lungs are affected by respiratory diseases, such as the novel coronavirus (COVID-19), the heart can be affected as well. The heart has to work hard to pump blood, which may be even more difficult for someone with a heart disease. Patients with coronary heart disease (CHD) are at high risk of contracting COVID-19. CHD in patients infected with COVID-19 can cause damage to the heart muscle or blood vessels [1]. The application of strict health protocols will have an impact on heart disease patients who have minimal activity. It was confirmed in a study by Hemphill et al. [2], where the number of steps during the COVID-19 pandemic was lower than before the pandemic. During the pandemic, in addition to implementing health protocols, you must also maintain a healthy lifestyle and perform routine health checks [3], one of which is routine heart health checks.
Cardiac health checks can be performed starting with routine examinations of risk factors, followed by electrocardiogram (ECG) examinations, laboratory examinations, and coronary angiography. Along with the development of artificial intelligence, all examination results in the diagnosis process can be used with machine learning to draw conclusions [4]. Many examinations result in several attributes that must be analyzed, and the number of attributes allows for ambiguity in drawing conclusions. There is one important process in the development of machine learning, namely, feature selection. Feature selection involves selecting the best feature that can provide a better machine learning performance [5-7]. Feature selection involves numerous methods, including filtering, wrappers, and embedding. Feature selection can be developed using a combination of several methods known as the hybrid method. The hybrid method aims to obtain better features than if only one method is used. In the development of machine learning for diagnosis, in addition to feature selection, there is an important influential component: the amount of data available for the learning process. Medical records are taken from hospitals, where people tend to check themselves when there are symptoms, therefore the diagnosis results are mostly positive. This condition causes the data obtained to obtain additional data that are diagnosed as positive than negative, which results in the availability of data being unbalanced [8, 9].
Several models for the diagnosis of CHD have been developed using feature selection methods. Numerous developments have been conducted using computational intelligence algorithms such as genetic algorithms (GAs), artificial bee colonies, and particle swarm optimization [10-13]. This study proposes a CHD diagnosis model that is preceded by feature selection using a hybrid method. The hybrid method used was a support vector machine (SVM) and a GA for the searching method. The feature selection stage ended with a filtering method that used a fast correlation-based filter (FCBF). To determine the output of the proposed system model, classification was performed using an ensemble learning algorithm, namely, the bagging-logistics model tree (bagging-LMT). At the learning stage of the bagging-LMT algorithm, to overcome unbalanced learning data, before the learning process oversampling was performed using the synthetic minority oversampling technique (SMOTE) method [14, 15]. The system testing was validated using k-folds cross-validation, as well as the z-Alizadeh Sani, Cleveland, and Statlog datasets. The performance parameters of the proposed system model include sensitivity (SSE), specificity (SPE), accuracy, area under the curve (AUC), positive prediction value (PPV), and negative prediction value (NPV).
CHD detection models have been developed along with machine learning [4, 16, 17]. Ghosh et al. [18] developed a diagnosis model using feature selection, namely, the Relief and LASSO techniques which can improve the performance of it. Despite the obstacles in implementing machine learning algorithms in clinical practice, the use of machine learning algorithms, such as convolutional neural networks, boosting, and SVM can provide good prospects for the development of diagnostic models [19]. Machine learning with a combination of SVM extreme gradient boosting can also be used to detect CHD, with good performance, namely, an F1 value of 91.86% and accuracy of 93.86%. However, in this study, feature selection was not optimized, and thus, the performance still requires many attribute checks [20].
Another study used the CART algorithm to detect CHD [21]. The CART algorithm is a decision tree-based algorithm; therefore, it can perform feature selection simultaneously with the training process. The CART algorithm has the ability to use five test attributes. When tested with the z-Alizadeh Sani dataset, it was able to provide 98.61% sensitivity but low specificity, namely 77.01% and 92.41% accuracy. The ability of this model is weak when the patient is negative but detected by the system model as true negative, with a percentage of 77.01%. The CART algorithm can extract knowledge into several rules arranged in a tree diagram. The ability to compose rules, such as decision tree models, can also be achieved using hybrid binary-real particle swarm optimization (PSO) [22]. The hybrid PSO model provided an average accuracy of 84.2%. This hybrid-PSO-based rule model can produce a relatively small number of rules, that is, 10 rules. Referring to the rule, testing is also conducted for the use of 13 and 11 attributes. The result is that the use of 13 attributes is better than the use of 11 attributes.
The use of computational intelligence in addition to the PSO algorithm also uses many GA [11]. The use of Gas combined with artificial neural networks for the diagnosis of CHD can provide excellent performance, with an accuracy of 93.5%. The next development is the use of neural network-based algorithms, namely an emotional neural network (ENN) combined with PSO (ENN+PSO). The use of ENN+PSO can reduce attributes from 55 to 22 with an accuracy of 88.34% [12]. The use of particle swarm can also provide good performance for the diagnosis of CHD, as shown by the test results with an accuracy parameter value of 87.097% [13]. The subsequent development uses a combination of PSO and GA. The use of the combination of the two algorithms can reduce attributes from 55 to 22, and the resulting performance is 93.08% for accuracy and 91.51% F-Score [23]. PSO+GA is capable of producing the same number of reduced attributes as PSO+ENN for the z-Alizadeh Sani dataset; however, its performance is better when using PSO+GA.
The ability of GAs in the feature selection process can also be demonstrated by Karegowda et al. [24]. This study proposed a combination of GA+correlation feature selection (CFS) with a radial basis function (RBF) classification algorithm. This model provided better CHD diagnosis system performance than the combination of a decision tree and RBF. The GA combined with SVM was also able to provide good performance compared to PSO, using the objective function in the form of accuracy [25]. A similar study was also conducted by Ephzibah [26], which showed that the ability of SVM+GA was better than that of SVM. This capability was indicated by the minimum number of features. Subsequent research using the genetic fuzzy system-logit-boost (GFS-LB) for the diagnosis of CHD provided better performance than without using GA [27]. Subsequent research was conducted using a feature selection model that combined filtering and wrapper methods [28]. The filtering algorithm used was conditional mutual information maximization (CMIM) combined with a binary GA (BGA), with the fitness function in the GA as an accuracy function. The resulting accuracy performance parameter was better than that of several other feature selections.
With the use of GAs in the feature selection process in machine learning, the majority of studies only focus on accuracy performance parameters, so that the fitness function used is also accurate [11, 23, 27]. The accuracy fitness function is also used in the PSO algorithm [12, 13, 22, 23]. This is certainly inappropriate in the medical field, particularly for the screening process or early diagnosis of the disease. In the screening or diagnosis of CHD, the sensitivity performance parameter is crucial. These parameters are used to measure when a patient is positive for CHD. Machine learning will also detect CHD, as indicated by the sensitivity parameter, so that in the screening or diagnosis system, it must suppress the incidence of positive patients, and using machine learning it is also diagnosed as positive [29]. Referring to this, the sensitivity performance parameter is crucial to be included in determining feature selection in the development of a CHD diagnosis system model.
The selection of the right feature selection method is important, particularly for high-dimensional data. FCBF is an effective feature selection method for high-dimensional data [30]. Sánchez-Maroño et al. [31] confirmed that the performance of accuracy of FCBF is better than the Relief when the number of attributes is more than 40 attributes. The FCBF feature selection algorithm combined with the SVM classification algorithm was able to provide better performance than the k-nearest neighbor, random forest, and naive Bayesian algorithms [32]. The CHD diagnosis system model with the z-Alizadeh Sani dataset has more than 40 attributes; therefore, the FCBF is very precise. The ability of feature selection was also demonstrated by Djellali et al. [25], where FCBF was able to reduce the number of large features but was still able to maintain performance in a good category. The ability of FCBF when combined with GA with a fitness function in the form of accuracy results in better performance than when using only FCBF.
The problem of developing machine learning is not limited in feature selection. The availability of good data for the learning process will make the learning process successful in building a system model for CHD diagnosis. The challenge in processing medical data is imbalanced data; this condition can cause the machine learning model built to be poor [8, 33]. The problem of imbalanced data can be overcome using the oversampling method. The development of a prognostic model for patients with heart failure also utilizes the oversampling method. Kim et al. [34] used oversampling algorithms such as SMOTE, borderline-SMOTE, and adaptive synthetic sampling (ADASYN). In addition, a study conducted by Brandt and Lanzén [35] analyzed the performance of SMOTE and ADASYN. The results of the two studies show that SMOTE is better than the others. SMOTE is also used in the classification of hypertension, where the data are unbalanced, and the results show an increase in accuracy from 91 to 98% [36]. The use of SMOTE is also effective for high-dimensional datasets [37], such as the z-Alizadeh Sani CHD dataset. The ability of SMOTE to predict compound-protein interactions has also been demonstrated [38]. In this study, SMOTE is better than random under-sampling (RUS), a combination of over-undersampling (COUS), and Tomek link (T-Link) algorithms, with reference to the AUC performance parameter.
In this study, the z-Alizadeh Sani dataset [39-42] was used, along with the Cleveland and Statlog datasets [43] as support. The z-Alizadeh Sani dataset has relatively complete types of examinations, such as demographic, symptom and examination, ECG, and laboratory and echo features. Another advantage of this dataset is that the data used are relatively new compared to their predecessors, such as the Cleveland and Statlog datasets. These datasets can be accessed online, and their distribution is presented in Table 1. This study used the research methods shown in Fig. 1. The stages are divided as follows: the first stage is pre-processing in the form of data normalization. The second stage performs feature selection using hybrid SVM-GA, which classifies data, and then measures the accuracy and sensitivity performance. The two performance parameters are then used as fitness functions in the GA, as shown in Equation (1). The SVM algorithm used is a binary SVM with a kernel using a radial basis function (RBF) [44]. The SVM algorithm works using nonlinear transformations, one of which uses the RBF kernel. The transformation is performed to map the input data to a higher-dimensional space and then perform a linear classification of the input data in the dimensional feature space to build the optimal hyperplane. Finally, the mapping returns to the original space and becomes a nonlinear classification in the input space [44, 45]. The GA serves to selects a subset of features using the benchmark as the fitness function. Chromosomal representation in the form of feature subsets or attributes of CHD examination. The GA works by performs several steps, as shown in Algorithm-1 [28, 46].
Dataset | #Feature | #Instance | Ratio Normal/CHD |
---|---|---|---|
z-Alizadeh Sani | 54 | 303 | 1:2.50 |
Cleveland | 13 | 303 | 1:0.85 |
Statlog | 13 | 270 | 1:0.80 |
The third stage is the oversampling process used to balance the data. Oversampling was performed using the SMOTE algorithm [14, 36]. The percentage of oversampling with reference to the ratio of positive and negative data for CHD is shown in Table 1. The SMOTE process is preceded by resampling. The next stage is the final stage of the feature selection process, namely, filtering using the FCBF algorithm [30, 47]. The FCBF algorithm generates a weight that indicates the weight of the attribute to the output. The next stage is the classification process using the bagging algorithm, with each bootstrap using the LMT algorithm [48], which is shown in Algorithm-2 [49, 50]. The system model was also tested using the random forest algorithm, forest by penalizing attributes (ForestPA) [51], C4.5, multilayer perceptron (MLP), and bagging-forestPA.
1: | Set generation=0 (The first generation) | ||
2: | Initialized initial population, P(generation), randomly | ||
//P(generation) is the population of one generation | |||
3: | Evaluate the fitness value for each individual | ||
4: | //equation (1) | ||
5: | |||
6: | Generation = generation +1 | ||
Population selection to get the parent candidate | |||
8: | (P’(generation)) | ||
9: | Crossover on P'(generation) | ||
10: | Mutation on P'(generation) | ||
11: | New population form = {P(generation) that survive, | ||
12: | P'(generation)} | ||
1: | The training set | ||
Machine learning | |||
2: | The number of base classifier | ||
3: | For | ||
4: | Create | ||
5: | bootstrap samples | ||
6: | Using St to learning | ||
Combining using majority voting: | |||
7: | |||
endFor | |||
ensemble N* |
The last stage is the evaluation of the system performance. The diagnosis system model was developed with feature selection using a hybrid SVM-GA and FCBF, and the resulting performance was measured by referring to the confusion matrix, as shown in Table 2. The performance parameters used included accuracy (ACC), sensitivity (SEN), specificity (SPE), AUC, positive predication value (PPV), and negative prediction value (NPV). Referring to Table 2, the performance parameters can be calculated using Equations (2-4).
Actual Class | Predictive Class | |
---|---|---|
Positive | Negative | |
Positive | TP | FN |
Negative | FP | TN |
Testing of the CHD diagnosis system model using the SMV-GA hybrid feature selection can produce 25 attributes, as shown in Fig. 2. The number of attributes is obtained when using the GA parameters with population 1000, generation 100, probability crossover 0.55, and probability mutation 0.3. Fig. 2 shows the feature weights of the 25 attributes resulting from the FCBF process. Referring to the process, it was shown that there are eight attributes that have a high weight in influencing the diagnosis of CHD. The attributes that have a high weight are typical chest pain, hypertension (HTN), age, diabetes mellitus (DM), regional wall motion abnormality (RWMA) region, T inversion, Q wave, and triglyceride (TG).
The z-Alizadeh Sani dataset was used for the test. Subsequent testing of the same model using the Cleveland dataset yielded 10 attributes. These results are shown in Fig. 3, which also shows the weights for each attribute using the FCBF algorithm. The attributes that had significant weights in the diagnosis process were thal, cp, slope, ca, age, and restecg. The final test used the StatLog dataset. The test results are shown in Fig. 4 with 10 attributes. The filtering results with the FCBF show that only six attributes had a significant weight in the diagnosis process. The attributes were thal, cp, oldpeak, thalac, ca, and restecg.
System testing after the feature selection and data balancing process using SMOTE was used to test the classification results. The classification process used the bagging-LMT algorithm. The performance parameters used for the analysis are shown in Equations (2-6). The results of testing the performance of the bagging-LMT algorithm using eight attributes, which were the result of feature selection in the z-Alizadeh Sani dataset, are shown in Fig. 5. Fig. 5 shows that the AUC performance of the bagging-LMT algorithm was better than that of the other ensemble algorithms. The highest AUC value was 97.5%, and the results for all performance parameters are listed in Table 2. In addition to the results for eight attributes in the z-Alizadeh Sani dataset, performance was also shown to determine changes in attribute reduction from 25 to eight attributes. These results are shown in Fig. 6, where changes in the number of attributes do not indicate a significant change in performance.
The result of subsequent system testing using Cleveland datasets are shown in Figs. 7, while using Statlog datasets in Fig. 8. The test results using the Cleveland and Statlog datasets show that the ability of the bagging-LMT algorithm is better than that of the random forest, forestPA, C4.5, MLP, and bagging-forestPA algorithms. The performance of the bagging-LMT algorithm for the AUC performance parameter was still lower than that of the z-Alizadeh–Sani dataset. This is because the z-Alizadeh Sani dataset has a high level of data imbalance, as shown in Table 3; therefore, the use of SMOTE is very effective when compared to the Cleveland and Statlog datasets. The Cleveland and Statlog datasets had the same attributes, however the feature selection results had two different attributes.
Dataset | SEN | SPE | AUC | ACC |
---|---|---|---|---|
z-Alizadeh Sani | 0.936 | 0.954 | 0.975 | 0.946 |
Cleveland | 0.840 | 0.821 | 0.896 | 0.830 |
Statlog | 0.878 | 0.784 | 0.902 | 0.828 |
The hybrid SVM-GA model combined with FCBF can significantly reduce the number of attributes, particularly for the z-Alizadeh Sani dataset. In the z-Alizadeh sani dataset, there was a decrease in the number of attributes from 54 to eight. Attribute reduction occurs in two stages, from 54 to 25 attributes, using a hybrid SVM-GA with accuracy and sensitivity fitness functions, while 25 to eight attributes refer to the weights generated by the FCBF algorithm. Changes in performance based on the number of attributes (25 to 8) are shown in Fig. 7. Performance changes that occur from the use of 25 to eight attributes with the bagging-LMT classification algorithm, as well as 10-fold cross-validation validation, are not significant. Tests using the Cleveland and Statlog datasets were also able to reduce the number of attributes from 13 to 10 when using hybrid SVM-GA and after being combined with FCBF to six attributes. The resulting performance is not as good as that when tested with the z-Alizadeh Sani dataset because the z-Alizadeh Sani dataset has a high level of imbalanced data, as shown in Table 1. When tested using the z-Alizadeh Sani dataset, the performance of the proposed system can provide an AUC performance of 97.5%, which is included in the very good category [52]. Meanwhile, in testing with the Clevelands dataset, the AUC value of 89.6% was still in the good category, whereas the statlog dataset AUC value of 90.2% was in the very good category [52].
Changes in the number of features when using features 25 to 8, as shown in Fig. 7. It can be observed that the AUC performance parameter decreases the number of attributes, and there is an increase in performance. This shows that when using 25 attributes, there is ambiguity between attributes that causes performance to decrease, which also confirms previous studies that states feature selection can improve performance [5-7]. In the Cleveland and Statlog datasets, when compared using 10 attributes with 6 attributes 6, the resulting AUC performance parameters are relatively the same. The Cleveland and Statlog datasets have the same number and type of attributes, however the result of the feature selection process produced has two different attributes. The difference is in the Cleveland dataset which includes slope and age, whereas the Statlog dataset includes old-peak and thalach. The difference is, of course, caused by the data in each dataset; therefore, the results of the weight ranking of the FCBF feature selection process are different.
In the proposed system model, by adding a data-balancing process, the sensitivity and specificity performance parameters were both high. Many studies have reported high accuracy performance parameters, high sensitivity, and low specificity. A significant difference between sensitivity and specificity was shown in the model of CHD diagnosis using the CART algorithm [21], bagging-SMO, naive Bayesian, SMO, and neural networks [53]. A significant difference between the sensitivity and specificity will also have an impact on the low AUC performance value if the AUC calculation is the average between the sensitivity and specificity [15]. In t he b iomedical f ield, if t here i s an AUC value o f 97.5%, it indicates that if 100 patients are positive for CHD disease, then there are 97 people correctly diagnosed with CHD by the system, while three patients are wrong. The capability of the proposed system with reference to the AUC parameter, is better than that of some previous studies, such as the research conducted by Joloudari et al. [54], which uses a random tree algorithm.
Using SVM-GA hybrid feature selection with fitness, which is a function of accuracy and sensitivity, causes the best feature subset to be determined by the sensitivity and accuracy parameters. Using the fitness function, as shown in Equation (1), results in better accuracy performance and a smaller number of attributes compared to using only the fitness function with accuracy [26, 55]. The same is true for the PSO algorithm [12]. A complete comparison with previous studies using either GA, PSO, or other algorithms, with testing using the z-Alizadeh Sani dataset, is shown in Table 4.
Ref | Method | #Feature | SEN | SPE | AUC |
---|---|---|---|---|---|
[59] | Var-IBLMM | 54 | 85.6 | 73.7 | - |
[53] | Bagging-SMO | 33 | 95.8 | 87.4 | - |
[42] | Combined IG for all arteries- SVM | 27 | 86.0 | - | - |
[11] | ANN+GA | 22 | 97.0 | 92.0 | - |
[10] | PSO-based FS | 27 | - | - | 98.7 |
[12] | Hybrid PSO+ENN | 22 | - | - | - |
[54] | Random Tree | 40 | - | - | 96.7 |
[60] | EHBM-DNN | 54 | 95.8 | 96.5 | - |
[40] | SVM along with Feature engineering | 32 | 1.00 | 88.0 | 92.0 |
Using the bagging-LMT algorithm in the proposed diagnosis system model can provide better performance than several o ther a lgorithms, such a s RF a nd C 4.5. T he L MT algorithm is a combination of logistic regression and C4.5. This combination can be used to prune and prevent over-fitting [56]. The ability of the LMT algorithm is even better when bagging is used. The application of bagging resolves the problem of unstable classifications. This makes the bagging ability of the LMT better than that of C4.5, as well as the MLP, because the MLP has a serious problem of overfitting. The random forest algorithm is a decision-tree-based ensemble algorithm that does not prune the resulting decision tree [57]. Failure to perform this pruning can lead to high prediction errors in new cases, and another weakness is the slow classification process, therefore it is not suitable for real-time cases. The forestPA algorithm is almost the same as a random forest, except that it is built using the CART algorithm [58]. The weakness of this algorithm is that the determination of the wrong weight on the attribute affects its performance.
The diagnosis system model using the hybrid feature selection method SVM-GA and FCBF, as well as the bagging-LMT algorithm, is able to provide good performance. The best performance occurs when using the z-Alizadeh Sani dataset because this dataset has a high level of imbalanced data compared to the Clevelands and Statlog datasets. This makes the addition of the SMOTE algorithm highly effective. The best performance was achieved with a sensitivity of 93.6%, specificity of 95.4%, AUC of 97.5%, PPV of 94.3%, NPV of 94.9%, and accuracy of 94.6% for the z-Alizadeh Sani dataset. This performance also shows that the LMT bagging ability is better than that of the random forest, MLP, C4.5, forestPA, and forestPA bagging algorithms. The resulting attribute reduction also showed a significant decrease from 54 to eight attributes. Referring to the resulting performance, the proposed diagnostic system model performed better than several previous studies. This makes the proposed model an alternative for the diagnosis of CHD with minimal examination attributes.
Dataset | #Feature | #Instance | Ratio Normal/CHD |
---|---|---|---|
z-Alizadeh Sani | 54 | 303 | 1:2.50 |
Cleveland | 13 | 303 | 1:0.85 |
Statlog | 13 | 270 | 1:0.80 |
Actual Class | Predictive Class | |
---|---|---|
Positive | Negative | |
Positive | TP | FN |
Negative | FP | TN |
Dataset | SEN | SPE | AUC | ACC |
---|---|---|---|---|
z-Alizadeh Sani | 0.936 | 0.954 | 0.975 | 0.946 |
Cleveland | 0.840 | 0.821 | 0.896 | 0.830 |
Statlog | 0.878 | 0.784 | 0.902 | 0.828 |
Ref | Method | #Feature | SEN | SPE | AUC |
---|---|---|---|---|---|
[59] | Var-IBLMM | 54 | 85.6 | 73.7 | - |
[53] | Bagging-SMO | 33 | 95.8 | 87.4 | - |
[42] | Combined IG for all arteries- SVM | 27 | 86.0 | - | - |
[11] | ANN+GA | 22 | 97.0 | 92.0 | - |
[10] | PSO-based FS | 27 | - | - | 98.7 |
[12] | Hybrid PSO+ENN | 22 | - | - | - |
[54] | Random Tree | 40 | - | - | 96.7 |
[60] | EHBM-DNN | 54 | 95.8 | 96.5 | - |
[40] | SVM along with Feature engineering | 32 | 1.00 | 88.0 | 92.0 |
Wiharto, Yaumi A. Z. A. Fajri, Esti Suryani, and Sigit Setyawan
Journal of information and communication convergence engineering 2023; 21(2): 130-138 https://doi.org/10.56977/jicce.2023.21.2.130Kim Kyoung-jae;Ahn Hyunchul;
The Korea Institute of Information and Commucation Engineering 2005; 3(4): 209-212 https://doi.org/10.7853/.2005.3.4.209