Improving accuracy of coronary artery disease diagnosis with biomarker-based machine learning models
A cost-effective and accessible precision medicine approach
We hypothesize that by utilizing CatBoost and Convolutional Neural Networks (CNNs), biomarker and electrocardiogram-based machine learning models can predict whether a patient has coronary artery disease (CAD) with significantly higher accuracy than current diagnostic techniques.
Coronary Artery Disease
Coronary artery disease (CAD) is the leading cause of mortality, associated with over 8.9 million deaths annually worldwide.1 CAD is typically caused by a buildup of plaque (deposits of cholesterol) in arteries supplying blood to the heart, and symptoms can include pain in the chest and heart attacks. In the U.S., Blacks, non-white Hispanics, and South Asians are at greater risk than white people.2
Standard diagnostic procedures for CAD include electrocardiograms (ECG), which measure the electrical activity of the heart, blood tests, and CT angiography. However, these diagnostic procedures are often inaccurate. Physicians correctly diagnose CAD in only 68% of cases3 Furthermore, diagnostic procedures can also be risky and high cost. CT angiography is an invasive technique that requires arterial puncture, subjecting the patient to radiation,4 and has an average cost over $2,500, which is prohibitively expensive for low-income patients.5
Physicians correctly diagnosed CAD in only 68% of cases, misdiagnosing it in the other 31%.
Precision Medicine and Machine Learning
Precision medicine is a targeted approach to medicine that has the potential to revolutionize the treatment and diagnosis of CAD. It integrates technology and medicine to diagnose and treat an individual patient's disease6 by holistically considering the patient’s genetics and lifestyle.7 Machine learning (ML) models are an important component of precision medicine. According to the National Institute of Health, “As researchers continue to unravel the many mysteries of genomics, they require more and more sophisticated technologies to diagnose, monitor, and treat genetic conditions. Artificial intelligence tools, which mimic human intelligence to solve problems, are well-suited to tackle these complex tasks.”8
|1||Catboost||Kim et al. 2022||Dankook University||Biomarkers||1312||2022||0.74||0.78||0.67|
|2||Catboost||Zhang et al. 2021||China||Biomarkers||2642||2021||0.82||0.75||0.76|
|3||Random forest||Muhammad et al. 2021||Nigeria||Biomarkers||506||2021||0.92||0.86||0.83|
|4||8-layer deep CNN||Tan et al. 2018||Cleveland Dataset||ECG||47||2018||0.99||0.99||0.99|
Biomarker-Based Machine Learning Models
Numerous machine learning models can diagnose CAD from patient biomarker data with accuracies of over 70%. Biomarkers are objective and quantifiable characteristics of biological processes such as systolic blood pressure, diastolic blood pressure, hemoglobin levels, triglyceride levels, and age.
Kim et al. 2022 analyzed the efficacy of 11 different machine-learning models in diagnosing obstructive CAD from 12 biomarkers.9 The models were trained on data collected from 1312 participants at Dankook University in South Korea between 2014 and 2016. This study looked at the SHAP values of various biomarkers. The SHAP value stands for SHapley Additive exPlantations and it indicates a feature’s importance in predicting model output — whether a patient has CAD or not. Features in these models are biomarkers — attributes of the individual participants. The paper found that the biomarker with the greatest mean SHAP value was troponin T levels, followed by age and sex. Troponin T is a protein released into the bloodstream when cardiac muscle has been damaged, such as in the case of a heart attack or CAD.
Numerous machine learning models can diagnose CAD from patient biomarker data with accuracies of over 70%.
Accuracy is defined as (TP+TN)/(TP+TN+FP+FN), sensitivity as TP/(TP+FN), and specificity as TN/(TN+FP), where TP is true positives, FP is false positives, TN is true negatives, and FN is false negatives. Sensitivity represents the true positive rate and specificity represents the true negative rate.
Of the 11 ML models analyzed, the most accurate was CatBoost with a 74.6% accuracy rate, higher than XGBoost, Support Vector Machine (SVM), and 8 other models. The most sensitive model was the Diamond-Forrester with 93.3% sensitivity but only 26.1% specificity (indicating a high probability of false negatives). Meanwhile, the most specific model is the CAD consortium basic with a 78.3% specificity rate and a 44.4% sensitivity rate.
CatBoost is a supervised machine learning algorithm that utilizes gradient boosting, a technique that merges weak learning models to inform a more robust and accurate final prediction. CatBoost outperforms XGBoost and SVM in part because its algorithm is versatile and flexible: it can adapt to missing values in datasets and has highly customizable parameters.
Zhang et al. in 2021 confirms the efficacy of the Catboost model by examining data collected from 2018 to 2019 of 2642 randomly selected patients from China, 717 of whom had CAD.10 The performance of three algorithms, Catboost, Random Forest, and Logistic Regression, was analyzed with confusion matrices — tables that compare predicted and actual results using the true positives and negatives and the false positives and negatives. One matrix was based on non-laboratory features such as age, body mass index, systolic blood pressure, history of smoking, etc.; the other was based on total features, including all non-laboratory features as well as laboratory features such as fasting blood sugar, total cholesterol, etc. For the model based on total features, Catboost performed the best by all metrics. It was the most accurate, with an 82.5% accuracy rate, in addition to having the highest sensitivity and specificity rates. Catboost also had the highest accuracy — 77.9% — when considering only the non-laboratory features. However, the Random Forest model had the highest specificity, marginally outperforming Logistic Regression and CatBoost models. Comparison of features reveals that age, total cholesterol, and family history of CHD were the most important risk factors for identifying coronary heart disease.
Catboost was the most accurate with an 82.5% accuracy rate in addition to having the highest sensitivity and specificity rates
Muhammad et al. 2021, similarly to Kim, compares the performance of different ML models in predicting CAD based on biomarkers.11 However, this dataset finds the Random Forest model is the most accurate, demonstrating that Random Forest and CatBoost models can both be utilized as accurate diagnostic models. The dataset on which testing was performed was obtained in the two General Hospitals in Kano State—Nigeria: Murtala Mohammed General Hospital and Abdullahi Wase General Hospital. The dataset was collected between 2003 and 2017.
ECG-Based Machine Learning Models
The electrocardiogram (ECG) is a test administered to measure electrical activity in the heart, which provides information about the heart’s rhythm, structure, and function. ECG is performed by placing electrodes on the skin of the chest, arms, and legs, which detects the electrical signals produced by the heart and records them as a graph.
Alizadehsani et al. 2019 provides a comprehensive overview of AI algorithms used to diagnose CAD using ECG data.12 It analyzes 67 datasets from 18 countries with differing pipelines and datasets. More than 90% of the CAD detection algorithms utilize supervised classification algorithms (algorithms that are trained on a labeled dataset of known CAD/no CAD cases), especially Artificial Neural Networks (ANNs), Decision Tree Learning (DTs), Support Vector Machine (SVMs), Native Bayes, and K-Nearest Neighbors.
Alizadehsani et al. analyzes several deep learning (DL) models for CAD diagnosis. Deep learning models are multi-layer neural networks trained on large quantities of data. Patterns discovered in data are assigned to layers of the neural network, which are structured similarly to the human brain and its web of connected neurons. Standouts include Acharya et al. 2017 (200 participants, 95.22% accuracy on ECG signals), Tan et al. 2018 (47 participants, 99.85% accuracy on ECG signals), Allahverdi et al. 2016 (85 participants, 98.05% accuracy on ECG signals). Some non-DL models, including KNNs and Decision Trees, also achieved high accuracy in diagnosing CAD from ECG signals, especially Acharya et al. 2017 (222 participants, 99.55% accuracy) and Sharma et al. 2019 (254 participants, 99.53% accuracy).
Cleveland Dataset Models
The Cleveland UCI dataset (1988), popular among heart disease researchers, has been used to train two of the top-performing models cited in Alizadehsani’s paper. Akella and Akella 2021 compared the performance of six different supervised ML algorithms — a generalized linear model, decision tree, random forest, support vector machine, neural network, and k-Nearest neighbor — trained on the UCI Cleveland dataset.13 The Cleveland dataset includes 14 metrics on 303 patients, including their age, gender, chest pain, blood pressure, cholesterol, blood sugar, etc.
Akella and Akella found that the artificial neural network had the highest accuracy (93%) and sensitivity (93.8%) as compared to the other models, although all models had greater than 79% accuracy. The researchers also found that resting EKG had the greatest importance in the neural network model for predicting CAD outcomes, followed by sex and chest pain (rated on a scale from 1-10).
[T]he artificial neural network had the highest accuracy (93%) and sensitivity (93.8%) as compared to the other models.
Joloudari et al. 2020 employs four ML algorithms to diagnose CAD.14 Although angiography is the most common diagnostic technique for heart disease, it has side effects and is expensive. Therefore, ML models including SVM, CHAID, C5.0, and random trees were tested with the 10-fold cross-validation method, in which the data is split into 10 smaller samples. The random trees model was found to be the best model compared to the other models based on accuracy, confirming the findings of Muhammad.
Machine learning models, particularly CatBoost and convolutional neural network models (CNNs), can aid physicians in accurately diagnosing CAD. Biomarker-based ML models are much more affordable than traditional CT angiography scans and do not require patients to undergo invasive procedures. Several tests on multiple sets of data have shown increased efficacy and high accuracy of biomarker and ECG machine learning models; however, these models have yet to be tested on clinical hospital data. Nonetheless, the high prevalence of CAD and CAD-related deaths worldwide indicate the need for more accurate diagnostic techniques, and machine learning models are a strong candidate.
- World Health Organization. (2020, December 9). The Top 10 Causes of Death. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
- Graham, G. (2015). Disparities in Cardiovascular Disease Risk in the United States. Current Cardiology Reviews, 11(3), 238–245. https://doi.org/10.2174/1573403x11666141122220003
- Bösner, S., Haasenritter, J., Keller, H., Hani, M. A., Sönnichsen, A. C., Baum, E., & Donner-Banzhoff, N. (2011). The Diagnosis of Coronary Heart Disease in a Low-Prevalence Setting. Deutsches Aerzteblatt Online. https://doi.org/10.3238/arztebl.2011.0445
- Alizadehsani, R., Abdar, M., Roshanzamir, M., Khosravi, A., Kebria, P. M., Khozeimeh, F., Nahavandi, S., Sarrafzadegan, N., & Acharya, U. R. (2019). Machine learning-based coronary artery disease diagnosis: A comprehensive review. Computers in Biology and Medicine, 111, 103346. https://doi.org/10.1016/j.compbiomed.2019.103346
- Goehler, A., Mayrhofer, T., Pursnani, A., Ferencik, M., Lumish, H. S., Barth, C., Karády, J., Chow, B., Truong, Q. A., Udelson, J. E., Fleg, J. L., Nagurney, J. T., Gazelle, G. S., & Hoffmann, U. (2020). Long-term health outcomes and cost-effectiveness of coronary CT angiography in patients with suspicion for acute coronary syndrome. Journal of Cardiovascular Computed Tomography, 14(1), 44–54. https://doi.org/10.1016/j.jcct.2019.06.008
- Akhoon, N. (2021). Precision Medicine: A New Paradigm in Therapeutics. International Journal of Preventive Medicine, 12, 12. https://doi.org/10.4103/ijpvm.IJPVM_375_19
- Leopold, J. A., & Loscalzo, J. (2018). Emerging Role of Precision Medicine in Cardiovascular Disease. Circulation Research, 122(9), 1302–1315. https://doi.org/10.1161/circresaha.117.310782
- Soo, S., & Ganguly, P. (2022, December 14). Artificial intelligence tools help scientists decode genomic disorders and communicate genomic risks. NIH. https://www.genome.gov/news/news-release/artificial-intelligence-tools-help-scientists-decode-genomic-disorders-and-communicate-genomic-risks
- Kim, J., Lee, S. Y., Cha, B. H., Lee, W., Ryu, J., Chung, Y. H., Kim, D., Lim, S., Kang, T. S., Park, B., Lee, M., & Cho, S. (2022). Machine learning models of clinically relevant biomarkers for the prediction of stable obstructive coronary artery disease. Frontiers in Cardiovascular Medicine, 9. https://doi.org/10.3389/fcvm.2022.933803
- Zhang, X., Wang, M., Wei, W., Xu, Y., Gao, L., Sun, Y., Ma, Z., & Wang, S. (2021). An accurate diagnosis of coronary heart disease by Catboost, with easily accessible data. Journal of Physics: Conference Series, 1955(1), 012027. https://doi.org/10.1088/1742-6596/1955/1/012027
- Muhammad, L. J., Al-Shourbaji, I., Haruna, A. A., Mohammed, I. A., Ahmad, A., & Jibrin, M. B. (2021). Machine Learning Predictive Models for Coronary Artery Disease. SN Computer Science, 2(5). https://doi.org/10.1007/s42979-021-00731-4
- Same as #3
- Akella, A., & Akella, S. (2021). Machine learning algorithms for predicting coronary artery disease: efforts toward an open source solution. Future Science OA, FSO698. https://doi.org/10.2144/fsoa-2020-0206
- Joloudari, J. H., Hassannataj Joloudari, E., Saadatfar, H., Ghasemigol, M., Razavi, S. M., Mosavi, A., Nabipour, N., Shamshirband, S., & Nadai, L. (2020). Coronary Artery Disease Diagnosis; Ranking the Significant Features Using a Random Trees Model. International Journal of Environmental Research and Public Health, 17(3), 731. https://doi.org/10.3390/ijerph17030731
About the Author
Kamryn Chan, Alexandra Kim, and Jeffrey Liu are high school juniors at Polytechnic School in Pasadena, CA. They are interested in the intersection of machine learning, precision medicine, and computational biology and plan to further research applications of machine learning to disease diagnosis.