Genomics: Insight

Enhancing Type 2 Diabetes Prediction with Machine Learning Algorithms

Grace P
April 15, 2025


Hypothesis: Compared to traditional statistical methods, machine learning models will significantly improve the accuracy of Type 2 Diabetes prediction.

Type 2 Diabetes

Diabetes is a disease that impacts the lives of millions of people globally. Each year, the number of people diagnosed with diabetes increases rapidly,and half of those with diabetes do not know they have the disease. In 2017, diabetes was ranked among the top 10 causes of death in adults, causing approximately four million deaths worldwide1. In 2030, the number of diabetes patients is expected to increase to 439 million 2.
There are two most common types of diabetes: Type one diabetes (T1D) and Type two diabetes (T2D). Out of the total population impacted by diabetes (both T1D and T2D), about 90% of those with diabetes have T2D, which is much more common in adults between the ages of 45 and 64 3. T2D is associated with many factors, including age, gender, weight, lifestyle, and genetics. For instance, T2D in adults can result from a sedentary lifestyle (not active enough), high-energy diet, or obesity. These factors contribute to increased cellular resistance to insulin, a hormone produced by the pancreas to regulate blood sugar levels and suppress glucose production 3. Long-term effects of T2D include cardiovascular issues like heart disease and high blood pressure, kidney disease, nerve damage, cancers, and eye and foot complications3
Because of these alarming effects of T2D, early prediction of this disease can help increase life expectancy through actively promoting healthier lifestyles. There are many ways to predict T2D, including a multitude of data mining techniques such as pattern recognition, regression models, and screening tools, which are techniques available to help build T2D prediction models but have not been able to predict accurately 4. However, over the last decades, new techniques like machine learning have been developed to tackle this problem. This paper will review the application of machine learning models in the prediction of T2D and compare their performance in terms of accuracy.
 

Machine Learning in T2D Prediction

Machine learning (ML) prediction models help improve the accuracy and efficiency of predicting T2D in a patient by discovering patterns in training data and applying them to new cases 5,6. Compared to traditional statistical methods, ML-based models use input variables (features) to make predictions and provide superior performance in early diabetes detection and enhanced T2D risk assessments .7,8,9For predicting T2D, ML models often use known risk factors as features to train and optimize their predictive performance. Moa Lugner et al identified top ten predictive factors of T2D through ML10.HbA1C exhibited the strongest predictive power for T2D, followed by BMI, waist circumference, blood glucose levels and relative diabetes score. Other factors, such as gamma-glutamyltransferase (GGT), waist-hip ratio, HDL level, age and urate, demonstrated less predictive competency. In addition to these demographic and clinic factors, researchers continue to explore new features to enhance the accuracy of ML models. One such approach involves incorporating polygenic risk score (PRS), a numerical value calculated from genetic variants across the genome associated with T2D risk . Researchers have found that a new risk-prediction model utilizing PRS in addition to demographic and clinical risk factors was able to improve the prediction of T2D in high-risk individuals. For example, in a 10-year prospective cohort study by Seok-Ju Hahn et al., they found that incorporating PRS and metabolite data from 186 serum metabolites in Random Forest (RF)-based models increased accuracy or the area under the curve (AUC) significantly, compared to using demographic and clinical risk factors alone. Metabolite data is also important in T2D prediction along with PRS and clinical factors because metabolites can provide information about metabolic pathways linked to insulin resistance, glucose regulation, and even early disease development. AUC is a key performance metric in machine learning, particularly for evaluating classification models in medical diagnostics, including T2D prediction. AUC is typically measured under the receiver operating characteristic (ROC) curve (AUC-ROC), which assesses a model’s ability to distinguish between diabetic and non-diabetic individuals11. In another study, Yikang Wang and his team also reported a significant increase in the AUC of RF and gradient boosting machine (GBM)-based ML models after including genetic risk score (GRS) as a classifier12. In a more recent study, Jinjin Li demonstrated that the accuracy or AUC of a logistic regression model was increased from 0.678 to 0.908 by combining GRS with non-GRS features13. Therefore, incorporating genetic risk factors into the features increased the predictive performance of ML models. Additionally, not only the choice of features but also the type of models influences predictive power. Henock M Debeneh and Intaek Kim compared some commonly used ML models, including logistic regression (LR), RF, support vector machine, XGBoost, and ensemble ML models in predicting diabetes14. They found that the performance of the single models was all reasonably good, with an accuracy range of 0.71 to 0.73, with the ensemble model which aggregates multiple classifiers showing an improved performance over single models. Another study demonstrated that widely used LR model – a linear regression based method--exhibits lower predictive performance compared to other advanced ML models such as RF, LightGBM, Glmnet and XGBoost, which are primarily tree-based or regularized ensemble methods4. These models are generally better at capturing complex, nonlinear relationships in high dimensional data, which explains their superior performance in T2D prediction.  Md. Kowsher et al. (2023) recently compared seven machine learning models, including both tree-based algorithms and linear models, with a deep learning method- artificial neural network (ANN)- in detecting T2D, and showed that the algorithm deep ANN had 95.14% accuracy and outperformed all other ML models tested because ANN is able to capture complex nonlinear relationships within the data15
 

Conclusion

Machine learning has introduced new ways to prevent T2D. By predicting T2D early, those predicted will be able to make lifestyle changes towards prevention before T2D progresses into a chronic condition, ultimately reducing the burden on healthcare systems. Early detection of T2D can lower healthcare cost by preventing complications, reducing hospital admissions, and minimizing long-term treatment expenses. Thus, the development and optimization of machine learning techniques for the prediction of T2D will have a great impact on diabetes research and healthcare. The integration of GRS and ML has offered a promising approach to T2D prevention. Continued advancements in ML technologies, such as ensemble models with multiple classifiers and ANN method will further improve T2D risk prediction, strengthening preventative strategies.  While machine learning has significantly improved T2D prediction, it also comes with some limitations. Many ML models rely on health records of certain cohorts, which may not fully represent diverse populations, leading to biased predictions. Training complex ML models also requires large computational power and is quite expensive including infrastructure and personnel training, which may not be feasible for small clinical or hospitals. Despite these challenges, the continued development of ML techniques in predicting T2D still provides a powerful tool in reducing the global burden of diabetes and improving public health outcomes. 

"The continued development of ML techniques in predicting T2D provides a powerful tool in reducing the global burden of diabetes and improving public health outcomes."


Figure Legends

Figure 1. Top 10 predictive factors for Type 2 Diabetes. Each section of the pie chart represents a predictive factor for T2D identified through machine learning. A larger pie section indicates a stronger predictive power on T2D. These predictive factors help ML models detect patterns and predict T2D in patients. This figure was created by the author based on data synthesized from the referenced literature9
 

Top 10 predictive factors for Type 2 Diabetes chart


Table Legends

Table 1. Integration of the Genetic Risk Score (GRS) improves the performance of machine learning models in the prediction of Type 2 diabetes (T2D).  ML model, dataset, features, and algorithm used are listed in columns 1 and 2. Their accuracy or prediction performance was assessed using either AUC or the receiver operating characteristic (ROC) curve. A higher AUC-ROC score indicates higher discriminatory power, where a value of 1.0 represents perfect classification and 0.5 signifies random probability. This table was compiled by the author based on data from the referenced studies.

Integration of the Genetic Risk Score (GRS) improves the performance of machine learning models in the prediction of Type 2 diabetes (T2D) table



References

  1. Pouya Saeedi, et al. (2019) Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas, 9th edition. 
    https://pubmed.ncbi.nlm.nih.gov/32061820/
  2. Lei Chen, et al. (2011) The worldwide epidemiology of type 2 diabetes mellitus–present and future perspectives. 
    https://pubmed.ncbi.nlm.nih.gov/22064493/
  3. Yanling Wu, et al. (2014) Risk Factors Contributing to Type 2 Diabetes and Recent Advances in the Treatment and Prevention. 
    https://pubmed.ncbi.nlm.nih.gov/25249787/
  4. Leon Kopitar, et al. (2020) Early detection of type 2 diabetes mellitus using machine learning-based prediction models. 
    https://www.nature.com/articles/s41598-020-68771-z?utm_
  5. Luis Fregoso-Aparicio, et al. (2021) Machine learning and deep learning predictive models for type 2 diabetes: a systematic review. https://pubmed.ncbi.nlm.nih.gov/34930452/
  6. Micheal O. Olusanya, et al. (2022) Accuracy of Machine Learning Classification Models for the Prediction of Type 2 Diabetes Mellitus: A Systematic Survey and Meta-Analysis Approach. 
    https://pubmed.ncbi.nlm.nih.gov/36361161/
  7. Seong Gyu Choi et al. (2023) Comparisons of the prediction models for undiagnosed diabetes between machine learning versus traditional statistical methods. 
    https://pubmed.ncbi.nlm.nih.gov/37567907/
  8. Shu Wang et al. (2022) Comparative study on risk prediction model of type 2 diabetes based on machine learning theory: a cross-sectional study. 
    https://pmc.ncbi.nlm.nih.gov/articles/PMC10465890/
  9. Yaqian Mao et al. (2023) Value of machine learning algorithms for predicting diabetes risk: A subset analysis from a real-world retrospective cohort study. 
    https://pubmed.ncbi.nlm.nih.gov/36345236/
  10. Moa Lugner, et al. (2024) Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data.
    https://www.nature.com/articles/s41598-024-52023-5
  11. Seok-Ju Hahn, et al. (2022) Prediction of type 2 diabetes using genome-wide polygenic risk score and metabolic profiles: A machine learning analysis of population-based 10-year prospective cohort study. 
    https://pubmed.ncbi.nlm.nih.gov/36462406/
  12. Yikang Wang, et al. (2021) Genetic Risk Score Increased Discriminant Efficiency of Predictive Models for Type 2 Diabetes Mellitus Using Machine Learning: Cohort Study. 
    https://pubmed.ncbi.nlm.nih.gov/33681127/
  13. Jinjin Li, et al. (2023) An early prediction model for type 2 diabetes mellitus based on generic variants and nongenetic risk factors in a Han Chinese cohort. 
    https://pubmed.ncbi.nlm.nih.gov/37955008/
  14. Henock M. Debernh and Intaek Kim (2021) Prediction of Type 2 Diabetes Based on Machine Learning Algorithm. 
    https://pubmed.ncbi.nlm.nih.gov/33806973/
  15. Md. Kowsher, et al. (2023) Prognosis and Treatment Prediction of Type-2 Diabetes Using Deep Neural Network and Machine Learning Classifiers. 
    https://arxiv.org/abs/2301.03093?utm_source
     

About the Author

Grace P

Grace Peng, a Junior at Yorba Linda High School, is passionate about predicting and preventing Type 2 Diabetes due to her family history. She is an FBLA officer, leads business initiatives, and competes in FBLA events. Outside school, she enjoys playing piano and using music to serve her community.

Mentor: Dr. Jian Xu, University of Southern California  Affiliation: Yorba Linda High School