PERBANDINGAN K-NEAREST NEIGHBORS, XGBOOST, SUPPORT VECTOR MACHINE, DAN RANDOM FOREST DALAM PREDIKSI DIABETES BERDASARKAN FAKTOR RISIKO

    Hibar Taufikurachman, - (2025) PERBANDINGAN K-NEAREST NEIGHBORS, XGBOOST, SUPPORT VECTOR MACHINE, DAN RANDOM FOREST DALAM PREDIKSI DIABETES BERDASARKAN FAKTOR RISIKO. S1 thesis, Universitas Pendidikan Indonesia.

    Abstract

    Diabetes Mellitus merupakan keadaan darurat kesehatan global yang berkembang pesat, dengan jutaan kasus tidak terdiagnosis yang meningkatkan risiko komplikasi. Pembelajaran mesin dapat menjadi solusi untuk deteksi dini diabetes. Penelitian ini bertujuan untuk mengembangkan dan menganalisis perbedaan performa model pembelajaran mesin pada kasus deteksi diabetes dengan algoritma K-Nearest Neighbors (KNN), XGBoost, Support Vector Machine (SVM), dan Random Forest. Selain itu, juga untuk mengidentifikasi dan menganalisis faktor risiko diabetes yang paling berpengaruh. Penelitian ini menggunakan data dari National Health and Nutrition Examination Survey (NHANES) periode Agustus 2021-Agustus 2023, mencakup 21 fitur berdasarkan faktor risiko demografis, klinis, dan gaya hidup. Ketidakseimbangan kelas target ditangani dengan teknik SMOTE-ENN. Performa model dievaluasi menggunakan matriks accuracy, precision, recall, F1-score, specificity, dan ROC-AUC. Hasil penelitian menunjukkan model ensemble unggul signifikan dibandingkan dengan model lainnya. XGBoost mencapai kinerja terbaik setelah hyperparameter tuning pada matriks accuracy (0.9371), precision (0.7194), dan F1-score (0.7855). Sedangkan Random Forest menunjukkan performa terbaik pada recall (0.8773) dan ROC-AUC (0.9599). Identifikasi dan analisis pada hasil feature importance menghasilkan Glycohemoglobin (HbA1c) sebagai prediktor yang paling berpengaruh, diikuti oleh faktor risiko terkait kolesterol (kadar dan riwayat kolesterol), dan merokok. Faktor-faktor lainnya juga menunjukkan kontribusi seperti usia, ras, gangguan tidur, dan depresi. Hasil temuan ini menunjukkan bahwa model ensemble sangat efektif untuk prediksi diabetes, dengan XGBoost sebagai model dengan kinerja seimbang terbaik sedangkan Random Forest yang unggul pada matriks recall dan ROCAUC sangat cocok untuk skrining diabetes. Selain itu, faktor risiko utama yaitu Glycohemoglobin (HbA1c) sejalan dengan pemahaman klinis yang memvalidasi relevansi model.---------Diabetes Mellitus is a rapidly growing global health emergency, with millions of undiagnosed cases increasing the risk of complications. Machine learning can be a solution for early detection of diabetes. This study aims to develop and analyze the performance differences of machine learning models in diabetes detection using K-Nearest Neighbors (KNN), XGBoost, Support Vector Machine (SVM), and Random Forest algorithms. Additionally, it aims to identify and analyze the most influential diabetes risk factors. This study uses data from the National Health and Nutrition Examination Survey (NHANES) for the period of August 2021–August 2023, covering 21 features based on demographic, clinical, and lifestyle risk factors. The target class imbalance was addressed with the SMOTE-ENN technique. Model performance was evaluated using accuracy, precision, recall, F1-score, specificity, and ROC-AUC metrics. The results show that ensemble models significantly outperformed the other models. XGBoost achieved the best performance after hyperparameter tuning on accuracy (0.9371), precision (0.7194), and F1-score (0.7855) metrics. Meanwhile, Random Forest showed the best performance on recall (0.8773) and ROC-AUC (0.9599). Identification and analysis of feature importance results revealed Glycohemoglobin (HbA1c) as the most influential predictor, followed by cholesterol-related risk factors (level and history of cholesterol), and smoking. Other factors also showed contributions, such as age, race, sleep disorders, and depression. These findings indicate that ensemble models are highly effective for diabetes prediction, with XGBoost being the model with the best-balanced performance, while Random Forest, which excels in recall and ROC-AUC metrics, is highly suitable for diabetes screening. Furthermore, the primary risk factor, Glycohemoglobin (HbA1c), aligns with clinical understanding, which validates the model's relevance

    [thumbnail of S_RPL_1907774_Title.pdf] Text
    S_RPL_1907774_Title.pdf

    Download (1MB)
    [thumbnail of S_RPL_1907774_Chapter1.pdf] Text
    S_RPL_1907774_Chapter1.pdf

    Download (254kB)
    [thumbnail of S_RPL_1907774_Chapter2.pdf] Text
    S_RPL_1907774_Chapter2.pdf
    Restricted to Staf Perpustakaan

    Download (909kB)
    [thumbnail of S_RPL_1907774_Chapter3.pdf] Text
    S_RPL_1907774_Chapter3.pdf

    Download (941kB)
    [thumbnail of S_RPL_1907774_Chapter4.pdf] Text
    S_RPL_1907774_Chapter4.pdf
    Restricted to Staf Perpustakaan

    Download (1MB)
    [thumbnail of S_RPL_1907774_Chapter5.pdf] Text
    S_RPL_1907774_Chapter5.pdf

    Download (201kB)
    [thumbnail of S_RPL_1907774_Appendix.pdf] Text
    S_RPL_1907774_Appendix.pdf

    Download (2MB)
    Official URL: https://repository.upi.edu/
    Item Type: Thesis (S1)
    Additional Information: https://scholar.google.com/citations?user=6rjxcpcAAAAJ&hl=en ID SINTA Dosen Pembimbing: Mochamad Iqbal Ardimansyah: 6658552 Yulia Retnowati: 6852573
    Uncontrolled Keywords: Prediksi Diabetes, Pembelajaran Mesin, XGBoost, Random Forest, Faktor Risiko, Diabetes Prediction, Machine Learning, XGBoost, Random Forest, Risk Factors.
    Subjects: L Education > L Education (General)
    Q Science > QA Mathematics > QA75 Electronic computers. Computer science
    Q Science > QA Mathematics > QA76 Computer software
    T Technology > T Technology (General)
    Divisions: UPI Kampus cibiru > S1 Rekayasa Perangkaat Lunak
    Depositing User: HIBAR TAUFIKURACHMAN
    Date Deposited: 26 Sep 2025 04:00
    Last Modified: 26 Sep 2025 04:00
    URI: http://repository.upi.edu/id/eprint/137929

    Actions (login required)

    View Item View Item