Diabetes Prediction Project
This project uses a large clinical dataset to build machine learning models that can predict whether a patient has diabetes.
The Data
The team worked with a dataset of 100,000 patient records sourced from Kaggle, containing 17 features (a mix of numerical and categorical data) such as age, BMI, blood glucose levels, and HbA1c levels. The goal was both classification (diabetic vs. non-diabetic) and clustering (grouping patients by similar profiles).
Cleaning & Preparation
Before any analysis, the data went through thorough preprocessing — removing duplicates, handling null values (replacing them with "Unknown" rather than discarding rows), identifying the correct data types, and detecting/removing outliers across key variables like age, BMI, HbA1c, and blood glucose.
Exploring the Data
Some interesting patterns emerged during visualization:
The dataset skews slightly female (~58k female vs. ~41k male patients).
HbA1c levels, BMI, and blood glucose all showed clear differences between diabetic and non-diabetic patients.
Hypertension also showed a notable association with diabetes risk.
Building the Models
Three models were built and evaluated:
Random Forest — Trained on a SMOTE-balanced dataset (to handle class imbalance), with feature importance analysis to select only the most relevant predictors. Performance was evaluated using an ROC curve.
Logistic Regression — A simpler, interpretable model used for comparison, also evaluated with an ROC curve.
CLARA Clustering — An unsupervised algorithm that grouped patients into clusters, with the optimal number of clusters found to be 2 or 3 (via an elbow graph).
The Final Product
The project didn't stop at models — the team built a full web-based user interface where users can interact with the model. The app includes a homepage, a model-building screen, data visualizations, patient result screens, a confusion matrix view, and a clustering model screen.