Breast Cancer Classification with Logistic Regression
I recently completed a project using the Breast Cancer Wisconsin Diagnostic Dataset to build a predictive model for classifying tumors as Benign (B) or Malignant (M).
Key Steps:
Dataset: 569 samples, 30 numerical features describing tumor cell nuclei.
Preprocessing:
Encoded diagnosis (M=1, B=0)
Removed irrelevant ID column
Standardized all features with StandardScaler
Train-Test Split: 80% train / 20% test (stratified to maintain class balance).
? Model:
Logistic Regression (with class_weight="balanced" to handle class imbalance).
Results:
Accuracy: 97.4%
AUC (ROC Curve): 0.995
Confusion Matrix:
True Negatives: 71
False Positives: 1
False Negatives: 2
True Positives: 40
Insights:
Both training (98.7%) and testing (97.4%) accuracies were high and close → no overfitting.
The model performs very well, but False Negatives (missed malignant cases) remain the most critical error type in medical diagnosis, as they could delay treatment.
False Positives, while less harmful, may lead to unnecessary stress and additional tests.
This project demonstrates how even simple models like Logistic Regression can deliver powerful, reliable results in healthcare-related tasks.