Data Preparation
Data Loading and Initial Inspection: The dataset "HR_comma_sep.csv" was loaded into a pandas DataFrame. The df.head() and df.info() methods were used to inspect the first few rows and check data types and non-null counts. The dataset contains 14,999 entries with no missing values.
Feature Engineering and Preprocessing:
Categorical Feature Encoding: The Department and salary columns, which are categorical, were converted into numerical values using LabelEncoder.
Feature and Target Split: The dataset was split into features (X) and the target variable (y). The left column, which indicates whether an employee left (1) or stayed (0), was designated as the target.
Feature Scaling: All features in X were standardized using StandardScaler to ensure that they have a mean of 0 and a standard deviation of 1. This is a crucial step for many machine learning algorithms, including Logistic Regression, to perform well.
Train-Test Split: The data was divided into training and testing sets using train_test_split with a test_size of 20% and stratify=y to maintain the same proportion of the target class in both sets.
Model Training and Evaluation
Three different Logistic Regression models were trained and evaluated:
Baseline Logistic Regression:
A standard LogisticRegression model was trained on the scaled training data.
The model's performance was evaluated on the test set, resulting in an accuracy of 0.771.
The precision was 0.539, the recall was 0.261, and the F1-score was 0.351.
The confusion matrix shows that the model correctly predicted 2127 employees who stayed and 186 employees who left. However, it failed to identify a significant number of employees who actually left (528 false negatives).
SMOTE with Logistic Regression:
To address the class imbalance (11428 stayed vs. 3571 left), the SMOTE (Synthetic Minority Over-sampling Technique) method was applied to the training data. This technique oversamples the minority class (left=1) to balance the dataset.
A new Logistic Regression model was trained on the resampled data.
The model showed an improved recall of 0.78 for the minority class (employees who left), with a trade-off in precision (0.50). This means the model is now much better at identifying employees who are likely to leave. The overall accuracy remained similar at 0.76.
Balanced Class Weight Logistic Regression:
Instead of oversampling, this approach uses the class_weight='balanced' parameter in the Logistic Regression model. This tells the algorithm to automatically adjust the weights inversely proportional to class frequencies, giving more importance to the minority class.
This model achieved a recall of 0.78 and a precision of 0.51, and an overall accuracy of 0.77. The results are very similar to the SMOTE model, confirming that addressing class imbalance is crucial for this prediction task.