This project involves cleaning, preprocessing, and transforming an HR dataset from an Excel file with multiple sheets. Here's a detailed breakdown of the process:
1. Data Loading and Initial Exploration
The project begins by loading an Excel file (HR_Dataset.xlsx) from a specific path
It reads all sheets from the Excel file using sheet_name=None to get a dictionary of DataFrames
For each sheet, it performs initial exploration by:
Printing sheet information with df.info()
Showing descriptive statistics with df.describe()
Counting duplicated rows with df.duplicated().sum()
Counting null values in each column with df.isnull().sum()
2. Data Cleaning
Handling Missing Values
Different strategies are applied to different sheets:
Employee sheet:
Sick_Days filled with mean value (converted to integer)
Years_At_Company filled with mean value (converted to integer)
Projects sheet:
Projects_Handled filled with median value
Remote work sheet:
Remote_Work_Frequency filled with mode (most frequent value)
Handling Duplicates
For each sheet, duplicated rows are identified and counted
A new dictionary HR_cleaned is created containing each sheet's data with duplicates removed using drop_duplicates()
3. Data Export
The cleaned data is exported to a new Excel file (HR_Dataset(final).xlsx) with two versions of each sheet:
Nulls_Handled_[sheetname]: Contains data with missing values handled
Cleaned_[sheetname]: Contains data with both missing values handled and duplicates removed
4. Data Transformation
Several transformations are performed on the dataset:
Age and Hiring Information
Current year is set as 2023 for calculations
Age column is converted to numeric, handling potential errors
Rows with missing Age values are dropped
Hire_Year is generated based on age groups:
Age > 45: Hired between 2005-2010
Age 40-45: Hired between 2010-2015
Age 30-39: Hired between 2015-2020
Age 20-29: Hired between 2019-2023
Date Generation
Hire_Date is created by combining:
Generated Hire_Year
Random month (1-12)
Random day (1-28)
Format is set as MM-DD-YYYY
Years at Company Calculation
Years_At_Company is calculated as current year (2023) minus Hire_Year
Gender Standardization
Gender values are standardized to 'M' and 'F' (from 'Male' and 'Female')
Final Output
The script prints the transformed data showing the Age, Hire_Date, and Years_At_Company columns to verify the transformations.