HR Dataset Cleaning and Transformation Project

تفاصيل العمل

This project involves cleaning, preprocessing, and transforming an HR dataset from an Excel file with multiple sheets. Here's a detailed breakdown of the process:

1. Data Loading and Initial Exploration

The project begins by loading an Excel file (HR_Dataset.xlsx) from a specific path

It reads all sheets from the Excel file using sheet_name=None to get a dictionary of DataFrames

For each sheet, it performs initial exploration by:

Printing sheet information with df.info()

Showing descriptive statistics with df.describe()

Counting duplicated rows with df.duplicated().sum()

Counting null values in each column with df.isnull().sum()

2. Data Cleaning

Handling Missing Values

Different strategies are applied to different sheets:

Employee sheet:

Sick_Days filled with mean value (converted to integer)

Years_At_Company filled with mean value (converted to integer)

Projects sheet:

Projects_Handled filled with median value

Remote work sheet:

Remote_Work_Frequency filled with mode (most frequent value)

Handling Duplicates

For each sheet, duplicated rows are identified and counted

A new dictionary HR_cleaned is created containing each sheet's data with duplicates removed using drop_duplicates()

3. Data Export

The cleaned data is exported to a new Excel file (HR_Dataset(final).xlsx) with two versions of each sheet:

Nulls_Handled_[sheetname]: Contains data with missing values handled

Cleaned_[sheetname]: Contains data with both missing values handled and duplicates removed

4. Data Transformation

Several transformations are performed on the dataset:

Age and Hiring Information

Current year is set as 2023 for calculations

Age column is converted to numeric, handling potential errors

Rows with missing Age values are dropped

Hire_Year is generated based on age groups:

Age > 45: Hired between 2005-2010

Age 40-45: Hired between 2010-2015

Age 30-39: Hired between 2015-2020

Age 20-29: Hired between 2019-2023

Date Generation

Hire_Date is created by combining:

Generated Hire_Year

Random month (1-12)

Random day (1-28)

Format is set as MM-DD-YYYY

Years at Company Calculation

Years_At_Company is calculated as current year (2023) minus Hire_Year

Gender Standardization

Gender values are standardized to 'M' and 'F' (from 'Male' and 'Female')

Final Output

The script prints the transformed data showing the Age, Hire_Date, and Years_At_Company columns to verify the transformations.

ملفات مرفقة

- RAR
- phase1final.rar
- (27.89KB)

بطاقة العمل

اسم المستقل

Sara A.

عدد الإعجابات

عدد المشاهدات

تاريخ الإضافة

27/04/2025

تاريخ الإنجاز

12/04/2025

المهارات

HR Dataset Cleaning and Transformation Project

تفاصيل العمل

ملفات مرفقة

بطاقة العمل

روابط

تابع مستقل على

وسائل الدفع المتاحة