In this project, I built an ETL pipeline for a dataset of women’s clothing e-commerce reviews. The work included extracting the raw data, applying several transformation steps, and preparing it for analytics and machine learning tasks.
Main Steps
Extract: Loaded the raw dataset from CSV/Notebook files into a pandas DataFrame.
Transform: Cleaned the data by handling missing values, fixing data types, encoding categorical features, and preprocessing the review text (normalization, removing punctuation, etc.). I also created new features to support deeper analysis.
Load: Exported the transformed dataset into structured formats (CSV/Database) for further use.
Tools & Technologies
Python (Pandas, NumPy, Matplotlib), Jupyter Notebook, Streamlit (for dashboard visualization), and optional workflow orchestration tools like Apache Airflow.
Results
The outcome was a cleaned, structured dataset ready for analysis. I also created a dashboard to explore key insights such as rating distributions and recommendation patterns.
Future Work
Next steps could include automating the ETL pipeline, moving the data into a cloud warehouse, and applying machine learning models for sentiment analysis and recommendations.