# Olist E-Commerce Data Engineering Pipeline with Apache Spark + Hadoop + GCP
This project showcases a full data engineering pipeline built on the Olist e-commerce dataset using **Apache Spark** on **Google Cloud Platform (GCP)** with **Hadoop integration**. The pipeline handles everything from data ingestion to storage optimization — designed to be modular, scalable, and cloud-ready.
---
## Project Modules
### 1️⃣ Data Ingestion & Exploration
- Loaded raw Olist CSVs from GCP cloud storage (GCS)
- Explored datasets using PySpark
- Validated schemas and basic statistics
### 2️⃣ Data Cleaning
- Removed nulls, fixed schema mismatches
- Standardized types, column naming, and integrity checks
- No Pandas used — entirely Spark-native
### 3️⃣ Data Integration & Aggregation
- Joined relational datasets (customers, orders, products, reviews)
- Generated KPIs and metrics:
- Revenue per customer
- Average delivery delay per seller
- Repeat purchase frequency
### 4️⃣ Data Optimization & Storage Layer
- Repartitioned datasets for parallelism
- Cached intermediate results in Spark memory
- Wrote final outputs as partitioned Parquet files to GCS
- Applied Hadoop-style configuration and Spark tuning
---
## ️ Tech Stack
- Apache Spark (PySpark)
- Hadoop (CLI, FS commands)
- Google Cloud Platform (GCS, Dataproc)
- Parquet Format
- Jupyter / Databricks Notebooks (optional for dev)
---
## Folder Structure
# Olist E-Commerce Data Engineering Pipeline with Apache Spark + Hadoop + GCP
This project showcases a full data engineering pipeline built on the Olist e-commerce dataset using **Apache Spark** on **Google Cloud Platform (GCP)** with **Hadoop integration**. The pipeline handles everything from data ingestion to storage optimization — designed to be modular, scalable, and cloud-ready.
---
## Project Modules
### 1️⃣ Data Ingestion & Exploration
- Loaded raw Olist CSVs from GCP cloud storage (GCS)
- Explored datasets using PySpark
- Validated schemas and basic statistics
### 2️⃣ Data Cleaning
- Removed nulls, fixed schema mismatches
- Standardized types, column naming, and integrity checks
- No Pandas used — entirely Spark-native
### 3️⃣ Data Integration & Aggregation
- Joined relational datasets (customers, orders, products, reviews)
- Generated KPIs and metrics:
- Revenue per customer
- Average delivery delay per seller
- Repeat purchase frequency
### 4️⃣ Data Optimization & Storage Layer
- Repartitioned datasets for parallelism
- Cached intermediate results in Spark memory
- Wrote final outputs as partitioned Parquet files to GCS
- Applied Hadoop-style configuration and Spark tuning
---
## ️ Tech Stack
- Apache Spark (PySpark)
- Hadoop (CLI, FS commands)
- Google Cloud Platform (GCS, Dataproc)
- Parquet Format
- Jupyter / Databricks Notebooks (optional for dev)
---
## Folder Structure
# Olist E-Commerce Data Engineering Pipeline with Apache Spark + Hadoop + GCP
This project showcases a full data engineering pipeline built on the Olist e-commerce dataset using **Apache Spark** on **Google Cloud Platform (GCP)** with **Hadoop integration**. The pipeline handles everything from data ingestion to storage optimization — designed to be modular, scalable, and cloud-ready.
---
## Project Modules
### 1️⃣ Data Ingestion & Exploration
- Loaded raw Olist CSVs from GCP cloud storage (GCS)
- Explored datasets using PySpark
- Validated schemas and basic statistics
### 2️⃣ Data Cleaning
- Removed nulls, fixed schema mismatches
- Standardized types, column naming, and integrity checks
- No Pandas used — entirely Spark-native
### 3️⃣ Data Integration & Aggregation
- Joined relational datasets (customers, orders, products, reviews)
- Generated KPIs and metrics:
- Revenue per customer
- Dashboard for seller
- Repeat purchase frequency
### 4️⃣ Data Optimization & Storage Layer
- Repartitioned datasets for parallelism
- Cached intermediate results in Spark memory
- Wrote final outputs as partitioned Parquet files to GCS
- Applied Hadoop-style configuration and Spark tuning
---
## ️ Tech Stack
- Apache Spark (PySpark)
- Hadoop (CLI, FS commands)
- Google Cloud Platform (GCS, Dataproc)
- Parquet Format
- Jupyter
---