تفاصيل العمل

# Olist E-Commerce Data Engineering Pipeline with Apache Spark + Hadoop + GCP

This project showcases a full data engineering pipeline built on the Olist e-commerce dataset using **Apache Spark** on **Google Cloud Platform (GCP)** with **Hadoop integration**. The pipeline handles everything from data ingestion to storage optimization — designed to be modular, scalable, and cloud-ready.

---

## Project Modules

### 1️⃣ Data Ingestion & Exploration

- Loaded raw Olist CSVs from GCP cloud storage (GCS)

- Explored datasets using PySpark

- Validated schemas and basic statistics

### 2️⃣ Data Cleaning

- Removed nulls, fixed schema mismatches

- Standardized types, column naming, and integrity checks

- No Pandas used — entirely Spark-native

### 3️⃣ Data Integration & Aggregation

- Joined relational datasets (customers, orders, products, reviews)

- Generated KPIs and metrics:

- Revenue per customer

- Average delivery delay per seller

- Repeat purchase frequency

### 4️⃣ Data Optimization & Storage Layer

- Repartitioned datasets for parallelism

- Cached intermediate results in Spark memory

- Wrote final outputs as partitioned Parquet files to GCS

- Applied Hadoop-style configuration and Spark tuning

---

## ️ Tech Stack

- Apache Spark (PySpark)

- Hadoop (CLI, FS commands)

- Google Cloud Platform (GCS, Dataproc)

- Parquet Format

- Jupyter / Databricks Notebooks (optional for dev)

---

## Folder Structure

# Olist E-Commerce Data Engineering Pipeline with Apache Spark + Hadoop + GCP

This project showcases a full data engineering pipeline built on the Olist e-commerce dataset using **Apache Spark** on **Google Cloud Platform (GCP)** with **Hadoop integration**. The pipeline handles everything from data ingestion to storage optimization — designed to be modular, scalable, and cloud-ready.

---

## Project Modules

### 1️⃣ Data Ingestion & Exploration

- Loaded raw Olist CSVs from GCP cloud storage (GCS)

- Explored datasets using PySpark

- Validated schemas and basic statistics

### 2️⃣ Data Cleaning

- Removed nulls, fixed schema mismatches

- Standardized types, column naming, and integrity checks

- No Pandas used — entirely Spark-native

### 3️⃣ Data Integration & Aggregation

- Joined relational datasets (customers, orders, products, reviews)

- Generated KPIs and metrics:

- Revenue per customer

- Average delivery delay per seller

- Repeat purchase frequency

### 4️⃣ Data Optimization & Storage Layer

- Repartitioned datasets for parallelism

- Cached intermediate results in Spark memory

- Wrote final outputs as partitioned Parquet files to GCS

- Applied Hadoop-style configuration and Spark tuning

---

## ️ Tech Stack

- Apache Spark (PySpark)

- Hadoop (CLI, FS commands)

- Google Cloud Platform (GCS, Dataproc)

- Parquet Format

- Jupyter / Databricks Notebooks (optional for dev)

---

## Folder Structure

# Olist E-Commerce Data Engineering Pipeline with Apache Spark + Hadoop + GCP

This project showcases a full data engineering pipeline built on the Olist e-commerce dataset using **Apache Spark** on **Google Cloud Platform (GCP)** with **Hadoop integration**. The pipeline handles everything from data ingestion to storage optimization — designed to be modular, scalable, and cloud-ready.

---

## Project Modules

### 1️⃣ Data Ingestion & Exploration

- Loaded raw Olist CSVs from GCP cloud storage (GCS)

- Explored datasets using PySpark

- Validated schemas and basic statistics

### 2️⃣ Data Cleaning

- Removed nulls, fixed schema mismatches

- Standardized types, column naming, and integrity checks

- No Pandas used — entirely Spark-native

### 3️⃣ Data Integration & Aggregation

- Joined relational datasets (customers, orders, products, reviews)

- Generated KPIs and metrics:

- Revenue per customer

- Dashboard for seller

- Repeat purchase frequency

### 4️⃣ Data Optimization & Storage Layer

- Repartitioned datasets for parallelism

- Cached intermediate results in Spark memory

- Wrote final outputs as partitioned Parquet files to GCS

- Applied Hadoop-style configuration and Spark tuning

---

## ️ Tech Stack

- Apache Spark (PySpark)

- Hadoop (CLI, FS commands)

- Google Cloud Platform (GCS, Dataproc)

- Parquet Format

- Jupyter

---

بطاقة العمل

اسم المستقل
عدد الإعجابات
0
تاريخ الإضافة
تاريخ الإنجاز
المهارات