A multi-stage data extraction and transformation pipeline that combines JavaScript and Python to scrape and structure global country profiles.
Instead of a basic scraping script, this project implements a resilient data pipeline that handles dynamic content rendering and intermediate structured storage.
Key Features & Architecture:
- Dynamic Extraction: Used Playwright with JavaScript to inject scripts and extract raw demographic profiles (Country Names, Capitals, Population, and Total Area).
- Intermediate Storage: Saved the raw extracted data into structured JSON files to prevent data loss.
- Python Transformation: Built a Python post-processing script using Pandas to read the JSON data, clean it, and reorganize it into a production-ready Excel (.xlsx) file.
Technologies Used:
- Playwright (JavaScript)
- Python
- Pandas
- JSON & Excel (XLSX)