تفاصيل العمل

Advanced Web Scraping API

A powerful, professional-grade web scraping platform built with FastAPI, featuring real-time streaming capabilities, intelligent content extraction, and comprehensive security measures.

Python FastAPI License

Key Features

Multiple Scraping Modes

Single Page Extraction - Precise content extraction from individual pages

Limited Batch Scraping - Controlled scraping with customizable page limits

Real-time Streaming - Live results with immediate page-by-page updates

Unlimited Website Crawling - Comprehensive site-wide content extraction

️ Enterprise-Grade Security

Advanced duplicate URL detection and normalization

Domain-restricted crawling for security compliance

Intelligent rate limiting and timeout controls

User-agent rotation to avoid bot detection

Request validation and sanitization

Real-Time Capabilities

Server-Sent Events (SSE) streaming architecture

Live progress tracking and status updates

Non-blocking asynchronous processing

Interactive web interface with Arabic/RTL support

Advanced Content Processing

Enhanced HTML parsing with trafilatura integration

WordPress and modern CMS optimized extraction

Elementor and dynamic content support

Smart content filtering and cleanup

️ Technical Architecture

Backend Stack

Framework: FastAPI with automatic OpenAPI documentation

Language: Python 3.11+ with modern async/await patterns

HTTP Client: Requests library with persistent session management

Content Extraction: BeautifulSoup4 + trafilatura for superior parsing

Data Validation: Pydantic models with automatic serialization

Server: Uvicorn ASGI with hot reload capabilities

Frontend Features

Responsive Arabic-RTL supporting interface

Real-time API health monitoring

Advanced form validation with live feedback

Results management with download/export functionality

3D styling and modern animations

Core Design Principles

Scalable Architecture: Object-oriented design with separation of concerns

Performance Optimized: HTTP session pooling and connection reuse

Security First: Comprehensive input validation and domain restrictions

User Experience: Real-time feedback and intuitive interface design

Quick Start

Prerequisites

Python 3.11 or higher

pip package manager

Installation & Setup

Clone the repository

git clone <repository-url>

cd web-scraping-api

Install dependencies

pip install -r requirements.txt

Start the server

python -m uvicorn main:app --host 0.0.0.0 --port 5000 --reload

Access the application

Web Interface: http://localhost:5000

API Documentation: http://localhost:5000/doc...

Alternative Docs: http://localhost:5000/red...

API Endpoints

Core Scraping Endpoints

POST /scrape-pages

Limited scraping with page count control

{

"url": "https://example.com";,

"max_pages": 50,

"timeout": 10

}

POST /scrape-single

Extract content from a single page

{

"url": "https://example.com/artic...;,

"timeout": 10

}

Real-Time Streaming Endpoints

POST /scrape-stream

Real-time streaming with page limits

{

"url": "https://example.com";,

"max_pages": 100,

"timeout": 10

}

POST /scrape-stream-unlimited

Unlimited website crawling with live updates

{

"url": "https://example.com";,

"timeout": 10

}

Health & Monitoring

GET /health

API health check and status monitoring

Web Interface Usage

Navigate to the web interface at http://localhost:5000

Enter target URL in the input field

Select scraping mode:

Single page only

Limited pages (specify count)

Real-time streaming

Unlimited website scraping

Configure settings (max pages, timeout)

Start scraping and monitor real-time progress

Download or export results in JSON format

️ Configuration Options

Scraping Parameters

URL: Target website URL (required)

Max Pages: Page limit for controlled scraping (1-500)

Timeout: Request timeout per page (5-60 seconds)

Security Controls

Domain Restriction: Automatic same-domain enforcement

Rate Limiting: Built-in request throttling

Duplicate Prevention: Advanced URL normalization

Content Filtering: Smart content extraction

Advanced Features

URL Normalization & Deduplication

Removes tracking parameters (UTM, fbclid, etc.)

Handles URL fragments and anchors

Case-insensitive domain matching

Path normalization and canonicalization

Content Extraction Intelligence

WordPress-optimized parsing

Modern CMS compatibility

Dynamic content handling

Article structure detection

Performance Optimizations

HTTP connection pooling

Session management and reuse

Async/await processing

Memory-efficient streaming

Response Format

Standard Response

{

"url": "https://example.com/page&...;,

"title": "Page Title",

"content": "Extracted content...",

"links": ["https://example.com/link1...;],

"timestamp": "2025-08-13T10:30:00Z",

"word_count": 1250

}

Streaming Response Format

{

"type": "page",

"data": {

"url": "https://example.com/page&...;,

"title": "Page Title",

"content": "Content...",

"links": ["..."],

"timestamp": "2025-08-13T10:30:00Z"

},

"progress": {

"current": 15,

"total": 100,

"percentage": 15

}

}

️ Security & Compliance

Built-in Protections

Domain validation and restriction

Request rate limiting

Input sanitization

URL validation

Timeout enforcement

Memory usage controls

Best Practices

Always respect robots.txt

Implement appropriate delays between requests

Monitor server resources during large operations

Use reasonable page limits for batch operations

Error Handling

The API provides comprehensive error responses:

400 Bad Request: Invalid URL or parameters

404 Not Found: Page or resource not accessible

429 Too Many Requests: Rate limit exceeded

500 Internal Server Error: Server processing error

Performance Metrics

Typical Performance

Single Page: 1-3 seconds per page

Batch Scraping: 50-100 pages per minute

Memory Usage: ~50MB for standard operations

Concurrent Requests: Up to 10 simultaneous streams

? Contributing

We welcome contributions! Please follow these steps:

Fork the repository

Create a feature branch (git checkout -b feature/amazing-feature)

Commit your changes (git commit -m 'Add amazing feature')

Push to the branch (git push origin feature/amazing-feature)

Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Credits

Developed by: Eng. Amr Hossam

Architecture: FastAPI + Python 3.11+

Frontend: Modern JavaScript with Arabic RTL support

Links

Documentation: API Docs

Alternative Docs: ReDoc

Health Check: Status

بطاقة العمل

اسم المستقل
عدد الإعجابات
0
عدد المشاهدات
1
تاريخ الإضافة