Web Scraping Task
The goal of this task was to perform web scraping on a target webpage to extract various types of data and store them in structured formats (CSV and JSON).
- Features
Extract Headings (h1, h2)
Extract Paragraphs (p)
Extract Lists (li)
Extract Table data (tr, td), grouped by rows
Extract Form field information (field names, input types, default values)
Extract Video link
- Tools & Libraries
Python
requests – fetch webpage content
BeautifulSoup – parse HTML and extract elements
csv – store structured data in CSV files
json – store extracted data in JSON format
- Output Files
CSV Files :
Extract_Text_Data.csv – Combined extracted text in a structured table
Extract_Table_Data.csv – Extracted table data only
JSON Files :
Product_Information.json – Book title, price, stock availability, and button text
Form_Information.json – Field name, input type, and default values
Video_Link.json – Video link
- Approach
Sent an HTTP request using requests to fetch the HTML content.
Parsed the HTML using BeautifulSoup to locate target elements.
Extracted and cleaned the data (removing extra spaces and newlines).
Stored the data in multiple CSV and JSON files.
- Challenges Faced
Handling different HTML structures for headings, lists, tables, and forms.
Extracting default values from form fields that may not always have a value.
Grouping table cells correctly when saving in CSV/JSON formats.
Cleaning and formatting text for consistency.
Managing multiple output files without overwriting data.