A flexible, Python-based web scraping utility to extract data from a curated list of URLs. This project logs the success and failure of requests, handles exceptions gracefully, and outputs results to a JSON file. Designed for beginners and experienced developers alike.
- Scrapes a list of URLs from a file (
urls.txt
) - Automatically logs:
- ✅ Successful scrapes
- ❌ Failed requests (with error reasons)
- Saves the successfully scraped data into
scraped_data.json
- Provides a cleaned list of valid URLs via
urls_clean.txt
- Modular and easily extendable
Web-Scraper/ ├── web-scraper.py # Main scraping logic ├── urls.txt # Input URLs to scrape ├── urls_clean.txt # Output of working URLs (auto-generated) ├── scraped_data.json # Final scraped content (auto-generated) ├── requirements.txt # List of dependencies └── README.md # Project documentation
- Python 3.7+
requests
beautifulsoup4
urllib3
logging
Install dependencies:
pip install -r requirements.txt
- Add the URLs you want to scrape into
urls.txt
, one per line. - Run the scraper:
python web-scraper.py
- Check your results:
scraped_data.json
: Scraped HTML or textual contenturls_clean.txt
: Filtered URLs that worked- Logs in the console will tell you which URLs failed
Console Log
2025-04-22 23:35:10,039 - INFO - Scraped https://example.com/
2025-04-22 23:35:21,373 - ERROR - Failed to fetch https://www.amazon.com/s?k=laptops: 503 Server Error
scraped_data.json
[
{
"url": "https://example.com",
"content": "<!doctype html>..."
},
...
]
Want to scrape specific elements or parse structured data like tables or product listings? Just extend the logic in web-scraper.py
using BeautifulSoup!
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title.string
- Some pages (like Amazon) actively block bots and may require headers, user-agent spoofing, or Selenium.
- API URLs that need authentication (e.g. NYT, Coindesk) may return
401 Unauthorized
or403 Forbidden
.
Built with curiosity, Python, and lots of trial & error 🚀