crawler

The goal of this project is to build a very simple web crawler which fetches URLs and outputs crawl results to some sort of log or console as the crawl proceeds.

Installation

$ pip install -r requirements.txt

Usage

python crawler.py https://rescale.com

python crawler.py https://rescale.com 100

Arguments

url The URL to begin crawling from, this has validation to ensure it follows http/https schema
max_crawl A number for the max amount of websites to visit (ex: 100)

Output

Visits and links are logged to console. A CSV file called data.csv is saved with the following columns:

1. "url": The URL scraped

2. "html": The HTML Response

3. "created_at": The time it was scrapped in [ISO format](https://www.iso.org/iso-8601-date-and-time-format.html)

4. "links": The scraped links

5. "success": True if no errors were raised, False if an error was raised

Name		Name	Last commit mess 92E8 age	Last commit date
Latest commit History 4 Commits
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

crawler

Installation

Usage

Arguments

Output

About

Uh oh!

Releases

Packages

Languages

abhidya/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

Installation

Usage

Arguments

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages