8000 GitHub - abhidya/crawler: The goal of this project is to build a very simple web crawler which fetches URLs and outputs crawl results to some sort of log or console as the crawl proceeds.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

The goal of this project is to build a very simple web crawler which fetches URLs and outputs crawl results to some sort of log or console as the crawl proceeds.

Notifications You must be signed in to change notification settings

abhidya/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

crawler

The goal of this project is to build a very simple web crawler which fetches URLs and outputs crawl results to some sort of log or console as the crawl proceeds.

Installation

$ pip install -r requirements.txt

Usage

python crawler.py https://rescale.com

python crawler.py https://rescale.com 100

Arguments

  1. url The URL to begin crawling from, this has validation to ensure it follows http/https schema
  2. max_crawl A number for the max amount of websites to visit (ex: 100)

Output

Visits and links are logged to console. A CSV file called data.csv is saved with the following columns:

1. "url": The URL scraped

2. "html": The HTML Response

3. "created_at": The time it was scrapped in [ISO format](https://www.iso.org/iso-8601-date-and-time-format.html)

4. "links": The scraped links

5. "success": True if no errors were raised, False if an error was raised

About

The goal of this project is to build a very simple web crawler which fetches URLs and outputs crawl results to some sort of log or console as the crawl proceeds.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0