8000 GitHub - flado/rss-crawler: RSS Crawler
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

flado/rss-crawler

Repository files navigation

RSS Crawler

RSS Crawler implementation in Python and MySQL.

How do I get set up?

  1. Install Python 2.7.x

  2. Install pip or download from: https://bootstrap.pypa.io/get-pip.py cd pip; python get-pip.py

  3. Install dependencies using pip: pip install beautifulsoup4; pip install requests; pip install feedparser

  4. MySQL database Configuration

crawler_config.py -> contains MySQL database configurations for root (ROOT_DB_CONFIG) and crawler database (CRAWLER_DB_CONFIG). You can update this configuration with your values.

  1. Start MySQL: mysqld

remember to check the port in my.cnf and update ROOT_DB_CONFIG

  1. Configure proxy (if any)

If you're running this app under a proxy please remember to configure the HTTP_PROXY and HTTPS_PROXY environment variables

 export HTTP_PROXY="http://user:pass@10.10.1.10:3128/"
 export HTTPS_PROXY="http://user:pass@10.10.1.10:3128/"
  1. Run: python rss_crawler.py --db crawler_database

    The rss_crawler.py script accepts the following input parameters:

Name Description
-s, --start the start URL (overrides the defaults in GLOBAL_CONFIG.start_urls )
-d, --db the crawler database name (override database name in CRAWLER_DB_CONFIG)
-r, --remove drops and re-creates again the crawler database (by default, if database already exists, the crawled feeds are added to the existing database)
-c, --console print logs to the console (by default the LOG_SETTINGS prints to file)
-p, --pwd MYSQL root password ('none' for no password)

Configuration guidelines

The rss_crawler.py script allows for multiple configurations that can be changed in crawler_config.py file:

Name Description
GLOBAL_CONFIG 'start_urls', 'drop_existing_database', 'log_to_file'
LOG_SETTINGS 'formatters', 'handlers', 'loggers'
MAX_PAGES_PER_DOMAIN max number of pages to check in current domain
EXCLUDES domain names to exclude from crawling
RSS_EXCLUDES keywords that determine if the RSS will be excluded
BAD_SUFIXES deprecated
MAX_CONTENT_LENGTH max size in bytes for a page to be fetched
ROOT_DB_CONFIG MySQL root user and password
CRAWLER_DB_CONFIG Crawler database: user configuration & database name

Have a look in crawler_config.py for the default values!!

Contribution guidelines

Feel free to contribute to this repo.

Who do I talk to?

  • Repo owner or admin (flado)

Licence

Licensed under the permissive MIT license

About

RSS Crawler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0