RSS Crawler implementation in Python and MySQL.
-
Install Python 2.7.x
-
Install pip or download from: https://bootstrap.pypa.io/get-pip.py
cd pip; python get-pip.py
-
Install dependencies using pip:
pip install beautifulsoup4; pip install requests; pip install feedparser
-
MySQL database Configuration
crawler_config.py -> contains MySQL database configurations for root (ROOT_DB_CONFIG) and crawler database (CRAWLER_DB_CONFIG). You can update this configuration with your values.
- Start MySQL:
mysqld
remember to check the port in my.cnf and update ROOT_DB_CONFIG
- Configure proxy (if any)
If you're running this app under a proxy please remember to configure the HTTP_PROXY and HTTPS_PROXY environment variables
export HTTP_PROXY="http://user:pass@10.10.1.10:3128/"
export HTTPS_PROXY="http://user:pass@10.10.1.10:3128/"
-
Run:
python rss_crawler.py --db crawler_database
The
rss_crawler.py
script accepts the following input parameters:
Name | Description |
---|---|
-s, --start | the start URL (overrides the defaults in GLOBAL_CONFIG.start_urls ) |
-d, --db | the crawler database name (override database name in CRAWLER_DB_CONFIG) |
-r, --remove | drops and re-creates again the crawler database (by default, if database already exists, the crawled feeds are added to the existing database) |
-c, --console | print logs to the console (by default the LOG_SETTINGS prints to file) |
-p, --pwd | MYSQL root password ('none' for no password) |
The rss_crawler.py
script allows for multiple configurations that can be changed in crawler_config.py
file:
Name | Description |
---|---|
GLOBAL_CONFIG | 'start_urls', 'drop_existing_database', 'log_to_file' |
LOG_SETTINGS | 'formatters', 'handlers', 'loggers' |
MAX_PAGES_PER_DOMAIN | max number of pages to check in current domain |
EXCLUDES | domain names to exclude from crawling |
RSS_EXCLUDES | keywords that determine if the RSS will be excluded |
BAD_SUFIXES | deprecated |
MAX_CONTENT_LENGTH | max size in bytes for a page to be fetched |
ROOT_DB_CONFIG | MySQL root user and password |
CRAWLER_DB_CONFIG | Crawler database: user configuration & database name |
Have a look in crawler_config.py
for the default values!!
Feel free to contribute to this repo.
- Repo owner or admin (flado)
Licensed under the permissive MIT license