RSS Crawler

RSS Crawler implementation in Python and MySQL.

How do I get set up?

Install Python 2.7.x
Install pip or download from: https://bootstrap.pypa.io/get-pip.py cd pip; python get-pip.py
Install dependencies using pip: pip install beautifulsoup4; pip install requests; pip install feedparser
MySQL database Configuration

crawler_config.py -> contains MySQL database configurations for root (ROOT_DB_CONFIG) and crawler database (CRAWLER_DB_CONFIG). You can update this configuration with your values.

Start MySQL: mysqld

remember to check the port in my.cnf and update ROOT_DB_CONFIG

Configure proxy (if any)

If you're running this app under a proxy please remember to configure the HTTP_PROXY and HTTPS_PROXY environment variables

 export HTTP_PROXY="http://user:pass@10.10.1.10:3128/"
 export HTTPS_PROXY="http://user:pass@10.10.1.10:3128/"

Run: python rss_crawler.py --db crawler_database

The rss_crawler.py script accepts the following input parameters:

Name	Description
-s, --start	the start URL (overrides the defaults in GLOBAL_CONFIG.start_urls )
-d, --db	the crawler database name (override database name in CRAWLER_DB_CONFIG)
-r, --remove	drops and re-creates again the crawler database (by default, if database already exists, the crawled feeds are added to the existing database)
-c, --console	print logs to the console (by default the LOG_SETTINGS prints to file)
-p, --pwd	MYSQL root password ('none' for no password)

Configuration guidelines

The rss_crawler.py script allows for multiple configurations that can be changed in crawler_config.py file:

Name	Description
GLOBAL_CONFIG	'start_urls', 'drop_existing_database', 'log_to_file'
LOG_SETTINGS	'formatters', 'handlers', 'loggers'
MAX_PAGES_PER_DOMAIN	max number of pages to check in current domain
EXCLUDES	domain names to exclude from crawling
RSS_EXCLUDES	keywords that determine if the RSS will be excluded
BAD_SUFIXES	deprecated
MAX_CONTENT_LENGTH	max size in bytes for a page to be fetched
ROOT_DB_CONFIG	MySQL root user and password
CRAWLER_DB_CONFIG	Crawler database: user configuration & database name

Have a look in crawler_config.py for the default values!!

Contribution guidelines

Feel free to contribute to this repo.

Who do I talk to?

Repo owner or admin (flado)

Licence

Licensed under the permissive MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
pip		pip
.gitignore		.gitignore
LICENSE-MIT		LICENSE-MIT
README.md		README.md
crawler_config.py		crawler_config.py
crawler_db.py		crawler_db.py
rss_crawler.py		rss_crawler.py
rss_crawler_start.sh		rss_crawler_start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RSS Crawler

How do I get set up?

Configuration guidelines

Contribution guidelines

Who do I talk to?

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

flado/rss-crawler

Folders and files

Latest commit

History

Repository files navigation

RSS Crawler

How do I get set up?

Configuration guidelines

Contribution guidelines

Who do I talk to?

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages