scrapy-german-news

Scrapy project with spiders to extract article content from the following german news sites:

spiegel.de
stern.de
faz.net
zeit.de
ftd.de

To get a list of all provided spiders:

scrapy list

To start a spider:

scrapy crawl stern -s JOBDIR=crawls/stern

Similarity

The simhash value is generated for every extracted article text. By using the similarity.py script, the similarity of all downloaded content can be computed.

Example:

python similarity.py crawls/stern/data.sqlite
0.327760236037

The value is the similarity in percent. In the example above the downloaded content has a 32% similarity.

Database

The content is stored in a Sqlite database called 'data.sqlite'. The 'data'-table has the following columns:

url - the downloaded url
body - the extracted article content
crawl_time - time of crawl
simhash - the generated simhash of the article content

If during a crawl you want to check how many sites have been crawled do the following:

sqlite crawls/stern/data.sqlite
select count(url) from data;

Execute this to get the top 10 sites with the same simhash (e.g. sites which are 100% similar):

select simhash, count(simhash) from data group by simhash order by count(simhash) DESC limit 10;

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
tutorial		tutorial
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_simhash.py		add_simhash.py
reqs.txt		reqs.txt
scrapy.cfg		scrapy.cfg
similarity.py		similarity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scrapy-german-news

Similarity

Database

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

arthurk/scrapy-german-news

Folders and files

Latest commit

History

Repository files navigation

scrapy-german-news

Similarity

Database

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages