8000 GitHub - arthurk/scrapy-german-news: Scrapy project with spiders to extract article content from various german news sites
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

arthurk/scrapy-german-news

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapy-german-news

Scrapy project with spiders to extract article content from the following german news sites:

  • spiegel.de
  • stern.de
  • faz.net
  • zeit.de
  • ftd.de

To get a list of all provided spiders:

scrapy list

To start a spider:

scrapy crawl stern -s JOBDIR=crawls/stern

Similarity

The simhash value is generated for every extracted article text. By using the similarity.py script, the similarity of all downloaded content can be computed.

Example:

python similarity.py crawls/stern/data.sqlite
0.327760236037

The value is the similarity in percent. In the example above the downloaded content has a 32% similarity.

Database

The content is stored in a Sqlite database called 'data.sqlite'. The 'data'-table has the following columns:

  • url - the downloaded url
  • body - the extracted article content
  • crawl_time - time of crawl
  • simhash - the generated simhash of the article content

If during a crawl you want to check how many sites have been crawled do the following:

sqlite crawls/stern/data.sqlite
select count(url) from data;

Execute this to get the top 10 sites with the same simhash (e.g. sites which are 100% similar):

select simhash, count(simhash) from data group by simhash order by count(simhash) DESC limit 10;

About

Scrapy project with spiders to extract article content from various german news sites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0