Change the repository type filter
All
Repositories list
64 repositories
cc-citations
Publiccc-webgraph
PublicTools to construct and process webgraphs from Common Crawl data
cc-crawl-statistics
PublicStatistics of Common Crawl monthly archives mined from URL index files
web-languages
PublicCrowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
news-crawl
PublicNews crawling with StormCrawler - stores content as WARC
A polite and user-friendly downloader for Common Crawl data
cc-notebooks
PublicVarious Jupyter notebooks about Common Crawl data
cc-pyspark
PublicProcess Common Crawl data with Python and Spark
webarchive-indexing
Publicuap-core
Publicweb-languages-code
PublicThe code used to generate templates for the web-languages repo https://github.com/commoncrawl/web-languages
nutch
PublicCommon Crawl fork of Apache Nutch
ia-hadoop-tools
Publicwhirlwind-python
Publiccc-warc-examples
Publiccrawler-commons
Publicopen-data-registry
Publiccc-index-table
PublicIndex Common Crawl archives in tabular format
language-detection-cld2
PublicNatural language detection, Java bindings for CLD2