Common Crawl Foundation

web-languages Public
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

commoncrawl/web-languages’s past year of commit activity

15 16 0 0 Updated Dec 12, 2024
cc-pyspark Public
Process Common Crawl data with Python and Spark

commoncrawl/cc-pyspark’s past year of commit activity

Python 408 MIT 86 3 2 Updated Dec 12, 2024
ia-web-commons Public Forked from Aloisius/ia-web-commons
Web archiving utility library

commoncrawl/ia-web-commons’s past year of commit activity

Java 9 Apache-2.0 75 6 3 Updated Dec 11, 2024
web-languages-code Public
The code used to generate templates for the web-languages repo https://github.com/commoncrawl/web-languages

commoncrawl/web-languages-code’s past year of commit activity

Python 1 Apache-2.0 0 0 0 Updated Dec 5, 2024
nutch Public Forked from Aloisius/nutch
Common Crawl fork of Apache Nutch

commoncrawl/nutch’s past year of commit activity

Java 29 Apache-2.0 1,252 8 (1 issue needs help) 0 Updated Dec 5, 2024
webarchive-indexing Public Forked from ikreymer/webarchive-indexing
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

commoncrawl/webarchive-indexing’s past year of commit activity

Python 5 MIT 10 0 2 Updated Dec 2, 2024
ia-hadoop-tools Public Forked from Aloisius/ia-hadoop-tools
Web archiving tools on Hadoop

commoncrawl/ia-hadoop-tools’s past year of commit activity

Java 3 28 2 1 Updated Nov 30, 2024
crawler-commons Public Forked from crawler-commons/crawler-commons
A set of reusable Java components that implement functionality common to any web crawler

commoncrawl/crawler-commons’s past year of commit activity

Java 0 Apache-2.0 79 0 0 Updated Nov 26, 2024
cc-citations Public
Scientific articles using or citing Common Crawl data

commoncrawl/cc-citations’s past year of commit activity

Jupyter Notebook 11 3 0 0 Updated Nov 24, 2024
open-data-registry Public Forked from awslabs/open-data-registry
A registry of publicly available datasets on AWS

commoncrawl/open-data-registry’s past year of commit activity

Python 1 Apache-2.0 939 0 0 Updated Nov 24, 2024

View all repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common Crawl Foundation

Pinned Loading

Repositories

People

Top languages

Most used topics