🕷️ The pipeline for the OSCAR corpus
-
Updated
Dec 18, 2023 - Rust
🕷️ The pipeline for the OSCAR corpus
builds a tantivy index from common crawl warc.wet files
A polite and user-friendly downloader for Common Crawl data
Add a description, image, and links to the commoncrawl topic page so that developers can more easily learn about it.
To associate your repository with the commoncrawl topic, visit your repo's landing page and select "manage topics."