🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️
It currently is the generation pipeline for OSCAR corpus, from CommonCrawl. Ungoliant is a replacement of goclassy.
- Via
cargo
:cargo install ungoliant
- Via
git
:cargo install --git https://github.com/oscar-corpus/ungoliant
Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc
can be needed as the project uses fasttext-rs.
Use curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin
.
The usual way of generating corpora is:
- Fetch the
wet.paths.gz
file from the last CommonCrawl dump and decompress it. - Download the files using the
download
command. - Generate the corpus using the
pipeline
command (it may take some time). - Deduplicate if needed using the
dedup
command. - Split into smaller files using the
split
command. - Compress using
compress
:-) package
will create language specific folders, move the relevant files in them and put a checksum file.
You can find more information on each command's --help
.
ungoliant 0.1.0
corpus generation tool.
USAGE:
ungoliant <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
compress Compress
dedup Deduplicate a generated, not split corpus.
download Downloading of CommonCrawl
help Prints this message or the help of the given subcommand(s)
package package
pipeline Run pipeline
split Split a not split corpus
Ungoliant is not yet on docs.rs: use cargo doc --bins --open
to open the documentation.
Benchmarking is not (yet) updated.
Use
cargo bench
to run benchmarking.See results in target/criterion/report/index.html