Ungoliant

🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️

It currently is the generation pipeline for OSCAR corpus, from CommonCrawl. Ungoliant is a replacement of goclassy.

Installation

Installing/Compiling the binary

Via cargo: cargo install ungoliant
Via git: cargo install --git https://github.com/oscar-corpus/ungoliant

Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc can be needed as the project uses fasttext-rs.

Getting the language identification file (for fastText):

Use curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin.

Usage

The usual way of generating corpora is:

Fetch the wet.paths.gz file from the last CommonCrawl dump and decompress it.
Download the files using the download command.
Generate the corpus using the pipeline command (it may take some time).
Deduplicate if needed using the dedup command.
Split into smaller files using the split command.
Compress using compress :-)
package will create language specific folders, move the relevant files in them and put a checksum file.

You can find more information on each command's --help.

ungoliant 0.1.0
corpus generation tool.

USAGE:
    ungoliant <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    compress    Compress
    dedup       Deduplicate a generated, not split corpus.
    download    Downloading of CommonCrawl
    help        Prints this message or the help of the given subcommand(s)
    package     package
    pipeline    Run pipeline
    split       Split a not split corpus

Documentation

Ungoliant is not yet on docs.rs: use cargo doc --bins --open to open the documentation.

Benchmarking

Benchmarking is not (yet) updated. ~~Use cargo bench to run benchmarking.~~ ~~See results in target/criterion/report/index.html~~

Name		Name	Last commit message	Last commit date
Latest commit History 293 Commits
.github		.github
benches		benches
img		img
src		src
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ungoliant

Installation

Installing/Compiling the binary

Getting the language identification file (for fastText):

Usage

Documentation

Benchmarking

About

Releases 6

Packages

Contributors 5

Languages

License

oscar-project/ungoliant

Folders and files

Latest commit

History

Repository files navigation

Ungoliant

Installation

Installing/Compiling the binary

Getting the language identification file (for fastText):

Usage

Documentation

Benchmarking

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 5

Languages

Packages