-
-
-
-
-
parquet2jsonl-rs Public
Converts Parquets to Jsonl. Specifically for dataset tomfoolery
Rust UpdatedDec 9, 2024 -
-
-
-
token-counter-rs Public
Simple rust utility to count tokens from tarfiles of contexts
Rust UpdatedAug 15, 2024 -
text-subsample-rs Public
Methods for subsampling text datasets (with emphasis on "duplicate aware subsampling")
Rust UpdatedAug 12, 2024 -
docshuffle-rs Public
Uses the local-cell mapper pattern to fully shuffle a collection of jsonl documents in rust
Rust UpdatedJul 1, 2024 -
tokshuf-rust Public
Tokenize/Shuffle tooling written in Rust
-
-
reservoir-datastats-rs Public
Multithreaded reservoir sampling for doc-length (also counts tokens globally :D)
Rust UpdatedJun 18, 2024 -
-
deduplicate-text-datasets Public
Forked from google-research/deduplicate-text-datasetsfor decontamination
Rust Apache License 2.0 UpdatedJun 18, 2024 -
rust-exact-dedup Public
Exact deduplication with rust and option to count presence
Rust UpdatedJun 17, 2024 -
-
-
wimbd Public
Forked from allenai/wimbdWhat's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
Python Apache License 2.0 UpdatedMay 7, 2024 -
in-context-pretraining Public
Forked from swj0419/in-context-pretrainingPython UpdatedMar 20, 2024 -
ray Public
Forked from ray-project/rayRay is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Python Apache License 2.0 UpdatedOct 24, 2023 -
open_lm Public
Forked from mlfoundations/open_lmA repository for research on medium sized language models.
-
lipMIP Public
Mixed integer programming for computing lipschitz constants of ReLU Networks
-
fastargs Public
Forked from GuillaumeLeclerc/fastargsPython library for argument and configuration management
Python MIT License UpdatedFeb 7, 2023 -
geometric-certificates Public
Geometric Certifications of Neural Nets
-
bit-diffusion Public
Forked from lucidrains/bit-diffusionImplementation of Bit Diffusion, Hinton's group's attempt at discrete denoising diffusion, in Pytorch
Python MIT License UpdatedOct 16, 2022 -
-
swav-cifar100 Public
Forked from facebookresearch/swavPyTorch implementation of SwAV https//arxiv.org/abs/2006.09882
Python Other UpdatedAug 26, 2022 -