-
Human Language Processing Laboratory (HLP Lab)
- Wuhan, China
Stars
Transformer models from BERT to GPT-4, environments from Hugging Face to OpenAI. Fine-tuning, training, and prompt engineering examples. A bonus section with ChatGPT, GPT-3.5-turbo, GPT-4, and DALL…
Jupyter notebooks for the Natural Language Processing with Transformers book
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
PDF scientific paper translation with preserved formats - 基于 AI 完整保留排版的 PDF 文档全文双语翻译,支持 Google/DeepL/Ollama/OpenAI 等服务,提供 CLI/GUI/MCP/Docker/Zotero
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
BLEURT is a metric for Natural Language Generation based on transfer learning.
🚀🚀 「大模型」2小时完全从0训练26M的小参数GPT!🌏 Train a 26M-parameter GPT from scratch in just 2h!
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Python GUI Programming Cookbook, Third Edition, Published by Packt
A tool that locates, downloads, and extracts machine translation corpora
NTREX -- News Test References for MT Evaluation
Facebook Low Resource (FLoRes) MT Benchmark
Vits Japanese with Whisper as data processor (you can train your VITS even you only have audios)
CjangCjengh / vits
Forked from jaywalnut310/vitsVITS implementation of Japanese, Chinese, Korean, Sanskrit and Thai
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.