Starred repositories
Automatically create Faiss knn indices with the most optimal similarity search parameters.
pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,Qwen2.5等模型应用在纠错场景,开箱即用。
Library for fast text representation and classification.
Tools to download and cleanup Common Crawl data
[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"
[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.
Code repository for the paper - "Matryoshka Representation Learning"
Label, clean and enrich text datasets with LLMs.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including…
A simple screen parsing tool towards pure vision based GUI agent
Implementation of Nougat Neural Optical Understanding for Academic Documents
如需体验textin文档解析,请点击https://cc.co/16YSIy
extract text from any document. no muss. no fuss.
A Collection of Variational Autoencoders (VAE) in PyTorch.
MedNLI - A Natural Language Inference Dataset For The Clinical Domain
An open-source solution for full parameter fine-tuning of DeepSeek-V3/R1 671B, including complete code and scripts from training to inference, as well as some practical experiences and conclusions.…