Stars
A community-driven AI automation framework that builds upon the incredible work of the open source community. Our goal is to combine language models with specialized tools for tasks like web search…
Python tool for converting files and office documents to Markdown.
A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
LLM API 管理 & 分发系统,支持 OpenAI、Azure、Anthropic Claude、Google Gemini、DeepSeek、字节豆包、ChatGLM、文心一言、讯飞星火、通义千问、360 智脑、腾讯混元等主流模型,统一 API 适配,可用于 key 管理与二次分发。单可执行文件,提供 Docker 镜像,一键部署,开箱即用。LLM API management & k…
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Official Implementation for "Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition"
Fast and accurate automatic speech recognition (ASR) for edge devices
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
Omni SenseVoice: High-Speed Speech Recognition with words timestamps 🗣️🎯
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Long Context Transfer from Language to Vision
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Use late-interaction multi-modal models such as ColPali in just a few lines of code.
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Summarize and perform RAG on PPTx/PPT file formats
Tesseract Open Source OCR Engine (main repository)
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
Retrieval and Retrieval-augmented LLMs
Disaggregated serving system for Large Language Models (LLMs).
目前已囊括232个大模型,覆盖chatgpt、gpt-4o、o3-mini、谷歌gemini、Claude3.5、智谱GLM-Zero、文心一言、qwen-max、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及DeepSeek-R1、qwq-32b、deepseek-v3、qwen2.5、llama3.3、phi-4、glm4、gemma3、mistral、书生in…