Stars
Open-source Multi-agent Poster Generation from Papers
FULL v0, Cursor, Manus, Same.dev, Lovable, Devin, Replit Agent, Windsurf Agent, VSCode Agent, Dia Browser & Trae AI (And other Open Sourced) System Prompts, Tools & AI Models.
Everything about the SmolLM2 and SmolVLM family of models
[CVPR'24 Oral] Official repository of Point Transformer V3 (PTv3)
Lightweight coding agent that runs in your terminal
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
🚀 One-stop solution for creating your digital avatar from chat logs 💡 Fine-tune LLMs with your chat logs to capture your unique style, then bind to a chatbot to bring your digital self to life. 从聊天…
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
DeerFlow is a community-driven Deep Research framework, combining language models with tools like web search, crawling, and Python execution, while contributing back to the open-source community.
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
Suna - Open Source Generalist AI Agent
A curated collection of resources, tools, and frameworks for developing GUI Agents.
MAGI-1: Autoregressive Video Generation at Scale
Model Context Protocol(MCP) 编程极速入门
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
Awesome curated collection of images and prompts generated by GPT-4o and gpt-image-1. Explore AI generated visuals created with ChatGPT and Sora, showcasing OpenAI’s advanced image generation capab…
An open protocol enabling communication and interoperability between opaque agentic applications.
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
【CVPR 2025 Oral】Official Repo for Paper "AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea"
AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
11 Lessons to Get Started Building AI Agents
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3, Llava, GLM4…
A GUI Agent application based on UI-TARS(Vision-Language Model) that allows you to control your computer using natural language.