-
Tsinghua University
- https://yzd-v.github.io/page/
Stars
Offical implementation of "Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning"
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
[IJCV] Bamboo: 4 times larger than ImageNet; 2 time larger than Object365; Built by active learning.
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Quick exploration into fine tuning florence 2
[CVPR 2025 Oral & Best Paper Award Candidate] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Everything about the SmolLM2 and SmolVLM family of models
A Framework of Small-scale Large Multimodal Models
Strong and Open Vision Language Assistant for Mobile Devices
MAGI-1: Autoregressive Video Generation at Scale
A simple screen parsing tool towards pure vision based GUI agent
💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
Your AI Operator for Web, Android, Automation & Testing.
A GUI Agent application based on UI-TARS(Vision-Language Model) that allows you to control your computer using natural language.
Efficient vision foundation models for high-resolution generation and perception.
DINO-X: The World's Top-Performing Vision Model for Open-World Object Detection and Understanding
No fortress, purely open ground. OpenManus is Coming.
🌐 Make websites accessible for AI agents. Automate tasks online with ease.
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Wan: Open and Advanced Large-Scale Video Generative Models
OpenMMLab Rotated Object Detection Toolbox and Benchmark