Stars
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents
Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。
[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation
Offical implement of Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for talking head Video Generation
[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents
TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices
Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
The model, data and code for the visual GUI Agent SeeClick
ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (IJCAI-24)
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Code to accompany "A Method for Animating Children's Drawings of the Human Figure"