Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
This repository provides a Python script to fetch and summarize research papers from arXiv using the free Gemini API
[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
[CVPR 2025 Highlight] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
3D Occupancy Prediction Benchmark in Autonomous Driving
Nexus: Decoupled Diffusion Sparks Adaptive Scene Generation
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
[ICLR24] Official implementation of the paper “MagicDrive: Street View Generation with Diverse 3D Geometry Control”
Official implementation of the paper “MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes”
STI-Bench : Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
[CVPR'25 Oral] MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
[CVPR 2025] UniScene: Unified Occupancy-centric Driving Scene Generation
RoboDual: Dual-System for Robotic Manipulation
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Modified 3D Gaussian rasterizer for latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction