- Rio de Janeiro, Brazil
Stars
VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for …
Surveillance Perspective Human Action Recognition Dataset: 7759 Videos from 14 Action Classes, aggregated from multiple sources, all cropped spatio-temporally and filmed from a surveillance-camera …
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train Qwen3, Llama 4, DeepSeek-R1, Gemma 3, TTS 2x faster with 70% less VRAM.
Recipes for shrinking, optimizing, customizing cutting edge vision models. 💜
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
GPT4V-level open-source multi-modal model based on Llama3-8B
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
✨✨Latest Advances on Multimodal Large Language Models
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Analyze videos using LLMs, Computer Vision and Automatic Speech Recognition
Python Computer Vision & Video Analytics Framework With Batteries Included
⛹️ Pytorch ReID: A tiny, friendly, strong pytorch implement of person re-id / vehicle re-id baseline. Tutorial 👉https://github.com/layumi/Person_reID_baseline_pytorch/tree/master/tutorial
Ready-to-use SRT / WebRTC / RTSP / RTMP / LL-HLS media server and media proxy that allows to read, publish, proxy, record and playback video and audio streams.
A lightweight web application for remotely viewing images from a remote computer through a web browser. 🖼️
Implementation of paper "Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces"
Amazon Kinesis Video Streams Webrtc SDK is for developers to install and customize realtime communication between devices and enable secure streaming of video, audio to Kinesis Video Streams.
superglue automates workflows from natural language. Agents use it to build deterministic workflows across apps, APIs and databases. Humans use it to automate complex workflows with just one prompt.
[NeurIPS 2023] Global Structure-Aware Diffusion Process for Low-Light Image Enhancement
DINO-X: The World's Top-Performing Vision Model for Open-World Object Detection and Understanding
🤖 Autonomous agent framework for Elixir. Built for distributed, autonomous behavior and dynamic workflows.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Implementation for paper "Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Model"
🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using RAG 🔄.
Ultimate camera streaming application with support RTSP, RTMP, HTTP-FLV, WebRTC, MSE, HLS, MP4, MJPEG, HomeKit, FFmpeg, etc.
1.5−3.0× lossless training or pre-training speedup. An off-the-shelf, easy-to-implement algorithm for the efficient training of foundation visual backbones.