I explore how multi-modal large language models (MLLMs) can advance remote sensing tasks.
I work on text-to-image, image-to-image, and text-to-video generation, blending creativity with machine learning.
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing (Accepted at IEEE GRSM)
GeoPix is a remote sensing MLLM that extends image understanding capabilities to the pixel level. It integrates a mask predictor into the MLLM, transforming visual features from the vision encoder into masks conditioned on the segmentation token embeddings generated by the LLM.
A Method of Efficient Synthesizing Post-disaster Remote Sensing Image with Diffusion Model and LLM (Accepted at APSIPA ASC 2023)
This work provides a novel method for generating disaster-affected remote sensing images by integrating state-of-the-art models, including Stable Diffusion, BLIP, GPT-4, and human-in-the-loop feedback. The pipeline starts with only 97 unlabelled 512×512 remote sensing images. BLIP is first used to generate initial captions, which are then refined through expert feedback and GPT-based semantic rewriting to enhance the prompts. These enhanced prompts, paired with the original images, form a synthetic training set.
I developed a ControlNet model designed to transform line art into fully colored anime-style images. This model enables precise and high-quality generation by conditioning the diffusion process on clean line drawings, making it easier to create vibrant and consistent anime artwork from simple sketches.
Main developer of the image generation pipeline for Linky, supporting anime-style, real-style, and film-style image stylization, pose editing, and face consistency modeling.
This pipeline covers:
- Prompt cleaning and expansion (similar to a prompt helper)
- Image style selection
- Pose extraction and editing
- Face consistency enhancement
- Risk control evaluation
- Compute resource scheduling strategies