The INSPIR (Image-Text Synthesis Pipeline for Intelligent Retrieval and Generation) framework presents an innovative approach to image captioning and retrieval by leveraging an ensemble of state-of-the-art models. This research introduces a method that generates descriptive captions from images using BLIP (Bootstrapping Language-Image Pre-training), ViT-GPT2 (Vision Transformer combined with GPT-2), and GIT (Generative Image Text), while employing CLIP (Contrastive Language-Image Pre-training) for ranking the generated captions based on their relevance. The top-ranked captions are then utilized by Llama3.1 to produce creative outputs tailored for various applications, including social media captions and image notes. Furthermore, the INSPIR model enhances image retrieval capabilities by annotating uploaded images and enabling users to conduct text-based searches, thereby facilitating efficient access to relevant visual content. By integrating multiple modalities within a unified semantic space through contrastive learning, this work aims to advance the field of image captioning beyond conventional classification tasks, offering a generalized model performance that addresses the complexities of language and vision interaction.
-
Notifications
You must be signed in to change notification settings - Fork 1
hriteshMaikap/INSPIR
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
A repository whrere machine learning concepts are implemented from Scratch
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published