As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
Composed image retrieval seeks to retrieve a target image that fulfills both modalities based on the user’s query, which comprises a modified text and a reference image. Although existing studies introduce novel multi-modal feature fusion techniques at the global or local level, they ignore exploring the cross-level semantic correspondence and recombining the original and target features in the reference image and the modified text, and the problems of modal inconsistency. To address these issues, we propose a CLIP-based Cross-Level Semantic Interaction and Recombination Network (SeIR). Specifically, we first resort the image and text encoders of the CLIP pre-trained model to narrow the modal gap between image and text, and extract both global and local features. We also introduce a cross-modal attention mechanism to screen out original and target features by exploring the semantic correlation between the cross-level image and text features. Subsequently, to alleviate the modal difference between the generated composed query representation and the target image, we utilize an affine transformation technique to recombine the original and target features from the image and text. Extensive experiments on the FashionIQ and CIRR benchmark datasets demonstrate the competitive performance of the proposed SeIR compared to the state-of-the-art methods.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.