CN117909555A

CN117909555A - Multi-modal information retrieval method, apparatus, device, readable storage medium and computer program product

Info

Publication number: CN117909555A
Application number: CN202410169465.3A
Authority: CN
Inventors: 王士进; 李亚; 杨磊; 刘权; 刘聪; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2024-02-06
Filing date: 2024-02-06
Publication date: 2024-04-19

Abstract

The application discloses a multi-mode information retrieval method, a device, equipment, a readable storage medium and a computer program product, which can accurately understand the complicated, detailed, rich and various retrieval intentions of a user by means of the capability of a multi-mode large model and meet the complicated retrieval intentions of the user in a real scene. The multi-mode large model has the capability of mapping information of different modes to the same semantic vector space through training, the first multi-mode large model in the double-tower model structure is used for carrying out feature coding on a user request to obtain query vector representation, the second multi-mode large model is used for carrying out feature coding on each piece of candidate information to obtain vector representation of each piece of candidate information, candidate information matched with the query vector representation is screened to serve as a search result, so that cross-mode information search is realized, and compared with the prior art, the search result accuracy is higher.

Description

Multi-modal information retrieval method, apparatus, device, readable storage medium and computer program product

Technical Field

The present application relates to the field of information retrieval technology, and more particularly, to a multi-modal information retrieval method, apparatus, device, readable storage medium, and computer program product.

Background

With the rapid growth of the information age, the amount of data that users produce and need to process has experienced explosive growth, which has created new challenges and demands for retrieval technology.

The current stage of search technology mainly performs text search, and common text search technologies such as a statistical-based method and a deep learning-based method. The statistical method is used for evaluating the correlation between the candidate documents and the user query request by calculating the frequency and the importance degree of the query words input by the user in each candidate document, and then selecting the candidate documents with high correlation as query results. Such methods do not truly understand the query intent of the user, such as for example, query requests that differ for two query intentions but contain the same or similar keywords, the query results obtained using the methods may be identical, i.e., statistical-based methods tend to result in insufficiently accurate query results.

The method based on deep learning adopts a deep neural network to carry out text retrieval, generally adopts a labeled scene data training model aiming at a specific scene, and needs to re-label the data training model when migrating to other scenes, so that the generalization capability of the scene is insufficient and the scene migration cost is higher. In addition, limited by the ability of the model, the existing deep learning-based method, in which the model does not have the understanding ability of code understanding, multilingual understanding and complex search intentions of the user, examples such as the understanding ability of the search intentions of the user are not limited to just finding answers, but may involve finding similar problems, such as 'help me find ancient poems which are necessarily similar to natural me materials', searching for fuzzy intentions, such as 'help me find a section of prose writing in autumn, style is dull, expression of depression is not delighted', code searching, such as 'help me find an optimal implementation code of bubbling sequencing', cross language searching, such as 'help me find French original text of a corresponding section of a problem', etc., the search intentions require the model to have the understanding ability of external knowledge and a certain reasoning ability, but are lacking in the existing deep neural network model, which lack the ability to align with the user intentions, and cannot be solved by adding training data, so that the existing deep neural network-based method meets the complex search intentions of the user under the complex search scene.

Further, users have a requirement of cross-modal searching, and cross-modal searching technology refers to a technology capable of processing and understanding different types of data (such as text, images, audio and the like) and realizing mutual searching among the heterogeneous data, for example, searching a graph in text, searching audio in text and the like. In the existing scheme, text labels are manually constructed on picture or audio-video mode information, such as a description text label of a picture and a description text label of audio-video data, then a text retrieval technology of a single text mode is utilized to retrieve the text labels matched with a user request, and further picture and audio-video mode data corresponding to the text labels are found. The method is limited by the accuracy of the text labels, and the multi-mode interaction between the user request and the picture and audio and video modal information is ignored, so that the loss of key semantic information is unavoidable, and the retrieval accuracy is low. Moreover, even if the scheme is adopted, only the retrieval from text to other modes can be realized, and the cross-mode retrieval cannot be realized for the user request of a non-text mode.

Disclosure of Invention

In view of the above problems, the present application provides a multi-modal information retrieval method, apparatus, device, readable storage medium and computer program product, so as to implement cross-modal information retrieval, satisfy the complex retrieval intention of the user in the real scene, and improve the accuracy of the retrieval result. The specific scheme is as follows:

In a first aspect, a multi-modal information retrieval method is provided, including:

Performing feature coding on a user request by using a first multi-mode large model to obtain a query vector representation of the user request, wherein the user request is request information of a text mode, an image mode and/or an audio mode;

Retrieving a target vector representation matched with the query vector representation in a configured vector database, and determining candidate information corresponding to the target vector representation as a retrieval result; the vector database stores vector representations after feature encoding is carried out on each piece of candidate information through a second multi-mode large model, and each piece of candidate information covers a text mode, an image mode and/or an audio mode;

The first multi-modal large model and the second multi-modal large model are each trained to have the ability to map information of different modalities to the same semantic vector space.

In another implementation manner of the first aspect of the embodiment of the present application, the first multi-modal large model and the second multi-modal large model have the same structure, and the model structure includes: the system comprises an image feature extraction module, an audio feature extraction module and a large language model;

The process of processing input information by the first multi-mode large model or the second multi-mode large model comprises the following steps:

inputting the input text modal information into the large language model to perform feature coding;

For the input information of the image mode, extracting vector representation of the image information by the image feature extraction module, and sending the vector representation of the image information into the large language model for feature coding;

And extracting vector representation of the audio information by the audio feature extraction module for the input audio mode information, and sending the vector representation of the audio information into the large language model for feature coding.

In another implementation form of the first aspect of the embodiments of the application, the first multi-modal large model and the second multi-modal large model each training process comprises a first training phase comprising:

Acquiring unsupervised multi-modal training data, wherein the multi-modal training data comprises text modal training data, image-text alignment modal training data and audio-text alignment modal training data;

And performing unsupervised pre-training on the multi-modal large model by utilizing the multi-modal training data in an autoregressive mode to obtain a pre-trained multi-modal large model.

In another implementation form of the first aspect of the embodiments of the application, the first multi-modal large model and the second multi-modal large model each training process further comprises a second training phase comprising:

Acquiring paired training data consisting of input samples and output samples, wherein the paired training data comprises training data of a single text mode and multiple modes;

and performing supervised fine tuning training on the multi-modal large model obtained in the first training stage by utilizing the paired training data in an autoregressive mode to obtain the multi-modal large model after fine tuning training.

In another implementation manner of the first aspect of the embodiment of the present application, in one possible design, the paired training data covers multiple NLP natural language processing tasks for training data of a single text modality and multiple cross-modality tasks for training data of multiple modalities.

In another implementation manner of the first aspect of the embodiment of the present application, the training loss function in the second training phase includes:

A first loss function for constraining loss of the autoregressive training;

And a second loss function for constraining a distance approach between hidden layer vector representations of paired input samples and output samples extracted by the multi-modal large model, and a distance approach between hidden layer vector representations of unpaired input samples and output samples extracted by the multi-modal large model.

In another implementation manner of the first aspect of the embodiment of the present application, the first multi-modal large model and the second multi-modal large model have the same structure and each include a large language model, and the large language model adopts a unidirectional encoded transducer structure;

and performing supervised fine tuning training on the multi-mode large model obtained in the first training stage by utilizing the paired training data in an autoregressive mode, wherein each training process comprises the following steps of:

Determining a mask proportion of the current time according to the current training step number s, wherein the mask proportion and the training step number s are in positive correlation, and the mask proportion does not exceed a preset maximum proportion threshold;

Randomly masking input samples in the paired training data according to the masking proportion of the current time, and ensuring that the last token element of the input samples is not masked to obtain masked input samples, and forming new paired training data with paired output samples;

And performing supervised fine tuning training on the multi-modal large model obtained in the first training stage in an autoregressive mode by utilizing the new paired training data.

In another implementation manner of the first aspect of the embodiment of the present application, the first multi-modal large model and the second multi-modal large model are the same multi-modal large model, and the training process of the same multi-modal large model further includes a third training stage, where the training stage includes:

Acquiring training data covering single-mode and cross-mode retrieval intents, wherein the training data comprises user request samples and candidate information sample sets, each candidate information sample in the set is marked with a matching grade relative to the user request samples, and the matching grade represents the matching degree of the candidate information samples and the user request samples;

respectively carrying out feature coding on the user request sample and each candidate information sample by using the same multi-mode large model to obtain a vector representation of the user request sample and a vector representation of each candidate information sample;

And aiming at any candidate information sample, taking each candidate information sample in the candidate information sample set, which has a matching degree with the user request sample lower than that of the candidate information sample, as a negative example sample, so as to minimize the distance between the user request sample and the vector representation of the candidate information sample, maximize the distance between the user request and the vector representation of each negative example sample, and carrying out parameter updating on the same multi-modal large model until a set training ending condition is reached, thereby obtaining the trained same multi-modal large model.

In another implementation manner of the first aspect of the embodiment of the present application, if the first multi-modal large model and the second multi-modal large model are different multi-modal large models, the training process of the first multi-modal large model and the second multi-modal large model further includes a third training phase, where the training phase includes:

performing feature coding on the user request sample by using the first multi-mode large model to obtain vector representation of the user request sample; performing feature coding on each candidate information sample by using the second multi-mode large model to obtain vector representation of each candidate information sample;

And aiming at any candidate information sample, taking each candidate information sample in the candidate information sample set, which has a matching degree with the user request sample lower than that of the candidate information sample, as a negative example sample, so as to minimize the distance between the user request sample and the vector representation of the candidate information sample, and maximizing the distance between the user request and the vector representation of each negative example sample as a target, and carrying out parameter updating on the first multi-modal large model and the second multi-modal large model until a set training ending condition is reached, thereby obtaining a trained first multi-modal large model and a trained second multi-modal large model.

In a second aspect, there is provided a multi-modal information retrieval apparatus comprising:

the user request coding unit is used for carrying out feature coding on a user request by utilizing the first multi-mode large model to obtain a query vector representation of the user request, wherein the user request is request information of a text mode, an image mode and/or an audio mode;

A vector retrieval unit for retrieving a target vector representation matching the query vector representation in a configured vector database, and determining candidate information corresponding to the target vector representation as a retrieval result; the vector database stores vector representations after feature encoding is carried out on each piece of candidate information through a second multi-mode large model, and each piece of candidate information covers a text mode, an image mode and/or an audio mode;

In a third aspect, there is provided a multi-modal information retrieval apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the multi-mode information retrieval method described in any one of the foregoing first aspects of the present application.

In a fourth aspect, a readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the multimodal information retrieval method described in any of the preceding first aspects of the application.

In a fifth aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the multimodal information retrieval method described in any of the preceding first aspects of the application.

By means of the technical scheme, the multi-mode information retrieval method disclosed by the application has the advantages that the capability of the multi-mode large model is utilized, the multi-mode large model can process information of multiple modes and has rich language knowledge, common sense reasoning and multi-mode information understanding capability, the huge advantages of the multi-mode large model in aspects of common sense reasoning and user intention alignment are fully exerted, the complicated, detailed, rich and various retrieval intentions of a user are accurately understood, and the complicated retrieval intentions of the user in a real scene are met. In addition, in order to adapt to the task of information retrieval, the generating capacity of the multi-modal large model is converted into the information compression capacity, namely, the multi-modal large model is trained to have the capacity of mapping information of different modes to the same semantic vector space, on the basis, multi-modal information retrieval can be realized by using a double-tower model structure, namely, a first multi-modal large model is respectively arranged to perform feature coding on a user request to obtain query vector representation, a second multi-modal large model is respectively arranged to perform feature coding on each piece of candidate information to obtain vector representation of each piece of candidate information, a vector database is formed, then target vector representations matched with the query vector representations are searched in the vector database, candidate information corresponding to the target vector representations is determined to serve as search results, and cross-modal information retrieval is realized. As the multi-mode large model can process multi-mode information, the multi-mode large model is compressed into vector representation, a final search result can be obtained through similarity matching among vectors, semantic information cannot be lost in the process, and compared with the search result in the prior art, the accuracy is higher.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 illustrates a schematic diagram of a dual-tower model structure to which a multi-modal information retrieval scheme is applied;

FIG. 2 illustrates a flow diagram of a multi-modal information retrieval method;

FIG. 3 illustrates a schematic diagram of a cross-modal autoregressive pre-training process;

FIG. 4 illustrates a schematic diagram of a cross-modal autoregressive pre-training process with masked inputs;

FIG. 5 illustrates a vector matrix diagram for ranking candidates under a multi-level contrast learning strategy;

FIG. 6 illustrates a multi-stage training flow diagram of a multi-modal large model;

FIG. 7 illustrates a schematic diagram of a multi-modal information retrieval apparatus;

fig. 8 illustrates a schematic structure of a multi-modal information retrieval apparatus.

Detailed Description

Before introducing the inventive solution, some basic concepts and knowledge will be explained first:

prompt: an instruction is indicated. When interacting with an AI (such as an artificial intelligence model), the instruction to be sent to the AI can be a text description, such as "please help me recommend a popular music" input when you interact with the AI, or a parameter description according to a certain format, such as making the AI draw according to a certain format, and describing related drawing parameters.

Artificial intelligence model: the model is an artificial intelligent model based on deep learning technology, which consists of hundreds of millions of parameters, and can realize complex tasks such as natural language processing, image recognition, voice recognition and the like through learning and training of a large amount of data. The artificial intelligence model may include a large language model, a large scale pre-training model.

Large language model: (Large language model, LLM) generally refers to a language model with a large number of parameters and capabilities that learns statistical rules and semantic relationships of a language by pre-training on large-scale text data. These models typically use an unsupervised learning method to predict the next word or fill in missing words to capture the context and semantic information of the language. The large language model is capable of generating coherent sentences, answering questions, completing translation tasks, and the like. LLMs are characterized by a large scale, containing billions or more of parameters, which help them learn complex patterns in linguistic data. The emerging capabilities of large language models include context learning, instruction following and progressive reasoning capabilities, etc., with ChatGPT being released, LLM-related research and applications gradually exploded, such as Google's PaLM model, meta's LLaMA model, etc.

The multi-mode large model is based on a large language model, further integrates multi-mode capability, and can process information of multiple modes, such as images, texts, audios and the like.

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The prior cross-modal searching scheme is introduced, and the searching of the information only by virtue of the text can ignore the multi-modal interaction between the user request and the candidate information (such as candidate pictures, candidate audios and videos and the like), so that the searching effect is greatly limited. Therefore, the application researches a cross-modal vector retrieval technology, allows different modal data to be mapped to a common semantic space, and realizes semantic alignment and interaction of different modalities. On the basis, a user can realize cross-mode information retrieval, for example, related text description can be retrieved through pictures, or corresponding images or audio/video contents and the like can be found through a text description, so that the interaction mode of people in the process of multimedia content consumption and production can be greatly expanded.

In addition, in order to meet the abundant multi-mode retrieval requirements of users in real scenes, the user's retrieval intention can be accurately understood, which cannot be achieved by adopting a traditional deep neural network model. With the rise of large models (large language models and multi-mode large models), the large models belong to a generating language model, and in view of the fact that the large models have rich language knowledge and common sense understanding, reasoning and generating capabilities, more and more users apply the large models to generating tasks. Because of the vast gap in form between the generation task and the retrieval task, the prior art does not have a solution to apply a large model to the retrieval task.

The scheme provides a brand-new multi-mode information retrieval scheme, creatively introduces a multi-mode large model into a multi-mode information retrieval process, realizes understanding of complex retrieval intention under a user real scene by means of the capability of the multi-mode large model, simultaneously utilizes the multi-mode large model to encode and express inputs of different modes, unifies the inputs of different modes into the same semantic space, realizes alignment and interaction of information among different modes, fully plays great advantages of the multi-mode large model in common sense reasoning and user intention alignment, accurately understands complex and detailed retrieval instructions of a user, covers abundant and various retrieval intentions, and meets requirements of user cross-mode information retrieval with high quality.

The method provided by the application can be divided into a training stage and an reasoning stage, wherein the training stage is a stage of training the multi-mode big model, and the reasoning stage is a process of executing multi-mode information retrieval by utilizing the trained multi-mode big model. The training phase and the reasoning phase can be deployed in the same equipment or in different equipment. For example, the training phase may be deployed in a cloud or server, and the reasoning phase may be deployed in an intelligent terminal, such as a mobile phone, a tablet, a smart car, a robot, or a wearable device.

For ease of understanding, the present application describes the flow of the training phase and the reasoning phase, respectively.

Inference phase

In combination with fig. 1, the multi-mode information retrieval scheme of the present application may integrally adopt a dual-tower model structure, where the dual-tower model structure includes two model structures, which are respectively used for performing feature coding on a multi-mode user request and multi-mode candidate information to obtain a query vector representation of the user request, and a vector representation of the candidate information, calculating a vector similarity between the query vector representation and the vector representation of each candidate information, and selecting a target vector representation with the highest vector similarity with the query vector representation, so as to obtain candidate information corresponding to the target vector representation as a retrieval result.

In this embodiment, two model structures are defined as a first multi-modal large model and a second multi-modal large model, respectively, that is, the two model structures both adopt multi-modal large models.

It should be noted that, the two models in the dual-tower model structure may or may not share parameters. That is, the first multi-modal large model and the second multi-modal large model may be the same multi-modal large model, or may be two multi-modal large models with the same structure but different network parameters. The double-tower model structure formed by the first multi-mode large model and the second multi-mode large model can be pre-trained, so that the first multi-mode large model and the second multi-mode large model have the capability of mapping information of different modes to the same semantic vector space, and the subsequent vector retrieval is facilitated.

Next, as described in connection with fig. 2, the multi-modal information retrieval method of the present application may include the steps of:

And step S100, performing feature coding on the multi-modal user request by using the first multi-modal large model to obtain a query vector representation of the user request.

The user request can be request information of any one or more of a text mode, an image mode and an audio mode. Several examples of user requests are illustrated in this embodiment:

1. user requests of a single text modality, such as:

Similar sentence queries, text clustering, retrieving text based on news headlines, knowledge questions and answers, code retrieval, cross-lingual document queries, and the like.

2. Cross-modal user requests, such as:

diagram searching, text searching, audio searching, etc. under different intents.

Taking "search map" as an example, the user request shows, for example: "help me find an image similar to the next image [ img ]", where the user request contains both text information and image information.

The first multi-mode large model and the second multi-mode large model are trained to have the capability of processing input information of different modes and mapping the information of different modes to the same semantic vector space. Therefore, in the step, the first multi-mode large model can be utilized to perform feature coding on the user request, so that the vector representation corresponding to the user request is obtained and used as the query vector representation.

Step S110, searching a target vector representation matched with the query vector representation in a vector database, and determining candidate information corresponding to the target vector representation as a search result.

The vector database stores vector representations after feature encoding of each piece of candidate information through the second multi-mode large model, and each piece of candidate information covers a text mode, an image mode and/or an audio mode.

In a specific retrieval task scene, the method and the device can acquire candidate information of various multi-modes, further utilize a second multi-mode large model to perform feature coding on each candidate information to obtain corresponding vector representations, store the corresponding vector representations in a vector database, and simultaneously store the corresponding relations between the candidate information and the vector representations.

After obtaining the query vector representation corresponding to the user request, searching the target vector representation matched with the query vector representation in the vector database by a vector search technology, for example, calculating a distance (such as euclidean distance) between two vector representations, wherein the smaller the distance is, the higher the similarity of the two vector representations is, and then the target vector representation with the highest similarity can be selected as the target vector representation matched with the query vector representation.

Each piece of candidate information may be information of a single mode, or may be information containing more than two modes at the same time, for example, the candidate information may be a document of plain text, a document containing text and images at the same time, or may be multimedia content containing images and audio at the same time, and so on.

According to the multi-mode information retrieval method provided by the embodiment, the multi-mode large model is introduced into the cross-mode retrieval field for the first time, the intelligent capability emerging by the multi-mode large model is enabled to the retrieval field, and high-quality multi-mode information understanding and modeling are achieved. The whole scheme can adopt a double-tower model structure, the large model fused with multi-mode capability is utilized to vectorize the user request, the vectorized representation and the vector representation of the candidate information already coded in the vector database are searched, the candidate information which is most suitable for the user request can be recalled by scoring the relevance function, and the search result can be information of multiple modes such as text, audio and images as a search result.

The embodiment utilizes the multi-mode large model to carry out coding representation on the input of different modes, unifies and abstracts the information of different modes, and realizes the interaction and alignment of the information of different modes. The method fully plays the huge advantages of the multi-mode large model in the aspects of common sense reasoning and user intention alignment, accurately understands the complicated and detailed retrieval instructions of the user, covers rich and various retrieval intentions, and meets the cross-mode information retrieval requirements of the user with high quality.

Search intents that are not understood by conventional deep neural network models include, but are not limited to: searching for similar problems, such as ' helping me find ancient poems which are necessarily similar to natural me materials ', searching for fuzzy intentions, such as ' helping me find a section of vergence writing autumn fallen leaves, a style is depressed, depression is expressed, code retrieval is performed, such as ' helping me find an optimal implementation code of bubbling sequencing ', cross-language retrieval is performed, such as ' helping me find French original text of a corresponding section of a problem ', and the like.

In some embodiments of the present application, the structure of the first and second multi-modal large models is described.

The model structures of the first and second multi-modal large models may be identical, as shown in connection with fig. 3, including: the system comprises an image feature extraction module, an audio feature extraction module and a large language model.

The Image feature extraction module may adopt CLIP (Contrastive Language-Image Pre-translation, multi-modal model of image+language), and may extract vector representation of the input Image modal information, and send the vector representation to the large language model for feature encoding.

The audio feature extraction module may employ a feature extraction network capable of processing audio data that is capable of extracting a vector representation of the input audio modality information and feeding into a large language model for feature encoding.

For the input text modal information, the information can be directly sent to a large language model for feature coding.

The large language model may employ a unidirectionally encoded transducer structure, or other bi-directionally encoded network structure. In this embodiment, a large language model is exemplified as a transducer structure.

The input text modal information can be sent into a large language model in a token form, the text modal information can be expressed as a token sequence w= [ W ₁,w₂,…,w_i, … ], W _i represents an ith token element, and the gap between the front and the back of the token sequence of the text modal information can be set by setting a token such as [ text ] and [/text ], wherein [ text ] represents a beginning symbol of the token sequence of the text modal information, and [ (text ] represents an ending symbol of the token sequence of the text modal information. Similarly, the vector representations of the image modality information extracted by the image feature extraction module may also be spaced by setting token such as [ img ] and [/img ] before being fed into the large language model; the vector representations of the audio modality information extracted by the audio feature extraction module may also be spaced by setting token such as [ spe ] and [/spe ] before being fed into the large language model.

Furthermore, for the multi-mode token sequence of the input large language model, position codes can be added to each token to strengthen the position information of the input token sequence and ensure stronger length extrapolation.

It should be noted that, in the multi-mode large model structure shown in fig. 3, when input information lacks information of a certain mode, the input of the corresponding mode may be set to be empty.

Training phase

In some embodiments of the present application, the training process of the first and second multi-modal large models is further described.

The first multi-mode big model and the second multi-mode big model can respectively comprise a first training stage, wherein a multi-mode autoregressive pre-training strategy can be adopted in the first training stage, namely, the non-supervision pre-training is carried out on the large-scale non-supervision multi-mode data in an autoregressive mode, so that the understanding, the characterization and the strong generating capability of the multi-mode information of the model are endowed.

First, unsupervised multimodal training data is obtained, which may include text modality training data, image-text alignment modality training data, audio-text alignment modality training data, and the like.

Wherein, text modality training data such as:

various articles, news, books, teaching materials and various public data (without limitation to language and format) disclosed on the network, and various courses and code data are endowed with the processing capability of various texts of the model.

Training data for image-text alignment modalities such as:

The rich text webpage data containing the multimode information contains a large amount of multimode alignment information, such as picture-picture introduction, video-video introduction comments and the like.

Training data for audio-text alignment modalities such as:

Audio extracted through the video page and corresponding text information.

Further, the multi-modal training data is utilized to perform unsupervised pre-training on the multi-modal large model in an autoregressive mode, and the pre-trained multi-modal large model is obtained.

It should be noted that, in general, the structures of the first and second multi-modal large models may be the same, and in the pre-training process in the first stage, the same multi-modal large model may be pre-trained by using multi-modal training data, and the obtained multi-modal large model may be used as the first and second multi-modal large models, that is, the structures and network parameters of the pre-trained first and second multi-modal large models are the same, that is, the same multi-modal large model.

If the first and second multi-modal large models have different structures, for example, different image feature extraction modules/audio feature extraction modules/large language models are adopted, the obtained multi-modal training data can be used to perform unsupervised pre-training on the two multi-modal large models with different structures in the pre-training process of the first stage, so as to obtain a pre-trained first multi-modal large model and second multi-modal large model.

Through the pre-training in the first stage, the multi-mode large model can have the understanding, characterization and strong generation capacity of multi-mode information, and can map information of different modes to the same semantic vector space.

Further, on the basis of the first training phase, a second training phase may be further added in this embodiment. Through the pre-training in the first stage, the multi-mode large model has basic multi-mode understanding and common sense understanding capability and has stronger generating capability. In order for the model to further motivate knowledge representation capability learned on large-scale pre-training data, the information is compressed into a vector representation of a fixed size to adapt the retrieval task, in this embodiment a second training phase is devised.

The correspondence between the input and the output is emphasized in the second training stage, so that the model is further strengthened to learn the mapping relation between the input and the output, and therefore, the training data has stronger purpose compared with the first training stage.

First, training data pairs consisting of input samples-output samples are acquired, the training data pairs including single text modality and cross-modality training data.

For training data of a single text modality, examples are:

the integrated tasks in the field of processing of various NLP natural languages are collected, including but not limited to: text classification, clustering, intent recognition, named entity recognition NER, text generation, dialog, etc. And acquiring the training data of each NLP task as training data of a single text mode.

Under the better condition, the training data of the text mode can cover a plurality of NLP tasks, so that the model can have various capacities of solving the text problem and gradually learning efficient text information compression.

For cross-modal training data, examples are:

Image title tasks, OCR recognition, semantic segmentation, visual questions and answers, etc., speech recognition related tasks such as emotion detection, etc.

Under the better condition, the cross-modal training data can cover various cross-modal tasks as much as possible, so that the model can learn semantic alignment information with different granularity among the modalities, such as alignment of semantic segmentation biased to local information, and OCR recognition explicitly correlates text information in pictures. The multi-modal large model is trained by adding the training data of the multi-modal tasks to the second training stage, so that the capture and understanding of the model on multi-modal different granularity information can be enhanced, and the multi-modal different granularity information can be aligned to a text.

Further, the paired training data collected in the second stage are utilized to conduct supervised fine tuning training on the multi-modal large model obtained in the first training stage in an autoregressive mode, and the multi-modal large model after fine tuning training is obtained.

When the multi-modal large model is subjected to fine tuning training in an autoregressive manner, an input sample-output sample can be converted into an input sequence of autoregressive training, for example, an original input sample-output sample is data under a choice question, and the input sample is: is the xxx movie nice? The output samples were: is attractive. Then after conversion to an input sequence for autoregressive training, it can be expressed as: the xxx movies look very good, which is true. For another example, the original input sample-output sample is the data under the complete filling question, and the input sample is the stem of the complete filling question: x1x2x2 () x3x4. The output sample is the text content y in brackets in the stem that needs to be supplemented. Then after conversion to an input sequence for autoregressive training, it can be expressed as: in the complete gap-filling questions x1x2x2 () x3x4, the brackets should be filled with y.

After the input sample-output sample is converted into the input sequence of the autoregressive training, the multimodal big model obtained in the previous training stage can be subjected to fine tuning training in an autoregressive mode, and the multimodal big model after the fine tuning training is obtained.

When the first and second multi-modal large models are the same model, the second training stage performs the second training on the same multi-modal large model obtained in the first training stage, and the trained multi-modal large model is used as the first and second multi-modal large models at the same time.

When the first multi-mode large model and the second multi-mode large model are models with different structures, respectively obtaining the first multi-mode large model and the second multi-mode large model after the pre-training through a first training stage. And in the second training stage, the paired training data are adopted to perform fine-tuning training on the first multi-mode large model and the second multi-mode large model respectively, so that the first multi-mode large model and the second multi-mode large model after fine-tuning training are obtained respectively.

In the embodiment of the present application, in order to further adapt to a search task, to strengthen an alignment relationship between input and output in a semantic space, a combination of training loss functions is designed in a second training stage, where the training loss functions in the second training stage may include:

a first loss function L _lm is used to constrain the loss of the autoregressive training.

And a second loss function L _dis for constraining the distance between the hidden layer vector representations extracted by the multi-modal large model of the paired input sample and output sample to approach, and the distance between the hidden layer vector representations extracted by the multi-modal large model of the unpaired input sample and output sample to approach.

The expression of the specific loss function can be as follows:

Wherein the dis () function is used to calculate the distance between two elements, and specifically, euclidean distance can be used. h _i represents the hidden layer vector representation of the input sample extracted by the last hidden layer of the multi-modal large model, h _o+ represents the hidden layer vector representation of the output sample paired with the input sample extracted by the last hidden layer of the multi-modal large model, and h _o- represents the hidden layer vector representation of the other output samples unpaired with the input sample extracted by the last hidden layer of the multi-modal large model.

Through the constraint of the loss function, the multi-mode large model can strengthen the mapping relation between input and output, and the distance between paired input and output samples in the semantic space can be shortened, and the distance between unpaired input and output samples in the semantic space can be shortened, so that the method is further suitable for subsequent retrieval tasks.

When the large language model adopts a unidirectional encoded transform structure, the hidden layer vector representation extracted from the last hidden layer of the input sample through the multi-mode large model is the hidden layer vector representation of the last character of the input sample, and similarly, the hidden layer vector representation extracted from the last hidden layer of the output sample through the multi-mode large model is the hidden layer vector representation of the last character of the output sample.

In the embodiment of the application, in order to enhance the information compression capability of a large language model of a unidirectional coding structure, more information is compressed into the vector representation of the last character, a dynamic mask strategy is designed, the input dependence of the previous is reduced by continuously and progressively increasing the proportion of mask masks, and the model is guided to be more dependent on the characteristic representation of the last character in the generation process step by step on the premise of not damaging the original model generation capability.

As shown in FIG. 4, some of the input token sequences of the large language model may be randomly selected for masking by a mask.

The pre-training process of the second training stage may iterate a plurality of training steps, during each training step:

S11, determining a mask proportion of the current time according to the current training step number S, wherein the mask proportion and the training step number S are in positive correlation, and the mask proportion does not exceed a preset maximum proportion threshold.

The mask ratio M(s) is determined as follows:

s is the current training step number, lambda is a set super parameter, and the increasing efficiency of the model generating difficulty can be controlled by changing the lambda. Meanwhile, in order to guarantee the learning property of the task, a maximum proportion threshold max_prob is set to restrict the mask proportion.

S12, randomly masking the input samples in the paired training data according to the mask proportion of the current time determined in the previous step, and ensuring that the last token element of the input samples is not masked, so as to obtain masked input samples, and forming new paired training data with paired output samples.

S13, performing supervised fine tuning training on the multi-modal large model obtained in the first training stage in an autoregressive mode by utilizing the new paired training data obtained in the last step.

According to the training strategy of the second training stage provided by the embodiment, the knowledge representation capability of the model, which is obtained on the large-scale pre-training data of the first training stage, is further stimulated progressively by dynamically adjusting the mask strategy and introducing the second loss function, and meanwhile, information is compressed into vector representation with a fixed size so as to adapt to a subsequent retrieval task.

Through the training of the first training stage and/or the second training stage, the multi-modal large model has processing and compression capabilities for multi-modal information, and also has various emerging intelligent capabilities with the general large model, including but not limited to: common sense knowledge, reasoning, code reading, multilingual understanding, etc. In this embodiment, in order to improve the output tendency of the multi-mode large model on the retrieval task and keep consistent with the user requirement, a second training phase may be further added on the basis of the first/second training phases.

In a third training phase, a training strategy is employed that is aligned with the user's intent.

For a double-tower model structure composed of a first multi-mode large model and a second multi-mode large model, two different training flows can exist according to whether the first multi-mode large model and the second multi-mode large model are identical or not, and the two different training flows are respectively described below.

1. The first and second multi-modal large models are the same multi-modal large model (i.e., the two multi-modal large models are identical in structure and share network parameters).

Two implementations of the third training phase are provided in this embodiment:

first, in this embodiment, a contrast learning strategy may be used to perform a third training stage, where specifically, the third training stage includes:

S21a, training data covering single-mode and cross-mode retrieval intents is obtained.

The training data comprises a user request sample, positive case candidate information samples matched with the user request sample and other negative case candidate information samples not matched with the user request sample. The positive candidate information sample matched with the user request sample can be regarded as the candidate information sample requested by the user request sample. Taking a sample of a text mode as an example, the user request sample is "help me generate a random text describing autumn, express solitude and sad emotion", the candidate information set contains a plurality of random texts, one random text which is most matched with the user request sample can be selected as a positive candidate information sample, and the rest random texts are marked as negative candidate information samples.

It should be noted that, the training data (including the user request sample and the candidate information sample) obtained in this step may cover the single-mode and cross-mode retrieval intentions, that is, the user request sample may include single-mode and cross-mode data, and the corresponding candidate information sample may also include single-mode and cross-mode data.

S22a, the same multi-mode large model is utilized to respectively perform feature coding on the user request sample and each candidate information sample, and vector representation of the user request sample and vector representation of each candidate information sample are obtained.

Specifically, the user request sample and each candidate information sample are respectively sent into the same multi-mode large model, and the vector representation output by the last hidden layer of the multi-mode large model is taken as the vector representation of the user request sample and the vector representation of each candidate information sample.

S23a, for each user request sample, forming a positive example sample pair with the positive example candidate information sample, and forming a negative example sample pair with the negative example candidate information sample. And training the multi-modal large model by adopting a contrast learning strategy, and in the training process, taking the distance between vector representations of the minimized positive example sample pair and the distance between vector representations of the maximized negative example sample pair as targets until the set training ending condition is reached, so as to obtain the trained multi-modal large model.

The contrast learned loss function can be expressed as:

Wherein the sim () representation calculates the similarity of two vector representations, the smaller the distance between the two vector representations, the higher the corresponding similarity. q represents a vector representation of the user request samples, A vector representation of positive example candidate information samples representing q,And a vector representation of the negative example candidate information sample representing q.

In the embodiment, by constructing the training data which accords with various search intentions of a real application scene, because the positive and negative candidate information samples in the training data are marked according to the preference of the user, the output tendency of the multi-mode large model is consistent with the requirement of the user through the cross-mode comparison learning strategy, and the obtained search result also accords with the requirement of the user when the multi-mode large model is subsequently applied to the search task.

Secondly, the first comparison learning strategy belongs to coarse-granularity comparison learning, and only positive examples and negative examples can be distinguished by a forced model in a one-time learning process, so that comparison information is weaker. In addition, in some real situations, there may be more than one candidate information sample matched with one user request sample, which is different only in that the matching degree between different candidate information samples and the user request sample is high or low, and still taking the text mode sample of the foregoing example as an example, the user request sample is "help me generate a autumn-described prose, express solitude and sad emotion", and the candidate information set includes multiple prose, where 10 prose accords with the user request, only the words and expressions are different, and the user can mark the matching grade of the 10 prose relative to the user request sample according to his own preference, and the higher the matching grade represents the higher the matching degree with the user request sample. The number of the matching grades can be more than 3, and the more the number of the matching grades is, the more the model can learn the comparison information in one comparison learning process, so that the search preference of the user for different candidates can be learned more easily.

It should be noted that, for some user request samples, each candidate information sample corresponding to the user request sample may only be able to be classified into 2 levels, that is, two levels of matching and not matching with the user request sample, for example: the user request samples are: which city will be the province of the northwest province? Only the candidate information sample of "Shijia" in the candidate information sample set is matched with the sample of the user request, and the rest candidate information is not matched with the user request.

Referring to fig. 5, for a user request sample q, each candidate information sample in the candidate information sample set may be classified into a set number of levels according to a matching degree with q, which may include, as illustrated in fig. 5: the class A, class B, class C, and class C candidates … are other class candidates, each class decreasing in succession.

Based on the labeling of the training data, a multi-order comparison learning strategy is introduced in the embodiment, so that the model can compare candidate information samples with different strengths in one learning process, the model can learn the search preference of a user for different candidates more easily, and the intention alignment with higher quality is realized. Assume that each candidate information sample is divided into three levels ABC in total, and each level decreases in turn. The information that can be referenced when computing the loss function in contrast learning process simultaneously includes: a > B, A > C, B > C. Where ">" means that the former matches the user request sample more than the latter. Therefore, the model can see a plurality of candidates with different grades at one time, and the evaluation capability is learned by comparing and learning to distinguish the differences.

The process of training in the third stage by adopting the multi-stage contrast learning strategy can comprise the following steps:

S21b, acquiring training data covering single-mode and cross-mode retrieval intents.

The training data comprises user request samples and candidate information sample sets, wherein each candidate information sample in the set is marked with a matching grade relative to the user request samples, and the matching grade represents the matching degree of the candidate information samples and the user request samples.

S22b, respectively carrying out feature coding on the user request sample and each candidate information sample by utilizing a multi-mode large model to obtain a vector representation of the user request sample and a vector representation of each candidate information sample.

S23b, regarding any candidate information sample, taking each candidate information sample with the matching degree with the user request sample being lower than that of any candidate information sample in the candidate information sample set as a negative example sample, so as to minimize the distance between the user request sample and the vector representation of any candidate information sample, and maximizing the distance between the user request and the vector representation of each negative example sample as a target, and carrying out parameter updating on the same multi-modal large model until the set training ending condition is reached, thereby obtaining the trained multi-modal large model.

The contrast learned loss function can be expressed as:

Wherein the sim () representation calculates the similarity of two vector representations, the smaller the distance between the two vector representations, the higher the corresponding similarity. q represents the vector representation of the user request sample, c _i and c _j represent the vector representations of the i and j-th candidate information samples, respectively, order (c _i)＞order(c_j) represents that the matching level of the i-th candidate information sample is higher than the matching level of the j-th candidate information sample with respect to the user request sample q.

Compared with the former contrast learning strategy, the multi-order contrast learning strategy provided in the embodiment can realize finer granularity contrast learning, the model can see candidate information samples with different matching grades at one time in one contrast learning process, and the model can learn the search preference of a user for different candidate information samples more easily by comparing the candidate information samples with different matching grades, so that higher-quality intention alignment is realized, and the model is more suitable for subsequent search tasks.

2. The first multi-modal large model and the second multi-modal large model are different multi-modal large models (i.e., the two multi-modal large models do not share network parameters).

Similar to the previous embodiment, two implementations of the third training phase of the first and second two multi-modal large models are also provided in this embodiment:

S31a, training data covering single-mode and cross-mode retrieval intents is acquired.

The training data comprises a user request sample, positive case candidate information samples matched with the user request sample and other negative case candidate information samples not matched with the user request sample.

Step S31a is the same as step S21a in the previous embodiment, and is described in detail with reference to the foregoing.

S32a, performing feature coding on a user request sample by using a first multi-mode large model to obtain a vector representation of the user request sample; and carrying out feature coding on each candidate information sample by using the second multi-mode large model to obtain vector representation of each candidate information sample.

Specifically, a user request sample is sent into a first multi-mode large model, and a vector representation of the last hidden layer output of the first multi-mode large model is taken as a vector representation of the user request sample. And sending each candidate information sample into the second multi-mode large model, and taking the vector representation output by the last hidden layer of the second multi-mode large model as the vector representation of each candidate information sample.

S33a, for each user request sample, forming a positive example sample pair with the positive example candidate information sample, and forming a negative example sample pair with the negative example candidate information sample. And training the first multi-mode large model and the second multi-mode large model by adopting a contrast learning strategy, wherein in the training process, the distance between vector representations of the positive example sample pair is minimized, and the distance between vector representations of the negative example sample pair is maximized until the set training ending condition is reached, so that the trained first multi-mode large model and second multi-mode large model are obtained.

The contrast learned loss function can be expressed as:

In the embodiment, by constructing the training data which accords with various search intentions of a real application scene, because the positive and negative candidate information samples in the training data are marked according to the preference of the user, the output tendency of the first and second multi-mode large models is consistent with the requirement of the user through the cross-mode comparison learning strategy, and the obtained search result also accords with the requirement of the user when the training data are applied to the search task in the follow-up.

Secondly, a multi-order comparison learning strategy is introduced in the embodiment, so that the model can compare candidate information samples with different strengths in one learning process, the model can learn the search preference of a user for different candidates more easily, and high-quality intention alignment is realized.

S31b, acquiring training data covering single-mode and cross-mode retrieval intents.

S32b, performing feature coding on the user request sample by using the first multi-mode large model to obtain a vector representation of the user request sample; and carrying out feature coding on each candidate information sample by using the second multi-mode large model to obtain vector representation of each candidate information sample.

And S33b, regarding any candidate information sample, taking each candidate information sample in the candidate information sample set, which has a matching degree with the user request sample lower than that of the any candidate information sample, as a negative example sample, so as to minimize the distance between the user request sample and the vector representation of the any candidate information sample, maximize the distance between the user request and the vector representation of each negative example sample, and carrying out parameter updating on the first multi-modal large model and the second multi-modal large model until the set training ending condition is reached, thereby obtaining the trained first multi-modal large model and second multi-modal large model.

The contrast learned loss function can be expressed as:

Compared with the former contrast learning strategy, the multi-order contrast learning strategy provided in the embodiment can realize finer granularity contrast learning, the models can see candidate information samples with different matching grades at one time in one contrast learning process, and the first multi-mode model and the second multi-mode model can learn search preference of the user for the different candidate information samples more easily by comparing the candidate information samples with different matching grades, so that higher-quality intention alignment is realized, and the method is more suitable for subsequent search tasks.

The third training stage described in the above embodiment adopts contrast learning aligned with the user intention, and training data used in the process can be in line with the actual application scene as much as possible, and various retrieval intentions are covered. In this embodiment, a quality evaluation manner of training data adopted in the third training stage is provided, and quality of the training data can be evaluated, so that high-quality training data is constructed to perform the third training stage.

The quality assessment scheme can be measured from several dimensions:

A. coverage degree: and combing out as many cross-modal retrieval intents as possible according to the function points, and constructing and screening various high-quality user request samples and corresponding candidate information samples according to the cross-modal retrieval intents.

I. text singletons, including but not limited to:

1. Similar sentence query

2. Clustering

3. News headline-text retrieval

4. Knowledge question and answer

5. Code retrieval

6. Cross-language document query

7、……

Cross-modal, including but not limited to:

1. searching pictures by pictures under different intents

2. Search for pictures in text

3. Search for text by picture

4. Search audio by graph

5、……

B. Difficulty in: the expression is consistent with the real user distribution and contains a large number of complex-intention user request samples, such as multi-intention, multi-language and multi-modal compound requests.

C. Differentiation: when the candidate information sample set is constructed, the candidate information sample set is divided into N grades in total according to a designed matching grade evaluation system, candidate information samples corresponding to each user request sample are distributed in the N grades as far as possible, and the distinguishing capability of the multi-mode large model on the candidate information samples with different matching grades is ensured. The matching grade evaluation system can consider the relevance, correctness, real-time performance and the like of the content, and particularly, candidate information samples can be marked by a manual or pre-trained neural network model according to the matching grade evaluation system.

By constructing high-quality cross-modal training data and training the multi-modal large model by means of contrast learning, samples with different matching grades can be distinguished by the multi-modal large model, and candidate information which is most in line with user intention can be selected by taking the user intention as a first priority.

In connection with FIG. 6, FIG. 6 illustrates an alternative multi-stage training process for the first and second multi-modal large models.

As shown in fig. 6, the training sequence may be sequentially divided into a first training phase (multi-modal autoregressive pre-training), a second training phase (adaptive information compression learning training), and a third training phase (contrast learning training aligned with the user's intention).

In the first training stage, the non-supervision pre-training is mainly performed on large-scale non-supervision multi-modal data in an autoregressive mode, so that the model multi-modal information is endowed with understanding, characterization and strong generation capacity.

In the second training stage, the knowledge representation capability of the model, which is obtained through large-scale pre-training, is further stimulated progressively by dynamically adjusting a mask strategy and introducing a second loss function, and information is compressed into a vector representation with a fixed size so as to adapt to a retrieval task.

In the third training stage, high-quality training data covering rich and various cross-modal retrieval intentions are constructed, a multi-order contrast learning strategy is designed, and the model can screen out the most suitable candidate information in a characterization mode.

The specific training manners of the first, second and third training stages may refer to the descriptions of the related embodiments, and are not repeated here.

The multi-modal information retrieval apparatus provided by the embodiment of the present application is described below, and the multi-modal information retrieval apparatus described below and the multi-modal information retrieval method described above may be referred to correspondingly with each other.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a multi-mode information retrieval device according to an embodiment of the present application.

As shown in fig. 7, the apparatus may include:

a user request encoding unit 11, configured to perform feature encoding on a user request by using a first multi-mode large model, so as to obtain a query vector representation of the user request, where the user request is request information of a text mode, an image mode and/or an audio mode;

a vector retrieval unit 12 for retrieving a target vector representation matching the query vector representation in a configured vector database, and determining candidate information corresponding to the target vector representation as a retrieval result; the vector database stores vector representations after feature encoding is carried out on each piece of candidate information through a second multi-mode large model, and each piece of candidate information covers a text mode, an image mode and/or an audio mode;

Optionally, the first multi-modal large model and the second multi-modal large model have the same structure, and the model structure includes: the system comprises an image feature extraction module, an audio feature extraction module and a large language model;

Optionally, the device of the application can further comprise a model training unit for training the first multi-mode large model and the second multi-mode large model; the first multi-modal large model and the second multi-modal large model each have a pre-training process comprising a first training phase comprising:

Further, the training process of the model training unit on the first multi-modal large model and the second multi-modal large model further comprises a second training phase, wherein the training phase comprises:

Optionally, in the paired training data, multiple NLP natural language processing tasks are covered for training data of a single text mode, and multiple cross-mode tasks are covered for training data of multiple modes.

Optionally, the training loss function in the second training phase includes:

A first loss function for constraining loss of the autoregressive training;

Optionally, the structures of the first multi-mode large model and the second multi-mode large model are the same and both include a large language model, the large language model adopts a unidirectional coded transducer structure, on the basis, a model training unit performs process iteration multiple times of training of supervised fine tuning training on the multi-mode large model obtained in the first training stage by using the paired training data in an autoregressive manner, and each training process includes:

Optionally, the process of training the first and second multi-mode large models by the model training unit may further include a third training stage, where the third training stage has two different implementation manners according to whether the first and second multi-mode large models are the same multi-mode large model:

When the first multi-modal large model and the second multi-modal large model are the same multi-modal large model, the model training unit executes a third training phase comprising:

When the first multi-modal large model and the second multi-modal large model are different multi-modal large models, the model training unit performs a third training phase comprising:

The multi-mode information retrieval device provided by the embodiment of the application can be applied to multi-mode information retrieval equipment, such as cloud end, a server, a mobile phone, a tablet, an intelligent vehicle, a robot or wearable equipment. Alternatively, fig. 8 shows a block diagram of a hardware structure of the multi-modal information retrieval apparatus, and referring to fig. 8, the hardware structure of the multi-modal information retrieval apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;

The processor 1 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, and the processor can call the program stored in the memory, wherein the program is used for executing each step of the multi-mode information retrieval method.

The embodiment of the application also provides a storage medium, which can store a program suitable for being executed by a processor, and the program is used for executing the steps of the multi-mode information retrieval method.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program is executed by a processor to realize each step of the multi-mode information retrieval method.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for multimodal information retrieval, comprising:

2. The method of claim 1, wherein the first multi-modal large model and the second multi-modal large model are identical in structure, the model structure comprising: the system comprises an image feature extraction module, an audio feature extraction module and a large language model;

3. The method of claim 1, wherein the first multi-modal large model and the second multi-modal large model each training process includes a first training phase comprising:

4. A method according to claim 3, wherein the training process of each of the first and second multi-modal large models further comprises a second training phase comprising:

5. The method of claim 4, wherein the training data of the pair covers a plurality of NLP natural language processing tasks for training data of a single text modality and a plurality of cross-modality tasks for training data of multiple modalities.

6. The method of claim 4, wherein the training loss function during the second training phase comprises:

A first loss function for constraining loss of the autoregressive training;

7. The method of claim 4, wherein the first multi-modal large model and the second multi-modal large model are identical in structure and each comprise a large language model employing a unidirectionally encoded transducer structure;

8. The method according to any of claims 3-7, wherein the first multi-modal large model and the second multi-modal large model are the same multi-modal large model, and the training process of the same multi-modal large model further comprises a third training phase comprising:

9. The method of any of claims 3-7, wherein the first and second multi-modal large models are different multi-modal large models, and the training process of the first and second multi-modal large models further comprises a third training phase comprising:

10. A multi-modal information retrieval apparatus, comprising:

11. A multi-modal information retrieval apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

The processor is configured to execute the program to implement the steps of the multi-modal information retrieval method according to any one of claims 1 to 9.

12. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the multimodal information retrieval method according to any of claims 1-9.

13. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the multimodal information retrieval method of any of claims 1 to 9.