[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2023065617A1 - 基于预训练模型和召回排序的跨模态检索系统及方法 - Google Patents

基于预训练模型和召回排序的跨模态检索系统及方法 Download PDF

Info

Publication number
WO2023065617A1
WO2023065617A1 PCT/CN2022/087219 CN2022087219W WO2023065617A1 WO 2023065617 A1 WO2023065617 A1 WO 2023065617A1 CN 2022087219 W CN2022087219 W CN 2022087219W WO 2023065617 A1 WO2023065617 A1 WO 2023065617A1
Authority
WO
WIPO (PCT)
Prior art keywords
cross
retrieval
model
text
module
Prior art date
Application number
PCT/CN2022/087219
Other languages
English (en)
French (fr)
Inventor
欧中洪
田子敬
史明昊
罗中李
宋美娜
钟茂华
梁昊光
Original Assignee
北京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京邮电大学 filed Critical 北京邮电大学
Publication of WO2023065617A1 publication Critical patent/WO2023065617A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the disclosure belongs to the field of artificial intelligence, and specifically relates to a cross-modal retrieval system and method based on pre-training models and recall sorting.
  • the mainstream multimodal retrieval technology can be divided into two types. One is the cross-encoder model based on matching function learning. Learn the cross-modal distance function, and finally get the graph-text relationship score.
  • This model mainly focuses on fine-grained attention and cross-features, and its structure is shown in Figure 3; the second is a vector embedding model based on representation learning. (cosine function, L2 function, etc.) to constrain the graph-text relationship, this model pays more attention to the representation of two different modal signals in the same mapping space, and its structure is shown in Figure 4.
  • the model effect of the cross-encoder model is better than that of the vector embedding model, because the combination of graphic and text features can provide more cross-feature information for the hidden layer of the model, but the main problem of the cross-encoder model is that it cannot use the top-level embedding to Input signals representing images and text independently.
  • N*M combinations need to be input to the model to obtain the results of searching text by image or searching image by text; in addition, when used online, the computing performance is also very large Bottleneck, the hidden layer needs to be calculated online after feature combination; due to the large amount of cross combination, it is impossible to store the embedding vector of the graphic signal in advance and use the cache for calculation. Therefore, although the cross-encoder model works well, it is not the mainstream of practical applications.
  • the vector embedding model structure is the current mainstream retrieval structure. Since the two different modal signals of pictures and texts are separated, the respective top-level embeddings can be calculated in the offline stage; The distance between the state vectors is sufficient. If it is the correlation filtering of sample pairs, you only need to calculate the cosine/Euclidean distance between the two vectors; if it is online retrieval recall, you need to construct a modality embedding set into a retrieval space in advance, and use the nearest neighbor retrieval algorithm ( Algorithms such as ANN) search.
  • Algorithms such as ANN
  • the core of the vector embedding model is to obtain high-quality embeddings.
  • the vector embedding model is concise, effective and widely used, its shortcomings are also obvious. It can be seen from the model structure that there is basically no interaction between signals of different modalities, so it is difficult to learn high-quality embeddings representing signal semantics, and the accuracy of the corresponding metric space/distance needs to be improved.
  • this proposal proposes a cross-modal retrieval system based on pre-training models and recall sorting, which is used to reduce information management costs, improve information search accuracy and efficiency, and support large-scale event consultation Multimodal Automated Information Retrieval for News and News Search.
  • the present disclosure aims to solve one of the technical problems in the related art at least to a certain extent.
  • the first purpose of this disclosure is to propose a cross-modal retrieval system based on pre-training models and recall sorting, which is used to reduce information management costs, improve information search accuracy and efficiency, and support multi-modality of large-scale event consultation and news search Automate information retrieval.
  • the second purpose of the present disclosure is to propose a cross-modal retrieval method based on pre-trained models and recall ranking.
  • a third object of the present disclosure is to propose a non-transitory computer-readable storage medium.
  • a fourth object of the present disclosure is to provide an electronic device.
  • a fifth object of the present disclosure is to provide a computer program product.
  • a sixth object of the present disclosure is to propose a computer program.
  • the embodiment of the first aspect of the present disclosure proposes a cross-modal retrieval system based on pre-training model and recall ranking, including: a multi-dimensional text information extraction module, which is used to provide the cross-modal retrieval system with The information support on the text side expands the semantic representation of text information through different dimensions and increases the amount of text samples; the intelligent image retrieval module includes a video intelligent frame extraction module and an image search module, among which the video intelligent frame extraction module is used to extract from a segment Several pictures that best represent the video content are extracted from the video, and the image search module is used to complete large-scale and efficient image retrieval tasks; and the cross-modal retrieval module is used to generate roughly related candidate sets according to query items, Precisely sort the candidate set, and finally return relevant retrieval results.
  • a multi-dimensional text information extraction module which is used to provide the cross-modal retrieval system with The information support on the text side expands the semantic representation of text information through different dimensions and increases the amount of text samples
  • the intelligent image retrieval module includes a video intelligent frame
  • the cross-modal retrieval system based on the pre-training model and recall ranking proposed by the embodiments of the present disclosure aims at the characteristics of cross-modal retrieval data dynamics, multi-source, and multi-modality, as well as the problems existing in the two current mainstream modeling methods.
  • the two modeling methods are organically combined, adopting the idea of rough recall and precise sorting, and combining the strengths of the two schemes to achieve efficient and fast cross-modal retrieval.
  • this solution proposes text query based on inverted retrieval and high-dimensional image feature retrieval technology based on color and texture to realize fast retrieval among multiple modalities and provide users with a good experience.
  • the multi-dimensional text information extraction module includes:
  • Speech data processing module for audio extraction and deep learning based speech recognition
  • the natural language text extension module is used to obtain the semantic description of the current sentence in different word orders and different languages, to expand the existing text data in many ways, and to obtain a large amount of negative sample data based on fine-grained text analysis.
  • the video intelligent frame extraction module is used to extract several pictures that best represent the video content from a video, specifically including:
  • All the extracted frames are sorted according to the absolute distance, and the top-ranked frames are regarded as the pictures that best represent the video content.
  • the image search module is used to complete large-scale and efficient image retrieval tasks, including:
  • the image feature extraction technology based on the average gray level comparison gap is used to extract the feature of the image.
  • the cross-modal retrieval module includes:
  • the rough recall module uses the transformer-based multimodal pre-training model as a sub-model of the vector embedding model to perform fast rough recall;
  • the precise sorting module uses the transformer-based multimodal pre-training model as a sub-model of the cross-encoder model to perform precise sorting.
  • the embodiment of the second aspect of the present disclosure proposes a cross-modal retrieval method based on pre-training model and recall ranking, including the following steps: extracting text information, expanding the semantic representation of text information through different dimensions, and increasing text Sample size; extract image information, extract several pictures that best represent the video content from a video, and retrieve the same or similar pictures from the database; generate roughly related candidate sets according to the query items, and accurately conduct the candidate set Sort, and finally return the relevant search results.
  • the cross-modal retrieval method based on the pre-training model and recall ranking proposed by the embodiments of the present disclosure aims at the characteristics of cross-modal retrieval data dynamics, multi-source, and multi-modality, as well as the problems existing in the two current mainstream modeling methods.
  • the two modeling methods are organically combined, adopting the idea of rough recall and precise sorting, and combining the strengths of the two schemes to achieve efficient and fast cross-modal retrieval.
  • this solution proposes text query based on inverted retrieval and high-dimensional image feature retrieval technology based on color and texture to realize fast retrieval among multiple modalities and provide users with a good experience.
  • said extracting text information includes:
  • the extraction of several pictures that best represent the video content from a video includes:
  • All the extracted frames are sorted according to the absolute distance, and the top-ranked frames are regarded as the pictures that best represent the video content.
  • the retrieval of the same or similar pictures from the database includes:
  • the image feature extraction technology based on the average gray level comparison gap is used to extract the feature of the image.
  • said generating a roughly related candidate set according to the query item, and performing precise sorting on the candidate set includes:
  • transformer-based multimodal pre-trained model as a sub-model of the cross-encoder model for precise ranking.
  • the embodiment of the third aspect of the present disclosure proposes a non-transitory computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, any implementation of the second aspect above can be realized.
  • a cross-modal retrieval method based on pre-trained models and recall ranking as described in the example.
  • the embodiment of the fourth aspect of the present disclosure proposes an electronic device, including: a memory; a processor; and a computer program stored in the memory and operable on the processor, wherein the processing The device implements the cross-modal retrieval method based on pre-training model and recall ranking as described in any embodiment of the second aspect above when executing the computer program.
  • the embodiment of the fifth aspect of the present disclosure proposes a computer program product, including a computer program, when the computer program is executed by a processor, it implements the pre-training model based on any embodiment of the second aspect above. and Recall Ranking for Cross-Modal Retrieval Methods.
  • the embodiment of the sixth aspect of the present disclosure proposes a computer program, including computer program code, when the computer program code is run on the computer, the computer executes the computer program described in any embodiment of the second aspect above.
  • FIG. 1 is a schematic flowchart of a cross-modal retrieval system based on a pre-trained model and recall ranking provided by an embodiment of the present disclosure.
  • FIG. 2 is a schematic flowchart of a cross-modal retrieval method based on a pre-trained model and recall ranking provided by an embodiment of the present disclosure.
  • FIG. 3 is a schematic diagram of a cross-coding model provided by an embodiment of the present disclosure.
  • Fig. 4 is a schematic diagram of a vector embedding model provided by an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of a technical solution provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a speech data processing module provided by an embodiment of the present disclosure.
  • Fig. 7 is a schematic diagram of a natural language text extension module provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of a video intelligent frame extraction module provided by an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of image feature extraction provided by an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of a retrieval architecture provided by an embodiment of the present disclosure.
  • Fig. 11 is a schematic diagram of a rough recall module provided by an embodiment of the present disclosure.
  • FIG. 12 is a schematic diagram of a precise sorting module provided by an embodiment of the present disclosure.
  • FIG. 1 is a schematic flowchart of a cross-modal retrieval system based on a pre-trained model and recall ranking provided by an embodiment of the present disclosure.
  • this cross-modal retrieval system based on pre-training model and recall ranking includes the following modules: multi-dimensional text information extraction module 10, intelligent image retrieval module 20 and cross-modal retrieval module 30.
  • the multi-dimensional text information extraction module 10 is used to provide information support on the text side for the cross-modal retrieval system, expand the semantic representation of text information through different dimensions, and increase the amount of text samples.
  • the intelligent image retrieval module 20 includes a video intelligent frame extraction module 201 and a picture search module 202, wherein the video intelligent frame extraction module is used to extract several pictures that can best represent the video content from a section of video, and search for pictures with pictures
  • the module is used to complete large-scale and efficient image retrieval tasks.
  • the cross-modal retrieval module 30 is configured to generate roughly relevant candidate sets according to the query items, sort the candidate sets accurately, and finally return relevant retrieval results.
  • the processing flow of this scheme is shown in Fig. 5 .
  • the multi-dimensional text information extraction module 10 includes:
  • Voice data processing module 101 for audio extraction and voice recognition based on deep learning
  • the natural language text extension module 102 is used to obtain the semantic description of the current sentence in different word orders and different languages, to expand the existing text data from various aspects, and to obtain a large amount of negative sample data based on fine-grained text analysis .
  • the multi-dimensional text information extraction module provides information support on the text side for the multi-modal retrieval system, mainly expanding the semantic representation of text information through different dimensions and increasing the text sample size.
  • this module provides sufficient data support for text single-modal retrieval. On the one hand, it enriches the data content of text modalities, and on the other hand, it enhances the association between multiple modalities.
  • the multi-dimensional text information extraction module adopts the method of text translation combined with speech recognition, making full use of the advantages of multi-modal data, and performing speech recognition on the audio data in the video, as well as the original audio data, to obtain Paired training data; then the overall text data is translated into text, using text semantic information to improve the overall quality of the data, expanding the number of multi-modal associated data, and at the same time through multi-dimensional natural language analysis, the sentences in the The components are randomly replaced to form a rich negative sample space and improve the robustness of the model.
  • the multi-dimensional text information extraction module can be subdivided into a speech data processing sub-module and a natural language text extension sub-module.
  • the speech data processing sub-module mainly includes audio extraction and speech recognition based on deep learning, and its structure is shown in Figure 6.
  • High-dimensional modalities have a high amount of information, and projecting them to low-dimensional ones can greatly expand low-dimensional modal data.
  • Transforming high-dimensional modalities (such as video, audio) into low-dimensional modal (text) data can provide a large amount of pairwise linked data content.
  • Audio extraction can efficiently strip the audio data in the video and quickly provide it to subsequent functions.
  • Speech recognition based on deep learning uses the attention mechanism to achieve end-to-end training, and performs unified speech recognition on the audio data obtained from various modalities to obtain low-modal (ie, text modal) information.
  • the end-to-end model can be used A complete pipeline is well formed to provide a large amount of paired data for subsequent text feature extraction.
  • the audio features obtained in the deep learning process can support the audio feature content required for the final cross-modal retrieval.
  • the data used for cross-modal retrieval model training are all paired linked data. At present, most of this data is obtained by human labeling, and the public complete data set is difficult to meet the amount of training data required for deep learning.
  • the multi-dimensional text information extraction module converts the natural language text into multi-language text information through deep learning-based translation to obtain the multi-dimensional semantic representation of the current text data, and then converts it back to the original language to achieve unified training language the goal of.
  • the natural language text extension sub-module mainly obtains the semantic description of the current sentence in different word orders and different languages through the translation results of multiple languages, and expands the existing text data in many ways.
  • natural language processing can also obtain a large amount of negative sample data based on fine-grained text analysis, making the final cross-modal retrieval model more robust and improving the robustness of the model. Its structure is shown in Figure 7.
  • the video intelligent frame extraction module 201 is used to extract several pictures that best represent the video content from a video, specifically including:
  • All the extracted frames are sorted according to the absolute distance, and the top-ranked frames are regarded as the pictures that best represent the video content.
  • the video is composed of picture frames, and there is a natural connection between the video modal data and the picture modal data. To realize the leap from the video modal to the picture modal, it is only necessary to extract some representative pictures from the video. Can.
  • the image search module 202 is used to complete large-scale and efficient image retrieval tasks, specifically including:
  • the image feature extraction technology based on the average gray level comparison gap is used to extract the feature of the image.
  • image search technology is indispensable.
  • a large number of image retrieval technologies have the problems of insufficient retrieval speed and insufficient retrieval range.
  • This solution proposes an image feature extraction method based on average gray level comparison, and the Elasticsearch search engine empowers accelerated image search technology to complete large-scale and efficient image retrieval tasks.
  • the retrieval speed greatly affects the retrieval experience.
  • Image retrieval is different from keyword retrieval, and the amount of calculation is significantly increased.
  • this program first converts the RGB three-color image into a gray image with 255 gray levels; then cuts the image appropriately, and cuts out the part that cannot express the image characteristics with a high probability, and the obtained image is shown in Figure 9 gray picture.
  • the extraction method of picture features is particularly important. This scheme selects 9*9 grid points and their surrounding areas in the picture shown in Figure 9, and calculates the average gray area based on these rectangular areas.
  • the degree-level comparison gap is quantified and stored as image features.
  • This picture feature extraction method can only use an 81*8 matrix to represent a picture, so it is faster when calculating the similarity between pictures; and because a single picture requires less storage space, it can achieve large Large-scale search-by-image tasks.
  • this solution implements the image retrieval task based on ElasticSearch.
  • the image features are stored in ElasticSearch to build an image retrieval database.
  • the image database based on ElasticSearch is different from the traditional database. It uses the inverted index mechanism to greatly improve the retrieval speed.
  • the cross-modal retrieval module 30 includes:
  • the rough recall module 301 adopts the transformer-based multimodal pre-training model as a sub-model of the vector embedding model to perform fast rough recall;
  • the precise sorting module 302 uses the transformer-based multimodal pre-training model as a sub-model of the cross-encoder model to perform precise sorting.
  • this scheme is based on the organic combination of the two schemes, and adopts the innovative idea of rough recall and precise sorting to improve the retrieval efficiency while ensuring the retrieval effect.
  • This scheme uses the vector embedding model for rough information recall, and then uses the cross-encoder model to sort the recalled information precisely, and finally returns the top-ranked options that best meet the retrieval requirements.
  • This architecture can utilize the existing cross-modal pre-training model and share parameters between the two models to improve the parameter efficiency of the model.
  • the retrieval architecture is shown in Figure 10.
  • the rough recall part uses a transformer-based multimodal pre-training model, such as OSCAR, as a sub-model of the vector embedding model to perform fast rough recall.
  • OSCAR transformer-based multimodal pre-training model
  • the vector embedding model contains two pre-trained sub-models, which process text signals and image signals respectively, but achieve parameter sharing.
  • the signals of different modalities are encoded separately; then mapped to the same high-dimensional multi-modal feature space; finally, the similarity between two signals is calculated using standard distance measurement methods, such as Euclidean distance and cosine distance , the most similar top-k candidates are selected and sorted precisely by the cross-encoder model.
  • the triplet loss function (triplet loss) is used to represent (the distance measurement method uses cosine distance):
  • (i,c) is a positive image-text pair from the training corpus
  • c′ and t′ are negative samples sampled from the training corpus such that the image-text pair (i,c′) and (i′,c) does not appear in the corpus.
  • the model Since the model encodes text and image signals independently, it only needs to map the query text or image to the same feature space for distance calculation during retrieval. Therefore, the data in the database can be encoded offline to ensure the efficiency of online retrieval, so that it can be applied to large-scale data retrieval; but since the model is not required to learn the fine-grained features of the input, it is only used for fast A set of candidate objects is recalled, precisely ranked by a cross-encoder model.
  • the precise sorting part uses a transformer-based multimodal pre-training model, such as OSCAR, as a sub-model of the cross-encoder model for precise sorting.
  • OSCAR transformer-based multimodal pre-training model
  • the exact sorting is shown in Figure 9.
  • the cross-encoder model only uses a pre-trained sub-model, which needs to splice text and image signals, and then judge their similarity through a neural network.
  • This program uses a binary classifier to determine whether text and images are related, and uses a cross entropy loss function to represent:
  • p(i,c) represents the probability that the combination of input image i and text c is a positive sample (whether it is a correct image-text combination).
  • the roughly recalled top-k candidates are concatenated with the query item in turn, and the similarity probability of each image-text pair is obtained respectively to complete the precise sorting.
  • the overall process of this sub-module is shown in Figure 12.
  • the vector embedding model is used to quickly select top-k roughly related candidates according to the user's query items, and then the cross-coding model is used to accurately sort the candidate set according to the query items. , and finally return relevant retrieval results to the user.
  • This scheme also retains the retrieval efficiency of the vector embedding model on large-scale data sets and the retrieval accuracy of the cross-coding model.
  • This solution makes full use of the advantages of multimodal data, adopts the method of text translation combined with speech recognition, improves the overall quality of data, the amount of multimodal associated data and the robustness of the model; utilizes the difference based on the average gray level comparison
  • the advanced image feature extraction technology extracts the features of the image, and combines the ElasticSearch search engine to quickly retrieve the image features, realizing large-scale and efficient image retrieval; combining the respective advantages of the vector embedding model and the cross-encoder model, innovatively adopts The strategy of coarse recall and precise ranking enables fast and efficient cross-modal retrieval on large-scale data.
  • the advantages of this scheme are: first, a joint retrieval framework is proposed, combined with the advantages of fast retrieval speed of the vector embedding model and good retrieval effect of the cross-encoder model, rough retrieval is used in retrieval.
  • the strategy of recall and precise sorting realizes fast and efficient cross-modal retrieval on large-scale data, and shares the parameters of the two models at the same time to improve parameter efficiency.
  • the framework is applicable to any cross-modal pre-training model, making the framework usable Existing models do not need to be trained from scratch, and have a wide range of application scenarios.
  • the single-modal fast retrieval is realized, which solves the shortcoming that the current mainstream cross-modal retrieval models cannot achieve the same modal information retrieval;
  • multi-dimensional text information extraction enriches the text
  • the information content of the modality on the other hand, enhances the correlation between multiple modalities, and realizes the conversion from speech to text at the same time;
  • intelligent image retrieval realizes the conversion from video modality data to image modality data, which can be based on the pixel and color of the image , texture and other information to extract image features, and efficiently retrieve the same or highly similar images in the database.
  • the cross-modal retrieval system based on the pre-training model and recall ranking proposed by the embodiments of the present disclosure aims at the characteristics of cross-modal retrieval data dynamics, multi-source, and multi-modality, as well as the problems existing in the two current mainstream modeling methods.
  • the two modeling methods are organically combined, adopting the idea of rough recall and precise sorting, and integrating the strengths of the two schemes to achieve efficient and fast cross-modal retrieval; in addition, this scheme proposes text query based on inverted retrieval and color and texture-based
  • the advanced high-dimensional image feature retrieval technology realizes fast retrieval among multiple modalities and provides users with a good experience.
  • the present disclosure also proposes a cross-modal retrieval method based on a pre-trained model and recall ranking.
  • FIG. 2 is a schematic diagram of a cross-modal retrieval method based on a pre-trained model and recall ranking provided by an embodiment of the present disclosure.
  • the cross-modal retrieval method based on the pre-training model and recall ranking includes steps S101 to S103.
  • S101 extract text information, expand the semantic representation of the text information through different dimensions, and increase the amount of text samples.
  • said extracting text information includes:
  • the extraction of several pictures that best represent the video content from a video includes:
  • All the extracted frames are sorted according to the absolute distance, and the top-ranked frames are regarded as the pictures that best represent the video content.
  • the retrieval of the same or similar pictures from the database includes:
  • the image feature extraction technology based on the average gray level comparison gap is used to extract the feature of the image.
  • said generating a roughly related candidate set according to the query item, and performing precise sorting on the candidate set includes:
  • transformer-based multimodal pre-trained model as a sub-model of the cross-encoder model for precise ranking.
  • the present disclosure also proposes a non-transitory computer-readable storage medium, on which a computer program is stored, wherein when the computer program is executed by a processor, it can realize any of the above-mentioned embodiments.
  • the present disclosure also proposes an electronic device, including: a memory; a processor; and a computer program stored in the memory and operable on the processor, wherein the processor executes
  • the computer program implements the cross-modal retrieval method based on pre-training model and recall ranking as described in any one of the above-mentioned embodiments.
  • the present disclosure also proposes a computer program product, including a computer program.
  • the computer program is executed by a processor, the pre-training model and recall sorting based on any one of the above-mentioned embodiments are implemented.
  • Cross-modal retrieval methods are implemented.
  • the present disclosure also proposes a computer program, including computer program code, when the computer program code is run on a computer, it causes the computer to execute the pre-training model-based model described in any of the above-mentioned embodiments. and Recall Ranking for Cross-Modal Retrieval Methods.
  • first and second are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features.
  • the features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
  • “plurality” means at least two, such as two, three, etc., unless otherwise specifically defined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提出了一种基于预训练模型和召回排序的跨模态检索系统及方法。该系统包括:多维度文本信息提取模块,用于为所述跨模态检索系统提供文本侧的信息支持,通过不同维度扩大文本信息的语义表示,增加文本样本量;智能图像检索模块,用于视频智能抽帧模块和以图搜图模块,其中,视频智能抽帧模块用于从一段视频中抽取出最能代表视频内容的若干张图片,以图搜图模块用于完成大规模高效率的图片检索任务;跨模态检索模块,用于根据查询项生成大致相关地候选集,对所述候选集进行精确排序,最终返回相关地检索结果。

Description

基于预训练模型和召回排序的跨模态检索系统及方法
相关申请的交叉引用
本申请要求在2021年10月21日在中国提交的中国专利申请号202111229288.6的优先权,其全部内容通过引用并入本文。
技术领域
本公开属于人工智能领域,具体涉及一种基于预训练模型和召回排序的跨模态检索系统及方法。
背景技术
随着互联网的发展,网络中的信息不再以单一的文本形式呈现,而是朝着多元化的方向发展。如今,网络上除了包含海量文本数据外,还有不亚于文本数量的图像、视频、音频等多个模态的数据。面对高速发展的互联网产业产生的海量数据,如何根据用户意愿在不同模态数据间快速、有效地检索出相关信息具有很大实用价值。目前主流的多模态检索技术主要可分为两种,一是基于匹配函数学习的交叉编码器模型,其主要思想是图文特征先融合,然后再经过隐层(神经网络),让隐层学习出跨模态距离函数,最终得到图文关系得分。该模型主要关注细粒度注意力和交叉特征,其结构如图3;二是基于表示学习的向量嵌入模型,其主要思想是图文特征分别计算得到最终顶层的嵌入,然后用可解释的距离函数(余弦函数、L2函数等)来约束图文关系,该模型更关注两种不同模态的信号在同一映射空间中的表示方法,其结构如图4。
一般而言,交叉编码器模型的模型效果优于向量嵌入模型,因为图文特征组合后可为模型隐层提供更多的交叉特征信息,但交叉编码器模型的主要问题在于无法使用顶层嵌入来独立表示图像和文本的输入信号。在一个N张图片M条文本输入的检索召回场景下,需要N*M个组合输入到该模型才能得到以图搜文或以文搜图的结果;此外,在线使用时,计算性能也是很大瓶颈,特征组合后隐层需要在线计算;由于交叉组合量非常大,无法提前存储图文信号的嵌入向量使用缓存进行计算。因此,交叉编码器模型虽然效果好,但并不是实际应用的主流。
向量嵌入模型结构是当前的主流检索结构,由于把图片和文本两个不同模态的信号分开,可以在离线阶段分别计算出各自的顶层嵌入;存储嵌入后在线使用时,只需计算两个 模态向量的距离即可。如果是样本对的相关性过滤,则只需计算两个向量的余弦/欧氏距离;如果是在线检索召回,则需提前将一个模态的嵌入集合构建成检索空间,使用最近邻检索算法(如ANN等算法)搜索。向量嵌入模型的核心是得到高质量的嵌入。然而,向量嵌入模型虽然简洁有效、应用广泛,但其缺点也很明显。从模型结构可看出,不同模态的信号基本没有交互,因此很难学习出高质量代表信号语义的嵌入,对应的度量空间/距离准确性也有待提升。
本提案针对目前互联网中数据动态、多源、多模态特点,提出基于预训练模型和召回排序的跨模态检索系统,用于降低信息管理成本、提升信息搜索精度和效率,支撑大型赛事咨询和新闻搜索的多模态自动化信息检索。
发明内容
本公开旨在至少在一定程度上解决相关技术中的技术问题之一。
本公开的第一个目的在于提出一种基于预训练模型和召回排序的跨模态检索系统,用于降低信息管理成本、提升信息搜索精度和效率,支撑大型赛事咨询和新闻搜索的多模态自动化信息检索。
本公开的第二个目的在于提出一种基于预训练模型和召回排序的跨模态检索方法。
本公开的第三个目的在于提出一种非临时性计算机可读存储介质。
本公开的第四个目的在于提出一种电子设备。
本公开的第五个目的在于提出一种计算机程序产品。
本公开的第六个目的在于提出一种计算机程序。
为达上述目的,本公开第一方面实施例提出了一种基于预训练模型和召回排序的跨模态检索系统,包括:多维度文本信息提取模块,用于为所述跨模态检索系统提供文本侧的信息支持,通过不同维度扩大文本信息的语义表示,增加文本样本量;智能图像检索模块,包括视频智能抽帧模块和以图搜图模块,其中,视频智能抽帧模块用于从一段视频中抽取出最能代表视频内容的若干张图片,以图搜图模块用于完成大规模高效率的图片检索任务;和跨模态检索模块,用于根据查询项生成大致相关地候选集,对所述候选集进行精确排序,最终返回相关地检索结果。
本公开实施例提出的基于预训练模型和召回排序的跨模态检索系统,针对跨模态检索数据动态、多源、多模态等特性,以及当前两种主流建模方法存在的问题,将两种建模方法有机结合,采用粗略召回、精确排序的思路,融合两种方案的长处,实现高效快速的跨模态检索。此外,本方案提出基于倒排检索的文本查询和基于颜色、纹理的高维图像特征 检索技术,实现多个模态间的快速检索,为用户提供良好使用体验。
在本公开的一个实施例中,所述多维度文本信息提取模块,包括:
语音数据处理模块,用于音频提取和基于深度学习的语音识别;和
自然语言文本扩展模块,用于获取不同语序不同语种下对于当前语句地语义描述,从多方面对已有地文本数据进行扩展,还用于根据细粒度地文本分析,获取大量地负样本数据。
在本公开的一个实施例中,所述视频智能抽帧模块用于从一段视频中抽取出最能代表视频内容的若干张图片,具体包括:
提取视频地每一帧,得到若干张图片;
将所述图片映射到统一地LUV颜色空间中,计算每一帧与前一帧地绝对距离;和
根据所述绝对距离将提取出地所有帧排序,排行靠前的若干帧即视为最能代表视频内容的若干张图片。
在本公开的一个实施例中,所述以图搜图模块用于完成大规模高效率的图片检索任务,具体包括:
基于平均灰度级比较差距的图片特征提取技术对图片进行特征提取;和
通过ElasticSearch提供的模糊查询功能,快速的从图片数据库中检索出相同或相似的图片。
在本公开的一个实施例中,所述跨模态检索模块,包括:
粗略召回模块,采用基于transformer的多模态预训练模型,作为向量嵌入模型的子模型,进行快速的粗略召回;和
精确排序模块,利用基于transformer的多模态预训练模型,作为交叉编码器模型的子模型,进行精确排序。
为达上述目的,本公开第二方面实施例提出了一种基于预训练模型和召回排序的跨模态检索方法,包括以下步骤:提取文本信息,通过不同维度扩大文本信息的语义表示,增加文本样本量;提取图像信息,从一段视频中抽取出最能代表视频内容的若干张图片,从数据库中检索出相同或相似图片;根据查询项生成大致相关地候选集,对所述候选集进行精确排序,最终返回相关地检索结果。
本公开实施例提出的基于预训练模型和召回排序的跨模态检索方法,针对跨模态检索数据动态、多源、多模态等特性,以及当前两种主流建模方法存在的问题,将两种建模方法有机结合,采用粗略召回、精确排序的思路,融合两种方案的长处,实现高效快速的跨模态检索。此外,本方案提出基于倒排检索的文本查询和基于颜色、纹理的高维图像特征 检索技术,实现多个模态间的快速检索,为用户提供良好使用体验。
在本公开的一个实施例中,所述提取文本信息,包括:
音频提取和基于深度学习的语音识别;和
获取不同语序不同语种下对于当前语句地语义描述,从多方面对已有地文本数据进行扩展,还用于根据细粒度地文本分析,获取大量地负样本数据。
在本公开的一个实施例中,所述从一段视频中抽取出最能代表视频内容的若干张图片,包括:
提取视频地每一帧,得到若干张图片;
将所述图片映射到统一地LUV颜色空间中,计算每一帧与前一帧地绝对距离;和
根据所述绝对距离将提取出地所有帧排序,排行靠前的若干帧即视为最能代表视频内容的若干张图片。
在本公开的一个实施例中,所述从数据库中检索出相同或相似图片,包括:
基于平均灰度级比较差距的图片特征提取技术对图片进行特征提取;和
通过ElasticSearch提供的模糊查询功能,快速的从图片数据库中检索出相同或相似的图片。
在本公开的一个实施例中,所述根据查询项生成大致相关地候选集,对所述候选集进行精确排序,包括:
采用基于transformer的多模态预训练模型,作为向量嵌入模型的子模型,进行快速的粗略召回;和
利用基于transformer的多模态预训练模型,作为交叉编码器模型的子模型,进行精确排序。
为达上述目的,本公开第三方面实施例提出了一种非临时性计算机可读存储介质,其上存储有计算机程序,其中所述计算机程序被处理器执行时实现如上第二方面任一实施例所述的基于预训练模型和召回排序的跨模态检索方法。
为达上述目的,本公开第四方面实施例提出了一种电子设备,包括:存储器;处理器;和存储在所述存储器上并可在所述处理器上运行的计算机程序,其中所述处理器在执行所述计算机程序时实现如上第二方面任一实施例所述的基于预训练模型和召回排序的跨模态检索方法。
为达上述目的,本公开第五方面实施例提出了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如上第二方面任一实施例所述的基于预训练模型和召回排序的跨模态检索方法。
为达上述目的,本公开第六方面实施例提出了一种计算机程序,包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如上第二方面任一实施例所述的基于预训练模型和召回排序的跨模态检索方法。
附图说明
本公开上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本公开实施例所提供的一种基于预训练模型和召回排序的跨模态检索系统的流程示意图。
图2为本公开实施例所提供的一种基于预训练模型和召回排序的跨模态检索方法的流程示意图。
图3为本公开实施例所提供的交叉编码模型示意图。
图4为本公开实施例所提供的向量嵌入模型示意图。
图5为本公开实施例所提供的技术方案示意图。
图6为本公开实施例所提供的语音数据处理模块示意图。
图7为本公开实施例所提供的自然语言文本扩展模块示意图。
图8为本公开实施例所提供的视频智能抽帧模块示意图。
图9为本公开实施例所提供的图片特征提取示意图。
图10为本公开实施例所提供的检索架构图示意图。
图11为本公开实施例所提供的粗略召回模块示意图。
图12为本公开实施例所提供的精准排序模块示意图。
具体实施方式
下面详细描述本公开的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本公开,而不能理解为对本公开的限制。
下面参考附图描述本公开实施例的基于预训练模型和召回排序的跨模态检索系统和方法。
图1为本公开实施例所提供的一种基于预训练模型和召回排序的跨模态检索系统的流程示意图。
如图1所示,该基于预训练模型和召回排序的跨模态检索系统包括以下模块:多维度 文本信息提取模块10、智能图像检索模块20和跨模态检索模块30。
多维度文本信息提取模块10,用于为跨模态检索系统提供文本侧的信息支持,通过不同维度扩大文本信息的语义表示,增加文本样本量。
智能图像检索模块20,包括视频智能抽帧模块201和以图搜图模块202,其中,视频智能抽帧模块用于从一段视频中抽取出最能代表视频内容的若干张图片,以图搜图模块用于完成大规模高效率的图片检索任务。
跨模态检索模块30,用于根据查询项生成大致相关地候选集,对所述候选集进行精确排序,最终返回相关地检索结果。本方案的处理流程如图5所示。
在本公开的一个实施例中,多维度文本信息提取模块10,包括:
语音数据处理模块101,用于音频提取和基于深度学习的语音识别;和
自然语言文本扩展模块102,用于获取不同语序不同语种下对于当前语句地语义描述,从多方面对已有地文本数据进行扩展,还用于根据细粒度地文本分析,获取大量地负样本数据。
可以理解的是,多维度文本信息提取模块为多模态检索系统提供文本侧的信息支持,主要通过不同维度扩大文本信息的语义表示,增加文本样本量。此外,该模块为文本单模态检索提供足量的数据支撑,一方面丰富文本模态的数据内容,另一方面增强多模态间的关联关系。
不同于常规的文本信息提取,多维度文本信息提取模块采用文本翻译结合语音识别的方法,充分利用多模态数据的优势,将视频中的音频数据,以及原本就是音频的数据进行语音识别,获取成对的训练数据;随后将整体的文本数据进行文本翻译处理,利用文本语义信息,提升数据的整体质量,扩展成对多模态关联数据的数量,同时通过多维度自然语言分析,将句子中的成分随机替换,构成丰富的负样本空间,提升模型的鲁棒性。
多维度文本信息提取模块可以细分为语音数据处理子模块和自然语言文本扩展子模块。
语音数据处理子模块主要包含音频提取和基于深度学习的语音识别,其结构如图6。
高维度模态具有较高的信息量,将其投影到低维度能大幅扩充低维度模态数据。将高维度模态(如视频、音频)转换为低维度模态(文本)数据,能提供大量的成对关联数据内容。音频提取能高效剥离视频中的音频数据,快速提供给后续功能。
基于深度学习的语音识别利用注意力机制实现端到端训练,将从各个模态中获取到的音频数据进行统一的语音识别获取低模态(即文本模态)信息,端到端模型能够用很好地形成完整的流水线为后续文本特征提取提供大量的成对数据。同时在深度学习过程中得到的音 频特征,可以支撑最终的跨模态检索所需的音频特征内容。
跨模态检索模型训练所使用的数据,都是成对的关联数据。目前这种数据大部分靠人为打标签进行获取,公开的完好数据集难以满足深度学习所需的训练数据量。多维度文本信息提取模块通过对自然语言文本进行基于深度学习的翻译转换为多语言的文本信息,以获取当前文本数据的多维语义表示,再将其转换回原本的语言,以达到统─训练语言的目的。
自然语言文本扩展子模块主要通过多语言的翻译结果,获取不同语序不同语种下对于当前语句的语义描述,从多方面对已有的文本数据进行扩展。此外,自然语言处理也能够根据细粒度的文本分析,获取大量的负样本数据,使最终的跨模态检索模型更加健壮,提升模型的鲁棒性,其结构如图7。
在本公开的一个实施例中,视频智能抽帧模块201用于从一段视频中抽取出最能代表视频内容的若干张图片,具体包括:
提取视频地每一帧,得到若干张图片;
将图片映射到统一地LUV颜色空间中,计算每一帧与前一帧地绝对距离;和
根据绝对距离将提取出地所有帧排序,排行靠前的若干帧即视为最能代表视频内容的若干张图片。
可以理解的是,视频由图片帧所构成,视频模态数据和图片模态数据存在天然连接,要实现从视频模态到图片模态的跨越,只需从视频中抽取若干代表性的图片即可。
为完成视频智能抽帧,首先提取视频的每一帧,得到若干张图片;然后将图片映射到统一的LUV颜色空间中,计算每一帧与前一帧的绝对距离,距离越大,表明该帧相较于前一帧的变化越剧烈;最后根据计算出的绝对距离将提取出的所有帧排序,排行靠前的若干帧即视为最能代表视频内容的若干张图片。视频智能抽帧如图8所示。
在本公开的一个实施例中,以图搜图模块202用于完成大规模高效率的图片检索任务,具体包括:
基于平均灰度级比较差距的图片特征提取技术对图片进行特征提取;和
通过ElasticSearch提供的模糊查询功能,快速的从图片数据库中检索出相同或相似的图片。
为了满足根据用户输入图片在数据库中快速检索并返回相同或相似图片的需求,以图搜图技术不可或缺。目前大量的图片检索技术都存在检索速度不够快、检索范围不够广的问题。本方案提出一种基于平均灰度级比较的图片特征提取方法,并由Elasticsearch搜索引擎赋能加速的以图搜图技术,以完成大规模高效率的图片检索任务。
检索速度极大影响检索体验,图片检索有别于关键词检索,计算量显著提升。为加快 图片检索速度,本方案首先将RGB三色图片转换为具有255个灰度级的灰色图片;然后对图片进行适当裁剪,裁去大概率不能表现图片特征的部分,得到如图9所示的灰色图片。为实现图片与图片间的相似度计算,图片特征的提取方法尤为重要,本方案在如图9所示的图片中选取9*9的网格点及其周边区域,基于这些矩形区域计算平均灰度级的比较差距,并将比较差距量化,作为图片特征进行存储。此图片特征提取方法可以实现仅用81*8的矩阵来表示一张图片,故在进行图片与图片间相似度计算时速度较快;且由于单张图片所需存储空间较小,可实现大规模的以图搜图任务。
为进一步提升检索速度,本方案基于ElasticSearch实现图片检索任务。运用上述图片特征提取方法,将图片特征存入ElasticSearch中,构建图片检索数据库,基于ElasticSearch的图片数据库不同于传统数据库,其利用了倒排索引机制,大大地提升检索速度。当用户输入一张图片或视频智能抽帧所得图片输入时,首先提取特征,再通过ElasticSearch提供的模糊查询功能,快速地从图片数据库中检索出相同或相似图片。
在本公开的一个实施例中,所述跨模态检索模块30,包括:
粗略召回模块301,采用基于transformer的多模态预训练模型,作为向量嵌入模型的子模型,进行快速的粗略召回;和
精确排序模块302,利用基于transformer的多模态预训练模型,作为交叉编码器模型的子模型,进行精确排序。
前述提到,现有的两种主流建模方案均存在不足。本方案首次基于两种方案进行有机结合,采用粗略召回、精确排序的创新思路,在保证检索效果的同时,提升检索效率。本方案使用向量嵌入模型进行粗略的信息召回,然后使用交叉编码器模型对召回的信息进行精确排序,最终返回最符合检索要求排序靠前的选项。该架构可以利用现有的跨模态预训练模型,且在两种模型间共享参数,提升模型的参数效率。检索架构如图10所示。
其中,粗略召回部分采用基于transformer的多模态预训练模型,如OSCAR,作为向量嵌入模型的子模型,进行快速的粗略召回。
由图11可知,向量嵌入模型包含两个预训练子模型,分别处理文本信号和图像信号,但实现参数共享。通过两个子模型,将不同模态的信号分别编码;然后映射到同一个高维多模态特征空间中;最后利用标准距离度量方法,如欧式距离、余弦距离,计算两个信号间的相似度,选出最相似的top-k个候选项,由交叉编码器模型进行精确排序。
为了使输入图像i和文本标题c两种模态的分布在高维多模态特征空间中更接近,在训练时将对应的图像-文本对紧密地放置于特征空间中,而不相关的样本对则放置较远(距离至 少超过边界值α)。因此,使用三重态损失函数(triplet loss)来表示(距离度量方法采用余弦距离):
L EMB(i,c)=max(0,cos(i,c′)-cos(i,c)+α)+max(0,cos(i′,c)-cos(i,c)+α)
其中(i,c)是来自训练语料库的正图像-文本对,而c′和t′是从训练语料库采样的负样本,使得图像-文本对(i,c′)和(i′,c)不出现在语料库中。
由于该模型对文本和图片信号进行独立编码,在检索时只需将查询的文本或图像映射到同样的特征空间中进行距离计算即可。因此,对于数据库中的数据可以进行离线编码,保证在线检索时的效率,使其可以应用于大规模的数据检索;但由于该模型不会被要求学习输入的细粒度特征,因此只用于快速召回候选目标集,由交叉编码器模型进行精确排序。
精确排序部分利用基于transformer的多模态预训练模型,如OSCAR,作为交叉编码器模型的子模型,进行精确排序。精确排序如图9所示。
由图9可知,交叉编码器模型只利用了一个预训练子模型,需要将文本和图像信号进行拼接,再通过神经网络判断其相似性。本方案利用二分类器来判断文本和图像是否相关,使用交叉嫡损失函数来表示:
L CE(i,c)=-(ylogp(i,c)+(1-y)log(1-p(i,c)))
p(i,c)表示输入图像i和文本c的组合是正样本的概率(是否为正确的图像-文本组合)。当(i,c)是正样本对时,y=1;当(i,c)是负样本对时,y=0。
检索时,将粗略召回的top-k个候选项依次与查询项拼接,分别得出每一个图像-文本对的相似概率,完成精确排序。
尽管上述方法通常具有较高性能,可以从两种信号的交互中学习到更多信息,但它的计算成本较高,因为需要将每种组合都通过整个网络来获得相似度评分p(i,c),即该方法在检索期间不利用任何预先计算的表示,很难在大规模数据上进行快速检索。
因此,本子模块的整体流程如图12所示,首先利用向量嵌入模型根据用户的查询项快速地选择top-k个大致相关的候选项,再利用交叉编码模型根据查询项对候选集进行精确排序,最终返回给用户相关的检索结果。本方案同时保留了向量嵌入模型能在大规模数据集上检索的效率和交叉编码模型的检索准确性。
本方案充分利用了多模态数据的优势,采用文本翻译结合语音识别的方法,提升数据的整体质量、多模态关联数据的数量和模型的鲁棒性;利用了基于平均灰度级比较差距的图片特征提取技术对图片进行特征提取,结合ElasticSearch搜索引擎对图片特征进行快速检索,实现了大规模高效的图片检索;结合向量嵌入模型和交叉编码器模型各自的优点,创新地在检索时采用粗略召回和精确排序的策略,实现了在大规模数据上快速有效地跨模态检索。
与当前主流的跨模态检索技术相比,本方案的优势在于:首先提出了一个联合检索的框架,结合向量嵌入模型检索速度快和交叉编码器模型检索效果好的优点,在检索时采用粗略召回和精确排序的策略,实现在大规模数据上快速有效地跨模态检索,同时共享两种模型的参数,提高参数效率,该框架适用于任何跨模态预训练模型,使得该框架可以使用现有模型而不需要从头开始训练,具有广泛的应用场景。其次结合多维度文本信息提取和智能图像检索,实现了单模态的快速检索,解决了目前主流跨模态检索模型无法实现相同模态信息检索的缺点;多维度文本信息提取一方面丰富了文本模态的信息内容,另一方面增强了多模态间关联关系,同时实现了语音到文本的转换;智能图像检索实现视频模态数据到图片模态数据的转换,可根据图片的像素、颜色、纹理等信息提取出图片特征,并在数据库中高效检索出相同或者相似度高的图片。
本公开实施例提出的基于预训练模型和召回排序的跨模态检索系统,针对跨模态检索数据动态、多源、多模态等特性,以及当前两种主流建模方法存在的问题,将两种建模方法有机结合,采用粗略召回、精确排序的思路,融合两种方案的长处,实现高效快速的跨模态检索;此外,本方案提出基于倒排检索的文本查询和基于颜色、纹理的高维图像特征检索技术,实现多个模态间的快速检索,为用户提供良好使用体验。
为了实现上述实施例,本公开还提出一种基于预训练模型和召回排序的跨模态检索方法。
图2为本公开实施例提供的一种基于预训练模型和召回排序的跨模态检索方法示意图。
如图2所示,该基于预训练模型和召回排序的跨模态检索方法,包括步骤S101至步骤S103。
S101,提取文本信息,通过不同维度扩大文本信息的语义表示,增加文本样本量。
S102,提取图像信息,从一段视频中抽取出最能代表视频内容的若干张图片,并从数据库中检索出相同或相似图片。
S103,根据查询项生成大致相关地候选集,对所述候选集进行精确排序,最终返回相 关地检索结果。
在本公开的一个实施例中,所述提取文本信息,包括:
音频提取和基于深度学习的语音识别;和
获取不同语序不同语种下对于当前语句地语义描述,从多方面对已有地文本数据进行扩展,还用于根据细粒度地文本分析,获取大量地负样本数据。
在本公开的一个实施例中,所述从一段视频中抽取出最能代表视频内容的若干张图片,包括:
提取视频地每一帧,得到若干张图片;
将所述图片映射到统一地LUV颜色空间中,计算每一帧与前一帧地绝对距离;和
根据所述绝对距离将提取出地所有帧排序,排行靠前的若干帧即视为最能代表视频内容的若干张图片。
在本公开的一个实施例中,所述从数据库中检索出相同或相似图片,包括:
基于平均灰度级比较差距的图片特征提取技术对图片进行特征提取;和
通过ElasticSearch提供的模糊查询功能,快速的从图片数据库中检索出相同或相似的图片。
在本公开的一个实施例中,所述根据查询项生成大致相关地候选集,对所述候选集进行精确排序,包括:
采用基于transformer的多模态预训练模型,作为向量嵌入模型的子模型,进行快速的粗略召回;和
利用基于transformer的多模态预训练模型,作为交叉编码器模型的子模型,进行精确排序。
为了实现上述实施例,本公开还提出了一种非临时性计算机可读存储介质,其上存储有计算机程序,其中所述计算机程序被处理器执行时实现如上述实施例中任一所述的基于预训练模型和召回排序的跨模态检索方法。
为了实现上述实施例,本公开还提出了一种电子设备,包括:存储器;处理器;和存储在所述存储器上并可在所述处理器上运行的计算机程序,其中所述处理器在执行所述计算机程序时实现如上述实施例中任一所述的基于预训练模型和召回排序的跨模态检索方法。
为了实现上述实施例,本公开还提出了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如上述实施例中任一所述的基于预训练模型和召回排序的跨模态检索方法。
为了实现上述实施例,本公开还提出了一种计算机程序,包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如上述实施例中任一所述的基于预训练模型和召回排序的跨模态检索方法。
需要说明的是,前述对基于预训练模型和召回排序的跨模态检索方法实施例的解释说明也适用于上述实施例中的非临时性计算机可读存储介质、计算机设备、计算机程序产品和计算机程序,此处不再赘述。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本公开的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本公开的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
尽管上面已经示出和描述了本公开的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本公开的限制,本领域的普通技术人员在本公开的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (14)

  1. 一种基于预训练模型和召回排序的跨模态检索系统,包括:
    多维度文本信息提取模块,用于为所述跨模态检索系统提供文本侧的信息支持,通过不同维度扩大文本信息的语义表示,增加文本样本量;
    智能图像检索模块,包括视频智能抽帧模块和以图搜图模块,其中所述视频智能抽帧模块用于从一段视频中抽取出最能代表视频内容的若干张图片,以图搜图模块用于完成大规模高效率的图片检索任务;和
    跨模态检索模块,用于根据查询项生成大致相关地候选集,对所述候选集进行精确排序,最终返回相关地检索结果。
  2. 根据权利要求1所述的系统,其中所述多维度文本信息提取模块,包括:
    语音数据处理模块,用于音频提取和基于深度学习的语音识别;和
    自然语言文本扩展模块,用于获取不同语序不同语种下对于当前语句地语义描述,从多方面对已有地文本数据进行扩展,还用于根据细粒度地文本分析,获取大量地负样本数据。
  3. 根据权利要求1或2所述的系统,其中所述视频智能抽帧模块用于从一段视频中抽取出最能代表视频内容的若干张图片,具体包括:
    提取视频地每一帧,得到若干张图片;
    将所述图片映射到统一地LUV颜色空间中,计算每一帧与前一帧地绝对距离;和
    根据所述绝对距离将提取出地所有帧排序,排行靠前的若干帧即视为最能代表视频内容的若干张图片。
  4. 根据权利要求1至3中任一项所述的系统,其中所述以图搜图模块用于完成大规模高效率的图片检索任务,具体包括:
    基于平均灰度级比较差距的图片特征提取技术对图片进行特征提取;和
    通过ElasticSearch提供的模糊查询功能,快速的从图片数据库中检索出相同或相似的图片。
  5. 根据权利要求1至4中任一项所述的系统,其中所述跨模态检索模块,包括:
    粗略召回模块,采用基于transformer的多模态预训练模型,作为向量嵌入模型的子模型,进行快速的粗略召回;和
    精确排序模块,利用基于transformer的多模态预训练模型,作为交叉编码器模型的子 模型,进行精确排序。
  6. 一种基于预训练模型和召回排序的跨模态检索方法,包括:
    提取文本信息,通过不同维度扩大文本信息的语义表示,增加文本样本量;
    提取图像信息,从一段视频中抽取出最能代表视频内容的若干张图片,从数据库中检索出相同或相似图片;和
    根据查询项生成大致相关地候选集,对所述候选集进行精确排序,最终返回相关地检索结果。
  7. 根据权利要求6所述的方法,其中所述提取文本信息,包括:
    音频提取和基于深度学习的语音识别;和
    获取不同语序不同语种下对于当前语句地语义描述,从多方面对已有地文本数据进行扩展,还用于根据细粒度地文本分析,获取大量地负样本数据。
  8. 根据权利要求6或7所述的方法,其中所述从一段视频中抽取出最能代表视频内容的若干张图片,包括:
    提取视频地每一帧,得到若干张图片;
    将所述图片映射到统一地LUV颜色空间中,计算每一帧与前一帧地绝对距离;和
    根据所述绝对距离将提取出地所有帧排序,排行靠前的若干帧即视为最能代表视频内容的若干张图片。
  9. 根据权利要求6至8中任一项所述的方法,其中所述从数据库中检索出相同或相似图片,包括:
    基于平均灰度级比较差距的图片特征提取技术对图片进行特征提取;和
    通过ElasticSearch提供的模糊查询功能,快速的从图片数据库中检索出相同或相似的图片。
  10. 根据权利要求6至9中任一项所述的方法,所述根据查询项生成大致相关地候选集,对所述候选集进行精确排序,包括:
    采用基于transformer的多模态预训练模型,作为向量嵌入模型的子模型,进行快速的粗略召回;和
    利用基于transformer的多模态预训练模型,作为交叉编码器模型的子模型,进行精确排序。
  11. 一种非临时性计算机可读存储介质,其上存储有计算机程序,其中所述计算机程序被处理器执行时实现如权利要求6至10中任一项所述的基于预训练模型和召回排序的跨模态检索方法。
  12. 一种电子设备,包括:
    存储器;
    处理器;和
    存储在所述存储器上并可在所述处理器上运行的计算机程序,
    其中所述处理器在执行所述计算机程序时实现如权利要求6至10中任一项所述的基于预训练模型和召回排序的跨模态检索方法。
  13. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如权利要求6至10中任一项所述的基于预训练模型和召回排序的跨模态检索方法。
  14. 一种计算机程序,包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如权利要求6至10中任一项所述的基于预训练模型和召回排序的跨模态检索方法。
PCT/CN2022/087219 2021-10-21 2022-04-15 基于预训练模型和召回排序的跨模态检索系统及方法 WO2023065617A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111229288.6 2021-10-21
CN202111229288.6A CN114419387A (zh) 2021-10-21 2021-10-21 基于预训练模型和召回排序的跨模态检索系统及方法

Publications (1)

Publication Number Publication Date
WO2023065617A1 true WO2023065617A1 (zh) 2023-04-27

Family

ID=81266522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087219 WO2023065617A1 (zh) 2021-10-21 2022-04-15 基于预训练模型和召回排序的跨模态检索系统及方法

Country Status (2)

Country Link
CN (1) CN114419387A (zh)
WO (1) WO2023065617A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229332A (zh) * 2023-05-06 2023-06-06 浪潮电子信息产业股份有限公司 一种视频预训练模型的训练方法、装置、设备及存储介质
CN116523024A (zh) * 2023-07-03 2023-08-01 腾讯科技(深圳)有限公司 召回模型的训练方法、装置、设备及存储介质
CN116578693A (zh) * 2023-07-14 2023-08-11 深圳须弥云图空间科技有限公司 一种文本检索方法及装置
CN117033308A (zh) * 2023-08-28 2023-11-10 中国电子科技集团公司第十五研究所 一种基于特定范围的多模态检索方法及装置
CN117312688A (zh) * 2023-11-29 2023-12-29 浙江大学 基于时空资产目录的跨源数据检索方法、介质及设备
CN117746344A (zh) * 2024-02-21 2024-03-22 厦门农芯数字科技有限公司 一种猪场监控视频的事件分析方法、装置以及设备
CN117953351A (zh) * 2024-03-27 2024-04-30 之江实验室 一种基于模型强化学习的决策方法
CN118394946A (zh) * 2024-06-28 2024-07-26 中国人民解放军国防科技大学 一种基于多视图聚类的检索增强生成方法和系统
CN118536606A (zh) * 2024-07-25 2024-08-23 浙江空港数字科技有限公司 一种人机交互方法、装置及电子设备
CN118709147A (zh) * 2024-08-29 2024-09-27 北京网智天元大数据科技有限公司 一种汉藏语多模态的图文处理方法及处理系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329749B (zh) * 2022-10-14 2023-01-10 成都数之联科技股份有限公司 一种语义检索的召回和排序联合训练方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140324828A1 (en) * 2013-04-30 2014-10-30 Microsoft Corporation Search result tagging
CN110472081A (zh) * 2019-08-23 2019-11-19 大连海事大学 一种基于度量学习的鞋图片跨域检索方法
CN111949806A (zh) * 2020-08-03 2020-11-17 中电科大数据研究院有限公司 一种基于Resnet-Bert网络模型的跨媒体检索方法
CN112035728A (zh) * 2020-08-21 2020-12-04 中国电子科技集团公司电子科学研究院 一种跨模态检索方法、装置及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140324828A1 (en) * 2013-04-30 2014-10-30 Microsoft Corporation Search result tagging
CN110472081A (zh) * 2019-08-23 2019-11-19 大连海事大学 一种基于度量学习的鞋图片跨域检索方法
CN111949806A (zh) * 2020-08-03 2020-11-17 中电科大数据研究院有限公司 一种基于Resnet-Bert网络模型的跨媒体检索方法
CN112035728A (zh) * 2020-08-21 2020-12-04 中国电子科技集团公司电子科学研究院 一种跨模态检索方法、装置及可读存储介质

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229332B (zh) * 2023-05-06 2023-08-04 浪潮电子信息产业股份有限公司 一种视频预训练模型的训练方法、装置、设备及存储介质
CN116229332A (zh) * 2023-05-06 2023-06-06 浪潮电子信息产业股份有限公司 一种视频预训练模型的训练方法、装置、设备及存储介质
CN116523024B (zh) * 2023-07-03 2023-10-13 腾讯科技(深圳)有限公司 召回模型的训练方法、装置、设备及存储介质
CN116523024A (zh) * 2023-07-03 2023-08-01 腾讯科技(深圳)有限公司 召回模型的训练方法、装置、设备及存储介质
CN116578693B (zh) * 2023-07-14 2024-02-20 深圳须弥云图空间科技有限公司 一种文本检索方法及装置
CN116578693A (zh) * 2023-07-14 2023-08-11 深圳须弥云图空间科技有限公司 一种文本检索方法及装置
CN117033308A (zh) * 2023-08-28 2023-11-10 中国电子科技集团公司第十五研究所 一种基于特定范围的多模态检索方法及装置
CN117033308B (zh) * 2023-08-28 2024-03-26 中国电子科技集团公司第十五研究所 一种基于特定范围的多模态检索方法及装置
CN117312688A (zh) * 2023-11-29 2023-12-29 浙江大学 基于时空资产目录的跨源数据检索方法、介质及设备
CN117312688B (zh) * 2023-11-29 2024-01-26 浙江大学 基于时空资产目录的跨源数据检索方法、介质及设备
CN117746344A (zh) * 2024-02-21 2024-03-22 厦门农芯数字科技有限公司 一种猪场监控视频的事件分析方法、装置以及设备
CN117746344B (zh) * 2024-02-21 2024-05-14 厦门农芯数字科技有限公司 一种猪场监控视频的事件分析方法、装置以及设备
CN117953351A (zh) * 2024-03-27 2024-04-30 之江实验室 一种基于模型强化学习的决策方法
CN118394946A (zh) * 2024-06-28 2024-07-26 中国人民解放军国防科技大学 一种基于多视图聚类的检索增强生成方法和系统
CN118536606A (zh) * 2024-07-25 2024-08-23 浙江空港数字科技有限公司 一种人机交互方法、装置及电子设备
CN118709147A (zh) * 2024-08-29 2024-09-27 北京网智天元大数据科技有限公司 一种汉藏语多模态的图文处理方法及处理系统

Also Published As

Publication number Publication date
CN114419387A (zh) 2022-04-29

Similar Documents

Publication Publication Date Title
WO2023065617A1 (zh) 基于预训练模型和召回排序的跨模态检索系统及方法
Kaur et al. Comparative analysis on cross-modal information retrieval: A review
Agnese et al. A survey and taxonomy of adversarial neural networks for text‐to‐image synthesis
CN113762322B (zh) 基于多模态表示的视频分类方法、装置和设备及存储介质
CN113157965B (zh) 音频可视化模型训练及音频可视化方法、装置及设备
CN111581401A (zh) 一种基于深度相关性匹配的局部引文推荐系统及方法
CN113672693B (zh) 基于知识图谱和标签关联的在线问答平台的标签推荐方法
CN116975615A (zh) 基于视频多模态信息的任务预测方法和装置
CN113392265A (zh) 多媒体处理方法、装置及设备
CN111949824A (zh) 基于语义对齐的视觉问答方法和系统、存储介质
CN115203421A (zh) 一种长文本的标签生成方法、装置、设备及存储介质
CN110866129A (zh) 一种基于跨媒体统一表征模型的跨媒体检索方法
CN116362221A (zh) 融合多模态语义关联图谱的航空文献关键词相似度判定方法
CN113094534B (zh) 一种基于深度学习的多模态图文推荐方法及设备
CN116977992A (zh) 文本信息识别方法、装置、计算机设备和存储介质
CN109800435A (zh) 一种语言模型的训练方法及装置
Liu et al. A multimodal approach for multiple-relation extraction in videos
CN115408488A (zh) 用于小说场景文本的分割方法及系统
CN114661951A (zh) 一种视频处理方法、装置、计算机设备以及存储介质
KR102034668B1 (ko) 이종 컨텐츠 추천 모델 제공 장치 및 방법
CN112084788B (zh) 一种影像字幕隐式情感倾向自动标注方法及系统
Tian et al. Research on image classification based on a combination of text and visual features
CN117909555A (zh) 多模态信息检索方法、装置、设备、可读存储介质及计算机程序产品
CN117540039A (zh) 一种基于无监督跨模态哈希算法的数据检索方法
CN116756363A (zh) 一种由信息量引导的强相关性无监督跨模态检索方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882242

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22882242

Country of ref document: EP

Kind code of ref document: A1