CN110866129A

CN110866129A - Cross-media retrieval method based on cross-media uniform characterization model

Info

Publication number: CN110866129A
Application number: CN201911061277.4A
Authority: CN
Inventors: 王进; 刘汪洋; 曹扬; 张秋悦; 闫盈盈; 宋荣伟; 阚丹会
Original assignee: Division Big Data Research Institute Co Ltd
Current assignee: Division Big Data Research Institute Co Ltd
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-03-06

Abstract

The invention provides a cross-media retrieval method based on a cross-media uniform representation model aiming at the problem of cross-media retrieval, which comprises the following steps: (1) constructing a cross-media database, and establishing a large-cross-media database facing the government affair news field; (2) cross-media data preprocessing, input preprocessing of data such as texts, images, videos and audios; (3) extracting original domain features of cross-media data, and extracting original domain feature vectors of the cross-media data; (4) uniformly representing the cross-media data, and extracting feature vectors of the cross-media data in a common representation space; (5) and calculating and sequencing the semantic similarity of the data, calculating the semantic similarity of the data of the retrieval target and the data in the cross-media database, and sequencing to output results. The invention not only provides a mutual retrieval method supporting four media data, but also provides a unified representation model of multiple media data, improves the cross-media semantic retrieval precision, and has a broad application prospect.

Description

Cross-media retrieval method based on cross-media uniform characterization model

Technical Field

The invention relates to a cross-media retrieval method based on a cross-media unified representation model, which belongs to the technical field of natural language processing, computer vision, cross-media data retrieval and the like and comprises the steps of extracting original domain features of multimedia data, uniformly representing the model through cross-media data, constructing a cross-media database, calculating and sequencing similarity of the cross-media data and the like.

Background

With the development of the big data era, data of various industries are explosively increased, and a large amount of multimedia data including massive unstructured data such as texts, images, videos and audios are generated at the moment of intelligent application represented by 5G and the Internet of things. How to better organize and retrieve queries across media data becomes a great challenge and research focus in the field of information retrieval, such as retrieving images, video, and audio through text; text, audio, etc. is retrieved via video.

For multimedia information sets such as texts, images, videos, audios and the like, most retrieval systems still adopt text keyword search technology, for example, the image and video retrieval function of Google is still based on text keywords (keywords), and the basic flow is to extract keyword labels from unstructured data, wherein the keyword labels may be texts, file names, data subject labels, target detection labels and the like around pictures, and a small amount of manual labels from the internet are also provided. Due to different cultural backgrounds and different professional knowledge of multimedia information producers, the text information associated with the pictures is often extremely unreliable and can be appreciated by people. For multimedia information such as images and videos, it is generally difficult to use natural language to perform effective and accurate description, and it is impossible to express the essential content and semantic relationship, so the solution for retrieving pictures and videos according to text information is difficult to meet the query requirement of people, and the search accuracy is very low.

Aiming at the problem of cross-media data retrieval, a semantic embedding method based on machine learning and deep learning becomes a research key point, a VSE + + model learns visual semantic embedding representation through a difficult case mining method, and the cross-media retrieval precision is improved; the ACMR and CM-GANs models perform model training by resisting the generation idea and achieve better performance in Wikipedia and NUSWIDE data sets. Most of the existing cross-media retrieval methods with good effects mostly adopt a deep neural network model, the model is usually poor in interpretability, and meanwhile, the model based on the countermeasure concept is used for enabling the transformation of data into a common representation space to be assumed to be linear reversible transformation, so that the inverse transformation constraint condition is increased, but the inverse transformation constraint condition is contradictory to the nonlinear transformation property of the neural network.

Disclosure of Invention

In order to solve the technical problems, the invention provides a cross-media retrieval method based on a cross-media uniform representation model, which supports four uniform representation models of media data retrieval and is used for cross-media data retrieval to improve retrieval precision.

The invention is realized by the following technical scheme.

The invention provides a cross-media retrieval method based on a cross-media uniform representation model, which comprises the following steps:

① constructing cross-media database, establishing cross-media database facing government affairs news field;

② preprocessing cross-media data, preprocessing the input of the cross-media database to obtain cross-media data;

③ extracting original domain features of the cross-media data, namely extracting original domain feature vectors of the cross-media data;

④, performing unified characterization of cross-media data, namely generating a cross-media unified characterization model supporting four media data input by adopting a deep neural network model for countermeasure train, and extracting a common spatial feature vector output by the cross-media unified characterization model;

⑤ calculating and sorting the semantic similarity of data retrieval, namely calculating the cosine similarity of the public space feature vector output by the cross-media uniform representation model and the original domain feature vector of the cross-media data, sorting by the similarity, and outputting the first K data with the maximum similarity as a retrieval result to be output.

In step ①, the domain of the government affairs news includes government affairs news, political characters and political events, and the cross-media database stores four types of unstructured data including text, images, videos and audios.

In step ②, the data format and dimensions of the multimedia search input data of text, image, video and audio are preprocessed, wherein the audio data is converted into a spectrum image as audio data input, and the text is segmented to obtain a segmentation array.

In the step ③, a word2vec model is used for text data to extract original domain feature vectors, a depth convolution network is used for image data to extract original domain features, C3D is used for video data to extract video original domain features, and a depth convolution network is used for voice data to extract original domain feature vectors.

And acquiring a word segmentation array of the text through word segmentation.

The invention has the beneficial effects that:

1. the method for supporting the unified representation of four media data, namely text, image, video and voice, is provided, and the cross-media data unified representation model adopts a model training method based on a generated countermeasure thought, so that semantic gaps among different media data representations are reduced;

2. a cross-media data retrieval method based on a cross-media unified representation model is provided, and mutual retrieval of four kinds of media data is achieved.

Drawings

Fig. 1 is a flow chart of the present invention.

Detailed Description

The technical solution of the present invention is further described below, but the scope of the claimed invention is not limited to the described.

As shown in fig. 1, a cross-media retrieval method based on a cross-media uniform characterization model includes the following steps:

In the step ③, a word2vec model is used for extracting original domain feature vectors from text data to obtain word vector representation, a deep convolution network is used for extracting original domain features from image data, a C3D (three-dimensional convolution network) is used for extracting video original domain features from video data, namely, a section of fixed frame number sequence image is obtained through video sampling, then a C3D model (three-dimensional convolution network) is used for obtaining video image features, and a deep convolution network is used for extracting original domain feature vectors from voice data, namely, an audio frequency spectrum image is input and extracted through the deep convolution network.

And acquiring a word segmentation array of the text through word segmentation.

Specifically, the cross-media data unified representation model adopts a model training method based on a countermeasure train, and in the training of the cross-media data unified representation model, the modal data discriminant loss function expression is as follows:

wherein L is_adv(θ_D) Represents the cross entropy loss function of all samples between different modalities, D (; theta_D) Representing the probability, m, of an image or text sample being discriminated as an image or text_iA real label indicating whether a sample belongs to an image or text;

the cross-media data characterization loss function is:

L_emd(θ_V,θT_iθ_imd)＝ω₁×L_imi+ω₂×L_imd+L_reg

wherein L is_imiIs an inter-modal structure invariant loss function, L_imdIs an intra-modal data class loss function, L_regRegularizing term, ω, for model parameters₁、ω₂Is a model hyper-parameter;

further, the air conditioner is provided with a fan,

the model training optimization function based on the generation of the countermeasure train is as follows:

wherein the threshold value of max is theta_D。

Examples

As described above, a cross-media retrieval method based on a cross-media uniform characterization model includes the following steps:

step 1: cross-media data pre-processing

The text input is: "nuebel physiological or medical professor in 2019 to american scientists william kelin, gray samarazake and british scientist peter lattershiro to show their contribution in" finding how cells perceive and adapt to oxygen supply ". "

Text word segmentation preprocessing is carried out, and the word segmentation result is as follows: [ 2019; a nobel; (ii) physiology; medical awards; awarding; the united states; a scientist; william kelin; greige, plug bundle; british; a scientist; peter lattershiro; carrying out outmost; they are; finding out; a cell; how; sensing; adapting; oxygen gas; supplying; an aspect; so as to obtain the finished product; contribution (B)

Step 2: cross-media data origin domain feature extraction

Obtaining a text feature vector by using a word2vec model: q1 ═ 1,1,0,0,0,1,0 … …;

and step 3: unified characterization across media data

Obtaining a feature vector Q2 of the text in a common representation space through a cross-media data uniform representation model;

and 4, step 4: data retrieval semantic similarity calculation and ranking

And (3) expressing the Q2 and feature vectors of all cross-media data in the database { V1, V2, V3 …, T1, T2 and … } to calculate cosine similarity, and sorting the cosine similarity according to the similarity and outputting a retrieval result.

Claims

1. A cross-media retrieval method based on a cross-media uniform characterization model is characterized in that: the method comprises the following steps:

2. The cross-media search method based on the cross-media uniform characterization model according to claim 1, wherein in the step ①, the government news domain comprises government news, political characters and political events, and the cross-media database stores four types of unstructured data including text, images, videos and audios.

3. The method as claimed in claim 1, wherein the step ② is performed by preprocessing the data format and dimension of the multimedia search input data of text, image, video and audio, wherein the audio data is transformed into the spectrum image as the audio data input, and the text is segmented to obtain the segmentation array.

4. The cross-media retrieval method based on the cross-media uniform characterization model according to claim 1, wherein in the step ③, a word2vec model is used to extract original domain feature vectors for text data, a deep convolution network is used to extract original domain features for image data, a C3D is used to extract video original domain features for video data, and a deep convolution network is used to extract original domain feature vectors for voice data.

5. The cross-media retrieval method based on the cross-media uniform characterization model according to claim 3, wherein: and acquiring a word segmentation array of the text through word segmentation.