[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113033438B - Data feature learning method for modal imperfect alignment - Google Patents

Data feature learning method for modal imperfect alignment Download PDF

Info

Publication number
CN113033438B
CN113033438B CN202110345293.7A CN202110345293A CN113033438B CN 113033438 B CN113033438 B CN 113033438B CN 202110345293 A CN202110345293 A CN 202110345293A CN 113033438 B CN113033438 B CN 113033438B
Authority
CN
China
Prior art keywords
data
aligned
modal
modality
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110345293.7A
Other languages
Chinese (zh)
Other versions
CN113033438A (en
Inventor
彭玺
杨谋星
林义杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202110345293.7A priority Critical patent/CN113033438B/en
Publication of CN113033438A publication Critical patent/CN113033438A/en
Application granted granted Critical
Publication of CN113033438B publication Critical patent/CN113033438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a data feature learning method facing modal incomplete alignment, which comprises the steps of defining a multi-modal data set, processing unaligned data by using information contained in data of an aligned part, realizing a modal realignment target through contrast learning, and simultaneously learning features of the realigned data. To achieve the above object, the present solution guides neural network training using the proposed contrast learning loss function. After training of the neural network model is completed, the unaligned multi-modal data is input into the model, and by selecting the sample with the closest distance as the realigned sample, the realignment of the non-fully aligned multi-modal data at the category level can be completed and the characteristics of the aligned data can be learned at the same time. The invention makes obvious progress on the common performance indexes of clustering and classifying tasks, greatly reduces the time and memory consumption, and is beneficial to subsequent tasks such as clustering, classifying identification or data retrieval.

Description

Data feature learning method for modal imperfect alignment
Technical Field
The invention relates to the field of feature learning, in particular to a data feature learning method for modal imperfect alignment.
Background
At present, the multi-modal feature learning technology is widely applied to various fields. In the retrieval application, the picture corresponding to the text description can be retrieved by inputting a section of text, and the core of the retrieval application is cross-modal feature learning. In the social network analysis, each person can be regarded as an example, the characters and the matching diagrams in the social application (WeChat friend circle) of the person are respectively regarded as samples of two modes, and people with similar love can be gathered together by performing multi-mode feature learning on the character modes and the matching diagrams of different persons, so that the applications such as behavior analysis, personality recommendation and the like can be further performed. In semantic navigation, a section of speech is input to the robot, and the robot can analyze the given description and carry out feature learning by combining visual perception to complete the given related tasks in the description. These multi-modal feature learning techniques are successful primarily due to the existence of high quality multi-modal data that satisfies two assumptions. The method comprises the following steps of firstly, assuming completeness of data, namely all samples need to exist in all modes, and a data missing condition cannot exist; the second is a modal alignment assumption, that is, there is a correct correspondence between samples in different modalities. In other words, on the basis of the current technology, to perform feature learning on multi-modal data, the data needs to be screened and aligned in advance to ensure the completeness and alignment of a sample. However, in these practical scenarios, it is a very difficult task to collect complete, fully aligned multimodal data due to the complexity and incompatibility of time and space. For example, to evaluate the teaching quality of an online mullet course, a video frame and an audio frame need to be input to a system based on multi-modal learning for joint evaluation, but the video frame and the audio frame are not always in a one-to-one alignment (corresponding) relationship, which may significantly reduce the performance of many modal methods.
Although a few related methods for multi-modal data alignment exist at present, they all attempt to recover the alignment relationship between different modalities of the same sample (element) based on the alignment at the instance level, and the required computation and storage are extremely expensive and the effect is often poor. For example, when running on an Nvidia 2080TiGPU, existing methods such as PVC cannot process large-scale data (e.g., a noissymist dataset with two modalities, each of which is 60000 samples). In addition, for data with a small scale, the PVC also usually takes several hours to perform modality alignment and occupies a large amount of memory resources, and the data representation obtained after alignment often performs poorly on subsequent tasks such as classification, clustering, and the like. Furthermore, when the modalities of misalignment are simultaneously lack of data (for example, some people have a friend circle only sending characters, and do not match a picture, and at the moment, the people lack the modality of pictures), the example-level alignment cannot be performed. Therefore, compared to alignment based on the example level, our research and design focuses on performing category-level alignment (i.e., aligning homogeneous samples across modalities) and feature learning of the data at the same time. Practice proves that the method can process data of different scales in extremely fast time with extremely little storage overhead, and obtains higher effect in subsequent tasks such as classification and clustering. Meanwhile, when the data missing condition also exists in the non-aligned mode, the method can also process the data missing condition. Therefore, compared with an example-level alignment method, the method has higher application prospect and practical value
Disclosure of Invention
In order to solve the above problems, the present invention provides a data feature learning method oriented to modality incomplete alignment, which is implemented by the following technical scheme:
a data feature learning method facing modal imperfect alignment comprises the following steps:
s1, defining a video image modal data set and a sound modal data set with incompletely aligned modalities, and selecting any modality as an alignment reference modality, and selecting the other modality as a to-be-aligned modality;
s2, taking the aligned data in the video image modality data set and the sound modality data set as a positive sample pair, and constructing a negative sample pair by taking the selected sample of the aligned reference modality as a reference;
s3, respectively inputting the constructed positive and negative samples into two neural networks with different structures, and calculating the common representation of the positive and negative samples;
s4, calculating a loss function by using the obtained public expression, and training two neural networks by using the calculated loss function;
and S5, inputting the sample data of the unaligned part in the video image mode data set and the sound mode data set into the trained neural network, and correcting the alignment relation of the sample data of the unaligned part to realign the sample data.
The scheme has the advantages that the common problem of modal non-alignment in multi-modal data can be solved, the alignment relation of the multi-modal data is recovered, and the performance of a multi-modal model is guaranteed.
Further, in the above-mentioned case,
the modality non-perfect alignment video image modality data set and sound modality data set defined in S1 are respectively expressed as:
{X(1)}={A(1),U(1)};
{X(2)}={A(2),U(2)};
wherein, { X(1)Is a modal incompletely aligned video image modality data set,
Figure BDA0003000614370000021
a data set representing an aligned portion of a video image modality,
Figure BDA0003000614370000022
represents { A(1)Aligning data samples in the partial data set, wherein j is the number of aligned partial samples;
Figure BDA0003000614370000023
a data set representing a misaligned portion of a video image modality,
Figure BDA0003000614370000024
represents { U(1)Data samples in the unaligned partial data set in (j), k is its number of samples; { X(2)Is a modal incompletely aligned sound modality data set,
Figure BDA0003000614370000025
a data set representing aligned portions of the acoustic modality,
Figure BDA0003000614370000026
represents { A(2)Align data samples in the partial data set, n is the number of aligned partial samples;
Figure BDA0003000614370000027
a data set representing a misaligned portion of a sound modality,
Figure BDA0003000614370000028
represents { U(2)Data samples in the unaligned partial data set in (j), m being its number of samples; .
The beneficial effect of the above scheme is that the problem of modality imperfect alignment is defined.
Further, the specific method for constructing the negative sample pair in S2 is as follows:
s21, taking the data of the aligned part in the video image mode data set and the sound mode data set as a positive sample pair;
and S22, taking the data of each alignment part in the alignment reference mode as an anchor point, and randomly sampling a plurality of data samples in the data set of the alignment part of the other mode to form a negative sample pair with each anchor point.
The beneficial effect of the above further scheme is that a negative sample for comparison learning can be constructed without the help of additional label information.
Further, the common expressions of the positive and negative samples in step S3 are respectively:
Z(1)=f(1)(A1);
Z(2)=f(2)(A2);
wherein f is(1)Neural networks constructed for video image modalities, f(2)A neural network constructed for acoustic modalities; a. the1Set of samples belonging to a video image modality, Z, for a positive and negative sample pair(1)Is A1By means of neural networks f(1)A learned public representation; a. the2Set of samples belonging to a sound modality for a positive and negative sample pair, Z(2)Is A2By means of a neural network f(2)A learned common representation.
The further scheme has the advantages that the data characteristics of the constructed positive and negative samples are extracted, and data redundancy is reduced.
Further, the loss function in S4 is specifically expressed as:
Figure BDA0003000614370000031
wherein, P is variable 0 and 1, 0 represents negative sample obtained by construction, 1 represents positive sample obtained by construction, N is total number of training samples, and l is index of training sample;
Figure BDA0003000614370000032
representing a loss function for a positive sample pair, wherein,
Figure BDA0003000614370000033
represents the distance in the latent space learned in the neural network across modal positive sample pairs in different modalities,
Figure BDA0003000614370000034
are respectively cross-modal positive sample pairs, j, in different modes1=n1Representing that cross-modality positive sample pairs in different modalities have the same index value;
Figure BDA0003000614370000035
representing a loss function for a negative sample pair, where p is an adaptive boundary value,
Figure BDA0003000614370000036
representing pairs of cross-modal negative samples randomly sampled from aligned portions of different modalities, with different index values.
The further scheme has the advantages that the neural network is guided to learn a potential space, multi-mode data are projected into the potential space, characteristics of the data can be obtained, meanwhile, characteristics of different modes and the same class are close to each other, the characteristics of different modes and the different classes are far away from each other, and therefore the characteristics of the data can be learned simultaneously when the mode alignment relation is processed.
Further, in the above-mentioned case,
the calculation method of the distance of the cross-modal positive sample in the different modes in the latent space learned in the neural network is as follows:
Figure BDA0003000614370000041
wherein f is(1)、f(2)Are respectively as
Figure BDA0003000614370000042
An input neural network. .
The further solution described above has the advantage that a loss function for positive sample pairs is defined.
Further, the S5 specifically includes:
s51, sequentially sending the sample data pairs of the cross-modal misalignment part in the video image mode and the sound mode into the trained neural network model f(1)And f(2)Calculating a feature representation set of the unaligned partial sample data pairs, wherein the input cross-modal unaligned partial sample data pairs have the same index each time;
s52, calculating a distance matrix of the unaligned part data according to the feature representation set obtained in the step S51;
and S53, for each feature in the feature representation set of the alignment reference modality, searching for the feature of the sample in the feature representation set of the to-be-aligned modality, minimizing a distance matrix between the searched feature and the feature in the alignment reference modality, and outputting the feature with the minimum distance matrix as the feature of the realigned modality.
The beneficial effect of the further scheme is that the alignment relation of the modal non-aligned sample is corrected to realign the modal non-aligned sample.
Further, in the above-mentioned case,
the feature representation set of the unaligned part data in the different modalities is represented as:
Figure BDA0003000614370000043
Figure BDA0003000614370000044
wherein,
Figure BDA0003000614370000045
to align the feature set of unaligned partial data in the reference modality, f(1)A neural network for inputting data of a part which is not aligned in the alignment reference mode;
Figure BDA0003000614370000046
for feature sets of unaligned partial data in the modality to be aligned, f(2)And inputting the unaligned part of data in the modality to be aligned to the neural network.
The beneficial effect of the further scheme is that the characteristics of modal non-aligned data are extracted, and redundancy removal of the data is realized.
Further, the formula for calculating the distance matrix of the unaligned portion data in step S52 is as follows:
Figure BDA0003000614370000047
wherein D is the unaligned portion data distance matrix.
The further scheme has the advantages that the distance matrix of the characteristics of the unaligned part of the data is calculated, and the alignment relation of the part of the data is convenient to correct.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a schematic flow chart of a data feature learning method for modal-oriented imperfect learning.
Detailed Description
Hereinafter, the term "comprising" or "may include" used in various embodiments of the present invention indicates the presence of the invented function, operation or element, and does not limit the addition of one or more functions, operations or elements. Furthermore, as used in various embodiments of the present invention, the terms "comprises," "comprising," "includes," "including," "has," "having" and their derivatives are intended to mean that the specified features, numbers, steps, operations, elements, components, or combinations of the foregoing, are only meant to indicate that a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be construed as first excluding the existence of, or adding to the possibility of, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
In various embodiments of the invention, the expression "or" at least one of a or/and B "includes any or all combinations of the words listed simultaneously. For example, the expression "a or B" or "at least one of a or/and B" may include a, may include B, or may include both a and B.
Expressions (such as "first", "second", and the like) used in various embodiments of the present invention may modify various constituent elements in various embodiments, but may not limit the respective constituent elements. For example, the above description does not limit the order and/or importance of the elements described. The foregoing description is for the purpose of distinguishing one element from another. For example, the first user device and the second user device indicate different user devices, although both are user devices. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of various embodiments of the present invention.
It should be noted that: if it is described that one constituent element is "connected" to another constituent element, the first constituent element may be directly connected to the second constituent element, and a third constituent element may be "connected" between the first constituent element and the second constituent element. In contrast, when one constituent element is "directly connected" to another constituent element, it is understood that there is no third constituent element between the first constituent element and the second constituent element.
The terminology used in the various embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the invention. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and the accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not used as limiting the present invention.
Example 1
A method for learning data features oriented to modal imperfect alignment, as shown in fig. 1, comprising the steps of:
s1, defining a video image modality data set and a sound modality data set with non-completely aligned modalities, and selecting any modality as an alignment reference modality, and another modality as a modality to be aligned, specifically, the video image modality data set and the sound modality data set with non-completely aligned modalities defined are respectively expressed as:
{X(1)}={A(1),U(1)};
{X(2)}={A(2),U(2)};
wherein, { X(1)Is a modal incompletely aligned video image modality data set,
Figure BDA0003000614370000061
a data set representing an aligned portion of a video image modality,
Figure BDA0003000614370000062
represents { A(1)Aligning data samples in the partial data set, wherein j is the number of aligned partial samples;
Figure BDA0003000614370000063
a data set representing a misaligned portion of a video image modality,
Figure BDA0003000614370000064
represents { U(1)The data samples in the unaligned partial data set in (j), k being its number of samples; { X(2)Is a modal incompletely aligned sound modality data set,
Figure BDA0003000614370000065
a data set representing aligned portions of the acoustic modality,
Figure BDA0003000614370000066
represents { A(2)Aligning data samples in the partial data set, wherein n is the number of aligned partial samples;
Figure BDA0003000614370000067
a data set representing a misaligned portion of a sound modality,
Figure BDA0003000614370000068
represents { U(2)The data samples in the partial data set are not aligned in (m) is its number of samples.
In the present embodiment, for convenience of illustration and without loss of generality, the scheme is described in detail by taking two modalities, i.e., a video image modality and a sound modality as an example, and when the number of modalities is greater than two, the situation is similar and can be analogized accordingly. Meanwhile, the video image mode is an alignment reference, and the sound mode is aligned to the reference mode. In the schemeDefining a multi-modal dataset that satisfies class-level alignment, if and only if samples of different modalities
Figure BDA0003000614370000069
And
Figure BDA00030006143700000610
belong to the same class, i.e. to
Figure BDA00030006143700000611
Figure BDA00030006143700000612
Where C (x) represents the class of sample x. Such class level alignment can be achieved by a discriminant problem, namely from { X }(1)Find a sample
Figure BDA00030006143700000613
Make it and { X(2)The kth sample in (1) } is
Figure BDA00030006143700000614
The above objective function is satisfied.
S2, taking the aligned data in the video image modality data set and the sound modality data set as a positive sample pair, and constructing a negative sample pair by taking the selected sample of the aligned reference modality as a reference;
specifically, the method comprises the following steps:
s21, taking the data of the aligned part in the video image mode data set and the sound mode data set as a positive sample pair;
and S22, taking the data of each alignment part in the alignment reference mode as an anchor point, and randomly sampling a plurality of data samples in the data set of the alignment part of the other mode to form a negative sample pair with each anchor point.
According to the scheme, the cross-modal sample pairs of the same category are used as positive sample pairs and the cross-modal sample pairs of different categories are used as negative sample pairs in category-level comparison learning, and then the similarity between the positive sample pairs is increased, and the similarity of the negative sample pairs is reduced. In the implementation process of the scheme, the data of the aligned part is taken as a positive sample pair. Negative examples pairs cannot be constructed directly, since it allows for a more generic unsupervised case, i.e. without the need for manually labeled class labels. Therefore, in order to obtain the above negative sample pair, for each sample in the alignment reference mode, samples of other modes are randomly sampled from the rest modes respectively, and the samples are formed into the negative sample pair. The negative sample pair thus formed will contain noisy data, i.e. a pair of samples that should have been positive samples is mistaken for a negative sample pair. To prevent these incorrectly labeled samples from affecting the implementation of the scheme, contrast learning needs to be robust to these noisy data.
S3, respectively inputting the constructed positive and negative samples into two neural networks with different structures, and calculating the common representation of the positive and negative samples;
for ease of illustration and without loss of generality, the scheme is described in detail using two modalities as an example. When there are more than two modalities, the situation is similar and can be analogized accordingly. For the video image mode and the sound mode, the corresponding two neural networks are respectively f(1),f(2)The corresponding training data are the data of the aligned part, respectively A(1)And A(2)The test data are partial data of the unaligned relation and are respectively U(1)And U(2)Wherein the neural network f(1),f(2)Different configurations are used.
In the scheme, public expressions of the constructed positive and negative sample pairs can be obtained through a neural network respectively, and the public expressions of the positive and negative samples are respectively as follows:
Z(1)=f(1)(A1);
Z(2)=f(2)(A2);
wherein, f(1)Neural networks constructed for video image modalities, f(2)A neural network constructed for acoustic modalities; a. the1Set of samples belonging to a video image modality, Z, for a positive and negative sample pair(1)Is A1Spirit of passing throughVia a network f(1)A learned public representation; a. the2Set of samples belonging to a sound modality for a positive and negative sample pair, Z(2)Is A2By means of a neural network f(2)A learned common representation.
S4, calculating a loss function by using the obtained public expression, and training two neural networks by using the calculated loss function;
to achieve the above-described mode alignment goal by contrast learning, the present solution utilizes two neural networks to learn a common representation of two modes, i.e., { f }(1),f(2)}, the loss function is defined as follows:
Figure BDA0003000614370000081
wherein, P is variable 0 and 1, 0 represents negative sample obtained by construction, 1 represents positive sample obtained by construction, N is total number of training samples, and l is index of training sample;
Figure BDA0003000614370000082
representing a loss function for a positive sample pair, wherein,
Figure BDA0003000614370000083
represents the distance in the latent space learned in the neural network across modal positive sample pairs in different modalities,
Figure BDA0003000614370000084
are respectively cross-modal positive sample pairs, j, in different modes1=n1Representing that cross-modality positive sample pairs in different modalities have the same index value;
Figure BDA0003000614370000085
representing a loss function for a negative sample pair, where p is an adaptive boundary value,
Figure BDA0003000614370000086
representing pairs of cross-modal negative samples randomly sampled from aligned portions of different modalities, with different index values.
Where d (-) denotes the distance of two samples in the neural network's underlying space, i.e.
Figure BDA0003000614370000087
Where p is an adaptive boundary value, the loss function
Figure BDA0003000614370000088
The negative influence of the noise data in the negative sample pairs on the training of the neural network can be reduced or even eliminated, so that the neural network is robust to the noise data. Designed loss function
Figure BDA0003000614370000089
The method is mainly used for guiding the neural network to learn the common subspace of the multi-modal data, wherein samples of the same class but belonging to different modes are close to each other, and samples of different classes and different modes are far away from each other.
And guiding the training of the neural network by using the designed noise robust contrast loss function, calculating the gradient of the loss value accumulated in each iteration, performing back propagation to update the parameter value of the neural network until the iteration process is converged, and stopping the training to obtain the trained neural network.
And S5, inputting the sample data of the unaligned part in the video image mode data set and the sound mode data set into the trained neural network, and correcting the alignment relation of the sample data of the unaligned part to realign the sample data.
After training of the neural network model is completed, unaligned multi-modal data are input into the model, and by selecting the sample closest to the model as a realigned sample, the multi-modal data alignment at the category level can be completed and the aligned features can be learned at the same time. It is noted that the loss function designed by the present scheme is the first loss function that is robust against noisy data in the comparison learning. Based on the above procedure, the model can be sufficiently trained on aligned data in multiple modalities, and a common representation of each modality is learned implicitly with the alignment information, thereby enabling the model to efficiently process non-aligned multi-modality data.
The specific method comprises the following steps:
s51, sequentially sending the sample data pairs of the cross-modal misalignment part in the video image mode and the sound mode into the trained neural network model f(1)And f(2)Calculating a feature representation set of the unaligned partial sample data pairs, wherein the input cross-modal unaligned partial sample data pairs have the same index each time;
in this embodiment, U's are sequentially combined(1)And U(2)Corresponding sample is
Figure BDA0003000614370000091
Respectively sent into the trained network f(1)And f(2)To obtain a characteristic representation
Figure BDA0003000614370000092
Wherein, each time the input cross-modal non-aligned partial sample data pair has the same index, namely, each time the input cross-modal non-aligned partial sample data pair has
Figure BDA0003000614370000093
Wherein k is m.
S52, calculating a distance matrix of the unaligned part data according to the feature representation set obtained in the step S51;
in particular according to the formula
Figure BDA0003000614370000094
Calculating the distance between the unaligned data features, and forming a matrix assuming Q, then Q11 is the distance between the first sample in the alignment reference mode and the first sample in the to-be-aligned mode, and Q12 is the distance between the first sample in the alignment reference mode and the second sample in the to-be-aligned modeThis distance is analogized in this way.
And S53, for each feature in the feature representation set of the alignment reference modality, searching for the feature of the sample in the feature representation set of the to-be-aligned modality, minimizing a distance matrix between the searched feature and the feature in the alignment reference modality, and outputting the feature with the minimum distance matrix as the feature of the realigned modality.
For the
Figure BDA0003000614370000095
Each feature in (1) represents
Figure BDA0003000614370000096
In that
Figure BDA0003000614370000097
In search for feature representation
Figure BDA0003000614370000098
Minimizing the distance between the two representations, i.e. minimizing DkmWhen the data is aligned, the data is aligned.
Experimental verification
In order to verify the superiority of the technical scheme, the technical scheme is firstly compared with other 10 multi-modal clustering technologies, namely Canonical Correlation Analysis (CCA), Kernel Canonical Correlation Analysis (KCCA), Deep Canonical Correlation Analysis (DCCA), Deep Canonical Correlation Autoencoder (DCCAE), multi-modal clustering based on matrix decomposition (MvC-DMF), potential multi-modal subspace clustering (LMSC), self-weighted multi-modal clustering (SwMC), multi-modal binary clustering (BMVC), dual-sneak-in-self-encoder network (AE2-Nets) and partial view aligned clustering (PVC). Specifically, we performed experimental comparisons on an object picture dataset Caltech-101 and a news dataset Reuters of the road agency. Because the comparison algorithm cannot process part of alignment data, the PCA is adopted for dimensionality reduction, then the Hungarian algorithm is used for obtaining an alignment sequence, the data are realigned according to the alignment sequence, and then the method is used for clustering. For comprehensive comparison, three indexes commonly used for measuring clustering effects, namely Normalized Mutual Information (NMI), Accuracy (ACC) and an ARI (Charlander index) are used as quantization indexes of our experiment to verify the algorithm effect. The value ranges of the three indexes are all 0-1, the effect is better when the number is larger, and the value is 1, so that the expression algorithm can accurately cluster data completely. The NMI calculation mode is as follows:
Figure BDA0003000614370000099
where Y is the category information predicted by the method and C is the actual category information of the data. H (-) represents information entropy, I (Y; C) represents mutual information. ARI is calculated as follows:
Figure BDA0003000614370000101
where RI is the Lande index and E (. cndot.) is desired.
Experiment one:
the performance of the solution was evaluated using the Reuters dataset. Reuters is a text data set consisting of 6 categories containing text from 5 languages, namely english text and its corresponding translations in french, german, spanish and indian, as shown in table 1.
TABLE 1
Modality English language French language German language Spanish language Indian language
Number of samples 18758 26648 29953 24039 12342
With each language as a modality. We evaluate the present solution using corresponding samples in the english modality and the french modality to construct non-fully aligned multi-modal data. The comparative results are shown in table 2:
TABLE 2
Figure BDA0003000614370000102
As can be seen from the table, compared with other clustering methods, the method provided by the scheme has great improvement in accuracy index, standardized mutual information index and tunable rand index, which means that in practical application, even under the condition of no given data tag information, the scheme can well cluster non-aligned language text data, and avoids dependence on data tags (requiring a large amount of human resources). In addition, the scheme only needs 3.48 seconds to realign the imperfect alignment data and learn the data characteristics, while the PVC method needs 30.36 seconds and the Hungarian algorithm-based method needs 289.82 seconds. In contrast, the method of the present method reduces a large amount of time overhead while greatly improving performance.
Experiment two:
using a dataset Caltech-101 containing 9144 pictures from 101 object classes, using the extracted features as 6 modalities, including (Gabor, WM, CENTRIST, HOG, GIST, and LBP). Because the number of categories is too large, only the top 20 categories with the largest data are listed here, and the corresponding data category information and sample number distribution are shown in table 3:
TABLE 3
Human face Leopard (A) Motorcycle with a motorcycle body Telescope Brain Camera with a camera module Vehicle with wheels Dollar bank note Ferry boat Jiafei cat
435 200 798 33 98 50 123 52 67 34
Hedgehog Tower with a tower body Rhinoceros Snoopy Stapling machine Stop sign Lotus flower Windsor chair Spanner Taiji picture
54 47 59 35 45 64 37 56 39 60
We evaluate the present solution using corresponding samples in the modality of the HOG feature and the modality of the GIST feature to construct non-fully aligned multi-modal data. The comparative results are shown in table 4:
TABLE 4
Figure BDA0003000614370000111
Compared with other clustering methods, the method provided by the scheme has great improvement on the accuracy index, the standardized mutual information index and the tunable rand index, which means that in practical application, even under the condition of no given data label information, the scheme can well cluster non-aligned multi-mode data correctly, and dependence on the presence of a data label (requiring a large amount of human resources) is avoided. In addition, the scheme only needs 1.75 seconds to realign the imperfect alignment data and learn the data characteristics, while the PVC method needs 7.2 seconds and the Hungarian algorithm-based method needs 48.87 seconds. In contrast, the method of the present method reduces a large amount of time overhead while greatly improving performance.
To further verify the superiority of the present technical solution, we further compare the present technical solution with other 9 multi-modal feature learning techniques on classification tasks, namely, Canonical Correlation Analysis (CCA), Kernel Canonical Correlation Analysis (KCCA), Deep Canonical Correlation Analysis (DCCA), Deep Canonical Correlation Autoencoder (DCCAE), matrix decomposition-based multi-modal clustering (MvC-DMF), potential multi-modal subspace clustering (LMSC), binary multi-modal clustering (BMVC), dual-latent self-encoder network (AE2-Nets), and partial view aligned clustering (PVC). Note that since the result of data clustering is learned directly from weighted multimodal clustering (SwMC), classification experiments cannot be performed. Specifically, the experiment comparison is carried out on an object picture data set Caltech-101 and a news data set Reuters of the Lupus. Because the comparison algorithm cannot process part of alignment data, the PCA is used for dimensionality reduction, then the Hungarian algorithm is used for obtaining an alignment sequence, the data are realigned according to the alignment sequence, then the method is used for feature learning, and then a standard SVM classifier is trained for classification. We used classification Accuracy (ACC) as a quantitative indicator of our experiments to verify the effectiveness of the method.
Experiment three
For example, in the first experiment, the performance of the proposed technical scheme is evaluated by using a Reuters data set in the first experiment. The results of the experiment are shown in table 5:
TABLE 5
Figure BDA0003000614370000121
As can be seen from the table, compared with other methods, the method provided by the scheme has a large improvement in the classification accuracy index, which means that the scheme can well classify the non-aligned language text data correctly in practical application, and avoids the need of a large amount of human resources to correct the alignment relation of the multi-mode data.
Experiment four
As experiment two, the Caltech-101 data set is used in the experiment to evaluate the performance of the technical scheme. The results of the experiment are shown in table 6:
TABLE 6
Figure BDA0003000614370000122
As can be seen from the table, compared with other methods, the method provided by the scheme has a large improvement in the classification accuracy index, which means that the scheme can well classify non-aligned multi-modal data correctly in practical application, and avoids the need of a large amount of human resources to correct the alignment relationship of the multi-modal data.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A data feature learning method facing modal imperfect alignment is characterized by comprising the following steps:
s1, defining a video image modal data set and a sound modal data set with incompletely aligned modalities, and selecting any modality as an alignment reference modality, and selecting the other modality as a to-be-aligned modality;
s2, taking the aligned data in the video image modality data set and the sound modality data set as a positive sample pair, and constructing a negative sample pair by taking the selected sample of the aligned reference modality as a reference;
s3, inputting the constructed positive and negative sample pairs into two neural networks with different structures respectively, specifically, inputting data belonging to an image video mode data set in the positive and negative sample pairs into a first neural network, inputting data belonging to a sound mode data set in the positive and negative sample pairs into a second neural network, and calculating common expressions of the image video mode data set and the sound mode data set respectively;
s4, calculating a loss function by using the obtained public expression, and training two neural networks by using the calculated loss function;
and S5, inputting the sample data of the unaligned part in the video image mode data set and the sound mode data set into a trained neural network, and correcting the alignment relation of the sample data of the unaligned part to realign the sample data.
2. The method according to claim 1, wherein the modality-imperfectly aligned video image modality data set and the sound modality data set defined in S1 are respectively expressed as:
{X(1)}={A(1),U(1)};
{X(2)}={A(2),U(2)};
wherein, { X(1)Is a modal incompletely aligned video image modality data set,
Figure FDA0003638887860000011
a data set representing an aligned portion of a video image modality,
Figure FDA0003638887860000012
represents { A(1)Aligning data samples in the partial data set, wherein j is the number of aligned partial samples;
Figure FDA0003638887860000013
a data set representing a misaligned portion of a video image modality,
Figure FDA0003638887860000014
represents { U(1)Data samples in the unaligned partial data set in (j), k is its number of samples; { X(2)Is a modal incompletely aligned sound modality data set,
Figure FDA0003638887860000015
a data set representing aligned portions of the acoustic modality,
Figure FDA0003638887860000016
represents { A(2)Align data samples in the partial data set, n is the number of aligned partial samples;
Figure FDA0003638887860000017
a data set representing a misaligned portion of a sound modality,
Figure FDA0003638887860000018
represents { U(2)The data samples in the partial data set are not aligned in (m) is its number of samples.
3. The method for learning modal-oriented imperfect alignment data features according to claim 2, wherein the specific method for constructing the negative sample pairs in S2 is:
s21, taking the data of the aligned part in the video image mode data set and the sound mode data set as a positive sample pair;
and S22, taking the data of each alignment part in the alignment reference mode as an anchor point, and randomly sampling a plurality of data samples in the data set of the alignment part in the to-be-aligned mode to form a negative sample pair with each anchor point.
4. A method according to claim 3, wherein the common representation of the positive and negative samples in step S3 is:
Z(1)=f(1)(A1);
Z(2)=f(2)(A2);
wherein f is(1)Neural networks constructed for video image modalities, f(2)A neural network constructed for acoustic modalities; a. the1Set of samples belonging to a video image modality, Z, for a positive and negative sample pair(1)Is A1By means of a neural network f(1)A learned public representation; a. the2Set of samples belonging to a sound modality for a positive and negative sample pair, Z(2)Is A2By means of neural networks f(2)A learned common representation.
5. The method according to claim 4, wherein the loss function in S4 is specifically expressed as:
Figure FDA0003638887860000021
wherein, P is variable 0 and 1, 0 represents negative sample obtained by construction, 1 represents positive sample obtained by construction, N is total number of training samples, and l is index of training sample;
Figure FDA0003638887860000022
representing a loss function for a positive sample pair, wherein,
Figure FDA0003638887860000023
represents the distance in the latent space learned in the neural network across modal positive sample pairs in different modalities,
Figure FDA0003638887860000024
are respectively cross-modal positive sample pairs, j, in different modes1=n1Representing cross-modal positive in different modalitiesThe present pair has the same index value;
Figure FDA0003638887860000025
representing a loss function for a negative sample pair, where p is an adaptive boundary value,
Figure FDA0003638887860000026
representing pairs of cross-modal negative samples randomly sampled from aligned portions of different modalities, with different index values.
6. The method according to claim 5, wherein the distance in the latent space learned in the neural network from the cross-modal positive samples in different modalities is calculated by:
Figure FDA0003638887860000027
wherein f is(1)、f(2)Are respectively as
Figure FDA0003638887860000028
The input neural network.
7. The method according to claim 6, wherein the S5 is specifically:
s51, sequentially sending the sample data pairs of the cross-modal misalignment part in the video image mode and the sound mode into the trained neural network model f(1)And f(2)Calculating a feature representation set of the unaligned partial sample data pairs, wherein the input cross-modal unaligned partial sample data pairs have the same index each time;
s52, calculating a distance matrix of the unaligned part data according to the feature representation set obtained in the step S51;
and S53, for each feature in the feature representation set of the alignment reference mode, searching for the feature of the sample in the feature representation set of the to-be-aligned mode, minimizing the distance between the searched feature and the feature in the alignment reference mode, and outputting the feature with the minimum distance and the feature in the alignment reference mode as a realigned cross-mode feature pair.
8. A method according to claim 7, wherein the feature representation set of the non-aligned partial data in step S51 is expressed as:
Figure FDA0003638887860000031
Figure FDA0003638887860000032
wherein,
Figure FDA0003638887860000033
for aligning the feature set of unaligned partial data in the reference modality, f(1)A neural network for inputting data of a part which is not aligned in the alignment reference mode;
Figure FDA0003638887860000034
for feature sets of unaligned partial data in the modality to be aligned, f(2)And inputting the unaligned part of data in the modality to be aligned to the neural network.
9. A method for learning modal-oriented imperfect alignment data features according to claim 8, wherein the formula for calculating the unaligned partial data distance matrix in step S52 is:
Figure FDA0003638887860000035
wherein D is the unaligned portion data distance matrix.
CN202110345293.7A 2021-03-31 2021-03-31 Data feature learning method for modal imperfect alignment Active CN113033438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110345293.7A CN113033438B (en) 2021-03-31 2021-03-31 Data feature learning method for modal imperfect alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110345293.7A CN113033438B (en) 2021-03-31 2021-03-31 Data feature learning method for modal imperfect alignment

Publications (2)

Publication Number Publication Date
CN113033438A CN113033438A (en) 2021-06-25
CN113033438B true CN113033438B (en) 2022-07-01

Family

ID=76453142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110345293.7A Active CN113033438B (en) 2021-03-31 2021-03-31 Data feature learning method for modal imperfect alignment

Country Status (1)

Country Link
CN (1) CN113033438B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486833B (en) * 2021-07-15 2022-10-04 北京达佳互联信息技术有限公司 Multi-modal feature extraction model training method and device and electronic equipment
CN114067233B (en) * 2021-09-26 2023-05-23 四川大学 Cross-mode matching method and system
CN114139641B (en) * 2021-12-02 2024-02-06 中国人民解放军国防科技大学 Multi-modal characterization learning method and system based on local structure transfer
CN117252274B (en) * 2023-11-17 2024-01-30 北京理工大学 Text audio image contrast learning method, device and storage medium
CN117494147B (en) * 2023-12-29 2024-03-22 戎行技术有限公司 Multi-platform virtual user data alignment method based on network space behavior data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001437A (en) * 2020-08-19 2020-11-27 四川大学 Modal non-complete alignment-oriented data clustering method
WO2021000664A1 (en) * 2019-07-03 2021-01-07 中国科学院自动化研究所 Method, system, and device for automatic calibration of differences in cross-modal target detection
CN112287126A (en) * 2020-12-24 2021-01-29 中国人民解放军国防科技大学 Entity alignment method and device suitable for multi-mode knowledge graph
CN112434654A (en) * 2020-12-07 2021-03-02 安徽大学 Cross-modal pedestrian re-identification method based on symmetric convolutional neural network
CN112584062A (en) * 2020-12-10 2021-03-30 上海哔哩哔哩科技有限公司 Background audio construction method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368123B (en) * 2020-02-17 2022-06-28 同济大学 Three-dimensional model sketch retrieval method based on cross-modal guide network
CN112001438B (en) * 2020-08-19 2023-01-10 四川大学 Multi-mode data clustering method for automatically selecting clustering number

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021000664A1 (en) * 2019-07-03 2021-01-07 中国科学院自动化研究所 Method, system, and device for automatic calibration of differences in cross-modal target detection
CN112001437A (en) * 2020-08-19 2020-11-27 四川大学 Modal non-complete alignment-oriented data clustering method
CN112434654A (en) * 2020-12-07 2021-03-02 安徽大学 Cross-modal pedestrian re-identification method based on symmetric convolutional neural network
CN112584062A (en) * 2020-12-10 2021-03-30 上海哔哩哔哩科技有限公司 Background audio construction method and device
CN112287126A (en) * 2020-12-24 2021-01-29 中国人民解放军国防科技大学 Entity alignment method and device suitable for multi-mode knowledge graph

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Jiwei Wei等.Universal Weighting Metric Learning for Cross-Modal Matching.《2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》.2020, *
Ya Jing等.Cross-Modal Cross-Domain Moment Alignment Network for Person Search.《2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》.2020, *
多模态深度学习综述;孙影影等;《计算机工程与应用》;20200930;第56卷(第21期);第1-11页 *

Also Published As

Publication number Publication date
CN113033438A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN113033438B (en) Data feature learning method for modal imperfect alignment
CN113591902B (en) Cross-modal understanding and generating method and device based on multi-modal pre-training model
CN109101537B (en) Multi-turn dialogue data classification method and device based on deep learning and electronic equipment
CN110209823B (en) Multi-label text classification method and system
CN109165291B (en) Text matching method and electronic equipment
CN114330475B (en) Content matching method, apparatus, device, storage medium, and computer program product
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN108959305A (en) A kind of event extraction method and system based on internet big data
CN116304984A (en) Multi-modal intention recognition method and system based on contrast learning
CN113536784B (en) Text processing method, device, computer equipment and storage medium
CN114239612A (en) Multi-modal neural machine translation method, computer equipment and storage medium
CN108287848A (en) Method and system for semanteme parsing
CN114818718A (en) Contract text recognition method and device
CN114925702A (en) Text similarity recognition method and device, electronic equipment and storage medium
Zhang et al. Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification}}
CN117611845B (en) Multi-mode data association identification method, device, equipment and storage medium
CN113626553B (en) Cascade binary Chinese entity relation extraction method based on pre-training model
CN115358817A (en) Intelligent product recommendation method, device, equipment and medium based on social data
CN113963235A (en) Cross-category image recognition model reusing method and system
CN116450781A (en) Question and answer processing method and device
CN115618968B (en) New idea discovery method and device, electronic device and storage medium
CN117113941B (en) Punctuation mark recovery method and device, electronic equipment and storage medium
Jayavarthini et al. Improved reranking approach for person re-identification system
CN111402012B (en) E-commerce defective product identification method based on transfer learning
Raue et al. Symbol grounding association in multimodal sequences with missing elements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant