CN113033438B - Data feature learning method for modal imperfect alignment - Google Patents
Data feature learning method for modal imperfect alignment Download PDFInfo
- Publication number
- CN113033438B CN113033438B CN202110345293.7A CN202110345293A CN113033438B CN 113033438 B CN113033438 B CN 113033438B CN 202110345293 A CN202110345293 A CN 202110345293A CN 113033438 B CN113033438 B CN 113033438B
- Authority
- CN
- China
- Prior art keywords
- data
- aligned
- modal
- modality
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000013528 artificial neural network Methods 0.000 claims abstract description 46
- 230000006870 function Effects 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000003062 neural network model Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 15
- 230000014509 gene expression Effects 0.000 claims description 11
- 238000010276 construction Methods 0.000 claims description 6
- 230000003044 adaptive effect Effects 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 5
- 238000012545 processing Methods 0.000 abstract description 5
- 239000000470 constituent Substances 0.000 description 14
- 238000002474 experimental method Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 238000004590 computer program Methods 0.000 description 7
- 238000010219 correlation analysis Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 230000006872 improvement Effects 0.000 description 5
- 238000003860 storage Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- FKOQWAUFKGFWLH-UHFFFAOYSA-M 3,6-bis[2-(1-methylpyridin-1-ium-4-yl)ethenyl]-9h-carbazole;diiodide Chemical compound [I-].[I-].C1=C[N+](C)=CC=C1C=CC1=CC=C(NC=2C3=CC(C=CC=4C=C[N+](C)=CC=4)=CC=2)C3=C1 FKOQWAUFKGFWLH-UHFFFAOYSA-M 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 244000060234 Gmelina philippensis Species 0.000 description 1
- 241001502129 Mullus Species 0.000 description 1
- 240000002853 Nelumbo nucifera Species 0.000 description 1
- 235000006508 Nelumbo nucifera Nutrition 0.000 description 1
- 235000006510 Nelumbo pentapetala Nutrition 0.000 description 1
- 241000282373 Panthera pardus Species 0.000 description 1
- 241000282806 Rhinoceros Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 206010025135 lupus erythematosus Diseases 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a data feature learning method facing modal incomplete alignment, which comprises the steps of defining a multi-modal data set, processing unaligned data by using information contained in data of an aligned part, realizing a modal realignment target through contrast learning, and simultaneously learning features of the realigned data. To achieve the above object, the present solution guides neural network training using the proposed contrast learning loss function. After training of the neural network model is completed, the unaligned multi-modal data is input into the model, and by selecting the sample with the closest distance as the realigned sample, the realignment of the non-fully aligned multi-modal data at the category level can be completed and the characteristics of the aligned data can be learned at the same time. The invention makes obvious progress on the common performance indexes of clustering and classifying tasks, greatly reduces the time and memory consumption, and is beneficial to subsequent tasks such as clustering, classifying identification or data retrieval.
Description
Technical Field
The invention relates to the field of feature learning, in particular to a data feature learning method for modal imperfect alignment.
Background
At present, the multi-modal feature learning technology is widely applied to various fields. In the retrieval application, the picture corresponding to the text description can be retrieved by inputting a section of text, and the core of the retrieval application is cross-modal feature learning. In the social network analysis, each person can be regarded as an example, the characters and the matching diagrams in the social application (WeChat friend circle) of the person are respectively regarded as samples of two modes, and people with similar love can be gathered together by performing multi-mode feature learning on the character modes and the matching diagrams of different persons, so that the applications such as behavior analysis, personality recommendation and the like can be further performed. In semantic navigation, a section of speech is input to the robot, and the robot can analyze the given description and carry out feature learning by combining visual perception to complete the given related tasks in the description. These multi-modal feature learning techniques are successful primarily due to the existence of high quality multi-modal data that satisfies two assumptions. The method comprises the following steps of firstly, assuming completeness of data, namely all samples need to exist in all modes, and a data missing condition cannot exist; the second is a modal alignment assumption, that is, there is a correct correspondence between samples in different modalities. In other words, on the basis of the current technology, to perform feature learning on multi-modal data, the data needs to be screened and aligned in advance to ensure the completeness and alignment of a sample. However, in these practical scenarios, it is a very difficult task to collect complete, fully aligned multimodal data due to the complexity and incompatibility of time and space. For example, to evaluate the teaching quality of an online mullet course, a video frame and an audio frame need to be input to a system based on multi-modal learning for joint evaluation, but the video frame and the audio frame are not always in a one-to-one alignment (corresponding) relationship, which may significantly reduce the performance of many modal methods.
Although a few related methods for multi-modal data alignment exist at present, they all attempt to recover the alignment relationship between different modalities of the same sample (element) based on the alignment at the instance level, and the required computation and storage are extremely expensive and the effect is often poor. For example, when running on an Nvidia 2080TiGPU, existing methods such as PVC cannot process large-scale data (e.g., a noissymist dataset with two modalities, each of which is 60000 samples). In addition, for data with a small scale, the PVC also usually takes several hours to perform modality alignment and occupies a large amount of memory resources, and the data representation obtained after alignment often performs poorly on subsequent tasks such as classification, clustering, and the like. Furthermore, when the modalities of misalignment are simultaneously lack of data (for example, some people have a friend circle only sending characters, and do not match a picture, and at the moment, the people lack the modality of pictures), the example-level alignment cannot be performed. Therefore, compared to alignment based on the example level, our research and design focuses on performing category-level alignment (i.e., aligning homogeneous samples across modalities) and feature learning of the data at the same time. Practice proves that the method can process data of different scales in extremely fast time with extremely little storage overhead, and obtains higher effect in subsequent tasks such as classification and clustering. Meanwhile, when the data missing condition also exists in the non-aligned mode, the method can also process the data missing condition. Therefore, compared with an example-level alignment method, the method has higher application prospect and practical value
Disclosure of Invention
In order to solve the above problems, the present invention provides a data feature learning method oriented to modality incomplete alignment, which is implemented by the following technical scheme:
a data feature learning method facing modal imperfect alignment comprises the following steps:
s1, defining a video image modal data set and a sound modal data set with incompletely aligned modalities, and selecting any modality as an alignment reference modality, and selecting the other modality as a to-be-aligned modality;
s2, taking the aligned data in the video image modality data set and the sound modality data set as a positive sample pair, and constructing a negative sample pair by taking the selected sample of the aligned reference modality as a reference;
s3, respectively inputting the constructed positive and negative samples into two neural networks with different structures, and calculating the common representation of the positive and negative samples;
s4, calculating a loss function by using the obtained public expression, and training two neural networks by using the calculated loss function;
and S5, inputting the sample data of the unaligned part in the video image mode data set and the sound mode data set into the trained neural network, and correcting the alignment relation of the sample data of the unaligned part to realign the sample data.
The scheme has the advantages that the common problem of modal non-alignment in multi-modal data can be solved, the alignment relation of the multi-modal data is recovered, and the performance of a multi-modal model is guaranteed.
Further, in the above-mentioned case,
the modality non-perfect alignment video image modality data set and sound modality data set defined in S1 are respectively expressed as:
{X(1)}={A(1),U(1)};
{X(2)}={A(2),U(2)};
wherein, { X(1)Is a modal incompletely aligned video image modality data set,a data set representing an aligned portion of a video image modality,represents { A(1)Aligning data samples in the partial data set, wherein j is the number of aligned partial samples;a data set representing a misaligned portion of a video image modality,represents { U(1)Data samples in the unaligned partial data set in (j), k is its number of samples; { X(2)Is a modal incompletely aligned sound modality data set,a data set representing aligned portions of the acoustic modality,represents { A(2)Align data samples in the partial data set, n is the number of aligned partial samples;a data set representing a misaligned portion of a sound modality,represents { U(2)Data samples in the unaligned partial data set in (j), m being its number of samples; .
The beneficial effect of the above scheme is that the problem of modality imperfect alignment is defined.
Further, the specific method for constructing the negative sample pair in S2 is as follows:
s21, taking the data of the aligned part in the video image mode data set and the sound mode data set as a positive sample pair;
and S22, taking the data of each alignment part in the alignment reference mode as an anchor point, and randomly sampling a plurality of data samples in the data set of the alignment part of the other mode to form a negative sample pair with each anchor point.
The beneficial effect of the above further scheme is that a negative sample for comparison learning can be constructed without the help of additional label information.
Further, the common expressions of the positive and negative samples in step S3 are respectively:
Z(1)=f(1)(A1);
Z(2)=f(2)(A2);
wherein f is(1)Neural networks constructed for video image modalities, f(2)A neural network constructed for acoustic modalities; a. the1Set of samples belonging to a video image modality, Z, for a positive and negative sample pair(1)Is A1By means of neural networks f(1)A learned public representation; a. the2Set of samples belonging to a sound modality for a positive and negative sample pair, Z(2)Is A2By means of a neural network f(2)A learned common representation.
The further scheme has the advantages that the data characteristics of the constructed positive and negative samples are extracted, and data redundancy is reduced.
Further, the loss function in S4 is specifically expressed as:
wherein, P is variable 0 and 1, 0 represents negative sample obtained by construction, 1 represents positive sample obtained by construction, N is total number of training samples, and l is index of training sample;
representing a loss function for a positive sample pair, wherein,represents the distance in the latent space learned in the neural network across modal positive sample pairs in different modalities,are respectively cross-modal positive sample pairs, j, in different modes1=n1Representing that cross-modality positive sample pairs in different modalities have the same index value;
representing a loss function for a negative sample pair, where p is an adaptive boundary value,representing pairs of cross-modal negative samples randomly sampled from aligned portions of different modalities, with different index values.
The further scheme has the advantages that the neural network is guided to learn a potential space, multi-mode data are projected into the potential space, characteristics of the data can be obtained, meanwhile, characteristics of different modes and the same class are close to each other, the characteristics of different modes and the different classes are far away from each other, and therefore the characteristics of the data can be learned simultaneously when the mode alignment relation is processed.
Further, in the above-mentioned case,
the calculation method of the distance of the cross-modal positive sample in the different modes in the latent space learned in the neural network is as follows:
The further solution described above has the advantage that a loss function for positive sample pairs is defined.
Further, the S5 specifically includes:
s51, sequentially sending the sample data pairs of the cross-modal misalignment part in the video image mode and the sound mode into the trained neural network model f(1)And f(2)Calculating a feature representation set of the unaligned partial sample data pairs, wherein the input cross-modal unaligned partial sample data pairs have the same index each time;
s52, calculating a distance matrix of the unaligned part data according to the feature representation set obtained in the step S51;
and S53, for each feature in the feature representation set of the alignment reference modality, searching for the feature of the sample in the feature representation set of the to-be-aligned modality, minimizing a distance matrix between the searched feature and the feature in the alignment reference modality, and outputting the feature with the minimum distance matrix as the feature of the realigned modality.
The beneficial effect of the further scheme is that the alignment relation of the modal non-aligned sample is corrected to realign the modal non-aligned sample.
Further, in the above-mentioned case,
the feature representation set of the unaligned part data in the different modalities is represented as:
wherein,to align the feature set of unaligned partial data in the reference modality, f(1)A neural network for inputting data of a part which is not aligned in the alignment reference mode;for feature sets of unaligned partial data in the modality to be aligned, f(2)And inputting the unaligned part of data in the modality to be aligned to the neural network.
The beneficial effect of the further scheme is that the characteristics of modal non-aligned data are extracted, and redundancy removal of the data is realized.
Further, the formula for calculating the distance matrix of the unaligned portion data in step S52 is as follows:
wherein D is the unaligned portion data distance matrix.
The further scheme has the advantages that the distance matrix of the characteristics of the unaligned part of the data is calculated, and the alignment relation of the part of the data is convenient to correct.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a schematic flow chart of a data feature learning method for modal-oriented imperfect learning.
Detailed Description
Hereinafter, the term "comprising" or "may include" used in various embodiments of the present invention indicates the presence of the invented function, operation or element, and does not limit the addition of one or more functions, operations or elements. Furthermore, as used in various embodiments of the present invention, the terms "comprises," "comprising," "includes," "including," "has," "having" and their derivatives are intended to mean that the specified features, numbers, steps, operations, elements, components, or combinations of the foregoing, are only meant to indicate that a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be construed as first excluding the existence of, or adding to the possibility of, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
In various embodiments of the invention, the expression "or" at least one of a or/and B "includes any or all combinations of the words listed simultaneously. For example, the expression "a or B" or "at least one of a or/and B" may include a, may include B, or may include both a and B.
Expressions (such as "first", "second", and the like) used in various embodiments of the present invention may modify various constituent elements in various embodiments, but may not limit the respective constituent elements. For example, the above description does not limit the order and/or importance of the elements described. The foregoing description is for the purpose of distinguishing one element from another. For example, the first user device and the second user device indicate different user devices, although both are user devices. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of various embodiments of the present invention.
It should be noted that: if it is described that one constituent element is "connected" to another constituent element, the first constituent element may be directly connected to the second constituent element, and a third constituent element may be "connected" between the first constituent element and the second constituent element. In contrast, when one constituent element is "directly connected" to another constituent element, it is understood that there is no third constituent element between the first constituent element and the second constituent element.
The terminology used in the various embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the invention. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and the accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not used as limiting the present invention.
Example 1
A method for learning data features oriented to modal imperfect alignment, as shown in fig. 1, comprising the steps of:
s1, defining a video image modality data set and a sound modality data set with non-completely aligned modalities, and selecting any modality as an alignment reference modality, and another modality as a modality to be aligned, specifically, the video image modality data set and the sound modality data set with non-completely aligned modalities defined are respectively expressed as:
{X(1)}={A(1),U(1)};
{X(2)}={A(2),U(2)};
wherein, { X(1)Is a modal incompletely aligned video image modality data set,a data set representing an aligned portion of a video image modality,represents { A(1)Aligning data samples in the partial data set, wherein j is the number of aligned partial samples;a data set representing a misaligned portion of a video image modality,represents { U(1)The data samples in the unaligned partial data set in (j), k being its number of samples; { X(2)Is a modal incompletely aligned sound modality data set,a data set representing aligned portions of the acoustic modality,represents { A(2)Aligning data samples in the partial data set, wherein n is the number of aligned partial samples;a data set representing a misaligned portion of a sound modality,represents { U(2)The data samples in the partial data set are not aligned in (m) is its number of samples.
In the present embodiment, for convenience of illustration and without loss of generality, the scheme is described in detail by taking two modalities, i.e., a video image modality and a sound modality as an example, and when the number of modalities is greater than two, the situation is similar and can be analogized accordingly. Meanwhile, the video image mode is an alignment reference, and the sound mode is aligned to the reference mode. In the schemeDefining a multi-modal dataset that satisfies class-level alignment, if and only if samples of different modalitiesAndbelong to the same class, i.e. to Where C (x) represents the class of sample x. Such class level alignment can be achieved by a discriminant problem, namely from { X }(1)Find a sampleMake it and { X(2)The kth sample in (1) } isThe above objective function is satisfied.
S2, taking the aligned data in the video image modality data set and the sound modality data set as a positive sample pair, and constructing a negative sample pair by taking the selected sample of the aligned reference modality as a reference;
specifically, the method comprises the following steps:
s21, taking the data of the aligned part in the video image mode data set and the sound mode data set as a positive sample pair;
and S22, taking the data of each alignment part in the alignment reference mode as an anchor point, and randomly sampling a plurality of data samples in the data set of the alignment part of the other mode to form a negative sample pair with each anchor point.
According to the scheme, the cross-modal sample pairs of the same category are used as positive sample pairs and the cross-modal sample pairs of different categories are used as negative sample pairs in category-level comparison learning, and then the similarity between the positive sample pairs is increased, and the similarity of the negative sample pairs is reduced. In the implementation process of the scheme, the data of the aligned part is taken as a positive sample pair. Negative examples pairs cannot be constructed directly, since it allows for a more generic unsupervised case, i.e. without the need for manually labeled class labels. Therefore, in order to obtain the above negative sample pair, for each sample in the alignment reference mode, samples of other modes are randomly sampled from the rest modes respectively, and the samples are formed into the negative sample pair. The negative sample pair thus formed will contain noisy data, i.e. a pair of samples that should have been positive samples is mistaken for a negative sample pair. To prevent these incorrectly labeled samples from affecting the implementation of the scheme, contrast learning needs to be robust to these noisy data.
S3, respectively inputting the constructed positive and negative samples into two neural networks with different structures, and calculating the common representation of the positive and negative samples;
for ease of illustration and without loss of generality, the scheme is described in detail using two modalities as an example. When there are more than two modalities, the situation is similar and can be analogized accordingly. For the video image mode and the sound mode, the corresponding two neural networks are respectively f(1),f(2)The corresponding training data are the data of the aligned part, respectively A(1)And A(2)The test data are partial data of the unaligned relation and are respectively U(1)And U(2)Wherein the neural network f(1),f(2)Different configurations are used.
In the scheme, public expressions of the constructed positive and negative sample pairs can be obtained through a neural network respectively, and the public expressions of the positive and negative samples are respectively as follows:
Z(1)=f(1)(A1);
Z(2)=f(2)(A2);
wherein, f(1)Neural networks constructed for video image modalities, f(2)A neural network constructed for acoustic modalities; a. the1Set of samples belonging to a video image modality, Z, for a positive and negative sample pair(1)Is A1Spirit of passing throughVia a network f(1)A learned public representation; a. the2Set of samples belonging to a sound modality for a positive and negative sample pair, Z(2)Is A2By means of a neural network f(2)A learned common representation.
S4, calculating a loss function by using the obtained public expression, and training two neural networks by using the calculated loss function;
to achieve the above-described mode alignment goal by contrast learning, the present solution utilizes two neural networks to learn a common representation of two modes, i.e., { f }(1),f(2)}, the loss function is defined as follows:
wherein, P is variable 0 and 1, 0 represents negative sample obtained by construction, 1 represents positive sample obtained by construction, N is total number of training samples, and l is index of training sample;
representing a loss function for a positive sample pair, wherein,represents the distance in the latent space learned in the neural network across modal positive sample pairs in different modalities,are respectively cross-modal positive sample pairs, j, in different modes1=n1Representing that cross-modality positive sample pairs in different modalities have the same index value;
representing a loss function for a negative sample pair, where p is an adaptive boundary value,representing pairs of cross-modal negative samples randomly sampled from aligned portions of different modalities, with different index values.
Where d (-) denotes the distance of two samples in the neural network's underlying space, i.e.
Where p is an adaptive boundary value, the loss functionThe negative influence of the noise data in the negative sample pairs on the training of the neural network can be reduced or even eliminated, so that the neural network is robust to the noise data. Designed loss functionThe method is mainly used for guiding the neural network to learn the common subspace of the multi-modal data, wherein samples of the same class but belonging to different modes are close to each other, and samples of different classes and different modes are far away from each other.
And guiding the training of the neural network by using the designed noise robust contrast loss function, calculating the gradient of the loss value accumulated in each iteration, performing back propagation to update the parameter value of the neural network until the iteration process is converged, and stopping the training to obtain the trained neural network.
And S5, inputting the sample data of the unaligned part in the video image mode data set and the sound mode data set into the trained neural network, and correcting the alignment relation of the sample data of the unaligned part to realign the sample data.
After training of the neural network model is completed, unaligned multi-modal data are input into the model, and by selecting the sample closest to the model as a realigned sample, the multi-modal data alignment at the category level can be completed and the aligned features can be learned at the same time. It is noted that the loss function designed by the present scheme is the first loss function that is robust against noisy data in the comparison learning. Based on the above procedure, the model can be sufficiently trained on aligned data in multiple modalities, and a common representation of each modality is learned implicitly with the alignment information, thereby enabling the model to efficiently process non-aligned multi-modality data.
The specific method comprises the following steps:
s51, sequentially sending the sample data pairs of the cross-modal misalignment part in the video image mode and the sound mode into the trained neural network model f(1)And f(2)Calculating a feature representation set of the unaligned partial sample data pairs, wherein the input cross-modal unaligned partial sample data pairs have the same index each time;
in this embodiment, U's are sequentially combined(1)And U(2)Corresponding sample isRespectively sent into the trained network f(1)And f(2)To obtain a characteristic representationWherein, each time the input cross-modal non-aligned partial sample data pair has the same index, namely, each time the input cross-modal non-aligned partial sample data pair hasWherein k is m.
S52, calculating a distance matrix of the unaligned part data according to the feature representation set obtained in the step S51;
in particular according to the formulaCalculating the distance between the unaligned data features, and forming a matrix assuming Q, then Q11 is the distance between the first sample in the alignment reference mode and the first sample in the to-be-aligned mode, and Q12 is the distance between the first sample in the alignment reference mode and the second sample in the to-be-aligned modeThis distance is analogized in this way.
And S53, for each feature in the feature representation set of the alignment reference modality, searching for the feature of the sample in the feature representation set of the to-be-aligned modality, minimizing a distance matrix between the searched feature and the feature in the alignment reference modality, and outputting the feature with the minimum distance matrix as the feature of the realigned modality.
For theEach feature in (1) representsIn thatIn search for feature representationMinimizing the distance between the two representations, i.e. minimizing DkmWhen the data is aligned, the data is aligned.
Experimental verification
In order to verify the superiority of the technical scheme, the technical scheme is firstly compared with other 10 multi-modal clustering technologies, namely Canonical Correlation Analysis (CCA), Kernel Canonical Correlation Analysis (KCCA), Deep Canonical Correlation Analysis (DCCA), Deep Canonical Correlation Autoencoder (DCCAE), multi-modal clustering based on matrix decomposition (MvC-DMF), potential multi-modal subspace clustering (LMSC), self-weighted multi-modal clustering (SwMC), multi-modal binary clustering (BMVC), dual-sneak-in-self-encoder network (AE2-Nets) and partial view aligned clustering (PVC). Specifically, we performed experimental comparisons on an object picture dataset Caltech-101 and a news dataset Reuters of the road agency. Because the comparison algorithm cannot process part of alignment data, the PCA is adopted for dimensionality reduction, then the Hungarian algorithm is used for obtaining an alignment sequence, the data are realigned according to the alignment sequence, and then the method is used for clustering. For comprehensive comparison, three indexes commonly used for measuring clustering effects, namely Normalized Mutual Information (NMI), Accuracy (ACC) and an ARI (Charlander index) are used as quantization indexes of our experiment to verify the algorithm effect. The value ranges of the three indexes are all 0-1, the effect is better when the number is larger, and the value is 1, so that the expression algorithm can accurately cluster data completely. The NMI calculation mode is as follows:
where Y is the category information predicted by the method and C is the actual category information of the data. H (-) represents information entropy, I (Y; C) represents mutual information. ARI is calculated as follows:
where RI is the Lande index and E (. cndot.) is desired.
Experiment one:
the performance of the solution was evaluated using the Reuters dataset. Reuters is a text data set consisting of 6 categories containing text from 5 languages, namely english text and its corresponding translations in french, german, spanish and indian, as shown in table 1.
TABLE 1
Modality | English language | French language | German language | Spanish language | Indian language |
Number of samples | 18758 | 26648 | 29953 | 24039 | 12342 |
With each language as a modality. We evaluate the present solution using corresponding samples in the english modality and the french modality to construct non-fully aligned multi-modal data. The comparative results are shown in table 2:
TABLE 2
As can be seen from the table, compared with other clustering methods, the method provided by the scheme has great improvement in accuracy index, standardized mutual information index and tunable rand index, which means that in practical application, even under the condition of no given data tag information, the scheme can well cluster non-aligned language text data, and avoids dependence on data tags (requiring a large amount of human resources). In addition, the scheme only needs 3.48 seconds to realign the imperfect alignment data and learn the data characteristics, while the PVC method needs 30.36 seconds and the Hungarian algorithm-based method needs 289.82 seconds. In contrast, the method of the present method reduces a large amount of time overhead while greatly improving performance.
Experiment two:
using a dataset Caltech-101 containing 9144 pictures from 101 object classes, using the extracted features as 6 modalities, including (Gabor, WM, CENTRIST, HOG, GIST, and LBP). Because the number of categories is too large, only the top 20 categories with the largest data are listed here, and the corresponding data category information and sample number distribution are shown in table 3:
TABLE 3
Human face | Leopard (A) | Motorcycle with a motorcycle body | Telescope | Brain | Camera with a camera module | Vehicle with wheels | Dollar bank note | Ferry boat | Jiafei cat |
435 | 200 | 798 | 33 | 98 | 50 | 123 | 52 | 67 | 34 |
Hedgehog | Tower with a tower body | Rhinoceros | Snoopy | Stapling machine | Stop sign | Lotus flower | Windsor chair | Spanner | Taiji picture |
54 | 47 | 59 | 35 | 45 | 64 | 37 | 56 | 39 | 60 |
We evaluate the present solution using corresponding samples in the modality of the HOG feature and the modality of the GIST feature to construct non-fully aligned multi-modal data. The comparative results are shown in table 4:
TABLE 4
Compared with other clustering methods, the method provided by the scheme has great improvement on the accuracy index, the standardized mutual information index and the tunable rand index, which means that in practical application, even under the condition of no given data label information, the scheme can well cluster non-aligned multi-mode data correctly, and dependence on the presence of a data label (requiring a large amount of human resources) is avoided. In addition, the scheme only needs 1.75 seconds to realign the imperfect alignment data and learn the data characteristics, while the PVC method needs 7.2 seconds and the Hungarian algorithm-based method needs 48.87 seconds. In contrast, the method of the present method reduces a large amount of time overhead while greatly improving performance.
To further verify the superiority of the present technical solution, we further compare the present technical solution with other 9 multi-modal feature learning techniques on classification tasks, namely, Canonical Correlation Analysis (CCA), Kernel Canonical Correlation Analysis (KCCA), Deep Canonical Correlation Analysis (DCCA), Deep Canonical Correlation Autoencoder (DCCAE), matrix decomposition-based multi-modal clustering (MvC-DMF), potential multi-modal subspace clustering (LMSC), binary multi-modal clustering (BMVC), dual-latent self-encoder network (AE2-Nets), and partial view aligned clustering (PVC). Note that since the result of data clustering is learned directly from weighted multimodal clustering (SwMC), classification experiments cannot be performed. Specifically, the experiment comparison is carried out on an object picture data set Caltech-101 and a news data set Reuters of the Lupus. Because the comparison algorithm cannot process part of alignment data, the PCA is used for dimensionality reduction, then the Hungarian algorithm is used for obtaining an alignment sequence, the data are realigned according to the alignment sequence, then the method is used for feature learning, and then a standard SVM classifier is trained for classification. We used classification Accuracy (ACC) as a quantitative indicator of our experiments to verify the effectiveness of the method.
Experiment three
For example, in the first experiment, the performance of the proposed technical scheme is evaluated by using a Reuters data set in the first experiment. The results of the experiment are shown in table 5:
TABLE 5
As can be seen from the table, compared with other methods, the method provided by the scheme has a large improvement in the classification accuracy index, which means that the scheme can well classify the non-aligned language text data correctly in practical application, and avoids the need of a large amount of human resources to correct the alignment relation of the multi-mode data.
Experiment four
As experiment two, the Caltech-101 data set is used in the experiment to evaluate the performance of the technical scheme. The results of the experiment are shown in table 6:
TABLE 6
As can be seen from the table, compared with other methods, the method provided by the scheme has a large improvement in the classification accuracy index, which means that the scheme can well classify non-aligned multi-modal data correctly in practical application, and avoids the need of a large amount of human resources to correct the alignment relationship of the multi-modal data.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (9)
1. A data feature learning method facing modal imperfect alignment is characterized by comprising the following steps:
s1, defining a video image modal data set and a sound modal data set with incompletely aligned modalities, and selecting any modality as an alignment reference modality, and selecting the other modality as a to-be-aligned modality;
s2, taking the aligned data in the video image modality data set and the sound modality data set as a positive sample pair, and constructing a negative sample pair by taking the selected sample of the aligned reference modality as a reference;
s3, inputting the constructed positive and negative sample pairs into two neural networks with different structures respectively, specifically, inputting data belonging to an image video mode data set in the positive and negative sample pairs into a first neural network, inputting data belonging to a sound mode data set in the positive and negative sample pairs into a second neural network, and calculating common expressions of the image video mode data set and the sound mode data set respectively;
s4, calculating a loss function by using the obtained public expression, and training two neural networks by using the calculated loss function;
and S5, inputting the sample data of the unaligned part in the video image mode data set and the sound mode data set into a trained neural network, and correcting the alignment relation of the sample data of the unaligned part to realign the sample data.
2. The method according to claim 1, wherein the modality-imperfectly aligned video image modality data set and the sound modality data set defined in S1 are respectively expressed as:
{X(1)}={A(1),U(1)};
{X(2)}={A(2),U(2)};
wherein, { X(1)Is a modal incompletely aligned video image modality data set,a data set representing an aligned portion of a video image modality,represents { A(1)Aligning data samples in the partial data set, wherein j is the number of aligned partial samples;a data set representing a misaligned portion of a video image modality,represents { U(1)Data samples in the unaligned partial data set in (j), k is its number of samples; { X(2)Is a modal incompletely aligned sound modality data set,a data set representing aligned portions of the acoustic modality,represents { A(2)Align data samples in the partial data set, n is the number of aligned partial samples;a data set representing a misaligned portion of a sound modality,represents { U(2)The data samples in the partial data set are not aligned in (m) is its number of samples.
3. The method for learning modal-oriented imperfect alignment data features according to claim 2, wherein the specific method for constructing the negative sample pairs in S2 is:
s21, taking the data of the aligned part in the video image mode data set and the sound mode data set as a positive sample pair;
and S22, taking the data of each alignment part in the alignment reference mode as an anchor point, and randomly sampling a plurality of data samples in the data set of the alignment part in the to-be-aligned mode to form a negative sample pair with each anchor point.
4. A method according to claim 3, wherein the common representation of the positive and negative samples in step S3 is:
Z(1)=f(1)(A1);
Z(2)=f(2)(A2);
wherein f is(1)Neural networks constructed for video image modalities, f(2)A neural network constructed for acoustic modalities; a. the1Set of samples belonging to a video image modality, Z, for a positive and negative sample pair(1)Is A1By means of a neural network f(1)A learned public representation; a. the2Set of samples belonging to a sound modality for a positive and negative sample pair, Z(2)Is A2By means of neural networks f(2)A learned common representation.
5. The method according to claim 4, wherein the loss function in S4 is specifically expressed as:
wherein, P is variable 0 and 1, 0 represents negative sample obtained by construction, 1 represents positive sample obtained by construction, N is total number of training samples, and l is index of training sample;
representing a loss function for a positive sample pair, wherein,represents the distance in the latent space learned in the neural network across modal positive sample pairs in different modalities,are respectively cross-modal positive sample pairs, j, in different modes1=n1Representing cross-modal positive in different modalitiesThe present pair has the same index value;
7. The method according to claim 6, wherein the S5 is specifically:
s51, sequentially sending the sample data pairs of the cross-modal misalignment part in the video image mode and the sound mode into the trained neural network model f(1)And f(2)Calculating a feature representation set of the unaligned partial sample data pairs, wherein the input cross-modal unaligned partial sample data pairs have the same index each time;
s52, calculating a distance matrix of the unaligned part data according to the feature representation set obtained in the step S51;
and S53, for each feature in the feature representation set of the alignment reference mode, searching for the feature of the sample in the feature representation set of the to-be-aligned mode, minimizing the distance between the searched feature and the feature in the alignment reference mode, and outputting the feature with the minimum distance and the feature in the alignment reference mode as a realigned cross-mode feature pair.
8. A method according to claim 7, wherein the feature representation set of the non-aligned partial data in step S51 is expressed as:
wherein,for aligning the feature set of unaligned partial data in the reference modality, f(1)A neural network for inputting data of a part which is not aligned in the alignment reference mode;for feature sets of unaligned partial data in the modality to be aligned, f(2)And inputting the unaligned part of data in the modality to be aligned to the neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110345293.7A CN113033438B (en) | 2021-03-31 | 2021-03-31 | Data feature learning method for modal imperfect alignment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110345293.7A CN113033438B (en) | 2021-03-31 | 2021-03-31 | Data feature learning method for modal imperfect alignment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113033438A CN113033438A (en) | 2021-06-25 |
CN113033438B true CN113033438B (en) | 2022-07-01 |
Family
ID=76453142
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110345293.7A Active CN113033438B (en) | 2021-03-31 | 2021-03-31 | Data feature learning method for modal imperfect alignment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113033438B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486833B (en) * | 2021-07-15 | 2022-10-04 | 北京达佳互联信息技术有限公司 | Multi-modal feature extraction model training method and device and electronic equipment |
CN114067233B (en) * | 2021-09-26 | 2023-05-23 | 四川大学 | Cross-mode matching method and system |
CN114139641B (en) * | 2021-12-02 | 2024-02-06 | 中国人民解放军国防科技大学 | Multi-modal characterization learning method and system based on local structure transfer |
CN117252274B (en) * | 2023-11-17 | 2024-01-30 | 北京理工大学 | Text audio image contrast learning method, device and storage medium |
CN117494147B (en) * | 2023-12-29 | 2024-03-22 | 戎行技术有限公司 | Multi-platform virtual user data alignment method based on network space behavior data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001437A (en) * | 2020-08-19 | 2020-11-27 | 四川大学 | Modal non-complete alignment-oriented data clustering method |
WO2021000664A1 (en) * | 2019-07-03 | 2021-01-07 | 中国科学院自动化研究所 | Method, system, and device for automatic calibration of differences in cross-modal target detection |
CN112287126A (en) * | 2020-12-24 | 2021-01-29 | 中国人民解放军国防科技大学 | Entity alignment method and device suitable for multi-mode knowledge graph |
CN112434654A (en) * | 2020-12-07 | 2021-03-02 | 安徽大学 | Cross-modal pedestrian re-identification method based on symmetric convolutional neural network |
CN112584062A (en) * | 2020-12-10 | 2021-03-30 | 上海哔哩哔哩科技有限公司 | Background audio construction method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368123B (en) * | 2020-02-17 | 2022-06-28 | 同济大学 | Three-dimensional model sketch retrieval method based on cross-modal guide network |
CN112001438B (en) * | 2020-08-19 | 2023-01-10 | 四川大学 | Multi-mode data clustering method for automatically selecting clustering number |
-
2021
- 2021-03-31 CN CN202110345293.7A patent/CN113033438B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021000664A1 (en) * | 2019-07-03 | 2021-01-07 | 中国科学院自动化研究所 | Method, system, and device for automatic calibration of differences in cross-modal target detection |
CN112001437A (en) * | 2020-08-19 | 2020-11-27 | 四川大学 | Modal non-complete alignment-oriented data clustering method |
CN112434654A (en) * | 2020-12-07 | 2021-03-02 | 安徽大学 | Cross-modal pedestrian re-identification method based on symmetric convolutional neural network |
CN112584062A (en) * | 2020-12-10 | 2021-03-30 | 上海哔哩哔哩科技有限公司 | Background audio construction method and device |
CN112287126A (en) * | 2020-12-24 | 2021-01-29 | 中国人民解放军国防科技大学 | Entity alignment method and device suitable for multi-mode knowledge graph |
Non-Patent Citations (3)
Title |
---|
Jiwei Wei等.Universal Weighting Metric Learning for Cross-Modal Matching.《2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》.2020, * |
Ya Jing等.Cross-Modal Cross-Domain Moment Alignment Network for Person Search.《2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》.2020, * |
多模态深度学习综述;孙影影等;《计算机工程与应用》;20200930;第56卷(第21期);第1-11页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113033438A (en) | 2021-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113033438B (en) | Data feature learning method for modal imperfect alignment | |
CN113591902B (en) | Cross-modal understanding and generating method and device based on multi-modal pre-training model | |
CN109101537B (en) | Multi-turn dialogue data classification method and device based on deep learning and electronic equipment | |
CN110209823B (en) | Multi-label text classification method and system | |
CN109165291B (en) | Text matching method and electronic equipment | |
CN114330475B (en) | Content matching method, apparatus, device, storage medium, and computer program product | |
CN109086265B (en) | Semantic training method and multi-semantic word disambiguation method in short text | |
CN108959305A (en) | A kind of event extraction method and system based on internet big data | |
CN116304984A (en) | Multi-modal intention recognition method and system based on contrast learning | |
CN113536784B (en) | Text processing method, device, computer equipment and storage medium | |
CN114239612A (en) | Multi-modal neural machine translation method, computer equipment and storage medium | |
CN108287848A (en) | Method and system for semanteme parsing | |
CN114818718A (en) | Contract text recognition method and device | |
CN114925702A (en) | Text similarity recognition method and device, electronic equipment and storage medium | |
Zhang et al. | Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification}} | |
CN117611845B (en) | Multi-mode data association identification method, device, equipment and storage medium | |
CN113626553B (en) | Cascade binary Chinese entity relation extraction method based on pre-training model | |
CN115358817A (en) | Intelligent product recommendation method, device, equipment and medium based on social data | |
CN113963235A (en) | Cross-category image recognition model reusing method and system | |
CN116450781A (en) | Question and answer processing method and device | |
CN115618968B (en) | New idea discovery method and device, electronic device and storage medium | |
CN117113941B (en) | Punctuation mark recovery method and device, electronic equipment and storage medium | |
Jayavarthini et al. | Improved reranking approach for person re-identification system | |
CN111402012B (en) | E-commerce defective product identification method based on transfer learning | |
Raue et al. | Symbol grounding association in multimodal sequences with missing elements |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |