CN113033438B

CN113033438B - Data feature learning method for modal imperfect alignment

Info

Publication number: CN113033438B
Application number: CN202110345293.7A
Authority: CN
Inventors: 彭玺; 杨谋星; 林义杰
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2022-07-01
Anticipated expiration: 2041-03-31
Also published as: CN113033438A

Abstract

The invention discloses a data feature learning method facing modal incomplete alignment, which comprises the steps of defining a multi-modal data set, processing unaligned data by using information contained in data of an aligned part, realizing a modal realignment target through contrast learning, and simultaneously learning features of the realigned data. To achieve the above object, the present solution guides neural network training using the proposed contrast learning loss function. After training of the neural network model is completed, the unaligned multi-modal data is input into the model, and by selecting the sample with the closest distance as the realigned sample, the realignment of the non-fully aligned multi-modal data at the category level can be completed and the characteristics of the aligned data can be learned at the same time. The invention makes obvious progress on the common performance indexes of clustering and classifying tasks, greatly reduces the time and memory consumption, and is beneficial to subsequent tasks such as clustering, classifying identification or data retrieval.

Description

Data feature learning method for modal imperfect alignment

Technical Field

The invention relates to the field of feature learning, in particular to a data feature learning method for modal imperfect alignment.

Background

At present, the multi-modal feature learning technology is widely applied to various fields. In the retrieval application, the picture corresponding to the text description can be retrieved by inputting a section of text, and the core of the retrieval application is cross-modal feature learning. In the social network analysis, each person can be regarded as an example, the characters and the matching diagrams in the social application (WeChat friend circle) of the person are respectively regarded as samples of two modes, and people with similar love can be gathered together by performing multi-mode feature learning on the character modes and the matching diagrams of different persons, so that the applications such as behavior analysis, personality recommendation and the like can be further performed. In semantic navigation, a section of speech is input to the robot, and the robot can analyze the given description and carry out feature learning by combining visual perception to complete the given related tasks in the description. These multi-modal feature learning techniques are successful primarily due to the existence of high quality multi-modal data that satisfies two assumptions. The method comprises the following steps of firstly, assuming completeness of data, namely all samples need to exist in all modes, and a data missing condition cannot exist; the second is a modal alignment assumption, that is, there is a correct correspondence between samples in different modalities. In other words, on the basis of the current technology, to perform feature learning on multi-modal data, the data needs to be screened and aligned in advance to ensure the completeness and alignment of a sample. However, in these practical scenarios, it is a very difficult task to collect complete, fully aligned multimodal data due to the complexity and incompatibility of time and space. For example, to evaluate the teaching quality of an online mullet course, a video frame and an audio frame need to be input to a system based on multi-modal learning for joint evaluation, but the video frame and the audio frame are not always in a one-to-one alignment (corresponding) relationship, which may significantly reduce the performance of many modal methods.

Although a few related methods for multi-modal data alignment exist at present, they all attempt to recover the alignment relationship between different modalities of the same sample (element) based on the alignment at the instance level, and the required computation and storage are extremely expensive and the effect is often poor. For example, when running on an Nvidia 2080TiGPU, existing methods such as PVC cannot process large-scale data (e.g., a noissymist dataset with two modalities, each of which is 60000 samples). In addition, for data with a small scale, the PVC also usually takes several hours to perform modality alignment and occupies a large amount of memory resources, and the data representation obtained after alignment often performs poorly on subsequent tasks such as classification, clustering, and the like. Furthermore, when the modalities of misalignment are simultaneously lack of data (for example, some people have a friend circle only sending characters, and do not match a picture, and at the moment, the people lack the modality of pictures), the example-level alignment cannot be performed. Therefore, compared to alignment based on the example level, our research and design focuses on performing category-level alignment (i.e., aligning homogeneous samples across modalities) and feature learning of the data at the same time. Practice proves that the method can process data of different scales in extremely fast time with extremely little storage overhead, and obtains higher effect in subsequent tasks such as classification and clustering. Meanwhile, when the data missing condition also exists in the non-aligned mode, the method can also process the data missing condition. Therefore, compared with an example-level alignment method, the method has higher application prospect and practical value

Disclosure of Invention

In order to solve the above problems, the present invention provides a data feature learning method oriented to modality incomplete alignment, which is implemented by the following technical scheme:

a data feature learning method facing modal imperfect alignment comprises the following steps:

s1, defining a video image modal data set and a sound modal data set with incompletely aligned modalities, and selecting any modality as an alignment reference modality, and selecting the other modality as a to-be-aligned modality;

s2, taking the aligned data in the video image modality data set and the sound modality data set as a positive sample pair, and constructing a negative sample pair by taking the selected sample of the aligned reference modality as a reference;

s3, respectively inputting the constructed positive and negative samples into two neural networks with different structures, and calculating the common representation of the positive and negative samples;

s4, calculating a loss function by using the obtained public expression, and training two neural networks by using the calculated loss function;

and S5, inputting the sample data of the unaligned part in the video image mode data set and the sound mode data set into the trained neural network, and correcting the alignment relation of the sample data of the unaligned part to realign the sample data.

The scheme has the advantages that the common problem of modal non-alignment in multi-modal data can be solved, the alignment relation of the multi-modal data is recovered, and the performance of a multi-modal model is guaranteed.

Further, in the above-mentioned case,

the modality non-perfect alignment video image modality data set and sound modality data set defined in S1 are respectively expressed as:

{X⁽¹⁾}＝{A⁽¹⁾,U⁽¹⁾}；

{X⁽²⁾}＝{A⁽²⁾,U⁽²⁾}；

wherein, { X⁽¹⁾Is a modal incompletely aligned video image modality data set,

a data set representing an aligned portion of a video image modality,

represents { A⁽¹⁾Aligning data samples in the partial data set, wherein j is the number of aligned partial samples;

a data set representing a misaligned portion of a video image modality,

represents { U⁽¹⁾Data samples in the unaligned partial data set in (j), k is its number of samples; { X⁽²⁾Is a modal incompletely aligned sound modality data set,

a data set representing aligned portions of the acoustic modality,

represents { A⁽²⁾Align data samples in the partial data set, n is the number of aligned partial samples;

a data set representing a misaligned portion of a sound modality,

represents { U⁽²⁾Data samples in the unaligned partial data set in (j), m being its number of samples; .

The beneficial effect of the above scheme is that the problem of modality imperfect alignment is defined.

Further, the specific method for constructing the negative sample pair in S2 is as follows:

s21, taking the data of the aligned part in the video image mode data set and the sound mode data set as a positive sample pair;

and S22, taking the data of each alignment part in the alignment reference mode as an anchor point, and randomly sampling a plurality of data samples in the data set of the alignment part of the other mode to form a negative sample pair with each anchor point.

The beneficial effect of the above further scheme is that a negative sample for comparison learning can be constructed without the help of additional label information.

Further, the common expressions of the positive and negative samples in step S3 are respectively:

Z⁽¹⁾＝f⁽¹⁾(A₁)；

Z⁽²⁾＝f⁽²⁾(A₂)；

wherein f is⁽¹⁾Neural networks constructed for video image modalities, f⁽²⁾A neural network constructed for acoustic modalities; a. the₁Set of samples belonging to a video image modality, Z, for a positive and negative sample pair⁽¹⁾Is A₁By means of neural networks f⁽¹⁾A learned public representation; a. the₂Set of samples belonging to a sound modality for a positive and negative sample pair, Z⁽²⁾Is A₂By means of a neural network f⁽²⁾A learned common representation.

The further scheme has the advantages that the data characteristics of the constructed positive and negative samples are extracted, and data redundancy is reduced.

Further, the loss function in S4 is specifically expressed as:

wherein, P is variable 0 and 1, 0 represents negative sample obtained by construction, 1 represents positive sample obtained by construction, N is total number of training samples, and l is index of training sample;

representing a loss function for a positive sample pair, wherein,

represents the distance in the latent space learned in the neural network across modal positive sample pairs in different modalities,

are respectively cross-modal positive sample pairs, j, in different modes₁＝n₁Representing that cross-modality positive sample pairs in different modalities have the same index value;

representing a loss function for a negative sample pair, where p is an adaptive boundary value,

representing pairs of cross-modal negative samples randomly sampled from aligned portions of different modalities, with different index values.

The further scheme has the advantages that the neural network is guided to learn a potential space, multi-mode data are projected into the potential space, characteristics of the data can be obtained, meanwhile, characteristics of different modes and the same class are close to each other, the characteristics of different modes and the different classes are far away from each other, and therefore the characteristics of the data can be learned simultaneously when the mode alignment relation is processed.

Further, in the above-mentioned case,

the calculation method of the distance of the cross-modal positive sample in the different modes in the latent space learned in the neural network is as follows:

wherein f is⁽¹⁾、f⁽²⁾Are respectively as

An input neural network. .

The further solution described above has the advantage that a loss function for positive sample pairs is defined.

Further, the S5 specifically includes:

s51, sequentially sending the sample data pairs of the cross-modal misalignment part in the video image mode and the sound mode into the trained neural network model f⁽¹⁾And f⁽²⁾Calculating a feature representation set of the unaligned partial sample data pairs, wherein the input cross-modal unaligned partial sample data pairs have the same index each time;

s52, calculating a distance matrix of the unaligned part data according to the feature representation set obtained in the step S51;

and S53, for each feature in the feature representation set of the alignment reference modality, searching for the feature of the sample in the feature representation set of the to-be-aligned modality, minimizing a distance matrix between the searched feature and the feature in the alignment reference modality, and outputting the feature with the minimum distance matrix as the feature of the realigned modality.

The beneficial effect of the further scheme is that the alignment relation of the modal non-aligned sample is corrected to realign the modal non-aligned sample.

Further, in the above-mentioned case,

the feature representation set of the unaligned part data in the different modalities is represented as:

wherein,

to align the feature set of unaligned partial data in the reference modality, f⁽¹⁾A neural network for inputting data of a part which is not aligned in the alignment reference mode;

for feature sets of unaligned partial data in the modality to be aligned, f⁽²⁾And inputting the unaligned part of data in the modality to be aligned to the neural network.

The beneficial effect of the further scheme is that the characteristics of modal non-aligned data are extracted, and redundancy removal of the data is realized.

Further, the formula for calculating the distance matrix of the unaligned portion data in step S52 is as follows:

wherein D is the unaligned portion data distance matrix.

The further scheme has the advantages that the distance matrix of the characteristics of the unaligned part of the data is calculated, and the alignment relation of the part of the data is convenient to correct.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

fig. 1 is a schematic flow chart of a data feature learning method for modal-oriented imperfect learning.

Detailed Description

Hereinafter, the term "comprising" or "may include" used in various embodiments of the present invention indicates the presence of the invented function, operation or element, and does not limit the addition of one or more functions, operations or elements. Furthermore, as used in various embodiments of the present invention, the terms "comprises," "comprising," "includes," "including," "has," "having" and their derivatives are intended to mean that the specified features, numbers, steps, operations, elements, components, or combinations of the foregoing, are only meant to indicate that a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be construed as first excluding the existence of, or adding to the possibility of, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

In various embodiments of the invention, the expression "or" at least one of a or/and B "includes any or all combinations of the words listed simultaneously. For example, the expression "a or B" or "at least one of a or/and B" may include a, may include B, or may include both a and B.

Expressions (such as "first", "second", and the like) used in various embodiments of the present invention may modify various constituent elements in various embodiments, but may not limit the respective constituent elements. For example, the above description does not limit the order and/or importance of the elements described. The foregoing description is for the purpose of distinguishing one element from another. For example, the first user device and the second user device indicate different user devices, although both are user devices. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of various embodiments of the present invention.

It should be noted that: if it is described that one constituent element is "connected" to another constituent element, the first constituent element may be directly connected to the second constituent element, and a third constituent element may be "connected" between the first constituent element and the second constituent element. In contrast, when one constituent element is "directly connected" to another constituent element, it is understood that there is no third constituent element between the first constituent element and the second constituent element.

The terminology used in the various embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the invention. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and the accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not used as limiting the present invention.

Example 1

A method for learning data features oriented to modal imperfect alignment, as shown in fig. 1, comprising the steps of:

s1, defining a video image modality data set and a sound modality data set with non-completely aligned modalities, and selecting any modality as an alignment reference modality, and another modality as a modality to be aligned, specifically, the video image modality data set and the sound modality data set with non-completely aligned modalities defined are respectively expressed as:

{X⁽¹⁾}＝{A⁽¹⁾,U⁽¹⁾}；

{X⁽²⁾}＝{A⁽²⁾,U⁽²⁾}；

a data set representing an aligned portion of a video image modality,

a data set representing a misaligned portion of a video image modality,

represents { U⁽¹⁾The data samples in the unaligned partial data set in (j), k being its number of samples; { X⁽²⁾Is a modal incompletely aligned sound modality data set,

a data set representing aligned portions of the acoustic modality,

represents { A⁽²⁾Aligning data samples in the partial data set, wherein n is the number of aligned partial samples;

a data set representing a misaligned portion of a sound modality,

represents { U⁽²⁾The data samples in the partial data set are not aligned in (m) is its number of samples.

In the present embodiment, for convenience of illustration and without loss of generality, the scheme is described in detail by taking two modalities, i.e., a video image modality and a sound modality as an example, and when the number of modalities is greater than two, the situation is similar and can be analogized accordingly. Meanwhile, the video image mode is an alignment reference, and the sound mode is aligned to the reference mode. In the schemeDefining a multi-modal dataset that satisfies class-level alignment, if and only if samples of different modalities

And

belong to the same class, i.e. to

Where C (x) represents the class of sample x. Such class level alignment can be achieved by a discriminant problem, namely from { X }⁽¹⁾Find a sample

Make it and { X⁽²⁾The kth sample in (1) } is

The above objective function is satisfied.

specifically, the method comprises the following steps:

According to the scheme, the cross-modal sample pairs of the same category are used as positive sample pairs and the cross-modal sample pairs of different categories are used as negative sample pairs in category-level comparison learning, and then the similarity between the positive sample pairs is increased, and the similarity of the negative sample pairs is reduced. In the implementation process of the scheme, the data of the aligned part is taken as a positive sample pair. Negative examples pairs cannot be constructed directly, since it allows for a more generic unsupervised case, i.e. without the need for manually labeled class labels. Therefore, in order to obtain the above negative sample pair, for each sample in the alignment reference mode, samples of other modes are randomly sampled from the rest modes respectively, and the samples are formed into the negative sample pair. The negative sample pair thus formed will contain noisy data, i.e. a pair of samples that should have been positive samples is mistaken for a negative sample pair. To prevent these incorrectly labeled samples from affecting the implementation of the scheme, contrast learning needs to be robust to these noisy data.

for ease of illustration and without loss of generality, the scheme is described in detail using two modalities as an example. When there are more than two modalities, the situation is similar and can be analogized accordingly. For the video image mode and the sound mode, the corresponding two neural networks are respectively f⁽¹⁾,f⁽²⁾The corresponding training data are the data of the aligned part, respectively A⁽¹⁾And A⁽²⁾The test data are partial data of the unaligned relation and are respectively U⁽¹⁾And U⁽²⁾Wherein the neural network f⁽¹⁾,f⁽²⁾Different configurations are used.

In the scheme, public expressions of the constructed positive and negative sample pairs can be obtained through a neural network respectively, and the public expressions of the positive and negative samples are respectively as follows:

Z⁽¹⁾＝f⁽¹⁾(A₁)；

Z⁽²⁾＝f⁽²⁾(A₂)；

wherein, f⁽¹⁾Neural networks constructed for video image modalities, f⁽²⁾A neural network constructed for acoustic modalities; a. the₁Set of samples belonging to a video image modality, Z, for a positive and negative sample pair⁽¹⁾Is A₁Spirit of passing throughVia a network f⁽¹⁾A learned public representation; a. the₂Set of samples belonging to a sound modality for a positive and negative sample pair, Z⁽²⁾Is A₂By means of a neural network f⁽²⁾A learned common representation.

to achieve the above-described mode alignment goal by contrast learning, the present solution utilizes two neural networks to learn a common representation of two modes, i.e., { f }⁽¹⁾,f⁽²⁾}, the loss function is defined as follows:

representing a loss function for a positive sample pair, wherein,

Where d (-) denotes the distance of two samples in the neural network's underlying space, i.e.

Where p is an adaptive boundary value, the loss function

The negative influence of the noise data in the negative sample pairs on the training of the neural network can be reduced or even eliminated, so that the neural network is robust to the noise data. Designed loss function

The method is mainly used for guiding the neural network to learn the common subspace of the multi-modal data, wherein samples of the same class but belonging to different modes are close to each other, and samples of different classes and different modes are far away from each other.

And guiding the training of the neural network by using the designed noise robust contrast loss function, calculating the gradient of the loss value accumulated in each iteration, performing back propagation to update the parameter value of the neural network until the iteration process is converged, and stopping the training to obtain the trained neural network.

After training of the neural network model is completed, unaligned multi-modal data are input into the model, and by selecting the sample closest to the model as a realigned sample, the multi-modal data alignment at the category level can be completed and the aligned features can be learned at the same time. It is noted that the loss function designed by the present scheme is the first loss function that is robust against noisy data in the comparison learning. Based on the above procedure, the model can be sufficiently trained on aligned data in multiple modalities, and a common representation of each modality is learned implicitly with the alignment information, thereby enabling the model to efficiently process non-aligned multi-modality data.

The specific method comprises the following steps:

in this embodiment, U's are sequentially combined⁽¹⁾And U⁽²⁾Corresponding sample is

Respectively sent into the trained network f⁽¹⁾And f⁽²⁾To obtain a characteristic representation

Wherein, each time the input cross-modal non-aligned partial sample data pair has the same index, namely, each time the input cross-modal non-aligned partial sample data pair has

Wherein k is m.

in particular according to the formula

Calculating the distance between the unaligned data features, and forming a matrix assuming Q, then Q11 is the distance between the first sample in the alignment reference mode and the first sample in the to-be-aligned mode, and Q12 is the distance between the first sample in the alignment reference mode and the second sample in the to-be-aligned modeThis distance is analogized in this way.

For the

Each feature in (1) represents

In that

In search for feature representation

Minimizing the distance between the two representations, i.e. minimizing D_kmWhen the data is aligned, the data is aligned.

Experimental verification

In order to verify the superiority of the technical scheme, the technical scheme is firstly compared with other 10 multi-modal clustering technologies, namely Canonical Correlation Analysis (CCA), Kernel Canonical Correlation Analysis (KCCA), Deep Canonical Correlation Analysis (DCCA), Deep Canonical Correlation Autoencoder (DCCAE), multi-modal clustering based on matrix decomposition (MvC-DMF), potential multi-modal subspace clustering (LMSC), self-weighted multi-modal clustering (SwMC), multi-modal binary clustering (BMVC), dual-sneak-in-self-encoder network (AE2-Nets) and partial view aligned clustering (PVC). Specifically, we performed experimental comparisons on an object picture dataset Caltech-101 and a news dataset Reuters of the road agency. Because the comparison algorithm cannot process part of alignment data, the PCA is adopted for dimensionality reduction, then the Hungarian algorithm is used for obtaining an alignment sequence, the data are realigned according to the alignment sequence, and then the method is used for clustering. For comprehensive comparison, three indexes commonly used for measuring clustering effects, namely Normalized Mutual Information (NMI), Accuracy (ACC) and an ARI (Charlander index) are used as quantization indexes of our experiment to verify the algorithm effect. The value ranges of the three indexes are all 0-1, the effect is better when the number is larger, and the value is 1, so that the expression algorithm can accurately cluster data completely. The NMI calculation mode is as follows:

where Y is the category information predicted by the method and C is the actual category information of the data. H (-) represents information entropy, I (Y; C) represents mutual information. ARI is calculated as follows:

where RI is the Lande index and E (. cndot.) is desired.

Experiment one:

the performance of the solution was evaluated using the Reuters dataset. Reuters is a text data set consisting of 6 categories containing text from 5 languages, namely english text and its corresponding translations in french, german, spanish and indian, as shown in table 1.

TABLE 1

Modality	English language	French language	German language	Spanish language	Indian language
						Number of samples	18758	26648	29953	24039	12342

With each language as a modality. We evaluate the present solution using corresponding samples in the english modality and the french modality to construct non-fully aligned multi-modal data. The comparative results are shown in table 2:

TABLE 2

As can be seen from the table, compared with other clustering methods, the method provided by the scheme has great improvement in accuracy index, standardized mutual information index and tunable rand index, which means that in practical application, even under the condition of no given data tag information, the scheme can well cluster non-aligned language text data, and avoids dependence on data tags (requiring a large amount of human resources). In addition, the scheme only needs 3.48 seconds to realign the imperfect alignment data and learn the data characteristics, while the PVC method needs 30.36 seconds and the Hungarian algorithm-based method needs 289.82 seconds. In contrast, the method of the present method reduces a large amount of time overhead while greatly improving performance.

Experiment two:

using a dataset Caltech-101 containing 9144 pictures from 101 object classes, using the extracted features as 6 modalities, including (Gabor, WM, CENTRIST, HOG, GIST, and LBP). Because the number of categories is too large, only the top 20 categories with the largest data are listed here, and the corresponding data category information and sample number distribution are shown in table 3:

TABLE 3

Human face	Leopard (A)	Motorcycle with a motorcycle body	Telescope	Brain	Camera with a camera module	Vehicle with wheels	Dollar bank note	Ferry boat	Jiafei cat
										435	200	798	33	98	50	123	52	67	34
Hedgehog	Tower with a tower body	Rhinoceros	Snoopy	Stapling machine	Stop sign	Lotus flower	Windsor chair	Spanner	Taiji picture
										54	47	59	35	45	64	37	56	39	60

We evaluate the present solution using corresponding samples in the modality of the HOG feature and the modality of the GIST feature to construct non-fully aligned multi-modal data. The comparative results are shown in table 4:

TABLE 4

Compared with other clustering methods, the method provided by the scheme has great improvement on the accuracy index, the standardized mutual information index and the tunable rand index, which means that in practical application, even under the condition of no given data label information, the scheme can well cluster non-aligned multi-mode data correctly, and dependence on the presence of a data label (requiring a large amount of human resources) is avoided. In addition, the scheme only needs 1.75 seconds to realign the imperfect alignment data and learn the data characteristics, while the PVC method needs 7.2 seconds and the Hungarian algorithm-based method needs 48.87 seconds. In contrast, the method of the present method reduces a large amount of time overhead while greatly improving performance.

To further verify the superiority of the present technical solution, we further compare the present technical solution with other 9 multi-modal feature learning techniques on classification tasks, namely, Canonical Correlation Analysis (CCA), Kernel Canonical Correlation Analysis (KCCA), Deep Canonical Correlation Analysis (DCCA), Deep Canonical Correlation Autoencoder (DCCAE), matrix decomposition-based multi-modal clustering (MvC-DMF), potential multi-modal subspace clustering (LMSC), binary multi-modal clustering (BMVC), dual-latent self-encoder network (AE2-Nets), and partial view aligned clustering (PVC). Note that since the result of data clustering is learned directly from weighted multimodal clustering (SwMC), classification experiments cannot be performed. Specifically, the experiment comparison is carried out on an object picture data set Caltech-101 and a news data set Reuters of the Lupus. Because the comparison algorithm cannot process part of alignment data, the PCA is used for dimensionality reduction, then the Hungarian algorithm is used for obtaining an alignment sequence, the data are realigned according to the alignment sequence, then the method is used for feature learning, and then a standard SVM classifier is trained for classification. We used classification Accuracy (ACC) as a quantitative indicator of our experiments to verify the effectiveness of the method.

Experiment three

For example, in the first experiment, the performance of the proposed technical scheme is evaluated by using a Reuters data set in the first experiment. The results of the experiment are shown in table 5:

TABLE 5

As can be seen from the table, compared with other methods, the method provided by the scheme has a large improvement in the classification accuracy index, which means that the scheme can well classify the non-aligned language text data correctly in practical application, and avoids the need of a large amount of human resources to correct the alignment relation of the multi-mode data.

Experiment four

As experiment two, the Caltech-101 data set is used in the experiment to evaluate the performance of the technical scheme. The results of the experiment are shown in table 6:

TABLE 6

As can be seen from the table, compared with other methods, the method provided by the scheme has a large improvement in the classification accuracy index, which means that the scheme can well classify non-aligned multi-modal data correctly in practical application, and avoids the need of a large amount of human resources to correct the alignment relationship of the multi-modal data.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A data feature learning method facing modal imperfect alignment is characterized by comprising the following steps:

s3, inputting the constructed positive and negative sample pairs into two neural networks with different structures respectively, specifically, inputting data belonging to an image video mode data set in the positive and negative sample pairs into a first neural network, inputting data belonging to a sound mode data set in the positive and negative sample pairs into a second neural network, and calculating common expressions of the image video mode data set and the sound mode data set respectively;

and S5, inputting the sample data of the unaligned part in the video image mode data set and the sound mode data set into a trained neural network, and correcting the alignment relation of the sample data of the unaligned part to realign the sample data.

2. The method according to claim 1, wherein the modality-imperfectly aligned video image modality data set and the sound modality data set defined in S1 are respectively expressed as:

{X⁽¹⁾}＝{A⁽¹⁾,U⁽¹⁾}；

{X⁽²⁾}＝{A⁽²⁾,U⁽²⁾}；

a data set representing an aligned portion of a video image modality,

a data set representing a misaligned portion of a video image modality,

a data set representing aligned portions of the acoustic modality,

a data set representing a misaligned portion of a sound modality,

3. The method for learning modal-oriented imperfect alignment data features according to claim 2, wherein the specific method for constructing the negative sample pairs in S2 is:

and S22, taking the data of each alignment part in the alignment reference mode as an anchor point, and randomly sampling a plurality of data samples in the data set of the alignment part in the to-be-aligned mode to form a negative sample pair with each anchor point.

4. A method according to claim 3, wherein the common representation of the positive and negative samples in step S3 is:

Z⁽¹⁾＝f⁽¹⁾(A₁)；

Z⁽²⁾＝f⁽²⁾(A₂)；

wherein f is⁽¹⁾Neural networks constructed for video image modalities, f⁽²⁾A neural network constructed for acoustic modalities; a. the₁Set of samples belonging to a video image modality, Z, for a positive and negative sample pair⁽¹⁾Is A₁By means of a neural network f⁽¹⁾A learned public representation; a. the₂Set of samples belonging to a sound modality for a positive and negative sample pair, Z⁽²⁾Is A₂By means of neural networks f⁽²⁾A learned common representation.

5. The method according to claim 4, wherein the loss function in S4 is specifically expressed as:

representing a loss function for a positive sample pair, wherein,

are respectively cross-modal positive sample pairs, j, in different modes₁＝n₁Representing cross-modal positive in different modalitiesThe present pair has the same index value;

6. The method according to claim 5, wherein the distance in the latent space learned in the neural network from the cross-modal positive samples in different modalities is calculated by:

wherein f is⁽¹⁾、f⁽²⁾Are respectively as

The input neural network.

7. The method according to claim 6, wherein the S5 is specifically:

and S53, for each feature in the feature representation set of the alignment reference mode, searching for the feature of the sample in the feature representation set of the to-be-aligned mode, minimizing the distance between the searched feature and the feature in the alignment reference mode, and outputting the feature with the minimum distance and the feature in the alignment reference mode as a realigned cross-mode feature pair.

8. A method according to claim 7, wherein the feature representation set of the non-aligned partial data in step S51 is expressed as:

wherein,

for aligning the feature set of unaligned partial data in the reference modality, f⁽¹⁾A neural network for inputting data of a part which is not aligned in the alignment reference mode;

9. A method for learning modal-oriented imperfect alignment data features according to claim 8, wherein the formula for calculating the unaligned partial data distance matrix in step S52 is:

wherein D is the unaligned portion data distance matrix.