CN112802445B

CN112802445B - Cross-audiovisual information conversion method based on semantic reservation

Info

Publication number: CN112802445B
Application number: CN202110140393.6A
Authority: CN
Inventors: 袁媛; 宁海龙
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2023-06-30
Anticipated expiration: 2041-02-02
Also published as: CN112802445A

Abstract

The invention discloses a cross-audio-visual information conversion method based on semantic preservation, which regards information conversion between audios and audios as a low-dimensional space expression similarity learning problem, realizes cross-modal conversion of features in the low-dimensional space by extracting semantic features of images, and finally maps the low-dimensional cross-modal features into sound waveforms based on human language. The invention solves the problem of limitation of the existing visual-to-auditory cross-mode information conversion method to accurately generate the voice waveform based on human language in the unconstrained environment. And aiming at the unconstrained environment, generating a sound waveform based on human language, so that the sound waveform is more in line with the actual situation.

Description

Cross-audiovisual information conversion method based on semantic reservation

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a cross-audiovisual information conversion method.

Background

The visual to auditory cross-modal information conversion is beneficial to the people with visual disorder to better perceive the information of the surrounding world, so that the method has strong practicability for the people. However, due to heterogeneous semantic gaps commonly existing among audio-visual modalities and complex data structures in the audio-visual modalities, it is very difficult to realize effective cross-audio-visual information conversion. Currently, relatively few studies are conducted on the conversion of visual to auditory cross-modal information, and the flow of these works is generally as follows: firstly, extracting semantic features of visual data, then predicting spectrograms of auditory data, and finally realizing generation of sound waveforms. These studies typically generate musical instrument sounds, percussive sounds, or ambient background sounds for a particular environment;

1) Cross-Modal information based on instrument sound generation translates to a method of generating a network based on conditional challenge as proposed by Chen et al in the literature "L.Chen, S.Srivastava, Z.Duan, and C.xu, deep Cross-Modal Audio-Visual Generation, in the Thematic Workshops of ACM Multimedia,2017, pp.349-357". The method comprises the steps of generating a sound spectrogram through input instrument images, encoding the sound spectrogram to generate the instrument images, judging the generated instrument images, optimizing the generated sound spectrogram, and finally obtaining a corresponding instrument sound waveform based on the optimized sound spectrogram.

2) Cross-modal information based on tap sound generation translates to a target tap sound generation method in video based on a recurrent neural network as proposed by Owens et al in documents "A.Owens, P.Isola, J.McDermott, A.Torralba, E.Adelson, and w.freeman, visually Indicated Sounds, in the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp.2405-2413. The method uses a recurrent neural network to predict sound features in a video and then uses an instance-based synthesis method to generate corresponding sound waveforms from the sound features.

3) Cross-modal information based on ambient background Sound generation translates to a video background Sound generation method based on a codec structure as proposed by Zhou et al in literature "Y.Zhou, Z.Wang, C.Fang, T.Bui, and T.berg, visual to Sound: generating Natural Sound for Videos in The Wild, in the IEEE Conference on Computer Vision and Pattern Recognition,2018, pp.3550-3558. The method comprises the steps of firstly encoding video features through a designed frame-to-frame, sequence-to-sequence and optical flow-based method, and then decoding the video features through a sampleRNN to generate sound waveforms.

The above methods have respective limitations, such as that cross-modal information conversion based on the generation of the knocking sounds can only generate regular knocking sounds, and it is difficult to accurately generate sound waveforms based on human language in an unconstrained environment, so that the practicability is not strong.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a cross-audio-visual information conversion method based on semantic preservation, which regards the information conversion between audios and videos as a low-dimensional space expression similarity learning problem, realizes the cross-modal conversion of the features in the low-dimensional space by extracting the semantic features of the images, and finally maps the low-dimensional cross-modal features into the voice waveforms based on human language. The invention solves the problem of limitation of the existing visual-to-auditory cross-mode information conversion method to accurately generate the voice waveform based on human language in the unconstrained environment. And aiming at the unconstrained environment, generating a sound waveform based on human language, so that the sound waveform is more in line with the actual situation.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

step 1: the image-audio description data set has N pairs of image-audio pairs; taking the image-audio description data set as a training set, normalizing each image in the training set according to the formula (1) to obtain a normalized image

Wherein X is ^tr For images in the training set, mu and theta are the mean value and standard deviation of all the images in the training set respectively, and K is the number of pixels of a single image in the training set;

step 2: for normalized images

Learning image semantic features Γ through image coding networks _i ^v I is the image-audio pair number, i=1, 2,. -%, N;

the objective function is:

wherein l _i And

the method comprises the steps of respectively inputting real semantic tags of an image and semantic tags predicted by an image coding network;

when the target function formula (2) is minimum, the image coding network completes training; the output of the image coding network is the image semantic feature Γ _i ^v ；

Step 3, learning sound semantic features Γ for sound of the image-audio description data set through the sound coding network _i ^s ；

The sound coding network consists of 6 sequentially connected full-connection layers, network training is completed through error feedback between input and reconstructed sound, and finally the output of the 3 rd full-connection layer is a sound semantic feature;

the objective function is:

wherein,,

sound describing the dataset for image-audio, +.>

Sound reconstructed for the sound coding network;

when the target function formula (3) is minimum, the voice coding network finishes training; the output of the voice coding network is voice semantic feature Γ _i ^s ；

Step 4, for image semantic feature Γ _i ^v Sound semantic feature Γ _i ^s Learning cross-modal feature expression Γ through cross-modal mapping networks _i ；

The cross-modal mapping network consists of 2 stacked full connection layers;

the objective function is:

min∑ _i (1-Γ _i ^s Γ _i /||Γ _i ^s ||·||Γ _i ||) (4)

wherein I II a modulus representing a vector;

when the target function formula (4) is minimum, training is completed by the cross-modal mapping network; the output of the cross-modal mapping network is the cross-modal feature expression Γ _i ；

Step 5: expressing Γ to cross-modal features _i Computing sound waveforms through a cross-modal feature network

The cross-modal characteristic network consists of 3 stacked residual blocks and 2 full-connection layers which are sequentially connected;

the objective function is:

wherein x represents any real number;

when the target function formula (5) is minimum, training is completed across the modal feature network; the output of the cross-modal mapping network is a sound waveform

Step 6: inputting the image to be tested to the image coding network trained in the step 2, and outputting to obtain the image semantic features of the image to be tested; inputting the image semantic features of the image to be tested into the cross-modal mapping network trained in the step 4, and outputting to obtain the cross-modal feature expression of the image to be tested; and finally, inputting the cross-modal characteristic expression of the image to be tested into the cross-modal mapping network trained in the step 5, and finally outputting to obtain the sound waveform of the test piece conversion.

Preferably, the image encoding network employs a modified VGG16 network, i.e. all fully connected layers of the VGG16 network are replaced by one randomly initialized fully connected layer.

The beneficial effects of the invention are as follows:

1. the invention realizes the direct conversion from visual data to auditory data through the semantic reservation of the low-dimensional characteristic layer, and has high robustness and accuracy.

2. The method can realize the generation of the sound waveform based on the human language by effectively decoding the low-dimensional characteristics, so that the method is more suitable for the people with vision impairment to accurately perceive the surrounding information.

3. The invention can effectively solve the problem of difficult generation between visual data and voice waveforms based on human language, has high training speed and strong generated voice intelligibility, and the highest STOI value can reach 0.9682.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

According to the invention, the information conversion between audios and audios is regarded as a problem of expression similarity learning in a low-dimensional space, the cross-modal conversion of the features is realized in the low-dimensional space by extracting the semantic features of the images, and finally the low-dimensional cross-modal features are mapped into the voice waveforms based on human language. The invention solves the problem of limitation of the existing visual-to-auditory cross-mode information conversion method to accurately generate the voice waveform based on human language in the unconstrained environment. And aiming at the unconstrained environment, generating a sound waveform based on human language, so that the sound waveform is more in line with the actual situation.

As shown in fig. 1, a semantic reservation-based cross-audiovisual information conversion method includes the following steps:

step 2: for normalized images

the objective function is:

wherein l _i And

the objective function is:

wherein,,

sound describing the dataset for image-audio, +.>

Sound reconstructed for the sound coding network;

The cross-modal mapping network consists of 2 stacked full connection layers;

the objective function is:

min∑ _i (1-Γ _i ^s Γ _i /||Γ _i ^s ||·||Γ _i ||) (4)

wherein I II a modulus representing a vector;

the objective function is:

wherein x represents any real number;

Specific examples:

1. simulation conditions

In this embodiment, the simulation is performed on an operating system with a central processing unit of Intel (R) Xeon (R) CPU E5-2650V4@2.20GHz and a memory 500G, ubuntu by using Python and other related kits.

The data used in the simulation is an image-audio description data set obtained by autonomously adding an audio description to an existing data set.

2. Emulation content

Model training and testing was performed on MNIST, CIFAR10 and CIFAR100 audio description datasets.

In order to demonstrate the effectiveness of the algorithm, a DCMAVG model, a CMCGAN model and an I2T2A model were chosen as comparison algorithms. The DCMAVG model is presented in document "L.Chen, S.Srivastava, Z.Duan, C.Xu.Deep cross-model audio-visual generation, in Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp.349-357"; the CMCGAN model is set forth In the literature "W.Hao, Z.Zhang, H.Guan.Cmcgan: A uniform framework for cross-model visual-audio mutual generation, in third-Second AAAI Conference on Artificial Intelligence,2018, pp.6886-6893"; the I2T2A model is obtained by generating a text description of an image by the method described In document "L.Liu, J.Tang, X.Wan, Z.Guo.Generating diverse and descriptive image captions using visual paraphrases, in 2019IEEE International Conference on Computer Vision,2019,pp.4239-4248," and converting the text description into audio by the speaking software. The comparison results are shown in Table 1.

TABLE 1 results of the invention

It can be seen from table 1 that the various performance indicators of the present invention are in most cases superior to other methods. According to the invention, through semantic reservation of the low-dimensional feature level, the semantic gap of data in the modes and the heterogeneous gap of data among the modes are reduced, the direct conversion from visual data to auditory data can be accurately realized, and the robustness and accuracy of an algorithm are improved. Meanwhile, the invention can realize the generation of the sound waveform based on the human language, and has strong practicability for visually impaired people.

Claims

1. The trans-audiovisual information conversion method based on semantic reservation is characterized by comprising the following steps of:

step 2: for normalized images

Learning image semantic features Γ through image coding networks _i ^v I is the image-audio pair number, i=1, 2, …, N;

the objective function is:

wherein l _i And

order of the Chinese medicineWhen the standard function (2) is minimum, the image coding network finishes training; the output of the image coding network is the image semantic feature Γ _i ^v ；

the objective function is:

wherein,,

sound describing the dataset for image-audio, +.>

Sound reconstructed for the sound coding network;

The cross-modal mapping network consists of 2 stacked full connection layers;

the objective function is:

min∑ _i (1-Γ _i ^s Γ _i /‖Γ _i ^s ‖·‖Γ _i ‖) (4)

wherein II represents the modulus of the vector;

when the objective function (4) is minimum, the cross-mode is adoptedThe mapping network completes training; the output of the cross-modal mapping network is the cross-modal feature expression Γ _i ；

the objective function is:

wherein x represents any real number;

2. The semantic reservation-based cross-audiovisual information conversion method according to claim 1, wherein the image encoding network adopts a modified VGG16 network, i.e. all full connection layers of the VGG16 network are replaced by one full connection layer initialized randomly.