CN112802445B - Cross-audiovisual information conversion method based on semantic reservation - Google Patents
Cross-audiovisual information conversion method based on semantic reservation Download PDFInfo
- Publication number
- CN112802445B CN112802445B CN202110140393.6A CN202110140393A CN112802445B CN 112802445 B CN112802445 B CN 112802445B CN 202110140393 A CN202110140393 A CN 202110140393A CN 112802445 B CN112802445 B CN 112802445B
- Authority
- CN
- China
- Prior art keywords
- image
- cross
- sound
- network
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 26
- 238000012549 training Methods 0.000 claims description 32
- 230000006870 function Effects 0.000 claims description 24
- 238000013507 mapping Methods 0.000 claims description 21
- 238000012360 testing method Methods 0.000 claims description 4
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 3
- 238000004321 preservation Methods 0.000 abstract description 2
- 230000000007 visual effect Effects 0.000 description 8
- 238000004088 simulation Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006735 deficit Effects 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 208000029257 vision disease Diseases 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Image Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a cross-audio-visual information conversion method based on semantic preservation, which regards information conversion between audios and audios as a low-dimensional space expression similarity learning problem, realizes cross-modal conversion of features in the low-dimensional space by extracting semantic features of images, and finally maps the low-dimensional cross-modal features into sound waveforms based on human language. The invention solves the problem of limitation of the existing visual-to-auditory cross-mode information conversion method to accurately generate the voice waveform based on human language in the unconstrained environment. And aiming at the unconstrained environment, generating a sound waveform based on human language, so that the sound waveform is more in line with the actual situation.
Description
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a cross-audiovisual information conversion method.
Background
The visual to auditory cross-modal information conversion is beneficial to the people with visual disorder to better perceive the information of the surrounding world, so that the method has strong practicability for the people. However, due to heterogeneous semantic gaps commonly existing among audio-visual modalities and complex data structures in the audio-visual modalities, it is very difficult to realize effective cross-audio-visual information conversion. Currently, relatively few studies are conducted on the conversion of visual to auditory cross-modal information, and the flow of these works is generally as follows: firstly, extracting semantic features of visual data, then predicting spectrograms of auditory data, and finally realizing generation of sound waveforms. These studies typically generate musical instrument sounds, percussive sounds, or ambient background sounds for a particular environment;
1) Cross-Modal information based on instrument sound generation translates to a method of generating a network based on conditional challenge as proposed by Chen et al in the literature "L.Chen, S.Srivastava, Z.Duan, and C.xu, deep Cross-Modal Audio-Visual Generation, in the Thematic Workshops of ACM Multimedia,2017, pp.349-357". The method comprises the steps of generating a sound spectrogram through input instrument images, encoding the sound spectrogram to generate the instrument images, judging the generated instrument images, optimizing the generated sound spectrogram, and finally obtaining a corresponding instrument sound waveform based on the optimized sound spectrogram.
2) Cross-modal information based on tap sound generation translates to a target tap sound generation method in video based on a recurrent neural network as proposed by Owens et al in documents "A.Owens, P.Isola, J.McDermott, A.Torralba, E.Adelson, and w.freeman, visually Indicated Sounds, in the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp.2405-2413. The method uses a recurrent neural network to predict sound features in a video and then uses an instance-based synthesis method to generate corresponding sound waveforms from the sound features.
3) Cross-modal information based on ambient background Sound generation translates to a video background Sound generation method based on a codec structure as proposed by Zhou et al in literature "Y.Zhou, Z.Wang, C.Fang, T.Bui, and T.berg, visual to Sound: generating Natural Sound for Videos in The Wild, in the IEEE Conference on Computer Vision and Pattern Recognition,2018, pp.3550-3558. The method comprises the steps of firstly encoding video features through a designed frame-to-frame, sequence-to-sequence and optical flow-based method, and then decoding the video features through a sampleRNN to generate sound waveforms.
The above methods have respective limitations, such as that cross-modal information conversion based on the generation of the knocking sounds can only generate regular knocking sounds, and it is difficult to accurately generate sound waveforms based on human language in an unconstrained environment, so that the practicability is not strong.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a cross-audio-visual information conversion method based on semantic preservation, which regards the information conversion between audios and videos as a low-dimensional space expression similarity learning problem, realizes the cross-modal conversion of the features in the low-dimensional space by extracting the semantic features of the images, and finally maps the low-dimensional cross-modal features into the voice waveforms based on human language. The invention solves the problem of limitation of the existing visual-to-auditory cross-mode information conversion method to accurately generate the voice waveform based on human language in the unconstrained environment. And aiming at the unconstrained environment, generating a sound waveform based on human language, so that the sound waveform is more in line with the actual situation.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
step 1: the image-audio description data set has N pairs of image-audio pairs; taking the image-audio description data set as a training set, normalizing each image in the training set according to the formula (1) to obtain a normalized image
Wherein X is tr For images in the training set, mu and theta are the mean value and standard deviation of all the images in the training set respectively, and K is the number of pixels of a single image in the training set;
step 2: for normalized imagesLearning image semantic features Γ through image coding networks i v I is the image-audio pair number, i=1, 2,. -%, N;
the objective function is:
wherein l i Andthe method comprises the steps of respectively inputting real semantic tags of an image and semantic tags predicted by an image coding network;
when the target function formula (2) is minimum, the image coding network completes training; the output of the image coding network is the image semantic feature Γ i v ;
Step 3, learning sound semantic features Γ for sound of the image-audio description data set through the sound coding network i s ;
The sound coding network consists of 6 sequentially connected full-connection layers, network training is completed through error feedback between input and reconstructed sound, and finally the output of the 3 rd full-connection layer is a sound semantic feature;
the objective function is:
wherein,,sound describing the dataset for image-audio, +.>Sound reconstructed for the sound coding network;
when the target function formula (3) is minimum, the voice coding network finishes training; the output of the voice coding network is voice semantic feature Γ i s ;
Step 4, for image semantic feature Γ i v Sound semantic feature Γ i s Learning cross-modal feature expression Γ through cross-modal mapping networks i ;
The cross-modal mapping network consists of 2 stacked full connection layers;
the objective function is:
min∑ i (1-Γ i s Γ i /||Γ i s ||·||Γ i ||) (4)
wherein I II a modulus representing a vector;
when the target function formula (4) is minimum, training is completed by the cross-modal mapping network; the output of the cross-modal mapping network is the cross-modal feature expression Γ i ;
Step 5: expressing Γ to cross-modal features i Computing sound waveforms through a cross-modal feature network
The cross-modal characteristic network consists of 3 stacked residual blocks and 2 full-connection layers which are sequentially connected;
the objective function is:
wherein x represents any real number;
when the target function formula (5) is minimum, training is completed across the modal feature network; the output of the cross-modal mapping network is a sound waveform
Step 6: inputting the image to be tested to the image coding network trained in the step 2, and outputting to obtain the image semantic features of the image to be tested; inputting the image semantic features of the image to be tested into the cross-modal mapping network trained in the step 4, and outputting to obtain the cross-modal feature expression of the image to be tested; and finally, inputting the cross-modal characteristic expression of the image to be tested into the cross-modal mapping network trained in the step 5, and finally outputting to obtain the sound waveform of the test piece conversion.
Preferably, the image encoding network employs a modified VGG16 network, i.e. all fully connected layers of the VGG16 network are replaced by one randomly initialized fully connected layer.
The beneficial effects of the invention are as follows:
1. the invention realizes the direct conversion from visual data to auditory data through the semantic reservation of the low-dimensional characteristic layer, and has high robustness and accuracy.
2. The method can realize the generation of the sound waveform based on the human language by effectively decoding the low-dimensional characteristics, so that the method is more suitable for the people with vision impairment to accurately perceive the surrounding information.
3. The invention can effectively solve the problem of difficult generation between visual data and voice waveforms based on human language, has high training speed and strong generated voice intelligibility, and the highest STOI value can reach 0.9682.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
According to the invention, the information conversion between audios and audios is regarded as a problem of expression similarity learning in a low-dimensional space, the cross-modal conversion of the features is realized in the low-dimensional space by extracting the semantic features of the images, and finally the low-dimensional cross-modal features are mapped into the voice waveforms based on human language. The invention solves the problem of limitation of the existing visual-to-auditory cross-mode information conversion method to accurately generate the voice waveform based on human language in the unconstrained environment. And aiming at the unconstrained environment, generating a sound waveform based on human language, so that the sound waveform is more in line with the actual situation.
As shown in fig. 1, a semantic reservation-based cross-audiovisual information conversion method includes the following steps:
step 1: the image-audio description data set has N pairs of image-audio pairs; taking the image-audio description data set as a training set, normalizing each image in the training set according to the formula (1) to obtain a normalized image
Wherein X is tr For images in the training set, mu and theta are the mean value and standard deviation of all the images in the training set respectively, and K is the number of pixels of a single image in the training set;
step 2: for normalized imagesLearning image semantic features Γ through image coding networks i v I is the image-audio pair number, i=1, 2,. -%, N;
the objective function is:
wherein l i Andthe method comprises the steps of respectively inputting real semantic tags of an image and semantic tags predicted by an image coding network;
when the target function formula (2) is minimum, the image coding network completes training; the output of the image coding network is the image semantic feature Γ i v ;
Step 3, learning sound semantic features Γ for sound of the image-audio description data set through the sound coding network i s ;
The sound coding network consists of 6 sequentially connected full-connection layers, network training is completed through error feedback between input and reconstructed sound, and finally the output of the 3 rd full-connection layer is a sound semantic feature;
the objective function is:
wherein,,sound describing the dataset for image-audio, +.>Sound reconstructed for the sound coding network;
when the target function formula (3) is minimum, the voice coding network finishes training; the output of the voice coding network is voice semantic feature Γ i s ;
Step 4, for image semantic feature Γ i v Sound semantic feature Γ i s Learning cross-modal feature expression Γ through cross-modal mapping networks i ;
The cross-modal mapping network consists of 2 stacked full connection layers;
the objective function is:
min∑ i (1-Γ i s Γ i /||Γ i s ||·||Γ i ||) (4)
wherein I II a modulus representing a vector;
when the target function formula (4) is minimum, training is completed by the cross-modal mapping network; the output of the cross-modal mapping network is the cross-modal feature expression Γ i ;
Step 5: expressing Γ to cross-modal features i Computing sound waveforms through a cross-modal feature network
The cross-modal characteristic network consists of 3 stacked residual blocks and 2 full-connection layers which are sequentially connected;
the objective function is:
wherein x represents any real number;
when the target function formula (5) is minimum, training is completed across the modal feature network; the output of the cross-modal mapping network is a sound waveform
Step 6: inputting the image to be tested to the image coding network trained in the step 2, and outputting to obtain the image semantic features of the image to be tested; inputting the image semantic features of the image to be tested into the cross-modal mapping network trained in the step 4, and outputting to obtain the cross-modal feature expression of the image to be tested; and finally, inputting the cross-modal characteristic expression of the image to be tested into the cross-modal mapping network trained in the step 5, and finally outputting to obtain the sound waveform of the test piece conversion.
Specific examples:
1. simulation conditions
In this embodiment, the simulation is performed on an operating system with a central processing unit of Intel (R) Xeon (R) CPU E5-2650V4@2.20GHz and a memory 500G, ubuntu by using Python and other related kits.
The data used in the simulation is an image-audio description data set obtained by autonomously adding an audio description to an existing data set.
2. Emulation content
Model training and testing was performed on MNIST, CIFAR10 and CIFAR100 audio description datasets.
In order to demonstrate the effectiveness of the algorithm, a DCMAVG model, a CMCGAN model and an I2T2A model were chosen as comparison algorithms. The DCMAVG model is presented in document "L.Chen, S.Srivastava, Z.Duan, C.Xu.Deep cross-model audio-visual generation, in Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp.349-357"; the CMCGAN model is set forth In the literature "W.Hao, Z.Zhang, H.Guan.Cmcgan: A uniform framework for cross-model visual-audio mutual generation, in third-Second AAAI Conference on Artificial Intelligence,2018, pp.6886-6893"; the I2T2A model is obtained by generating a text description of an image by the method described In document "L.Liu, J.Tang, X.Wan, Z.Guo.Generating diverse and descriptive image captions using visual paraphrases, in 2019IEEE International Conference on Computer Vision,2019,pp.4239-4248," and converting the text description into audio by the speaking software. The comparison results are shown in Table 1.
TABLE 1 results of the invention
It can be seen from table 1 that the various performance indicators of the present invention are in most cases superior to other methods. According to the invention, through semantic reservation of the low-dimensional feature level, the semantic gap of data in the modes and the heterogeneous gap of data among the modes are reduced, the direct conversion from visual data to auditory data can be accurately realized, and the robustness and accuracy of an algorithm are improved. Meanwhile, the invention can realize the generation of the sound waveform based on the human language, and has strong practicability for visually impaired people.
Claims (2)
1. The trans-audiovisual information conversion method based on semantic reservation is characterized by comprising the following steps of:
step 1: the image-audio description data set has N pairs of image-audio pairs; taking the image-audio description data set as a training set, normalizing each image in the training set according to the formula (1) to obtain a normalized image
Wherein X is tr For images in the training set, mu and theta are the mean value and standard deviation of all the images in the training set respectively, and K is the number of pixels of a single image in the training set;
step 2: for normalized imagesLearning image semantic features Γ through image coding networks i v I is the image-audio pair number, i=1, 2, …, N;
the objective function is:
wherein l i Andthe method comprises the steps of respectively inputting real semantic tags of an image and semantic tags predicted by an image coding network;
order of the Chinese medicineWhen the standard function (2) is minimum, the image coding network finishes training; the output of the image coding network is the image semantic feature Γ i v ;
Step 3, learning sound semantic features Γ for sound of the image-audio description data set through the sound coding network i s ;
The sound coding network consists of 6 sequentially connected full-connection layers, network training is completed through error feedback between input and reconstructed sound, and finally the output of the 3 rd full-connection layer is a sound semantic feature;
the objective function is:
wherein,,sound describing the dataset for image-audio, +.>Sound reconstructed for the sound coding network;
when the target function formula (3) is minimum, the voice coding network finishes training; the output of the voice coding network is voice semantic feature Γ i s ;
Step 4, for image semantic feature Γ i v Sound semantic feature Γ i s Learning cross-modal feature expression Γ through cross-modal mapping networks i ;
The cross-modal mapping network consists of 2 stacked full connection layers;
the objective function is:
min∑ i (1-Γ i s Γ i /‖Γ i s ‖·‖Γ i ‖) (4)
wherein II represents the modulus of the vector;
when the objective function (4) is minimum, the cross-mode is adoptedThe mapping network completes training; the output of the cross-modal mapping network is the cross-modal feature expression Γ i ;
Step 5: expressing Γ to cross-modal features i Computing sound waveforms through a cross-modal feature network
The cross-modal characteristic network consists of 3 stacked residual blocks and 2 full-connection layers which are sequentially connected;
the objective function is:
wherein x represents any real number;
when the target function formula (5) is minimum, training is completed across the modal feature network; the output of the cross-modal mapping network is a sound waveform
Step 6: inputting the image to be tested to the image coding network trained in the step 2, and outputting to obtain the image semantic features of the image to be tested; inputting the image semantic features of the image to be tested into the cross-modal mapping network trained in the step 4, and outputting to obtain the cross-modal feature expression of the image to be tested; and finally, inputting the cross-modal characteristic expression of the image to be tested into the cross-modal mapping network trained in the step 5, and finally outputting to obtain the sound waveform of the test piece conversion.
2. The semantic reservation-based cross-audiovisual information conversion method according to claim 1, wherein the image encoding network adopts a modified VGG16 network, i.e. all full connection layers of the VGG16 network are replaced by one full connection layer initialized randomly.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110140393.6A CN112802445B (en) | 2021-02-02 | 2021-02-02 | Cross-audiovisual information conversion method based on semantic reservation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110140393.6A CN112802445B (en) | 2021-02-02 | 2021-02-02 | Cross-audiovisual information conversion method based on semantic reservation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112802445A CN112802445A (en) | 2021-05-14 |
CN112802445B true CN112802445B (en) | 2023-06-30 |
Family
ID=75813564
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110140393.6A Active CN112802445B (en) | 2021-02-02 | 2021-02-02 | Cross-audiovisual information conversion method based on semantic reservation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112802445B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115154216B (en) * | 2022-06-02 | 2024-10-29 | 北京工业大学 | Blind guiding method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018188240A1 (en) * | 2017-04-10 | 2018-10-18 | 北京大学深圳研究生院 | Cross-media retrieval method based on deep semantic space |
CN111597298A (en) * | 2020-03-26 | 2020-08-28 | 浙江工业大学 | Cross-modal retrieval method and device based on deep confrontation discrete hash learning |
-
2021
- 2021-02-02 CN CN202110140393.6A patent/CN112802445B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018188240A1 (en) * | 2017-04-10 | 2018-10-18 | 北京大学深圳研究生院 | Cross-media retrieval method based on deep semantic space |
CN111597298A (en) * | 2020-03-26 | 2020-08-28 | 浙江工业大学 | Cross-modal retrieval method and device based on deep confrontation discrete hash learning |
Non-Patent Citations (1)
Title |
---|
视听相关的多模态概念检测;奠雨洁;金琴;;计算机研究与发展(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112802445A (en) | 2021-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhou et al. | Vision-infused deep audio inpainting | |
Song et al. | Multimodal sparse transformer network for audio-visual speech recognition | |
Lin et al. | Audiovisual transformer with instance attention for audio-visual event localization | |
US10679643B2 (en) | Automatic audio captioning | |
Zhao et al. | Multi-modal multi-cultural dimensional continues emotion recognition in dyadic interactions | |
Mun et al. | Text-guided attention model for image captioning | |
JP2023537705A (en) | AUDIO-VISUAL EVENT IDENTIFICATION SYSTEM, METHOD AND PROGRAM | |
CA3175428A1 (en) | Multimodal analysis combining monitoring modalities to elicit cognitive states and perform screening for mental disorders | |
CN112053690A (en) | Cross-modal multi-feature fusion audio and video voice recognition method and system | |
US20220172710A1 (en) | Interactive systems and methods | |
Ma et al. | Unpaired image-to-speech synthesis with multimodal information bottleneck | |
WO2023226239A1 (en) | Object emotion analysis method and apparatus and electronic device | |
Misra et al. | A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models. | |
CN115658954B (en) | Cross-modal search countermeasure method based on prompt learning | |
Li et al. | Detection of multiple steganography methods in compressed speech based on code element embedding, Bi-LSTM and CNN with attention mechanisms | |
CN112802445B (en) | Cross-audiovisual information conversion method based on semantic reservation | |
Ruan et al. | Accommodating audio modality in CLIP for multimodal processing | |
WO2021028236A1 (en) | Systems and methods for sound conversion | |
Djilali et al. | Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping | |
Li et al. | Robust audio-visual ASR with unified cross-modal attention | |
CN117370934B (en) | Multi-mode data enhancement method of sensitive information discovery model | |
Hong et al. | When hearing the voice, who will come to your mind | |
CN113743267A (en) | Multi-mode video emotion visualization method and device based on spiral and text | |
Chen et al. | Cross-modal dynamic sentiment annotation for speech sentiment analysis | |
US20230290371A1 (en) | System and method for automatically generating a sign language video with an input speech using a machine learning model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |