WO2024140430A1 - Text classification method based on multimodal deep learning, device, and storage medium - Google Patents
Text classification method based on multimodal deep learning, device, and storage medium Download PDFInfo
- Publication number
- WO2024140430A1 WO2024140430A1 PCT/CN2023/140831 CN2023140831W WO2024140430A1 WO 2024140430 A1 WO2024140430 A1 WO 2024140430A1 CN 2023140831 W CN2023140831 W CN 2023140831W WO 2024140430 A1 WO2024140430 A1 WO 2024140430A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- data
- image
- features
- video
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 92
- 238000013135 deep learning Methods 0.000 title claims abstract description 33
- 239000013598 vector Substances 0.000 claims description 43
- 238000013527 convolutional neural network Methods 0.000 claims description 36
- 230000007246 mechanism Effects 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 15
- 230000015654 memory Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 12
- 230000006403 short-term memory Effects 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 7
- 238000013518 transcription Methods 0.000 claims description 7
- 230000035897 transcription Effects 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 5
- 238000004140 cleaning Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000002123 temporal effect Effects 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 2
- 238000013500 data storage Methods 0.000 claims description 2
- 230000006872 improvement Effects 0.000 description 11
- 230000008569 process Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 230000003993 interaction Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000000926 separation method Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 2
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 239000004615 ingredient Substances 0.000 description 2
- 235000013311 vegetables Nutrition 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 235000015278 beef Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000008267 milk Substances 0.000 description 1
- 235000013336 milk Nutrition 0.000 description 1
- 210000004080 milk Anatomy 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/683—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/685—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the method of "inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features" specifically includes: segmenting the local video data of the lips into continuous lip picture frames; inputting the continuous lip picture frames into a 3D convolutional neural network model for calculation, extracting multiple features, and obtaining the image features.
- the "based on the image lip reading recognition method, inputting the image features into the multi-channel multi-size time deep convolutional neural network model for transcription to obtain the first image text data” specifically includes: inputting the image features into the multi-channel multi-size time deep convolutional neural network for calculation to obtain time series image features; according to the image lip reading recognition method, mapping the time series image features into a pinyin sequence of a pinyin sentence; and then translating the pinyin sequence into a Chinese character sequence corresponding to the Chinese character sentence.
- the self-weight information and/or associated weight information of the words and phrases in the text features of the speech text data and the image text data are distinguished to obtain the weight information of the text semantic features.
- FIG1 is a structural block diagram of a model involved in a text classification method based on multimodal deep learning in one embodiment of the present invention.
- FIG3 is a schematic diagram of the steps of acquiring real-time audio and video data and historical audio and video data in one embodiment of the present invention.
- FIG6 is a schematic diagram of the steps of obtaining a local area video image in the video data and transcribing the video image into image text data in one embodiment of the present invention.
- the embodiment of the present invention is a text classification method based on multimodal deep learning.
- the present application provides the method operation steps as described in the following implementation or flowchart 1, based on routine or no creative labor, the execution order of the steps in the method that do not have a necessary causal relationship in logic is not limited to the execution order provided in the implementation of the present application.
- FIG2 is a schematic diagram of the steps of a text classification method based on multimodal deep learning, including:
- S2 Preprocess the real-time audio and video data and historical audio and video data to obtain valid voice data and video data.
- S4 Acquire a video image of a local area in the valid video data, and transcribe the video image into image text data.
- S5 Acquire context information of the text data and weight information of text semantic features according to the speech text data and the image text data.
- the method provided by the present invention can be used by smart electronic devices to implement real-time interaction or message push functions with users based on the user's real-time audio and video data input.
- a smart refrigerator is taken as an example, and the method is described in combination with a pre-trained deep learning model.
- the smart refrigerator Based on the user's audio and video input, the smart refrigerator classifies the corresponding text content generated by the user's audio and video data, and calculates the text content classification result information to be output based on the classification result information.
- the historical audio and video data stored in the internal memory of the smart refrigerator can be read.
- the historical audio and video data stored in the external storage device configured by the smart refrigerator can also be read.
- the external storage device is a mobile storage device such as a U disk, SD card, etc. By setting an external storage device, the storage space of the smart refrigerator can be further expanded.
- the historical audio and video data stored in a client terminal such as a mobile phone, a tablet computer, or an application software server can be obtained.
- Implementing multi-channel historical audio and video data acquisition channels can greatly increase the amount of historical audio and video data, thereby improving the accuracy of subsequent voice recognition and video image recognition.
- step S41 and step S42 considering that according to the video image features of the lip area of a person, the sentences that may be recognized are relatively complex, such as different sentence lengths, different pause positions or word compositions, and correlations between image features, we can perform video processing operations such as cropping and framing according to the effective video data, obtain the video image of the lip area, and crop and segment the video image of the lip area to obtain multiple continuous lip picture frames.
- the multiple continuous lip picture frames are input into the 3D convolutional neural network model, and more expressive features can be extracted by adding information of the time dimension.
- the 3D convolutional neural network model can solve the correlation information between multiple pictures, and takes multiple continuous frames of images as input, and captures the motion information in the input frames by adding a new dimension of information, so as to better obtain its image features.
- step S7 in this real-time mode, after the classification result information is obtained through the above steps, it can be converted into voice, and the result information can be broadcasted through the built-in sound playback device of the smart refrigerator, or the result information can be converted into text and displayed directly through the display device configured by the smart refrigerator, or the result information can be converted into an image and displayed directly on the large screen of the smart refrigerator.
- the result information can also be transmitted to the client terminal for output by voice communication.
- the present invention provides a method for classifying audio and video generated text based on multimodal deep learning, which obtains real-time audio and video data and historical audio and video data through multiple channels, processes the audio and video data, and then converts the voice data and video data into corresponding voice text data and image text data.
- the context information after the audio and video text is generated is combined with the multi-channel multi-size deep convolutional neural network model and the multi-channel multi-size time deep convolutional neural network model to fully extract the text semantic features, obtain the generated text classification results, and output the text classification results through multiple channels.
- the method not only significantly improves the accuracy of generated text classification, but also makes the interaction between users and smart refrigerators more convenient and diversified, greatly improving the user experience.
- the processor is used to implement the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.
- a memory for storing executable instructions
- the processor is used to implement the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.
- the present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned text classification method based on multimodal deep learning.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Library & Information Science (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Disclosed in the present invention is a text classification method based on multimodal deep learning, comprising: acquiring context information of text data and weight information of text semantic features; and combining the context information and the weight information of the text semantic features via a fully connected layer, and then outputting the combined information to a classifier to calculate a score to obtain classification result information. The method effectively improves the accuracy and generalization ability of classification of texts generated from audios and videos, thereby improving the user experience.
Description
本发明涉及计算机技术领域,具体地涉及一种基于多模态深度学习的文本分类方法、设备及存储介质。The present invention relates to the field of computer technology, and in particular to a text classification method, device and storage medium based on multimodal deep learning.
随着多模态深度学习技术的应用落地,目前智能冰箱与用户交互的大多数是语音和文本数据,不但基于视频数据的交互方式微乎其微,而且传统方法就冰箱智能语音与视频普遍存在如下问题:特征提取不准确和不充分,导致语音识别精度、视频内容的文本分类准确率偏低,从而影响冰箱音视频的用户使用效果,甚至影响高端冰箱的智能化和信息化程度。With the application of multimodal deep learning technology, most of the interactions between smart refrigerators and users are voice and text data. Not only are the interactions based on video data very rare, but traditional methods for refrigerator smart voice and video generally have the following problems: inaccurate and insufficient feature extraction, resulting in low accuracy in voice recognition and text classification of video content, which affects the user experience of refrigerator audio and video, and even affects the intelligence and information level of high-end refrigerators.
因此,如何借助多通道多尺寸深度卷积神经网络模型构建冰箱音视频生成文本分类模型成为文本分类准确率提高的关键技术。而智能冰箱交互离不开语音、文本、视频等多源异构数据,故针对所述多源异构数据如何基于多模态或跨模态数据实现最优的特征信息提取方法,从而优化智能冰箱音视频生成文本分类准确率进而提升冰箱使用的体验效果,目前业界尚未提出较为有效的解决方案。Therefore, how to build a refrigerator audio and video generated text classification model with the help of a multi-channel and multi-scale deep convolutional neural network model has become a key technology to improve the accuracy of text classification. Smart refrigerator interaction is inseparable from multi-source heterogeneous data such as voice, text, and video. Therefore, the industry has not yet proposed a more effective solution for how to implement the optimal feature information extraction method based on multi-modal or cross-modal data for the multi-source heterogeneous data, thereby optimizing the accuracy of smart refrigerator audio and video generated text classification and thus improving the experience of using the refrigerator.
本发明的目的在于提供一种基于多模态深度学习的文本分类方法、设备及存储介质。The purpose of the present invention is to provide a text classification method, device and storage medium based on multimodal deep learning.
本发明提供种基于多模态深度学习的生成文本分类方法,包括步骤:The present invention provides a method for generating text classification based on multimodal deep learning, comprising the steps of:
获取实时音视频数据和历史音视频数据;对所述实时音视频数据和历史音视频数据进行预处理,获取有效的语音数据和视频数据;转写所述有效语音数据为语音文本数据;获取所述有效视频数据中局部区域的视频图像,并转写所述视频图像为图像文本数据;根据所述语音文本数据和图像文本数据,获取该文本数据的上下文信息和文本语义特征的权重信息;将所述上下文信息和文本语义特征的权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断所述音视频数据生成文本的类别信息;输出所述生成文本的类别信息。Acquire real-time audio and video data and historical audio and video data; pre-process the real-time audio and video data and historical audio and video data to acquire valid voice data and video data; transcribe the valid voice data into voice text data; acquire a video image of a local area in the valid video data, and transcribe the video image into image text data; acquire context information of the text data and weight information of text semantic features according to the voice text data and image text data; combine the context information and weight information of text semantic features through a fully connected layer, and output them to a classifier to calculate a score to obtain classification result information, and determine the category information of the text generated by the audio and video data; and output the category information of the generated text.
作为本发明的进一步改进,所述“对所述实时音视频数据和历史音视频数据进行预处理,获取有效的语音数据和视频数据”,具体包括:对所述实时音视频数据和历史音视频数据进行数据清洗、格式解析、格式转换和数据存储,获得有效的音视频数据;采用脚本或第三方工具将所述有效音视频数据进行语音和视频分离,以获得所述语音数据和视频数据;对所述语音数据和视频数据进行预处理,包括:对所述语音数据进行分帧和加窗处理,对所述视频数据进行裁剪、分帧处理。As a further improvement of the present invention, the "preprocessing the real-time audio and video data and the historical audio and video data to obtain valid voice data and video data" specifically includes: performing data cleaning, format analysis, format conversion and data storage on the real-time audio and video data and the historical audio and video data to obtain valid audio and video data; using a script or a third-party tool to separate the voice and video of the valid audio and video data to obtain the voice data and video data; preprocessing the voice data and video data, including: framing and windowing the voice data, and cropping and framing the video data.
作为本发明的进一步改进,所述“转写所述有效语音数据为语音文本数据”,具体包括:提取所述有效语音数据特征,得到语音特征;将所述语音特征输入语音识别多通道多尺寸深度卷积神经网络模型转写得到第一语音文本数据;基于连接时序分类方法输出所述语音特征和所述第一语音文本数据的对齐关系,以得到第二语音文本数据;基于注意力机制,获取所述第二语音文本数据的关键特征或所述关键特征的权重信息;将所述第二语音文本数据以及其关键特征或关键特征的权重信息经全连接层组合后,再经过分类函数计算得分得到所述语音文本数据。As a further improvement of the present invention, the "transcription of the valid voice data into voice text data" specifically includes: extracting the features of the valid voice data to obtain voice features; inputting the voice features into a voice recognition multi-channel and multi-size deep convolutional neural network model to transcribe the first voice text data; outputting the alignment relationship between the voice features and the first voice text data based on a connection time series classification method to obtain second voice text data; based on an attention mechanism, obtaining the key features of the second voice text data or the weight information of the key features; combining the second voice text data and its key features or the weight information of the key features through a fully connected layer, and then calculating the score through a classification function to obtain the voice text data.
作为本发明的进一步改进,所述“提取所述有效语音数据特征”,具体包括:提取所述有效语音数据特征,获取其梅尔频率倒谱系数特征。As a further improvement of the present invention, the "extracting the effective voice data features" specifically includes: extracting the effective voice data features and obtaining its Mel-frequency cepstral coefficient features.
作为本发明的进一步改进,所述“获取所述视频数据中局部区域视频图像,并转写所述视频图像为图像文本数据”,具体包括:根据所述有效视频数据,获取嘴唇区域的视频图像;将所述嘴唇区域的视频图像输入3D卷积神经网络模型计算,得到图像特征;基于图像唇语识别方法,将所述图像特征输入多通道多尺寸时间深度卷积神经网络模型转写,获得第一图像文本数据;基于连接时序分类方法输出所述语音特征序列和所述第一图像文本数据的对齐关系,以得到第二图像文本数据;将所述第二图像文本数据经全连接层组合后,再经过分类函数计算得分得到所述图像文本数据。As a further improvement of the present invention, the "obtaining a video image of a local area in the video data, and transcribing the video image into image text data" specifically includes: obtaining a video image of the lip area according to the effective video data; inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features; based on an image lip reading recognition method, inputting the image features into a multi-channel and multi-scale time deep convolutional neural network model for transcribing to obtain first image text data; based on a connection temporal classification method, outputting an alignment relationship between the speech feature sequence and the first image text data to obtain second image text data; combining the second image text data through a fully connected layer, and then calculating a score through a classification function to obtain the image text data.
作为本发明的进一步改进,所述“将所述嘴唇区域的视频图像输入3D卷积神经网络模型计算,得到图像特征”,具体包括:对嘴唇局部视频数据分割成连续嘴唇图片帧;将所述连续嘴唇图片帧输入3D卷积神经网络模型计算,提取多种特征,得到所述图像特征。As a further improvement of the present invention, the method of "inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features" specifically includes: segmenting the local video data of the lips into continuous lip picture frames; inputting the continuous lip picture frames into a 3D convolutional neural network model for calculation, extracting multiple features, and obtaining the image features.
作为本发明的进一步改进,所述“基于图像唇语识别方法,将所述图像特征输入多通道多尺寸时间深度卷积神经网络模型转写,获得第一图像文本数据”,具体包括:将所述图像特征输入所述多通道多尺寸时间深度卷积神经网络计算,得到时序图像特征;根据所述图像唇语识别方法,将所述时序图像特征映射为拼音语句的拼音序列;再将所述拼音序列翻译为对应汉字语句的汉字序列。As a further improvement of the present invention, the "based on the image lip reading recognition method, inputting the image features into the multi-channel multi-size time deep convolutional neural network model for transcription to obtain the first image text data" specifically includes: inputting the image features into the multi-channel multi-size time deep convolutional neural network for calculation to obtain time series image features; according to the image lip reading recognition method, mapping the time series image features into a pinyin sequence of a pinyin sentence; and then translating the pinyin sequence into a Chinese character sequence corresponding to the Chinese character sentence.
作为本发明的进一步改进,所述“根据所述语音文本数据和图像文本数据,获取该文本数据的上下文信息和文本语义特征的权重信息”,具体包括:将所述语音文本数据和图像文本数据转换为语音文本词向量和图像文本词向量;将所述语音文本词向量和图像文本词向量输入双向长短记忆网络模型,获取包含所述语音文本数据和图像文本数据特征信息的上下文特征向量。As a further improvement of the present invention, the method of "obtaining context information of the text data and weight information of text semantic features based on the speech text data and image text data" specifically includes: converting the speech text data and image text data into speech text word vectors and image text word vectors; inputting the speech text word vectors and image text word vectors into a bidirectional long short-term memory network model to obtain a context feature vector containing feature information of the speech text data and image text data.
作为本发明的进一步改进,基于注意力机制模型,区分所述语音文本数据和图像文本数据的文本特征中的词、词语的自身权重信息和/或关联权重信息,获得所述文本语义特征的权重信息。As a further improvement of the present invention, based on the attention mechanism model, the self-weight information and/or associated weight information of the words and phrases in the text features of the speech text data and the image text data are distinguished to obtain the weight information of the text semantic features.
作为本发明的进一步改进,所述“基于注意力机制模型,区分所述语音文本数据和所述图像文本数据的文本特征中的词、词语的自身权重信息和/或关联权重信息”,具体包括:分别将所述语音文本上下文特征向量和所述图像文本上下文特征向量输入自注意力机制和互注意力机制;获取包含所述语音文本语义特征和图像文本语义特征自身权重信息的自身权重文本注意力特征向量;获取包含所述语音文本语义特征和图像文本语义特征关联权重信息的关联权重文本注意力特征向量。As a further improvement of the present invention, the "based on the attention mechanism model, distinguishing the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data and the image text data" specifically includes: respectively inputting the speech text context feature vector and the image text context feature vector into the self-attention mechanism and the mutual attention mechanism; obtaining the self-weight text attention feature vector containing the self-weight information of the speech text semantic features and the image text semantic features; obtaining the associated weight text attention feature vector containing the associated weight information of the speech text semantic features and the image text semantic features.
作为本发明的进一步改进,所述“将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断所述音视频数据生成文本的类别信息”,具体包括:将所述上下文特征向量和文本注意力权重特征向量经全连接层组合后,输出至分类函数,计算所述语音文本数据和所述图像文本数据文本语义的得分及其归一化得分结果,得到生成文本的类别信息。As a further improvement of the present invention, the "combining the context information and weight information through a fully connected layer, outputting the result to a classifier to calculate a score to obtain classification result information, and determining the category information of the text generated from the audio and video data" specifically includes: combining the context feature vector and the text attention weight feature vector through a fully connected layer, outputting the result to a classification function, calculating the scores of the text semantics of the speech text data and the image text data and their normalized score results, and obtaining the category information of the generated text.
作为本发明的进一步改进,所述“转写所述语音数据为语音文本数据”,还包括:获取存储于外部缓存的配置数据,将所述语音数据基于所述配置数据执行所述多通道多尺寸深度卷积神经网络模型计算,进行文本转写和提取文本特征。As a further improvement of the present invention, the "transcribe the voice data into voice text data" also includes: obtaining configuration data stored in an external cache, executing the multi-channel and multi-scale deep convolutional neural network model calculation on the voice data based on the configuration data, performing text transcription and extracting text features.
本发明还提供一种电器设备,包括:存储器,用于存储可执行指令;处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于多模态深度学习的文本分类方法。The present invention also provides an electrical device, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.
本发明还提供一种冰箱,包括:存储器,用于存储可执行指令;处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于多模态深度学习的文本分类方法。The present invention also provides a refrigerator, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.
本发明还提供一种计算机可读存储介质,其存储有可执行指令,所述可执行指令被处理器执行时实现上述的基于多模态深度学习的文本分类方法。The present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned text classification method based on multimodal deep learning.
本发明的有益效果是:本发明所提供的方法完成了对所获取的音视频生成文本的识别与分类任务,首先将语音文本数据和说话人嘴唇区域视频图像特征识别文本内容相结合的方式,获得文本语义特征信息的互补性、关联性及语义特征的加强等效果,进而实现了音视频生成文本分类任务的精准性。其次,通过综合考虑实时音视频数据和历史音视频数据,将历史音视频数据作为补充数据,弥补了语音数据文本和视频文本数据语义信息较少的问题,有效提高了文本分类准确度。最后,通过构建融合了多通道多尺寸深度卷积神经网络和时间深度卷积神经网络模型提高了实时音视频生成文本分类识别的精度,具体通过构建融合了上下文信息机制、自注意力机制和互注意力机制的卷积神经网络模型,更加充分地挖掘出文本数据丰富的语义特征信息。因此,整体模型充分利用了实时和历史音视频数据以及上下文数据,具有优秀的语义表征能力,对音视频生成文本分类的准确率高,有效的提高了音视频生成文本分类的准确率和泛化能力,提升用户的体验效果。The beneficial effects of the present invention are as follows: the method provided by the present invention completes the recognition and classification tasks of the acquired audio and video generated text. First, the voice text data and the speaker's lip area video image feature recognition text content are combined to obtain the complementarity, correlation and enhancement of the semantic features of the text semantic feature information, thereby achieving the accuracy of the audio and video generated text classification task. Secondly, by comprehensively considering the real-time audio and video data and the historical audio and video data, the historical audio and video data is used as supplementary data to make up for the problem of less semantic information of the voice data text and video text data, and effectively improve the accuracy of text classification. Finally, by constructing a multi-channel multi-size deep convolutional neural network and a time-depth convolutional neural network model, the accuracy of the classification and recognition of the real-time audio and video generated text is improved. Specifically, by constructing a convolutional neural network model that integrates the context information mechanism, the self-attention mechanism and the mutual attention mechanism, the rich semantic feature information of the text data is more fully excavated. Therefore, the overall model makes full use of the real-time and historical audio and video data and the context data, has excellent semantic representation ability, has a high accuracy rate for the classification of audio and video generated text, effectively improves the accuracy rate and generalization ability of the audio and video generated text classification, and improves the user experience effect.
图1是本发明一实施方式中的基于多模态深度学习的文本分类方法所涉及模型的结构框图。FIG1 is a structural block diagram of a model involved in a text classification method based on multimodal deep learning in one embodiment of the present invention.
图2是本发明一实施方式中的基于多模态深度学习的文本分类方法步骤示意图。FIG2 is a schematic diagram of steps of a text classification method based on multimodal deep learning in one embodiment of the present invention.
图3是本发明一实施方式中获取实时音视频数据和历史音视频数据步骤示意图。FIG3 is a schematic diagram of the steps of acquiring real-time audio and video data and historical audio and video data in one embodiment of the present invention.
图4是本发明一实施方式中对所述实时音视频数据和历史音视频数据进行数据预处理步骤示意图。FIG. 4 is a schematic diagram of data preprocessing steps for the real-time audio and video data and the historical audio and video data in one embodiment of the present invention.
图5是本发明一实施方式中转写所述有效语音数据为语音文本数据步骤示意图。FIG. 5 is a schematic diagram of the steps of transcribing the valid voice data into voice text data in one embodiment of the present invention.
图6是本发明一实施方式中获取所述视频数据中局部区域视频图像,并转写所述视频图像为图像文本数据步骤示意图。FIG6 is a schematic diagram of the steps of obtaining a local area video image in the video data and transcribing the video image into image text data in one embodiment of the present invention.
图7是本发明一实施方式中根据所述语音文本数据和图像文本数据,获取该文本数据的上下文信息和权重信息步骤示意图。FIG. 7 is a schematic diagram of steps for obtaining context information and weight information of text data based on the speech text data and image text data in one embodiment of the present invention.
以下将结合附图所示的具体实施方式对本发明进行详细描述。但这些实施方式并不限制本发明,本领域的普通技术人员根据这些实施方式所做出的结构、方法、或功能上的变换均包含在本发明的保护范围内。The present invention will be described in detail below in conjunction with the specific embodiments shown in the accompanying drawings. However, these embodiments do not limit the present invention, and any structural, methodological, or functional changes made by a person skilled in the art based on these embodiments are all within the scope of protection of the present invention.
需要说明的是,术语“包括”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。此外,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。It should be noted that the term "comprises" or any other variation thereof is intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In addition, the terms "first", "second", etc. are used for descriptive purposes only and cannot be understood as indicating or implying relative importance.
本发明的实施例是一种基于多模态深度学习的文本分类方法。虽然本申请提供了如下述实施方式或流程图1所述的方法操作步骤,但是基于常规或者无需创造性的劳动,所述方法在逻辑性上不存在必要因果关系的步骤中,这些步骤的执行顺序不限于本申请实施方式中所提供的执行顺序。The embodiment of the present invention is a text classification method based on multimodal deep learning. Although the present application provides the method operation steps as described in the following implementation or flowchart 1, based on routine or no creative labor, the execution order of the steps in the method that do not have a necessary causal relationship in logic is not limited to the execution order provided in the implementation of the present application.
如图1所示,为本发明所提供的一种基于多模态深度学习的文本分类方法所涉及模型的结构框图,如图2所示,为基于多模态深度学习的文本分类方法步骤示意图,其包括:As shown in FIG1 , a structural block diagram of a model involved in a text classification method based on multimodal deep learning provided by the present invention is shown in FIG2 , which is a schematic diagram of the steps of a text classification method based on multimodal deep learning, including:
S1:获取实时音视频数据和历史音视频数据。S1: Obtain real-time audio and video data and historical audio and video data.
S2:对所述实时音视频数据和历史音视频数据进行预处理,获取有效的语音数据和视频数据。S2: Preprocess the real-time audio and video data and historical audio and video data to obtain valid voice data and video data.
S3:转写所述有效语音数据为语音文本数据。S3: Transcribing the valid voice data into voice text data.
S4:获取所述有效视频数据中局部区域的视频图像,并转写所述视频图像为图像文本数据。S4: Acquire a video image of a local area in the valid video data, and transcribe the video image into image text data.
S5:根据所述语音文本数据和图像文本数据,获取该文本数据的上下文信息和文本语义特征的权重信息。S5: Acquire context information of the text data and weight information of text semantic features according to the speech text data and the image text data.
S6:将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断所述音视频数据生成文本的类别信息。S6: After the context information and weight information are combined through a fully connected layer, they are output to a classifier to calculate a score to obtain classification result information, and the category information of the text generated by the audio and video data is determined.
S7:输出所述生成文本的类别信息。S7: Outputting the category information of the generated text.
本发明提供的方法可供智能电子设备基于用户的实时音视频数据输入,来实现与用户之间的实时交互或消息推送等功能。示例性的,在本实施方式中,以智能冰箱为例,并结合预先训练好的深度学习模型对本方法进行说明。基于用户的音视频输入,智能冰箱对用户音视频数据所生成的对应文本内容进行分类,并根据分类结果信息计算需要输出的文本内容分类结果信息。The method provided by the present invention can be used by smart electronic devices to implement real-time interaction or message push functions with users based on the user's real-time audio and video data input. Exemplarily, in this embodiment, a smart refrigerator is taken as an example, and the method is described in combination with a pre-trained deep learning model. Based on the user's audio and video input, the smart refrigerator classifies the corresponding text content generated by the user's audio and video data, and calculates the text content classification result information to be output based on the classification result information.
如图3所示,在步骤S1中,其具体包括:As shown in FIG3 , in step S1, it specifically includes:
S11:获取采集装置所采集的所述实时音视频数据,和/或S11: Acquire the real-time audio and video data collected by the collection device, and/or
获取自客户终端传输的所述实时音视频数据。The real-time audio and video data transmitted from the client terminal is obtained.
S12:获取内部存储的历史音视频数据,和/或S12: Obtaining historical audio and video data stored internally, and/or
获取外部存储的历史音视频数据,和/或Get historical audio and video data from external storage, and/or
获取客户终端传输的历史音视频数据。Obtain historical audio and video data transmitted by client terminals.
这里所述的实时音视频数据包括实时语音数据和实时视频数据,所述实时语音指的是用户当前对智能电子设备或对与智能电子设备通信连接的客户终端设备等说出的询问性或指令性语句,同样的,也可以是语音采集装置采集用户发出的语音信息。如在本实施方式中,用户可提出诸如“今天冰箱里有啥蔬菜”、“今天冰箱里牛肉食材有哪些”等问题,或用户可发出诸如“删除全部食材”等命令指令。所述实时视频数据是利用智能电子设备或智能电子设备通信连接的客户终端设备实时拍摄而获得的实时视频图像,如在本实时方式中,利用内置在智能冰箱内的视像头拍摄到用户的脸部图像,从脸部图像中提取嘴唇区域特征图像以识别该图像对应的文本内容,比如识别出“今天冰箱里有啥蔬菜”的图像文本数据。The real-time audio and video data described here include real-time voice data and real-time video data. The real-time voice refers to the interrogative or directive statements currently spoken by the user to the intelligent electronic device or the client terminal device connected to the intelligent electronic device. Similarly, it can also be the voice information sent by the user collected by the voice collection device. For example, in this embodiment, the user can ask questions such as "What vegetables are in the refrigerator today?", "What beef ingredients are in the refrigerator today?", or the user can issue commands such as "Delete all ingredients". The real-time video data is a real-time video image obtained by real-time shooting using an intelligent electronic device or a client terminal device connected to the intelligent electronic device. For example, in this real-time mode, the user's facial image is captured by a video camera built into the intelligent refrigerator, and the lip area feature image is extracted from the facial image to identify the text content corresponding to the image, such as identifying the image text data of "What vegetables are in the refrigerator today".
这里所述的历史音视频数据是指以往使用过程中用户的实时音视频数据,进一步的,其还可以包括用户自行输入的历史音视频数据等。具体的,在本实施方式中,所述历史音视频数据可包括:获取以往用户发出的指令或提出的问题的音视频数据,所获取到的音视频数据包含与当前实时音视频数据有关联的信息,也可以是以往使用过程中用户依据放入的物品发出的说明性的音视频信息,比如“冰箱里没有牛奶了”,历史音视频数据的获取可以作为预训练和预测模型的数据集的一部分,能够有效地补充实时音视频数据的单一语音表征,丰富语义特征。The historical audio and video data mentioned here refers to the real-time audio and video data of the user in the previous use process, and further, it can also include the historical audio and video data input by the user. Specifically, in this embodiment, the historical audio and video data may include: obtaining audio and video data of instructions issued or questions raised by the user in the past, and the obtained audio and video data contains information related to the current real-time audio and video data, and can also be the explanatory audio and video information issued by the user according to the items put in the refrigerator in the previous use process, such as "there is no milk in the refrigerator". The acquisition of historical audio and video data can be used as part of the data set of the pre-training and prediction model, which can effectively supplement the single voice representation of the real-time audio and video data and enrich the semantic features.
如步骤S11所述,在本实施方式中,可通过设置于智能冰箱内的照相机、摄像头等音视频采集装置采集用户实时音视频,在使用过程中,当用户需要与智能冰箱进行交互时,直接对智能冰箱发出语音即可。并且,也可通过与智能冰箱基于无线通信协议连接的客户终端获取传输而来的用户实时音视频数据,客户终端为具有信息发送功能的电子设备,如手机、平板电脑、智能摄像机、智能手表、APP或蓝牙等智能电子设备,在使用过程中,用户直接对客户终端发出语音或直接使用冰箱内置的摄像头进行拍摄即可,客户终端采集音视频后通过wifi或蓝牙等无线通信方式传输至智能冰箱。从而实现多渠道的实时音视频获取方式,并不局限于必须面向智能冰箱发出语音。当用户有交互需求时,通过任意便捷渠道发出实时语音即可,从而能够显著提高用户的使用便捷度。在本发明的其他实施方式中,也可采用上述实时音视频数据获取方法中一种或任意多种,或者也可基于现有技术通过其他渠道获取所述实时音视频数据,本发明对此不作具体限制。As described in step S11, in this embodiment, the user's real-time audio and video can be collected by an audio and video collection device such as a camera or a camera set in the smart refrigerator. During use, when the user needs to interact with the smart refrigerator, the user can directly send a voice to the smart refrigerator. In addition, the user's real-time audio and video data transmitted can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol. The client terminal is an electronic device with an information sending function, such as a mobile phone, a tablet computer, a smart camera, a smart watch, an APP or a smart electronic device such as Bluetooth. During use, the user can directly send a voice to the client terminal or directly use the camera built into the refrigerator to shoot. After the client terminal collects audio and video, it is transmitted to the smart refrigerator through wireless communication methods such as wifi or Bluetooth. Thereby realizing a multi-channel real-time audio and video acquisition method, it is not limited to sending a voice to the smart refrigerator. When the user has an interactive demand, the real-time voice can be sent through any convenient channel, so that the user's convenience of use can be significantly improved. In other embodiments of the present invention, one or any multiple of the above-mentioned real-time audio and video data acquisition methods can also be used, or the real-time audio and video data can also be obtained through other channels based on the prior art, and the present invention does not make specific restrictions on this.
如步骤S12所述,在本实施方式中,可读取智能冰箱的内部存储器所存储的历史音视频数据。并且,也可通过读取智能冰箱配置的外部存储装置所存储的历史音视频数据,外部存储装置为诸如U盘、SD卡等移动存储设备,通过设置外部存储装置可进一步拓展智能冰箱的存储空间。并且,也可通过获取存储在诸如手机、平板电脑等客户终端或应用软件服务器端等处的所述历史音视频数据。实现多渠道的历史音视频数据获取渠道,能够大幅提高历史音视频的数据量,从而提高后续语音识别和视频图像识别的准确度。在本发明的其他实施方式中,也可采用上述历史音视频数据获取方法中的一种或任意多种,或者也可基于现有技术通过其他渠道获取所述历史音视频数据,本发明对此不作具体限制。As described in step S12, in this embodiment, the historical audio and video data stored in the internal memory of the smart refrigerator can be read. In addition, the historical audio and video data stored in the external storage device configured by the smart refrigerator can also be read. The external storage device is a mobile storage device such as a U disk, SD card, etc. By setting an external storage device, the storage space of the smart refrigerator can be further expanded. In addition, the historical audio and video data stored in a client terminal such as a mobile phone, a tablet computer, or an application software server can be obtained. Implementing multi-channel historical audio and video data acquisition channels can greatly increase the amount of historical audio and video data, thereby improving the accuracy of subsequent voice recognition and video image recognition. In other embodiments of the present invention, one or any multiple of the above-mentioned historical audio and video data acquisition methods may also be used, or the historical audio and video data may also be obtained through other channels based on the prior art, and the present invention does not impose specific restrictions on this.
进一步的,在本实施方式中,智能冰箱配置有外部缓存,至少有部分所述历史音视频数据被储存在所述外部缓存中,随着使用时间增加,历史音视频数据增多,通过将部分数据存储在外部缓存中,能够节省智能冰箱内部存储空间,并且在进行神经网络计算时,直接读取存储于外部缓存中的所述音视频数据,能够提高算法效率。Furthermore, in the present embodiment, the smart refrigerator is provided with an external cache, and at least part of the historical audio and video data is stored in the external cache. As the usage time increases, the historical audio and video data increases. By storing part of the data in the external cache, the internal storage space of the smart refrigerator can be saved, and when performing neural network calculations, the audio and video data stored in the external cache can be directly read, which can improve the efficiency of the algorithm.
具体的,在本实施方式中,采用Redis组件作为所述外部缓存,Redis组件为当前一种使用较为广泛的key/value存储结构的分布式缓存系统,其可用作数据库,高速缓存和消息队列代理。在本发明的其他实施方式中也可采用诸如Memcached等其他外部缓存,本发明对此不作具体限制。Specifically, in this embodiment, a Redis component is used as the external cache. The Redis component is a widely used distributed cache system with a key/value storage structure, which can be used as a database, a cache and a message queue agent. In other embodiments of the present invention, other external caches such as Memcached may also be used, and the present invention does not specifically limit this.
综上所述,在步骤S11和步骤S12中,能够通过多渠道灵活获取实时音视频数据和历史音视频数据,在提升了用户体验的同时,保证了数据量,并有效提升了算法效率。In summary, in step S11 and step S12, real-time audio and video data and historical audio and video data can be flexibly obtained through multiple channels, which improves the user experience while ensuring the data volume and effectively improving the algorithm efficiency.
如图4所示,在步骤S2中,其具体包括步骤:As shown in FIG4 , in step S2, it specifically includes the steps of:
S21:对所述实时音视频数据和历史音视频数据进行数据清洗,获得有效的音视频数据。S21: Clean the real-time audio and video data and the historical audio and video data to obtain valid audio and video data.
S22:将所述有效音视频数据进行语音和视频分离,以获得所述语音数据和视频数据。S22: Separate the effective audio and video data into voice and video to obtain the voice data and video data.
S23:对所述语音数据和视频数据进行预处理,包括:对所述语音数据进行分帧和加窗处理,对所述视频数据进行裁剪、分帧处理。S23: Preprocessing the voice data and video data, including: performing frame division and windowing processing on the voice data, and cropping and frame division processing on the video data.
在步骤S21中,对所述实时音视频数据和历史音视频数据进行数据清洗具体包括:In step S21, data cleaning of the real-time audio and video data and the historical audio and video data specifically includes:
获取一定数量的实时音视频数据集和历史音视频数据集,示例性的,可以以文件的形式导入数据清洗模型进行处理,为了防止数据导入失败,对不满足文件导入格式的数据进行数据格式解析和数据格式转换,然后再删除数据集中的无关数据、重复数据以及处理异常值和缺失值数据等,初步筛选掉与分类无关的信息,对所述音视频数据进行清洗处理,同时将清洗后的数据以指定格式输出并保存起来,从而获得有效的音视频数据。A certain number of real-time audio and video data sets and historical audio and video data sets are obtained. Exemplarily, they can be imported into the data cleaning model in the form of files for processing. In order to prevent data import failure, data that does not meet the file import format is parsed and converted, and then irrelevant data and duplicate data in the data set are deleted, and abnormal values and missing value data are processed. Information irrelevant to classification is preliminarily screened out, and the audio and video data is cleaned. At the same time, the cleaned data is output and saved in a specified format, thereby obtaining valid audio and video data.
在步骤S22中,采用脚本或者第三方音视频分离工具对所述有效的音视频数据进行语音和视频分离,从而获得了语音数据和视频数据。In step S22, voice and video are separated from the valid audio and video data by using a script or a third-party audio and video separation tool, thereby obtaining voice data and video data.
在本发明实施例中,可以采用python语言进行音视频分离脚本的编写,或者是第三方的音视频分离工具,将输入的音视频数据进行分离操作,实现语音、视频的分离,得到分类后的语音和视频数据。In an embodiment of the present invention, the audio and video separation script can be written in Python language, or a third-party audio and video separation tool can be used to separate the input audio and video data to achieve voice and video separation and obtain classified voice and video data.
在步骤S23中,对分类后的语音根据指定的时间段或采样数进行分段,完成对语音的分帧处理以得到语音信号数据,再通过窗函数的作用,使得原本含有噪声的语音信号呈现出信号加强和信号周期性的特征,完成加窗处理,便于后续更好的提取语音的特征参数。示例性的,步骤S23还包括对有效的视频数据进行裁剪,产生多帧图片,具体的,可以采用编写脚本的方式首先加载视频数据并读取视频信息,然后根据视频信息对视频进行解码,确定视频每秒钟展示多少张图片,从而获取单帧图像信息,所述单帧图像信息包括每帧图片的宽度和高度,最后将视频保存成多张图片。所以,经过步骤S23的处理,可以得到有效的语音数据和图像数据。在本发明的其他实施方式中也可采用诸如第三方视频裁剪工具等其他视频分帧方法,本发明对此不作具体限制。In step S23, the classified speech is segmented according to the specified time period or sampling number, and the framing process of the speech is completed to obtain the speech signal data, and then through the effect of the window function, the speech signal originally containing noise presents the characteristics of signal enhancement and signal periodicity, and the windowing process is completed, which is convenient for the subsequent better extraction of the characteristic parameters of the speech. Exemplarily, step S23 also includes cutting the effective video data to generate multiple frames of pictures. Specifically, the video data can be first loaded and the video information can be read by writing a script, and then the video can be decoded according to the video information to determine how many pictures the video shows per second, so as to obtain single-frame image information, and the single-frame image information includes the width and height of each frame of the picture, and finally the video is saved as multiple pictures. Therefore, after the processing of step S23, effective speech data and image data can be obtained. In other embodiments of the present invention, other video framing methods such as third-party video cropping tools can also be used, and the present invention does not specifically limit this.
如图5所示,在步骤S3钟,其具体包括:As shown in FIG5 , in step S3, it specifically includes:
S31:提取所述有效语音数据特征,得到语音特征。S31: Extract the effective voice data features to obtain voice features.
S32:将所述语音特征输入语音识别多通道多尺寸深度卷积神经网络模型转写得到第一语音文本数据。S32: Input the speech features into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe and obtain first speech text data.
S33:基于连接时序分类方法输出所述语音特征和所述第一语音文本数据的对齐关系,以得到第二语音文本数据。S33: Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data.
S34: 基于注意力机制,获取所述第二语音文本数据的关键特征或所述关键特征的权重信息。S34: Based on the attention mechanism, obtain the key features of the second speech text data or the weight information of the key features.
S35:将所述第二语音文本数据以及其关键特征或关键特征的权重信息经全连接层组合后,再经过分类函数计算得分得到所述语音文本数据。S35: The second speech text data and its key features or weight information of the key features are combined through a fully connected layer, and then the scores are calculated through a classification function to obtain the speech text data.
在步骤S31钟,提取所述有效语音数据特征具体包括:In step S31, extracting the effective voice data features specifically includes:
提取所述语音数据特征,获取其梅尔频率倒谱系数特征(Mel-scale Frequency Cepstral Coefficients,简称MFCC)。MFCC是一种语音信号中具有辨识性的成分,是在Mel标度频率域提取出来的倒谱参数,其中,Mel标度描述了人耳频率的非线性特性,MFCC的参数考虑到了人耳对不同频率的感受程度,特别适用于语音辨别和语者辨识。Extract the features of the speech data and obtain its Mel-scale Frequency Cepstral Coefficients (MFCC). MFCC is a recognizable component in a speech signal and is a cepstral parameter extracted in the Mel scale frequency domain. The Mel scale describes the nonlinear characteristics of the human ear frequency. The MFCC parameters take into account the sensitivity of the human ear to different frequencies and are particularly suitable for speech recognition and speaker identification.
在本发明实施例中,也可以通过不同算法步骤获取所述语音数据的感知线性预测特征(Perceptual Linear Predictive,简称PLP)或线性预测系数特征(Linear Predictive Coding,简称LPC)等特征参数来取代MFCC特征,具体可根据实际应用场景和采用的模型参数进行具体的调整,本发明对此不做具体限制。In an embodiment of the present invention, characteristic parameters such as perceptual linear prediction characteristics (PLP) or linear prediction coefficient characteristics (LPC) of the speech data may be obtained through different algorithm steps to replace the MFCC features. The specific adjustments may be made according to the actual application scenario and the model parameters used, and the present invention does not impose any specific restrictions on this.
上述步骤中所涉及的具体的算法步骤可参考当前本领域的现有技术,具体的内容在此不做具体描述。The specific algorithm steps involved in the above steps can refer to the current existing technology in this field, and the specific content will not be described in detail here.
在步骤S32中,通过自动语音识别技术中的网络模型对所述有效语音数据实现文本内容转写,得到所述的第一语音文本数据。In step S32, the text content of the valid voice data is transcribed through a network model in the automatic speech recognition technology to obtain the first voice text data.
在本实施方式中,通过增加网络的宽度途径构建多通道多尺寸深度卷积神经网络模型实现语音转文本的任务,该深度网络模型是由多层深度卷积网络模型构成,深度卷积神经网络模型一般是由若干卷积层加若干全连接层组成,中间包含各种的非线性操作、池化操作,主要用于处理网格结构的数据,因此该模型可以利用滤波器将相邻像素之间的轮廓过滤出来。另外,该模型它是先提出语音特征值,然后再对特征值进行计算而不是对原始语音数据值进行计算。因此,相比于传统的循环神经网络来说,深度卷积神经网络模型具有计算量小、容易刻画局部特征的优势,而且共享权重以及池化层可以赋予该模型更好的时域或频域的不变性,另外更深层的非线性结构也可以让该模型具备强大的表征能力。另外,多通道多尺寸可以从不同的视角去提取语音特征,获取更多的语音特征信息,具有更好的语音识别精度。In this embodiment, a multi-channel multi-size deep convolutional neural network model is constructed by increasing the width of the network to achieve the task of voice-to-text conversion. The deep network model is composed of a multi-layer deep convolutional network model. The deep convolutional neural network model is generally composed of several convolutional layers plus several fully connected layers, and various nonlinear operations and pooling operations are included in the middle. It is mainly used to process grid structured data, so the model can use filters to filter out the contours between adjacent pixels. In addition, the model first proposes speech feature values, and then calculates the feature values instead of calculating the original speech data values. Therefore, compared with the traditional recurrent neural network, the deep convolutional neural network model has the advantages of small computational complexity and easy to characterize local features, and shared weights and pooling layers can give the model better invariance in the time domain or frequency domain. In addition, the deeper nonlinear structure can also give the model a powerful representation ability. In addition, multi-channel and multi-size can extract speech features from different perspectives, obtain more speech feature information, and have better speech recognition accuracy.
具体的,在本实施方式中,在步骤S32中,所采用的多通道多尺寸深度卷积神经网络由3*3卷积层、32通道数和一层最大池化构成。Specifically, in this embodiment, in step S32, the multi-channel multi-size deep convolutional neural network used is composed of a 3*3 convolutional layer, 32 channels and a layer of maximum pooling.
在步骤S33中,利用连接时序分类方法(Connectionist temporal classification,CTC)得到输入语音特征序列和输出的语音文本特征序列的对齐关系。In step S33, the alignment relationship between the input speech feature sequence and the output speech text feature sequence is obtained by using the connectionist temporal classification (CTC) method.
在本实施方式中,所述有效语音数据和所述第一语音文本数据的文字很难构建精准的映射关系,从而增加了后续语音识别的难度。为了解决这个问题,采用了时序分类方法,该方法一般是在使用卷积网络模型之后使用的,是一种完全端到端的声学模型训练,不需要预先对数据做对齐处理,只需要一个输入序列和一个输出序列即可训练,不需要对数据做对齐和一一标注处理,同时可以直接输出序列预测的概率。根据这个预测概率,我们可以获得最有可能的文本输出结果,以得到第二语音文本数据。In this embodiment, it is difficult to construct an accurate mapping relationship between the effective voice data and the text of the first voice text data, which increases the difficulty of subsequent voice recognition. In order to solve this problem, a time series classification method is adopted. This method is generally used after using a convolutional network model. It is a completely end-to-end acoustic model training that does not require pre-alignment of the data. It only requires an input sequence and an output sequence for training. It does not require alignment and one-to-one labeling of the data, and can directly output the probability of sequence prediction. Based on this predicted probability, we can obtain the most likely text output result to obtain the second voice text data.
进一步的,在步骤S34中,所述注意力机制可以引导深度卷神经网络去关注更为关键的特征信息而抑制其他非关键的特征信息,因此,通过引入注意力机制,能够得到所述第二语音文本数据的局部关键特征或权重信息,从而进一步减少模型训练时出现序列的不规则误差对齐现象。Furthermore, in step S34, the attention mechanism can guide the deep convolutional neural network to focus on more critical feature information and suppress other non-critical feature information. Therefore, by introducing the attention mechanism, the local key features or weight information of the second speech text data can be obtained, thereby further reducing the irregular error alignment of the sequence during model training.
这里,在步骤S35中,根据所述第二语音文本数据以及其关键特征或关键特征的权重信息,通过自注意力机制和全连接层相融合的模型将所述第二语音文本数据赋予其自身权重信息,从而更好的获得所述语音文本数据文本语义特征的内部权重信息,以增强文本语义特征信息不同部分的重要性,最后再经过分类函数计算得分得到所述语音文本数据。Here, in step S35, according to the second speech text data and its key features or the weight information of the key features, the second speech text data is given its own weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information of the text semantic features of the speech text data, so as to enhance the importance of different parts of the text semantic feature information, and finally the speech text data is obtained by calculating the score through the classification function.
如图6所示,在步骤S4中,其具体包括:As shown in FIG6 , in step S4, it specifically includes:
S41:根据所述视频数据,获取嘴唇区域的视频图像。S41: Acquire a video image of the lip area according to the video data.
S42:将所述嘴唇区域的视频图像输入3D卷积神经网络模型计算,得到图像特征。S42: Input the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features.
S43:基于图像唇语识别方法,将所述图像特征输入多通道多尺寸时间深度卷积神经网络转写,获得第一图像文本数据。S43: Based on the image lip reading recognition method, the image features are input into a multi-channel multi-size time deep convolutional neural network for transcription to obtain first image text data.
S44:基于连接时序分类方法输出所述图像特征和所述第一图像文本数据的对齐关系,以得到第二图像文本数据。S44: Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data.
S45:将所述第二图像文本数据经全连接层组合后,再经过分类函数计算得分得到所述图像文本数据。S45: After combining the second image text data through the fully connected layer, the scores are calculated through the classification function to obtain the image text data.
在步骤S41和步骤S42中,考虑到根据人的嘴唇区域视频图像特征,可能识别到的句子比较复杂,比如句子长度不一、句子停顿位置或单词构成不一样以及其图像特征存在关联性等多种情况,所以我们可以根据所述有效的视频数据,对其进行裁剪分帧等视频处理操作,获取嘴唇区域的视频图像,并对嘴唇区域的视频图像进行裁剪、分割,以得到多张连续的嘴唇图片帧。在本实施例中,将所述多张连续的嘴唇图片帧输入到3D卷积神经网络模型中,通过增加时间维度的信息,能够提取到更具表达性的特征,所述3D卷积神经网络模型可以解决多张图片之间的关联信息,是以连续的多帧图像作为输入,通过增加了一个新的维度信息,捕捉到输入帧中的运动信息,从而更好的获得其图像特征。In step S41 and step S42, considering that according to the video image features of the lip area of a person, the sentences that may be recognized are relatively complex, such as different sentence lengths, different pause positions or word compositions, and correlations between image features, we can perform video processing operations such as cropping and framing according to the effective video data, obtain the video image of the lip area, and crop and segment the video image of the lip area to obtain multiple continuous lip picture frames. In this embodiment, the multiple continuous lip picture frames are input into the 3D convolutional neural network model, and more expressive features can be extracted by adding information of the time dimension. The 3D convolutional neural network model can solve the correlation information between multiple pictures, and takes multiple continuous frames of images as input, and captures the motion information in the input frames by adding a new dimension of information, so as to better obtain its image features.
在步骤S43中,将步骤S42中3D卷积神经网络模型产生的结果输入到多通道多尺寸时间深度卷积神经网络模型中,对其进行多通道多卷积核的运算后,输出与卷积核数量相同的多个特征图,比如以3通道输入,2个卷积核的卷积层为例,通过卷积计算后输出2个特征图。考虑到句子层面的视频图像唇语识别方法,在本实施方式中,采用了拼音序列识别(LipPic to Pinyin,P2P)和汉字序列识别(Pinyin to Chinese-Character,P2CC)两个步骤实现了所述视频图像唇语识别方法,该方法实现的是一种中文唇语识别方法。具体的,将所述多通道多尺寸时间深度卷积神经网络模型处理产生的时序图像特征映射为拼音语句的拼音序列,再将该拼音序列翻译为汉字语句的汉字序列,最后获得所述第一图像文本数据。当然,对于其它中文唇语识别的方法也不做具体的限制,只要能实现视频图像转换成对应的文本数据都在本发明的保护范围内。In step S43, the result generated by the 3D convolutional neural network model in step S42 is input into the multi-channel multi-scale time deep convolutional neural network model, and after multi-channel multi-convolution kernel operation is performed on it, multiple feature maps with the same number of convolution kernels are output, such as taking a convolution layer with 3-channel input and 2 convolution kernels as an example, 2 feature maps are output after convolution calculation. Considering the video image lip reading recognition method at the sentence level, in this embodiment, the video image lip reading recognition method is implemented by two steps of pinyin sequence recognition (LipPic to Pinyin, P2P) and Chinese character sequence recognition (Pinyin to Chinese-Character, P2CC), which implements a Chinese lip reading recognition method. Specifically, the time-series image features generated by the multi-channel multi-scale time deep convolutional neural network model are mapped to the pinyin sequence of the pinyin sentence, and then the pinyin sequence is translated into the Chinese character sequence of the Chinese character sentence, and finally the first image text data is obtained. Of course, there is no specific limitation on other methods of Chinese lip reading recognition, as long as the video image can be converted into the corresponding text data, it is within the protection scope of the present invention.
在步骤S44核S45中,同样也是和上述语音数据处理的方法一样,也采用了连续时序分类方法,实现了所述有效视频数据和所述第一图像文本数据的文字之间的映射关系,以得到第二图像文本数据。再通过自注意力机制和全连接层相融合的模型将所述第二图像文本数据赋予其自身权重信息和/或关联权重信息,从而更好的获得所述图像文本数据文本语义特征的内部权重信息和/或关联权重信息,以增强文本语义特征信息不同部分的重要性,最后再经过分类函数计算得分得到所述图像文本数据。具体的处理过程同上述语音数据处理步骤,在此不做赘述。In steps S44 and S45, the continuous time series classification method is also used, which is the same as the above-mentioned voice data processing method, to realize the mapping relationship between the effective video data and the text of the first image text data, so as to obtain the second image text data. Then, the second image text data is given its own weight information and/or associated weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information and/or associated weight information of the text semantic features of the image text data, so as to enhance the importance of different parts of the text semantic feature information, and finally the image text data is obtained by calculating the score through the classification function. The specific processing process is the same as the above-mentioned voice data processing steps, which will not be repeated here.
如图7所示,在步骤S5中,其具体包括:As shown in FIG. 7 , in step S5, it specifically includes:
S51:将所述语音文本数据和图像文本数据转换为语音文本词向量和图像文本词向量。S51: Convert the speech text data and the image text data into speech text word vectors and image text word vectors.
S52:将所述语音文本词向量和图像文本词向量输入双向长短记忆网络模型,获取包含所述语音文本数据和图像文本数据特征信息的上下文特征向量。S52: Input the speech text word vector and the image text word vector into a bidirectional long short-term memory network model to obtain a context feature vector containing feature information of the speech text data and the image text data.
S53:基于注意力机制模型,区分所述语音文本数据和所述图像文本数据的文本特征中的词和/或词语的自身权重信息和/或关联权重信息,获得所述文本语义特征的权重信息。S53: Based on the attention mechanism model, distinguish the self-weight information and/or associated weight information of the words and/or phrases in the text features of the speech text data and the image text data, and obtain the weight information of the text semantic features.
在步骤S51中,为了将文本数据转化为计算机能够识别和处理的向量化形式,可通过Word2Vec算法,将所述语音文本数据和图像文本数据转化为所述语音文本词向量和图像文本词向量,或者也可通过其他诸如Glove算法等本领域现有算法转化得到所述词向量,本发明对此不做具体限制。In step S51, in order to convert the text data into a vectorized form that can be recognized and processed by a computer, the speech text data and the image text data can be converted into the speech text word vector and the image text word vector through the Word2Vec algorithm, or the word vector can be obtained through other existing algorithms in the field such as the Glove algorithm, and the present invention does not impose any specific restrictions on this.
在步骤S52中,双向长短记忆网络(Bi-directional Long Short-Term Memory,简写BiLSTM)由前向长短记忆网络(Long Short-Term Memory,简写LSTM)和后向长短记忆网络组合而成,LSTM模型能够更好地获取文本语义长距离的依赖关系,而在其基础上,BiLSTM模型能更好地获取文本双向语义。将所述语音文本词向量和图像文本词向量输入BiLSTM模型中,经过前向LSTM和后向LSTM处理后,其中前向LSTM和后向LSTM都是等到所有时间步都计算完成后,才能产生两个结果向量,再将这两个结果向量拼接起来,输出带有语境上下文信息的所述上下文特征向量。In step S52, the bidirectional long short-term memory network (Bi-directional Long Short-Term Memory, abbreviated as BiLSTM) is composed of a forward long short-term memory network (Long Short-Term Memory, abbreviated as LSTM) and a backward long short-term memory network. The LSTM model can better obtain the long-distance dependency of text semantics, and on this basis, the BiLSTM model can better obtain the bidirectional semantics of text. The speech text word vector and the image text word vector are input into the BiLSTM model, and after being processed by the forward LSTM and the backward LSTM, the forward LSTM and the backward LSTM both wait until all time steps are calculated to generate two result vectors, and then the two result vectors are spliced together to output the context feature vector with contextual context information.
在本发明实施方式中,也可以通过构建其他结构的神经网络模型来实现语音数据和视频数据转写为所述的语音文本数据和视频文本数据,具体的方法不做限制。In the implementation manner of the present invention, the voice data and video data can also be transcribed into the voice text data and video text data by constructing a neural network model of other structures, and the specific method is not limited.
在步骤S53中,为了区分所述语音文本数据和图像文本数据中不同词或词语的自身的权重信息或不同文本数据之间的关联权重信息,分别将所述语音文本上下文特征向量和所述图像文本上下文特征向量输入自注意力机制和互注意力机制中,获取包含所述语音文本语义特征和图像文本语义特征自身权重信息的自身权重特征向量以及包含所述语音文本语义特征和图像文本语义特征关联权重信息的关联权重特征向量,充分利用了音视频转文本的上下文信息,补充了语音和视频数据中单一特征的不足,丰富了文本数据中的语义表征能力,优化了后续的文本分类能力。In step S53, in order to distinguish the weight information of different words or phrases in the speech text data and the image text data or the associated weight information between different text data, the speech text context feature vector and the image text context feature vector are respectively input into the self-attention mechanism and the mutual attention mechanism to obtain the self-weight feature vector containing the weight information of the speech text semantic features and the image text semantic features and the associated weight feature vector containing the associated weight information of the speech text semantic features and the image text semantic features, thereby making full use of the context information of audio and video text conversion, supplementing the deficiency of single features in speech and video data, enriching the semantic representation capability in text data, and optimizing the subsequent text classification capability.
在步骤S6中,其具体包括:In step S6, it specifically includes:
将所述语音的上下文特征向量和权重文本注意力特征向量(包括自身权重文本注意力特征向量和关联权重文本注意力特征向量)经全连接层组合后,输出至分类函数,计算所述语音文本数据和所述图像文本数据中文本语义的得分及其归一化得分结果,得到分类结果信息。The context feature vector and the weighted text attention feature vector (including the own weighted text attention feature vector and the associated weighted text attention feature vector) of the speech are combined through a fully connected layer and output to a classification function to calculate the scores of the text semantics in the speech text data and the image text data and their normalized score results to obtain classification result information.
综上所述,依次通过上述步骤可以得到本发明所提供的音视频生成文本的分类方法。通过获取所述实时的音视频数据和历史的音视频数据,对其进行数据清洗,同时对其进行语音和视频的分离,分别产生有效的语音数据和视频数据,并将其都作为预训练和预测模型的数据集的一部分,从而更全面的获取了文本语义特征。另外,通过构建融合了连接时序分类方法和注意力机制的多通道多尺寸的深度卷积网络模型以及基于时间深度卷积神经网络模型与句子层面的视频图像唇语识别方法,从而挖掘并获得了更加丰富的高层语义特征信息。最后,通过构建融合了语义文本数据和视频文本数据的上下文信息机制、自注意力机制和互注意力机制,更加充分的利用了语义表征能力,弥补了语音和视频数据中单一特征的不足,提高了音视频生成文本分类的准确性。另外,通过获取外部存储的配置数据进行计算,提高了模型的计算效率。整体模型结构具有很好的文本数据语义表征能力,从语义特征熵体现了良好的互补性和关联性特点,提高了对音视频生成文本分类的准确率。In summary, the classification method for audio and video generated text provided by the present invention can be obtained by sequentially passing through the above steps. By obtaining the real-time audio and video data and the historical audio and video data, data cleaning is performed on them, and voice and video are separated at the same time, effective voice data and video data are generated respectively, and they are all used as part of the data set of the pre-training and prediction model, thereby obtaining text semantic features more comprehensively. In addition, by constructing a multi-channel and multi-size deep convolutional network model that integrates the connection time series classification method and the attention mechanism, and a video image lip reading recognition method based on the time deep convolutional neural network model and the sentence level, more abundant high-level semantic feature information is mined and obtained. Finally, by constructing a context information mechanism, a self-attention mechanism, and a mutual attention mechanism that integrates semantic text data and video text data, the semantic representation ability is more fully utilized, the deficiency of a single feature in voice and video data is compensated, and the accuracy of audio and video generated text classification is improved. In addition, by obtaining the configuration data of the external storage for calculation, the computational efficiency of the model is improved. The overall model structure has a good semantic representation ability of text data, and reflects the good complementarity and correlation characteristics from the semantic feature entropy, which improves the accuracy of audio and video generated text classification.
在步骤S7中,其具体包括:In step S7, it specifically includes:
将所述生成文本的类别信息转换为语音进行输出,和/或converting the category information of the generated text into speech for output, and/or
将所述生成文本的类别信息转换为语音传输至客户终端输出,和/或Convert the category information of the generated text into voice and transmit it to the client terminal for output, and/or
将所述生成文本的类别信息转换为文本进行输出,和/或converting the category information of the generated text into text for output, and/or
将所述生成文本的类别信息转换为文本传输至客户终端输出,和/或Convert the category information of the generated text into text and transmit it to the client terminal for output, and/or
将所述生成文本的类别信息转换为图像进行输出,和/或Convert the category information of the generated text into an image for output, and/or
将所述生成文本的类别信息转换为图像传输至客户终端输出。The category information of the generated text is converted into an image and transmitted to a client terminal for output.
如步骤S7所述,在本实时方式中,在通过上述所述步骤获得分类结果信息后,可将其转换成语音,通过智能冰箱内置的声音播放设备播报所述结果信息,或者也可以将所述结果信息转换为文本,直接通过智能冰箱配置的显示设备显示,或者也可以将所述结果信息转换图像,直接通过智能冰箱的大屏显示。并且,也可将结果信息语音通信传输至客户终端输出,这里,客户终端为具有信息接收功能的电子设备,如将语音传输至手机、智能音响、蓝牙耳机等设备进行播报,或将分类结果信息以文本或图像形式通过短信、邮件等方式通讯传输至诸如手机、平板电脑等客户终端或客户终端安装的应用软件,供用户查阅。从而实现多渠道多种类的分类结果信息输出方式,用户并不局限于只能在智能冰箱附近处获得相关信息,配合本发明所提供的多渠道多种类实时语音获取方式,使得用户能够直接在远程与智能冰箱进行交互,具有极高的便捷性,大幅提高了用户使用体验。在本发明的其他实施方式中,也可仅采用上述分类结果信息输出方式中的一种或几种,或者也可基于现有技术通过其他渠道输出分类结果信息,本发明对此不作具体限制。As described in step S7, in this real-time mode, after the classification result information is obtained through the above steps, it can be converted into voice, and the result information can be broadcasted through the built-in sound playback device of the smart refrigerator, or the result information can be converted into text and displayed directly through the display device configured by the smart refrigerator, or the result information can be converted into an image and displayed directly on the large screen of the smart refrigerator. In addition, the result information can also be transmitted to the client terminal for output by voice communication. Here, the client terminal is an electronic device with information receiving function, such as transmitting voice to mobile phones, smart speakers, Bluetooth headsets and other devices for broadcasting, or the classification result information is transmitted in the form of text or image through SMS, email and other methods to client terminals such as mobile phones, tablet computers or application software installed on client terminals for users to view. Thereby, a multi-channel and multi-category classification result information output method is realized, and users are not limited to obtaining relevant information only near the smart refrigerator. With the multi-channel and multi-category real-time voice acquisition method provided by the present invention, users can directly interact with the smart refrigerator remotely, which is extremely convenient and greatly improves the user experience. In other implementations of the present invention, only one or several of the above-mentioned classification result information output methods may be used, or the classification result information may be output through other channels based on the existing technology, and the present invention does not impose any specific limitation on this.
综上所述,本发明提供的一种基于多模态深度学习的音视频生成文本分类方法,其通过多渠道获取实时音视频数据和历史音视频数据,将所述音视频数据进行数据处理之后,将所述语音数据和视频数据转换成对应的语音文本数据和图像文本数据,结合音视频生成文本后的上下文信息通过多通道多尺寸深度卷积神经网络模型和多通道多尺寸时间深度卷积神经网络模型充分提取了文本语义特征,获得生成文本分类结果,并将所述文本分类结果通过多渠道进行输出,所述方法不仅显著提高了生成文本分类的准确率,而且使得用户和智能冰箱的交互方式更加便捷、多元化,大大提高了用户的体验。In summary, the present invention provides a method for classifying audio and video generated text based on multimodal deep learning, which obtains real-time audio and video data and historical audio and video data through multiple channels, processes the audio and video data, and then converts the voice data and video data into corresponding voice text data and image text data. The context information after the audio and video text is generated is combined with the multi-channel multi-size deep convolutional neural network model and the multi-channel multi-size time deep convolutional neural network model to fully extract the text semantic features, obtain the generated text classification results, and output the text classification results through multiple channels. The method not only significantly improves the accuracy of generated text classification, but also makes the interaction between users and smart refrigerators more convenient and diversified, greatly improving the user experience.
基于同一发明构思,本发明还提供一种电器设备,其包括:Based on the same inventive concept, the present invention also provides an electrical device, which includes:
存储器,用于存储可执行指令;A memory for storing executable instructions;
处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于多模态深度学习的文本分类方法。The processor is used to implement the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.
基于同一发明构思,本发明还提供一种冰箱,其包括:Based on the same inventive concept, the present invention also provides a refrigerator, comprising:
存储器,用于存储可执行指令;A memory for storing executable instructions;
处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于多模态深度学习的文本分类方法。The processor is used to implement the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.
基于同一发明构思,本发明还提供一种计算机可读存储介质,其存储有可执行指令,所述可执行指令被处理器执行时实现上述的基于多模态深度学习的文本分类方法。Based on the same inventive concept, the present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned text classification method based on multimodal deep learning.
应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施方式中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。It should be understood that although this specification is described according to implementation modes, not every implementation mode contains only one independent technical solution. This description of the specification is only for the sake of clarity. Those skilled in the art should regard the specification as a whole. The technical solutions in each implementation mode may also be appropriately combined to form other implementation modes that can be understood by those skilled in the art.
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。The series of detailed descriptions listed above are only specific descriptions of feasible implementation methods of the present invention and are not intended to limit the scope of protection of the present invention. All equivalent implementation methods or changes that do not deviate from the technical spirit of the present invention should be included in the scope of protection of the present invention.
Claims (15)
- 一种基于多模态深度学习的文本分类方法,其特征在于,包括步骤:A text classification method based on multimodal deep learning, characterized by comprising the steps of:获取实时音视频数据和历史音视频数据;Obtain real-time audio and video data and historical audio and video data;对所述实时音视频数据和历史音视频数据进行预处理,获取有效的语音数据和视频数据;Preprocessing the real-time audio and video data and the historical audio and video data to obtain valid voice data and video data;转写所述有效语音数据为语音文本数据;Transcribing the valid voice data into voice text data;获取所述有效视频数据中局部区域的视频图像,并转写所述视频图像为图像文本数据;Acquire a video image of a local area in the valid video data, and transcribe the video image into image text data;根据所述语音文本数据和图像文本数据,获取该文本数据的上下文信息和文本语义特征的权重信息;According to the speech text data and the image text data, obtaining context information of the text data and weight information of text semantic features;将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断所述音视频数据生成文本的类别信息;After combining the context information and the weight information through a fully connected layer, the context information and the weight information are output to a classifier to calculate a score to obtain classification result information, and the category information of the text generated by the audio and video data is determined;输出所述生成文本的类别信息。Output the category information of the generated text.
- 根据权利要求1所述的基于多模态深度学习的文本分类方法,其特征在于,所述“对所述实时音视频数据和历史音视频数据进行预处理,获取有效的语音数据和视频数据”,具体包括:The text classification method based on multimodal deep learning according to claim 1 is characterized in that the "preprocessing the real-time audio and video data and the historical audio and video data to obtain valid voice data and video data" specifically includes:对所述实时音视频数据和历史音视频数据进行数据清洗、格式解析、格式转换和数据存储,获得有效的音视频数据;Performing data cleaning, format analysis, format conversion and data storage on the real-time audio and video data and historical audio and video data to obtain valid audio and video data;采用脚本或第三方工具将所述有效音视频数据进行语音和视频分离,以获得所述语音数据和视频数据;Using a script or a third-party tool to separate the effective audio and video data into voice and video to obtain the voice data and video data;对所述语音数据和视频数据进行预处理,包括:对所述语音数据进行分帧和加窗处理,对所述视频数据进行裁剪、分帧处理。The voice data and the video data are preprocessed, including: framing and windowing the voice data, and cropping and framing the video data.
- 根据权利要求1所述的基于多模态深度学习的文本分类方法,其特征在于,所述“转写所述有效语音数据为语音文本数据”,具体包括:The text classification method based on multimodal deep learning according to claim 1 is characterized in that the "transcribing the valid voice data into voice text data" specifically includes:提取所述有效语音数据特征,得到语音特征;Extracting the effective voice data features to obtain voice features;将所述语音特征输入语音识别多通道多尺寸深度卷积神经网络模型转写得到第一语音文本数据;Inputting the speech feature into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe to obtain first speech text data;基于连接时序分类方法输出所述语音特征和所述第一语音文本数据的对齐关系,以得到第二语音文本数据;Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data;基于注意力机制,获取所述第二语音文本数据的关键特征或所述关键特征的权重信息;Based on the attention mechanism, obtaining key features of the second voice text data or weight information of the key features;将所述第二语音文本数据以及其关键特征或关键特征的权重信息经全连接层组合后,再经过分类函数计算得分得到所述语音文本数据。The second speech text data and its key features or weight information of the key features are combined through a fully connected layer and then scored by a classification function to obtain the speech text data.
- 根据权利要求3所述的基于多模态深度学习的文本分类方法,其特征在于,所述“提取所述有效语音数据特征”,具体包括:The text classification method based on multimodal deep learning according to claim 3 is characterized in that the “extracting the effective voice data features” specifically includes:提取所述有效语音数据特征,获取其梅尔频率倒谱系数特征。The effective speech data features are extracted to obtain the Mel-frequency cepstral coefficient features thereof.
- 根据权利要求1所述的基于多模态深度学习的文本分类方法,其特征在于,所述“获取所述有效视频数据中局部区域的视频图像,并转写所述视频图像为图像文本数据”,具体包括:The text classification method based on multimodal deep learning according to claim 1 is characterized in that the step of "obtaining a video image of a local area in the valid video data and transcribing the video image into image text data" specifically includes:根据所述视频数据,获取嘴唇区域的视频图像;Acquire a video image of the lip area according to the video data;将所述嘴唇区域的视频图像输入3D卷积神经网络模型计算,得到图像特征;Inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features;基于图像唇语识别方法,将所述图像特征输入多通道多尺寸时间深度卷积神经网络模型转写,获得第一图像文本数据;Based on the image lip reading recognition method, the image features are input into a multi-channel multi-scale time deep convolutional neural network model for transcription to obtain first image text data;基于连接时序分类方法输出所述图像特征和所述第一图像文本数据的对齐关系,以得到第二图像文本数据;Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data;将所述第二图像文本数据经全连接层组合后,再经过分类函数计算得分得到所述图像文本数据。The second image text data is combined through a fully connected layer and then scored by a classification function to obtain the image text data.
- 根据权利要求5所述的基于多模态深度学习的文本分类方法,其特征在于,所述“将所述嘴唇区域的视频图像输入3D卷积神经网络模型计算,得到图像特征”,具体包括:The text classification method based on multimodal deep learning according to claim 5 is characterized in that the step of “inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features” specifically includes:对嘴唇局部视频数据分割成连续嘴唇图片帧;Segmenting the lip local video data into continuous lip picture frames;将所述连续嘴唇图片帧输入3D卷积神经网络模型计算,提取多种特征,得到所述图像特征。The continuous lip picture frames are input into a 3D convolutional neural network model for calculation, and multiple features are extracted to obtain the image features.
- 根据权利要求6所述的基于多模态深度学习的文本分类方法,其特征在于,所述“基于图像唇语识别方法,将所述图像特征输入多通道多尺寸时间深度卷积神经网络模型转写,获得第一图像文本数据”,具体包括:The text classification method based on multimodal deep learning according to claim 6 is characterized in that the “based on the image lip reading recognition method, inputting the image features into the multi-channel multi-scale time deep convolutional neural network model for transcription to obtain the first image text data” specifically includes:将所述图像特征输入所述多通道多尺寸时间深度卷积神经网络计算,得到时序图像特征;Input the image features into the multi-channel multi-size time deep convolutional neural network for calculation to obtain time series image features;根据所述图像唇语识别方法,将所述时序图像特征映射为拼音语句的拼音序列;According to the image lip reading recognition method, the time-series image features are mapped into a pinyin sequence of a pinyin sentence;再将所述拼音序列翻译为对应汉字语句的汉字序列。The pinyin sequence is then translated into a Chinese character sequence corresponding to the Chinese character sentence.
- 根据权利要求1所述的基于多模态深度学习的文本分类方法,其特征在于,所述“根据所述语音文本数据和图像文本数据,获取该文本数据的上下文信息和文本语义特征的权重信息”,具体包括:The text classification method based on multimodal deep learning according to claim 1 is characterized in that the "obtaining context information of the text data and weight information of text semantic features according to the speech text data and the image text data" specifically includes:将所述语音文本数据和图像文本数据转换为语音文本词向量和图像文本词向量;Converting the speech text data and the image text data into speech text word vectors and image text word vectors;将所述语音文本词向量和图像文本词向量输入双向长短记忆网络模型,获取包含所述语音文本数据和图像文本数据特征信息的上下文特征向量。The speech text word vector and the image text word vector are input into a bidirectional long short-term memory network model to obtain a context feature vector containing feature information of the speech text data and the image text data.
- 根据权利要求8所述的基于多模态深度学习的文本分类方法,其特征在于,所述方法还包括:The text classification method based on multimodal deep learning according to claim 8, characterized in that the method further comprises:基于注意力机制模型,区分所述语音文本数据和图像文本数据的文本特征中的词、词语的自身权重信息和/或关联权重信息,获得所述文本语义特征的权重信息。Based on the attention mechanism model, the self-weight information and/or associated weight information of the words and phrases in the text features of the speech text data and the image text data are distinguished to obtain the weight information of the text semantic features.
- 根据权利要求9所述的基于多模态深度学习的文本分类方法,其特征在于,所述“基于注意力机制模型,区分所述语音文本数据和所述图像文本数据的文本特征中的词、词语的自身权重信息和/或关联权重信息”,具体包括:The text classification method based on multimodal deep learning according to claim 9 is characterized in that the “based on the attention mechanism model, distinguishing the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data and the image text data” specifically includes:分别将所述语音文本上下文特征向量和所述图像文本上下文特征向量输入自注意力机制和互注意力机制;Inputting the speech text context feature vector and the image text context feature vector into a self-attention mechanism and a mutual attention mechanism respectively;获取包含所述语音文本语义特征和图像文本语义特征自身权重信息的自身权重文本注意力特征向量;Obtaining a self-weighted text attention feature vector including self-weight information of the speech text semantic features and the image text semantic features;获取包含所述语音文本语义特征和图像文本语义特征关联权重信息的关联权重文本注意力特征向量。Obtain an associated weight text attention feature vector containing associated weight information of the speech text semantic features and the image text semantic features.
- 根据权利要求10所述的基于多模态深度学习的文本分类方法,其特征在于,所述“将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断所述音视频数据生成文本的类别信息”,具体包括:The text classification method based on multimodal deep learning according to claim 10 is characterized in that the "combining the context information and the weight information through the fully connected layer, outputting them to the classifier to calculate the score to obtain the classification result information, and determining the category information of the text generated by the audio and video data", specifically includes:将所述上下文特征向量和权重文本注意力特征向量经全连接层组合后,输出至分类函数,计算所述语音文本数据和所述图像文本数据文本语义的得分及其归一化得分结果,得到生成文本的类别信息。The context feature vector and the weighted text attention feature vector are combined through a fully connected layer and output to a classification function to calculate the text semantic scores of the speech text data and the image text data and their normalized score results to obtain the category information of the generated text.
- 根据权利要求1所述的基于多模态深度学习的文本分类方法,其特征在于,所述“转写所述语音数据为语音文本数据”,还包括:The text classification method based on multimodal deep learning according to claim 1 is characterized in that the "transcribing the voice data into voice text data" further includes:获取存储于外部缓存的配置数据,将所述语音数据基于所述配置数据执行所述多通道多尺寸深度卷积神经网络模型计算,进行文本转写和提取文本特征。The configuration data stored in the external cache is obtained, and the multi-channel and multi-size deep convolutional neural network model is calculated based on the configuration data to perform text transcription and extract text features.
- 一种电器设备,其特征在于,包括:An electrical device, characterized in that it comprises:存储器,用于存储可执行指令;A memory for storing executable instructions;处理器,用于运行所述存储器存储的可执行指令时,实现权利要求1至12任一项所述的基于多模态深度学习的文本分类方法。A processor, configured to implement the text classification method based on multimodal deep learning as described in any one of claims 1 to 12 when running the executable instructions stored in the memory.
- 一种冰箱,其特征在于,包括:A refrigerator, characterized by comprising:存储器,用于存储可执行指令;A memory for storing executable instructions;处理器,用于运行所述存储器存储的可执行指令时,实现权利要求1至12任一项所述的基于多模态深度学习的文本分类方法。A processor, configured to implement the text classification method based on multimodal deep learning as described in any one of claims 1 to 12 when running the executable instructions stored in the memory.
- 一种计算机可读存储介质,其存储有可执行指令,其特征在于,所述可执行指令被处理器执行时实现权利要求1至12任一项所述的基于多模态深度学习的文本分类方法。A computer-readable storage medium storing executable instructions, characterized in that when the executable instructions are executed by a processor, the text classification method based on multimodal deep learning according to any one of claims 1 to 12 is implemented.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211734528.2 | 2022-12-31 | ||
CN202211734528.2A CN116108176A (en) | 2022-12-31 | 2022-12-31 | Text classification method, equipment and storage medium based on multi-modal deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024140430A1 true WO2024140430A1 (en) | 2024-07-04 |
Family
ID=86266818
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/140831 WO2024140430A1 (en) | 2022-12-31 | 2023-12-22 | Text classification method based on multimodal deep learning, device, and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116108176A (en) |
WO (1) | WO2024140430A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118675092A (en) * | 2024-08-21 | 2024-09-20 | 南方科技大学 | Multi-mode video understanding method based on large language model |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116108176A (en) * | 2022-12-31 | 2023-05-12 | 青岛海尔电冰箱有限公司 | Text classification method, equipment and storage medium based on multi-modal deep learning |
CN116890786A (en) * | 2023-09-11 | 2023-10-17 | 江西五十铃汽车有限公司 | Vehicle lock control method, device and medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361276A (en) * | 2014-11-18 | 2015-02-18 | 新开普电子股份有限公司 | Multi-mode biometric authentication method and multi-mode biometric authentication system |
CN110276259A (en) * | 2019-05-21 | 2019-09-24 | 平安科技(深圳)有限公司 | Lip reading recognition methods, device, computer equipment and storage medium |
CN112818861A (en) * | 2021-02-02 | 2021-05-18 | 南京邮电大学 | Emotion classification method and system based on multi-mode context semantic features |
CN113408385A (en) * | 2021-06-10 | 2021-09-17 | 华南理工大学 | Audio and video multi-mode emotion classification method and system |
US20210319093A1 (en) * | 2020-04-09 | 2021-10-14 | International Business Machines Corporation | Using multimodal model consistency to detect adversarial attacks |
CN113590769A (en) * | 2020-04-30 | 2021-11-02 | 阿里巴巴集团控股有限公司 | State tracking method and device in task-driven multi-turn dialogue system |
CN114944156A (en) * | 2022-05-20 | 2022-08-26 | 青岛海尔电冰箱有限公司 | Article classification method, device and equipment based on deep learning and storage medium |
CN115062143A (en) * | 2022-05-20 | 2022-09-16 | 青岛海尔电冰箱有限公司 | Voice recognition and classification method, device, equipment, refrigerator and storage medium |
CN116108176A (en) * | 2022-12-31 | 2023-05-12 | 青岛海尔电冰箱有限公司 | Text classification method, equipment and storage medium based on multi-modal deep learning |
-
2022
- 2022-12-31 CN CN202211734528.2A patent/CN116108176A/en active Pending
-
2023
- 2023-12-22 WO PCT/CN2023/140831 patent/WO2024140430A1/en unknown
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361276A (en) * | 2014-11-18 | 2015-02-18 | 新开普电子股份有限公司 | Multi-mode biometric authentication method and multi-mode biometric authentication system |
CN110276259A (en) * | 2019-05-21 | 2019-09-24 | 平安科技(深圳)有限公司 | Lip reading recognition methods, device, computer equipment and storage medium |
US20210319093A1 (en) * | 2020-04-09 | 2021-10-14 | International Business Machines Corporation | Using multimodal model consistency to detect adversarial attacks |
CN113590769A (en) * | 2020-04-30 | 2021-11-02 | 阿里巴巴集团控股有限公司 | State tracking method and device in task-driven multi-turn dialogue system |
CN112818861A (en) * | 2021-02-02 | 2021-05-18 | 南京邮电大学 | Emotion classification method and system based on multi-mode context semantic features |
CN113408385A (en) * | 2021-06-10 | 2021-09-17 | 华南理工大学 | Audio and video multi-mode emotion classification method and system |
CN114944156A (en) * | 2022-05-20 | 2022-08-26 | 青岛海尔电冰箱有限公司 | Article classification method, device and equipment based on deep learning and storage medium |
CN115062143A (en) * | 2022-05-20 | 2022-09-16 | 青岛海尔电冰箱有限公司 | Voice recognition and classification method, device, equipment, refrigerator and storage medium |
CN116108176A (en) * | 2022-12-31 | 2023-05-12 | 青岛海尔电冰箱有限公司 | Text classification method, equipment and storage medium based on multi-modal deep learning |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118675092A (en) * | 2024-08-21 | 2024-09-20 | 南方科技大学 | Multi-mode video understanding method based on large language model |
Also Published As
Publication number | Publication date |
---|---|
CN116108176A (en) | 2023-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021082941A1 (en) | Video figure recognition method and apparatus, and storage medium and electronic device | |
WO2023222088A1 (en) | Voice recognition and classification method and apparatus | |
WO2024140430A1 (en) | Text classification method based on multimodal deep learning, device, and storage medium | |
WO2024140434A1 (en) | Text classification method based on multi-modal knowledge graph, and device and storage medium | |
CN110460872B (en) | Information display method, device and equipment for live video and storage medium | |
CN113408385A (en) | Audio and video multi-mode emotion classification method and system | |
WO2023222089A1 (en) | Item classification method and apparatus based on deep learning | |
WO2023222090A1 (en) | Information pushing method and apparatus based on deep learning | |
WO2024140432A1 (en) | Ingredient recommendation method based on knowledge graph, and device and storage medium | |
CN105512348A (en) | Method and device for processing videos and related audios and retrieving method and device | |
CN110517689A (en) | A kind of voice data processing method, device and storage medium | |
CN117077787A (en) | Text generation method and device, refrigerator and storage medium | |
CN109710799B (en) | Voice interaction method, medium, device and computing equipment | |
CN114138960A (en) | User intention identification method, device, equipment and medium | |
CN115798459B (en) | Audio processing method and device, storage medium and electronic equipment | |
CN111462732B (en) | Speech recognition method and device | |
CN112581937A (en) | Method and device for acquiring voice instruction | |
CN113724689B (en) | Speech recognition method and related device, electronic equipment and storage medium | |
CN113763925B (en) | Speech recognition method, device, computer equipment and storage medium | |
CN114492579A (en) | Emotion recognition method, camera device, emotion recognition device and storage device | |
WO2024188276A1 (en) | Text classification method and refrigeration device system | |
US20230326369A1 (en) | Method and apparatus for generating sign language video, computer device, and storage medium | |
CN112235183B (en) | Communication message processing method and device and instant communication client | |
CN114283493A (en) | Artificial intelligence-based identification system | |
CN118551343B (en) | Multi-mode large model construction method, system, refrigeration equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23910356 Country of ref document: EP Kind code of ref document: A1 |