WO2024140430A1

WO2024140430A1 - Text classification method based on multimodal deep learning, device, and storage medium

Info

Publication number: WO2024140430A1
Application number: PCT/CN2023/140831
Authority: WO
Inventors: 李华刚; 曾谁飞; 孔令磊; 张景瑞; 李敏; 刘卫强
Original assignee: 青岛海尔电冰箱有限公司; 海尔智家股份有限公司
Priority date: 2022-12-31
Filing date: 2023-12-22
Publication date: 2024-07-04
Also published as: CN116108176A

Abstract

Disclosed in the present invention is a text classification method based on multimodal deep learning, comprising: acquiring context information of text data and weight information of text semantic features; and combining the context information and the weight information of the text semantic features via a fully connected layer, and then outputting the combined information to a classifier to calculate a score to obtain classification result information. The method effectively improves the accuracy and generalization ability of classification of texts generated from audios and videos, thereby improving the user experience.

Description

Text classification method, device and storage medium based on multimodal deep learning

Technical Field

The present invention relates to the field of computer technology, and in particular to a text classification method, device and storage medium based on multimodal deep learning.

Background technique

With the application of multimodal deep learning technology, most of the interactions between smart refrigerators and users are voice and text data. Not only are the interactions based on video data very rare, but traditional methods for refrigerator smart voice and video generally have the following problems: inaccurate and insufficient feature extraction, resulting in low accuracy in voice recognition and text classification of video content, which affects the user experience of refrigerator audio and video, and even affects the intelligence and information level of high-end refrigerators.

Therefore, how to build a refrigerator audio and video generated text classification model with the help of a multi-channel and multi-scale deep convolutional neural network model has become a key technology to improve the accuracy of text classification. Smart refrigerator interaction is inseparable from multi-source heterogeneous data such as voice, text, and video. Therefore, the industry has not yet proposed a more effective solution for how to implement the optimal feature information extraction method based on multi-modal or cross-modal data for the multi-source heterogeneous data, thereby optimizing the accuracy of smart refrigerator audio and video generated text classification and thus improving the experience of using the refrigerator.

Summary of the invention

The purpose of the present invention is to provide a text classification method, device and storage medium based on multimodal deep learning.

The present invention provides a method for generating text classification based on multimodal deep learning, comprising the steps of:

Acquire real-time audio and video data and historical audio and video data; pre-process the real-time audio and video data and historical audio and video data to acquire valid voice data and video data; transcribe the valid voice data into voice text data; acquire a video image of a local area in the valid video data, and transcribe the video image into image text data; acquire context information of the text data and weight information of text semantic features according to the voice text data and image text data; combine the context information and weight information of text semantic features through a fully connected layer, and output them to a classifier to calculate a score to obtain classification result information, and determine the category information of the text generated by the audio and video data; and output the category information of the generated text.

As a further improvement of the present invention, the "preprocessing the real-time audio and video data and the historical audio and video data to obtain valid voice data and video data" specifically includes: performing data cleaning, format analysis, format conversion and data storage on the real-time audio and video data and the historical audio and video data to obtain valid audio and video data; using a script or a third-party tool to separate the voice and video of the valid audio and video data to obtain the voice data and video data; preprocessing the voice data and video data, including: framing and windowing the voice data, and cropping and framing the video data.

As a further improvement of the present invention, the "transcription of the valid voice data into voice text data" specifically includes: extracting the features of the valid voice data to obtain voice features; inputting the voice features into a voice recognition multi-channel and multi-size deep convolutional neural network model to transcribe the first voice text data; outputting the alignment relationship between the voice features and the first voice text data based on a connection time series classification method to obtain second voice text data; based on an attention mechanism, obtaining the key features of the second voice text data or the weight information of the key features; combining the second voice text data and its key features or the weight information of the key features through a fully connected layer, and then calculating the score through a classification function to obtain the voice text data.

As a further improvement of the present invention, the "extracting the effective voice data features" specifically includes: extracting the effective voice data features and obtaining its Mel-frequency cepstral coefficient features.

As a further improvement of the present invention, the "obtaining a video image of a local area in the video data, and transcribing the video image into image text data" specifically includes: obtaining a video image of the lip area according to the effective video data; inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features; based on an image lip reading recognition method, inputting the image features into a multi-channel and multi-scale time deep convolutional neural network model for transcribing to obtain first image text data; based on a connection temporal classification method, outputting an alignment relationship between the speech feature sequence and the first image text data to obtain second image text data; combining the second image text data through a fully connected layer, and then calculating a score through a classification function to obtain the image text data.

As a further improvement of the present invention, the method of "inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features" specifically includes: segmenting the local video data of the lips into continuous lip picture frames; inputting the continuous lip picture frames into a 3D convolutional neural network model for calculation, extracting multiple features, and obtaining the image features.

As a further improvement of the present invention, the "based on the image lip reading recognition method, inputting the image features into the multi-channel multi-size time deep convolutional neural network model for transcription to obtain the first image text data" specifically includes: inputting the image features into the multi-channel multi-size time deep convolutional neural network for calculation to obtain time series image features; according to the image lip reading recognition method, mapping the time series image features into a pinyin sequence of a pinyin sentence; and then translating the pinyin sequence into a Chinese character sequence corresponding to the Chinese character sentence.

As a further improvement of the present invention, the method of "obtaining context information of the text data and weight information of text semantic features based on the speech text data and image text data" specifically includes: converting the speech text data and image text data into speech text word vectors and image text word vectors; inputting the speech text word vectors and image text word vectors into a bidirectional long short-term memory network model to obtain a context feature vector containing feature information of the speech text data and image text data.

As a further improvement of the present invention, based on the attention mechanism model, the self-weight information and/or associated weight information of the words and phrases in the text features of the speech text data and the image text data are distinguished to obtain the weight information of the text semantic features.

As a further improvement of the present invention, the "based on the attention mechanism model, distinguishing the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data and the image text data" specifically includes: respectively inputting the speech text context feature vector and the image text context feature vector into the self-attention mechanism and the mutual attention mechanism; obtaining the self-weight text attention feature vector containing the self-weight information of the speech text semantic features and the image text semantic features; obtaining the associated weight text attention feature vector containing the associated weight information of the speech text semantic features and the image text semantic features.

As a further improvement of the present invention, the "combining the context information and weight information through a fully connected layer, outputting the result to a classifier to calculate a score to obtain classification result information, and determining the category information of the text generated from the audio and video data" specifically includes: combining the context feature vector and the text attention weight feature vector through a fully connected layer, outputting the result to a classification function, calculating the scores of the text semantics of the speech text data and the image text data and their normalized score results, and obtaining the category information of the generated text.

As a further improvement of the present invention, the "transcribe the voice data into voice text data" also includes: obtaining configuration data stored in an external cache, executing the multi-channel and multi-scale deep convolutional neural network model calculation on the voice data based on the configuration data, performing text transcription and extracting text features.

The present invention also provides an electrical device, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.

The present invention also provides a refrigerator, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.

The present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned text classification method based on multimodal deep learning.

The beneficial effects of the present invention are as follows: the method provided by the present invention completes the recognition and classification tasks of the acquired audio and video generated text. First, the voice text data and the speaker's lip area video image feature recognition text content are combined to obtain the complementarity, correlation and enhancement of the semantic features of the text semantic feature information, thereby achieving the accuracy of the audio and video generated text classification task. Secondly, by comprehensively considering the real-time audio and video data and the historical audio and video data, the historical audio and video data is used as supplementary data to make up for the problem of less semantic information of the voice data text and video text data, and effectively improve the accuracy of text classification. Finally, by constructing a multi-channel multi-size deep convolutional neural network and a time-depth convolutional neural network model, the accuracy of the classification and recognition of the real-time audio and video generated text is improved. Specifically, by constructing a convolutional neural network model that integrates the context information mechanism, the self-attention mechanism and the mutual attention mechanism, the rich semantic feature information of the text data is more fully excavated. Therefore, the overall model makes full use of the real-time and historical audio and video data and the context data, has excellent semantic representation ability, has a high accuracy rate for the classification of audio and video generated text, effectively improves the accuracy rate and generalization ability of the audio and video generated text classification, and improves the user experience effect.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a structural block diagram of a model involved in a text classification method based on multimodal deep learning in one embodiment of the present invention.

FIG2 is a schematic diagram of steps of a text classification method based on multimodal deep learning in one embodiment of the present invention.

FIG3 is a schematic diagram of the steps of acquiring real-time audio and video data and historical audio and video data in one embodiment of the present invention.

FIG. 4 is a schematic diagram of data preprocessing steps for the real-time audio and video data and the historical audio and video data in one embodiment of the present invention.

FIG. 5 is a schematic diagram of the steps of transcribing the valid voice data into voice text data in one embodiment of the present invention.

FIG6 is a schematic diagram of the steps of obtaining a local area video image in the video data and transcribing the video image into image text data in one embodiment of the present invention.

FIG. 7 is a schematic diagram of steps for obtaining context information and weight information of text data based on the speech text data and image text data in one embodiment of the present invention.

Detailed ways

The present invention will be described in detail below in conjunction with the specific embodiments shown in the accompanying drawings. However, these embodiments do not limit the present invention, and any structural, methodological, or functional changes made by a person skilled in the art based on these embodiments are all within the scope of protection of the present invention.

It should be noted that the term "comprises" or any other variation thereof is intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In addition, the terms "first", "second", etc. are used for descriptive purposes only and cannot be understood as indicating or implying relative importance.

The embodiment of the present invention is a text classification method based on multimodal deep learning. Although the present application provides the method operation steps as described in the following implementation or flowchart 1, based on routine or no creative labor, the execution order of the steps in the method that do not have a necessary causal relationship in logic is not limited to the execution order provided in the implementation of the present application.

As shown in FIG1 , a structural block diagram of a model involved in a text classification method based on multimodal deep learning provided by the present invention is shown in FIG2 , which is a schematic diagram of the steps of a text classification method based on multimodal deep learning, including:

S1: Obtain real-time audio and video data and historical audio and video data.

S2: Preprocess the real-time audio and video data and historical audio and video data to obtain valid voice data and video data.

S3: Transcribing the valid voice data into voice text data.

S4: Acquire a video image of a local area in the valid video data, and transcribe the video image into image text data.

S5: Acquire context information of the text data and weight information of text semantic features according to the speech text data and the image text data.

S6: After the context information and weight information are combined through a fully connected layer, they are output to a classifier to calculate a score to obtain classification result information, and the category information of the text generated by the audio and video data is determined.

S7: Outputting the category information of the generated text.

The method provided by the present invention can be used by smart electronic devices to implement real-time interaction or message push functions with users based on the user's real-time audio and video data input. Exemplarily, in this embodiment, a smart refrigerator is taken as an example, and the method is described in combination with a pre-trained deep learning model. Based on the user's audio and video input, the smart refrigerator classifies the corresponding text content generated by the user's audio and video data, and calculates the text content classification result information to be output based on the classification result information.

As shown in FIG3 , in step S1, it specifically includes:

S11: Acquire the real-time audio and video data collected by the collection device, and/or

The real-time audio and video data transmitted from the client terminal is obtained.

S12: Obtaining historical audio and video data stored internally, and/or

Get historical audio and video data from external storage, and/or

Obtain historical audio and video data transmitted by client terminals.

The real-time audio and video data described here include real-time voice data and real-time video data. The real-time voice refers to the interrogative or directive statements currently spoken by the user to the intelligent electronic device or the client terminal device connected to the intelligent electronic device. Similarly, it can also be the voice information sent by the user collected by the voice collection device. For example, in this embodiment, the user can ask questions such as "What vegetables are in the refrigerator today?", "What beef ingredients are in the refrigerator today?", or the user can issue commands such as "Delete all ingredients". The real-time video data is a real-time video image obtained by real-time shooting using an intelligent electronic device or a client terminal device connected to the intelligent electronic device. For example, in this real-time mode, the user's facial image is captured by a video camera built into the intelligent refrigerator, and the lip area feature image is extracted from the facial image to identify the text content corresponding to the image, such as identifying the image text data of "What vegetables are in the refrigerator today".

The historical audio and video data mentioned here refers to the real-time audio and video data of the user in the previous use process, and further, it can also include the historical audio and video data input by the user. Specifically, in this embodiment, the historical audio and video data may include: obtaining audio and video data of instructions issued or questions raised by the user in the past, and the obtained audio and video data contains information related to the current real-time audio and video data, and can also be the explanatory audio and video information issued by the user according to the items put in the refrigerator in the previous use process, such as "there is no milk in the refrigerator". The acquisition of historical audio and video data can be used as part of the data set of the pre-training and prediction model, which can effectively supplement the single voice representation of the real-time audio and video data and enrich the semantic features.

As described in step S11, in this embodiment, the user's real-time audio and video can be collected by an audio and video collection device such as a camera or a camera set in the smart refrigerator. During use, when the user needs to interact with the smart refrigerator, the user can directly send a voice to the smart refrigerator. In addition, the user's real-time audio and video data transmitted can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol. The client terminal is an electronic device with an information sending function, such as a mobile phone, a tablet computer, a smart camera, a smart watch, an APP or a smart electronic device such as Bluetooth. During use, the user can directly send a voice to the client terminal or directly use the camera built into the refrigerator to shoot. After the client terminal collects audio and video, it is transmitted to the smart refrigerator through wireless communication methods such as wifi or Bluetooth. Thereby realizing a multi-channel real-time audio and video acquisition method, it is not limited to sending a voice to the smart refrigerator. When the user has an interactive demand, the real-time voice can be sent through any convenient channel, so that the user's convenience of use can be significantly improved. In other embodiments of the present invention, one or any multiple of the above-mentioned real-time audio and video data acquisition methods can also be used, or the real-time audio and video data can also be obtained through other channels based on the prior art, and the present invention does not make specific restrictions on this.

As described in step S12, in this embodiment, the historical audio and video data stored in the internal memory of the smart refrigerator can be read. In addition, the historical audio and video data stored in the external storage device configured by the smart refrigerator can also be read. The external storage device is a mobile storage device such as a U disk, SD card, etc. By setting an external storage device, the storage space of the smart refrigerator can be further expanded. In addition, the historical audio and video data stored in a client terminal such as a mobile phone, a tablet computer, or an application software server can be obtained. Implementing multi-channel historical audio and video data acquisition channels can greatly increase the amount of historical audio and video data, thereby improving the accuracy of subsequent voice recognition and video image recognition. In other embodiments of the present invention, one or any multiple of the above-mentioned historical audio and video data acquisition methods may also be used, or the historical audio and video data may also be obtained through other channels based on the prior art, and the present invention does not impose specific restrictions on this.

Furthermore, in the present embodiment, the smart refrigerator is provided with an external cache, and at least part of the historical audio and video data is stored in the external cache. As the usage time increases, the historical audio and video data increases. By storing part of the data in the external cache, the internal storage space of the smart refrigerator can be saved, and when performing neural network calculations, the audio and video data stored in the external cache can be directly read, which can improve the efficiency of the algorithm.

Specifically, in this embodiment, a Redis component is used as the external cache. The Redis component is a widely used distributed cache system with a key/value storage structure, which can be used as a database, a cache and a message queue agent. In other embodiments of the present invention, other external caches such as Memcached may also be used, and the present invention does not specifically limit this.

In summary, in step S11 and step S12, real-time audio and video data and historical audio and video data can be flexibly obtained through multiple channels, which improves the user experience while ensuring the data volume and effectively improving the algorithm efficiency.

As shown in FIG4 , in step S2, it specifically includes the steps of:

S21: Clean the real-time audio and video data and the historical audio and video data to obtain valid audio and video data.

S22: Separate the effective audio and video data into voice and video to obtain the voice data and video data.

S23: Preprocessing the voice data and video data, including: performing frame division and windowing processing on the voice data, and cropping and frame division processing on the video data.

In step S21, data cleaning of the real-time audio and video data and the historical audio and video data specifically includes:

A certain number of real-time audio and video data sets and historical audio and video data sets are obtained. Exemplarily, they can be imported into the data cleaning model in the form of files for processing. In order to prevent data import failure, data that does not meet the file import format is parsed and converted, and then irrelevant data and duplicate data in the data set are deleted, and abnormal values and missing value data are processed. Information irrelevant to classification is preliminarily screened out, and the audio and video data is cleaned. At the same time, the cleaned data is output and saved in a specified format, thereby obtaining valid audio and video data.

In step S22, voice and video are separated from the valid audio and video data by using a script or a third-party audio and video separation tool, thereby obtaining voice data and video data.

In an embodiment of the present invention, the audio and video separation script can be written in Python language, or a third-party audio and video separation tool can be used to separate the input audio and video data to achieve voice and video separation and obtain classified voice and video data.

In step S23, the classified speech is segmented according to the specified time period or sampling number, and the framing process of the speech is completed to obtain the speech signal data, and then through the effect of the window function, the speech signal originally containing noise presents the characteristics of signal enhancement and signal periodicity, and the windowing process is completed, which is convenient for the subsequent better extraction of the characteristic parameters of the speech. Exemplarily, step S23 also includes cutting the effective video data to generate multiple frames of pictures. Specifically, the video data can be first loaded and the video information can be read by writing a script, and then the video can be decoded according to the video information to determine how many pictures the video shows per second, so as to obtain single-frame image information, and the single-frame image information includes the width and height of each frame of the picture, and finally the video is saved as multiple pictures. Therefore, after the processing of step S23, effective speech data and image data can be obtained. In other embodiments of the present invention, other video framing methods such as third-party video cropping tools can also be used, and the present invention does not specifically limit this.

As shown in FIG5 , in step S3, it specifically includes:

S31: Extract the effective voice data features to obtain voice features.

S32: Input the speech features into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe and obtain first speech text data.

S33: Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data.

S34: Based on the attention mechanism, obtain the key features of the second speech text data or the weight information of the key features.

S35: The second speech text data and its key features or weight information of the key features are combined through a fully connected layer, and then the scores are calculated through a classification function to obtain the speech text data.

In step S31, extracting the effective voice data features specifically includes:

Extract the features of the speech data and obtain its Mel-scale Frequency Cepstral Coefficients (MFCC). MFCC is a recognizable component in a speech signal and is a cepstral parameter extracted in the Mel scale frequency domain. The Mel scale describes the nonlinear characteristics of the human ear frequency. The MFCC parameters take into account the sensitivity of the human ear to different frequencies and are particularly suitable for speech recognition and speaker identification.

In an embodiment of the present invention, characteristic parameters such as perceptual linear prediction characteristics (PLP) or linear prediction coefficient characteristics (LPC) of the speech data may be obtained through different algorithm steps to replace the MFCC features. The specific adjustments may be made according to the actual application scenario and the model parameters used, and the present invention does not impose any specific restrictions on this.

The specific algorithm steps involved in the above steps can refer to the current existing technology in this field, and the specific content will not be described in detail here.

In step S32, the text content of the valid voice data is transcribed through a network model in the automatic speech recognition technology to obtain the first voice text data.

In this embodiment, a multi-channel multi-size deep convolutional neural network model is constructed by increasing the width of the network to achieve the task of voice-to-text conversion. The deep network model is composed of a multi-layer deep convolutional network model. The deep convolutional neural network model is generally composed of several convolutional layers plus several fully connected layers, and various nonlinear operations and pooling operations are included in the middle. It is mainly used to process grid structured data, so the model can use filters to filter out the contours between adjacent pixels. In addition, the model first proposes speech feature values, and then calculates the feature values instead of calculating the original speech data values. Therefore, compared with the traditional recurrent neural network, the deep convolutional neural network model has the advantages of small computational complexity and easy to characterize local features, and shared weights and pooling layers can give the model better invariance in the time domain or frequency domain. In addition, the deeper nonlinear structure can also give the model a powerful representation ability. In addition, multi-channel and multi-size can extract speech features from different perspectives, obtain more speech feature information, and have better speech recognition accuracy.

Specifically, in this embodiment, in step S32, the multi-channel multi-size deep convolutional neural network used is composed of a 3*3 convolutional layer, 32 channels and a layer of maximum pooling.

In step S33, the alignment relationship between the input speech feature sequence and the output speech text feature sequence is obtained by using the connectionist temporal classification (CTC) method.

In this embodiment, it is difficult to construct an accurate mapping relationship between the effective voice data and the text of the first voice text data, which increases the difficulty of subsequent voice recognition. In order to solve this problem, a time series classification method is adopted. This method is generally used after using a convolutional network model. It is a completely end-to-end acoustic model training that does not require pre-alignment of the data. It only requires an input sequence and an output sequence for training. It does not require alignment and one-to-one labeling of the data, and can directly output the probability of sequence prediction. Based on this predicted probability, we can obtain the most likely text output result to obtain the second voice text data.

Furthermore, in step S34, the attention mechanism can guide the deep convolutional neural network to focus on more critical feature information and suppress other non-critical feature information. Therefore, by introducing the attention mechanism, the local key features or weight information of the second speech text data can be obtained, thereby further reducing the irregular error alignment of the sequence during model training.

Here, in step S35, according to the second speech text data and its key features or the weight information of the key features, the second speech text data is given its own weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information of the text semantic features of the speech text data, so as to enhance the importance of different parts of the text semantic feature information, and finally the speech text data is obtained by calculating the score through the classification function.

As shown in FIG6 , in step S4, it specifically includes:

S41: Acquire a video image of the lip area according to the video data.

S42: Input the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features.

S43: Based on the image lip reading recognition method, the image features are input into a multi-channel multi-size time deep convolutional neural network for transcription to obtain first image text data.

S44: Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data.

S45: After combining the second image text data through the fully connected layer, the scores are calculated through the classification function to obtain the image text data.

In step S41 and step S42, considering that according to the video image features of the lip area of a person, the sentences that may be recognized are relatively complex, such as different sentence lengths, different pause positions or word compositions, and correlations between image features, we can perform video processing operations such as cropping and framing according to the effective video data, obtain the video image of the lip area, and crop and segment the video image of the lip area to obtain multiple continuous lip picture frames. In this embodiment, the multiple continuous lip picture frames are input into the 3D convolutional neural network model, and more expressive features can be extracted by adding information of the time dimension. The 3D convolutional neural network model can solve the correlation information between multiple pictures, and takes multiple continuous frames of images as input, and captures the motion information in the input frames by adding a new dimension of information, so as to better obtain its image features.

In step S43, the result generated by the 3D convolutional neural network model in step S42 is input into the multi-channel multi-scale time deep convolutional neural network model, and after multi-channel multi-convolution kernel operation is performed on it, multiple feature maps with the same number of convolution kernels are output, such as taking a convolution layer with 3-channel input and 2 convolution kernels as an example, 2 feature maps are output after convolution calculation. Considering the video image lip reading recognition method at the sentence level, in this embodiment, the video image lip reading recognition method is implemented by two steps of pinyin sequence recognition (LipPic to Pinyin, P2P) and Chinese character sequence recognition (Pinyin to Chinese-Character, P2CC), which implements a Chinese lip reading recognition method. Specifically, the time-series image features generated by the multi-channel multi-scale time deep convolutional neural network model are mapped to the pinyin sequence of the pinyin sentence, and then the pinyin sequence is translated into the Chinese character sequence of the Chinese character sentence, and finally the first image text data is obtained. Of course, there is no specific limitation on other methods of Chinese lip reading recognition, as long as the video image can be converted into the corresponding text data, it is within the protection scope of the present invention.

In steps S44 and S45, the continuous time series classification method is also used, which is the same as the above-mentioned voice data processing method, to realize the mapping relationship between the effective video data and the text of the first image text data, so as to obtain the second image text data. Then, the second image text data is given its own weight information and/or associated weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information and/or associated weight information of the text semantic features of the image text data, so as to enhance the importance of different parts of the text semantic feature information, and finally the image text data is obtained by calculating the score through the classification function. The specific processing process is the same as the above-mentioned voice data processing steps, which will not be repeated here.

As shown in FIG. 7 , in step S5, it specifically includes:

S51: Convert the speech text data and the image text data into speech text word vectors and image text word vectors.

S52: Input the speech text word vector and the image text word vector into a bidirectional long short-term memory network model to obtain a context feature vector containing feature information of the speech text data and the image text data.

S53: Based on the attention mechanism model, distinguish the self-weight information and/or associated weight information of the words and/or phrases in the text features of the speech text data and the image text data, and obtain the weight information of the text semantic features.

In step S51, in order to convert the text data into a vectorized form that can be recognized and processed by a computer, the speech text data and the image text data can be converted into the speech text word vector and the image text word vector through the Word2Vec algorithm, or the word vector can be obtained through other existing algorithms in the field such as the Glove algorithm, and the present invention does not impose any specific restrictions on this.

In step S52, the bidirectional long short-term memory network (Bi-directional Long Short-Term Memory, abbreviated as BiLSTM) is composed of a forward long short-term memory network (Long Short-Term Memory, abbreviated as LSTM) and a backward long short-term memory network. The LSTM model can better obtain the long-distance dependency of text semantics, and on this basis, the BiLSTM model can better obtain the bidirectional semantics of text. The speech text word vector and the image text word vector are input into the BiLSTM model, and after being processed by the forward LSTM and the backward LSTM, the forward LSTM and the backward LSTM both wait until all time steps are calculated to generate two result vectors, and then the two result vectors are spliced together to output the context feature vector with contextual context information.

In the implementation manner of the present invention, the voice data and video data can also be transcribed into the voice text data and video text data by constructing a neural network model of other structures, and the specific method is not limited.

In step S53, in order to distinguish the weight information of different words or phrases in the speech text data and the image text data or the associated weight information between different text data, the speech text context feature vector and the image text context feature vector are respectively input into the self-attention mechanism and the mutual attention mechanism to obtain the self-weight feature vector containing the weight information of the speech text semantic features and the image text semantic features and the associated weight feature vector containing the associated weight information of the speech text semantic features and the image text semantic features, thereby making full use of the context information of audio and video text conversion, supplementing the deficiency of single features in speech and video data, enriching the semantic representation capability in text data, and optimizing the subsequent text classification capability.

In step S6, it specifically includes:

The context feature vector and the weighted text attention feature vector (including the own weighted text attention feature vector and the associated weighted text attention feature vector) of the speech are combined through a fully connected layer and output to a classification function to calculate the scores of the text semantics in the speech text data and the image text data and their normalized score results to obtain classification result information.

In summary, the classification method for audio and video generated text provided by the present invention can be obtained by sequentially passing through the above steps. By obtaining the real-time audio and video data and the historical audio and video data, data cleaning is performed on them, and voice and video are separated at the same time, effective voice data and video data are generated respectively, and they are all used as part of the data set of the pre-training and prediction model, thereby obtaining text semantic features more comprehensively. In addition, by constructing a multi-channel and multi-size deep convolutional network model that integrates the connection time series classification method and the attention mechanism, and a video image lip reading recognition method based on the time deep convolutional neural network model and the sentence level, more abundant high-level semantic feature information is mined and obtained. Finally, by constructing a context information mechanism, a self-attention mechanism, and a mutual attention mechanism that integrates semantic text data and video text data, the semantic representation ability is more fully utilized, the deficiency of a single feature in voice and video data is compensated, and the accuracy of audio and video generated text classification is improved. In addition, by obtaining the configuration data of the external storage for calculation, the computational efficiency of the model is improved. The overall model structure has a good semantic representation ability of text data, and reflects the good complementarity and correlation characteristics from the semantic feature entropy, which improves the accuracy of audio and video generated text classification.

In step S7, it specifically includes:

converting the category information of the generated text into speech for output, and/or

Convert the category information of the generated text into voice and transmit it to the client terminal for output, and/or

converting the category information of the generated text into text for output, and/or

Convert the category information of the generated text into text and transmit it to the client terminal for output, and/or

Convert the category information of the generated text into an image for output, and/or

The category information of the generated text is converted into an image and transmitted to a client terminal for output.

As described in step S7, in this real-time mode, after the classification result information is obtained through the above steps, it can be converted into voice, and the result information can be broadcasted through the built-in sound playback device of the smart refrigerator, or the result information can be converted into text and displayed directly through the display device configured by the smart refrigerator, or the result information can be converted into an image and displayed directly on the large screen of the smart refrigerator. In addition, the result information can also be transmitted to the client terminal for output by voice communication. Here, the client terminal is an electronic device with information receiving function, such as transmitting voice to mobile phones, smart speakers, Bluetooth headsets and other devices for broadcasting, or the classification result information is transmitted in the form of text or image through SMS, email and other methods to client terminals such as mobile phones, tablet computers or application software installed on client terminals for users to view. Thereby, a multi-channel and multi-category classification result information output method is realized, and users are not limited to obtaining relevant information only near the smart refrigerator. With the multi-channel and multi-category real-time voice acquisition method provided by the present invention, users can directly interact with the smart refrigerator remotely, which is extremely convenient and greatly improves the user experience. In other implementations of the present invention, only one or several of the above-mentioned classification result information output methods may be used, or the classification result information may be output through other channels based on the existing technology, and the present invention does not impose any specific limitation on this.

In summary, the present invention provides a method for classifying audio and video generated text based on multimodal deep learning, which obtains real-time audio and video data and historical audio and video data through multiple channels, processes the audio and video data, and then converts the voice data and video data into corresponding voice text data and image text data. The context information after the audio and video text is generated is combined with the multi-channel multi-size deep convolutional neural network model and the multi-channel multi-size time deep convolutional neural network model to fully extract the text semantic features, obtain the generated text classification results, and output the text classification results through multiple channels. The method not only significantly improves the accuracy of generated text classification, but also makes the interaction between users and smart refrigerators more convenient and diversified, greatly improving the user experience.

Based on the same inventive concept, the present invention also provides an electrical device, which includes:

A memory for storing executable instructions;

The processor is used to implement the above-mentioned text classification method based on multimodal deep learning when running the executable instructions stored in the memory.

Based on the same inventive concept, the present invention also provides a refrigerator, comprising:

A memory for storing executable instructions;

Based on the same inventive concept, the present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned text classification method based on multimodal deep learning.

It should be understood that although this specification is described according to implementation modes, not every implementation mode contains only one independent technical solution. This description of the specification is only for the sake of clarity. Those skilled in the art should regard the specification as a whole. The technical solutions in each implementation mode may also be appropriately combined to form other implementation modes that can be understood by those skilled in the art.

The series of detailed descriptions listed above are only specific descriptions of feasible implementation methods of the present invention and are not intended to limit the scope of protection of the present invention. All equivalent implementation methods or changes that do not deviate from the technical spirit of the present invention should be included in the scope of protection of the present invention.

Claims

A text classification method based on multimodal deep learning, characterized by comprising the steps of:

Obtain real-time audio and video data and historical audio and video data;

Preprocessing the real-time audio and video data and the historical audio and video data to obtain valid voice data and video data;

Transcribing the valid voice data into voice text data;

Acquire a video image of a local area in the valid video data, and transcribe the video image into image text data;

According to the speech text data and the image text data, obtaining context information of the text data and weight information of text semantic features;

After combining the context information and the weight information through a fully connected layer, the context information and the weight information are output to a classifier to calculate a score to obtain classification result information, and the category information of the text generated by the audio and video data is determined;

Output the category information of the generated text.
The text classification method based on multimodal deep learning according to claim 1 is characterized in that the "preprocessing the real-time audio and video data and the historical audio and video data to obtain valid voice data and video data" specifically includes:

Performing data cleaning, format analysis, format conversion and data storage on the real-time audio and video data and historical audio and video data to obtain valid audio and video data;

Using a script or a third-party tool to separate the effective audio and video data into voice and video to obtain the voice data and video data;

The voice data and the video data are preprocessed, including: framing and windowing the voice data, and cropping and framing the video data.
The text classification method based on multimodal deep learning according to claim 1 is characterized in that the "transcribing the valid voice data into voice text data" specifically includes:

Extracting the effective voice data features to obtain voice features;

Inputting the speech feature into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe to obtain first speech text data;

Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data;

Based on the attention mechanism, obtaining key features of the second voice text data or weight information of the key features;

The second speech text data and its key features or weight information of the key features are combined through a fully connected layer and then scored by a classification function to obtain the speech text data.
The text classification method based on multimodal deep learning according to claim 3 is characterized in that the “extracting the effective voice data features” specifically includes:

The effective speech data features are extracted to obtain the Mel-frequency cepstral coefficient features thereof.
The text classification method based on multimodal deep learning according to claim 1 is characterized in that the step of "obtaining a video image of a local area in the valid video data and transcribing the video image into image text data" specifically includes:

Acquire a video image of the lip area according to the video data;

Inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features;

Based on the image lip reading recognition method, the image features are input into a multi-channel multi-scale time deep convolutional neural network model for transcription to obtain first image text data;

Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data;

The second image text data is combined through a fully connected layer and then scored by a classification function to obtain the image text data.
The text classification method based on multimodal deep learning according to claim 5 is characterized in that the step of “inputting the video image of the lip area into a 3D convolutional neural network model for calculation to obtain image features” specifically includes:

Segmenting the lip local video data into continuous lip picture frames;

The continuous lip picture frames are input into a 3D convolutional neural network model for calculation, and multiple features are extracted to obtain the image features.
The text classification method based on multimodal deep learning according to claim 6 is characterized in that the “based on the image lip reading recognition method, inputting the image features into the multi-channel multi-scale time deep convolutional neural network model for transcription to obtain the first image text data” specifically includes:

Input the image features into the multi-channel multi-size time deep convolutional neural network for calculation to obtain time series image features;

According to the image lip reading recognition method, the time-series image features are mapped into a pinyin sequence of a pinyin sentence;

The pinyin sequence is then translated into a Chinese character sequence corresponding to the Chinese character sentence.
The text classification method based on multimodal deep learning according to claim 1 is characterized in that the "obtaining context information of the text data and weight information of text semantic features according to the speech text data and the image text data" specifically includes:

Converting the speech text data and the image text data into speech text word vectors and image text word vectors;

The speech text word vector and the image text word vector are input into a bidirectional long short-term memory network model to obtain a context feature vector containing feature information of the speech text data and the image text data.
The text classification method based on multimodal deep learning according to claim 8, characterized in that the method further comprises:

Based on the attention mechanism model, the self-weight information and/or associated weight information of the words and phrases in the text features of the speech text data and the image text data are distinguished to obtain the weight information of the text semantic features.
The text classification method based on multimodal deep learning according to claim 9 is characterized in that the “based on the attention mechanism model, distinguishing the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data and the image text data” specifically includes:

Inputting the speech text context feature vector and the image text context feature vector into a self-attention mechanism and a mutual attention mechanism respectively;

Obtaining a self-weighted text attention feature vector including self-weight information of the speech text semantic features and the image text semantic features;

Obtain an associated weight text attention feature vector containing associated weight information of the speech text semantic features and the image text semantic features.
The text classification method based on multimodal deep learning according to claim 10 is characterized in that the "combining the context information and the weight information through the fully connected layer, outputting them to the classifier to calculate the score to obtain the classification result information, and determining the category information of the text generated by the audio and video data", specifically includes:

The context feature vector and the weighted text attention feature vector are combined through a fully connected layer and output to a classification function to calculate the text semantic scores of the speech text data and the image text data and their normalized score results to obtain the category information of the generated text.
The text classification method based on multimodal deep learning according to claim 1 is characterized in that the "transcribing the voice data into voice text data" further includes:

The configuration data stored in the external cache is obtained, and the multi-channel and multi-size deep convolutional neural network model is calculated based on the configuration data to perform text transcription and extract text features.
An electrical device, characterized in that it comprises:

A memory for storing executable instructions;

A processor, configured to implement the text classification method based on multimodal deep learning as described in any one of claims 1 to 12 when running the executable instructions stored in the memory.
A refrigerator, characterized by comprising:

A memory for storing executable instructions;

A processor, configured to implement the text classification method based on multimodal deep learning as described in any one of claims 1 to 12 when running the executable instructions stored in the memory.
A computer-readable storage medium storing executable instructions, characterized in that when the executable instructions are executed by a processor, the text classification method based on multimodal deep learning according to any one of claims 1 to 12 is implemented.