CN118233706A

CN118233706A - Live broadcasting room scene interaction application method, device, equipment and storage medium

Info

Publication number: CN118233706A
Application number: CN202410423213.9A
Authority: CN
Inventors: 刘彬
Original assignee: Beijing Fenghuo Wanjia Technology Co ltd
Current assignee: Beijing Fenghuo Wanjia Technology Co ltd
Priority date: 2024-04-09
Filing date: 2024-04-09
Publication date: 2024-06-21

Abstract

The embodiment of the specification provides a live broadcasting room scene interaction application method, device, equipment and storage medium, wherein the method comprises the following steps: acquiring live interaction scene generation data, and preprocessing the live interaction scene generation data; training and optimizing based on the processed live broadcast interaction scene generation data to obtain an intention recognition model for live broadcast interaction scene classification; according to different live interaction scenes, training and optimizing based on the processed live interaction scene generation data respectively to obtain a plurality of large models for outputting different dialogue contents; based on the text input by the user, classifying the live interaction scene through the intention recognition model, and based on the classification, calling a corresponding large model to output dialogue content under the corresponding scene to the user. The invention solves the problems that the chat robot in the prior art can only make popular replies, the experience of audiences is poor, the personalized requirements of the audiences can not be met, and the live broadcast effect is greatly influenced.

Description

Live broadcasting room scene interaction application method, device, equipment and storage medium

Technical Field

The present document relates to the technical field of entertainment services and deep learning, and in particular, to a live broadcast room scene interactive application method, apparatus, device and storage medium.

Background

With the development of internet technology, the live broadcast industry has been actively developed in the past few years. It has covered various fields including entertainment, electronics, education, health, etc. However, due to the real-time property and interactivity of live broadcast, the supervision difficulty is high, and part of live broadcast platforms have contents which violate legal regulations and ethics. On the other hand, the live platform also needs to introduce more real-time interaction functions, such as voting, lottery, question-answering links and the like, so that audience participation is higher, and interaction communication is carried out with the anchor. However, the perfection in these fields requires a lot of manpower, time and effort. AI technology is a technology for comparing firecracks in recent years, and AI can solve various problems by learning to accumulate knowledge and experience continuously and applying the learning. Different learning methods and techniques can be applied to different scenarios and problems, providing an intelligent solution. Therefore, a system combining the AI technology and live broadcast is urgently needed, so that more personalized and intelligent response can be provided for users, user experience is improved, improper content can be monitored, and the workload of manual auditing is reduced.

Disclosure of Invention

The invention aims to provide a live broadcasting room scene interactive application method, device, equipment and storage medium, and aims to solve the problems in the prior art.

The invention provides a live broadcasting room scene interaction application method, which comprises the following steps:

acquiring live interaction scene generation data, and preprocessing the live interaction scene generation data;

Training and optimizing based on the processed live broadcast interaction scene generation data to obtain an intention recognition model for live broadcast interaction scene classification;

Aiming at different live broadcast interaction scenes, respectively using the processed live broadcast interaction scene generation data to train and optimize a plurality ChatGPT derivative interaction scene models so as to adapt to different interaction scenes and generate relevant text replies or dialogue contents;

based on the user input text, classifying the live interaction scene through the intention recognition model, and based on the classification, calling a corresponding ChatGPT derivative interaction scene model to output dialogue content under the corresponding scene to the user.

The invention provides a live broadcasting room scene interaction application device, which comprises:

The preprocessing module is used for acquiring live broadcast interaction scene generation data and preprocessing the live broadcast interaction scene generation data;

The first training module is used for training and optimizing based on the processed live broadcast interaction scene generation data to obtain an intention recognition model for live broadcast interaction scene classification;

The second training module is used for training and optimizing a plurality ChatGPT derived interactive scene models by respectively using the processed live broadcast interactive scene generation data aiming at different live broadcast interactive scenes so as to adapt to different interactive scenes and generate relevant text replies or dialogue contents;

And the processing module is used for classifying the live broadcast interaction scene through the intention recognition model based on the text input by the user, and calling a corresponding ChatGPT derivative interaction scene model based on the classification to output the dialogue content under the corresponding scene to the user.

The embodiment of the invention also provides electronic equipment, which comprises: the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the computer program realizes the steps of the live broadcast room scene interaction application method when being executed by the processor.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores an information transmission implementation program, and the program is executed by a processor to realize the steps of the live broadcasting room scene interactive application method.

Accordingtothetechnicalscheme,theproblemsthatinthepriorart,achatrobotcanonlymakepopularreplies,audienceexperienceispoor,individualdemandsofaudiencescannotbemet,andlivebroadcastingeffectsaregreatlyaffectedaresolved,anLSTM-Amodelbasedonanattentionmechanismisprovidedinanattentionrecognitionmodule,themodelcanpredictscenes,andthemodelcanbehelpedtoconcentrateonkeyinformationwhenpredictingthescenesthroughtheattentionmechanism,sothatthemodelcanbetterunderstandtheintentionoftheaudiences. And then, the multi-layer ChatGPT derivative interactive scene model is called to generate intelligent replies aiming at specific scenes, and a model library similar to AIGC is trained to enhance the knowledge and the capability of digital people, so that the digital people can understand wider problems and topics, more personalized and intelligent responses can be provided, and viewers can enjoy more comfortable live broadcast experience.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description that follow are only some of the embodiments described in the description, from which, for a person skilled in the art, other drawings can be obtained without inventive faculty.

FIG. 1 is a flow chart of a live room scene interactive application method according to an embodiment of the invention;

FIG. 2 is a detailed process flow diagram of a live room scene interactive application method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a novel intent recognition training network architecture in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart of a novel intent recognition model training and evaluation in accordance with an embodiment of the present invention;

FIG. 5 is a diagram illustrating a ChatGPT-derived interactive scene model training and evaluation process according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of a live room scene interactive application device according to an embodiment of the invention;

fig. 7 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to solve the problems, the embodiment of the invention provides a live broadcasting room scene interaction application method based on intention recognition and ChatGPT derived interaction scene model technology, application scene classification is carried out by using intention recognition, a ChatGPT derived interaction scene technology is used for providing a more intelligent and personalized interaction system for a live broadcasting room, and the intention recognition and ChatGPT derived interaction scene model technology are combined to construct better service experience.

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.

Method embodiment one

The existing ChatGPT large model technology is a single general large model, and can simultaneously perform natural language tasks such as intention recognition, text generation and the like, but is not the best choice in a live broadcast room scene; according to the embodiment of the invention, the large model framework which is more suitable for the live broadcasting room scene is constructed by the mode of deriving the interactive scene model through the front model plus ChatGPT. The advantages of this include: the task of the pre-model is typically to perform some specific natural language processing tasks, such as intent recognition, entity recognition, emotion analysis, etc. The function of the system is to extract key information from the input of a user and classify, analyze or process the input, so that different components are allowed to be optimized for specific tasks, and the output is more accurate for specific input. In contrast, a single integrated model requires multiple tasks to be handled in one model, and thus it may exhibit relatively high ambiguity in some cases. If the data training of the pre-model is inadequate or the weight settings are not reasonable, this may result in a single model having no significant change in response to different inputs or being unable to handle a particular task with high accuracy. Furthermore, from the point of view of the availability of training computational power, training a single comprehensive model may require more resources, while training a pre-model in a hybrid model may require less resources, which would directly impact cost and resource requirements. Accordingly, embodiments of the present invention provide a new inventive solution based on the above considerations.

According to an embodiment of the present invention, a live broadcasting room scene interaction application method is provided, fig. 1 is a flowchart of a live broadcasting room scene interaction application method according to an embodiment of the present invention, and as shown in fig. 1, the live broadcasting room scene interaction application method according to an embodiment of the present invention specifically includes:

Step S101, acquiring live interaction scene generation data, and preprocessing the live interaction scene generation data; the method specifically comprises the following steps:

Acquiring live broadcast interaction scene generation data with different live broadcast interaction scene labels, setting unknown UNK operation to invalid content in the live broadcast interaction scene generation data, performing word segmentation and serial number processing, converting the invalid content into ID capable of being identified by a machine, constructing a data set and storing the data set.

Step S102, training and optimizing based on the processed live broadcast interaction scene generation data to obtain an intention recognition model for live broadcast interaction scene classification; the method specifically comprises the following steps:

inputtingtheprocessedlivebroadcastinteractionscenegenerationdataintoLSTMtoobtainarelationmatrixbetweenvectors,calculatingthesimilaritybetweenthehiddenstateofthecurrentLSTMunitandthehiddenstateofotherpositionsinthesequencetoobtainattentionweight,multiplyingtheoutputofeachmomentbytheattentionweighttobeusedastheweightoutputofthecurrentmoment,performingfullconnectionandactivationoperationontherelationmatrix,obtaininganoutputresultthroughmatrixtransformation,performinglosscalculationontheoutputresultandarealsamplelabel,updatingandoptimizingrelevantparametersthroughreversegradientsaccordingtothelosscalculationresult,andobtainingatrainedintentionrecognitionmodelLSTM-A.

Step S103, respectively using the processed live broadcast interaction scene generation data to train and optimize a plurality ChatGPT derivative interaction scene models aiming at different live broadcast interaction scenes so as to adapt to different interaction scenes and generate relevant text replies or dialogue contents; the method specifically comprises the following steps:

According to different live interaction scenes, encoding operation is carried out on the processed live interaction scene generation data through an encoder, an obtained encoding result is used as input of a decoder, current output is used as input of the next time according to a circulating structure, model training, loss calculation, reverse gradient updating and parameter optimizing operation are carried out, and a plurality of trained ChatGPT derivative interaction scene models for outputting different dialogue contents are obtained according to the different live interaction scenes.

Step S104, based on the text input by the user, classifying the live interaction scene through the intention recognition model, and based on the classification, calling a corresponding ChatGPT derivative interaction scene model to output dialogue content under the corresponding scene to the user. The method specifically comprises the following steps:

converting the words input by the user into numerical data which can be processed by a computer through word vector conversion, and capturing semantic relations among the words input by the user;

Inputting the user input characters after word vector conversion into the intention recognition model, executing Linear and ReLu activation through the intention recognition model, connecting each neuron with all neurons of the upper layer in a full-connection layer of the intention recognition model, linearly combining all the characteristics of the user input characters through an activation function, introducing nonlinear transformation, outputting a prediction result of a live broadcast interaction scene after full connection, executing a softMax formula on the prediction result, performing confidence degree processing, converting the multi-classification output values into probability distribution in the range of 0,1 and 1, and selecting the output value with the largest confidence degree as a final prediction category result.

Performing ID conversion, filling conversion and word vector conversion on the user input text, performing LSTM on the matrix obtained after conversion to obtain three layers of output (h _t)、hidden_output(h_t) and cell_output (C _t), performing N layers of conversion on the output (h _t) layer output, namely performing coding operation through an encoder to obtain content corresponding to the user input text and correlation information of each content, wherein the N layers of conversion comprise Self-attention formula conversion, residual formula f1 conversion, layerNorm formula conversion and Linear formula conversion, and the residual formula f1 is as follows:

y=βf (x) +x formula 5;

f (x) =n (u ^Tσ(W₁V_i+W₂Q_t-1)) formula 6;

Wherein y represents the output of the layer, β represents a super parameter for adjusting f (x), f (x) is self-attention transformation performed on input x, and+ represents addition operation of element level, i.e. Add operation, u ^T、W₁、W₂ is vector matrix for converting the dimensions of two matrices of V _i and Q _t-1, V _i and Q _t-1 are the results obtained by the word vector through matrix conversion, σ represents nonlinear activation function, i.e. sigmoid function, N represents normal distribution of f (x), and the purpose is to select K maximum values;

LayerNorm formula:

Wherein x represents a characteristic value, gamma and beta represent a scaling factor and a displacement factor, u and sigma represent a mean value and a standard deviation, e has the function of preventing 0 division errors, and alpha represents a super parameter for adjusting the displacement size;

Taking the hidden_output (h _t) and the cell_output (C _t) obtained by the encoder as the input of the last moment in the decoder, executing LSTM, obtaining the decoder-output and the hidden_output and the cell_output, performing attention calculation on the decoder-output and the encoded output, obtaining the relation between the current information and the previous information, splicing and fully connecting with the decoder-output, outputting the current output as the input of the next round, executing the same operation again, deriving an interactive scene model to continuously output a prediction sequence through ChatGPT until encountering a stop or reaching the maximum output length, outputting the text obtained at the moment, namely outputting the text, and outputting the dialogue content under the corresponding scene to the user according to the output text;

Modeling a historical dialogue through ChatGPT derived interactive scene models, and optimizing the ChatGPT derived interactive scene models through historical modeling.

In summary, in the embodiment of the present invention, the user speaking is analyzed in real time based on the intention recognition technology, and the intention and the demand of the user are determined. This may help the anchor better understand the audience's questions or feedback. A chat robot assistant can be constructed by utilizing ChatGPT derived interactive scene model technology to provide intelligent answers and interactions for the anchor and audience. The ChatGPT derivative interactive scene model can quickly generate a reasonable and relevant reply when the audience sends a message or question. Live broadcasting room scene interaction based on intention recognition and ChatGPT derived interaction scene model technology can achieve more intelligent and personalized user interaction, provide better service experience and enhance interaction effect between the anchor and audience.

The invention is described in further detail below with reference to the attached drawings and detailed description:

As shown in fig. 2, the method specifically comprises the following steps:

step S1: different scene inputs.

During the live broadcast process, different scenes may be faced, such as a lottery link, a viewer brushing a gift, an improper speaker appearing in a viewer comment, a malicious screen brushing, a list entering a live broadcast room, and the like. At this time, the user has different intentions and demands, so that different scenes need to be classified, and different data generated by the different scenes are constructed and stored.

Step S2: and (5) data processing.

The data of different scenes are different, and the data needs to be processed. On the one hand, for invalid content in data, an UNK (set to unknown) operation is required to become content which is not involved in training. On the other hand, for the processed text data, the text data is required to be converted into an id which can be recognized by a machine through jieba word segmentation and serial number processing, a dataset is constructed, and id information and model characteristics are stored for convenient later recall. Through data operation, the constructed model is more excellent.

Step S3: novel intention recognition model training and evaluation. FIG. 3 is a schematic diagram of a novel intent recognition training network architecture in accordance with an embodiment of the present invention.

And(3)performingdataprocessingofS2,andexecutingLSTM-Atoobtainarelationmatrixamongvectors,whereinthespecificformulaisasfollows:

k=c _t*C_i, i=1, 2,3 … equation 2;

c _t＝C_t+kC_i, i=1, 2, 3..formula 3;

Wherein C _t is the output of the current time, C _t-1 is the output of the last time, f _t represents a weight matrix, k represents a weight matrix obtained after attention calculation, C _i represents the outputs of all times, The current time input is represented as follows:

Wherein h _t-1 represents hidden layer output at the last moment, b _c represents offset, x _t represents input feature, α represents super parameter for adjusting weight, W _c represents matrix for adjusting size of matrix, and tanh represents tanh function.

Comparedwiththepriorart,theLSTM-AmodelbasedontheattentionmechanismisproposedforthefirsttimebyimprovingtheLSTMformula.

Firstly, for a cell layer, forgetting the gate and inputting the gate through coupling, so that the LSTM does not consider what to forget and what information to add independently, but considers together, for example, the user use information at the last moment and the user use information at the current moment can be considered together, the model can analyze the importance degree, make a targeted choice and break, and greatly accelerate the training speed.

Second,LSTM-ausesanattentionmechanismtosolvethelong-termdependencyproblemofconventionalLSTM,forexample,fortheusageinformationofauserinthepast,theconventionalLSTMmodelmayforgetpastcontents,butthepresentapplicationwillmakeiteasierforLSTM-atomemorizetheusageinformationoftheuserinthepastandalsoautomaticallypayattentiontothecontextinformationrelatedtothecurrentinformation,therebymakingadjustmentstothedialoguecontents.

Inaword,throughadjustmentandoptimization,theLSTM-Acanrememberthattheuserpredictstheuserpreferencemoreaccurately,maketargetedrepliesandrecommendations,andgreatlyimprovetheuserexperience.

And then performing full connection and activation operation on the matrix, obtaining an output result through matrix transformation, and performing loss calculation with a real sample label, and updating and optimizing related parameters in a reverse gradient manner.

Step S4: chatGPT derive interactive scene model training and evaluation.

Step S3 obtains application scenes of various categories, and trains different ChatGPT derivative interactive scene dialogue models aiming at different scenes. And (3) performing coding operation on the data processed in the step (S2), taking the obtained result as the input of a decoder, and performing training, calculation loss, inverse gradient updating and parameter optimizing operation according to a cyclic structure, namely taking the current output as the input of the next time. And finally, storing the trained model.

Step S5: and (5) model calling.

After the model is trained, the model is used, and different dialogue contents can be output by the model for different application scenes of the live broadcasting room, so that the user experience and participation are enhanced.

Step S6: a novel intent model is executed.

Intent recognition is a natural language processing technique used to understand intent and purpose in user sentences. However, because the application scene of the application is a live broadcast scene, the internal data generated in different live broadcast interaction scenes have larger difference. Therefore, based on the application, a novel intention recognition model is provided, different scenes can be accurately classified, and the speed of calling the follow-up ChatGPT derived interactive scene model and the fault tolerance rate of the follow-up ChatGPT derived interactive scene model are greatly improved.

Wherein step S6 is intended to identify the model comprising the sub-steps of:

The novel intent recognition execution flow chart as shown in fig. 4.

And step A1, word vector conversion.

The id obtained in step S2 is converted into a word vector that can be recognized by a computer through matrix transformation. The computer cannot directly understand and process the text information, but word vector conversion can convert the words into numerical data which can be processed by the computer, and capture the semantic relationship among the words.

StepA2,BiLSTM(i.e.,LSTM-A).

TheLSTM-Aconversioniscarriedoutonthecontentafterthewordvectorconversion,andtheLSTM-Aiscoupledwiththeforgettinggateandtheinputgate,sothattheLSTMdoesnotsinglyconsiderwhatisforgottenandwhatinformationisadded,butconsiderstheinformationtogether,forexample,theuseruseinformationatthelastmomentandtheuseruseinformationatthecurrentmomentcanbeconsideredtogether,themodelcananalyzetheimportancedegree,makeatargetedchoiceandmakeatrainingspeedgreatlyaccelerated.

Second,LSTM-ausesanattentionmechanismtosolvethelong-termdependencyproblemofconventionalLSTM,forexample,fortheusageinformationofauserinthepast,theconventionalLSTMmodelmayforgetpastcontents,butthepresentapplicationwillmakeiteasierforLSTM-atomemorizetheusageinformationoftheuserinthepastandalsoautomaticallypayattentiontothecontextinformationrelatedtothecurrentinformation,therebymakingadjustmentstothedialoguecontents. inaword,throughadjustmentandoptimization,theLSTM-Acanrememberthattheuserpredictstheuserpreferencemoreaccurately,maketargetedrepliesandrecommendations,andgreatlyimprovetheuserexperience.

Step A3, full-connect.

At this stage, linear and ReLu activations need to be performed, each neuron being connected to all neurons of the previous layer in the fully connected layer, each connection having a weight. Such a connection may linearly combine the various features of the input data and introduce nonlinear transformations through the activation functions, resulting in a richer, more complex representation and feature extraction capability.

Step A4, softMax.

The prediction results of all the classes are output after full connection, a softMax formula is needed to be executed on the results, confidence processing is carried out, the multi-class output values can be converted into probability distribution which ranges from 0 to 1 and is 1 through the softMax function, and the result with the largest confidence is selected as the prediction class result.

Step S7: scene prediction.

And outputting the result after the intention recognition model, namely the current most probable application scene, and calling different ChatGPT derivative interaction scene models according to the current scene.

Step S8: call ChatGPT spawns an interactive scene model.

The step S8 of deriving the interactive scene model according to the different scene calls ChatGPT comprises the following steps:

call ChatGPT, as shown in fig. 5, spawns an interactive scene model execution flow chart.

Step B1: and (3) a coding stage.

Inputting text during intention recognition, performing id conversion, filling conversion and word vector conversion, performing LSTM on the converted matrix to obtain three layers of output (h _t)、hidden_output(h_t) and cell_output (C _t), and performing N layers of transformers on the output layer output. the transformers contain Self-saturation, f1 (residual formulas), layerNorm, linear, etc., wherein the residual formulas are expressed as:

y=βf (x) +x formula 5;

f (x) =n (u ^Tσ(W₁V_i+W₂Q_t-1)) formula 6;

Where y is the output of the layer, β represents the superparameter for adjusting f (x), f (x) is the self-attention transformation performed on input x, + represents the addition operation at the element level, i.e. Add operation. In the formula f (x), u ^T、W₁、W₂ is a vector matrix, which is used for conveniently converting the dimensions of the two matrices V _i and Q _t-1, V _i and Q _t-1 are results obtained by performing matrix conversion on word vectors, σ represents a sigmoid function, also called a Logistic function, which is a common nonlinear activation function, and N represents normal distribution of f (x), so as to select K maximum values.

Compared with the prior art, the traditional residual error formula is adjusted.

Firstly, the original input characteristics are directly added to a certain level of a network by the traditional residual formula, the original input is added to the super parameter for optimization adjustment, so that the current input x is prevented from being influenced by the excessive or the insufficient input, for example, the original user input information is added to the parameter adjustment, and is used as the current input together with the current user input information, and the current user information is prevented from being covered by the novel residual formula adjustment.

Secondly, an original attention calculation formula is adjusted, normal processing is carried out on the original attention calculation formula on the basis of original weight calculation, K maximum values are selected for the processed result, for example, a plurality of pieces of information are input in a user history, if the weight of all the information is calculated, the task amount and the consumption of resources are huge, so that weight optimization is needed to be carried out by pointedly selecting K pieces of information, and through the operation, the calculation amount can be greatly reduced, and the model calculation is quickened.

LayerNorm formula is:

Wherein x represents a characteristic value, gamma and beta represent a scaling factor and a displacement factor, u and sigma represent a mean value and a standard deviation, e has the function of preventing error of 0 division, alpha represents a super parameter, and the displacement size is adjusted.

Compared with the prior art, the traditional LayerNorm calculation formula is improved, the super-parameters are added for adjusting and controlling the displacement factors, for example, the user uses inconsistent information data, layerNorm (normalization) processing is needed, so that the data are more beneficial to calculation, the stability of the model can be effectively improved, the convergence speed can be increased, and meanwhile, the generalization capability of the model can be enhanced.

Through the operation of the encoding stage, we obtain the input content of the user and the correlation of each content, and then take the output result as the input of the decoder, and output the output result word by word according to the cyclic structure.

Step B2: and a decoding stage.

Taking the hidden_output and the cell_output obtained by the encoder as the input of the last moment in the decoder, executing the LSTM, also obtaining the decoder-output and the hidden_output and the cell_output, paying attention to the decoder-output and the encoded output, obtaining the relation between the current information and the previous information, splicing and fully connecting the decoder-output and outputting the current output, taking the current output as the input of the next round, executing the same operation again, and deriving the interactive scene model ChatGPT to continuously output the predicted sequence until the stop is met or the maximum output length is reached. The text obtained at this time is output text.

Step B3: history memory.

In ChatGPT derivative interactive scene model dialogue learning, it models historical dialogues, not just single sentences, by studying past data and events, finding patterns and rules therein, and applying those patterns and rules to future scenarios. Through historical modeling, the dialogue generating capacity and fluency of the model can be better improved.

Step S9: and (5) dialog identification.

Through the processing of the novel intention recognition model and the multi-layer ChatGPT derived interactive scene model, the output text is the dialogue content aiming at different application scenes.

In the embodiment of the invention, the foregoing "classifying live interaction scenes based on user input text through the intention recognition model" can be further refined and expanded, the input text can be expanded into a language in live broadcast, and the algorithm for recognizing specific intention from the language in live broadcast according to the following steps:

1. Defining intent categories

Identifying an intention type: the intention category to be identified is determined, such as question, feedback, request information, express emotion, etc.

Scene analysis: language patterns and user interactions common in different live scenes (e.g., games, education, interviews) are analyzed.

2. Data preparation and preprocessing

And (3) data collection: language samples in various live scenes are collected, including text, voice and video data. In the process of model construction, multi-mode data fusion is needed to be considered, and information of different modes is integrated into a unified model so as to improve comprehensive understanding of user intention, and proper feature extraction and multi-mode data fusion strategies are needed to be designed. The following are 5 innovative approaches to multimodal data fusion.

(1) Cross-modal attention mechanisms. As it relates to text, voice and video data, it is necessary to design a cross-modal attention mechanism: the attention mechanism can dynamically adjust the attention degree of the model to different modal data. For example, in a live scene, the anchor's voice and facial expressions may simultaneously provide critical information, so the model may adaptively focus on important information therein through a cross-modal attentiveness mechanism.

This cross-modal attention mechanism enables flexible adjustment of points of interest between different data modalities (e.g., text, voice, video) to effectively capture critical information. This capability is critical to understanding the complex user intent in a live interaction scenario. However, to more fully capture and understand these multidimensional data, we also need to consider how to efficiently integrate the information of these different modalities. This led to our (2) th innovation, the multi-channel fusion model. This model provides a more comprehensive view of the data by processing the data of each modality independently and then integrating their information in a fusion layer. The method not only maintains the unique characteristics of each mode, but also reveals the possibly existing complex relationship between the modes, and further enhances the understanding and the response capability of the model to the live interaction content.

(2) And (5) a multi-channel fusion model. A multi-channel model is constructed that is capable of processing text, speech and video simultaneously. Each channel has its own feature extraction and processing flow, and finally their information is integrated together by a fusion layer. This approach may better capture complex relationships between the various modalities.

Firstly, carrying out modal data processing.

Text passage: text data (e.g., chat, barrage) in live broadcast is processed using NLP technology. This may include using Word embedding techniques (such as Word2Vec or BERT) to convert text data into vector form, and applying text classification or emotion analysis models.

Voice channel: and processing the voice data in the live broadcast. This includes speech recognition (converting speech to text), and possibly speech emotion analysis, to identify the emotion and intent of the speaker.

Video channel: the live video stream is analyzed and computer vision techniques (e.g., face recognition, expression analysis) are used to identify and interpret visual information, such as facial expressions and gestures of the host.

The multi-channel data fusion is then performed.

And (3) designing a fusion strategy: an effective fusion strategy is designed to combine the feature vectors of different channels into a comprehensive feature representation. This may involve simple vector stitching, or more complex fusion techniques (such as weighted averaging or attention mechanisms).

Realization of a fusion layer: the fusion policy is implemented in a fusion layer of the model to ensure that information from different modalities is comprehensively considered.

(3) Timing alignment and synchronization: in multi-modality data fusion, timing alignment and synchronization issues also need to be considered to ensure that the information of the different modalities is corresponding in time to avoid confusion or inconsistencies, which is critical for identifying user intent, especially in fast-changing live scenes.

(4) Cross-modal information enhancement: information of one modality is enhanced by information of the other modality. For example, text information is used to enhance speech recognition, or text emotion analysis is enhanced by emotion expression in video. Such information interleaving may provide more comprehensive and accurate intent recognition.

(5) Dynamic modality selection:

And dynamically selecting the most relevant mode for processing according to the current live broadcast scene and the user behavior. This can be achieved by introducing a dynamic weight or modality selection network to accommodate different live interaction scenarios.

In the data preparation and preprocessing steps, the data fusion is completed, and the data marking and preprocessing are further carried out.

And (3) data marking: and labeling the collected data, and determining the intention category of each sample.

Pretreatment: cleaning data, and performing necessary text conversion, such as word segmentation, stop word removal, part-of-speech tagging and the like.

3. Feature extraction

Word embedding: text is converted to numerical vectors using models such as Word2Vec, gloVe, or BERT.

Contextual characteristics: context information in the live context, such as the relation of front and rear sentences, is considered to improve recognition accuracy. For the multi-channel fusion mentioned in the foregoing (2), independent feature extraction is required: the feature extraction method is designed independently for each channel. For example, a text channel may focus on keywords and semantic structures, a speech channel may focus on intonation and speaking speed, and a video channel may focus on facial expressions and body language.

4. Model selection and training

Model construction: an appropriate machine learning model, such as LSTM, GRU or transducer, is selected.

In this method embodiment, the consideration of using a transducer model is based on the following:

performance advantage: the transfomer model is widely recognized as having significant advantages in processing natural language tasks (e.g., text understanding, generation, etc.). Because of its attention mechanism, it can handle long-range dependencies and contextual information more efficiently, which is important for understanding and generating complex live interactive content.

The adaptability is strong: the transducer model is known for its flexibility and adaptability. The method can be well adapted to data sets of various sizes and types, so that the method can effectively work in different live interaction scenes.

Wide application cases: since the introduction of the transducer model, it has been widely used for various language processing tasks and has led to leading performance in many benchmark tests. This increases the reliability of its application in live interaction scenarios.

Continuous technical progress: the research and development of transducer models is still in rapid progress, and new variants and improvements are continually emerging. This means that their application in live interactive applications may be better performing and more functional as technology advances.

Efficient parallel processing: the transducer model is particularly suited for parallel processing, which is important for processing large amounts of real-time data (as is common in live scenes). They can process information faster, providing real-time or near real-time responses.

In summary, using a transducer model may bring more processing power, better performance and wider adaptability to live interactive applications, which may be key to the promotion of existing systems to new levels.

Training and verification: training the model by using the marked data, and testing the performance of the model on the verification set.

5. Algorithm optimization

Adjusting super parameters: and adjusting super parameters such as learning rate, layer number, hidden unit number and the like according to the verification result.

Characteristic engineering: different feature combinations and text representation methods are tried to optimize the model performance.

6. Integration and testing

API development: an API interface is developed that enables the live platform to invoke the intent recognition algorithm in real time.

And (3) testing in real time: and performing real-time testing in a live environment, and evaluating the response time and accuracy of the algorithm.

7. Iteration and feedback

Collecting user feedback: feedback on the performance of the algorithm is collected from the anchor and audience.

And (5) continuously iterating: and continuously iterating the optimization algorithm according to the feedback and the test result.

8. Deployment and monitoring

Algorithm deployment: and deploying the trained model to a live platform.

And (3) performance monitoring: the performance of the algorithm in actual use is continuously monitored, and stability and accuracy are ensured.

Through the steps, an intention recognition algorithm aiming at the live interaction language is developed, and the algorithm can adapt to different live scenes, and effectively recognizes and responds to the intention of a user.

The invention provides a live broadcasting room scene interaction application method based on intention recognition and ChatGPT derived interaction scene model technology, which has the following beneficial effects:

1. the embodiment of the invention can classify different live broadcasting room scenes based on the intention recognition technology, and the experience of different scene users is also different, so that the scene classification is more beneficial to the targeted reply of the system;

2. According to the embodiment of the invention, a multi-layer ChatGPT derived interactive scene model technology is used, a plurality of chat robot assistants are constructed aiming at different scenes, and aiming at different chat problems, the training has pertinence reply, the interactive effect is enhanced, and the use experience of audiences is enhanced;

3. the embodiment of the invention improves the residual calculation formula, so that a deeper network can be trained and optimized more easily. The introduction of residual error connection is also based on modeling the 'residual error' of the network, and gradually improving the performance of the network by optimizing the residual error;

4. according to the embodiment of the invention, the LayerNorm calculation formula is improved, and the LayerNorm technology is used in the neural network training process, so that the model stability can be effectively improved, the convergence rate can be increased, and the model generalization capability can be enhanced;

5. theembodimentoftheinventionimprovesanLSTMcalculationformula,providesanLSTM-Aformulabasedonanattentionmechanism,andsolvesthelong-termdependenceproblemofthetraditionalLSTMbycouplingaforgettinggateandaninputgate,whereintheLSTM-Aisnotconsideredtoforgetwhatandaddwhatinformationindependently,butisconsideredtogether,sothatthetrainingspeedisincreased,andattentionispaidtotheLSTM-A.

6. The invention can process the scenes of inputting words, can be expanded into word, voice and video information in live broadcast, can identify specific intention categories from the information, such as question, feedback, request information, express emotion and the like, and is suitable for common language modes and user interactions in different live broadcast scenes (such as games, education and interviews).

Device embodiment 1

According to an embodiment of the present invention, a live-room scene interaction application device is provided, and fig. 6 is a schematic diagram of the live-room scene interaction application device according to the embodiment of the present invention, as shown in fig. 6, where the live-room scene interaction application device according to the embodiment of the present invention specifically includes:

the preprocessing module 60 is configured to acquire live interaction scene generation data, and preprocess the live interaction scene generation data;

the first training module 62 is configured to train and optimize based on the processed live interaction scene generation data, to obtain an intention recognition model for live interaction scene classification;

The second training module 64 is configured to train and optimize the plurality ChatGPT of derivative interaction scene models to adapt to different interaction scenes by using the processed live interaction scene generation data for different live interaction scenes, respectively, and to generate relevant text replies or dialogue contents;

The processing module 66 is configured to classify live interaction scenes through the intent recognition model based on text input by a user, and call a corresponding ChatGPT derived interaction scene model based on the classification to output dialogue content in a corresponding scene to the user.

In summary, the embodiment of the application provides a live broadcasting room scene interaction application device based on intention recognition and ChatGPT derived interaction scene model technology, which is different from the prior art in that the traditional chat robot only can make popular replies, has poor audience experience, cannot meet individual requirements of audiences, and greatly influences live broadcasting effect. attheintentionrecognitionmodule,anLSTM-Amodelbasedonanattentionmechanismiscreativelyprovidedforthefirsttime,themodelcanpredictascene,andtheattentionmechanismcanhelpthemodeltofocusonkeyinformationwhenpredictingthescene,sothatthemodelcanbetterunderstandtheintentionofaspectator. And then, the multi-layer ChatGPT derivative interactive scene model is called to generate intelligent replies aiming at specific scenes, and a model library similar to AIGC is trained to enhance the knowledge and the capability of digital people, so that the digital people can understand wider problems and topics, more personalized and intelligent responses can be provided, and viewers can enjoy more comfortable live broadcast experience.

The embodiment of the present invention is an embodiment of a device corresponding to the embodiment of the method, and specific operations of each module may be understood by referring to descriptions of the embodiment of the method, which are not repeated herein.

Device example two

An embodiment of the present invention provides an electronic device, as shown in fig. 7, including: memory 70, processor 72 and a computer program stored on the memory 70 and executable on the processor 72, which when executed by the processor 72 performs the steps as described in the method embodiments.

Device example III

Embodiments of the present invention provide a computer-readable storage medium having stored thereon a program for carrying out information transmission, which when executed by the processor 72, carries out the steps as described in the method embodiments.

The computer readable storage medium of the present embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, etc.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The live broadcasting room scene interaction application method is characterized by comprising the following steps of:

2. The method according to claim 1, wherein the step of acquiring live interaction scene generation data and preprocessing the live interaction scene generation data specifically comprises the steps of:

Acquiring live broadcast interaction scene generation data with different live broadcast interaction scene labels, setting invalid content in the live broadcast interaction scene generation data to be unknown UNK operation, performing word segmentation and serial number processing, converting the unknown UNK operation into an ID capable of being identified by a machine, constructing a data set and storing the data set.

3. The method according to claim 1, wherein the training and optimizing based on the processed live interaction scene generation data to obtain the intention recognition model for live interaction scene classification specifically comprises the steps of:

inputtingtheprocessedlivebroadcastinteractionscenegenerationdataintoLSTMtoobtainarelationmatrixbetweenvectors,calculatingthesimilaritybetweenthehiddenstateofthecurrentLSTMunitandthehiddenstateofotherpositionsinthesequencetoobtainattentionweight,multiplyingtheoutputofeachmomentbytheattentionweighttobeusedastheweightoutputofthecurrentmoment,performingfullconnectionandactivationoperationontherelationmatrix,obtaininganoutputresultthroughmatrixtransformation,performinglosscalculationontheoutputresultandarealsamplelabel,andupdatingtheoptimizedmodelparametersthroughinversegradientsaccordingtothelosscalculationresulttoobtainatrainedintentionrecognitionmodelLSTM-A.

4. The method of claim 1, wherein the training and optimizing based on the processed live interaction scene generation data according to the difference of the live interaction scenes to obtain a plurality of ChatGPT derivative interaction scene models for outputting different dialogue contents specifically comprises the steps of:

5. The method of claim 3, wherein inputting the processed live interaction scene generation data into the LSTM to obtain the relationship matrix between vectors specifically comprises:

The relationship matrix between the vectors is obtained according to equations 1-4:

k=c _t*C_i, i=1, 2, 3..formula 2;

C _t＝C_t+kC_i, i=1, 2, 3..formula 3;

Wherein C _t is the output of the current time, C _t-1 is the output of the last time, f _t represents a weight matrix, k represents a weight matrix obtained after attention calculation, C _i represents the outputs of all times, Indicating the current time input, h _t-1 indicating the hidden layer output at the last time, b _c indicating the offset, x _t indicating the input characteristics, α indicating the hyper-parameters for adjusting the weights, W _c indicating the matrix for adjusting the matrix size, and tanh indicating the tanh function.

6. The method according to claim 1, wherein the classification of live interaction scenes by the intention recognition model specifically comprises the steps of:

Inputting the user input characters after word vector conversion into the intention recognition model, executing Linear and ReLu activation through the intention recognition model, connecting each neuron with all neurons of the upper layer in a full-connection layer of the intention recognition model, linearly combining all the characteristics of the user input characters through an activation function, introducing nonlinear transformation, outputting a prediction result of a live broadcast interaction scene after full connection, calculating the prediction result through a softMax formula, performing confidence degree processing, converting the multi-classification output value into probability distribution with the range of [0,1] and sum of 1, and selecting the output value with the maximum confidence degree as a final prediction category result.

7. The method according to claim 6, wherein the step of calling the corresponding ChatGPT derivative interaction scenario model to output the dialog content in the corresponding scenario to the user based on the classification specifically comprises the steps of:

y=βf (x) +x formula 5;

f (x) =n (u ^Tσ(W₁V_i+W₂Q_t-1)) formula 6;

LayerNorm formula:

Taking the output of the hidden_output (h _t) and the output of the cell_output (C _t) obtained by the encoder as the input of the last moment in the decoder, executing LSTM to obtain the decoder-output, hidden _output and the cell_output, performing attention calculation on the decoder-output and the encoded output to obtain the relation between the current information and the previous information, splicing and fully connecting the decoder-output, outputting the current output, performing iterative operation as the input of the next round, deriving an interactive scene model through ChatGPT to continuously output a prediction sequence until a stop is met or the maximum output length is reached, outputting the text obtained at the moment, and outputting the dialogue content under the corresponding scene to the user according to the output text;

8. A live room scene interactive application device, comprising:

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the live room scene interactive application method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, wherein a program for implementing information transfer is stored on the computer-readable storage medium, and the program when executed by a processor implements the steps of the live-room scene interactive application method as claimed in any one of claims 1 to 7.