CN112183197B

CN112183197B - Working state determining method and device based on digital person and storage medium

Info

Publication number: CN112183197B
Application number: CN202010847552.1A
Authority: CN
Inventors: 常向月
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2024-06-25
Anticipated expiration: 2040-08-21
Also published as: CN112183197A

Abstract

The application relates to a working state determining method, device and storage medium based on digital people. The method comprises the following steps: acquiring target voice of a target user and a target image corresponding to the target voice; carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state. The digital person can indicate through the virtual image, and can control the virtual image to send out the working state prompt information according to the target working state corresponding to the target user so as to remind the user.

Description

Working state determining method and device based on digital person and storage medium

Technical Field

The present application relates to the field of information technologies, and in particular, to a method, an apparatus, and a storage medium for determining a working state based on a digital person.

Background

With the development of science and technology, in many cases, the living situation of the user can be intelligently monitored or supervised through the intelligent device, for example, the movement situation of the user in one day, such as how many steps are taken, or how long the user has been at rest, etc.

However, in many cases, supervision is still required manually, for example, the work state of the staff is determined manually according to the completion of the work object, resulting in inefficiency in work state determination.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus, and storage medium for determining an operating state based on a digital person.

A method of determining an operational state based on a digital person, the method comprising: acquiring target voice of a target user and a target image corresponding to the target voice; carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

In some embodiments, the performing working state analysis based on the face feature corresponding to the target image, and obtaining the first working state corresponding to the target user includes: acquiring face features corresponding to the target image, and processing the face features by using a trained expression recognition model to obtain a target expression corresponding to the target user; and analyzing the working state according to the target expression to obtain a first working state corresponding to the target user.

In some embodiments, the target image is a plurality of target images, and the working state analysis based on the face features corresponding to the target image, the obtaining the first working state corresponding to the target user includes: acquiring feature point positions corresponding to a plurality of eye key feature points corresponding to the target image, and acquiring a target closed state corresponding to the eyes of the target user in the target image based on the position difference between the feature point positions; sequencing the target closed states according to the acquisition sequences corresponding to the target images to obtain a closed state sequence; and carrying out working state analysis according to the closed state sequence to obtain a first working state corresponding to the target user.

In some embodiments, the performing working state analysis based on the voice feature corresponding to the target voice, and obtaining the second working state corresponding to the target user includes: acquiring voice attribute information corresponding to the target voice, wherein the voice attribute information comprises at least one of speech speed information or intonation change information; and analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user.

In some embodiments, the method further comprises: carrying out semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice; the step of analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user comprises the following steps: and carrying out working state analysis based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

In some embodiments, the first working state is a working state corresponding to a plurality of analysis dimensions respectively, the second working state is a working state corresponding to a plurality of analysis dimensions respectively, and determining the target working state corresponding to the target user by combining the first working state and the second working state includes: calculating the number of states in which the working states are tired in the first working state and the second working state; and when the state quantity is larger than a preset threshold value or the proportion corresponding to the state quantity is larger than a preset proportion, determining that the target working state corresponding to the target user is tired.

In some embodiments, the method further comprises: acquiring an virtual user image corresponding to the target user; and controlling the virtual user image to send out working state prompt information according to the target working state.

In some embodiments, the controlling the avatar to send out the working state prompt information according to the target working state includes: acquiring face adjustment parameters corresponding to the first working state; and performing image adjustment on the virtual user image according to the face adjustment parameters, and sending out working state prompt information according to the virtual user image after the image adjustment is controlled by the target working state.

In some embodiments, the determining, in combination with the first working state and the second working state, the target working state corresponding to the target user includes: and inputting the first working state and the second working state into a comprehensive state determination model to obtain a target working state corresponding to the target user.

In some embodiments, the first working state is obtained by processing the face feature with a first state determination model, and the second working state is obtained by processing the voice feature with a second state determination model, and the method further includes: acquiring a first training sample, wherein the first training sample comprises training face features, corresponding first state labels, training voice features, corresponding second state labels, first state labels and comprehensive state labels; inputting the training face features into a first state determination model to be trained to obtain a first prediction state; inputting the training voice characteristics into a second state determination model to be trained to obtain a second prediction state; inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state; obtaining a target model loss value based on the state difference of the first state label and the first prediction state, the state difference of the second state label and the second prediction state, and the state difference of the comprehensive state label and the third prediction state; and adjusting model parameters of the first state determining model to be trained, the second state determining model to be trained and the comprehensive state determining model to be trained based on the target model loss value to obtain the first state determining model, the second state determining model and the comprehensive state determining model.

A digital person-based operating state determination device, the device comprising: the image and voice acquisition module is used for acquiring target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; the first working state obtaining module is used for carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; the second working state obtaining module is used for carrying out working state analysis based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and the target working state determining module is used for determining the target working state corresponding to the target user by combining the first working state and the second working state.

In some embodiments, the first operating state obtaining module is configured to: acquiring face features corresponding to the target image, and processing the face features by using a trained expression recognition model to obtain a target expression corresponding to the target user; and analyzing the working state according to the target expression to obtain a first working state corresponding to the target user.

In some embodiments, the target image is a plurality of target images, and the first working state obtaining module is configured to: acquiring feature point positions corresponding to a plurality of eye key feature points corresponding to the target image, and acquiring a target closed state corresponding to the eyes of the target user in the target image based on the position difference between the feature point positions; sequencing the target closed states according to the acquisition sequences corresponding to the target images to obtain a closed state sequence; and carrying out working state analysis according to the closed state sequence to obtain a first working state corresponding to the target user.

In some embodiments, the second operating state obtaining module is configured to: acquiring voice attribute information corresponding to the target voice, wherein the voice attribute information comprises at least one of speech speed information or intonation change information; and analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user.

In some embodiments, the apparatus further comprises: the semantic emotion analysis module is used for carrying out semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice; the second working state obtaining module is used for: and carrying out working state analysis based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

In some embodiments, the first working state is a working state corresponding to a plurality of analysis dimensions, the second working state is a working state corresponding to a plurality of analysis dimensions, and the target working state determining module is configured to: calculating the number of states in which the working states are tired in the first working state and the second working state; and when the state quantity is larger than a preset threshold value or the proportion corresponding to the state quantity is larger than a preset proportion, determining that the target working state corresponding to the target user is tired.

In some embodiments, the apparatus further comprises: the virtual user image acquisition module is used for acquiring the virtual user image corresponding to the target user; and the working state prompt information sending module is used for controlling the virtual user image to send out the working state prompt information according to the target working state.

In some embodiments, the working state prompt information sending module is configured to: acquiring face adjustment parameters corresponding to the first working state; and performing image adjustment on the virtual user image according to the face adjustment parameters, and sending out working state prompt information according to the virtual user image after the image adjustment is controlled by the target working state.

In some embodiments, the target operating state determination module is to: and inputting the first working state and the second working state into a comprehensive state determination model to obtain a target working state corresponding to the target user.

In some embodiments, the first working state is obtained by processing the face feature by using a first state determination model, the second working state is obtained by processing the voice feature by using a second state determination model, and the apparatus further includes a model training module, where the model training module is configured to: acquiring a first training sample, wherein the first training sample comprises training face features, corresponding first state labels, training voice features, corresponding second state labels, first state labels and comprehensive state labels; inputting the training face features into a first state determination model to be trained to obtain a first prediction state; inputting the training voice characteristics into a second state determination model to be trained to obtain a second prediction state; inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state; obtaining a target model loss value based on the state difference of the first state label and the first prediction state, the state difference of the second state label and the second prediction state, and the state difference of the comprehensive state label and the third prediction state; and adjusting model parameters of the first state determining model to be trained, the second state determining model to be trained and the comprehensive state determining model to be trained based on the target model loss value to obtain the first state determining model, the second state determining model and the comprehensive state determining model.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of: acquiring target voice of a target user and a target image corresponding to the target voice; carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

In some embodiments, the computer program further causes the processor to perform the steps of: carrying out semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice; the step of analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user comprises the following steps: and carrying out working state analysis based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

In some embodiments, the computer program further causes the processor to perform the steps of: acquiring an virtual user image corresponding to the target user; and controlling the virtual user image to send out working state prompt information according to the target working state.

In some embodiments, the first working state is obtained by processing the face feature using a first state determination model, the second working state is obtained by processing the voice feature using a second state determination model, and the computer program further causes the processor to perform the steps of: acquiring a first training sample, wherein the first training sample comprises training face features, corresponding first state labels, training voice features, corresponding second state labels, first state labels and comprehensive state labels; inputting the training face features into a first state determination model to be trained to obtain a first prediction state; inputting the training voice characteristics into a second state determination model to be trained to obtain a second prediction state; inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state; obtaining a target model loss value based on the state difference of the first state label and the first prediction state, the state difference of the second state label and the second prediction state, and the state difference of the comprehensive state label and the third prediction state; and adjusting model parameters of the first state determining model to be trained, the second state determining model to be trained and the comprehensive state determining model to be trained based on the target model loss value to obtain the first state determining model, the second state determining model and the comprehensive state determining model.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of: acquiring target voice of a target user and a target image corresponding to the target voice; carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

According to the working state determining method, device, computer equipment and storage medium based on the digital person, the voice of the target user and the corresponding image when the voice is sent can be obtained, the working state analysis is carried out according to the face characteristics corresponding to the target image, the first working state corresponding to the target user is obtained, the working state analysis is carried out according to the voice characteristics corresponding to the voice, the second working state corresponding to the target user is obtained, and the target working state corresponding to the target user is determined by combining the first working state and the second working state, so that the working state corresponding to the user can be accurately determined, and the efficiency and accuracy of determining the working state are improved.

Drawings

FIG. 1 is a diagram of an application environment for a digital person-based working state determination method in some embodiments;

FIG. 2 is a flow chart of a method of determining a digital person-based operating state in some embodiments;

FIG. 3 is a flow chart of steps for model training a composite state determination model in some embodiments;

FIG. 4A is a flow chart of a method of determining a digital person-based operating state in some embodiments;

FIG. 4B is a schematic diagram of an interface for a digital person to prompt according to an operating state in some embodiments;

FIG. 5 is a block diagram of a digital person-based operating condition determination device in some embodiments;

FIG. 6 is a block diagram of a digital person-based operating condition determination device in some embodiments;

FIG. 7 is an internal block diagram of a computer device in some embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The working state determining method based on the digital person, provided by the application, can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 is placed in an area where the target user is located, for example, a computer used by the target user, and a camera and a recording device, for example, a microphone may be installed on the terminal 102. When a user speaks, the terminal 102 can record sound and collect images so as to collect the voice and the images of the target user in real time and send the voice and the images to the server 104 in real time, and the server executes the working state determining method based on the digital person provided by the embodiment of the application and can send the obtained target working state to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers.

It will be appreciated that the digital person-based operating state determination method of the embodiments of the present application may also be performed at the terminal 102. The digital person in the embodiment of the application is a virtual person, and can refer to a virtual person which can assist or replace a real person to execute tasks, for example, a set of developed programs can assist or replace the real person to monitor the working state of staff by executing the programs.

In some embodiments, as shown in fig. 2, a method for determining an operating state based on a digital person is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step 202, obtaining target voice of a target user and a corresponding target image when the target voice is sent out.

The target user may be any user, for example, a user who operates a terminal to perform work, for example, an employee who uses a computer. The server can also control the terminal to acquire the face image of the user, detect the face image, detect whether the acquired face is consistent with the face corresponding to the account logged in the terminal, and if so, determine the user as the target user. For example, for office staff, each computer logs in an account number of one staff, the computer can collect face images of people sitting on an office chair, compare the face images with preset faces corresponding to the logged-in account numbers in the computer, and when the comparison is passed, determine that the user is a target user.

Specifically, when the remote office software is started, a camera and recording equipment of the terminal can be automatically started, when a target user speaks, the terminal can collect voice information of the user and record and obtain a target image when the user speaks at the same time, and the target image is uploaded to the server.

In some embodiments, the image acquisition device may be turned on again when the user's speech is detected, so as to reduce consumption of terminal resources.

And 204, analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user.

Wherein, the working state refers to a working state. The working state can be measured by working fatigue degree, and the quantification of the fatigue degree can be set according to the requirement and can be divided into a plurality of grades. For example, may include three levels of very tired, slight tired, and mental. The face feature is a feature related to the face, and may include at least one of a feature corresponding to eyes, a feature corresponding to mouth, and a feature corresponding to nose, or may be a mental feature obtained by combining face features. The face features may include the positions of the respective feature points of the face and may also include features derived from pixel values. The eye-corresponding feature may be, for example, at least one of open or closed. The corresponding feature of the nose may be, for example, at least one of nasal inhalation or nasal exhalation. The mouth-corresponding feature may be, for example, at least one of open or closed. The first working state can be obtained according to a preset judging rule or can be determined according to an artificial intelligence model.

Specifically, after the server obtains the target image, the target image can be subjected to extraction of the face features, the face features are obtained, and the working state analysis is performed according to the face features.

In some embodiments, the face features may be extracted by a face feature extraction model, which may be a deep learning model. A plurality of face feature extraction models may be included, for example, at least one of a model that extracts features corresponding to eyes or a model that extracts features corresponding to the mouth may be included.

In some embodiments, the server may acquire a face feature corresponding to the target image, and process the face feature by using the trained expression recognition model to obtain a target expression corresponding to the target user; and carrying out working state analysis according to the target expression to obtain a first working state corresponding to the target user.

The expression is an emotion of thought expressed on the face, and may be, for example, depression, excitement, anger, or the like. The face feature extraction model and the expression recognition model can be cascaded, and are obtained by combined training during model training. For example, the training image may be input into a face feature extraction model to obtain face features, and the face features may be input into an expression recognition model to obtain a predicted expression. And obtaining a model loss value according to the difference between the predicted expression and the actual expression, and adjusting parameters of the model according to the gradient descent method. Wherein the difference between the predicted expression and the actual expression is in positive correlation with the model loss value. Therefore, the facial feature extraction model and the expression recognition model can be quickly obtained through combined training.

In some embodiments, the correspondence between the expression and the working state may be preset, for example, the working state corresponding to the excited expression may be set as spirit. The corresponding working state of the smoldering tag is tired. Therefore, after the target expression is obtained, the first working state corresponding to the target user can be determined.

In some embodiments, the server acquires feature point positions corresponding to a plurality of eye key feature points corresponding to the target image, and obtains a target closed state corresponding to the eyes of the target user in the target image based on the position difference between the feature point positions; sequencing the target closed states according to the acquisition sequences corresponding to the target images to obtain a closed state sequence; and carrying out working state analysis according to the closed state sequence to obtain a first working state corresponding to the target user.

The eye key feature points may be set as needed, and may include, for example, feature points on upper eyelid and feature points on lower eyelid. The position difference represents a difference in the position of the feature point, and may be represented by a distance. The target closed state may be open or closed. The distance between the feature points on the upper eyelid and the feature points on the lower eyelid may be obtained, and whether the eye is open or closed may be determined based on the distance. When the distance is greater than the first preset distance, the eyes are determined to be open, and when the distance is less than the second preset distance, the eyes are determined to be closed. Wherein the first preset distance is greater than or equal to the second preset distance. The acquisition sequence refers to the sequence in which the target image is acquired. The sequence of the target images acquired firstly is before the sequence of the target images acquired later. Because the target images are multiple, the target closed states can be ordered according to the order of the acquisition order, and a closed state sequence is obtained. For example, assume that there are 5 pictures arranged in sequence, and the corresponding target closed states are open, closed, and open, respectively. The target closed states may be arranged sequentially in this order.

The server may analyze the closed state sequence to obtain a first working state corresponding to the target user. For example, a change rule of states in the closed state sequence may be determined, and the corresponding working state may be determined according to the change rule and a preset judgment rule. For example, it may be provided that when the law of change is in a sequence of closed states, the duration corresponding to the state that is continuously closed exceeds a preset duration, it is determined to be tired, otherwise it is normal, for example, if the length of time that the eye is closed exceeds 1 minute, it is determined to be tired. The number of times that the duration time corresponding to each closed state is longer than the preset duration time can also be counted, and if the number of times is greater than the preset number of times, the working state is determined to be tired. For example, if the time period for closing the eyes exceeds 1 minute 5 times and the preset number of times is 3 times, it is determined that the operating state is tired.

In some embodiments, the operating state is determined to be exhausted when the number of states in the sequence of closed states is greater than a preset number or the ratio is greater than a preset ratio. Otherwise it is normal or excited.

In some embodiments, the target image may be extracted according to a preset time interval or a preset image interval, for example, a video frame may be extracted from a video obtained by video capturing of the user every 3 video frames, so as to analyze the working state.

And 206, analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user.

Where a speech feature is a feature used to represent a speech characteristic. For example, may include at least one of intonation or speech rate. Intonation refers to the change in sound rise and fall in a sentence. For example, it may be raised, lowered, or suddenly raised or lowered, etc. The variation of the frequency of the voice can be counted to obtain the intonation feature. One or more speech features may be acquired and combined to obtain the second operating state. The second working state can be obtained according to a preset judging rule or can be determined according to an artificial intelligence model.

Specifically, the server may perform speech feature recognition on the target speech by using a natural language processing technology, to obtain a speech feature set. For example, the server obtains corresponding voice attribute information based on the target voice, the voice attribute information including at least one of speech rate information or intonation variation information. The intonation change information may be counted in units of a preset time length, for example, an average voice frequency corresponding to a time period corresponding to each preset time length may be calculated, and the intonation is determined according to a change of the average voice frequency between adjacent time periods. For example, assuming that the preset time period is 1 second, the average voice frequency corresponding to the 1 st second, the average voice frequency corresponding to the 2 nd second, and the average voice frequency corresponding to the 2 nd second may be obtained, and when the tone change information is a tone increase. The corresponding relation between the voice characteristics and the working states can be preset, and the second working state is determined according to the voice characteristics corresponding to the target voice.

In some embodiments, the server may obtain voice attribute information corresponding to the target voice, where the voice attribute information includes at least one of speech speed information or intonation variation information; and carrying out working state analysis based on the voice attribute information to obtain a second working state corresponding to the target user.

Specifically, the speech rate information refers to the speed of speaking. The corresponding relation between the voice attribute and the working state can be set, and the second working state corresponding to the target user is obtained according to the corresponding relation. Or comparing the voice attribute with a preset voice attribute corresponding to the target user, and determining a second working state corresponding to the target user according to the comparison result. For example, the voice attribute information corresponding to each target user when the target user is in each working state may be preset, so that the second working state corresponding to the target user may be determined according to the voice attribute information corresponding to the target voice. For example, when the target user is tired, the speech speed is less than the first speed, the intonation is gradually reduced, and when the obtained voice attribute information corresponding to the target voice is less than the first speed, the intonation is gradually reduced, the second working state corresponding to the target user is determined to be tired. The working state of the user can be accurately determined by analyzing the working state through the voice attribute information.

In some embodiments, the digital person-based operational state determination method further comprises: carrying out semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice; based on the voice attribute information, performing working state analysis, and obtaining a second working state corresponding to the target user comprises: and carrying out working state analysis based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

The semantic emotion refers to an emotion represented by meaning expressed in a sentence, and may be a positive emotion or a negative emotion. The recognition of the semantic emotion may be performed, for example, based on a semantic emotion recognition model. The semantic emotion recognition model is obtained through supervised training, training sentences for training and corresponding labels (semantic emotion) can be obtained, the training sentences are input into the semantic emotion recognition model to be trained, the semantic emotion recognition model outputs predicted semantic emotion, a model loss value is obtained according to the difference between the predicted semantic emotion and the labels, model parameters are adjusted towards the direction of the decrease of the model loss value until the model converges, the condition of model convergence can be that the model loss value is smaller than a preset threshold, the difference between the predicted semantic emotion and the labels is in positive correlation with the model loss value, and the model loss value is larger as the difference is larger.

Specifically, the server can recognize the target voice to obtain a target sentence, the target sentence is input into a semantic emotion recognition model which is trained, and the semantic emotion recognition model recognizes the semantic emotion of the target sentence to obtain a target semantic emotion. The corresponding relation between the voice attribute information and the semantic emotion and the working state can be set, so that the corresponding second working state can be obtained according to the voice attribute information and the target semantic emotion. For example, it may be provided that when the speech rate is lower than a preset speech rate and the target semantic emotion is negative, the working state is determined to be tired.

And step 208, determining a target working state corresponding to the target user by combining the first working state and the second working state.

Specifically, the target operating state is obtained in combination with the first operating state and the second operating state. For example, the first working state is a working state corresponding to a plurality of analysis dimensions respectively, the second working state is a working state corresponding to a plurality of analysis dimensions respectively, and the number of states of fatigue in the first working state and the second working state can be calculated; and when the state quantity is larger than a preset threshold value or the proportion corresponding to the state quantity is larger than a preset proportion, determining that the target working state corresponding to the target user is tired.

The ratio of the number of states refers to the number of operating states that are exhausted, and the ratio of the number of operating states to the total number of the first operating state and the second operating state. For example, assuming that the first operation state is n, the second operation state is m, and the number of states in which the operation state is tired is k, the corresponding ratio is k/(n+m). The analysis dimension refers to a dimension in which the working state is analyzed, and for example, for facial features, the analysis dimension may include analysis dimensions of a look, an expression, and an eye. The first working state corresponding to the magic state, the first working state corresponding to the expression and the first working state corresponding to the eyes can be obtained. For the voice characteristics, the working state corresponding to the speech speed, the working state corresponding to the intonation and the like can be obtained. The preset threshold and the preset ratio can be set according to needs, for example, the preset threshold can be 3, and the preset ratio can be 60%. For example, assume that the first operating state corresponding to the mental state is tired, the first operating state corresponding to the expression is tired, and the first operating state corresponding to the eye is normal. The second working state corresponding to the speech speed is tired, and the second working state corresponding to the intonation is tired, so that the number of the working states of the tired is 4 and is larger than the preset threshold value 3, and the target working state is tired. In the embodiment of the application, the accuracy of the determined working state is improved through multi-level analysis.

In some embodiments, the first working state is a working state corresponding to each of the plurality of analysis dimensions, the second working state is a working state corresponding to each of the plurality of analysis dimensions, and the server inputs the first working state and the second working state into the comprehensive state determination model to obtain a target working state corresponding to the target user. The comprehensive state determining model is a model which is obtained by training in advance and is used for determining a target working state by integrating the first working state and the second working state. May be obtained through supervised training. For example, a working state set for training a model and a manually marked working state label can be obtained, each working state in the working state set is input into a comprehensive state determination model to be trained, the model outputs a predicted comprehensive working state, a model loss value is obtained according to the difference between the predicted comprehensive working state and the working state label, and model parameters are adjusted according to a gradient descent algorithm until the model converges, so that a trained comprehensive state determination model is obtained.

In some embodiments, the second working state is obtained by processing voice features by using a second state determination model, the first working state is obtained by processing face features by using a first state determination model, as shown in fig. 3, the working state determination method based on digital people includes a step of performing model training on a comprehensive state determination model, and the step of performing model training on the comprehensive state determination model includes:

Step S302, a first training sample is obtained, wherein the first training sample comprises training voice features, corresponding second state labels, training face features, corresponding first state labels and comprehensive state labels.

Wherein the training sample is a sample for model training. The training voice features in the first training sample are obtained by extracting features of training voice, and the training face features are obtained by extracting features of training images. Training speech refers to speech for model training, and training images are images for model training. For training speech and training images in the same training sample, the training images are images of a user acquired when the user utters the training speech. The second status tag, the first status tag, and the integrated status tag may be manually labeled.

Specifically, the server may obtain the training voice and the corresponding training image acquired when the training voice is sent out, and perform feature extraction on the training voice to obtain the training voice feature. And extracting the face features of the training images to obtain the training face features. The server can output training voice and training images to the terminal, the terminal receives state labeling operation, and the terminal responds to the state labeling operation to acquire a second state label, a first state label and a comprehensive state label.

Step S304, inputting the training face features into a first state determination model to be trained to obtain a first prediction state.

Specifically, the first state determining model is a model for processing the face features, for example, may be a neural network model, and the first prediction state is an output working state after the first state determining model processes the training face features. The server can input the training face features into a first state determination model to be trained, and the first state determination model processes the training face features by using model parameters and predicts the training face features to obtain a first prediction state.

Step S306, inputting the training voice characteristics into a second state determination model to be trained to obtain a second prediction state.

Specifically, the second state determining model is a model for processing the training speech feature, for example, may be a neural network model, and the second predicted state is an operating state that is output by the second state determining model for processing the training speech feature. The server may input the training speech features into a second state determination model to be trained, where the second state determination model processes the training speech features using model parameters, and predicts a second predicted state.

In some embodiments, the first state determining model is a plurality of the second state determining models, for example, the plurality of first state determining models may be models with different model structures, and the plurality of second state determining models may be models with different model structures.

Step S308, the first prediction state and the second prediction state are input into the comprehensive state determination model to be trained, and a third prediction state is obtained.

Specifically, the number of the first state determination models is plural, and the number of the second state determination models is plural, so that a plurality of first prediction states and a plurality of second prediction states can be obtained, the plurality of first prediction states are respectively used as features, the plurality of second prediction states are respectively used as features, the plurality of first prediction states are respectively used as features, the plurality of second prediction states are input into the comprehensive state determination model, and the comprehensive state determination model processes the input features by using model parameters to obtain a third prediction state.

Step S310, obtaining a target model loss value based on the state difference between the first state label and the first predicted state, the state difference between the second state label and the second predicted state, and the state difference between the integrated state label and the third predicted state.

Specifically, the model loss value and the difference form a positive correlation, and the larger the difference is, the larger the model loss value is, and the model loss value can be a cross entropy loss value or a mean square error (Mean Square Error). The first model loss value can be obtained according to the state difference between the first state label and the first prediction state, the second model loss value can be obtained according to the state difference between the second state label and the second prediction state, and the third model loss value can be obtained by integrating the state difference between the state label and the third prediction state. And carrying out weighted summation according to the first model loss value, the second model loss value and the third model loss value to obtain a target model loss value. The weight corresponding to each model loss value can be set according to the requirement.

Step S312, adjusting model parameters of the first state determination model to be trained, the second state determination model to be trained and the comprehensive state determination model to be trained based on the target model loss value to obtain the first state determination model, the second state determination model and the comprehensive state determination model.

Specifically, after the target model loss value is obtained, back propagation is performed according to the target model loss value, and model parameters of a first state determination model to be trained, a second state determination model to be trained and a comprehensive state determination model to be trained are adjusted to obtain a first trained state determination model, a second trained state determination model and a comprehensive trained state determination model.

In the embodiment of the application, the first state determination model, the second state determination model and the comprehensive state determination model are obtained through combined training, and the target model loss value is obtained by combining the state difference between the first state label and the first prediction state, the state difference between the second state label and the second prediction state and the state difference between the comprehensive state label and the third prediction state.

According to the working state determining method based on the digital person, the voice of the target user and the corresponding image when the voice is sent can be obtained, the working state analysis is carried out according to the face characteristics corresponding to the target image, the first working state corresponding to the target user is obtained, the working state analysis is carried out according to the voice characteristics corresponding to the voice, the second working state corresponding to the target user is obtained, and the target working state corresponding to the target user is determined by combining the first working state and the second working state, so that the working state corresponding to the user can be accurately determined, and the efficiency and the accuracy of determining the working state are improved.

In some embodiments, as shown in fig. 4A, the digital person-based working state determining method further includes:

step S402, obtaining an virtual user image corresponding to the target user.

Specifically, the avatar means that the user character is obtained virtually and is not a real user character, for example, may be a cartoon character of the user, and the avatar may be generated according to characteristics of the user, for example, a hairstyle of the avatar may be determined according to a hairstyle of the target user, so that the avatar is more matched with the characteristics of the target user. The avatar may be preset.

In some embodiments, face adjustment parameters corresponding to the first working state may also be obtained; and performing image adjustment on the virtual user image according to the face adjustment parameters, and controlling the virtual user image after image adjustment to send out working state prompt information according to the target working state.

Specifically, the face adjustment parameters refer to parameters for adjusting the face, and the corresponding relation between the working state and the face adjustment parameters is preset, so that the image of the virtual user is adjusted according to the face adjustment parameters. For example, if the first working state is tired, parameters corresponding to the face when the face is tired, such as a parameter for adjusting eyes to drooping, a parameter for adjusting the face to crinkle eyebrows, and the like, are acquired so as to adjust the face in the avatar to have the eyes drooping and crinkle eyebrows. The first working state is obtained according to the face characteristics, so that the first working state can reflect the working state of the target user reflected by the appearance, corresponding face adjusting parameters are obtained according to the first working state, the virtual user image can be adjusted through the parameters matched with the current face image of the user, the prompt is more efficient, for example, prompt information can be played in a voice mode, and the virtual user image after image adjustment is displayed.

And step S404, controlling the virtual user image to send out working state prompt information according to the target working state.

Specifically, the working state prompt information can be displayed in the form of voice, text or action. For example, the working state prompt message may be "you are currently in a tired state, please note rest". The operation state prompt information may be a prompt by the avatar performing an action corresponding to the target operation state, for example, when tired, an action of dozing or a yawning is controlled by the avatar.

In the embodiment of the application, the working state prompt information is sent out through the virtual user image corresponding to the target user, so that the current target working state of the target user can be prompted in an image, and the prompting efficiency is improved.

In some embodiments, when the user's operating state is detected to be tired, a rest prompt may be sent to prompt the user to rest. For example, when the eye closing time of the target user exceeds the preset time or the eye closing times exceed the preset times, the working state of the user is confirmed to be tired.

In some embodiments, when the user's operating state is detected to be mental, an encouragement cue may be issued to encourage the user.

In some embodiments, the working time of the staff can be monitored and reminded, and the time of the face appearing in the video is used as the working time of the staff. If the working time of the staff does not reach the office time, the staff can automatically send out the prompt information of the working time. If the working time of the staff exceeds the preset time length, for example, 10 hours, the staff can automatically send out the off-duty prompt information.

In some embodiments, the digital person may acquire voice and image for a preset time period, for example, 30 minutes, that is, may analyze every 30 minutes in determining the working state.

In some embodiments, each time the working state analysis is performed, a working state (referred to as a forward working state) obtained by previous analysis may be obtained, for example, a working state obtained by previous analysis, and a current target working state corresponding to the target user is determined according to the forward working state. For example, when the forward working state is determined to be tired, the fatigue probability of the time is larger, and if the previous analysis is good, the fatigue probability of the time is also larger, so that the result is more accurate through multi-level analysis.

The forward operation state is determined by using a voice preceding the target voice (referred to as forward voice) and an image corresponding to the forward voice (referred to as forward image). The number of fatigue in the forward working state (called forward fatigue number) can be obtained, and when the number of fatigue in the current working state is greater than the number of fatigue in the forward working state, the target user is determined to be tired. For example, assuming that 3 operating states are exhausted when the operating state is determined last time, 5 operating states are exhausted this time, which means that the current operating state of the target user is exhausted. In the embodiment of the application, the analysis of the current working state is assisted by the last generated working state, so that the result is more accurate.

The scheme in the embodiment of the application can be used for the state prompt of staff and can be applied to remote office monitoring so as to improve the remote working efficiency of staff. Before the remote office monitoring, face recognition can be performed for verification, so that whether the person is working or not can be checked. At present, the daily working state of the staff cannot be ensured, and the staff is also likely to have overlong working time and not have a rest in time, so that the expression and eye information of the staff can be detected by utilizing a computer vision technology, the speech and semantic information of the speech of the staff is obtained by utilizing a speech recognition and natural language processing technology, the working efficiency of the staff can be accurately judged by carrying out multi-level analysis on the comprehensive information, and the staff is encouraged or rested according to the working state, so that the remote working efficiency of the staff is improved.

The working state determining method can be executed once every preset time, for example, once every 20 minutes, can remind staff for many times in the working process, can also generate a summary report according to the target working state obtained every time, can show the time period that the working state is tired in the summary report, and can summarize the working state of one day, for example, the time period that the working state is tired, and the like, so as to be beneficial to the working state adjustment of staff on the next day.

In some embodiments, the appearance time of the face of the worker in the video can also be calculated, and the working time of the worker is calculated according to the appearance time.

As shown in fig. 4B, in some embodiments, an interface diagram of a digital person prompting according to a working state is shown. The digital person is represented by the virtual image, as shown in the left graph of fig. 4B, when in a normal working state, the user can work on the working interface, for example, writing codes, debugging codes and the like, and the digital person is in a hidden display state, so that the user can work normally, and the interference to the user is reduced. When the user is detected to be in a tired working state, the user can wake up the digital person, the digital person is displayed on the working interface, and a working state prompt message of' you are in the tired state at present, please pay attention to the rest, so as to remind the user to take a rest.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a part of other steps or stages.

In some embodiments, as shown in fig. 5, there is provided a digital person-based operation state determining apparatus, including: an image and voice acquisition module 502, a first working state obtaining module 504, a second working state obtaining module 506, and a target working state determining module 508, wherein:

the image and voice acquiring module 502 is configured to acquire a target voice of a target user and a target image corresponding to the target voice when the target voice is sent out.

The first working state obtaining module 504 is configured to perform working state analysis based on the face feature corresponding to the target image, so as to obtain a first working state corresponding to the target user.

The second working state obtaining module 506 is configured to perform working state analysis based on the voice feature corresponding to the target voice, so as to obtain a second working state corresponding to the target user.

The target working state determining module 508 is configured to determine a target working state corresponding to the target user in combination with the first working state and the second working state.

In some embodiments, the first operating state derivation module is to: acquiring face features corresponding to a target image, and processing the face features by using a trained expression recognition model to obtain a target expression corresponding to a target user; and carrying out working state analysis according to the target expression to obtain a first working state corresponding to the target user.

In some embodiments, the target image is a plurality of, and the first working state obtaining module is configured to: acquiring feature point positions corresponding to a plurality of eye key feature points corresponding to a target image, and acquiring a target closing state corresponding to eyes of a target user in the target image based on the position difference between the feature point positions; sequencing the target closed states according to the acquisition sequences corresponding to the target images to obtain a closed state sequence; and carrying out working state analysis according to the closed state sequence to obtain a first working state corresponding to the target user.

In some embodiments, the second operating state derivation module is to: acquiring voice attribute information corresponding to target voice, wherein the voice attribute information comprises at least one of speech speed information or intonation change information; and carrying out working state analysis based on the voice attribute information to obtain a second working state corresponding to the target user.

In some embodiments, the first working state is a working state corresponding to each of the plurality of analysis dimensions, the second working state is a working state corresponding to each of the plurality of analysis dimensions, and the target working state determining module is configured to: calculating the number of states of fatigue in the first working state and the second working state; and when the state quantity is larger than a preset threshold value or the proportion corresponding to the state quantity is larger than a preset proportion, determining that the target working state corresponding to the target user is tired.

In some embodiments, as shown in fig. 6, the digital person-based operation state determining apparatus further includes: an avatar acquisition module 602, configured to acquire an avatar corresponding to the target user; the working state prompt information sending module 604 is configured to control the virtual user image to send the working state prompt information according to the target working state.

In some embodiments, the working state prompt information sending module is configured to: acquiring face adjustment parameters corresponding to a first working state; and performing image adjustment on the virtual user image according to the face adjustment parameters, and controlling the virtual user image after image adjustment to send out working state prompt information according to the target working state.

In some embodiments, the target operating state determination module is to: and inputting the first working state and the second working state into the comprehensive state determination model to obtain a target working state corresponding to the target user.

In some embodiments, the first working state is obtained by processing the face features by using a first state determination model, the second working state is obtained by processing the voice features by using a second state determination model, and the apparatus further comprises a model training module, where the model training module is configured to: acquiring a first training sample, wherein the first training sample comprises training face features, corresponding first state labels, training voice features, corresponding second state labels, first state labels and comprehensive state labels; inputting training face features into a first state determination model to be trained to obtain a first prediction state; inputting training voice characteristics into a second state determination model to be trained to obtain a second prediction state; inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state; obtaining a target model loss value based on the state difference between the first state label and the first prediction state, the state difference between the second state label and the second prediction state, and the state difference between the comprehensive state label and the third prediction state; and adjusting model parameters of the first state determining model to be trained, the second state determining model to be trained and the comprehensive state determining model to be trained based on the target model loss value to obtain the first state determining model, the second state determining model and the comprehensive state determining model.

For the specific definition of the digital person-based working state determining device, reference may be made to the definition of the digital person-based working state determining method hereinabove, and the description thereof will not be repeated. The above-described individual modules in the digital person-based operation state determination device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data determined based on the working state of the digital person. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of determining an operating state based on a digital person.

It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In some embodiments, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of: acquiring target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

In some embodiments, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description. The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method for determining an operational state based on a digital person, the method comprising:

when the remote office software is started, acquiring target voice of a target user and a target image corresponding to the target voice;

Carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; the first working state is a working state corresponding to a plurality of analysis dimensions respectively;

analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; the second working state is a working state corresponding to each of the plurality of analysis dimensions;

Calculating the number of states in which the working states are tired in the first working state and the second working state; and when the state quantity is larger than a preset threshold value or the proportion corresponding to the state quantity is larger than a preset proportion, determining that the target working state corresponding to the target user is tired.

2. The method of claim 1, wherein the performing working state analysis based on the face feature corresponding to the target image to obtain the first working state corresponding to the target user comprises:

Acquiring face features corresponding to the target image, and processing the face features by using a trained expression recognition model to obtain a target expression corresponding to the target user;

and analyzing the working state according to the target expression to obtain a first working state corresponding to the target user.

3. The method of claim 1, wherein the target image is a plurality of target images, and the performing working state analysis based on the face features corresponding to the target image to obtain the first working state corresponding to the target user includes:

acquiring feature point positions corresponding to a plurality of eye key feature points corresponding to the target image, and acquiring a target closed state corresponding to the eyes of the target user in the target image based on the position difference between the feature point positions;

sequencing the target closed states according to the acquisition sequences corresponding to the target images to obtain a closed state sequence;

and carrying out working state analysis according to the closed state sequence to obtain a first working state corresponding to the target user.

4. The method of claim 1, wherein the performing the working state analysis based on the voice feature corresponding to the target voice to obtain the second working state corresponding to the target user includes:

Acquiring voice attribute information corresponding to the target voice, wherein the voice attribute information comprises at least one of speech speed information or intonation change information;

And analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user.

5. The method according to claim 4, wherein the method further comprises:

Carrying out semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice;

the step of analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user comprises the following steps:

And carrying out working state analysis based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

6. The method according to claim 1, further comprising, prior to the acquiring the target voice of the target user and the target image corresponding to the target voice when the target voice is emitted:

Acquiring face images of a user, detecting according to the face images, detecting whether the acquired face images are consistent with face images corresponding to accounts logged in a terminal, and if so, determining the user as a target user.

7. The method according to claim 1, wherein the method further comprises:

acquiring an virtual user image corresponding to the target user;

And controlling the virtual user image to send out working state prompt information according to the target working state.

8. The method of claim 7, wherein controlling the avatar to issue the operation state prompt message according to the target operation state comprises:

acquiring face adjustment parameters corresponding to the first working state;

And performing image adjustment on the virtual user image according to the face adjustment parameters, and sending out working state prompt information according to the virtual user image after the image adjustment is controlled by the target working state.

9. A digital person-based operating condition determining apparatus, the apparatus comprising:

the image and voice acquisition module is used for acquiring target voice of a target user and a target image corresponding to the target voice when the target voice is sent out when the remote office software is started;

The first working state obtaining module is used for carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; the first working state is a working state corresponding to a plurality of analysis dimensions respectively;

the second working state obtaining module is used for carrying out working state analysis based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; the second working state is a working state corresponding to each of the plurality of analysis dimensions;

The target working state determining module is used for calculating the number of states of which the working states are tired in the first working state and the second working state; and when the state quantity is larger than a preset threshold value or the proportion corresponding to the state quantity is larger than a preset proportion, determining that the target working state corresponding to the target user is tired.

10. The apparatus of claim 9, wherein the first operating state obtaining module is configured to:

11. The apparatus of claim 9, wherein the target image is a plurality of target images, and the first operating state obtaining module is configured to:

12. The apparatus of claim 9, wherein the second operating state obtaining module is configured to:

13. The apparatus of claim 12, further comprising a semantic emotion analysis module for:

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

15. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.