WO2021232775A1

WO2021232775A1 - Video processing method and apparatus, and electronic device and storage medium

Info

Publication number: WO2021232775A1
Application number: PCT/CN2020/137690
Authority: WO
Inventors: 孙贺然; 王磊; 白登峰; 夏建明; 曹军
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2020-05-22
Filing date: 2020-12-18
Publication date: 2021-11-25
Also published as: JP2022537475A; TW202145131A; KR20210144658A; CN111553323A

Abstract

The present disclosure relates to a video processing method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring a video, wherein at least some video frames in the video include a target object; according to the video, detecting at least one type of learning behavior conducted by the target object in the process of watching a teaching course; and upon detecting that the target object conducts at least one type of learning behavior, generating learning state information according to at least some video frames which include the at least one type of learning behavior and/or the length of time in which the target object conducts the at least one type of learning behavior.

Description

Video processing method and device, electronic equipment and storage medium

This disclosure requires the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 202010442733.6, and the application name is "video processing method and device, electronic equipment and storage medium" on May 22, 2020, the entire content of which is incorporated by reference In this disclosure.

Technical field

The present disclosure relates to the field of computer vision, and in particular to a video processing method and device, electronic equipment, and storage medium.

Background technique

In the teaching process, because the teacher needs to concentrate on teaching, it is difficult for the institution or the teacher to grasp the student's listening status, and the parents cannot understand the child's performance in school. Whether students are actually attending the class, whether they are listening to the class carefully, and how well they perform in class interactions cannot be quantitatively evaluated.

Therefore, how to grasp the learning status of each student in the teaching process while ensuring the quality of teaching has become an urgent problem to be solved at present.

Summary of the invention

The present disclosure proposes a video processing solution.

According to an aspect of the present disclosure, there is provided a video processing method, including:

Obtain a video, wherein at least part of the video frames in the video contain the target object; according to the video, detect at least one type of learning behavior of the target object during the course of watching the teaching course; when the target object is detected In the case of performing at least one type of learning behavior, the learning state information is generated according to at least a part of the video frame containing the at least one type of learning behavior and/or the duration of the target object performing the at least one type of learning behavior.

According to an aspect of the present disclosure, there is provided a video processing device, including:

A video acquisition module, configured to acquire a video, wherein at least part of the video frames in the video contain the target object;

The detection module is configured to detect at least one type of learning behavior of the target object in the process of watching the teaching course according to the video;

A generating module, configured to perform the at least one type of learning according to at least part of the video frame containing the at least one type of learning behavior and/or the target object in the case of detecting that the target object performs at least one type of learning behavior The duration of the behavior to generate learning status information.

According to an aspect of the present disclosure, there is provided an electronic device including:

A processor; a memory for storing executable instructions of the processor; wherein the processor is configured to execute the above-mentioned video processing method.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having computer program instructions stored thereon, and when the computer program instructions are executed by a processor, the foregoing video processing method is implemented.

According to an aspect of the present disclosure, there is provided a computer program including computer readable code, and when the computer readable code is executed in an electronic device, a processor in the electronic device executes the method for implementing the video processing method described above. .

In the embodiments of the present disclosure, when it is detected that the target object has at least one type of learning behavior, the video frame containing the learning behavior can be used to generate intuitive learning state information, and the quantified learning can be generated according to the duration of the learning behavior Status information. The above-mentioned methods can be used to flexibly obtain learning status information with evaluation value, which is convenient for teachers or parents and other relevant personnel and institutions to effectively and accurately grasp the learning status of students.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, rather than limiting the present disclosure. According to the following detailed description of exemplary embodiments with reference to the accompanying drawings, other features and aspects of the present disclosure will become clear.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification. These drawings illustrate embodiments that conform to the present disclosure, and are used together with the specification to explain the technical solutions of the present disclosure.

Fig. 1 shows a flowchart of a video processing method according to an embodiment of the present disclosure.

Fig. 2 shows a block diagram of a video processing device according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of an application example according to the present disclosure.

Fig. 4 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

Fig. 5 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed ways

Various exemplary embodiments, features, and aspects of the present disclosure will be described in detail below with reference to the drawings. The same reference numerals in the drawings indicate elements with the same or similar functions. Although various aspects of the embodiments are shown in the drawings, unless otherwise noted, the drawings are not necessarily drawn to scale.

The dedicated word "exemplary" here means "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" need not be construed as being superior or better than other embodiments.

The term "and/or" in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations. In addition, the term "at least one" in this document means any one of a plurality of or any combination of at least two of the plurality, for example, including at least one of A, B, and C, may mean including A, Any one or more elements selected in the set formed by B and C.

In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific embodiments. Those skilled in the art should understand that the present disclosure can also be implemented without certain specific details. In some instances, the methods, means, elements, and circuits well known to those skilled in the art have not been described in detail, so as to highlight the gist of the present disclosure.

Fig. 1 shows a flowchart of a video processing method according to an embodiment of the present disclosure. The method can be applied to a video processing device, and the video processing device can be a terminal device, a server, or other processing equipment. Among them, terminal devices can be User Equipment (UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal digital assistants (Personal Digital Assistants, PDAs), handheld devices, computing devices, vehicle-mounted devices, and mobile devices. Wearable equipment, etc. In an example, the data processing method can be applied to a cloud server or a local server, and the cloud server can be a public cloud server or a private cloud server, which can be flexibly selected according to actual conditions.

In some possible implementations, the video processing method can also be implemented by a processor invoking computer-readable instructions stored in the memory.

As shown in FIG. 1, in a possible implementation manner, the video processing method may include:

Step S11: Obtain a video, where at least part of the video frames in the video contain the target object.

Step S12, according to the video, detect at least one type of learning behavior of the target object in the process of watching the teaching course.

Step S13, in a case where it is detected that the target object performs at least one type of learning behavior, generate learning state information according to at least part of the video frames containing at least one type of learning behavior and/or the duration of the target object performing at least one type of learning behavior.

Among them, the target object can be any object whose learning status information is acquired, that is, an object with learning status evaluation requirements, and its specific implementation form can be flexibly determined according to actual conditions. In a possible realization method, the target object may be students, such as elementary school students, middle school students or college students, etc.; in a possible realization method, the target object may be adults with advanced studies, such as adults participating in vocational education and training. Or the elderly who study in senior colleges, etc.

In the embodiments of the present disclosure, the video may be a video recorded by the target object while watching the teaching course. The realization form of the teaching course is not limited. It may be a pre-recorded course video, or a live course or a teacher’s site. Courses taught, etc.; at least part of the video frames in the video can contain the target object, that is, the appearance of the target object in the recorded video can be flexibly determined according to the actual situation. In a possible implementation manner, the target object may always be in the video. In a possible implementation manner, the target object may also not appear in the video frame at certain moments or certain periods of time.

The scene in which the target object views the teaching course can be flexibly determined according to the actual situation. In a possible implementation, this scene can be an online scene, that is, the target object watches the teaching course through online education methods such as online classrooms; Among possible implementations, this scene can also be an offline scene, that is, the target object can watch the teaching course taught by the teacher on the spot through the traditional face-to-face teaching method, or the target object can watch the video through the classroom and other specific teaching places. Or teaching courses played in other multimedia formats.

The specific implementation form of the video can be flexibly determined according to the application scenario of the video processing method. In one possible implementation, the video can be a real-time video, such as a video recorded in real time by the target object in the online classroom learning process, or the target object is captured by a camera deployed in the classroom while the target object is in the classroom. In one possible implementation, the video can also be a recorded video. For example, after the target object learns through an online classroom, the recorded playback video of the target object learning, or the target object after the classroom has a class, Complete classroom learning videos collected through cameras deployed in the classroom.

For ease of description, the subsequent disclosed embodiments all take the video recorded in real time during the online classroom learning process of the target object as an example to illustrate the video processing process. The video processing process in other application scenarios can be flexibly extended with reference to the subsequent disclosed embodiments, which will not be repeated here.

After obtaining the video as described in the above-mentioned disclosed embodiments in step S11, step S12 can be used to detect at least one type of learning behavior of the target object in the process of watching the teaching course. The type and quantity of the detected learning behaviors can be flexibly determined according to actual conditions, and are not limited to the following disclosed embodiments. In a possible implementation manner, the learning behavior performed by the target object may include at least one of the following behaviors: for example, performing at least one target gesture, expressing the target emotion, paying attention to the display area of the teaching course, and generating at least one behavior with other objects. This kind of interactive behavior, not appearing in at least part of the video frame in the video, closing eyes, and eye contact in the display area of the teaching course.

Among them, the target gesture can reflect certain preset gestures that the target object may produce during the course of watching the teaching course. The specific implementation form can be flexibly set according to the actual situation. For details, please refer to the subsequent disclosed embodiments. Do unfold.

The target emotion can be some emotions of the target object that reflect the true feelings of the teaching course during the process of watching the teaching course. The specific realization form can also be flexibly set according to the actual situation, and will not be expanded here.

Focusing on the display area of the teaching course can reflect the attention of the target object in the process of watching the teaching course. The specific area range of the display area can be flexibly set according to the actual situation and is not limited to the following disclosed embodiments. In a possible implementation, the display area can be the display area of the teaching course video in the online classroom. For example, when students are learning online through terminal devices such as computers, mobile phones, or tablets, the display area can be these terminals The screen for playing the teaching course in the device, etc.; in a possible implementation, the display area may be the teaching area of the teacher in the offline classroom, such as the podium or blackboard in the classroom.

At least one kind of interaction behavior with other objects can be the learning-related interaction generated by the target object and other objects related to the teaching course during the course of watching the teaching course. Among them, the realization form of other objects can be flexibly determined according to the actual situation. In a possible implementation, other objects can be teaching objects, such as teachers, etc. In a possible implementation, other objects can also be learning objects other than the target object in the teaching process, such as the target object’s Classmates, etc.; the interaction behavior with other objects can be flexibly changed according to the different objects. In a possible implementation mode, when the other objects are the instructors, the interaction with other objects can include receiving sent by the teacher For example, receiving small red flowers from the teacher or commendation by name, etc., in a possible implementation mode, when other objects are the instructors, the interaction with other objects can include answering the teacher’s questions or according to In a possible implementation manner, when the other objects are students, the interaction with other objects can include group mutual assistance, group discussion, or group study.

At least part of the video frame does not appear in the video. It may be that the learning object has left the teaching course at certain moments or certain time periods. For example, the target object may temporarily leave the current online due to personal reasons during the online learning process. Learning equipment, or leaving the shooting range of the current online learning equipment, etc.

Eye-closing can be the closed-eye operation performed by the target object in the process of watching the teaching course. The eye contact in the display area of the teaching course can be the display area of watching the teaching course. Correspondingly, according to the target object in the video The situation of eye contact in the display area of the teaching course can further determine the situation that the target object has not watched the display area of the teaching course, etc.

Through the various learning behaviors mentioned in the above disclosed embodiments, comprehensive and flexible behavior detection can be performed on the learning process of the target object, thereby improving the comprehensiveness and accuracy of the learning status information obtained according to the detection, and making it more flexible and accurate. Grasp the learning status of the target object.

Specifically, in step S12, which type or types of detections are performed on the various learning behaviors in the above disclosed embodiment can be flexibly set according to actual conditions. In a possible implementation manner, the various learning behaviors mentioned in the above disclosed embodiments can be detected at the same time, and the specific detection methods and processes can be detailed in the following disclosed embodiments, which will not be expanded here.

In the case where it is detected that the target object performs at least one type of learning behavior, the learning state information may be generated according to a video frame containing at least part of the at least one type of learning behavior and/or the duration of the target object performing at least one type of learning behavior. Among them, the specific implementation form of the learning status information can be flexibly determined according to the type of learning behavior and the corresponding operation performed. In a possible implementation manner, in a case where the learning state information is generated based on video frames that at least partially contain at least one type of learning behavior, the learning state information may include information composed of video frames; in a possible implementation manner In the case of performing at least one type of learning behavior according to the duration of the target object, the learning state information may be data information in digital form; in a possible implementation, the learning state information may also include video frame information at the same time And data information; in a possible implementation, the learning status information can also include other status information. Specifically, how to generate the learning state information and the implementation form of the learning state information can refer to the subsequent disclosed embodiments, and will not be expanded here.

As mentioned in the above disclosed embodiment, the video can be a video recorded by the target object during watching the teaching course, and the scene of the target object watching the teaching course can be flexibly determined according to the actual situation. Therefore, correspondingly, the video is obtained in step S11 The method can also be flexibly changed according to different scenarios. In a possible implementation manner, in the case where the scene where the target object views the teaching course is an online scene, that is, when the target object can watch the teaching course through the online classroom, the way to obtain the video may include: If the video processing device The device for online learning with the target object is the same device, and the device for online learning for the target object can be used to collect video on the process of the target object watching the teaching course; if the video processing device and the device for online learning for the target object are different devices, Then, the device for online learning of the target object can collect the video of the process of the target object watching the teaching course, and transmit it to the video processing device in real-time and/or non-real-time. In one possible implementation method, when the target object is watching the teaching course in an offline scene, that is, when the target object participates in a face-to-face teaching or watching a teaching course video in a specific teaching scene, the way to obtain the video It may include: collecting the video of the target object by deploying offline image acquisition equipment (such as ordinary cameras, shooting devices deployed in response to security requirements, etc.). Further, if the image acquisition device deployed offline can perform video processing, that is, it can be used as a video processing device, then the video acquisition process in step S11 has been completed; if the image acquisition device deployed offline cannot perform video processing, the The video collected by the offline image acquisition equipment is transmitted to the video processing device in real time and/or non-real time.

As described in the above disclosed embodiments, the method of performing learning behavior detection on the target object in step S12 can be flexibly determined according to the actual situation. In a possible implementation manner, step S12 may include:

Step S121: Perform target object detection on the video to obtain a video frame containing the target object.

Step S122: Perform at least one type of learning behavior detection on the video frame containing the target object.

It can be seen from the above disclosed embodiments that, in a possible implementation manner, target object detection can be performed on the video to determine the video frame containing the target object in the video. After determining which video frames contain the target object, at least one type of learning behavior detection can be performed on the target object in the video frame containing the target object.

Wherein, the method of detecting the target object can be flexibly determined according to the actual situation, and is not limited to the following embodiments. In a possible implementation manner, the target object in the video can be detected by means such as face detection or face tracking. In a possible implementation, after the video frame is detected by means of face detection or face tracking, multiple objects may be detected. In this case, the detected face may be further detected. The image is screened, and one or more objects are selected as the target object. The specific screening method can be flexibly set according to the actual situation, which is not limited in the embodiment of the present disclosure.

In a possible implementation manner, after the video frame containing the target object is obtained, step S122 may be used to perform at least one type of learning behavior detection on the video frame containing the target object. The implementation of step S122 can be flexibly changed according to different learning behaviors. For details, refer to the following disclosed embodiments, which will not be expanded here. In the case where multiple types of learning behaviors of the target object need to be detected, multiple methods can be combined to achieve multiple types of learning behavior detection at the same time.

In some possible implementation manners, after the target object is detected on the video, the learning behavior detection of the target object in the process of watching the teaching course can be completed. That is, by performing target object detection on the video, it can be determined that this learning behavior does not appear in at least part of the video frames in the video mentioned in the above-mentioned disclosed embodiment. And further obtain the learning state information according to the video frame of the undetected target object, or calculate the time that the target object does not appear in at least some of the video frames in the video according to the video frame of the undetected target object as the learning state information .

In the embodiments of the present disclosure, by performing target object detection on the video, a video frame containing the target object is obtained, and at least one type of learning behavior detection is performed on the video frame containing the target object. Through the above process, the target object of the video can be used. Object detection is more targeted to detect at least one type of learning behavior of the target object, thereby making the learning behavior detection more accurate, and further improving the accuracy and reliability of the subsequent learning state information.

As described in the above disclosed embodiments, the implementation of step S122 can be flexibly changed according to different learning behaviors. In a possible implementation manner, the learning behavior may include: performing at least one target gesture;

In this case, performing at least one type of learning behavior detection on the video frame containing the target object may include:

Detecting at least one target gesture on the video frame containing the target object;

In a case where it is detected that the number of consecutive video frames containing at least one target gesture exceeds the first threshold, at least one of the video frames containing the target gesture is recorded as a gesture start frame;

In the video frames after the gesture start frame, when the number of consecutive video frames that do not include the target gesture exceeds the second threshold, record at least one of the video frames that do not include the target gesture as the gesture end frame;

According to the number of gesture start frames and gesture end frames, determine the number of times and/or time for the target object in the video to perform at least one target gesture.

It can be seen from the above disclosed embodiments that, in a case where the learning behavior includes performing at least one target gesture, the learning behavior detection performed on the video frame of the target object may include target gesture detection.

Wherein, which gestures the target gesture specifically includes can be flexibly set according to actual conditions, and is not limited to the following disclosed embodiments. Exemplarily, the target gesture includes one or more of a hand-raising gesture, a thumb-up gesture, an OK gesture, and a victory gesture.

In a possible implementation, the target gesture can be included in the process of watching the teaching course. The target object reflects the learning-related gestures according to the listening situation, such as the gesture of raising the hand used to answer the question, the content of the lecture or the teaching The teacher's praise gesture (thumbs up, etc.), the OK gesture that expresses understanding or approval of the teaching content, and the victory gesture for interaction with the instructor (such as Yeah gesture, etc.), etc.

Specifically, the method for detecting at least one target gesture on the video frame containing the target object can be flexibly determined according to the actual situation, and is not limited to the following disclosed embodiments. In a possible implementation, the detection of target gestures can be achieved through the related algorithms of gesture recognition. For example, the key points of the hand of the target object in the video frame or the image area corresponding to the hand detection frame can be recognized, based on the hand key Gesture detection is performed on the image area corresponding to the point or hand detection frame, and based on the gesture detection result, it is determined whether the target object is performing the target gesture. In a possible implementation manner, the detection of the target gesture can be achieved through a neural network with a gesture detection function. The specific structure and implementation of the neural network with gesture detection function can be flexibly set according to the actual situation. In the case that the target gesture includes multiple gestures, in a possible implementation manner, the video frame containing the target object can be Input to the neural network that can detect multiple gestures at the same time to achieve the detection of the target gesture; in a possible implementation, the video frame containing the target object can also be input to multiple neural networks with a single gesture detection function. In the network, to realize the detection of multiple target gestures.

In the process of performing target gesture detection through any of the above disclosed embodiments, if it is detected that the number of continuous video frames containing at least one target gesture exceeds the first threshold, these continuous video frames containing the target gesture can be selected, Select at least one frame as the start frame of the gesture. Among them, the number of first thresholds can be flexibly set according to the actual situation. The number of first thresholds corresponding to different target gestures can be the same or different. For example, the first threshold corresponding to the hand-raising gesture can be set to 6. The first threshold corresponding to the thumbs-up gesture is set to 7. If the number of consecutive video frames containing the hand-raising gesture is detected to be not less than 6, at least one frame can be selected from the video frames containing the hand-raising gesture as the gesture. The gesture start frame of the hand gesture, if the number of consecutive video frames of the like gesture is not less than 7, at least one frame may be selected from the video frames containing the like gesture as the gesture start frame of the like gesture. In a possible implementation manner, in order to facilitate the detection of the target gesture, the first thresholds corresponding to different target gestures may be set to the same value. In an example, the number of the first thresholds may be set to 6.

The selection method of the gesture start frame can also be flexibly set according to the actual situation. In a possible implementation, the first frame of the detected continuous video frames containing the target gesture can be used as the gesture start of the target gesture Frame, in a possible implementation, in order to reduce the error of gesture detection, a certain frame after the first frame of the detected continuous video frames containing the target gesture can also be used as the gesture start frame of the target gesture .

After the gesture start frame is determined, the gesture end frame can be determined from the video frames after the gesture start frame, that is, the end time of the target gesture in the gesture start frame can be determined. The specific determination method can be flexibly selected according to the actual situation, and is not limited to the following disclosed embodiments. In a possible implementation manner, in the video frames after the gesture start frame is detected, if the number of consecutive video frames that do not contain the target gesture in the gesture start frame exceeds the second threshold, the target will not be included. At least one of the consecutive video frames of the gesture is recorded as the gesture end frame. The value of the second threshold can also be flexibly set according to actual conditions. The values of the second threshold corresponding to different target gestures can be the same or different. The specific setting method can refer to the first threshold, which will not be repeated here. In an example, the value of the second threshold corresponding to different target gestures can be the same, for example, it can be set to 10. That is, after the gesture start frame, it is detected that 10 consecutive frames do not contain the target gesture in the gesture start frame. It is considered that the target object has finished performing the target gesture. In this case, you can select at least one frame from the continuous video frames that do not contain the target gesture as the gesture end frame. The selection method can also refer to the gesture start frame. In one example, you can select the gesture start frame. The last frame in the continuous video frames is used as the gesture end frame; in an example, a frame before the last frame in the continuous video frames that do not contain the target gesture may also be used as the gesture end frame. In a possible implementation, if there is a certain frame or a few frames of video frames that do not contain the target object after the gesture start frame is detected, one or some video frames that do not contain the target object can also be set End the frame as a gesture.

After the gesture start frame and gesture end frame are determined, the number of gesture start frames and gesture end frames contained in the video frame can be used to determine the number of times the target object performs a certain target gesture or certain target gestures. The duration of the execution of a certain or certain target gesture, etc. The specific determination of the content related to the target gesture can be flexibly determined according to the requirement of learning state information in step S13. For details, please refer to the subsequent disclosed embodiments, which will not be expanded here.

By detecting at least one target gesture on the video frame containing the target object, and determining the gesture start frame and the gesture end frame according to the detection situation, thereby further determining the number and/or time of at least one target gesture performed by the target object in the video, Through the above process, it is possible to comprehensively and accurately detect the gestures fed back by the target object in the video according to the learning state, thereby improving the comprehensiveness and accuracy of the subsequent learning state information, and then accurately grasping the learning state of the target object.

In a possible implementation, the learning behavior can include: expressing the target emotion;

Perform expression detection and/or smile value detection on the video frame containing the target object;

In a case where it is detected that the target object in the video frame exhibits at least one first target expression or the result of the smile value detection exceeds the target smile value, use the detected video frame as the first detection frame;

In a case where it is detected that the number of consecutive first detection frames exceeds the third threshold, it is determined that the target object generates the target emotion.

Among them, the target emotion can be any emotion set according to actual needs, for example, it can be a happy emotion that indicates that the target object is focused on learning, or a bored emotion that indicates that the target object is in a poor learning state. The following disclosed embodiments are described by taking the target emotion as happy emotion as an example, and the case where the target emotion is other emotions can be expanded with reference to the subsequent disclosed embodiments.

It can be seen from the above disclosed embodiments that when the learning behavior includes expressing the target emotion, expression detection and/or smile value detection can be used to achieve the learning behavior detection of the target object. In a possible implementation manner, the learning behavior of expressing the target emotion can be detected only by expression detection or smile value detection. In a possible implementation manner, expression detection and smile value detection can be used together. Determine whether the target object expresses the target emotion. The subsequent disclosed embodiments are described by taking as an example the determination of whether the target object expresses the target emotion through expression detection and smile value detection. The remaining implementation manners can be expanded with reference to the subsequent disclosed embodiments, and will not be repeated here.

Among them, the expression detection can include the detection of the expressions displayed by the target object, for example, it can detect what kind of expression the target object displays. The specific expression division can be flexibly set according to the actual situation. In a possible implementation manner, the expression can be divided For happiness, calmness, etc.; the smile value detection can include the detection of the smile intensity of the target object, for example, it can detect how big the smile of the target object is, and the result of the smile value detection can be fed back by numerical values. The detection result is set to be between [0,100]. The higher the value, the higher the intensity or amplitude of the target's smile. The specific expression detection and smile value detection methods can be flexibly determined according to the actual situation. Any method that can detect the expression or the degree of smile of the target object can be used as a corresponding detection method, and is not limited to the following disclosed embodiments. In a possible implementation manner, the expression detection of the target object can be realized by the facial expression recognition neural network, and in a possible implementation manner, the smile value detection of the target object can be realized by the smile detection neural network. Specifically, the structure and implementation of the facial expression recognition neural network and the smile value detection neural network are not limited in the embodiments of the present disclosure. Any neural network that can realize the expression recognition function through training and the neural network that realizes the smile value detection function through training are both It can be applied to the embodiments of the present disclosure. In a possible implementation manner, facial expression detection and smile value detection can also be realized by detecting the key points of the face and the mouth of the target object in the video.

Specifically, in the case of what kind of detection result is achieved by the expression detection and the smile value detection, it is determined that the target object produces the target emotion, and the realization method can be flexibly set according to the actual situation. In a possible implementation manner, it can be considered that the target object in the video frame is detected to show at least one first target expression, or the smile value detection result exceeds the target smile value, the target object in the video frame is considered Show the target emotion, in this case, the video frame can be used as the first detection frame. Among them, the specific expression type of the first target expression can be flexibly set according to the actual situation, and is not limited to the following disclosed embodiments. In a possible implementation manner, happiness may be used as the first target expression, that is, video frames in which the detected expression of the target object is happy may be used as the first detection frame. In a possible implementation manner, both happy and calm can be used as the first target expression, that is, the detected expression of the target object can be a happy or calm video frame, and both can be used as the first detection frame. In the same way, the specific value of the target smile value can also be flexibly set according to the actual situation, and there is no specific limitation here. Therefore, in a possible implementation manner, a video frame whose smile value detection result exceeds the target smile value may also be used as the first detection frame.

In a possible implementation manner, in the case that a certain video frame is detected as the first detection frame, it may be determined that the target object generates the target emotion. In a possible implementation, in order to improve the accuracy of detection and reduce the impact of detection errors on the results of learning behavior detection, the target can be determined when the number of consecutive first detection frames exceeds the third threshold. The subject develops the target emotion. Wherein, a video frame sequence in which each frame in the continuous video frames is the first detection frame may be used as the continuous first detection frame. The number of the third threshold can be a number flexibly set according to the actual situation, and its value can be the same as or different from the first threshold or the second threshold. In an example, the number of the third threshold can be 6, which means it is detected In the case where 6 consecutive frames are the first detection frame, it can be considered that the target object has the target emotion.

Further, after it is determined that the target object produces the target emotion, a frame from the first continuous detection frame can be selected as the target emotion start frame, and then after the target emotion start frame, the expression of the target object is not detected for 10 consecutive frames If it is the first target expression, or the smile value detection result of the target object in 10 consecutive frames does not exceed the third threshold, or the target object cannot be detected in a certain frame or a few frames, the target emotion end frame can be further determined, Then, according to the target emotion start frame or the target emotion end frame, the number and/or time of the target emotion generated by the target object are determined. The specific process can refer to the corresponding process of the target gesture, which will not be repeated here.

By performing expression detection and/or smile value detection on the video frame containing the target object, and according to the results of the expression detection and smile value detection, the first detection frame is determined, so that when the number of consecutive first detection frames exceeds the first detection frame In the case of three thresholds, it is determined that the target object produces the target emotion. Through the above process, the emotion of the target object in the learning process can be flexibly determined based on the expression and smile of the target object, so that the target object can be more comprehensively and accurately perceived in the learning process. The emotional state in the process generates more accurate learning state information.

In a possible implementation, the learning behavior can include: paying attention to the display area of the teaching course;

Perform expression detection and face angle detection on video frames containing target objects;

In a case where it is detected that the target object in the video frame shows at least one second target expression and the face angle is within the target face angle range, use the detected video frame as the second detection frame;

When it is detected that the number of consecutive second detection frames exceeds the fourth threshold, it is determined that the target object pays attention to the display area of the teaching course.

Among them, the implementation form of the display area of the teaching course can refer to the above-mentioned disclosed embodiments, which will not be repeated here.

It can be seen from the above disclosed embodiments that in the case where the learning behavior includes paying attention to the display area of the teaching course, the learning behavior detection of the target object can be achieved through expression detection and face angle detection. In a possible implementation manner, the detection of the learning behavior of paying attention to the display area of the teaching course can also be realized only by detecting the face angle. Subsequent disclosed embodiments are described by using expression detection and face angle detection to determine whether the target object pays attention to the display area of the teaching course as an example. The remaining implementation methods can be expanded with reference to the subsequent disclosed embodiments, and will not be repeated here. .

Among them, the implementation of expression detection can refer to the above disclosed embodiments, which will not be repeated here; the face angle detection can be the detection of the orientation angle of the face. The specific face angle detection method can be flexibly determined according to the actual situation. Any method that can detect the face angle of the target object can be used as the face angle detection method, and is not limited to the following disclosed embodiments. In a possible implementation manner, the face angle detection of the target object can be realized through the face angle detection neural network. Specifically, the structure and implementation of the face angle detection neural network are not limited in the embodiments of the present disclosure, and any neural network that can realize the face angle detection function through training can be applied to the embodiments of the present disclosure. In a possible implementation manner, the face angle of the target object can also be determined by detecting the key points of the target object's face in the video. The form of the face angle that can be detected by the face angle detection can also be flexibly determined according to the actual situation. In a possible implementation, it can be determined by detecting the yaw angle and pitch angle of the target object’s face The angle of the target's face.

Specifically, in the case of the detection results achieved by the expression detection and the face angle detection, the target object is determined to focus on the display area of the teaching course, and the implementation method can be flexibly set according to the actual situation. In a possible implementation manner, it can be considered that when it is detected that the target object in the video frame shows at least one second target expression, and the detected face angle is within the range of the target face angle, it can be considered that the video frame The target object of is concerned with the display area of the teaching course. In this case, the video frame can be used as the second detection frame. Among them, the specific expression type of the second target expression can be flexibly set according to the actual situation, and may be the same as the first target expression mentioned in the above-mentioned public embodiment, or it may be different from the first target expression mentioned in the above-mentioned public embodiment It is not limited to the following disclosed embodiments. In a possible implementation manner, calm can be used as the second target expression, that is, the detected target object's expression is calm and the video frame whose face angle is within the range of the target face angle can be regarded as the second detection frame . In a possible implementation manner, all expressions other than other expressions can be used as the second target expression, that is, the face angle of the detected target object can be within the range of the target face angle, and the expression is not "other" video Frames are regarded as the second detection frame. In the same way, the specific range value of the target face angle range can also be flexibly set according to the actual situation, and no specific limitation is made here. In a possible implementation, the target face angle range may be static. In one example, the overall position that the teacher may move to during the lecture (such as the podium area where the teacher is located in the offline scene, etc.) As the target face angle range; in one example, a fixed area (such as the display screen that the target object pays attention to in an online scene) may be taken as the target face angle range when the target object views the teaching course. In a possible implementation, the target face angle range can also be dynamic. In one example, the target face angle range can be flexibly determined according to the current position of the teacher's movement during the lecture, that is, it can follow the teacher's movement To dynamically change the value of the target face angle range.

In a possible implementation manner, in the case that a certain video frame is detected as the second detection frame, it can be determined that the target object pays attention to the display area of the teaching course. In a possible implementation manner, in order to improve the accuracy of detection and reduce the impact of detection errors on the results of learning behavior detection, the target can be determined when the number of consecutive second detection frames exceeds the fourth threshold. The subject pays attention to the display area of the teaching course. Wherein, a video frame sequence in which each frame in the continuous video frames is the second detection frame may be used as the continuous second detection frame. The number of the fourth threshold can be a number flexibly set according to actual conditions, and its value can be the same as or different from the first threshold, the second threshold, or the third threshold. In an example, the number of the fourth threshold can be 6. , That is, when it is detected that 6 consecutive frames are all the second detection frames, it can be considered that the target object pays attention to the display area of the teaching course.

Further, after determining the display area of the target object's attention to the teaching course, you can also select a frame from the consecutive second detection frames as the attention start frame, and then after the attention start frame, the target object is not detected for 10 consecutive frames When the expression is the second target expression, or the face angle of the target object in 10 consecutive frames is not within the range of the target face angle, or the target object cannot be detected in a certain frame or a few frames, the end frame of attention can be further determined , And then determine the number and/or time the target object pays attention to the teaching course display area according to the focus start frame or focus end frame. The specific process can refer to the corresponding process of target gestures and target emotions, which will not be repeated here.

By performing expression detection and face angle detection on the video frame containing the target object, and according to the results of expression detection and face angle detection, the second detection frame is determined, so that the number of consecutive second detection frames exceeds the first detection frame. In the case of four thresholds, it is determined that the target object pays attention to the display area of the teaching course. Through the above process, it can be flexibly determined whether the target object pays attention to the display area of the teaching course based on the expression and face angle of the target object, so that it can be more comprehensive and accurate Perceive the energy concentration of the target object in the learning process, and generate more accurate learning status information.

In a possible implementation manner, the learning behavior may also include: generating at least one interaction behavior with other objects. For the implementation of the interactive behavior, reference may be made to the above disclosed embodiments, which will not be repeated here. In this case, the method of detecting the interactive behavior of the video frame containing the target object can be flexibly determined according to the actual situation. In a possible implementation, if the interactive behavior is an online interactive behavior, such as receiving the teacher’s approval In the case of the small red flowers sent by the online classroom, or when the teacher speaks according to the roll call of the teacher in the online classroom, the interactive behavior detection method can be directly based on the signals transmitted by other objects to determine whether the target object has an interactive behavior. In a possible implementation, if the interactive behavior is offline, for example, when the target object is called by the teacher to speak in the classroom, the method of detecting whether the target object has an interactive behavior can include: The target action of the object is recognized to determine whether the target object has an interactive behavior. The target action can be flexibly set according to the actual situation of the interactive behavior. For example, the target action can include speaking after standing up or speaking with the face facing other objects. The time exceeds a certain time value, etc.

In a possible implementation manner, the learning behavior may also include not appearing in at least part of the video frames in the video. In this case, step S12 may include:

Perform target object detection on the video to obtain a video frame containing the target object, and use the video frame other than the video frame containing the target object as the video frame where the target object is not detected;

In a case where the number of video frames in which the target object is not detected exceeds the preset number of video frames, detecting the learning behavior includes: not appearing in at least part of the video frames in the video.

Among them, the method of performing target object detection on the video is detailed in the above-mentioned disclosed embodiments, and will not be repeated here. In a possible implementation manner, in addition to the video frames that contain the target object, each video frame in the video may also contain video frames that do not contain the target object. Therefore, these video frames that do not contain the target object can be regarded as undetected Video frames of the target object, and in the case where the number of video frames where the target object is not detected exceeds the preset number of video frames, it is confirmed that the learning behavior of "not appearing in at least part of the video frames in the video" is detected. The number of preset video frames can be flexibly set according to the actual situation. In a possible implementation, the number of preset video frames can be set to 0, that is, when the video contains video frames where the target object is not detected, That is to say, it is considered that this learning behavior is not detected in at least part of the video frames in the video. In a possible implementation, the preset number of video frames can also be a number greater than 0. The specific setting can be based on the actual situation. Flexible decision.

In a possible implementation, the learning behavior can also include closed eyes. In this case, the learning behavior detection method can be closed eyes detection. The specific process of closed eyes detection can be flexibly set according to the actual situation. In an example , It can be realized by a neural network with closed eyes detection function. In one example, it can also determine whether the target object has closed eyes or not by detecting the key points in the eyes and the eyeball. For example, after detecting the key points in the eyeball In the case of dots, it is determined that the target object has eyes open; in the case that only the key points of the eye are detected, and the key points in the eyeball are not detected, the eyes of the target object are determined to be closed. In a possible implementation manner, the learning behavior can also include eye contact in the display area of the teaching course. In this case, the learning behavior detection method can refer to the focus on the display area of the teaching course in the above disclosed embodiment. During the process, the specific detection method can be flexibly changed. For example, the target object can be detected with closed eyes and face angle at the same time, and the video frame with the face angle within the target face angle range without closed eyes is used as the third detection frame. Then, when the number of third detection frames exceeds a certain set threshold, it is determined that the target object is making eye contact in the display area of the teaching course.

After the detection of at least one type of learning behavior of the target object is achieved through any combination of the various implementation manners of the above disclosed embodiments, it can be generated through step S13 when the target object is detected to perform at least one type of learning behavior. Learning status information. The specific implementation of step S13 is not limited, and can be flexibly changed according to the actual situation of the detected learning behavior, and is not limited to the following disclosed embodiments.

From the actual content of step S13 in the above disclosed embodiment, it can be seen that in the process of generating learning state information in step S13, there may be the following generation methods. For example, the learning state can be generated based on a video frame containing at least one type of learning behavior. Information; or generate learning state information according to the duration of the target object performing at least one type of learning behavior; or a combination of the above two situations, both based on the video frame containing at least one type of learning behavior to generate part of the learning state information, Another type of learning state information is generated according to the duration of at least one type of learning behavior performed by the target object. When the learning state information can be generated based on the video frames of the learning behavior, and the learning state information can be generated based on the duration of the target object performing at least one type of learning behavior, which learning state is generated according to which type of learning behavior Information and its mapping method can be flexibly set according to the actual situation. In a possible implementation, some positive learning behaviors can be corresponded to the process of generating learning state information based on the video frames containing the learning behaviors, such as performing at least one target gesture on the target object and showing a positive goal. In the case of emotions, paying attention to the display area of the teaching course, and at least one interactive behavior with other objects, the learning state information can be generated based on the video frame containing the above learning behavior; in a possible implementation manner, it can also be Some negative learning behaviors, such as when the target object does not appear in at least part of the video frame in the video, eyes are closed, or there is no eye contact in the display area of the teaching course, can be based on the duration of the above learning behavior. Generate learning status information.

In a possible implementation manner, generating learning state information according to video frames containing at least one type of learning behavior at least in part may include:

Step S1311: Obtain video frames containing at least one type of learning behavior in the video as a target video frame set;

Step S1312: Perform face quality detection on at least one video frame in the target video frame set, and use a video frame with a face quality greater than a face quality threshold as a target video frame;

Step S1313: Generate learning state information according to the target video frame.

Wherein, the video frame containing at least one type of learning behavior may be a video frame in which the target object is detected to perform at least one type of behavior in the process of learning behavior detection, such as the first detection frame mentioned in the above-mentioned disclosed embodiment, The second detection frame and the third detection frame, etc., or the video frame containing the target gesture between the gesture start frame and the gesture end frame, etc.

After the video frames containing at least one type of learning behavior are determined, how to obtain the target video frame set can be flexibly determined. In a possible implementation manner, each video frame containing each type of learning behavior can be obtained according to the type of learning behavior, so as to form the target video frame set of each type of learning behavior; in a possible implementation manner, It is also possible to obtain partial frames containing each type of learning behavior according to the type of learning behavior, and then obtain the target video frame set of that type of learning behavior based on the partial frames of each type of learning behavior, which part of the frame is specifically selected, and the selection method Can be flexibly decided.

After obtaining the target video frame set corresponding to the learning behavior, step S1312 may be used to select and obtain the target video frame from the target video frame set. It can be seen from step S1312 that, in a possible implementation manner, face quality detection may be performed on the video frames in the target video frame set, and then video frames with face quality greater than the face quality threshold are used as the target video frames.

Among them, the face quality detection method can be flexibly set according to the actual situation, and is not limited to the following disclosed embodiments. In a possible implementation manner, the face quality can be determined by performing face recognition on the face in the video frame. The completeness of the face in the video frame is used to determine the face quality; in a possible implementation, the face quality can also be determined based on the clarity of the face in the video frame; in a possible implementation, it is also The face quality in the video frame can be comprehensively judged based on multiple parameters such as the completeness, clarity, and brightness of the face of the video frame; in a possible implementation, the video frame can also be input to the face quality nerve Network to obtain the face quality in the video frame. The face quality neural network can be obtained by training a large number of face images containing face quality scores. The specific implementation form can be flexibly selected according to the actual situation. In the embodiments of the present disclosure, No restrictions.

The specific value of the face quality threshold can be flexibly determined according to the actual situation, which is not limited in the embodiment of the present disclosure. In a possible implementation manner, different face quality thresholds may be set for each type of learning behavior; in a possible implementation manner, the same face threshold may also be set for each type of learning behavior. In a possible implementation, the face quality threshold can also be set to the maximum value of the face quality in the target video frame set. In this case, you can directly set the highest face quality under each type of learning behavior The video frame is used as the target video frame.

In some possible implementation manners, there may be certain video frames that contain multiple types of learning behaviors at the same time. In this case, the manner of processing video frames containing multiple types of learning behaviors can be flexibly changed according to actual conditions. In a possible implementation manner, these video frames can be assigned to each type of learning behavior, and then selected from the set of video frames corresponding to each type of learning behavior in step S1312 to obtain the target video frame; In a possible implementation manner, a video frame containing multiple types of learning behaviors at the same time can also be directly selected as the target video frame.

After the target video frame is determined through any of the foregoing embodiments, step S1313 may be used to generate learning state information according to the target video frame. The implementation of step S1313 can be flexibly selected according to the actual situation. For details, please refer to the following disclosed embodiments, which will not be expanded here.

In the embodiment of the present disclosure, the video frame containing at least one type of learning behavior is obtained as the target video frame set, so that according to the target video frame set of each type of learning behavior, the video frame with higher face quality is selected As the target video frame, the learning state information is then generated according to the target video frame. Through the above process, the generated learning status information can be based on the information obtained from the video frames with higher face quality and containing learning behaviors, with higher accuracy, so that the learning of the target object can be grasped more accurately state.

As described in the above disclosed embodiment, the implementation of step S1313 can be flexibly changed. In a possible implementation manner, step S1313 may include:

Use at least one of the target video frames as learning state information; and/or,

Identify the area where the target object is located in at least one frame of the target video frame, and generate learning state information based on the area where the target object is located.

It can be seen from the above disclosed embodiments that in a possible implementation manner, at least one frame of the target video frame can be directly used as the learning state information. In an example, the obtained target video frame can be further selected. This selection can be random or subject to certain conditions, and then the selected target video frame is directly used as the learning state information; in one example, each target video frame obtained can also be directly equalized. As learning status information.

In a possible implementation manner, the area where the target object is located in the target video frame may be further identified, so as to generate learning state information according to the area where the target object is located. The method of recognizing the target object area is not limited in the embodiment of the present disclosure. In a possible implementation manner, it can be implemented by the neural network with the target object detection function mentioned in the above-mentioned disclosed embodiment. After the area of the target object in the target video frame is determined, the target video frame can be further processed accordingly to obtain the learning state information. Among them, the processing method can be flexibly determined. In one example, the image of the area where the target object is located in the target video frame can be used as the learning state information; in one example, the background outside the area where the target object is located in the target video frame can also be used as learning state information. Area rendering, such as adding other stickers, or adding mosaic to the background area, or replacing the image of the background area, etc., to get the learning status information that does not display the current background of the target object, so as to better protect the privacy of the target object , You can also use stickers and other rendering methods to increase the diversity and beauty of the learning status information.

By using at least one frame in the target video as the learning state information, and/or generating the learning state information according to the area of the target object in the target video frame, the above method can make the final learning state information more flexible, so that According to the needs of the target object, the learning status information of the target object is more prominent, or the learning status information that protects the privacy of the target object more can be obtained.

The above disclosed embodiments can be combined arbitrarily to obtain learning state information generated based on video frames containing learning behaviors. For example, Table 1 shows a learning state information generation rule according to an embodiment of the present disclosure.

Table 1 Rules for generating learning status information

Among them, M, N, X, Y, and Z are all positive integers, and the specific values can be set according to actual needs. In addition, the parameters such as M in different rows in Table 1 may be the same or different. The above-mentioned parameters such as M are only used as a schematic description, and not as a limitation to the present disclosure.

Among them, the wonderful moment is the moment corresponding to the positive learning behavior of the target object. It can be seen from Table 1 that in an example, the target object can be detected to perform target gestures such as raising hands, to generate the target emotion of happiness, or to pay attention to the display area of the teaching course, and to have a roll call with the teacher. In the case of school behavior, certain data processing is performed on the video, and after the data processing, further image processing is performed on the video frame to obtain the target video frame as the learning state information.

In a possible implementation manner, generating learning state information according to the duration of the target object performing at least one type of learning behavior may include:

Step S1321, in the case where it is detected that the time for the target object to perform at least one type of learning behavior is not less than the time threshold, record the duration of the at least one type of learning behavior;

In step S1322, the duration corresponding to at least one type of learning behavior is used as the learning state information.

Among them, the time threshold can be a certain value flexibly set according to the actual situation, and the time thresholds of different types of learning behaviors can be the same or different. When it is detected that the target object performs a certain type of learning behavior within a certain period of time, the time for the target object to perform these learning behaviors can be counted, so as to feed back to the teacher or parent as learning status information. The specific statistical conditions and the statistical time under which learning behaviors can be implemented can be flexibly set according to the actual situation.

In a possible implementation, when it is detected that the target object does not appear in the video (such as no one in the video, someone in the video frame but it is impossible to determine whether it is the target object or there is someone in the shot but not the target object) In the case that the target object has closed eyes or the target object does not watch the display area of the teaching course, the time length of these learning behaviors can be counted and used as the learning status information.

In the embodiments of the present disclosure, when it is detected that the time for the target object to perform at least one type of learning behavior is not less than the time threshold, the duration of at least one type of learning behavior is recorded as the learning state information. Through the above process, the The learning status information is quantified, and the learning status of the target object can be grasped more intuitively and accurately.

In a possible implementation, the video processing method proposed in the embodiment of the present disclosure may further include:

Rendering the background area in at least part of the video frame in the video, where the background area is an area outside the target object in the video frame.

For the segmentation method of the background area and the rendering method of the background area, reference may be made to the above-mentioned disclosed embodiment for identifying the area where the target object in the target video frame is located and the rendering process after the recognition, which will not be repeated here. In the process of rendering the background area, in one example, it can be rendered by a universal template preset in the current video processing device; in one example, it can also be rendered by calling other templates in the database of the non-video processing device or Customized templates, etc. for rendering, for example, other background templates can be called from a cloud server of a non-video processing device, etc., to render the background area in the video, etc.

By rendering the background area in at least part of the video frame in the video, on the one hand, the privacy of the target object in the video can be protected, and the possibility of privacy leakage of the target object due to the lack of a suitable video capture location is reduced. On the other hand, it is also It can enhance the interest of the target object to watch the teaching course process.

Calculate the learning status information of at least one target object, and obtain a statistical result of at least one target object;

According to the statistical result of at least one target object, the learning state statistical data is generated.

In the embodiment of the present disclosure, the target object contained in a video may be one or multiple. In addition, the video processing method in the embodiment of the present disclosure may be used to process a single video, or it may be used to process a single video. Multiple videos are processed. Therefore, correspondingly, the learning status information of one target object can be obtained, and the learning status information of multiple target objects can also be obtained. In this case, statistics can be performed on the learning state information of at least one target object to obtain a statistical result of at least one target object. Among them, the statistical result may include not only the learning status information of the target object, but also other information related to the target object's viewing of the teaching course. For example, in a possible implementation manner, before step S12, that is, before performing learning behavior detection on the target object, the sign-in data of the target object can also be obtained. The check-in data of the target object may include the identity information and check-in time of the target object. The specific check-in data acquisition method can be flexibly determined according to the actual check-in method of the target object, which is not limited in the embodiments of the present disclosure.

After the statistical result of at least one target object is obtained, the learning state statistical data can be generated according to the at least one statistical result. Specifically, the generation method and content of the learning state statistical data can be flexibly changed according to the realization form of the statistical result. For details, please refer to the following disclosed embodiments, which will not be expanded here.

In the embodiment of the present disclosure, the statistical result of the at least one target object is obtained by counting the learning status information of at least one target object, so as to generate the learning status statistical data according to the statistical result of the at least one target object. Through the above process, it is possible to effectively Comprehensive evaluation of the learning status of multiple target objects makes it easier for teachers to grasp the overall learning situation of the entire classroom, and it is also convenient for other relevant personnel to have a more comprehensive understanding of the current learning position of the target object.

In a possible implementation manner, generating the learning state statistical data according to the statistical result of at least one of the target objects includes:

According to the category to which at least one target object belongs, the statistical result of the target object contained in the at least one category is obtained, and the learning status statistical data of at least one category is generated. And at least one of the devices used by the target object; and/or,

Visualizing the statistical results of the at least one target object to generate statistical data of the learning state of the at least one target object.

The category to which the target object belongs may be a category divided according to the identity of the target object. For example, the category to which the target object belongs may include at least one of the courses the target object participates in, the institution registered by the target object, and the equipment used by the target object. In one type, the course that the target object participates in may be the teaching course watched by the target object mentioned in the above disclosed embodiment, and the institution registered by the target object may be the educational institution where the target object is located, or the grade or grade of the target object. The class where the target object is located, and the equipment used by the target object may be the terminal device used by the target object to participate in the online course in an online scene.

In the embodiments of the present disclosure, the statistical results of the target objects contained in at least one category can be obtained according to the category to which the target object belongs, that is, at least one statistical result of the category to which the target object belongs can be summarized to obtain the Statistics of overall learning status. For example, it can be divided according to the categories of equipment, courses, educational institutions, etc., and the statistical results of different target objects under the same equipment, the statistical results of different target objects under the same course, and the statistical results of different target objects in the same educational institution can be obtained respectively. Wait. In an example, these statistical results can also be displayed in the form of a report. In an example, the statistical results of each category in the report can include not only the overall learning status information of each target object, but also the specific learning status information of each target object, such as the focus on the display area of the teaching course The length of time, the length of smiling time, etc., in addition to this, it can also contain other information related to watching the teaching course, such as the check-in time of the target object, the number of check-ins, the match between the target object and the face in the preset database, Sign-in equipment and sign-in courses, etc.

In addition, the statistical results of at least one target object can also be visualized to obtain the statistical data of the learning state of the at least one target object. Among them, the visual processing method can be flexibly determined according to the actual situation, for example, the data can be sorted into forms such as charts or videos. The content contained in the learning status statistics can be flexibly determined according to the actual situation. For example, it can include the overall learning status information of the target object, the name of the teaching course watched by the target object, and the specific learning status information of the target object. The actual situation is flexible. In one example, the identity of the target object, the name of the teaching course viewed by the target object, the duration of the display area of the target object’s attention teaching course, the degree of attention of the target object, and the data comparison between the target object and other target objects The results, the number of interactions of the target object, and the emotions of the target object are organized into a visual report, and sent to the target object or other relevant personnel of the target object, such as the parents of the target object.

In an example, in addition to pictures and videos, the visualized statistical data of learning status can contain text content in the form of "The subject of class is XX, the duration of concentration of A student is 30 minutes, and the concentration is concentrated, which is 10% higher than the class. % Of classmates interacted 3 times and smiled 5 times. I hereby give praise and are willing to continue to work hard" or "The subject of class is XX, B students have less concentration, and the frequency of gestures such as raising hands is lower. Parents are advised to pay close attention , Adjust the children’s study habits in time" and so on.

In the embodiment of the present disclosure, by acquiring the category to which at least one target object belongs, the learning state statistical data of at least one category is generated, and/or the statistical result of the at least one target object is visualized to generate the statistics of the at least one target object. Statistics of learning status. Through the above process, the learning state of the target object can be grasped more intuitively and comprehensively through different statistical methods.

Fig. 2 shows a block diagram of a video processing device according to an embodiment of the present disclosure. As shown in the figure, the video processing device 20 may include:

The video acquisition module 21 is configured to acquire a video, where at least part of the video frames in the video contain the target object;

The detection module 22 is used to detect at least one type of learning behavior of the target object in the process of watching the teaching course according to the video;

The generating module 23 is configured to generate learning based on at least part of the video frames containing at least one type of learning behavior and/or the duration of the target object performing at least one type of learning behavior when it is detected that the target object performs at least one type of learning behavior status information.

In a possible implementation, the learning behavior includes at least one of the following behaviors: performing at least one target gesture, expressing the target emotion, paying attention to the display area of the teaching course, generating at least one interactive behavior with other objects, There is no eye contact, eyes closed, and eye contact in the display area of the teaching course in at least part of the video frames in.

In a possible implementation manner, the detection module is configured to: perform target object detection on the video to obtain a video frame containing the target object; and perform at least one type of learning behavior detection on the video frame containing the target object.

In a possible implementation manner, the learning behavior includes performing at least one target gesture; the detection module is further configured to: perform detection of at least one target gesture on the video frame containing the target object; When the number of continuous video frames exceeds the first threshold, record at least one of the video frames containing the target gesture as the gesture start frame; in the video frames after the gesture start frame, the continuous video frames that do not contain the target gesture When the number exceeds the second threshold, record at least one of the video frames that do not contain the target gesture as the gesture end frame; according to the number of gesture start frames and gesture end frames, determine that the target object in the video performs at least one target The number and/or time of gestures.

In a possible implementation, the learning behavior includes expressing the target emotion; the detection module is further used to: perform expression detection and/or smile value detection on the video frame containing the target object; in the detected video frame, the target object displays at least one When the first target expression or smile value detection result exceeds the target smile value, the detected video frame is regarded as the first detection frame; when the number of consecutive first detection frames exceeds the third threshold, it is determined The target object produces the target emotion.

In a possible implementation, the learning behavior includes paying attention to the display area of the teaching course; the detection module is further used to: perform expression detection and face angle detection on the video frame containing the target object; display the target object in the detected video frame In the case of at least one second target expression and the face angle is within the range of the target face angle, the detected video frame is used as the second detection frame; when the number of consecutive second detection frames exceeds the fourth threshold In this case, determine the target object to focus on the display area of the teaching course.

In a possible implementation, the generating module is used to: obtain video frames containing at least one type of learning behavior in the video as a target video frame set; perform face quality detection on at least one video frame in the target video frame set, The video frame whose face quality is greater than the face quality threshold is taken as the target video frame; according to the target video frame, the learning state information is generated.

In a possible implementation manner, the generating module is further configured to: use at least one frame of the target video frame as the learning state information; and/or, identify the area where the target object is located in the at least one frame of the target video frame, based on the target object In the area, the learning status information is generated.

In a possible implementation, the detection module is used to: perform target object detection on the video to obtain a video frame containing the target object, and use the video frame other than the video frame containing the target object as the undetected target object When the number of video frames in which the target object is not detected exceeds the preset number of video frames, the detected learning behavior includes: not appearing in at least part of the video frames in the video.

In a possible implementation manner, the generating module is used to record the duration of at least one type of learning behavior when it is detected that the time for the target object to perform at least one type of learning behavior is not less than a time threshold; The duration corresponding to the behavior is used as the learning status information.

In a possible implementation manner, the device is further configured to: render a background area in at least part of the video frame in the video, where the background area is an area outside the target object in the video frame.

In a possible implementation manner, the device is further configured to: collect statistics on the learning state information of at least one target object to obtain a statistical result of at least one target object; and generate statistical data of the learning state according to the statistical result of at least one target object.

In a possible implementation manner, the device is further configured to: obtain statistical results of the target objects contained in the at least one category according to the category to which the at least one target object belongs, and generate statistical data of the learning state of at least one category, wherein the target object belongs to The category includes at least one of the courses the target object participates in, the institution registered by the target object, and the equipment used by the target object; and/or visualize the statistical results of at least one target object to generate the learning status of at least one target object Statistical data.

Without violating logic, different embodiments of the present application can be combined with each other, and the description of different embodiments is emphasized, and the parts that are not described may be referred to the records of other embodiments.

In some embodiments of the present disclosure, the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments. For specific implementation and technical effects, please refer to the above method embodiments. Description, for the sake of brevity, I will not repeat it here.

Application scenario example

The way students learn is usually the teacher teaches, the students listen to the lesson, the classroom lacks interaction and interest, the students are not easy to be interested in the lesson, and the real-time performance of the students cannot form a positive motivation for the students. At the same time, institutions or teachers cannot grasp the status of students’ attendance, and parents cannot understand their children’s performance at school. Especially affected by the epidemic, students spend a lot of time in online classes. However, whether students are actually attending classes and whether they are attending classes carefully. The performance of the interaction cannot be quantitatively evaluated. Therefore, how to effectively grasp the learning status of students has become an urgent problem to be solved at present.

The application example of the present disclosure proposes a set of learning system, which can effectively grasp the learning state of students through the video processing method proposed in the above-mentioned disclosed embodiment.

Fig. 3 shows a schematic diagram of an application example according to the present disclosure. As shown in the figure, in an example, the learning system can be composed of three parts: the user end, the educational software service (SaaS, Software-as-a-Service) backend, and the interactive classroom backend. Among them, students watch the teaching courses through the client. The client can include two parts: hardware devices for learning (such as the client with Windows system or IOS system and SDK installed in the picture), and the student's login to the online classroom. Application (ie the user APP in the figure). The education SaaS backend can be a platform built by the server of the educational institution where the student is located, and the interactive classroom backend can be a platform built by a server that aggregates data from different educational institutions and performs data maintenance, whether it is an education SaaS backend or an interactive classroom backend. Data can be exchanged with the client through the API interface. Thereby, the generation of learning state information and the generation of learning state statistical data mentioned in the above disclosed embodiments are realized.

In the application example of the present disclosure, the process of generating learning state information may include:

The user terminal obtains the learning status information of each student by collecting the videos of the students watching the teaching course process and processing the collected videos. The education SaaS background and the interactive classroom background call the learning status generated in different users through the API interface. Information, and perform statistical processing on the learning state information in any manner mentioned in the above-mentioned disclosed embodiment to generate learning state statistical data.

In an example, the user terminal processes the collected video, and the process of obtaining the learning status information of each student may include:

A. Get the exciting moments of the students in class (that is, the positive learning behavior mentioned in the above-mentioned disclosed embodiment).

In an example, you can define certain rules to create a collection of exciting videos of students. You can edit the performance of students into a short video or some exciting pictures and provide them to parents, so that parents can evaluate students’ performance in class in time. Well, students may be encouraged to continue participating in related courses.

In an example, the student's wonderful moments can be obtained after the student signs in successfully, and the videos or pictures of the next wonderful moments will be uploaded to the background or the cloud. At the same time, it is also possible to choose whether the students can see the uploaded wonderful moments in real time. In one example, the highlight definition rule may include: generating at least one target gesture. The target gesture may include raising hand, like, gesture OK, gesture Yeah, etc. If a student is detected to perform the above gesture within a period of time, Then you can extract pictures or video frames from videos that contain gestures. Express the happy target emotion. If it is detected that the student’s expression is happy within a period of time, and the smile value reaches a certain target smile value (such as 99 points), there can be a video frame with a happy label or a target smile value. The video frame performs picture or video frame extraction. Pay attention to the display area of the teaching course. If the student's face orientation is always correct within a period of time, that is, the headpose is within a certain threshold range, then pictures or video frames can be extracted from the video within this period of time.

B. Perform a learning situation test on the learning situation of the students (for the negative learning behavior mentioned in the above disclosed embodiment).

In one example, the student may not be on the screen or may be unfocused, and the data can be pushed to the parents in real time through the learning situation detection, so that the parents can pay attention to the children in the first time, and correct the children’s bad learning habits in time. Auxiliary supervision.

In one example, the process of checking the student's academic status can be carried out after the student signs in successfully. For example, for how long in front of the camera, no one appears in front of the camera, does not watch the screen, closes eyes, etc., it is judged that the person has a low degree of concentration. In this case, it is possible to count the length of time during which the student has the above-mentioned learning behavior, and use it as the result of the academic condition detection to obtain the corresponding learning state data. The specific academic condition detection configuration rules can refer to the above disclosed embodiments, which will not be repeated here.

Through the above public examples, learning status information including exciting moments and learning situation detection can be obtained. Further, the education SaaS backend and interactive classroom backend use API interfaces to call the learning status information generated in different client terminals to generate learning status. The process of statistical data can include:

C. Report generation (that is, the generation of statistical data of learning status in at least one category in the above disclosed embodiment).

In one example, the backend or cloud API can view student sign-in information and learning status information in different dimensions such as device, course, institution, etc. The main data indicators can include: sign-in time, sign-in times, and face database (that is, the above-mentioned public The target object in the embodiment matches the face in the preset database), sign-in equipment, sign-in course, focus time, smile time, etc.

D. Analysis report (that is, the visualization process in the above disclosed embodiment generates statistics on the learning status of at least one target object).

In one example, the education SaaS backend or the interactive classroom backend can unify the students' performance in the online classroom into a complete academic analysis report. The report explains the student’s class status through a visual graphical interface. Furthermore, the background can also select a better situation and push it to parents or teachers, so that it can be used by institutional teachers to analyze the student’s situation and gradually assist children in improving their learning behavior.

In addition to the above process, the learning system can also perform background segmentation processing on the student's learning video when the student is learning through the user terminal. In one example, the user terminal may provide a background segmentation function for situations where the student does not have a location background suitable for live broadcast or is unwilling to display a background image for privacy protection. In an example, the SDK on the user side can support several different background templates. For example, several general templates can be preset. In one example, students can also call customized templates from the interactive classroom backend through the user side. In one example, the SDK can provide a background template preview interface to the app on the user side, so that students can preview the customized templates that can be called through the app; students can also use the background segmentation stickers on the app on the user side to compare The live broadcast background is rendered. In one example, if the student is not satisfied with the sticker, it can also be manually triggered to close. The APP on the user side can report the data of students using stickers to the corresponding back-end (education SaaS back-end or interactive classroom back-end), and the corresponding back-end can analyze which background stickers are used by students and information such as usage amount as additional learning status information.

The learning system proposed in the application examples of the present disclosure can not only be applied to online classrooms, but also be extended to other related fields, such as online meetings.

It can be understood that the various method embodiments mentioned in the present disclosure can be combined with each other to form a combined embodiment without violating the principle and logic. The length is limited, and the details of this disclosure will not be repeated.

Those skilled in the art can understand that in the above-mentioned methods of the specific implementation, the writing order of the steps does not mean a strict execution order but constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possibility. The inner logic is determined.

The embodiments of the present disclosure also provide a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above-mentioned method when executed by a processor. The computer-readable storage medium may be a volatile computer-readable storage medium or a non-volatile computer-readable storage medium.

An embodiment of the present disclosure also provides an electronic device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured as the above-mentioned method.

The embodiment of the present disclosure also provides a computer program, including computer readable code, when the computer readable code is executed in an electronic device, the processor in the electronic device is executed to implement the above method.

In practical applications, the above-mentioned memory may be a volatile memory (volatile memory), such as RAM; or a non-volatile memory (non-volatile memory), such as ROM, flash memory, hard disk drive (Hard Disk Drive) , HDD) or solid-state drive (Solid-State Drive, SSD); or a combination of the above types of memory, and provide instructions and data to the processor.

The foregoing processor may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It is understandable that, for different devices, the electronic device used to implement the above-mentioned processor function may also be other, and the embodiment of the present disclosure does not specifically limit it.

The electronic device can be provided as a terminal, server or other form of device.

Based on the same technical concept as the foregoing embodiment, the embodiment of the present disclosure also provides a computer program, which implements the foregoing method when the computer program is executed by a processor.

FIG. 4 is a block diagram of an electronic device 800 according to an embodiment of the present disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and other terminals.

4, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, and a sensor component 814 , And communication component 816.

The processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method. In addition, the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations in the electronic device 800. Examples of these data include instructions for any application or method to operate on the electronic device 800, contact data, phone book data, messages, pictures, videos, etc. The memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.

The power supply component 806 provides power for various components of the electronic device 800. The power supply component 806 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), and when the electronic device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module. The above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.

The sensor component 814 includes one or more sensors for providing the electronic device 800 with various aspects of state evaluation. For example, the sensor component 814 can detect the on/off status of the electronic device 800 and the relative positioning of the components. For example, the component is the display and the keypad of the electronic device 800. The sensor component 814 can also detect the electronic device 800 or the electronic device 800. The position of the component changes, the presence or absence of contact between the user and the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800. The sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact. The sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related personnel information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more application-specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field-available A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.

In an exemplary embodiment, there is also provided a non-volatile computer-readable storage medium, such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to complete the foregoing method.

FIG. 5 is a block diagram of an electronic device 1900 according to an embodiment of the present disclosure. For example, the electronic device 1900 may be provided as a server. 5, the electronic device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by a memory 1932, for storing instructions executable by the processing component 1922, such as application programs. The application program stored in the memory 1932 may include one or more modules each corresponding to a set of instructions. In addition, the processing component 1922 is configured to execute instructions to perform the above-described methods.

The electronic device 1900 may also include a power supply component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to the network, and an input output (I/O) interface 1958 . The electronic device 1900 can operate based on an operating system stored in the memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.

In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as the memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to complete the foregoing method.

The present disclosure may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling a processor to implement various aspects of the present disclosure.

The computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon The protruding structure in the hole card or the groove, and any suitable combination of the above. The computer-readable storage medium used here is not interpreted as the instantaneous signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, light pulses through fiber optic cables), or through wires Transmission of electrical signals.

The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing/processing devices, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, optical fiber transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network, and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device .

The computer program instructions used to perform the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages. Source code or object code written in any combination, the programming language includes object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as "C" language or similar programming languages. Computer-readable program instructions can be executed entirely on the user's computer, partly on the user's computer, executed as a stand-alone software package, partly on the user's computer and partly executed on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network-including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to connect to the user's computer) connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), is personalized by using status personnel information of computer-readable program instructions. The computer-readable program instructions can be executed to implement various aspects of the present disclosure.

Here, various aspects of the present disclosure are described with reference to flowcharts and/or block diagrams of methods, devices (systems) and computer program products according to embodiments of the present disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine that makes these instructions when executed by the processor of the computer or other programmable data processing device , A device that implements the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams is produced. It is also possible to store these computer-readable program instructions in a computer-readable storage medium. These instructions make computers, programmable data processing apparatuses, and/or other devices work in a specific manner. Thus, the computer-readable medium storing the instructions includes An article of manufacture, which includes instructions for implementing various aspects of the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions on a computer, other programmable data processing device, or other equipment, so that a series of operation steps are executed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , So that the instructions executed on the computer, other programmable data processing apparatus, or other equipment realize the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowcharts and block diagrams in the accompanying drawings show the possible implementation architecture, functions, and operations of the system, method, and computer program product according to multiple embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction, and the module, program segment, or part of an instruction contains one or more components for realizing the specified logical function. Executable instructions. In some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two consecutive blocks can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or actions Or it can be realized by a combination of dedicated hardware and computer instructions.

The embodiments of the present disclosure have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Without departing from the scope and spirit of the illustrated embodiments, many modifications and changes are obvious to those of ordinary skill in the art. The choice of terms used herein is intended to best explain the principles, practical applications, or technical improvements in the market of the various embodiments, or to enable other ordinary skilled in the art to understand the various embodiments disclosed herein.

Claims

A video processing method, characterized in that it comprises:

Acquiring a video, where at least part of the video frames in the video contains the target object;

According to the video, detect at least one type of learning behavior of the target object in the process of watching the teaching course;

In the case of detecting that the target object performs at least one type of learning behavior, according to at least a part of the video frame containing the at least one type of learning behavior and/or the duration of the target object performing the at least one type of learning behavior, Generate learning status information.
The method according to claim 1, wherein the learning behavior includes at least one of the following behaviors: performing at least one target gesture, expressing target emotions, paying attention to the display area of the teaching course, and producing with other objects. At least one interactive behavior, not appearing in at least part of the video frames in the video, eyes closed, and eye contact in the display area of the teaching course.
The method according to claim 1 or 2, wherein the detecting at least one type of learning behavior of the target object according to the video comprises:

Performing target object detection on the video to obtain a video frame containing the target object;

At least one type of learning behavior detection is performed on the video frame containing the target object.
The method according to claim 3, wherein the learning behavior comprises performing at least one target gesture;

The performing at least one type of learning behavior detection on the video frame containing the target object includes:

Detecting at least one target gesture on the video frame containing the target object;

In a case where it is detected that the number of continuous video frames containing at least one of the target gestures exceeds the first threshold, at least one of the video frames containing the target gesture is recorded as a gesture start frame;

In the video frames after the gesture start frame, if the number of consecutive video frames that do not include the target gesture exceeds the second threshold, record at least one of the video frames that do not include the target gesture as the gesture end frame ；

According to the number of the gesture start frame and the gesture end frame, determine the number of times and/or time for the target object in the video to perform at least one target gesture.
The method according to claim 3 or 4, wherein the learning behavior includes expressing a target emotion;

The performing at least one type of learning behavior detection on the video frame containing the target object includes:

Performing expression detection and/or smile value detection on the video frame containing the target object;

In a case where it is detected that the target object in the video frame shows at least one first target expression or a smile value detection result exceeds the target smile value, use the detected video frame as the first detection frame;

In a case where it is detected that the number of consecutive first detection frames exceeds a third threshold, it is determined that the target object generates the target emotion.
The method according to any one of claims 3 to 5, wherein the learning behavior includes paying attention to the display area of the teaching course;

The performing at least one type of learning behavior detection on the video frame containing the target object includes:

Performing expression detection and face angle detection on the video frame containing the target object;

In a case where it is detected that the target object in the video frame shows at least one second target expression and the face angle is within the target face angle range, use the detected video frame as the second detection frame;

In a case where it is detected that the number of consecutive second detection frames exceeds a fourth threshold, it is determined that the target object pays attention to the display area of the teaching course.
The method according to any one of claims 1 to 6, wherein the generating learning state information according to a video frame at least partially containing the at least one type of learning behavior comprises:

Acquiring video frames containing at least one type of learning behavior in the video as a target video frame set;

Perform face quality detection on at least one video frame in the target video frame set, and use a video frame with a face quality greater than a face quality threshold as a target video frame;

According to the target video frame, the learning state information is generated.
The method according to claim 7, wherein said generating said learning state information according to said target video frame comprises:

Use at least one of the target video frames as learning state information; and/or,

Identify the area where the target object is located in at least one frame of the target video frame, and generate the learning state information based on the area where the target object is located.
The method according to claim 1 or 2, wherein the detecting at least one type of learning behavior of the target object according to the video comprises:

Performing target object detection on the video to obtain a video frame containing the target object, and using a video frame other than the video frame containing the target object in the video as a video frame in which no target object is detected;

In a case where the number of video frames in which the target object is not detected exceeds the preset number of video frames, detecting the learning behavior includes: not appearing in at least part of the video frames in the video.
The method according to any one of claims 1 to 9, wherein the generating learning state information according to the duration of the target object performing the at least one type of learning behavior comprises:

If it is detected that the time for the target object to perform at least one type of learning behavior is not less than a time threshold, record the duration of at least one type of learning behavior;

The duration corresponding to at least one type of the learning behavior is used as the learning state information.
The method according to any one of claims 1 to 10, wherein the method further comprises:

Rendering a background area in at least a part of the video frame in the video, where the background area is an area outside the target object in the video frame.
The method according to any one of claims 1 to 11, wherein the method further comprises:

Statistics the learning state information of at least one of the target objects, and obtain a statistical result of at least one of the target objects;

According to the statistical result of at least one of the target objects, the learning state statistical data is generated.
The method according to claim 12, wherein said generating statistical data of learning state according to a statistical result of at least one of said target objects comprises:

According to the category to which at least one of the target objects belongs, the statistical results of the target objects contained in at least one of the categories are obtained, and the learning status statistics data of at least one category are generated, wherein the category to which the target object belongs includes the participation of the target object At least one of the courses of, the institution registered by the target object, and the equipment used by the target object; and/or,

Visual processing is performed on the statistical results of at least one of the target objects to generate statistical data of the learning state of at least one of the target objects.
A video processing device, characterized in that it comprises:

A video acquisition module, configured to acquire a video, wherein at least part of the video frames in the video contain the target object;

The detection module is configured to detect at least one type of learning behavior of the target object in the process of watching the teaching course according to the video;

A generating module, configured to perform the at least one type of learning according to at least part of the video frame containing the at least one type of learning behavior and/or the target object in the case of detecting that the target object performs at least one type of learning behavior The duration of the behavior to generate learning status information.
An electronic device, characterized in that it comprises:

processor;

A memory for storing processor executable instructions;

Wherein, the processor is configured to call instructions stored in the memory to execute the method according to any one of claims 1-12.
A computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the method according to any one of claims 1 to 13 when the computer program instructions are executed by a processor.
A computer program, comprising computer-readable code, when the computer-readable code runs in an electronic device, the processor in the electronic device executes the Methods.