WO2021121302A1

WO2021121302A1 - Video collection control method, electronic device, and computer-readable storage medium

Info

Publication number: WO2021121302A1
Application number: PCT/CN2020/137100
Authority: WO
Inventors: 崔强强; 秦磊; 陈天珞; 卢曰万
Original assignee: 华为技术有限公司
Priority date: 2019-12-19
Filing date: 2020-12-17
Publication date: 2021-06-24
Also published as: CN113014846A; CN113014846B

Abstract

Disclosed are a video collection control method, an electronic device, a computer-readable storage medium, a computer program product and a chip, relating to the field of image processing. The electronic device comprises: a display, a keyboard, a camera and a processor, wherein the camera is arranged close to the keyboard, and is used for collecting a video frame during video communication, and sending the collected video frame to the processor; and the processor is connected to the display, the keyboard and the camera, and is used for receiving a first video frame from the camera, then removing, when it is determined that the first video frame includes content conforming to a preset finger model, a finger from the first video frame in order to obtain a second video frame, sending the second video frame to the display for displaying, and/or sending the second video frame to an electronic device at an opposite end for displaying. The technical problem in the prior art of the proportions of a finger being easily distorted during a video call process is solved. The method can be used for an artificial intelligence device. The method is related to technology such as deep learning.

Description

Video capture control method, electronic equipment, and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office of China with an application number of 201911315367.1 on December 19, 2019, and the title of the invention is "a video capture control method, electronic equipment, and computer-readable storage medium" The priority of the Chinese patent application, the entire content of which is incorporated in this application by reference.

Technical field

This application relates to the field of image processing, and in particular to a video capture control method, electronic equipment, computer-readable storage media, computer program products, and chips.

Background technique

With the popularization of electronic devices, the forms of electronic products have become more and more diversified. Taking electronic products as laptops as an example, the cameras of traditional laptops are often located on the top of the laptop. In recent years, due to the protection of personal privacy In addition, there are also many laptop computers whose cameras are placed below the screen or placed at the top of the keyboard in the form of a hidden camera. As shown in FIG. 1, the camera 11 is placed near the keyboard 10 of the laptop computer 1. However, the low camera angle also causes the user to type while making a video call, which may cause the finger to block the camera’s field of view, resulting in distortion of the proportion of the finger displayed on the screen, as shown in Figure 2, which presents a similar pattern. The effect of claw fish.

Summary of the invention

The video communication method, video acquisition method, and electronic equipment provided by the present application avoid the distortion of finger proportions in the video communication process and improve the quality of video communication.

In the first aspect, an embodiment of the present invention provides an electronic device, including: a display, a keyboard, a camera, and a processor; the camera is arranged near the keyboard, and is used to collect video frames in video communication and collect The video frame is sent to the processor; the processor is connected to the display, the keyboard, and the camera, and is used to receive the first video frame from the camera, and determine that the first video frame contains a finger model that conforms to the preset finger model. Content, the finger in the first video frame is removed to obtain a second video frame; and the second video frame is sent to the display for display, and/or the second video frame is sent to the opposite end Electronic device display. It can solve the technical problem of finger distortion in the output video frame, and can avoid causing the video transmitted by the video communication to be an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.

In an optional embodiment, when the determining that the first video frame contains content that conforms to a preset finger model, removing the finger in the first video frame includes: determining that the first video The frame contains content that conforms to the preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, and the finger in the first video frame is removed; or, it is determined that the first video frame contains Preset the content of the finger model, and determine that there is no overlap between the finger area and the position of the face, remove the finger in the first video frame; or, determine that the first video frame contains content that conforms to the preset finger model , And it is determined that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the finger in the first video frame is removed. Based on the above solution, it can be ensured that the finger is only removed when the finger is located in a specific area in the first video frame, so that the automatic removal can be more accurate, and the possibility of removing the undistorted finger can be reduced.

Optionally, the determining that the first video frame contains content that conforms to a preset finger model, then removing the finger in the first video frame includes: obtaining a keyboard input signal; determining that the first video frame is Contains content that conforms to the preset finger model, and the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, and the finger in the first video frame is removed. In this solution, the detection of the keyboard input signal is one of the conditions for removing the finger, so it can ensure that the removed finger is a typing finger, thereby achieving accurate removal of distorted fingers and avoiding the technology of fingers occluding the video frame picture effect.

In a second aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the one or more computers Programs are stored in the one or more memories, and the one or more computer programs include instructions. When the instructions are executed by one or more processors of the electronic device, the electronic device executes the following Step: Obtain a first video frame, and obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and the time of the obtained keyboard input signal meets the time of obtaining the first video frame The preset time threshold is then removed from the first video frame to obtain a second video frame; indicating that the second video frame is displayed. Through this solution, the technical problem of finger distortion in the output video frame can be solved, and the video transmitted by the video communication can be prevented from being an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication. And in this solution, the detection of the keyboard input signal is one of the conditions for removing the finger, so that the typing finger in the video frame can be accurately removed.

In a third aspect, an embodiment of the present invention provides a video capture control method, which is applied to an electronic device, the electronic device includes: a keyboard and a camera, the camera is arranged near the keyboard, and the method includes: passing through the camera Acquiring a first video frame; determining that the first video frame contains content that conforms to a preset finger model, then removing the finger in the first video frame to obtain a second video frame; displaying the second video frame, And/or, sending the second video frame to the opposite electronic device for display. Through this solution, the technical problem of finger distortion in the output video frame can be solved, and it can be avoided that the video transmitted by the video communication is an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.

In a fourth aspect, an embodiment of the present invention provides a video communication control method, including: obtaining a first video frame and obtaining a keyboard input signal; determining that the first video frame contains content that conforms to a preset finger model, and the obtained The time of the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, then the finger in the first video frame is removed to obtain the second video frame; indicating that the second video frame is displayed. Through this solution, the technical problem of finger distortion in the output video frame can be solved, and the video transmitted by the video communication can be prevented from being an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication. And in this solution, the detection of the keyboard input signal is one of the conditions for removing the finger, so that the typing finger in the video frame can be accurately removed.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the one or more computers Programs are stored in the one or more memories, and the one or more computer programs include instructions. When the instructions are executed by one or more processors of the electronic device, the electronic device executes the following Step: Obtain a first video frame; determine that the first video frame contains content that conforms to a preset finger model, and determine that the finger is located in the bottom area of the first video frame, then remove the first video frame Finger to obtain the second video frame; instruct to display the second video frame. Through this solution, the technical problem of finger distortion in the output video frame can be solved, and the video transmitted by the video communication can be prevented from being an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication. Moreover, in this solution, the detection of the finger area at the bottom area of the video frame is used as the removal condition of the finger, so it is possible to distinguish the finger placed in a specific position from the finger in other positions, thereby achieving more accurate finger removal.

Description of the drawings

FIG. 1 is a structural diagram of a notebook computer with a prior art camera located in the keyboard area;

FIG. 2 is a schematic diagram of a picture containing typing fingers acquired based on the front camera of the notebook computer shown in FIG. 1 in the prior art;

Figure 3 is a structural diagram of an electronic device according to an embodiment of the present invention;

Figure 4 is a software framework diagram of an embodiment of the present invention;

FIG. 5 is a schematic diagram of another structure of an electronic device introduced in an embodiment of the present invention;

6 is a flowchart and interface comparison diagram of the video control method introduced by the embodiment of the present invention;

FIG. 7 is a schematic diagram of a finger area determined by an embodiment of the present invention;

FIG. 8 is a schematic diagram of an implementation manner of replacing a finger area by replacing content in an embodiment of the present invention; FIG.

FIG. 9 is a schematic diagram of another implementation manner of replacing a finger area by replacing content in an embodiment of the present invention;

FIG. 10 is a schematic diagram of generating prompt information after removing a finger in an embodiment of the present invention; FIG.

11 is a schematic diagram of generating prompt information before removing a finger in an embodiment of the present invention;

Figure 12 is a flowchart of a method for training a semantic segmentation model in an embodiment of the present invention;

FIG. 13 is a flowchart of identifying a finger area in an image based on a semantic segmentation model in an embodiment of the present invention;

14 is a flowchart of semantic reasoning when recognizing a finger region in an image based on a semantic segmentation model in an embodiment of the present invention;

15A is a schematic diagram of an image of a typing finger collected by a front camera in an embodiment of the present invention;

15B is a schematic diagram of a finger region mask determined by recognizing the image shown in FIG. 6A based on a semantic segmentation model in an embodiment of the present invention;

15C is a schematic diagram of a finger area mask after noise reduction is performed on the finger mask area in an embodiment of the present invention;

FIG. 16 is a flowchart of the image processing method introduced in the embodiment of the present invention;

FIG. 17 is a flowchart of a video communication method introduced in an embodiment of the present invention;

18A-18C are schematic diagrams of video frames collected by a front camera in a video communication method in an embodiment of the present invention, a typing finger area in the video frame, and image frames output after processing;

FIG. 19 is a flowchart of a video communication method introduced by another embodiment of the present invention;

20 is a flowchart of determining whether a user is in a typing state in the video communication method introduced by another embodiment of the present invention;

FIG. 21 is a flowchart of a video processing method introduced in an embodiment of the present invention;

FIG. 22 shows an interface change diagram of a specific application scenario of the present invention;

Fig. 23 shows the interface change diagram of another specific application scenario of the present invention.

Detailed ways

The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. Among them, in the description of the embodiments of the present application, unless otherwise specified, "/" means or, for example, A/B can mean A or B; "and/or" in this document is only a description of related objects The association relationship of indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: A alone exists, A and B exist at the same time, and B exists alone.

Hereinafter, the terms "first" and "second" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present application, unless otherwise specified, "plurality" means two or more.

The following describes the application scenarios involved in the embodiments of the present application. Electronic equipment is equipped with cameras, microphones, global positioning system (global positioning system, GPS) chips, various sensors (such as magnetic field sensors, gravity sensors, gyroscope sensors, etc.) and other devices to sense the external environment and user actions Wait. According to the perceived external environment and the user's actions, the electronic device provides the user with a personalized and contextual business experience. Among them, the camera can obtain rich and accurate information so that the electronic device can perceive the external environment and the user's actions. The embodiments of the present application provide an electronic device, which can be implemented as any of the following devices: mobile phones, tablet computers (pad), portable game consoles, handheld computers (personal digital assistant, PDA), notebook computers, ultra-mobile personal computers Digital display products such as ultra mobile personal computer (UMPC), handheld computers, netbooks, in-vehicle media playback devices, wearable electronic devices, virtual reality (VR) terminal devices, augmented reality (AR) terminal devices, etc.

First, an exemplary electronic device 100 provided in the following embodiments of the present application is introduced.

FIG. 3 shows a schematic diagram of the structure of the electronic device 100.

Hereinafter, the embodiment will be described in detail by taking the electronic device 100 as an example. It should be understood that the electronic device 100 shown in FIG. 1 is only an example, and the electronic device 100 may have more or fewer components than those shown in FIG. 1, two or more components may be combined, or Can have different component configurations. The various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.

The electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2. Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display 194, And subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.

It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than those shown in the figure, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware. For the detailed structure introduction of the electronic device 100, please refer to the previous patent application: CN201910430270.9.

As shown in Figure 4, the software architecture involved in this application includes: application layer, Windows multimedia framework, control layer, core layer, platform layer, camera driver, and camera hardware. The finger occlusion processing module involved in this application is integrated in the MFT (Media Foundation Transforms) module of the core layer, and this module can also integrate other functions. After the video stream is obtained from the camera driver, it is passed to the Media Source module of the core layer, passed to the MFT module through the Media Source module, and then the input video frame is processed by the finger occlusion processing module to remove the fingers contained in it. The processed video frames are sent to application software, such as video communication software, through Media Sink.

The kernel layer is the layer between hardware and software. The kernel layer includes at least a display driver, a camera driver, an audio driver, a sensor driver, and a finger occlusion processing component. The finger occlusion processing component integrates the function of processing video frames containing preset objects introduced in the embodiment of the present invention. The occlusion component can identify the finger in the video frame and remove the finger to obtain a video frame that does not contain the finger; then the finger occlusion processing component can output the processed video frame to the display; if it is to transmit the processed video frame to At the opposite end, the processing result is delivered to the video application software through the Windows multimedia framework through the finger occlusion processing component, and is delivered to the opposite end through the end-to-end connection established by the video application software.

In the first aspect, an embodiment of the present invention provides an electronic device 100. Please refer to FIG. 5. The electronic device includes:

Display 50;

Keyboard 51;

The camera 52 is arranged near the keyboard 51. As shown in FIG. 5, the camera 52 can be arranged on the plane to which the keyboard 51 belongs, such as

cameras

52a and 52c. The camera 52a is arranged in the area where the keyboard belongs. The area refers to the area determined by the point on the upper left corner and the point on the lower right corner of the keyboard as a rectangle. Alternatively, the camera 52a can be set to move the rectangle away by a preset distance (for example: 0.5cm, 1cm, 2cm) The determined area, or alternatively, the camera 52 can be arranged on the frame 50a of the display 50, for example: arranged on the left frame, right frame, lower frame, etc. of the display 50. As an optional implementation, it is also The camera 52 can be arranged below the frame, for example, in the range of 1/2, 1/3 below the frame, or the camera 52 can also be arranged at the bottom of the frame, and so on. The camera is set near the keyboard and is used to collect video frames in video communication and send the collected video frames to the processor.

The camera 52b shown in FIG. 5 is disposed under the frame 50a.

The processor (not shown in the figure) is connected to the display 50, the keyboard 51, and the camera 52; the processor is used to receive the first video frame from the camera, and determine that the first video frame contains a finger model that conforms to the preset finger model. Content, the finger in the first video frame is removed to obtain a second video frame; and the second video frame is sent to the display for display, and/or the second video frame is sent to the opposite end Electronic device display.

In the following, the video processing method of the embodiment of the present invention will be introduced in conjunction with the above structure. Please refer to FIG. 6. The method includes the following steps:

S600; Obtain the first video frame through camera collection;

S610: Determine that the first video frame contains content that meets the preset finger model, remove the finger in the first video frame to obtain a second video frame; and send the second video frame to the display Display, and/or, send the second video frame to the opposite electronic device for display.

In S600, in a specific implementation process, the electronic device 100 can perform video collection after detecting a user's video shooting operation (for example: clicking a video shooting button, generating a preset gesture, generating a voice command, etc.); When the user conducts video communication with the peer electronic device, video capture is performed, and then the collected video is sent to the peer electronic device. For example: when the electronic device detects the user's video communication operation (or video shooting operation), it generates a video communication instruction; then sends the video communication instruction to the processor, and the processor responds to the video communication instruction, starts the video communication software, and sends the instruction Drive the camera to control the camera for video capture. The camera driver sends the collected data to the finger occlusion processing component, and the finger occlusion processing component performs subsequent operations.

Step S610 may be executed by a processor.

In S610, the first video frame may be input to the semantic segmentation model, the mask of the finger area in the first video frame is determined through the semantic segmentation model, and then the finger area in the first video frame is determined through the mask of the finger area, Among them, the mask of the finger area can be directly determined as the finger area, or the finger area can be obtained after noise reduction is performed on the mask of the finger area. When the finger area exists, it is determined that the first video frame contains content that conforms to the preset finger model, and the semantic segmentation model is obtained by training based on sample photos, and each sample photo contains a photo of the user's finger , And each photo marks the finger area. How to determine the finger area in the video frame through the semantic segmentation model will be introduced later, and will not be repeated here. Figure 7 is a schematic diagram of the determined finger area. Figure 7 contains 6 small images, which are: Figures 7a to 7f. Figure 7b is a collected schematic diagram of the user's finger, based on the figure shown in Figure 7b. For example, the finger area determined by the semantic segmentation model is shown in Fig. 7a.

In the specific implementation process, in step S610, when it is determined that the first video frame contains content that conforms to the preset finger model, the fingers in the first video frame are removed (that is, the fingers in the first video frame are all Remove); In an optional embodiment, it is also possible to first determine whether the preset condition is satisfied, and if the preset condition is satisfied, then remove the finger in the first video frame. The preset condition may be a variety of conditions , Four of them are listed below for introduction. Of course, in the specific implementation process, it is not limited to the following four situations. (The following four preset conditions all correspond to the user’s typing finger in the first video frame. The purpose of judging that the preset condition is met is to remove the finger when the user’s typing finger exists in the first video frame. Non-user’s typing fingers can be retained, for example: the user raises his finger to show the other party manicure, the user holds up a teacup to drink tea, etc. In these cases, the fingers are the fingers that the user does not want to remove. This solution can guarantee the removal. The user’s typing finger (or the finger placed on the keyboard) does not remove other fingers, and other fingers are often not close to the camera, so there will be no deformation problems. Based on the above technical solutions, on the one hand, it can reduce the seriousness of the fingers. Distortion may cause the possibility of blocking the picture, and on the other hand, it can retain the normal proportion of the finger, and achieve the technical effect of accurately removing the proportion distorted finger.).

The first method is to determine that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, then it is determined that there is a typing finger, and the first video frame can be removed Fingers.

As shown in Figure 7a, it contains two connected areas, namely connected

areas

64 and 65, and these two connected areas are connected to the bottom. Therefore, it can be seen that there are two bottom connected areas in Figure 7a. In this case , It means that there are typing fingers in the first video frame, which means that the preset conditions are met, so the fingers in the first video frame can be removed.

Second, it is determined that the first video frame contains content that meets the preset finger model, and it is determined that there is no overlap between the finger area and the position of the human face, then it is determined that there is a typing finger, and the finger in the first video frame can be removed .

For example, usually the area where the face is located is the central area of the video frame. If the user raises his finger to show the other party as an example, in this case, the connected area may be located in the area where the face is. In this case Below, it means that the finger is not a typing finger, but a finger shown to the other party. It means that the preset conditions are not met, so there is no need to remove the finger in the first video frame; only when the finger area does not overlap with the face area In the case of, it means that the finger in the first video frame needs to be removed only when the preset condition is met. Wherein, the overlap of the finger area and the area where the face is located may be a partial overlap or a full overlap, which is not limited in the embodiment of the present invention.

③ Obtain a keyboard input signal; determine that the first video frame contains content that conforms to the preset finger model, and that the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, remove the Finger in the first video frame. In the specific implementation process, if a keyboard input signal is obtained, it indicates that there is a typing finger in the first video frame, so it indicates that there is a finger in the first video frame that satisfies the preset condition. In the specific implementation process, the keyboard signal can be obtained to determine whether there is a keyboard input signal. Among them, you can collect whether the keyboard input signal is detected within the preset time period before and after the first video frame to determine whether the preset condition is met. If the keyboard input signal is detected before and after the preset time period (for example: 1 second, 2 seconds, etc.) , It is deemed to meet the preset conditions, otherwise it is deemed that the preset conditions are not met.

Since in the above solution, the keyboard input signal is used as one of the conditions for triggering the removal of the finger in the first video frame, it can be ensured that the removed finger is the finger of the user in the typing state.

④ It is determined that the first video frame contains content that conforms to the preset finger model, and that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the first video frame is removed. Finger in a video frame.

In the specific implementation process, when the fingers are located in the bottom area, the fingers are likely to be typing fingers, but they may also be the fingers inputting through the touchpad. Generally, the typing fingers are often located on both sides of the keyboard, so they are generally and The sides of the first video frame are connected. For example, in Figure 7a, the bottom connected area 65 is connected to the right side of the first video frame. Therefore, this solution can position the typing finger more accurately, thus accurately region the first video frame. Typing fingers in. Therefore, if the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, it is determined that the preset condition is satisfied; otherwise, it is determined that the preset condition is not satisfied.

Wherein, when it is determined in S610 that the first video frame contains content that conforms to the preset finger model and after the finger area is determined, it can be further determined whether the finger area is in an abnormal state. Among them, it can be determined whether the finger area is in an abnormal state when it is determined that the preset conditions are met. If the finger area is in an abnormal state, the finger in the first video frame is not removed, and the first video frame is directly output to determine the finger area In an abnormal state, the step of directly outputting (for example: displaying on a display, sending to a peer electronic device for display, etc.) the first video frame can be executed by the processor; otherwise, the finger in the first video frame is removed. It is also possible to determine whether the finger area determined based on the preset finger model is in an abnormal state when it is detected that the first video frame contains content that conforms to the preset finger model, which is not limited in the embodiment of the present invention.

In the specific implementation process, the abnormal state can have many situations, four of which are listed below for introduction. Of course, in the specific implementation process, it is not limited to the following four situations.

The first is that the area of the finger area is greater than the preset area threshold (as shown in Figure 7c), for example: the area of the finger area is greater than 5000 pixels, the area of the finger area is greater than 1/4, 1/3, etc. In this case, if the finger in the first video frame is removed, the finger area may not be fused with other areas. Therefore, in order to ensure that the finger area of the output video frame (second video frame) is fused with its background area, only When the area of the finger area is not greater than the preset area threshold, the operation of removing the finger in the first video frame is performed.

In the second type, there is an overlap between the finger area and the area 66 where the face is located. As shown in Figure 7d, the finger area contains three connected areas, which are two bottom connected areas (that is, the connected area located at the bottom of the first video frame). And an intermediate connected area, where there is overlap in the area where the face is located (how to determine the area where the face is located will be introduced later), in this case, it means that one hand of the user is at the bottom in the first video frame Area, the other is not located at the bottom area, it usually means that the user is typing with one hand and doing something other than typing with the other hand (for example: drinking water, touching hair, etc.). In this case, the user It is not desirable to remove the fingers that are not typing, and if only the typing fingers are removed, the picture will be more abrupt. Therefore, in this case, the finger area is considered to be in an abnormal state and there is no need to remove the fingers in the first video frame. Output the first video frame directly.

The third type is that there are at least two bottom connected areas, and the distance between the two bottom connected areas is greater than a preset distance threshold, and the preset distance threshold is, for example, 100 pixels, 150 pixels, and so on. In this case, the distance between the two hands of the user in the first video frame is greater than the first preset distance (the first preset distance is equal to the preset distance threshold, or is positively correlated). As shown in Figure 7e, in this case, it often means that in the first video frame, one hand of the user is typing and the other hand is touching the trackpad, and the area where the trackpad is touched often corresponds to the user’s neck In the area where the finger area is located, the transition of the neck area will be unnatural, and if only the typing finger is in the area, and the touch finger is retained, the screen will be more abrupt, so this state can be regarded as an abnormal state .

The fourth type, there is a non-bottom connected area, as shown in Figure 7f. In this case, it often means that the user is typing with one hand and doing other things with the other hand. In this case, both hands are Removal does not meet the user's requirements, and only removes the fingers in the bottom area of the first video frame, the picture will be more abrupt, so it is considered an abnormal state, and the fingers in the first video frame are not removed.

In step S610, the first video frame can be removed in a variety of ways. Three of them are listed below for introduction. Of course, in the specific implementation process, the following three situations are not limited.

The first method is to replace the content of the finger area in the first video frame with the replacement content to obtain the second video frame. For example, an electronic device responds to video communication operations, such as: in the interface of the first contact in the video communication software, click the video communication button (video communication operation), then it can enter the video communication state, and it will display in the video communication state Video communication interface. There is a video preview window and a video receiving window on the video communication interface. The video preview window is used to display the video frames collected by the current electronic device, and the video receiving window is used to display the video received from the opposite electronic device. Video frame.

Please refer to Figure 8. When the electronic device enters the initial stage of video capture, the user's hand of the electronic device is not placed on the keyboard for typing, the electronic device captures the third video frame, the third video frame does not contain fingers, the electronic device The third video frame will be output, as shown in Figure 8a; at the same time, the electronic device will determine whether the third video frame contains the user’s finger, which can be determined by the semantic segmentation model introduced in the embodiment of the present invention, If it is determined that there is a finger area through the semantic segmentation model, the third video frame is deemed to contain a finger, otherwise it is deemed that the third video frame does not contain a finger; if it is determined that the third video frame does not contain a finger, the third video frame is It is stored as a background frame, as shown in Figure 8b. Then the electronic device collects the first video frame (as shown in Figure 8c), the electronic device determines whether the first video frame contains content that meets the preset finger model, and the first video frame is input into the semantic segmentation model to finally determine the finger For example, the area is shown as 90 in Figure 8d. After the finger area is determined, it is determined that it contains content that meets the preset finger model; then the replacement content corresponding to the finger area is determined from the background frame, and the determined replacement content is shown in the figure As shown in 8e, the finger area in the first video frame is then covered by the replacement content, thereby obtaining a second video frame (as shown in FIG. 8f), and finally the second video frame is output.

Based on the above solution, it can be ensured that the screen ratio or layout of the first video frame is not affected while removing the fingers of the first video frame, so that the screen output of the video frame is smoother.

As an optional embodiment, the method further includes: obtaining a third video frame, where the third video frame is a video frame collected before the first video frame; and determining that the third video frame is not If the content conforms to the preset finger model is included, the third video frame is used as the background frame; in the background frame, the content corresponding to the finger area in the first video frame is determined as the replacement content. This step can be executed by the processor.

The second video frame may contain a specific object. The specific object is often a stationary object in the background area, such as a teacup, a pen, etc. The location of the specific object in the second video frame is determined by this method and the third video The coordinates of the specific object in the frame are the same (or the offset is less than the preset offset, such as 10 pixels, 20 pixels, etc.); or the size of the specific object in the second video frame determined by this method is the same as that of the third video frame The size of specific objects in the same or similar (for example: the difference is within 5%, 10%).

In the process of starting the video communication, if the user places a finger on the keyboard, at this time, since no video frame without a background frame has been collected yet, the electronic device outputs a video frame containing the user's finger at this time.

In the second type, the first video frame is cropped to obtain the second video frame that does not include the finger area.

For an example, please refer to Figure 9. After the electronic device collects the video frame, it can first determine whether there is content that meets the preset finger in the video frame. If it does not exist, it will directly output the video frame, as shown in Figure 9. After the device collects the third video frame, it directly outputs the third video frame; if it is determined that there is content that meets the preset finger in the video frame (for example, as shown in 9c), it is based on the collected video frame (for example: the first video frame). ) Determine a cropping frame 91 that does not contain a finger, and after the first video frame is cropped by the cropping frame, a second video frame that does not contain a finger is obtained (as shown in FIG. 9d). The cropping frame can be determined in a variety of ways, for example: ①Assuming that the lower left corner of the video frame is the origin, determine the Y-axis maximum value of the finger area in the video frame, use the Y-axis maximum value as the bottom cropping edge, and set the video frame The top of the is used as the upper cropping edge; the cropping ratio is determined by the determined heights of the upper and lower cropping edges ((the maximum value of the Y axis minus the minimum value of the Y axis)/the height of the first video frame), and then the video is determined The center position of the person in the frame, the center position is extended to the left by a first preset distance (1/2*cropping ratio*the width of the first video frame), the left border is determined, and the center position is extended to the right for the second preset Set the distance (1/2*cropping ratio*width of the first video frame), determine the right frame, and determine the crop frame based on this. ②Determine a cropping frame of preset size, place the cropping frame in the center area of the video frame, and then determine whether there is overlap between the finger area and the area, if there is overlap, move the cropping frame up as a whole until the finger area and the crop The boxes do not overlap. Of course, in the specific implementation process, the cropping frame may also be determined in other ways, and the embodiment of the present invention will not enumerate in detail, and it is not limited.

Through the above solution, the offset of the coordinates of the specific object in the second video frame with respect to the coordinates of the specific object in the third video frame is greater than the preset offset; or, the size of the specific object in the second video frame is the same as that of the third video frame. The difference in the size of the specific object in the frame is greater than the preset value. Through the above solution, the function of removing the finger can be realized without using the background frame, so that the processing burden of the electronic device can be reduced.

The third type is to fill the finger area with pixels in the vicinity of the finger area to obtain the second video frame. In this case, the coordinates of the specific object in the second video frame are the same as or not much different from the coordinates of the specific object in the third video frame (the offset is less than the preset offset), or the second video frame The size of the specific object in the third video frame is the same or not much different from the size of the specific object in the third video frame. Through the above solution, this function can be realized without passing through a background frame.

As an optional embodiment, after controlling the display to display the second video frame based on S610, in response to determining that the fourth video frame contains content that conforms to a preset finger model, and the finger is in In an abnormal state, determine at least one transition frame through the finger area of the fourth video frame and the finger area of the fourth video frame; control the display to display the at least one transition frame; control the After displaying the at least one transition frame, the display displays the fourth video frame. This step can be executed by the processor.

In practice, during the video communication process, the user first typed with one hand, and the electronic device collected the first video frame. Since there is content in the first video frame that conforms to the preset finger model, and there is no abnormal state in the finger area, The electronic device outputs the second video frame that does not contain the finger; then the user places the other hand on the touchpad. At this time, the electronic device captures the fourth video frame. The fourth video frame still conforms to the preset finger model content. The distance between the two connected areas of the finger area is greater than the preset distance threshold, so it is determined that the finger area is in an abnormal state. In this case, the fourth video frame can be directly output, but the second video frame and the fourth video frame If you switch directly between them, there will be a situation where the fingers of a typing hand suddenly appear, which will cause the picture to be more abrupt. To prevent this, add at least one transition frame between the second video frame and the fourth video frame. , To achieve a smooth transition from the second video frame to the fourth video frame. If at least one transition frame is one frame, the weights of the second video frame and the fourth video frame are, for example, 0.5. If at least one transition frame is multiple frames, the weight of the second video frame is gradually reduced. The weight of the video frame gradually increases. For example, if at least one transition frame is 5 frames, the weight of the second video frame and the fourth video frame in the first transition frame is (0.2, 0.8), and the second transition frame is the second in the second transition frame. The weights of the video frame and the fourth video frame are (0.4, 0.6), the weights of the second video frame and the fourth video frame in the third transition frame are (0.5, 0.5), and the second video frame in the fourth transition frame The weights of the fourth video frame and the fourth video frame are (0.6, 0.4), and the weights of the second video frame and the fourth video frame in the fifth transition frame are (0.8, 0.2), etc. Of course, the above weights are only examples. Not as a limitation. Among them, for the background area outside the finger area, the content of the background area in the fourth video frame is displayed in the transition frame, and the content of the finger area in each transition frame is determined by the above method.

In the specific implementation process, after the electronic device removes the finger in the first video frame and displays the second video frame, please refer to FIG. 10, and a prompt message can also be generated to remind the user that the finger in the video frame has been removed, and prompt the user whether Fingers need to be retained. As shown in Figure 10, a prompt box 100 is displayed on the surface of the second video frame. The prompt box 100 prompts "The finger for video chat has been removed, please confirm whether to continue removing", if the user wants to continue removing Finger, click the confirmation button 110. After the electronic device detects that the user clicks the confirmation button 11, the finger is still removed in the subsequent captured video frames, as shown in Figure 10b; if the user does not want to continue removing the finger, click Cancel Button 120, after the electronic device detects that the user clicks the cancel button, the finger will no longer be removed from the subsequent captured video frames, as shown in FIG. 10c. Based on the above solution, it is possible to choose whether to remove the user's finger during the video communication process based on the user's needs, thereby achieving more flexible control.

In the specific implementation process, after the electronic device starts the video communication (or captures the user's finger), it can prompt the user whether to remove the finger. If the user's confirmation operation is detected, the finger in the video frame is removed, otherwise, it is not removed. Finger in video frame.

For example, please refer to FIG. 11. After the electronic device collects the first video frame, it is determined by detection that the video frame contains content that conforms to the preset finger model (that is, a finger), and a prompt box 130 is generated. In 130, the prompt "Finger has been detected in the video, please confirm whether to remove it", if the user of the electronic device clicks the confirmation button 140, the electronic device removes the finger in the first video frame and displays the second video as shown in Figure 11b Frame; if the user clicks the cancel button 150, the electronic device does not remove the finger in the first video frame, but directly outputs the first video frame, as shown in FIG. 11c.

The above scheme can be applied to the video capture end of the video communication process. For example, the electronic device performs finger removal processing on the captured video frame, so that the video frame with the finger removed is sent to the peer electronic device; the above scheme can also be applied to video The receiving end of the frame communication process, that is: the opposite end electronic device receives the video frame without removing the finger, and then outputs the video frame based on the video processing method introduced in the embodiment of the present invention. In this case, The video frames in the above steps are not video frames obtained by collection, but video frames received from the opposite end electronic device.

On the other hand, based on the same inventive concept, embodiments of the present invention provide an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the One or more computer programs are stored in the one or more memories, and the one or more computer programs include instructions that, when executed by one or more processors of the electronic device, cause all The electronic device executes the following steps: obtaining a first video frame, and obtaining a keyboard input signal; determining that the first video frame contains content that conforms to a preset finger model, and the obtained keyboard input signal is different from the time when the first video frame is obtained. When the time of the video frame meets the preset time threshold, then the finger in the first video frame is removed to obtain a second video frame; the second video frame is instructed to be displayed.

In the specific implementation process, there may be many situations in which the second video frame is indicated to be displayed, for example: ① The second video frame is displayed through the display unit of the electronic device; ② The second video frame is sent to another display unit for display, one possibility Yes, the electronic device does not have a display unit. Another possibility is that another display unit is more suitable for display. For example, the display area of the other display unit is larger. ③ The second video frame is sent to the opposite end electronic device of the video communication, and displayed by the opposite end electronic device.

The following will describe how to determine the finger area through the semantic segmentation model in step S610 with reference to FIGS. 12-14, 17A, and 17B, so as to determine that the first video frame contains content that conforms to the preset finger model.

The embodiment of the present invention provides a method for training a semantic segmentation model. The semantic segmentation model is a classification at the pixel level. For a photo, the pixels of each object are divided into one category (for example, pixels belonging to a person are divided into one category, belonging to Motorcycle pixels are divided into one category, pixels belonging to puppies are divided into one category, etc.), in addition to the background pixels are also divided into one category. Please refer to Figure 12, the method includes the following steps:

S1200: Data collection and labeling, collecting photos of the user when typing (or other photos containing the user's hand), and labeling the mask of the finger (or other preset object) area for the training of the semantic segmentation model.

S1210: Model design, design a semantic segmentation model, the semantic segmentation model is, for example, a convolutional neural network model, a conditional random field model, and so on. Optionally, in order to ensure the accuracy of the segmented finger region and the accuracy of the edge of the finger, a dual-branch convolutional neural network model may be used. The dual-branch includes, for example, semantic feature branches and edge feature branches. Among them, the semantic feature branch is used to extract the semantic feature of the image, the semantic feature refers to the specific object represented by the pixel point, such as: human face, finger, etc., the edge feature branch is used to extract the texture feature of the image, and the texture feature refers to Features such as edge information, shape information (for example: corners), and color.

S1220: Model training, used to train and update the parameters of the semantic segmentation model based on the photos marked in S1200 and the semantic segmentation model designed in S1210 to obtain a semantic segmentation model for finger recognition.

In the specific implementation process, the input of the model is a picture that normalizes the pixel value to between 0 and 1 (in the calculation process, the picture is a multi-dimensional array stored in a specific order in the memory), and the semantic feature branch can be used The deep convolutional neural network extracts semantic features to determine the finger area, and the edge feature branch can use the shallow neural network to extract texture features to ensure the accuracy of finger edge segmentation. During training, after inputting images to the dual-branch convolutional neural network model, the dual-branch convolutional neural network model extracts semantic features through deep convolutional neural networks, and texture features through shallow neural networks; and then extracts through feature fusion network pairs The semantic features and edge features of each pixel are calculated by feature fusion calculation to obtain comprehensive features. The final classification layer takes the comprehensive features as input to calculate the confidence that each pixel belongs to the finger area, and then determines whether each pixel belongs to the finger area (in the specific implementation) In the process, the semantic segmentation model is a collection of various operations connected in a specific form. Each operation is composed of different values. The actual performance of the operation is to use its own parameter values and input arrays to perform matrix operations, and output the calculated results Array). Then it is compared with the labeled finger area, and based on the difference loss function, it is propagated back to the semantic segmentation model through the difference loss function, and the parameters of the semantic segmentation model are updated.

Based on the above semantic segmentation model training method, an embodiment of the present invention also provides a method for recognizing a finger (or other preset object) region in an image (or video) based on the semantic segmentation model. Please refer to FIG. 13, including the following steps :

S1300: The model is solidified. Specifically, after the semantic segmentation model training is completed, the obtained semantic segmentation model is used for finger recognition, and its parameters do not change.

S1310: Data preprocessing. This step obtains the current video frame and performs normalization preprocessing of the data, for example: normalizing the pixel value of the video frame to an image between 0 and 1.

S1320: Model reasoning. For an example, please refer to FIG. 14, the model inference may include the following steps: S1400, inputting an image, that is, inputting an image with the pixel value of S1310 normalized to 0 to 1 into a semantic segmentation model. S1410a: Extract the semantic features of the image through the semantic feature branch in the semantic segmentation model. The semantic branch is, for example, a deep convolutional neural network. When the semantic feature is extracted through the semantic feature branch, the image can also be reduced. For example, shrink 4 times, 5 times, etc.; S1410b: Extract texture features through edge feature branches, such as shallow convolutional neural networks. Similarly, when extracting texture features through edge feature branches, you can also Perform thumbnail processing, such as shrinking by 2 times, 3 times, etc.; S1420: Perform feature fusion calculation on the extracted semantic features and edge features through the feature fusion network to obtain comprehensive features. If the semantic feature branch and the edge feature branch are for the image If the zoom factor is not the same, when calculating the comprehensive feature, you can first calculate the feature with lower pixels (for example, the semantic feature pixel is 8 pixels × 8 pixels, and the edge feature is 32 pixels × 32 pixels. Then the feature with lower pixels is semantic Feature) performs interpolation and amplification to ensure that the semantic feature and the texture feature have the same size, and then perform feature fusion calculation to obtain a comprehensive feature. S1430: Input the comprehensive features into the classifier to obtain the finger area mask. The final classifier takes the comprehensive features as input, calculates the confidence that each pixel belongs to the finger area and the confidence that it does not belong to the finger area, and then judges whether each pixel belongs to the finger area, and obtains the mask of the finger area.

In the specific implementation process, when the mask of the finger area is determined based on the above steps, there may be some noise masks. In order to improve the accuracy of determining the finger area and ensure the accuracy of subsequent filling of the finger area, you can Noise filtering is performed based on the mask of the finger region determined by the semantic segmentation model.

Please refer to Figure 15A, which is the user's photo captured by the front camera of the electronic device, where the front camera is set at the bottom of the display or on the keyboard, so the image captured by the front camera contains a finger, as shown in Figure 16B The white area is the finger area mask output based on the semantic segmentation model, indicating that the area may be a finger. Each white area is called a connected area. In Figure 15B, the mask of the finger area determined based on the semantic segmentation model includes five connected areas. Areas (based on different pictures, the number of connected areas determined is also different, the embodiment of the present invention does not limit), respectively: connected area 61, connected area 62, connected area 63, connected area 64, connected area 65, connected area The outer frame is called the circumscribed rectangle of the connected area, and the mask of the finger area may also include the experience position 66 of the face frame. Connected area 61, connected area 62, and connected area 63 are called non-bottom connected areas because their bottom areas are not in contact with the bottom of the picture.

Connected areas

64 and 65 are called bottom connected areas. They are in normal typing state, and are areas where people are handed. Often belongs to the bottom connected area. After the finger area mask is determined, the finger area mask can be identified as a finger area, or the finger area mask can be noise-reduced, and the finger area mask after noise reduction can be used as the finger area.

In a specific implementation process, the location of the face frame can be determined by performing face recognition on the image, and the location of the recognized face frame is taken as the empirical location of the face frame. It is also possible to use an experience area in the center of the screen of the electronic device as the experience area of the face frame; it is also possible to identify a large number of chat videos and analyze the position of the face frame in these chat videos, thereby comprehensively analyzing the results and obtaining the person The experience position of the face frame.

One or more of the following methods can be used to filter the noise area:

①According to the obtained empirical position 66 of the finger area mask and the face frame, the binary image of the mask is corroded and expanded to filter out noise holes with a very small area, for example, the noise hole with a very small area is less than 1 Noise holes with the size of pixels, 2 pixels, 3 pixels, etc., are often noise at the edge of the finger.

②Find all connected areas of the binary image of the mask, and preliminarily filter out connected areas whose area is less than a preset threshold, such as 10 pixels, 20 pixels, and so on.

③ Filter out connected areas whose bounding rectangle is larger than a preset area threshold, for example: 100 pixels, 200 pixels, and so on. Or, calculate the area center of the connected area, and determine the area where the connected area is located based on the relationship between the area center of the connected area and the center of the empirical position of the face frame, for example: determine that the connected area is located at the empirical position of the face frame For the upper, lower, left or right side of the center, different preset area thresholds are set for the connected area based on the area where the connected area is located, and then connected areas whose circumscribed rectangle area is larger than the preset area threshold are filtered out. Among them, if the connected area is located on the left or right side of the face center, the corresponding preset area threshold is, for example: 150 pixels, 200 pixels, 220 pixels, etc., if the connected area is located on the upper side of the face center, then The corresponding preset area threshold is, for example, 80 pixels, 100 pixels, and so on. If the connected area is located below the center of the face, the corresponding preset area threshold is, for example, 300 pixels, 400 pixels, and so on. Generally, the predetermined area threshold of the connected area located on the lower side of the face center is larger than the left side (or right side), and the predetermined area threshold value of the connected area located on the left side (or right side) of the face center is larger than the upper side.

④ For the connected area that meets the above requirements (in the specific implementation process, the above three filtering steps can be used all, or only some of them can be used, and the embodiment of the present invention does not limit it), continue to determine whether the connected area is related to people The experience area of the face frame overlaps. If there is overlap and the connected area does not belong to the bottom connected area, it is determined that the current state is not typing and no subsequent processing is performed; if there is no overlap or belongs to the bottom connected area, you can proceed to step ⑤ to continue Make judgments. Of course, in the specific implementation process, it is also possible to directly determine that the area that does not overlap or belong to the bottom connected area is the typing state, and the corresponding connected area is determined as the finger area mask.

⑤If the above requirements are met, calculate the aspect ratio of the connected area, filter out connected areas with an aspect ratio less than the threshold (for example: 0.5, 0.7, 0.8, etc.), connected areas with inverted triangles, because the finger area is hidden in the normal typing state The film has a relatively large width and height, and generally does not have an inverted triangle shape. This step can also be performed before step ④, which is not limited in the embodiment of the present invention.

⑥Calculate the area ratio of the remaining connected area and its circumscribed rectangle. If it is greater than the set threshold (for example: 0.5, 0.6, etc.), return the typing status information to determine that the current user belongs to the typing status, and the final screened finger area mask It is the effective mask area, that is, the area where the effective finger is located. In this step, it is also possible to directly determine whether the area of the corresponding connected area is greater than the preset area (for example: 60,000 pixels, 70,000 pixels, etc.), if it is greater, return the typing status information to determine that the current user belongs to the typing status, and finally filter The extracted finger area mask is used as the effective mask area, that is, the area where the effective finger is located.

In the specific implementation process, the above steps ① to ⑥ can be executed in order. In the case of no conflict, steps ① to ⑥ can be executed separately for each connected area, and then based on the execution result of each step, it is judged whether it is Belongs to the finger area mask. The remaining connected areas after the above noise reduction operation are shown in FIG. 15C, for example, the

connected areas

61, 62, and 63 have been filtered, and only the connected

areas

64, 65 are left as real finger areas.

In the specific implementation process, the above-mentioned semantic segmentation model training method can also be used to recognize other objects in the image, such as trash cans, backgrounds, ashtrays, folders, stylus pens, palms, arms, and so on. Similarly, it can also be used to identify the area where these objects are located. As long as different training samples are used in the training phase, for example, for the trash can, the training sample is the image of the trash can and the mark of the area where the trash can is located, and for the stylus, the training sample is the image of the stylus and the touch. Marking of the area where the stylus is located, etc.

In the specific implementation process, in addition to the aforementioned semantic segmentation model for foreground segmentation (to determine the area of the finger), other methods can also be used for foreground segmentation, such as: foreground segmentation method based on frame difference, foreground segmentation method based on motion shape and many more. Using the aforementioned semantic segmentation model can more accurately segment the finger area in the current video frame without being affected by the body of the typist who is moving all the time, and its speed can meet the real-time requirements.

The embodiment of the present invention also provides an image processing method, which is used to determine the replacement content based on the background area of the video frame. Please refer to FIG. 16, including the following steps:

S700: Perform motion offset estimation based on the background frame and the current frame to obtain the motion offset matrix of the background frame relative to the current frame (in the specific implementation process, the motion offset of the current frame relative to the background frame offset can also be calculated matrix). In the initial stage, if the input finger is not included in the image collected through the video chat, it will be used as the background frame; if the input finger is included, the background frame will not be determined until the input finger is not included. The acquired image is collected as the background frame.

Among them, the feature points of the current frame and the background frame can be detected first, and then the detected feature points of the current frame and the background frame can be matched to find the paired feature points, and then the perspective transformation matrix is calculated according to the paired feature points. The perspective transformation matrix It is a matrix that characterizes the amount of motion of the background frame relative to the current frame. Among them, it can be achieved through SIFT (Scale-invariant feature transform), SURF (Speeded Up Robust Features: rotation invariant features), ORB (Oriented FAST and Rotated Brief) and other feature point detection algorithms determine the feature points of the background frame and the current frame. You can use BF (Brute-Force: brute force matching), FLANN (Fast Approximate Nearest Neighbor Search Library: fast nearest neighbor search library), etc The feature point matching algorithm determines the feature points that match the background frame and the current frame. Of course, it is also possible to detect feature points and perform feature point matching in other ways, which is not limited in the embodiment of the present invention.

Perspective transformation:

Perspective transformation (Homography) is a mapping relationship between two images. Through matrix multiplication, a point in one image is mapped to a corresponding point in another image, including perspective matrix calculation and coordinate mapping.

Perspective transformation matrix calculation: the perspective transformation matrix is a 3*3 matrix H, set as:

Given that there are n (n>=4) pairs of matching points [A1,B1], [A2,B2], [A3,B3]...[An,Bn] in Figure A and Figure B, the matching points and H form the following linear equation set:

B ₁ ＝H×A ₁

B ₂ ＝H×A ₂

B ₃ ＝H×A ₃

…

B _n ＝H×A _n

H is obtained by solving the linear equations, and H expresses the estimation of the motion offset from the graph A to the graph B (that is, the perspective transformation matrix H), so for each point (X1, Y1) in the graph A, The corresponding coordinates (X2, Y2) in the view plane of Figure B can be calculated by the following matrix multiplication:

S710: Perform motion compensation on the background frame according to the amount of motion determined in the previous step to obtain a compensated frame (that is, the background frame after motion compensation). The purpose of this process is to align the current frame and the background frame to eliminate human body motion. The fragmentation phenomenon of the human body region after the completion of the image (in the specific implementation process, the current frame can also be motion compensation based on the amount of motion, which is not limited in the embodiment of the present invention).

For each point (X1, Y1) in the background frame, the corresponding coordinates (X2, Y2) in the view plane of the current frame can be calculated by the following matrix multiplication:

Through the above perspective transformation, the motion compensation of the background frame relative to the current frame can be realized.

S720: Calculate the background area for filling based on the motion-compensated background frame (ie: compensation frame) and the finger area mask, and use the content/image of the background area for filling to perform the finger area mask on the current frame Fill/replace

In the specific implementation process, the image of the location of the finger area mask can be determined from the background frame as the background area for filling/replacement, and then the background area used for filling can be overlaid on the finger area mask of the current frame. Achieve filling in the finger area mask of the current frame.

S730: Use an ambient light rendering method to render the filled foreground area and the surrounding background area to make the picture brightness consistent, and eliminate the difference in video frame brightness between adjacent frames caused by hardware. In the specific implementation process, step S730 is an optional step. The output frame is obtained based on the above processing, and the output frame is used as the final video frame for video output.

S740: Use the output frame obtained in step S730 as a new background frame, and update the background frame.

The embodiment of the present invention provides a video communication method. Please refer to FIG. 8. The video communication method includes the following steps:

S800: Obtain the video frame in front of the display screen through the front camera;

In the specific implementation process, this solution can be applied to any electronic device with video communication function. The electronic device can have its own or an external camera. Optionally, the camera is set under the display of the electronic device, or the camera is set. On the input device of the electronic device (such as keyboard, mouse, touchpad, etc.), optionally, the electronic device is a notebook computer, and the camera is set on the keyboard of the notebook computer or set under the display screen of the notebook computer .

In the specific implementation process, suppose that user A of the electronic device wants to start a video chat with another user B. User A opens the instant chat application of the electronic device, then opens the chat interface with user B, and then clicks the "video call" button. After the device detects this operation of user A, it establishes a video communication connection with the electronic device of user B, and turns on the camera of the electronic device to capture and obtain the video sent to user B. The obtained video contains video frames such as Shown in Figure 9A. Or, the user opens the contact interface of the electronic device 100, selects the contact B, and then clicks the video call button (for example, the unlinked call button). After the electronic device detects the operation of the user A, it will contact the electronic device of the user B. Establish a video communication connection between. Of course, it is also possible to establish a video communication connection with the electronic device of user B in other ways, which is not listed in detail in the embodiment of the present invention, and is not limited.

Generally, the front camera is turned on by default for video communication, but based on user A's selection operation or setting operation, the electronic device may also turn on the rear camera, which is not limited in the embodiment of the present invention.

Another embodiment of the present invention provides a video processing method. Please refer to FIG. 17, which includes the following steps:

S810: After obtaining the video frame input by the front camera, determine whether there is finger typing;

In the specific implementation process, the current keyboard signal is read through the keyboard signal reading device of the electronic device, and the current keyboard signal is used to determine whether there is an input signal. If there is an input signal, it is determined that there is a finger typing; if there is no input signal , It is determined that there is no finger typing.

S820b: If there is no finger typing, update the background frame through the current video frame.

In the initial stage, if the input finger is not included in the image collected through the video chat, it will be used as the background frame; if the input finger is included, the background frame will not be determined until the input finger is not included. The acquired image is collected as the background frame.

S820a: Determine whether there is a finger region mask through the semantic segmentation model, and how to determine it specifically is described above, so I will not repeat it here. The obtained finger area mask is, for example, as shown at 90 in FIG. 18B. Among them, S820a and S810 do not implement the existing sequence. Optionally, after it is determined that there is finger typing based on S810, the processing of S820a is performed to reduce the amount of data processing of the electronic device.

S830: When there is finger typing and there is a finger area mask, noise reduction is performed on the finger area mask. The specific noise reduction has been processed before, so I will not repeat it here. This step is optional.

S840: Remove the finger from the finger area mask. The specific how to remove it has been introduced in the aforementioned image processing method, so it will not be repeated here. The video frame after removing the finger from the finger area mask is shown in FIG. 18C, for example. If there is the above step S830, it is the finger from which the finger area mask after the noise reduction process is removed.

In addition, in addition to using the aforementioned image processing method to remove the fingers of the finger area mask, the finger area can also be covered by other pictures, or the finger area masked by the background area of the current image frame can be filled. The present invention The embodiments are not listed in detail, and are not limited.

S850: Obtain the video frame after the finger with the finger area mask removed as a new video frame, and output the video frame, which can be transmitted to the electronic device of user B for display, or displayed on the electronic device of user A .

S860: Update the background frame through the new video frame.

Based on the above solution, it is further able to segment and recognize the typing finger area while maintaining human body information, complement the background area and the human body area blocked by the typing finger, and solve the problem that the typing finger appears on the screen like an "octopus" Technical issues have greatly improved the user’s video chat experience. And based on the above solution, only the user's typing finger is removed, and the user's other fingers do not need to be removed, so that the accuracy of judgment can be improved, the occurrence of misjudgments can be reduced, and the intelligence of human-computer interaction can be improved.

Another embodiment of the present invention provides a video communication method. Please refer to FIG. 19. The method includes the following steps:

S1000: Obtain the video in front of the display screen through the front camera. The specific collection process is similar to that of the S800, so I won't repeat it here.

S1010a: After obtaining the video collected by the front camera, determine the area where the face is located;

Among them, the area where the face is located can be identified by face recognition technology; in an optional embodiment, an experience area in the center of the screen of the electronic device can also be used as the experience area of the face frame; or, it can be By recognizing a large number of chat videos (or self-portrait pictures), the position of the face frame in these chat videos (or self-portrait pictures) is analyzed, and the result of comprehensive analysis is obtained to obtain the experience area of the face frame.

S1010b: Determine the finger region mask through the semantic segmentation model. How to determine the mask has been described above, and will not be repeated here. There is no order of execution between this step and S1010a.

S1020: Determine whether the user in the current frame is typing through the mask of the face area and the finger area.

In the specific implementation process, it can be judged whether there is a bottom connected area in the connected area of the finger area mask, if it exists, it is considered to be in the typing state; if it does not exist, it is judged whether the connected area overlaps the area where the face is located, and if it overlaps, it is considered In the non-typing state, if there is no overlap, it is considered to be in the typing state.

In the specific implementation process, it is also possible to determine whether it is in the typing state only based on whether there is a bottom connected area in the connected area of the finger area mask, for example: if there is a bottom connected area, it is considered to be in the typing state, if there is no bottom connected area, It is considered a non-typing state. In this case, the above step S1010a is an optional step.

Please refer to FIG. 20. Step S1020 in FIG. 19 may include the following steps:

S1100: Perform a corrosion expansion operation on the finger area mask to filter out noise holes in a very small area, such as noise holes smaller than 1 pixel, 2 pixels, 3 pixels, etc., which are often finger edges Noise.

S1110: Search for all connected regions of the binary image of the picture, and preliminarily filter out connected regions with an area smaller than a preset threshold, the preset threshold is, for example, 10 pixels, 20 pixels, and so on.

S1120: Preliminarily screen the connected area, calculate the area center of the connected area, and determine the area in which it is located according to the relationship between its center position and the center position of the face frame, for example: determine that the connected area is located above and below the center of the face On the side, the left side or the right side, the connected areas that meet the requirements are screened according to the circumscribed rectangle area thresholds of different location areas, and the connected areas with the circumscribed rectangle area larger than the preset area threshold are filtered out. Among them, if the connected area is located on the left or right side of the face center, the corresponding preset area threshold is, for example: 150 pixels, 200 pixels, 220 pixels, etc., if the connected area is located on the upper side of the face center, then The corresponding preset area threshold is, for example, 80 pixels, 100 pixels, and so on. If the connected area is located below the center of the face, the corresponding preset area threshold is, for example, 300 pixels, 400 pixels, and so on. Generally, the predetermined area threshold of the connected area located on the lower side of the face center is larger than the left side (or right side), and the predetermined area threshold value of the connected area located on the left side (or right side) of the face center is larger than the upper side.

The above three steps are optional.

Then S1130 is executed again, that is, it is determined whether the user in the current frame is typing through the filtered finger area mask and the area where the face is located. The judgment method is similar to the previous description, and will not be repeated here. If it is determined that the user has not typed based on S1130, the non-typing status code can be returned; and if it is determined that the user has typed based on S1140, the typing status code can be returned, or S1140: the connected area can be refined, and the refined screening can include The following way:

Method 1: Calculate the aspect ratio of the connected area, and filter out the connected areas whose aspect ratio is less than the threshold (for example: 0.5, 0.7, 0.8, etc.), the connected area of the inverted triangle, because the width and height of the finger area mask in the normal typing state It is relatively large, and generally not an inverted triangle;

Method 2: Calculate the area ratio of the remaining connected area and its circumscribed rectangle. If it is greater than the set threshold (for example: 0.5, 0.6, etc.), return the typing status information to determine that the current user belongs to the typing status, and the finally filtered finger area The mask is the effective mask area, that is, the area where the effective finger is located. In this step, it can also be directly judged whether the area of the corresponding connected area is larger than the preset area (for example: 6W pixels, 7W pixels, etc.). If it is larger, the typing status information will be returned to confirm that the current user belongs to the typing status, and finally the selected ones The finger area mask serves as the effective mask area, that is, the area where the effective finger is located.

If there are still connected areas after screening based on the above fine screening step, return the typing status code to confirm that the current user is in the typing state; if based on the above fine screening steps, there are no connected areas left, then return to the non-typing status code to confirm the current user In a non-typing state.

S1030a: After confirming that there is no typing, update the background frame through the current frame.

S1030b: In the case of determining that there is typing, remove/cover the fingers of the finger area mask through the image processing operation to obtain a new video frame. The specific removal has been described above and will not be repeated here.

S1040: Output a new video frame, which can be output to the electronic device where user B is located as a video frame for video communication; it can also be output to the electronic device where user A is located (at the same time).

S1050: Update the background frame with a new video frame. The execution order of step S1040 and step S1050 can be interchanged.

In the specific implementation process, the steps of judging whether there is finger typing in S810 and S1020 can be used alternatively or in combination, or when it cannot be judged in S810, S1020 can be used for judgment.

Based on the above solution, it is further able to segment and recognize the typing finger area while maintaining human body information, complement the background area and the human body area blocked by the typing finger, and solve the problem that the typing finger appears on the screen like an "octopus" Slots and problems greatly improve the user’s video chat experience. And based on the above solution, only the user's typing finger is removed, and the user's other fingers do not need to be removed, so that the accuracy of judgment can be improved, the occurrence of misjudgments can be reduced, and the intelligence of human-computer interaction can be improved.

Please refer to FIG. 21. An embodiment of the present invention provides a video processing method, including the following steps:

S2100: Obtain a video frame;

In the specific implementation process, the electronic device can collect and obtain video frames. For example, the electronic device can perform the video after detecting the user's video shooting operation (for example: clicking the video shooting button, generating preset gestures, generating voice commands, etc.) Capture; it can also perform video capture when it detects that the user is in video communication with the peer electronic device, and then send the collected video to the peer electronic device. The electronic device can also obtain video frames from other electronic devices or the network. The embodiments of the invention are not listed in detail, and are not limited.

If the video frame is a video frame captured by an electronic device, the video frame can be a video frame captured by a front camera, a video frame captured by a rear camera, or a video frame captured by a front camera and a rear camera. The video frames are merged to obtain video frames, etc.; it can also be a video frame collected by other cameras connected to the electronic device 100, for example: the electronic device 100 and one or more of a drone, a TV, and a desktop computer The connection is established, and the electronic device 100 can acquire video frames through these devices.

In the specific implementation process, when the electronic device detects the user's video communication operation (or video shooting operation), it generates a video communication instruction; then sends the video communication instruction to the processor, and the processor responds to the video communication instruction to start the video communication The software sends instructions to the camera to drive and control the camera for video capture. The camera driver sends the collected data to the finger occlusion processing component, and the finger occlusion processing component performs subsequent operations.

S2110: Obtain the area where the preset object on the video frame is located;

The preset object can be an object set by default by the system, or an object designated by the user. For example: when the user collects video frames through the electronic device 100, because the user holds the electronic device 100 or type on the electronic device 100 by hand, the electronic device 100 collects the fingers that block the lens of the camera, the fingers that are typing input, etc. , These are all images that the user does not want to be captured; or, when the user finds that there are trash cans, ashtrays (preset objects), etc. in the photo when taking a photo, you can manually select the trash can in the screen , Ashtrays, etc., so that these objects are designated as preset objects. Through the above solution, the preset objects in the video can be removed, so that the video chat can better meet the needs of users and can also protect the privacy of users.

In the specific implementation process, the preset objects are, for example, the user's hand, finger, trash can, ashtray, and so on. It can automatically determine the preset object area through the semantic segmentation model (how to automatically determine the area where the preset object is located through the semantic segmentation model will be described later in conjunction with Figure 3-5); it can also receive user choices Operation, the area where the preset object is located is determined based on the selection operation. For example: in the process of shooting a video, the user clicks on the ashtray in the picture. After the electronic device detects the user's click operation, it determines that the user wants to recognize the ashtray. Therefore, the image recognition algorithm is used to identify the user's click operation. The area where the object is located.

S2120: Determine the replacement content. The replacement content may be the content of the background area corresponding to the video frame. How to determine the content of the background area will be described later with reference to FIG. 7. It may also not be the content of the background area, such as other images (for example: emoticons, icons, etc.), the content of the area where the object is preset after mosaicing the video frame, and so on.

S2120: Fill the area where the preset object is located by replacing content to remove/replace the preset object. Among them, part of the content of the preset object can be removed, or all content of the preset object can be removed.

In the specific implementation process, the area where the preset object is located can be filled by the image processing method introduced above; or the area where the preset object is located can be directly covered by other objects, such as covering expressions in the area where the ashtray is located, and performing the area where the ashtray is located. Mosaic and so on.

For example: in the video capture process, when the area where the preset object is detected, you can directly fill the area where the preset object is located through the background area; or, in the video capture process, when the preset object is detected Area, the preset object area in the video is covered by the preset icon (the preset icon can be the default icon or the icon that changes randomly), and when the user edits the preset icon is detected, the In addition to this icon, or replace the preset icon with other icons; or, display the edit button on the video capture interface, and after detecting the user's click on the edit button, display various editing operations (such as filters, icons, puzzles) Etc.), after detecting the user's click on the icon, various icons are displayed, and then based on the user's specific operation (for example: dragging the icon to the preset object surface), the icon is blocked on the preset object surface; For another example, various icons are directly displayed on the video capture interface, and the icons are hidden from the surface of the preset object based on a specific operation of the user.

S2130: Replace the content of the area where the preset object is located with the determined replacement content, and output a video frame after replacement processing.

In the specific implementation process, the video frame after the replacement process can be transmitted to another electronic device to display the video frame after the replacement process on the other electronic device; the video frame after the replacement process can also be displayed on the current electronic device, To provide users of electronic devices, they can also store the replaced video frames in the electronic device.

Based on the above solution, it is further able to segment and recognize the typing finger area while maintaining human body information, complement the background area and the human body area blocked by the typing finger, and solve the problem that the typing finger appears on the screen like an "octopus" Technical issues have greatly improved the user’s video chat experience.

In addition to video capture, the above solution can also be used for image capture. For example, after detecting the user's image capture operation, identify the preset object in it, and then remove the preset object. The removal method can be the previous The image processing method introduced can also be overlaid with other pictures, which are not listed in detail in the embodiment of the present invention, and are not limited.

In the following, the application scenarios of the solutions of the embodiments of the present invention will be introduced in combination with two application scenarios.

Application scenario one:

Please refer to Figure 22a. In the initial stage, the display interface of the electronic device displays an interface 220 for real-time chatting with another user. The interface displays a video communication button 220a and a voice communication button 220b. The user of the electronic device It is user A, user B is standing next to user A, user A’s target finger is placed on the keyboard, and user B clicks on the video communication button (220a) with the mouse (or user A generates a voice command);

After the electronic device detects the video communication operation, it jumps to the interface shown in FIG. 22b. 22b includes a video communication interface 221. The video communication interface includes a video preview interface 221a and a video display interface 221b. The video preview interface 221a displays the current The video frame (for example: the video frame of user A) collected (or processed) by the electronic device, the video display interface 221b displays the video frame of the user of the opposite electronic device. At this time, the finger of user A has been placed on the keyboard. Since the electronic device does not detect the background frame that does not contain the finger, the electronic device does not trigger the finger removal function and displays the video frame containing the finger.

Then, as shown in Figure 22c, user A removes his finger from the keyboard and places it on his knee. At this time, the electronic device cannot detect the user’s finger, so the captured video frame does not contain the user’s finger. Output the video frame directly; and use the collected video frame as the background frame.

Then, as shown in Figure 22d, the user places his finger on the keyboard again to type, and the captured video frame contains the user’s finger; the finger in the video frame is removed by the method described above, and the output is Video frames without fingers. And a prompt box 222 is displayed on the video frame, the prompt box 222 is used to prompt "The finger in the video chat has been removed, please confirm whether to continue removing", and the prompt box 222 includes a confirmation button 222a and a cancel button 222b, and user B passes Click the cancel button 222b with the mouse (or user A generates a voice command);

Then, as shown in FIG. 22e, the position of the finger of the user A has not changed relative to the position of the finger of the user A in FIG. 22d, but the video frame displayed in the video preview interface again contains the user's finger.

Application scenario two:

User A starts the video communication function. When starting the video communication, the user’s hand is placed on his knee, and the captured video frame does not contain the user’s finger, as shown in Figure 23a; in this case, the electronic device will be as shown in Figure 23a. The video frame shown is set as the background frame, and the video frame is output, as shown in Figure 23b.

Subsequently, user A starts typing through the keyboard, and the video frame collected by the electronic device contains the user's finger, as shown in Figure 23c; the electronic device determines that the content of the collected video frame conforms to the preset finger model, and then generates a prompt message. The prompt information is, for example, text, voice, icon, etc., as shown in Figure 23d, a prompt box 130 is displayed, and the prompt box 130 displays "finger has been detected in the video frame, please confirm whether to remove it", and the prompt box 13 The confirmation button 140 and the cancel button 150 are also displayed. The user A wants to remove the finger and clicks the confirmation button 140, and the electronic device detects that the user has clicked the confirmation button 140. The user has been typing through the keyboard.

Subsequently, the electronic device collects a video frame containing the typing finger, as shown in Figure 23e, because the user has confirmed the removal of the finger; the electronic device removes the finger in the video frame based on the method described above, and outputs the video frame that does not contain the finger. The video frame is shown in Figure 23f. Prompt information can also be generated on the video frame. The prompt information is used to inform the user that the user is currently in the finger removal state. The prompt information can last for a period of time (for example: 1 second, 2 seconds) and then disappear, or it can be in the finger removal state. It is always displayed; a cancel button can also be generated, and the finger removal mode can be exited by responding to the user clicking the cancel button. As shown in FIG. 23f, the prompt message and the cancel button are integrated together, which is a prompt button 230.

After a period of time, the user hopes not to remove the finger. Click the prompt button 230 in 23f with the mouse. After the electronic device detects the operation of clicking the prompt button, the video frame 23g containing the finger is collected again. In this case, there is no need It is then judged whether the video frame contains content that conforms to the preset finger model, and the video frame containing the finger is directly output, as shown in FIG. 23h.

Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium, including instructions, which when run on an electronic device, cause the electronic device to execute the method described in any embodiment of the present invention.

Based on the same inventive concept, embodiments of the present invention provide a computer program product, the computer program product includes software code, and the software code is used to execute the method described in any embodiment of the present invention.

Based on the same inventive concept, an embodiment of the present invention provides a chip containing instructions, which when the chip runs on an electronic device, causes the electronic device to execute the method described in any embodiment of the present invention.

Based on the same inventive concept, an embodiment of the present invention provides an electronic device, including: a keyboard and a camera, the camera is arranged near the keyboard, and the electronic device further includes: a first collection module for collecting data through the camera Obtain a first video frame; a first determining module, configured to determine that the first video frame contains content that conforms to a preset finger model, then remove the finger in the first video frame to obtain a second video frame; display module , For displaying the second video frame, and/or sending the second video frame to the opposite electronic device for display.

In an optional implementation manner, the first determining module includes: a first determining unit, configured to determine that the first video frame contains content that conforms to a preset finger model, and determine that the finger is located at the The bottom area of the first video frame removes the fingers in the first video frame; the second determining unit is used to determine that the first video frame contains content that meets the preset finger model, and determines that the finger area is If there is no overlap in the position of the human face, remove the finger in the first video frame; the third determining unit is used to determine that the first video frame contains content that meets the preset finger model, and to determine that the finger is located at the The bottom area of the first video frame is connected to the side of the first video frame, and the finger in the first video frame is removed.

In an optional implementation manner, the first determining module includes: an obtaining unit, configured to obtain a keyboard input signal; and a fourth determining unit, configured to determine that the first video frame contains a finger model that conforms to a preset finger model. And the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, and the finger in the first video frame is removed.

In an optional implementation manner, the first determining module is configured to: replace the content of the finger area in the first video frame with replacement content to obtain the second video frame; or A video frame is cropped to obtain the second video frame that does not include the finger area; or, the finger area is filled with pixels in the vicinity of the finger area to obtain the second video frame.

In an optional embodiment, the electronic device further includes: a second acquisition module, configured to acquire a third video frame before acquiring the first video frame; and a second determination module, configured to It is determined that the third video frame does not contain content that meets the preset finger model, then the third video frame is used as a background frame; the third determining module is configured to: determine and The content corresponding to the finger area in the first video frame is used as the replacement content.

In an optional embodiment, the first determining module is configured to: determine that the first video frame contains content that conforms to a preset finger model, and the finger is not in an abnormal state, then remove all The finger in the first video frame obtains the second video frame; the abnormal state corresponds to at least one of the following situations: the two hands of the user in the first video frame are located at the bottom of the first video frame, And the distance between the two hands of the user is greater than a first preset distance; in the first video frame, one hand of the user is located in the bottom area, and the other hand is greater than the preset distance from the bottom area; the first video The area of the user's finger in the frame is greater than the preset area threshold; the user's finger in the first video frame covers the face.

In an optional embodiment, the electronic device further includes: a fourth determining module, configured to determine that the first video frame contains content conforming to a preset finger model, and determine that the finger is in an abnormal state, The first video frame is sent to the display for display.

In an optional embodiment, the first determining module is configured to: input the first video frame into a semantic segmentation model, and determine the finger area in the first video frame through the semantic segmentation model, and When the finger area exists, it is determined that the first video frame contains content that meets the preset finger model, and the semantic segmentation model is obtained through training based on sample photos, and each sample photo contains a photo of the user's finger. And each photo marks the finger area.

Based on the same inventive concept, an embodiment of the present invention provides an electronic device, including: an obtaining module for obtaining a first video frame and obtaining a keyboard input signal; a fifth determining module for determining that the first video frame contains If the content of the preset finger model is met, and the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, the finger in the first video frame is removed to obtain the second video frame; The instruction module is used to instruct to display the second video frame.

In an optional embodiment, the fifth determining module is configured to: replace the content of the finger area in the first video frame with replacement content to obtain the second video frame; or The first video frame is cropped to obtain the second video frame that does not include the finger area; or, the finger area is filled with pixels in the vicinity of the finger area to obtain the second video frame .

In an optional embodiment, the electronic device further includes: a third acquisition module, configured to acquire a third video frame before acquiring the first video frame; and a sixth determining module, configured to It is determined that the third video frame does not contain content that meets the preset finger model, then the third video frame is used as a background frame; the seventh determining module is configured to: The content corresponding to the finger area in the first video frame is used as the replacement content.

In an optional embodiment, the fifth determining module is configured to: determine that the first video frame contains content that conforms to a preset finger model, and that the finger is not in an abnormal state, and obtains When the time of the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, the finger in the first video frame is removed to obtain the second video frame; the abnormal state corresponds to at least one of the following A situation: the two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance; there is one user in the first video frame The hand is located in the bottom area, and the distance between the other hand and the bottom area is greater than the preset distance; the area of the user's finger in the first video frame is greater than the preset area threshold; the user's finger in the first video frame is blocked Face.

In an optional embodiment, the electronic device further includes: an eighth determining module, configured to determine that when the first video frame contains content that conforms to a preset finger model, and determine that the finger is in an abnormal state To indicate to display the first video frame.

In an optional embodiment, the fifth determining module is configured to: input the first video frame into a semantic segmentation model, and determine the finger area in the first video frame through the semantic segmentation model, and When the finger area exists, it is determined that the first video frame contains content that meets the preset finger model, and the semantic segmentation model is obtained through training based on sample photos, and each sample photo contains a photo of the user's finger. And each photo marks the finger area.

For other content, please refer to the description of the relevant content above, and will not repeat it.

It can be understood that, in order to realize the above-mentioned functions, the above-mentioned electronic devices and the like include hardware structures and/or software modules corresponding to the respective functions. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present invention.

The embodiments of the present application may divide the above-mentioned electronic devices and the like into functional modules according to the above-mentioned method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. It should be noted that the division of modules in the embodiment of the present invention is illustrative, and is only a logical function division, and there may be other division methods in actual implementation. The following is an example of dividing each function module corresponding to each function:

The methods provided in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, a network device, an electronic device, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, SSD).

A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

The above are only specific implementations of the application, but the protection scope of the embodiments of the application is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the embodiments of the application. , Should be covered within the protection scope of the embodiments of this application. Therefore, the protection scope of the embodiments of the present application should be subject to the protection scope of the claims.

Claims

An electronic device, characterized in that it comprises:

Display, keyboard, camera and processor;

The camera is arranged near the keyboard for collecting video frames in video communication, and sending the collected video frames to the processor;

The processor is connected to the display, the keyboard, and the camera, and is configured to receive a first video frame from the camera, and if it is determined that the first video frame contains content that conforms to a preset finger model, then the first video is removed The finger in the frame obtains a second video frame; and sends the second video frame to the display for display, and/or sends the second video frame to the opposite electronic device for display.
The method of claim 1, wherein when determining that the first video frame contains content that conforms to a preset finger model, removing the finger in the first video frame comprises:

It is determined that the first video frame contains content that conforms to a preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, and the finger in the first video frame is removed; or,

It is determined that the first video frame contains content that meets the preset finger model, and it is determined that there is no overlap between the finger area and the position of the human face, and the finger in the first video frame is removed; or,

It is determined that the first video frame contains content that conforms to a preset finger model, and it is determined that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the first video frame is removed. Finger in video frame.
The method of claim 1, wherein the determining that the first video frame contains content that conforms to a preset finger model, then removing the finger in the first video frame comprises:

Obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and that the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, remove the first video frame Finger in a video frame.
5. The electronic device according to any one of claims 1 to 3, wherein the removing a finger in the first video frame to obtain a second video frame comprises:

Replacing the content of the finger area in the first video frame with replacement content to obtain the second video frame; or

Crop the first video frame to obtain the second video frame that does not include the finger area; or,

Filling the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
The method of claim 4, wherein the processor is further configured to: obtain a third video frame, the third video frame being a video frame collected before the first video frame; If the third video frame does not contain content that conforms to the preset finger model, the third video frame is used as the background frame; in the background frame, it is determined that the finger area in the first video frame corresponds to The content serves as the replacement content.
5. The electronic device according to any one of claims 1 to 5, wherein when it is determined that the first video frame contains content conforming to a preset finger model, then the finger in the first video frame is removed To obtain the second video frame, including:

Determining that the first video frame contains content that meets the preset finger model, and the finger is not in an abnormal state, then removing the finger in the first video frame to obtain the second video frame;

The abnormal state corresponds to at least one of the following conditions:

The two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance;

In the first video frame, one hand of the user is located at the bottom area, and the distance between the other hand and the bottom area is greater than a preset distance;

The area of the user's finger in the first video frame is greater than a preset area threshold;

The user's finger in the first video frame covers the face.
The electronic device of claim 6, wherein:

The processor is further configured to: when it is determined that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is in an abnormal state, the first video frame is sent to the display for display.
7. The method according to any one of claims 1-7, wherein the determining that the first video frame contains content conforming to a preset finger model comprises:

The first video frame is input into the semantic segmentation model, and the finger region in the first video frame is determined by the semantic segmentation model. When the finger region exists, it is determined that the first video frame contains the predetermined The content of the finger model, the semantic segmentation model is obtained by training based on sample photos, each sample photo contains a photo of the user's finger, and each photo marks the finger area.
The method according to any one of claims 1-7, wherein:

The processor is further configured to control the display to display a video communication interface in response to the user's operation to start video communication, and the video communication interface includes a video preview window and a video receiving window; A second video frame, displaying the video frame received from the opposite end electronic device in the video receiving window;

The processor is further configured to: input the first video frame into a semantic segmentation model to obtain a finger area mask; perform noise reduction processing on the finger area mask to obtain a finger area; and determine whether the finger area contains Bottom connected area and the bottom connected area is connected to the side of the first video frame, if the finger area includes a bottom connected area and the bottom connected area is connected to the side of the first video frame , Removing the finger in the first video frame; the performing noise reduction processing on the finger area mask includes at least one of the following: according to the obtained finger area mask and the area where the face is located, the finger area The binary image of the mask is corroded and expanded; all the connected areas of the binary image of the finger area mask are searched, and connected areas with an area smaller than the preset threshold are filtered out; the connected area of the finger area mask is filtered out and the circumscribed rectangle is less than or The connected area equal to the preset area threshold, based on the different area of the connected area, the corresponding preset area threshold is different; filter out the area of the connected area of the finger area mask that does not belong to the bottom connected area; filter out the finger area mask There is a connected area that overlaps with the area where the face is located in the connected area; filter the connected area of the connected area of the finger area mask whose aspect ratio is less than the second preset threshold; calculate the area ratio of the connected area and the circumscribed rectangle, and filter out Connected areas whose area ratio is less than or equal to the third preset threshold;

The removing fingers in the first video frame to obtain the second video frame includes: determining whether the similarity value between the background frame and the first video frame is greater than a preset similarity value, and if it is greater, moving the background frame Compensation to obtain a motion-compensated background frame, where the background frame is a video frame that does not contain the finger captured before the first video frame is captured; the finger area of the motion-compensated background frame is determined as the replacement content; The replacement content replaces the finger area in the first video frame; the replacement content on the fill and the surrounding background area are rendered using an ambient light rendering method to obtain the second video frame; The motion compensation of the background frame includes: performing motion offset estimation based on the background frame and the current frame, and obtaining a motion offset matrix of the background frame relative to the current frame. The obtaining of the motion offset matrix includes: detecting the first video The feature points of the frame and the background frame, then match the feature points of the first video frame and the background frame, find the paired feature points, and then calculate the perspective transformation matrix based on the paired feature points, the perspective transformation matrix is To characterize the motion offset matrix of the background frame relative to the current frame; perform motion compensation on the background frame according to the motion offset matrix to obtain the background frame after motion compensation;

The processor is further configured to: after obtaining the second video frame, use the second video frame as a background frame;

The processor is further configured to: after controlling the display to display the second video frame, determine that the received fourth video frame contains content conforming to a preset finger model, and the finger is in an abnormal state, then according to the The finger area of the second video frame and the finger area of the fourth video frame determine at least one transition frame; control the display to display the at least one transition frame; control the display to display the at least one transition frame To display the fourth video frame.
A video capture control method, characterized in that it is applied to an electronic device, the electronic device includes: a keyboard and a camera, the camera is arranged near the keyboard, and includes:

Acquiring the first video frame through the camera collection;

It is determined that the first video frame contains content that meets the preset finger model, then removing the finger in the first video frame to obtain a second video frame;

The second video frame is displayed, and/or the second video frame is sent to the opposite electronic device for display.
The method of claim 10, wherein when determining that the first video frame contains content that conforms to a preset finger model, removing the finger in the first video frame comprises:

It is determined that the first video frame contains content that conforms to a preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, and the finger in the first video frame is removed; or,

It is determined that the first video frame contains content that meets the preset finger model, and it is determined that there is no overlap between the finger area and the position of the human face, and the finger in the first video frame is removed; or,

It is determined that the first video frame contains content that conforms to a preset finger model, and it is determined that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the first video frame is removed. Finger in video frame.
The method of claim 10, wherein the determining that the first video frame contains content conforming to a preset finger model, then removing the finger in the first video frame comprises:

Obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and that the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, remove the first video frame Finger in a video frame.
The method according to any one of claims 10-12, wherein the removing a finger in the first video frame to obtain a second video frame comprises:

Replacing the content of the finger area in the first video frame with replacement content to obtain the second video frame; or

Crop the first video frame to obtain the second video frame that does not include the finger area; or,

Filling the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
The method of claim 13, wherein the method further comprises:

Before acquiring the first video frame, acquiring a third video frame;

Determining that the third video frame does not contain content that meets the preset finger model, then using the third video frame as a background frame;

Before the replacing the content of the finger area in the first video frame with the replacement content to obtain the second video frame, the method further includes: determining the connection with the first video in the background frame The content corresponding to the finger area in the frame is used as the replacement content.
The method according to any one of claims 10-14, wherein the determining that the first video frame contains content conforming to a preset finger model, then removing the finger in the first video frame, Obtain the second video frame, including:

Determining that the first video frame contains content that meets the preset finger model, and the finger is not in an abnormal state, then removing the finger in the first video frame to obtain the second video frame;

The abnormal state corresponds to at least one of the following conditions:

The two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance;

In the first video frame, one hand of the user is located at the bottom area, and the distance between the other hand and the bottom area is greater than a preset distance;

The area of the user's finger in the first video frame is greater than a preset area threshold;

The user's finger in the first video frame covers the face.
The method of claim 15, wherein the method further comprises:

It is determined that the first video frame contains content that meets the preset finger model, and it is determined that the finger is in an abnormal state, and the first video frame is sent to the display for display.
The method according to any one of claims 10-16, wherein the determining that the first video frame contains content conforming to a preset finger model comprises:

The first video frame is input into the semantic segmentation model, and the finger region in the first video frame is determined by the semantic segmentation model. When the finger region exists, it is determined that the first video frame contains the predetermined The content of the finger model, the semantic segmentation model is obtained by training based on sample photos, each sample photo contains a photo of the user's finger, and each photo marks the finger area.
The method according to any one of claims 10-16, wherein the method further comprises:

In response to the user's operation to start video communication, a video communication interface is displayed. The video communication interface includes a video preview window and a video receiving window; the video preview window is used to display the video frame generated by the local end, and the video receiving window is used to Display the video frames received from the peer electronic device;

The determining that the first video frame contains content conforming to the preset finger model, then removing the finger in the first video frame to obtain the second video frame includes:

Input the first video frame into a semantic segmentation model to obtain a finger area mask; perform noise reduction processing on the finger area mask to obtain a finger area; determine whether the finger area includes a bottom connected area and the bottom connected area Is connected to the side of the first video frame, and if the finger area includes a bottom connected area and the bottom connected area is connected to the side of the first video frame, remove the The finger; said performing noise reduction processing on the finger area mask includes at least one of the following: corroding and expanding the binary image of the finger area mask according to the obtained finger area mask and the area where the face is located Operation; find all connected areas of the binary image of the finger area mask, and filter out the connected areas whose area is less than the preset threshold; filter out the connected areas of the connected area of the finger area mask whose circumscribed rectangle is less than or equal to the preset area threshold, Based on the different areas of the connected area, the corresponding preset area thresholds are different; filter out the area of the connected area of the finger area mask that does not belong to the bottom connected area; filter out the area of the connected area of the finger area mask and the face There are overlapping connected areas; among the connected areas of the finger area mask, the connected areas whose aspect ratio is less than the second preset threshold are filtered; the area proportion of the connected area and its circumscribed rectangle is calculated, and the area proportion is less than or equal to the third preset threshold. Threshold connected area;

Determine whether the similarity value between the background frame and the first video frame is greater than the preset similarity value. If it is greater, perform motion compensation on the background frame to obtain a motion-compensated background frame. The background frame is the acquisition of the first video frame. A video frame collected before a video frame that does not contain the finger; determining the finger area of the background frame after motion compensation as replacement content; replacing the finger area in the first video frame through the replacement content; Using an ambient light rendering method to render the replacement content on the fill and the surrounding background area to obtain the second video frame; the performing motion compensation on the background frame includes: performing motion offset estimation based on the background frame and the current frame , Obtaining the motion offset matrix of the background frame relative to the current frame, and the obtaining the motion offset matrix includes: detecting the feature points of the first video frame and the background frame, and then comparing the first video frame and The feature points of the background frame are matched, the paired feature points are found, and then the perspective transformation matrix is calculated according to the paired feature points. The perspective transformation matrix is the motion offset matrix that characterizes the motion of the background frame relative to the current frame; according to the motion The offset matrix performs motion compensation on the background frame to obtain a background frame after motion compensation;

The method further includes: after obtaining the second video frame, using the second video frame as a background frame;

The method further includes: after controlling the display to display the second video frame, determining that the collected fourth video frame contains content conforming to a preset finger model, and the finger is in an abnormal state, then according to the first The finger area of the second video frame and the finger area of the fourth video frame determine at least one transition frame; display the at least one transition frame; after the at least one transition frame is displayed, display the at least one transition frame in the video preview window The fourth video frame.
An electronic device, characterized in that it comprises:

One or more processors;

One or more memories;

Multiple applications;

And one or more computer programs, wherein the one or more computer programs are stored in the one or more memories, and the one or more computer programs include instructions. When executed by one or more processors, the electronic device executes the following steps:

Obtain a first video frame, and obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset Time threshold, then remove the finger in the first video frame to obtain a second video frame; instruct to display the second video frame.
The electronic device of claim 19, wherein the removing a finger in the first video frame to obtain a second video frame comprises:

Replacing the content of the finger area in the first video frame with replacement content to obtain the second video frame; or

Crop the first video frame to obtain the second video frame that does not include the finger area; or,

Filling the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
The electronic device of claim 20, wherein when the instruction is executed by the electronic device, the electronic device further executes the following steps:

Before acquiring the first video frame, acquiring a third video frame;

Determining that the third video frame does not contain content that meets the preset finger model, then using the third video frame as a background frame;

In the background frame, content corresponding to the finger area in the first video frame is determined as the replacement content.
The electronic device according to claim 19, wherein said determining that the first video frame contains content conforming to a preset finger model, and the time of the obtained keyboard input signal is different from the time of obtaining the first video frame When the time meets the preset time threshold, removing the finger in the first video frame to obtain the second video frame includes:

It is determined that the first video frame contains content that conforms to the preset finger model, the finger is not in an abnormal state, and the time of obtaining the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold , Remove the finger in the first video frame to obtain the second video frame;

The abnormal state corresponds to at least one of the following conditions:

The two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance;

In the first video frame, one hand of the user is located at the bottom area, and the distance between the other hand and the bottom area is greater than a preset distance;

The area of the user's finger in the first video frame is greater than a preset area threshold;

The user's finger in the first video frame covers the face.
The electronic device according to claim 22, when the instruction is executed by the electronic device, the electronic device further executes the following steps:

When it is determined that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is in an abnormal state, indicating that the first video frame is displayed.
22. The electronic device according to any one of claims 19-23, wherein the determining that the first video frame contains content conforming to a preset finger model comprises:

The determining that the first video frame contains content conforming to a preset finger model includes:

The first video frame is input into the semantic segmentation model, and the finger region in the first video frame is determined by the semantic segmentation model. When the finger region exists, it is determined that the first video frame contains the predetermined The content of the finger model, the semantic segmentation model is obtained by training based on sample photos, each sample photo contains a photo of the user's finger, and each photo marks the finger area.
A video communication control method, characterized in that it comprises:

Obtain the first video frame, and obtain the keyboard input signal;

It is determined that the first video frame contains content that conforms to the preset finger model, and the time when the keyboard input signal is obtained and the time when the first video frame is obtained meet the preset time threshold, then the first video frame is removed Finger to obtain the second video frame;

Instruct to display the second video frame.
The method of claim 25, wherein the removing a finger in the first video frame to obtain a second video frame comprises:

Replacing the content of the finger area in the first video frame with replacement content to obtain the second video frame; or

Crop the first video frame to obtain the second video frame that does not include the finger area; or,

Filling the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
The method of claim 26, wherein the method further comprises:

Before acquiring the first video frame, acquiring a third video frame;

Determining that the third video frame does not contain content that meets the preset finger model, then using the third video frame as a background frame;

Before the replacing the content of the finger area in the first video frame with the replacement content to obtain the second video frame, the method further includes: determining the connection with the first video in the background frame The content corresponding to the finger area in the frame is used as the replacement content.
25. The method of claim 25, wherein the determining that the first video frame contains content conforming to a preset finger model, and the time when the keyboard input signal is obtained is the same as the time when the first video frame is obtained When the preset time threshold is met, removing the finger in the first video frame to obtain the second video frame includes:

It is determined that the first video frame contains content that conforms to the preset finger model, the finger is not in an abnormal state, and the time of obtaining the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold , Remove the finger in the first video frame to obtain the second video frame;

The abnormal state corresponds to at least one of the following conditions:

The two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance;

In the first video frame, one hand of the user is located at the bottom area, and the distance between the other hand and the bottom area is greater than a preset distance;

The area of the user's finger in the first video frame is greater than a preset area threshold;

The user's finger in the first video frame covers the face.
The method of claim 28, further comprising:

When it is determined that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is in an abnormal state, it is instructed to display the first video frame.
The method according to any one of claims 25-29, wherein the determining that the first video frame contains content that conforms to a preset finger model comprises:

The first video frame is input into the semantic segmentation model, and the finger region in the first video frame is determined by the semantic segmentation model. When the finger region exists, it is determined that the first video frame contains the predetermined The content of the finger model, the semantic segmentation model is obtained by training based on sample photos, each sample photo contains a photo of the user's finger, and each photo marks the finger area.
A computer-readable storage medium, comprising instructions, characterized in that, when the instructions are executed on an electronic device, the electronic device is caused to execute the method according to any one of claims 16-30.
A computer program product, wherein the computer program product comprises software code, and the software code is used to execute the method according to any one of claims 16-30.
A chip containing instructions, characterized in that, when the chip runs on an electronic device, the electronic device is caused to execute the method according to any one of claims 16-30.