WO2021121302A1 - Video collection control method, electronic device, and computer-readable storage medium - Google Patents
Video collection control method, electronic device, and computer-readable storage medium Download PDFInfo
- Publication number
- WO2021121302A1 WO2021121302A1 PCT/CN2020/137100 CN2020137100W WO2021121302A1 WO 2021121302 A1 WO2021121302 A1 WO 2021121302A1 CN 2020137100 W CN2020137100 W CN 2020137100W WO 2021121302 A1 WO2021121302 A1 WO 2021121302A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video frame
- finger
- area
- frame
- video
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 139
- 230000006854 communication Effects 0.000 claims abstract description 77
- 238000004891 communication Methods 0.000 claims abstract description 69
- 238000012545 processing Methods 0.000 claims abstract description 31
- 238000004590 computer program Methods 0.000 claims abstract description 19
- 230000011218 segmentation Effects 0.000 claims description 65
- 230000002159 abnormal effect Effects 0.000 claims description 32
- 239000011159 matrix material Substances 0.000 claims description 26
- 230000007704 transition Effects 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 19
- 230000009466 transformation Effects 0.000 claims description 12
- 230000015654 memory Effects 0.000 claims description 11
- 230000009467 reduction Effects 0.000 claims description 10
- 238000009877 rendering Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 51
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000013473 artificial intelligence Methods 0.000 abstract 1
- 238000013135 deep learning Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 22
- 230000000875 corresponding effect Effects 0.000 description 21
- 238000010586 diagram Methods 0.000 description 20
- 238000003672 processing method Methods 0.000 description 11
- 238000012790 confirmation Methods 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 8
- 230000003993 interaction Effects 0.000 description 7
- 239000010410 layer Substances 0.000 description 7
- 239000010813 municipal solid waste Substances 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 238000001514 detection method Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 230000004927 fusion Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000012216 screening Methods 0.000 description 4
- 241000238413 Octopus Species 0.000 description 3
- 241001422033 Thestylus Species 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 239000012792 core layer Substances 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 210000003127 knee Anatomy 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 241001122767 Theaceae Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 210000000746 body region Anatomy 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 210000000078 claw Anatomy 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007797 corrosion Effects 0.000 description 1
- 238000005260 corrosion Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000003651 drinking water Substances 0.000 description 1
- 235000020188 drinking water Nutrition 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/141—Systems for two-way working between two video terminals, e.g. videophone
- H04N7/142—Constructional details of the terminal equipment, e.g. arrangements of the camera and the display
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/02—Input arrangements using manually operated switches, e.g. using keyboards or dials
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/478—Supplemental services, e.g. displaying phone caller identification, shopping application
- H04N21/4788—Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
Definitions
- This application relates to the field of image processing, and in particular to a video capture control method, electronic equipment, computer-readable storage media, computer program products, and chips.
- the video communication method, video acquisition method, and electronic equipment provided by the present application avoid the distortion of finger proportions in the video communication process and improve the quality of video communication.
- an embodiment of the present invention provides an electronic device, including: a display, a keyboard, a camera, and a processor; the camera is arranged near the keyboard, and is used to collect video frames in video communication and collect The video frame is sent to the processor; the processor is connected to the display, the keyboard, and the camera, and is used to receive the first video frame from the camera, and determine that the first video frame contains a finger model that conforms to the preset finger model. Content, the finger in the first video frame is removed to obtain a second video frame; and the second video frame is sent to the display for display, and/or the second video frame is sent to the opposite end Electronic device display.
- the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.
- removing the finger in the first video frame includes: determining that the first video The frame contains content that conforms to the preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, and the finger in the first video frame is removed; or, it is determined that the first video frame contains Preset the content of the finger model, and determine that there is no overlap between the finger area and the position of the face, remove the finger in the first video frame; or, determine that the first video frame contains content that conforms to the preset finger model , And it is determined that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the finger in the first video frame is removed. Based on the above solution, it can be ensured that the finger is only removed when the finger is located in a specific area in the first video frame, so that the automatic removal can be more accurate, and the possibility of removing the undistorted finger can be reduced.
- the determining that the first video frame contains content that conforms to a preset finger model, then removing the finger in the first video frame includes: obtaining a keyboard input signal; determining that the first video frame is Contains content that conforms to the preset finger model, and the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, and the finger in the first video frame is removed.
- the detection of the keyboard input signal is one of the conditions for removing the finger, so it can ensure that the removed finger is a typing finger, thereby achieving accurate removal of distorted fingers and avoiding the technology of fingers occluding the video frame picture effect.
- an embodiment of the present invention provides an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the one or more computers Programs are stored in the one or more memories, and the one or more computer programs include instructions.
- the electronic device executes the following Step: Obtain a first video frame, and obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and the time of the obtained keyboard input signal meets the time of obtaining the first video frame The preset time threshold is then removed from the first video frame to obtain a second video frame; indicating that the second video frame is displayed.
- the technical problem of finger distortion in the output video frame can be solved, and the video transmitted by the video communication can be prevented from being an incomplete picture blocked by the finger.
- the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.
- the detection of the keyboard input signal is one of the conditions for removing the finger, so that the typing finger in the video frame can be accurately removed.
- an embodiment of the present invention provides a video capture control method, which is applied to an electronic device, the electronic device includes: a keyboard and a camera, the camera is arranged near the keyboard, and the method includes: passing through the camera Acquiring a first video frame; determining that the first video frame contains content that conforms to a preset finger model, then removing the finger in the first video frame to obtain a second video frame; displaying the second video frame, And/or, sending the second video frame to the opposite electronic device for display.
- the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.
- an embodiment of the present invention provides a video communication control method, including: obtaining a first video frame and obtaining a keyboard input signal; determining that the first video frame contains content that conforms to a preset finger model, and the obtained The time of the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, then the finger in the first video frame is removed to obtain the second video frame; indicating that the second video frame is displayed.
- the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.
- the detection of the keyboard input signal is one of the conditions for removing the finger, so that the typing finger in the video frame can be accurately removed.
- an embodiment of the present invention provides an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the one or more computers Programs are stored in the one or more memories, and the one or more computer programs include instructions.
- the electronic device executes the following Step: Obtain a first video frame; determine that the first video frame contains content that conforms to a preset finger model, and determine that the finger is located in the bottom area of the first video frame, then remove the first video frame Finger to obtain the second video frame; instruct to display the second video frame.
- the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.
- the detection of the finger area at the bottom area of the video frame is used as the removal condition of the finger, so it is possible to distinguish the finger placed in a specific position from the finger in other positions, thereby achieving more accurate finger removal.
- FIG. 1 is a structural diagram of a notebook computer with a prior art camera located in the keyboard area;
- FIG. 2 is a schematic diagram of a picture containing typing fingers acquired based on the front camera of the notebook computer shown in FIG. 1 in the prior art;
- Figure 3 is a structural diagram of an electronic device according to an embodiment of the present invention.
- Figure 4 is a software framework diagram of an embodiment of the present invention.
- FIG. 5 is a schematic diagram of another structure of an electronic device introduced in an embodiment of the present invention.
- FIG. 6 is a flowchart and interface comparison diagram of the video control method introduced by the embodiment of the present invention.
- FIG. 7 is a schematic diagram of a finger area determined by an embodiment of the present invention.
- FIG. 8 is a schematic diagram of an implementation manner of replacing a finger area by replacing content in an embodiment of the present invention.
- FIG. 9 is a schematic diagram of another implementation manner of replacing a finger area by replacing content in an embodiment of the present invention.
- FIG. 10 is a schematic diagram of generating prompt information after removing a finger in an embodiment of the present invention.
- FIG. 11 is a schematic diagram of generating prompt information before removing a finger in an embodiment of the present invention.
- Figure 12 is a flowchart of a method for training a semantic segmentation model in an embodiment of the present invention
- FIG. 13 is a flowchart of identifying a finger area in an image based on a semantic segmentation model in an embodiment of the present invention
- 15A is a schematic diagram of an image of a typing finger collected by a front camera in an embodiment of the present invention.
- 15B is a schematic diagram of a finger region mask determined by recognizing the image shown in FIG. 6A based on a semantic segmentation model in an embodiment of the present invention
- 15C is a schematic diagram of a finger area mask after noise reduction is performed on the finger mask area in an embodiment of the present invention.
- FIG. 16 is a flowchart of the image processing method introduced in the embodiment of the present invention.
- FIG. 17 is a flowchart of a video communication method introduced in an embodiment of the present invention.
- 18A-18C are schematic diagrams of video frames collected by a front camera in a video communication method in an embodiment of the present invention, a typing finger area in the video frame, and image frames output after processing;
- FIG. 19 is a flowchart of a video communication method introduced by another embodiment of the present invention.
- 20 is a flowchart of determining whether a user is in a typing state in the video communication method introduced by another embodiment of the present invention.
- FIG. 21 is a flowchart of a video processing method introduced in an embodiment of the present invention.
- FIG. 22 shows an interface change diagram of a specific application scenario of the present invention
- Fig. 23 shows the interface change diagram of another specific application scenario of the present invention.
- first and second are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present application, unless otherwise specified, “plurality” means two or more.
- Electronic equipment is equipped with cameras, microphones, global positioning system (global positioning system, GPS) chips, various sensors (such as magnetic field sensors, gravity sensors, gyroscope sensors, etc.) and other devices to sense the external environment and user actions Wait.
- the electronic device According to the perceived external environment and the user's actions, the electronic device provides the user with a personalized and contextual business experience.
- the camera can obtain rich and accurate information so that the electronic device can perceive the external environment and the user's actions.
- the embodiments of the present application provide an electronic device, which can be implemented as any of the following devices: mobile phones, tablet computers (pad), portable game consoles, handheld computers (personal digital assistant, PDA), notebook computers, ultra-mobile personal computers Digital display products such as ultra mobile personal computer (UMPC), handheld computers, netbooks, in-vehicle media playback devices, wearable electronic devices, virtual reality (VR) terminal devices, augmented reality (AR) terminal devices, etc.
- PDA personal digital assistant
- UMPC ultra mobile personal computer
- handheld computers netbooks
- in-vehicle media playback devices wearable electronic devices
- VR virtual reality
- AR augmented reality
- FIG. 3 shows a schematic diagram of the structure of the electronic device 100.
- the electronic device 100 shown in FIG. 1 is only an example, and the electronic device 100 may have more or fewer components than those shown in FIG. 1, two or more components may be combined, or Can have different component configurations.
- the various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
- the electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2.
- Mobile communication module 150 wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display 194, And subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
- SIM subscriber identification module
- the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
- the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100.
- the electronic device 100 may include more or fewer components than those shown in the figure, or combine certain components, or split certain components, or arrange different components.
- the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
- the software architecture involved in this application includes: application layer, Windows multimedia framework, control layer, core layer, platform layer, camera driver, and camera hardware.
- the finger occlusion processing module involved in this application is integrated in the MFT (Media Foundation Transforms) module of the core layer, and this module can also integrate other functions.
- MFT Media Foundation Transforms
- the processed video frames are sent to application software, such as video communication software, through Media Sink.
- the kernel layer is the layer between hardware and software.
- the kernel layer includes at least a display driver, a camera driver, an audio driver, a sensor driver, and a finger occlusion processing component.
- the finger occlusion processing component integrates the function of processing video frames containing preset objects introduced in the embodiment of the present invention.
- the occlusion component can identify the finger in the video frame and remove the finger to obtain a video frame that does not contain the finger; then the finger occlusion processing component can output the processed video frame to the display; if it is to transmit the processed video frame to At the opposite end, the processing result is delivered to the video application software through the Windows multimedia framework through the finger occlusion processing component, and is delivered to the opposite end through the end-to-end connection established by the video application software.
- an embodiment of the present invention provides an electronic device 100. Please refer to FIG. 5.
- the electronic device includes:
- the camera 52 is arranged near the keyboard 51. As shown in FIG. 5, the camera 52 can be arranged on the plane to which the keyboard 51 belongs, such as cameras 52a and 52c.
- the camera 52a is arranged in the area where the keyboard belongs.
- the area refers to the area determined by the point on the upper left corner and the point on the lower right corner of the keyboard as a rectangle.
- the camera 52a can be set to move the rectangle away by a preset distance (for example: 0.5cm, 1cm, 2cm)
- the determined area or alternatively, the camera 52 can be arranged on the frame 50a of the display 50, for example: arranged on the left frame, right frame, lower frame, etc. of the display 50.
- the camera 52 can be arranged below the frame, for example, in the range of 1/2, 1/3 below the frame, or the camera 52 can also be arranged at the bottom of the frame, and so on.
- the camera is set near the keyboard and is used to collect video frames in video communication and send the collected video frames to the processor.
- the camera 52b shown in FIG. 5 is disposed under the frame 50a.
- the processor (not shown in the figure) is connected to the display 50, the keyboard 51, and the camera 52; the processor is used to receive the first video frame from the camera, and determine that the first video frame contains a finger model that conforms to the preset finger model. Content, the finger in the first video frame is removed to obtain a second video frame; and the second video frame is sent to the display for display, and/or the second video frame is sent to the opposite end Electronic device display.
- the method includes the following steps:
- S610 Determine that the first video frame contains content that meets the preset finger model, remove the finger in the first video frame to obtain a second video frame; and send the second video frame to the display Display, and/or, send the second video frame to the opposite electronic device for display.
- the electronic device 100 can perform video collection after detecting a user's video shooting operation (for example: clicking a video shooting button, generating a preset gesture, generating a voice command, etc.);
- video capture is performed, and then the collected video is sent to the peer electronic device.
- the electronic device detects the user's video communication operation (or video shooting operation)
- it generates a video communication instruction
- the processor responds to the video communication instruction, starts the video communication software, and sends the instruction Drive the camera to control the camera for video capture.
- the camera driver sends the collected data to the finger occlusion processing component, and the finger occlusion processing component performs subsequent operations.
- Step S610 may be executed by a processor.
- the first video frame may be input to the semantic segmentation model, the mask of the finger area in the first video frame is determined through the semantic segmentation model, and then the finger area in the first video frame is determined through the mask of the finger area, Among them, the mask of the finger area can be directly determined as the finger area, or the finger area can be obtained after noise reduction is performed on the mask of the finger area.
- the finger area exists, it is determined that the first video frame contains content that conforms to the preset finger model, and the semantic segmentation model is obtained by training based on sample photos, and each sample photo contains a photo of the user's finger , And each photo marks the finger area. How to determine the finger area in the video frame through the semantic segmentation model will be introduced later, and will not be repeated here.
- Figure 7 is a schematic diagram of the determined finger area.
- Figure 7 contains 6 small images, which are: Figures 7a to 7f.
- Figure 7b is a collected schematic diagram of the user's finger, based on the figure shown in Figure 7b. For example, the finger area determined by the semantic segmentation model is shown in Fig. 7a.
- step S610 when it is determined that the first video frame contains content that conforms to the preset finger model, the fingers in the first video frame are removed (that is, the fingers in the first video frame are all Remove);
- the preset condition may be a variety of conditions , Four of them are listed below for introduction.
- the preset condition may be a variety of conditions , Four of them are listed below for introduction.
- the specific implementation process it is not limited to the following four situations. (The following four preset conditions all correspond to the user’s typing finger in the first video frame. The purpose of judging that the preset condition is met is to remove the finger when the user’s typing finger exists in the first video frame.
- Non-user’s typing fingers can be retained, for example: the user raises his finger to show the other party manicure, the user holds up a teacup to drink tea, etc.
- the fingers are the fingers that the user does not want to remove.
- This solution can guarantee the removal.
- the user’s typing finger (or the finger placed on the keyboard) does not remove other fingers, and other fingers are often not close to the camera, so there will be no deformation problems.
- Distortion may cause the possibility of blocking the picture, and on the other hand, it can retain the normal proportion of the finger, and achieve the technical effect of accurately removing the proportion distorted finger.).
- the first method is to determine that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, then it is determined that there is a typing finger, and the first video frame can be removed Fingers.
- Figure 7a it contains two connected areas, namely connected areas 64 and 65, and these two connected areas are connected to the bottom. Therefore, it can be seen that there are two bottom connected areas in Figure 7a. In this case , It means that there are typing fingers in the first video frame, which means that the preset conditions are met, so the fingers in the first video frame can be removed.
- the first video frame contains content that meets the preset finger model, and it is determined that there is no overlap between the finger area and the position of the human face, then it is determined that there is a typing finger, and the finger in the first video frame can be removed .
- the area where the face is located is the central area of the video frame. If the user raises his finger to show the other party as an example, in this case, the connected area may be located in the area where the face is. In this case below, it means that the finger is not a typing finger, but a finger shown to the other party. It means that the preset conditions are not met, so there is no need to remove the finger in the first video frame; only when the finger area does not overlap with the face area In the case of, it means that the finger in the first video frame needs to be removed only when the preset condition is met.
- the overlap of the finger area and the area where the face is located may be a partial overlap or a full overlap, which is not limited in the embodiment of the present invention.
- a keyboard input signal if a keyboard input signal is obtained, it indicates that there is a typing finger in the first video frame, so it indicates that there is a finger in the first video frame that satisfies the preset condition.
- the keyboard signal can be obtained to determine whether there is a keyboard input signal. Among them, you can collect whether the keyboard input signal is detected within the preset time period before and after the first video frame to determine whether the preset condition is met. If the keyboard input signal is detected before and after the preset time period (for example: 1 second, 2 seconds, etc.) , It is deemed to meet the preset conditions, otherwise it is deemed that the preset conditions are not met.
- the keyboard input signal is used as one of the conditions for triggering the removal of the finger in the first video frame, it can be ensured that the removed finger is the finger of the user in the typing state.
- the first video frame contains content that conforms to the preset finger model, and that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the first video frame is removed. Finger in a video frame.
- the fingers when the fingers are located in the bottom area, the fingers are likely to be typing fingers, but they may also be the fingers inputting through the touchpad.
- the typing fingers are often located on both sides of the keyboard, so they are generally and The sides of the first video frame are connected.
- the bottom connected area 65 is connected to the right side of the first video frame. Therefore, this solution can position the typing finger more accurately, thus accurately region the first video frame.
- Typing fingers in. Therefore, if the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, it is determined that the preset condition is satisfied; otherwise, it is determined that the preset condition is not satisfied.
- the finger area when it is determined in S610 that the first video frame contains content that conforms to the preset finger model and after the finger area is determined, it can be further determined whether the finger area is in an abnormal state. Among them, it can be determined whether the finger area is in an abnormal state when it is determined that the preset conditions are met. If the finger area is in an abnormal state, the finger in the first video frame is not removed, and the first video frame is directly output to determine the finger area In an abnormal state, the step of directly outputting (for example: displaying on a display, sending to a peer electronic device for display, etc.) the first video frame can be executed by the processor; otherwise, the finger in the first video frame is removed. It is also possible to determine whether the finger area determined based on the preset finger model is in an abnormal state when it is detected that the first video frame contains content that conforms to the preset finger model, which is not limited in the embodiment of the present invention.
- the abnormal state can have many situations, four of which are listed below for introduction. Of course, in the specific implementation process, it is not limited to the following four situations.
- the first is that the area of the finger area is greater than the preset area threshold (as shown in Figure 7c), for example: the area of the finger area is greater than 5000 pixels, the area of the finger area is greater than 1/4, 1/3, etc.
- the finger area may not be fused with other areas. Therefore, in order to ensure that the finger area of the output video frame (second video frame) is fused with its background area, only When the area of the finger area is not greater than the preset area threshold, the operation of removing the finger in the first video frame is performed.
- the finger area contains three connected areas, which are two bottom connected areas (that is, the connected area located at the bottom of the first video frame). And an intermediate connected area, where there is overlap in the area where the face is located (how to determine the area where the face is located will be introduced later), in this case, it means that one hand of the user is at the bottom in the first video frame Area, the other is not located at the bottom area, it usually means that the user is typing with one hand and doing something other than typing with the other hand (for example: drinking water, touching hair, etc.).
- the finger area is considered to be in an abnormal state and there is no need to remove the fingers in the first video frame. Output the first video frame directly.
- the third type is that there are at least two bottom connected areas, and the distance between the two bottom connected areas is greater than a preset distance threshold, and the preset distance threshold is, for example, 100 pixels, 150 pixels, and so on.
- the distance between the two hands of the user in the first video frame is greater than the first preset distance (the first preset distance is equal to the preset distance threshold, or is positively correlated).
- the fourth type there is a non-bottom connected area, as shown in Figure 7f.
- it often means that the user is typing with one hand and doing other things with the other hand.
- both hands are Removal does not meet the user's requirements, and only removes the fingers in the bottom area of the first video frame, the picture will be more abrupt, so it is considered an abnormal state, and the fingers in the first video frame are not removed.
- step S610 the first video frame can be removed in a variety of ways. Three of them are listed below for introduction. Of course, in the specific implementation process, the following three situations are not limited.
- the first method is to replace the content of the finger area in the first video frame with the replacement content to obtain the second video frame.
- an electronic device responds to video communication operations, such as: in the interface of the first contact in the video communication software, click the video communication button (video communication operation), then it can enter the video communication state, and it will display in the video communication state Video communication interface.
- video communication operations such as: in the interface of the first contact in the video communication software, click the video communication button (video communication operation), then it can enter the video communication state, and it will display in the video communication state
- Video communication interface There is a video preview window and a video receiving window on the video communication interface.
- the video preview window is used to display the video frames collected by the current electronic device, and the video receiving window is used to display the video received from the opposite electronic device.
- Video frame For example, an electronic device responds to video communication operations, such as: in the interface of the first contact in the video communication software, click the video communication button (video communication operation), then it can enter the video communication state, and it will
- the electronic device When the electronic device enters the initial stage of video capture, the user's hand of the electronic device is not placed on the keyboard for typing, the electronic device captures the third video frame, the third video frame does not contain fingers, the electronic device The third video frame will be output, as shown in Figure 8a; at the same time, the electronic device will determine whether the third video frame contains the user’s finger, which can be determined by the semantic segmentation model introduced in the embodiment of the present invention, If it is determined that there is a finger area through the semantic segmentation model, the third video frame is deemed to contain a finger, otherwise it is deemed that the third video frame does not contain a finger; if it is determined that the third video frame does not contain a finger, the third video frame is It is stored as a background frame, as shown in Figure 8b.
- the electronic device collects the first video frame (as shown in Figure 8c), the electronic device determines whether the first video frame contains content that meets the preset finger model, and the first video frame is input into the semantic segmentation model to finally determine the finger
- the area is shown as 90 in Figure 8d.
- the finger area is determined, it is determined that it contains content that meets the preset finger model; then the replacement content corresponding to the finger area is determined from the background frame, and the determined replacement content is shown in the figure
- the finger area in the first video frame is then covered by the replacement content, thereby obtaining a second video frame (as shown in FIG. 8f), and finally the second video frame is output.
- the method further includes: obtaining a third video frame, where the third video frame is a video frame collected before the first video frame; and determining that the third video frame is not If the content conforms to the preset finger model is included, the third video frame is used as the background frame; in the background frame, the content corresponding to the finger area in the first video frame is determined as the replacement content.
- This step can be executed by the processor.
- the second video frame may contain a specific object.
- the specific object is often a stationary object in the background area, such as a teacup, a pen, etc.
- the location of the specific object in the second video frame is determined by this method and the third video
- the coordinates of the specific object in the frame are the same (or the offset is less than the preset offset, such as 10 pixels, 20 pixels, etc.); or the size of the specific object in the second video frame determined by this method is the same as that of the third video frame
- the size of specific objects in the same or similar for example: the difference is within 5%, 10%).
- the electronic device In the process of starting the video communication, if the user places a finger on the keyboard, at this time, since no video frame without a background frame has been collected yet, the electronic device outputs a video frame containing the user's finger at this time.
- the first video frame is cropped to obtain the second video frame that does not include the finger area.
- the electronic device collects the video frame, it can first determine whether there is content that meets the preset finger in the video frame. If it does not exist, it will directly output the video frame, as shown in Figure 9. After the device collects the third video frame, it directly outputs the third video frame; if it is determined that there is content that meets the preset finger in the video frame (for example, as shown in 9c), it is based on the collected video frame (for example: the first video frame). ) Determine a cropping frame 91 that does not contain a finger, and after the first video frame is cropped by the cropping frame, a second video frame that does not contain a finger is obtained (as shown in FIG. 9d).
- the cropping frame can be determined in a variety of ways, for example: 1Assuming that the lower left corner of the video frame is the origin, determine the Y-axis maximum value of the finger area in the video frame, use the Y-axis maximum value as the bottom cropping edge, and set the video frame The top of the is used as the upper cropping edge; the cropping ratio is determined by the determined heights of the upper and lower cropping edges ((the maximum value of the Y axis minus the minimum value of the Y axis)/the height of the first video frame), and then the video is determined The center position of the person in the frame, the center position is extended to the left by a first preset distance (1/2*cropping ratio*the width of the first video frame), the left border is determined, and the center position is extended to the right for the second preset Set the distance (1/2*cropping ratio*width of the first video frame), determine the right frame, and determine the crop frame based on this.
- the cropping frame may also be determined in other ways, and the embodiment of the present invention will not enumerate in detail, and it is not limited.
- the offset of the coordinates of the specific object in the second video frame with respect to the coordinates of the specific object in the third video frame is greater than the preset offset; or, the size of the specific object in the second video frame is the same as that of the third video frame.
- the difference in the size of the specific object in the frame is greater than the preset value.
- the third type is to fill the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
- the coordinates of the specific object in the second video frame are the same as or not much different from the coordinates of the specific object in the third video frame (the offset is less than the preset offset), or the second video frame
- the size of the specific object in the third video frame is the same or not much different from the size of the specific object in the third video frame.
- the display after controlling the display to display the second video frame based on S610, in response to determining that the fourth video frame contains content that conforms to a preset finger model, and the finger is in In an abnormal state, determine at least one transition frame through the finger area of the fourth video frame and the finger area of the fourth video frame; control the display to display the at least one transition frame; control the After displaying the at least one transition frame, the display displays the fourth video frame.
- This step can be executed by the processor.
- the user first typed with one hand, and the electronic device collected the first video frame. Since there is content in the first video frame that conforms to the preset finger model, and there is no abnormal state in the finger area, The electronic device outputs the second video frame that does not contain the finger; then the user places the other hand on the touchpad. At this time, the electronic device captures the fourth video frame. The fourth video frame still conforms to the preset finger model content. The distance between the two connected areas of the finger area is greater than the preset distance threshold, so it is determined that the finger area is in an abnormal state.
- the fourth video frame can be directly output, but the second video frame and the fourth video frame If you switch directly between them, there will be a situation where the fingers of a typing hand suddenly appear, which will cause the picture to be more abrupt.
- add at least one transition frame between the second video frame and the fourth video frame To achieve a smooth transition from the second video frame to the fourth video frame. If at least one transition frame is one frame, the weights of the second video frame and the fourth video frame are, for example, 0.5. If at least one transition frame is multiple frames, the weight of the second video frame is gradually reduced. The weight of the video frame gradually increases.
- the weight of the second video frame and the fourth video frame in the first transition frame is (0.2, 0.8), and the second transition frame is the second in the second transition frame.
- the weights of the video frame and the fourth video frame are (0.4, 0.6), the weights of the second video frame and the fourth video frame in the third transition frame are (0.5, 0.5), and the second video frame in the fourth transition frame
- the weights of the fourth video frame and the fourth video frame are (0.6, 0.4), and the weights of the second video frame and the fourth video frame in the fifth transition frame are (0.8, 0.2), etc.
- the above weights are only examples. Not as a limitation. Among them, for the background area outside the finger area, the content of the background area in the fourth video frame is displayed in the transition frame, and the content of the finger area in each transition frame is determined by the above method.
- a prompt message can also be generated to remind the user that the finger in the video frame has been removed, and prompt the user whether Fingers need to be retained.
- a prompt box 100 is displayed on the surface of the second video frame. The prompt box 100 prompts "The finger for video chat has been removed, please confirm whether to continue removing", if the user wants to continue removing Finger, click the confirmation button 110.
- the electronic device after the electronic device starts the video communication (or captures the user's finger), it can prompt the user whether to remove the finger. If the user's confirmation operation is detected, the finger in the video frame is removed, otherwise, it is not removed. Finger in video frame.
- the electronic device collects the first video frame, it is determined by detection that the video frame contains content that conforms to the preset finger model (that is, a finger), and a prompt box 130 is generated.
- the prompt "Finger has been detected in the video, please confirm whether to remove it"
- the user of the electronic device clicks the confirmation button 140 the electronic device removes the finger in the first video frame and displays the second video as shown in Figure 11b Frame; if the user clicks the cancel button 150, the electronic device does not remove the finger in the first video frame, but directly outputs the first video frame, as shown in FIG. 11c.
- the above scheme can be applied to the video capture end of the video communication process.
- the electronic device performs finger removal processing on the captured video frame, so that the video frame with the finger removed is sent to the peer electronic device; the above scheme can also be applied to video The receiving end of the frame communication process, that is: the opposite end electronic device receives the video frame without removing the finger, and then outputs the video frame based on the video processing method introduced in the embodiment of the present invention.
- the video frames in the above steps are not video frames obtained by collection, but video frames received from the opposite end electronic device.
- embodiments of the present invention provide an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the One or more computer programs are stored in the one or more memories, and the one or more computer programs include instructions that, when executed by one or more processors of the electronic device, cause all
- the electronic device executes the following steps: obtaining a first video frame, and obtaining a keyboard input signal; determining that the first video frame contains content that conforms to a preset finger model, and the obtained keyboard input signal is different from the time when the first video frame is obtained. When the time of the video frame meets the preset time threshold, then the finger in the first video frame is removed to obtain a second video frame; the second video frame is instructed to be displayed.
- the second video frame is indicated to be displayed, for example: 1
- the second video frame is displayed through the display unit of the electronic device; 2
- the second video frame is sent to another display unit for display, one possibility Yes, the electronic device does not have a display unit.
- Another possibility is that another display unit is more suitable for display. For example, the display area of the other display unit is larger.
- 3 The second video frame is sent to the opposite end electronic device of the video communication, and displayed by the opposite end electronic device.
- the embodiment of the present invention provides a method for training a semantic segmentation model.
- the semantic segmentation model is a classification at the pixel level.
- the pixels of each object are divided into one category (for example, pixels belonging to a person are divided into one category, belonging to Motorcycle pixels are divided into one category, pixels belonging to puppies are divided into one category, etc.), in addition to the background pixels are also divided into one category.
- the method includes the following steps:
- S1200 Data collection and labeling, collecting photos of the user when typing (or other photos containing the user's hand), and labeling the mask of the finger (or other preset object) area for the training of the semantic segmentation model.
- the semantic segmentation model is, for example, a convolutional neural network model, a conditional random field model, and so on.
- a dual-branch convolutional neural network model may be used.
- the dual-branch includes, for example, semantic feature branches and edge feature branches.
- the semantic feature branch is used to extract the semantic feature of the image
- the semantic feature refers to the specific object represented by the pixel point, such as: human face, finger, etc.
- the edge feature branch is used to extract the texture feature of the image
- the texture feature refers to Features such as edge information, shape information (for example: corners), and color.
- S1220 Model training, used to train and update the parameters of the semantic segmentation model based on the photos marked in S1200 and the semantic segmentation model designed in S1210 to obtain a semantic segmentation model for finger recognition.
- the input of the model is a picture that normalizes the pixel value to between 0 and 1 (in the calculation process, the picture is a multi-dimensional array stored in a specific order in the memory), and the semantic feature branch can be used
- the deep convolutional neural network extracts semantic features to determine the finger area
- the edge feature branch can use the shallow neural network to extract texture features to ensure the accuracy of finger edge segmentation.
- the dual-branch convolutional neural network model extracts semantic features through deep convolutional neural networks, and texture features through shallow neural networks; and then extracts through feature fusion network pairs The semantic features and edge features of each pixel are calculated by feature fusion calculation to obtain comprehensive features.
- the final classification layer takes the comprehensive features as input to calculate the confidence that each pixel belongs to the finger area, and then determines whether each pixel belongs to the finger area (in the specific implementation)
- the semantic segmentation model is a collection of various operations connected in a specific form. Each operation is composed of different values. The actual performance of the operation is to use its own parameter values and input arrays to perform matrix operations, and output the calculated results Array). Then it is compared with the labeled finger area, and based on the difference loss function, it is propagated back to the semantic segmentation model through the difference loss function, and the parameters of the semantic segmentation model are updated.
- an embodiment of the present invention also provides a method for recognizing a finger (or other preset object) region in an image (or video) based on the semantic segmentation model. Please refer to FIG. 13, including the following steps :
- S1300 The model is solidified. Specifically, after the semantic segmentation model training is completed, the obtained semantic segmentation model is used for finger recognition, and its parameters do not change.
- S1310 Data preprocessing. This step obtains the current video frame and performs normalization preprocessing of the data, for example: normalizing the pixel value of the video frame to an image between 0 and 1.
- S1320 Model reasoning.
- the model inference may include the following steps: S1400, inputting an image, that is, inputting an image with the pixel value of S1310 normalized to 0 to 1 into a semantic segmentation model.
- S1410a Extract the semantic features of the image through the semantic feature branch in the semantic segmentation model.
- the semantic branch is, for example, a deep convolutional neural network.
- the image can also be reduced. For example, shrink 4 times, 5 times, etc.;
- S1410b Extract texture features through edge feature branches, such as shallow convolutional neural networks.
- S1430 Input the comprehensive features into the classifier to obtain the finger area mask.
- the final classifier takes the comprehensive features as input, calculates the confidence that each pixel belongs to the finger area and the confidence that it does not belong to the finger area, and then judges whether each pixel belongs to the finger area, and obtains the mask of the finger area.
- Figure 15A is the user's photo captured by the front camera of the electronic device, where the front camera is set at the bottom of the display or on the keyboard, so the image captured by the front camera contains a finger, as shown in Figure 16B
- the white area is the finger area mask output based on the semantic segmentation model, indicating that the area may be a finger.
- Each white area is called a connected area.
- the mask of the finger area determined based on the semantic segmentation model includes five connected areas.
- Areas (based on different pictures, the number of connected areas determined is also different, the embodiment of the present invention does not limit), respectively: connected area 61, connected area 62, connected area 63, connected area 64, connected area 65, connected area
- the outer frame is called the circumscribed rectangle of the connected area, and the mask of the finger area may also include the experience position 66 of the face frame.
- Connected area 61, connected area 62, and connected area 63 are called non-bottom connected areas because their bottom areas are not in contact with the bottom of the picture.
- Connected areas 64 and 65 are called bottom connected areas. They are in normal typing state, and are areas where people are handed. Often belongs to the bottom connected area.
- the finger area mask can be identified as a finger area, or the finger area mask can be noise-reduced, and the finger area mask after noise reduction can be used as the finger area.
- the location of the face frame can be determined by performing face recognition on the image, and the location of the recognized face frame is taken as the empirical location of the face frame. It is also possible to use an experience area in the center of the screen of the electronic device as the experience area of the face frame; it is also possible to identify a large number of chat videos and analyze the position of the face frame in these chat videos, thereby comprehensively analyzing the results and obtaining the person The experience position of the face frame.
- One or more of the following methods can be used to filter the noise area:
- the binary image of the mask is corroded and expanded to filter out noise holes with a very small area, for example, the noise hole with a very small area is less than 1 Noise holes with the size of pixels, 2 pixels, 3 pixels, etc., are often noise at the edge of the finger.
- 2Find all connected areas of the binary image of the mask and preliminarily filter out connected areas whose area is less than a preset threshold, such as 10 pixels, 20 pixels, and so on.
- ⁇ 3 Filter out connected areas whose bounding rectangle is larger than a preset area threshold, for example: 100 pixels, 200 pixels, and so on. Or, calculate the area center of the connected area, and determine the area where the connected area is located based on the relationship between the area center of the connected area and the center of the empirical position of the face frame, for example: determine that the connected area is located at the empirical position of the face frame For the upper, lower, left or right side of the center, different preset area thresholds are set for the connected area based on the area where the connected area is located, and then connected areas whose circumscribed rectangle area is larger than the preset area threshold are filtered out.
- a preset area threshold for example: 100 pixels, 200 pixels, and so on.
- the corresponding preset area threshold is, for example: 150 pixels, 200 pixels, 220 pixels, etc., if the connected area is located on the upper side of the face center, then The corresponding preset area threshold is, for example, 80 pixels, 100 pixels, and so on. If the connected area is located below the center of the face, the corresponding preset area threshold is, for example, 300 pixels, 400 pixels, and so on.
- the predetermined area threshold of the connected area located on the lower side of the face center is larger than the left side (or right side), and the predetermined area threshold value of the connected area located on the left side (or right side) of the face center is larger than the upper side.
- the above three filtering steps can be used all, or only some of them can be used, and the embodiment of the present invention does not limit it), continue to determine whether the connected area is related to people
- the experience area of the face frame overlaps. If there is overlap and the connected area does not belong to the bottom connected area, it is determined that the current state is not typing and no subsequent processing is performed; if there is no overlap or belongs to the bottom connected area, you can proceed to step 5 to continue Make judgments.
- it is also possible to directly determine that the area that does not overlap or belong to the bottom connected area is the typing state, and the corresponding connected area is determined as the finger area mask.
- the film has a relatively large width and height, and generally does not have an inverted triangle shape. This step can also be performed before step 4, which is not limited in the embodiment of the present invention.
- the set threshold for example: 0.5, 0.6, etc.
- steps 1 to 6 can be executed in order. In the case of no conflict, steps 1 to 6 can be executed separately for each connected area, and then based on the execution result of each step, it is judged whether it is Belongs to the finger area mask.
- the remaining connected areas after the above noise reduction operation are shown in FIG. 15C, for example, the connected areas 61, 62, and 63 have been filtered, and only the connected areas 64, 65 are left as real finger areas.
- the above-mentioned semantic segmentation model training method can also be used to recognize other objects in the image, such as trash cans, backgrounds, ashtrays, folders, stylus pens, palms, arms, and so on. Similarly, it can also be used to identify the area where these objects are located.
- the training sample is the image of the trash can and the mark of the area where the trash can is located
- the stylus is the image of the stylus and the touch. Marking of the area where the stylus is located, etc.
- semantic segmentation model for foreground segmentation (to determine the area of the finger)
- other methods can also be used for foreground segmentation, such as: foreground segmentation method based on frame difference, foreground segmentation method based on motion shape and many more.
- foreground segmentation method based on frame difference foreground segmentation method based on motion shape and many more.
- Using the aforementioned semantic segmentation model can more accurately segment the finger area in the current video frame without being affected by the body of the typist who is moving all the time, and its speed can meet the real-time requirements.
- the embodiment of the present invention also provides an image processing method, which is used to determine the replacement content based on the background area of the video frame. Please refer to FIG. 16, including the following steps:
- S700 Perform motion offset estimation based on the background frame and the current frame to obtain the motion offset matrix of the background frame relative to the current frame (in the specific implementation process, the motion offset of the current frame relative to the background frame offset can also be calculated matrix).
- the input finger is not included in the image collected through the video chat, it will be used as the background frame; if the input finger is included, the background frame will not be determined until the input finger is not included.
- the acquired image is collected as the background frame.
- the feature points of the current frame and the background frame can be detected first, and then the detected feature points of the current frame and the background frame can be matched to find the paired feature points, and then the perspective transformation matrix is calculated according to the paired feature points.
- the perspective transformation matrix It is a matrix that characterizes the amount of motion of the background frame relative to the current frame. Among them, it can be achieved through SIFT (Scale-invariant feature transform), SURF (Speeded Up Robust Features: rotation invariant features), ORB (Oriented FAST and Rotated Brief) and other feature point detection algorithms determine the feature points of the background frame and the current frame.
- BF Brunauer-Force: brute force matching
- FLANN Full Approximate Nearest Neighbor Search Library: fast nearest neighbor search library
- the feature point matching algorithm determines the feature points that match the background frame and the current frame.
- Perspective transformation is a mapping relationship between two images. Through matrix multiplication, a point in one image is mapped to a corresponding point in another image, including perspective matrix calculation and coordinate mapping.
- the perspective transformation matrix is a 3*3 matrix H, set as:
- H is obtained by solving the linear equations, and H expresses the estimation of the motion offset from the graph A to the graph B (that is, the perspective transformation matrix H), so for each point (X1, Y1) in the graph A, The corresponding coordinates (X2, Y2) in the view plane of Figure B can be calculated by the following matrix multiplication:
- S710 Perform motion compensation on the background frame according to the amount of motion determined in the previous step to obtain a compensated frame (that is, the background frame after motion compensation).
- the purpose of this process is to align the current frame and the background frame to eliminate human body motion.
- the fragmentation phenomenon of the human body region after the completion of the image in the specific implementation process, the current frame can also be motion compensation based on the amount of motion, which is not limited in the embodiment of the present invention).
- S720 Calculate the background area for filling based on the motion-compensated background frame (ie: compensation frame) and the finger area mask, and use the content/image of the background area for filling to perform the finger area mask on the current frame Fill/replace
- the image of the location of the finger area mask can be determined from the background frame as the background area for filling/replacement, and then the background area used for filling can be overlaid on the finger area mask of the current frame. Achieve filling in the finger area mask of the current frame.
- step S730 Use an ambient light rendering method to render the filled foreground area and the surrounding background area to make the picture brightness consistent, and eliminate the difference in video frame brightness between adjacent frames caused by hardware.
- step S730 is an optional step.
- the output frame is obtained based on the above processing, and the output frame is used as the final video frame for video output.
- step S740 Use the output frame obtained in step S730 as a new background frame, and update the background frame.
- the embodiment of the present invention provides a video communication method. Please refer to FIG. 8.
- the video communication method includes the following steps:
- S800 Obtain the video frame in front of the display screen through the front camera
- this solution can be applied to any electronic device with video communication function.
- the electronic device can have its own or an external camera.
- the camera is set under the display of the electronic device, or the camera is set.
- the input device of the electronic device such as keyboard, mouse, touchpad, etc.
- the electronic device is a notebook computer, and the camera is set on the keyboard of the notebook computer or set under the display screen of the notebook computer .
- user A of the electronic device wants to start a video chat with another user B.
- User A opens the instant chat application of the electronic device, then opens the chat interface with user B, and then clicks the "video call" button.
- the device detects this operation of user A, it establishes a video communication connection with the electronic device of user B, and turns on the camera of the electronic device to capture and obtain the video sent to user B.
- the obtained video contains video frames such as Shown in Figure 9A.
- the user opens the contact interface of the electronic device 100, selects the contact B, and then clicks the video call button (for example, the unlinked call button).
- the electronic device detects the operation of the user A, it will contact the electronic device of the user B. Establish a video communication connection between.
- the front camera is turned on by default for video communication, but based on user A's selection operation or setting operation, the electronic device may also turn on the rear camera, which is not limited in the embodiment of the present invention.
- FIG. 17 Another embodiment of the present invention provides a video processing method. Please refer to FIG. 17, which includes the following steps:
- the current keyboard signal is read through the keyboard signal reading device of the electronic device, and the current keyboard signal is used to determine whether there is an input signal. If there is an input signal, it is determined that there is a finger typing; if there is no input signal , It is determined that there is no finger typing.
- the input finger In the initial stage, if the input finger is not included in the image collected through the video chat, it will be used as the background frame; if the input finger is included, the background frame will not be determined until the input finger is not included.
- the acquired image is collected as the background frame.
- S820a Determine whether there is a finger region mask through the semantic segmentation model, and how to determine it specifically is described above, so I will not repeat it here.
- the obtained finger area mask is, for example, as shown at 90 in FIG. 18B.
- S820a and S810 do not implement the existing sequence.
- the processing of S820a is performed to reduce the amount of data processing of the electronic device.
- the finger area can also be covered by other pictures, or the finger area masked by the background area of the current image frame can be filled.
- the present invention The embodiments are not listed in detail, and are not limited.
- S850 Obtain the video frame after the finger with the finger area mask removed as a new video frame, and output the video frame, which can be transmitted to the electronic device of user B for display, or displayed on the electronic device of user A .
- FIG. 19 Another embodiment of the present invention provides a video communication method. Please refer to FIG. 19.
- the method includes the following steps:
- S1000 Obtain the video in front of the display screen through the front camera.
- the specific collection process is similar to that of the S800, so I won't repeat it here.
- S1010a After obtaining the video collected by the front camera, determine the area where the face is located;
- the area where the face is located can be identified by face recognition technology; in an optional embodiment, an experience area in the center of the screen of the electronic device can also be used as the experience area of the face frame; or, it can be By recognizing a large number of chat videos (or self-portrait pictures), the position of the face frame in these chat videos (or self-portrait pictures) is analyzed, and the result of comprehensive analysis is obtained to obtain the experience area of the face frame.
- S1010b Determine the finger region mask through the semantic segmentation model. How to determine the mask has been described above, and will not be repeated here. There is no order of execution between this step and S1010a.
- S1020 Determine whether the user in the current frame is typing through the mask of the face area and the finger area.
- step S1010a is an optional step.
- Step S1020 in FIG. 19 may include the following steps:
- S1100 Perform a corrosion expansion operation on the finger area mask to filter out noise holes in a very small area, such as noise holes smaller than 1 pixel, 2 pixels, 3 pixels, etc., which are often finger edges Noise.
- S1110 Search for all connected regions of the binary image of the picture, and preliminarily filter out connected regions with an area smaller than a preset threshold
- the preset threshold is, for example, 10 pixels, 20 pixels, and so on.
- S1120 Preliminarily screen the connected area, calculate the area center of the connected area, and determine the area in which it is located according to the relationship between its center position and the center position of the face frame, for example: determine that the connected area is located above and below the center of the face On the side, the left side or the right side, the connected areas that meet the requirements are screened according to the circumscribed rectangle area thresholds of different location areas, and the connected areas with the circumscribed rectangle area larger than the preset area threshold are filtered out.
- the corresponding preset area threshold is, for example: 150 pixels, 200 pixels, 220 pixels, etc., if the connected area is located on the upper side of the face center, then The corresponding preset area threshold is, for example, 80 pixels, 100 pixels, and so on. If the connected area is located below the center of the face, the corresponding preset area threshold is, for example, 300 pixels, 400 pixels, and so on.
- the predetermined area threshold of the connected area located on the lower side of the face center is larger than the left side (or right side), and the predetermined area threshold value of the connected area located on the left side (or right side) of the face center is larger than the upper side.
- S1130 is executed again, that is, it is determined whether the user in the current frame is typing through the filtered finger area mask and the area where the face is located. The judgment method is similar to the previous description, and will not be repeated here. If it is determined that the user has not typed based on S1130, the non-typing status code can be returned; and if it is determined that the user has typed based on S1140, the typing status code can be returned, or S1140: the connected area can be refined, and the refined screening can include The following way:
- Method 1 Calculate the aspect ratio of the connected area, and filter out the connected areas whose aspect ratio is less than the threshold (for example: 0.5, 0.7, 0.8, etc.), the connected area of the inverted triangle, because the width and height of the finger area mask in the normal typing state It is relatively large, and generally not an inverted triangle;
- the threshold for example: 0.5, 0.7, 0.8, etc.
- Method 2 Calculate the area ratio of the remaining connected area and its circumscribed rectangle. If it is greater than the set threshold (for example: 0.5, 0.6, etc.), return the typing status information to determine that the current user belongs to the typing status, and the finally filtered finger area
- the mask is the effective mask area, that is, the area where the effective finger is located. In this step, it can also be directly judged whether the area of the corresponding connected area is larger than the preset area (for example: 6W pixels, 7W pixels, etc.). If it is larger, the typing status information will be returned to confirm that the current user belongs to the typing status, and finally the selected ones
- the finger area mask serves as the effective mask area, that is, the area where the effective finger is located.
- S1040 Output a new video frame, which can be output to the electronic device where user B is located as a video frame for video communication; it can also be output to the electronic device where user A is located (at the same time).
- step S1050 Update the background frame with a new video frame.
- the execution order of step S1040 and step S1050 can be interchanged.
- the steps of judging whether there is finger typing in S810 and S1020 can be used alternatively or in combination, or when it cannot be judged in S810, S1020 can be used for judgment.
- An embodiment of the present invention provides a video processing method, including the following steps:
- the electronic device can collect and obtain video frames.
- the electronic device can perform the video after detecting the user's video shooting operation (for example: clicking the video shooting button, generating preset gestures, generating voice commands, etc.) Capture; it can also perform video capture when it detects that the user is in video communication with the peer electronic device, and then send the collected video to the peer electronic device.
- the electronic device can also obtain video frames from other electronic devices or the network.
- the embodiments of the invention are not listed in detail, and are not limited.
- the video frame is a video frame captured by an electronic device
- the video frame can be a video frame captured by a front camera, a video frame captured by a rear camera, or a video frame captured by a front camera and a rear camera.
- the video frames are merged to obtain video frames, etc.; it can also be a video frame collected by other cameras connected to the electronic device 100, for example: the electronic device 100 and one or more of a drone, a TV, and a desktop computer The connection is established, and the electronic device 100 can acquire video frames through these devices.
- the electronic device when it detects the user's video communication operation (or video shooting operation), it generates a video communication instruction; then sends the video communication instruction to the processor, and the processor responds to the video communication instruction to start the video communication
- the software sends instructions to the camera to drive and control the camera for video capture.
- the camera driver sends the collected data to the finger occlusion processing component, and the finger occlusion processing component performs subsequent operations.
- the preset object can be an object set by default by the system, or an object designated by the user. For example: when the user collects video frames through the electronic device 100, because the user holds the electronic device 100 or type on the electronic device 100 by hand, the electronic device 100 collects the fingers that block the lens of the camera, the fingers that are typing input, etc. , These are all images that the user does not want to be captured; or, when the user finds that there are trash cans, ashtrays (preset objects), etc. in the photo when taking a photo, you can manually select the trash can in the screen , Ashtrays, etc., so that these objects are designated as preset objects.
- the preset objects in the video can be removed, so that the video chat can better meet the needs of users and can also protect the privacy of users.
- the preset objects are, for example, the user's hand, finger, trash can, ashtray, and so on. It can automatically determine the preset object area through the semantic segmentation model (how to automatically determine the area where the preset object is located through the semantic segmentation model will be described later in conjunction with Figure 3-5); it can also receive user choices Operation, the area where the preset object is located is determined based on the selection operation. For example: in the process of shooting a video, the user clicks on the ashtray in the picture. After the electronic device detects the user's click operation, it determines that the user wants to recognize the ashtray. Therefore, the image recognition algorithm is used to identify the user's click operation. The area where the object is located.
- the replacement content may be the content of the background area corresponding to the video frame. How to determine the content of the background area will be described later with reference to FIG. 7. It may also not be the content of the background area, such as other images (for example: emoticons, icons, etc.), the content of the area where the object is preset after mosaicing the video frame, and so on.
- S2120 Fill the area where the preset object is located by replacing content to remove/replace the preset object. Among them, part of the content of the preset object can be removed, or all content of the preset object can be removed.
- the area where the preset object is located can be filled by the image processing method introduced above; or the area where the preset object is located can be directly covered by other objects, such as covering expressions in the area where the ashtray is located, and performing the area where the ashtray is located. Mosaic and so on.
- the preset object area in the video is covered by the preset icon (the preset icon can be the default icon or the icon that changes randomly), and when the user edits the preset icon is detected, the In addition to this icon, or replace the preset icon with other icons; or, display the edit button on the video capture interface, and after detecting the user's click on the edit button, display various editing operations (such as filters, icons, puzzles) Etc.), after detecting the user's click on the icon, various icons are displayed, and then based on the user's specific operation (for example: dragging the icon to the preset object surface), the icon is blocked on the preset object surface; For another example, various icons are directly displayed on the video capture interface, and the icons are hidden from the surface of the preset object based on a specific operation of the user.
- the video frame after the replacement process can be transmitted to another electronic device to display the video frame after the replacement process on the other electronic device; the video frame after the replacement process can also be displayed on the current electronic device, To provide users of electronic devices, they can also store the replaced video frames in the electronic device.
- the above solution can also be used for image capture. For example, after detecting the user's image capture operation, identify the preset object in it, and then remove the preset object.
- the removal method can be the previous
- the image processing method introduced can also be overlaid with other pictures, which are not listed in detail in the embodiment of the present invention, and are not limited.
- the display interface of the electronic device displays an interface 220 for real-time chatting with another user.
- the interface displays a video communication button 220a and a voice communication button 220b.
- the user of the electronic device It is user A, user B is standing next to user A, user A’s target finger is placed on the keyboard, and user B clicks on the video communication button (220a) with the mouse (or user A generates a voice command);
- 22b includes a video communication interface 221.
- the video communication interface includes a video preview interface 221a and a video display interface 221b.
- the video preview interface 221a displays the current The video frame (for example: the video frame of user A) collected (or processed) by the electronic device
- the video display interface 221b displays the video frame of the user of the opposite electronic device.
- the finger of user A has been placed on the keyboard. Since the electronic device does not detect the background frame that does not contain the finger, the electronic device does not trigger the finger removal function and displays the video frame containing the finger.
- the user places his finger on the keyboard again to type, and the captured video frame contains the user’s finger; the finger in the video frame is removed by the method described above, and the output is Video frames without fingers.
- a prompt box 222 is displayed on the video frame, the prompt box 222 is used to prompt "The finger in the video chat has been removed, please confirm whether to continue removing", and the prompt box 222 includes a confirmation button 222a and a cancel button 222b, and user B passes Click the cancel button 222b with the mouse (or user A generates a voice command);
- User A starts the video communication function.
- the user’s hand is placed on his knee, and the captured video frame does not contain the user’s finger, as shown in Figure 23a; in this case, the electronic device will be as shown in Figure 23a.
- the video frame shown is set as the background frame, and the video frame is output, as shown in Figure 23b.
- user A starts typing through the keyboard, and the video frame collected by the electronic device contains the user's finger, as shown in Figure 23c; the electronic device determines that the content of the collected video frame conforms to the preset finger model, and then generates a prompt message.
- the prompt information is, for example, text, voice, icon, etc., as shown in Figure 23d, a prompt box 130 is displayed, and the prompt box 130 displays "finger has been detected in the video frame, please confirm whether to remove it", and the prompt box 13
- the confirmation button 140 and the cancel button 150 are also displayed.
- the user A wants to remove the finger and clicks the confirmation button 140, and the electronic device detects that the user has clicked the confirmation button 140.
- the user has been typing through the keyboard.
- the electronic device collects a video frame containing the typing finger, as shown in Figure 23e, because the user has confirmed the removal of the finger; the electronic device removes the finger in the video frame based on the method described above, and outputs the video frame that does not contain the finger.
- the video frame is shown in Figure 23f.
- Prompt information can also be generated on the video frame.
- the prompt information is used to inform the user that the user is currently in the finger removal state.
- the prompt information can last for a period of time (for example: 1 second, 2 seconds) and then disappear, or it can be in the finger removal state. It is always displayed; a cancel button can also be generated, and the finger removal mode can be exited by responding to the user clicking the cancel button.
- the prompt message and the cancel button are integrated together, which is a prompt button 230.
- the user hopes not to remove the finger.
- the video frame 23g containing the finger is collected again. In this case, there is no need It is then judged whether the video frame contains content that conforms to the preset finger model, and the video frame containing the finger is directly output, as shown in FIG. 23h.
- embodiments of the present invention also provide a computer-readable storage medium, including instructions, which when run on an electronic device, cause the electronic device to execute the method described in any embodiment of the present invention.
- embodiments of the present invention provide a computer program product, the computer program product includes software code, and the software code is used to execute the method described in any embodiment of the present invention.
- an embodiment of the present invention provides a chip containing instructions, which when the chip runs on an electronic device, causes the electronic device to execute the method described in any embodiment of the present invention.
- an embodiment of the present invention provides an electronic device, including: a keyboard and a camera, the camera is arranged near the keyboard, and the electronic device further includes: a first collection module for collecting data through the camera Obtain a first video frame; a first determining module, configured to determine that the first video frame contains content that conforms to a preset finger model, then remove the finger in the first video frame to obtain a second video frame; display module , For displaying the second video frame, and/or sending the second video frame to the opposite electronic device for display.
- the first determining module includes: a first determining unit, configured to determine that the first video frame contains content that conforms to a preset finger model, and determine that the finger is located at the The bottom area of the first video frame removes the fingers in the first video frame; the second determining unit is used to determine that the first video frame contains content that meets the preset finger model, and determines that the finger area is If there is no overlap in the position of the human face, remove the finger in the first video frame; the third determining unit is used to determine that the first video frame contains content that meets the preset finger model, and to determine that the finger is located at the The bottom area of the first video frame is connected to the side of the first video frame, and the finger in the first video frame is removed.
- the first determining module includes: an obtaining unit, configured to obtain a keyboard input signal; and a fourth determining unit, configured to determine that the first video frame contains a finger model that conforms to a preset finger model. And the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, and the finger in the first video frame is removed.
- the first determining module is configured to: replace the content of the finger area in the first video frame with replacement content to obtain the second video frame; or A video frame is cropped to obtain the second video frame that does not include the finger area; or, the finger area is filled with pixels in the vicinity of the finger area to obtain the second video frame.
- the electronic device further includes: a second acquisition module, configured to acquire a third video frame before acquiring the first video frame; and a second determination module, configured to It is determined that the third video frame does not contain content that meets the preset finger model, then the third video frame is used as a background frame; the third determining module is configured to: determine and The content corresponding to the finger area in the first video frame is used as the replacement content.
- the first determining module is configured to: determine that the first video frame contains content that conforms to a preset finger model, and the finger is not in an abnormal state, then remove all The finger in the first video frame obtains the second video frame; the abnormal state corresponds to at least one of the following situations: the two hands of the user in the first video frame are located at the bottom of the first video frame, And the distance between the two hands of the user is greater than a first preset distance; in the first video frame, one hand of the user is located in the bottom area, and the other hand is greater than the preset distance from the bottom area; the first video The area of the user's finger in the frame is greater than the preset area threshold; the user's finger in the first video frame covers the face.
- the electronic device further includes: a fourth determining module, configured to determine that the first video frame contains content conforming to a preset finger model, and determine that the finger is in an abnormal state, The first video frame is sent to the display for display.
- a fourth determining module configured to determine that the first video frame contains content conforming to a preset finger model, and determine that the finger is in an abnormal state, The first video frame is sent to the display for display.
- the first determining module is configured to: input the first video frame into a semantic segmentation model, and determine the finger area in the first video frame through the semantic segmentation model, and When the finger area exists, it is determined that the first video frame contains content that meets the preset finger model, and the semantic segmentation model is obtained through training based on sample photos, and each sample photo contains a photo of the user's finger. And each photo marks the finger area.
- an embodiment of the present invention provides an electronic device, including: an obtaining module for obtaining a first video frame and obtaining a keyboard input signal; a fifth determining module for determining that the first video frame contains If the content of the preset finger model is met, and the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, the finger in the first video frame is removed to obtain the second video frame;
- the instruction module is used to instruct to display the second video frame.
- the fifth determining module is configured to: replace the content of the finger area in the first video frame with replacement content to obtain the second video frame; or The first video frame is cropped to obtain the second video frame that does not include the finger area; or, the finger area is filled with pixels in the vicinity of the finger area to obtain the second video frame .
- the electronic device further includes: a third acquisition module, configured to acquire a third video frame before acquiring the first video frame; and a sixth determining module, configured to It is determined that the third video frame does not contain content that meets the preset finger model, then the third video frame is used as a background frame; the seventh determining module is configured to: The content corresponding to the finger area in the first video frame is used as the replacement content.
- the fifth determining module is configured to: determine that the first video frame contains content that conforms to a preset finger model, and that the finger is not in an abnormal state, and obtains When the time of the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, the finger in the first video frame is removed to obtain the second video frame; the abnormal state corresponds to at least one of the following A situation: the two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance; there is one user in the first video frame The hand is located in the bottom area, and the distance between the other hand and the bottom area is greater than the preset distance; the area of the user's finger in the first video frame is greater than the preset area threshold; the user's finger in the first video frame is blocked Face.
- the electronic device further includes: an eighth determining module, configured to determine that when the first video frame contains content that conforms to a preset finger model, and determine that the finger is in an abnormal state To indicate to display the first video frame.
- the fifth determining module is configured to: input the first video frame into a semantic segmentation model, and determine the finger area in the first video frame through the semantic segmentation model, and When the finger area exists, it is determined that the first video frame contains content that meets the preset finger model, and the semantic segmentation model is obtained through training based on sample photos, and each sample photo contains a photo of the user's finger. And each photo marks the finger area.
- the above-mentioned electronic devices and the like include hardware structures and/or software modules corresponding to the respective functions.
- the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present invention.
- each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
- the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. It should be noted that the division of modules in the embodiment of the present invention is illustrative, and is only a logical function division, and there may be other division methods in actual implementation. The following is an example of dividing each function module corresponding to each function:
- the methods provided in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
- software When implemented by software, it can be implemented in the form of a computer program product in whole or in part.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, a network device, an electronic device, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
- the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, SSD).
- the disclosed system, device, and method can be implemented in other ways.
- the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Disclosed are a video collection control method, an electronic device, a computer-readable storage medium, a computer program product and a chip, relating to the field of image processing. The electronic device comprises: a display, a keyboard, a camera and a processor, wherein the camera is arranged close to the keyboard, and is used for collecting a video frame during video communication, and sending the collected video frame to the processor; and the processor is connected to the display, the keyboard and the camera, and is used for receiving a first video frame from the camera, then removing, when it is determined that the first video frame includes content conforming to a preset finger model, a finger from the first video frame in order to obtain a second video frame, sending the second video frame to the display for displaying, and/or sending the second video frame to an electronic device at an opposite end for displaying. The technical problem in the prior art of the proportions of a finger being easily distorted during a video call process is solved. The method can be used for an artificial intelligence device. The method is related to technology such as deep learning.
Description
本申请要求在2019年12月19日提交中国国家知识产权局、申请号为201911315367.1的中国专利申请的优先权,发明名称为“一种视频采集控制方法、电子设备、计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office of China with an application number of 201911315367.1 on December 19, 2019, and the title of the invention is "a video capture control method, electronic equipment, and computer-readable storage medium" The priority of the Chinese patent application, the entire content of which is incorporated in this application by reference.
本申请涉及图像处理领域,尤其涉及一种视频采集控制方法、电子设备、计算机可读存储介质、计算机程序产品、芯片。This application relates to the field of image processing, and in particular to a video capture control method, electronic equipment, computer-readable storage media, computer program products, and chips.
随着电子设备的普及,电子产品的形态越来越多样化,以电子产品为笔记本电脑为例,传统的笔记本电脑的摄像头往往位于笔记本电脑的顶部,而近年来,出于对个人隐私的保护,也有很多笔记本电脑的摄像头被设置于屏幕下方,或者以隐藏式摄像头的形式置于键盘顶端,如图1所示,摄像头11设置于笔记本电脑1的键盘10附近。但是较低的摄像头角度也造成使用者在视频通话的同时如果进行打字,很可能导致手指遮挡摄像头的视野,导致呈现在屏幕上的手指比例失真,如图2所示,其呈现出一个类似八爪鱼的效果。With the popularization of electronic devices, the forms of electronic products have become more and more diversified. Taking electronic products as laptops as an example, the cameras of traditional laptops are often located on the top of the laptop. In recent years, due to the protection of personal privacy In addition, there are also many laptop computers whose cameras are placed below the screen or placed at the top of the keyboard in the form of a hidden camera. As shown in FIG. 1, the camera 11 is placed near the keyboard 10 of the laptop computer 1. However, the low camera angle also causes the user to type while making a video call, which may cause the finger to block the camera’s field of view, resulting in distortion of the proportion of the finger displayed on the screen, as shown in Figure 2, which presents a similar pattern. The effect of claw fish.
发明内容Summary of the invention
本申请提供的一种视频通信方法、视频采集方法及电子设备,避免视频通信过程中手指比例失真,提高视频通信质量。The video communication method, video acquisition method, and electronic equipment provided by the present application avoid the distortion of finger proportions in the video communication process and improve the quality of video communication.
第一方面,本发明实施例提供一种电子设备,包括:显示器,键盘,摄像头和处理器;所述摄像头设置于所述键盘附近,用于在视频通信中,采集视频帧,并将采集的视频帧发送给所述处理器;所述处理器,与所述显示器、键盘、摄像头连接,用于接收来自摄像头的第一视频帧,确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧;并将所述第二视频帧发送给所述显示器显示,和/或,将所述第二视频帧发送给对端电子设备显示。能够解决输出的视频帧中手指失真的技术问题,且能够避免避免导致视频通信所传输的视频为被手指遮挡的不完整的画面。且本发明实施例中,通过确定第一视频帧中是否包含符合预设手指模型的内容,自动对手指区域进行识别,从而自动去除第一视频帧中的手指,输出第二视频帧,而无需用户手动去除,故而能够提高去除手指的效率;通过实时智能识别出手指,并自动进行去除,能在不影响视频通信流畅性的前提下,提高人机交互的智能性。In the first aspect, an embodiment of the present invention provides an electronic device, including: a display, a keyboard, a camera, and a processor; the camera is arranged near the keyboard, and is used to collect video frames in video communication and collect The video frame is sent to the processor; the processor is connected to the display, the keyboard, and the camera, and is used to receive the first video frame from the camera, and determine that the first video frame contains a finger model that conforms to the preset finger model. Content, the finger in the first video frame is removed to obtain a second video frame; and the second video frame is sent to the display for display, and/or the second video frame is sent to the opposite end Electronic device display. It can solve the technical problem of finger distortion in the output video frame, and can avoid causing the video transmitted by the video communication to be an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.
在一种可选的的实施例中,所述确定所述第一视频帧中包含符合预设手指模型的内容时,去除所述第一视频帧中的手指,包括:确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域,去除所述第一视频帧中的手指;或,确定所述第一视频帧中包含符合预设手指模型 的内容,且确定出手指区域与人脸所在位置不存在重叠,去除所述第一视频帧中的手指;或,确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域且与所述第一视频帧的侧边相连,去除所述第一视频帧中的手指。基于上述方案,能够保证在手指位于第一视频帧中的特定区域时才去除该手指,从而能够更加精确的自动去除,降低将非失真手指去除的可能性。In an optional embodiment, when the determining that the first video frame contains content that conforms to a preset finger model, removing the finger in the first video frame includes: determining that the first video The frame contains content that conforms to the preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, and the finger in the first video frame is removed; or, it is determined that the first video frame contains Preset the content of the finger model, and determine that there is no overlap between the finger area and the position of the face, remove the finger in the first video frame; or, determine that the first video frame contains content that conforms to the preset finger model , And it is determined that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the finger in the first video frame is removed. Based on the above solution, it can be ensured that the finger is only removed when the finger is located in a specific area in the first video frame, so that the automatic removal can be more accurate, and the possibility of removing the undistorted finger can be reduced.
可选的,所述确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,包括:获得键盘输入信号;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,去除所述第一视频帧中的手指。由于在该方案中,将检测到键盘输入信号作为去除手指的条件之一,因此能够保证去除的手指为打字的手指,由此实现精确的去除失真的手指,并避免手指遮挡视频帧画面的技术效果。Optionally, the determining that the first video frame contains content that conforms to a preset finger model, then removing the finger in the first video frame includes: obtaining a keyboard input signal; determining that the first video frame is Contains content that conforms to the preset finger model, and the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, and the finger in the first video frame is removed. In this solution, the detection of the keyboard input signal is one of the conditions for removing the finger, so it can ensure that the removed finger is a typing finger, thereby achieving accurate removal of distorted fingers and avoiding the technology of fingers occluding the video frame picture effect.
第二方面,本发明实施例提供一种电子设备,包括:一个或多个处理器;一个或多个存储器;多个应用程序;以及一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述一个或多个存储器中,所述一个或多个计算机程序包括指令,当所述指令被所述电子设备的一个或多个处理器执行时,使得所述电子设备执行以下步骤:获得第一视频帧,且获得键盘输入信号;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得第二视频帧;指示显示所述第二视频帧。通过该方案,能够能够解决输出的视频帧中手指失真的技术问题,且能够避免避免导致视频通信所传输的视频为被手指遮挡的不完整的画面。且本发明实施例中,通过确定第一视频帧中是否包含符合预设手指模型的内容,自动对手指区域进行识别,从而自动去除第一视频帧中的手指,输出第二视频帧,而无需用户手动去除,故而能够提高去除手指的效率;通过实时智能识别出手指,并自动进行去除,能在不影响视频通信流畅性的前提下,提高人机交互的智能性。且在该方案中,将检测到键盘输入信号作为去除手指的条件之一,从而能够精准的去除视频帧中打字的手指。In a second aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the one or more computers Programs are stored in the one or more memories, and the one or more computer programs include instructions. When the instructions are executed by one or more processors of the electronic device, the electronic device executes the following Step: Obtain a first video frame, and obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and the time of the obtained keyboard input signal meets the time of obtaining the first video frame The preset time threshold is then removed from the first video frame to obtain a second video frame; indicating that the second video frame is displayed. Through this solution, the technical problem of finger distortion in the output video frame can be solved, and the video transmitted by the video communication can be prevented from being an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication. And in this solution, the detection of the keyboard input signal is one of the conditions for removing the finger, so that the typing finger in the video frame can be accurately removed.
第三方面,本发明实施例提供一种视频采集控制方法,应用于电子设备中,所述电子设备包括:键盘和摄像头,所述摄像头设置于所述键盘附近,所述包括:通过所述摄像头采集获得第一视频帧;确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧;显示所述第二视频帧,和/或,将所述第二视频帧发送给对端电子设备显示。通过该方案,能够解决输出的视频帧中手指失真的技术问题,且能够避免避免导致视频通信所传输的视频为被手指遮挡的不完整的画面。且本发明实施例中,通过确定第一视频帧中是否包含符合预设手指模型的内容,自动对手指区域进行识别,从而自动去除第一视频帧中的手指,输出第二视频帧,而无需用户手动去除,故而能够提高去除手指的效率;通过实时智能识别出手指,并自动进行去除,能在不影响视频 通信流畅性的前提下,提高人机交互的智能性。In a third aspect, an embodiment of the present invention provides a video capture control method, which is applied to an electronic device, the electronic device includes: a keyboard and a camera, the camera is arranged near the keyboard, and the method includes: passing through the camera Acquiring a first video frame; determining that the first video frame contains content that conforms to a preset finger model, then removing the finger in the first video frame to obtain a second video frame; displaying the second video frame, And/or, sending the second video frame to the opposite electronic device for display. Through this solution, the technical problem of finger distortion in the output video frame can be solved, and it can be avoided that the video transmitted by the video communication is an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication.
第四方面,本发明实施例提供一种视频通信控制方法,包括:获得第一视频帧,且获得键盘输入信号;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得第二视频帧;指示显示所述第二视频帧。通过该方案,能够能够解决输出的视频帧中手指失真的技术问题,且能够避免避免导致视频通信所传输的视频为被手指遮挡的不完整的画面。且本发明实施例中,通过确定第一视频帧中是否包含符合预设手指模型的内容,自动对手指区域进行识别,从而自动去除第一视频帧中的手指,输出第二视频帧,而无需用户手动去除,故而能够提高去除手指的效率;通过实时智能识别出手指,并自动进行去除,能在不影响视频通信流畅性的前提下,提高人机交互的智能性。且在该方案中,将检测到键盘输入信号作为去除手指的条件之一,从而能够精准的去除视频帧中打字的手指。In a fourth aspect, an embodiment of the present invention provides a video communication control method, including: obtaining a first video frame and obtaining a keyboard input signal; determining that the first video frame contains content that conforms to a preset finger model, and the obtained The time of the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, then the finger in the first video frame is removed to obtain the second video frame; indicating that the second video frame is displayed. Through this solution, the technical problem of finger distortion in the output video frame can be solved, and the video transmitted by the video communication can be prevented from being an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication. And in this solution, the detection of the keyboard input signal is one of the conditions for removing the finger, so that the typing finger in the video frame can be accurately removed.
第五方面,本发明实施例提供一种电子设备,包括:一个或多个处理器;一个或多个存储器;多个应用程序;以及一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述一个或多个存储器中,所述一个或多个计算机程序包括指令,当所述指令被所述电子设备的一个或多个处理器执行时,使得所述电子设备执行以下步骤:获得第一视频帧;确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域,则去除所述第一视频帧中的手指,获得第二视频帧;指示显示所述第二视频帧。通过该方案,能够能够解决输出的视频帧中手指失真的技术问题,且能够避免避免导致视频通信所传输的视频为被手指遮挡的不完整的画面。且本发明实施例中,通过确定第一视频帧中是否包含符合预设手指模型的内容,自动对手指区域进行识别,从而自动去除第一视频帧中的手指,输出第二视频帧,而无需用户手动去除,故而能够提高去除手指的效率;通过实时智能识别出手指,并自动进行去除,能在不影响视频通信流畅性的前提下,提高人机交互的智能性。且在该方案中,将检测到手指区域位于视频帧的底部区域作为手指的去除条件,因此能够区分放置于特定位置的手指和其他位置的手指,由此实现更加精确的手指去除。In a fifth aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the one or more computers Programs are stored in the one or more memories, and the one or more computer programs include instructions. When the instructions are executed by one or more processors of the electronic device, the electronic device executes the following Step: Obtain a first video frame; determine that the first video frame contains content that conforms to a preset finger model, and determine that the finger is located in the bottom area of the first video frame, then remove the first video frame Finger to obtain the second video frame; instruct to display the second video frame. Through this solution, the technical problem of finger distortion in the output video frame can be solved, and the video transmitted by the video communication can be prevented from being an incomplete picture blocked by the finger. Moreover, in the embodiment of the present invention, the finger area is automatically recognized by determining whether the first video frame contains content that conforms to the preset finger model, thereby automatically removing the finger in the first video frame, and outputting the second video frame without Manual removal by the user can improve the efficiency of removing the finger; by intelligently identifying the finger in real time and automatically removing it, it can improve the intelligence of human-computer interaction without affecting the fluency of video communication. Moreover, in this solution, the detection of the finger area at the bottom area of the video frame is used as the removal condition of the finger, so it is possible to distinguish the finger placed in a specific position from the finger in other positions, thereby achieving more accurate finger removal.
图1为现有技术的摄像头位于键盘区域的笔记本电脑的结构图;FIG. 1 is a structural diagram of a notebook computer with a prior art camera located in the keyboard area;
图2为现有技术中基于图1所示的笔记本电脑的前置摄像头采集获得的包含打字手指的图片的示意图;FIG. 2 is a schematic diagram of a picture containing typing fingers acquired based on the front camera of the notebook computer shown in FIG. 1 in the prior art;
图3为本发明实施例的电子设备的结构图;Figure 3 is a structural diagram of an electronic device according to an embodiment of the present invention;
图4为本发明实施例的软件框架图;Figure 4 is a software framework diagram of an embodiment of the present invention;
图5为本发明实施例所介绍的电子设备的另一种结构示意图;FIG. 5 is a schematic diagram of another structure of an electronic device introduced in an embodiment of the present invention;
图6为本发明实施例所介绍的视频控制方法的流程图与界面对照图;6 is a flowchart and interface comparison diagram of the video control method introduced by the embodiment of the present invention;
图7为本发明实施例所确定的手指区域的示意图;FIG. 7 is a schematic diagram of a finger area determined by an embodiment of the present invention;
图8为本发明实施例中通过替换内容替换手指区域一种实施方式的示意图;FIG. 8 is a schematic diagram of an implementation manner of replacing a finger area by replacing content in an embodiment of the present invention; FIG.
图9为本发明实施例中通过替换内容替换手指区域另一种实施方式的示意图;FIG. 9 is a schematic diagram of another implementation manner of replacing a finger area by replacing content in an embodiment of the present invention;
图10为本发明实施例中在去除手指之后产生提示信息的示意图;FIG. 10 is a schematic diagram of generating prompt information after removing a finger in an embodiment of the present invention; FIG.
图11为本发明实施例中在去除手指之前产生提示信息的示意图;11 is a schematic diagram of generating prompt information before removing a finger in an embodiment of the present invention;
图12为本发明实施例中语义分割模型的训练方法的流程图;Figure 12 is a flowchart of a method for training a semantic segmentation model in an embodiment of the present invention;
图13为本发明实施例中基于语义分割模型识别出图像中的手指区域的流程图;FIG. 13 is a flowchart of identifying a finger area in an image based on a semantic segmentation model in an embodiment of the present invention;
图14为本发明实施例中基于语义分割模型识别出图像中的手指区域时的语义推理的流程图;14 is a flowchart of semantic reasoning when recognizing a finger region in an image based on a semantic segmentation model in an embodiment of the present invention;
图15A为本发明实施例中前置摄像头采集获得打字的手指的图像示意图;15A is a schematic diagram of an image of a typing finger collected by a front camera in an embodiment of the present invention;
图15B为本发明实施例中基于语义分割模型对图6A所示的图像进行识别所确定的手指区域掩膜的示意图;15B is a schematic diagram of a finger region mask determined by recognizing the image shown in FIG. 6A based on a semantic segmentation model in an embodiment of the present invention;
图15C为本发明实施例中对手指掩膜区域进行降噪之后的手指区域掩膜的示意图;15C is a schematic diagram of a finger area mask after noise reduction is performed on the finger mask area in an embodiment of the present invention;
图16为本发明实施例中介绍的图像处理方法的流程图;FIG. 16 is a flowchart of the image processing method introduced in the embodiment of the present invention;
图17为本发明实施例中介绍的视频通信方法的流程图;FIG. 17 is a flowchart of a video communication method introduced in an embodiment of the present invention;
图18A-18C为本发明实施例中视频通信方法中前置摄像头所采集的视频帧、视频帧中的打字手指区域、经处理后输出的图像帧的示意图;18A-18C are schematic diagrams of video frames collected by a front camera in a video communication method in an embodiment of the present invention, a typing finger area in the video frame, and image frames output after processing;
图19为本发明另一实施例所介绍的视频通信方法的流程图;FIG. 19 is a flowchart of a video communication method introduced by another embodiment of the present invention;
图20为本发明另一实施例所介绍的视频通信方法中确定用户是否处于打字状态的流程图;20 is a flowchart of determining whether a user is in a typing state in the video communication method introduced by another embodiment of the present invention;
图21为本发明一实施例所介绍的视频处理方法的流程图;FIG. 21 is a flowchart of a video processing method introduced in an embodiment of the present invention;
图22介绍了本发明的一种具体应用场景的界面变化图;FIG. 22 shows an interface change diagram of a specific application scenario of the present invention;
图23介绍了本发明另一种具体应用场景的界面变化图。Fig. 23 shows the interface change diagram of another specific application scenario of the present invention.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. Among them, in the description of the embodiments of the present application, unless otherwise specified, "/" means or, for example, A/B can mean A or B; "and/or" in this document is only a description of related objects The association relationship of indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: A alone exists, A and B exist at the same time, and B exists alone.
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。Hereinafter, the terms "first" and "second" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present application, unless otherwise specified, "plurality" means two or more.
下面介绍本申请实施例涉及的应用场景。电子设备中配置了摄像头、麦克风、 全球定位系统(global positioning system,GPS)芯片、各类传感器(例如磁场传感器、重力传感器、陀螺仪传感器等)等器件,用于感知外部的环境、用户的动作等。根据感知到的外部的环境和用户的动作,电子设备向用户提供个性化的、情景化的业务体验。其中,摄像头能够获取丰富、准确的信息使得电子设备感知外部的环境、用户的动作。本申请实施例提供一种电子设备,电子设备可以实现为以下任意一种设备:手机、平板电脑(pad)、便携式游戏机、掌上电脑(personal digital assistant,PDA)、笔记本电脑、超级移动个人计算机(ultra mobile personal computer,UMPC)、手持计算机、上网本、车载媒体播放设备、可穿戴电子设备、虚拟现实(virtual reality,VR)终端设备、增强现实(augmented reality,AR)终端设备等数显产品。The following describes the application scenarios involved in the embodiments of the present application. Electronic equipment is equipped with cameras, microphones, global positioning system (global positioning system, GPS) chips, various sensors (such as magnetic field sensors, gravity sensors, gyroscope sensors, etc.) and other devices to sense the external environment and user actions Wait. According to the perceived external environment and the user's actions, the electronic device provides the user with a personalized and contextual business experience. Among them, the camera can obtain rich and accurate information so that the electronic device can perceive the external environment and the user's actions. The embodiments of the present application provide an electronic device, which can be implemented as any of the following devices: mobile phones, tablet computers (pad), portable game consoles, handheld computers (personal digital assistant, PDA), notebook computers, ultra-mobile personal computers Digital display products such as ultra mobile personal computer (UMPC), handheld computers, netbooks, in-vehicle media playback devices, wearable electronic devices, virtual reality (VR) terminal devices, augmented reality (AR) terminal devices, etc.
首先,介绍本申请以下实施例中提供的示例性的电子设备100。First, an exemplary electronic device 100 provided in the following embodiments of the present application is introduced.
图3示出了电子设备100的结构示意图。FIG. 3 shows a schematic diagram of the structure of the electronic device 100.
下面以电子设备100为例对实施例进行具体说明。应该理解的是,图1所示电子设备100仅是一个范例,并且电子设备100可以具有比图1中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。Hereinafter, the embodiment will be described in detail by taking the electronic device 100 as an example. It should be understood that the electronic device 100 shown in FIG. 1 is only an example, and the electronic device 100 may have more or fewer components than those shown in FIG. 1, two or more components may be combined, or Can have different component configurations. The various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
电子设备100可以包括:处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2. Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display 194, And subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。电子设备100的详细结构介绍,请参考在先专利申请:CN201910430270.9。It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than those shown in the figure, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware. For the detailed structure introduction of the electronic device 100, please refer to the previous patent application: CN201910430270.9.
如图4所示,本申请涉及的软件架构包括:应用层、Windows多媒体框架、控制层、核心层、平台层、摄像头驱动、摄像头硬件。本申请涉及的手指遮挡处理模块,集成于核心层的MFT(Media Foundation Transforms)模块,该模块还可以集成其他功能。视频流从摄像头驱动获得之后,传入核心层的Media Source(媒体来源)模块,通过Media Source模块传入MFT模块,然后通过手指遮挡处理模块对输入的视频帧进行处理,去除其中包含的手指,处理好的视频帧通过Media Sink(媒体接收器)传送到应用软件,例如视频通信软件。As shown in Figure 4, the software architecture involved in this application includes: application layer, Windows multimedia framework, control layer, core layer, platform layer, camera driver, and camera hardware. The finger occlusion processing module involved in this application is integrated in the MFT (Media Foundation Transforms) module of the core layer, and this module can also integrate other functions. After the video stream is obtained from the camera driver, it is passed to the Media Source module of the core layer, passed to the MFT module through the Media Source module, and then the input video frame is processed by the finger occlusion processing module to remove the fingers contained in it. The processed video frames are sent to application software, such as video communication software, through Media Sink.
内核层是硬件和软件之间的层。内核层至少包含显示驱动、摄像头驱动、音频驱动、传感器驱动、手指遮挡处理组件,该手指遮挡处理组件集成本发明实施例所介绍的对包含预设对象的视频帧进行处理的功能,通过该手指遮挡组件能够识别出视频帧中的手指,并去除该手指,获得不包含手指的视频帧;然后手指遮挡处理组件可以将处理后的视频帧输出给显示器;如果是将处理后的视频帧传输给对端,则是通过手指遮挡处理组件将处理结果以Windows多媒体框架为中介传递给视频应用软件,通过视频应用软件建立的端到端连接传递给对端。The kernel layer is the layer between hardware and software. The kernel layer includes at least a display driver, a camera driver, an audio driver, a sensor driver, and a finger occlusion processing component. The finger occlusion processing component integrates the function of processing video frames containing preset objects introduced in the embodiment of the present invention. The occlusion component can identify the finger in the video frame and remove the finger to obtain a video frame that does not contain the finger; then the finger occlusion processing component can output the processed video frame to the display; if it is to transmit the processed video frame to At the opposite end, the processing result is delivered to the video application software through the Windows multimedia framework through the finger occlusion processing component, and is delivered to the opposite end through the end-to-end connection established by the video application software.
第一方面,本发明实施例提供一种电子设备100,请参考图5,该电子设备包括:In the first aspect, an embodiment of the present invention provides an electronic device 100. Please refer to FIG. 5. The electronic device includes:
显示器50; Display 50;
键盘51; Keyboard 51;
摄像头52,设置于设置于键盘51附近,如图5所示,摄像头52可以设置于键盘51所属的平面上,例如摄像头52a和52c,其中,摄像头52a设置于键盘所属的区域内,键盘所属的区域指的是以键盘的左上角的点、右下角的点作矩形所确定出的区域,又或者,摄像头52a可以设置于将该矩形外移预设距离(例如:0.5cm、1cm、2cm)所确定出的区域,又或者,该摄像头52可以设置于显示器50的边框50a上,例如:设置于显示器50的左边框、右边框、下边框等等,作为一种可选的实施方式,还可以将该摄像头52设置于边框的下方,例如:边框下方1/2、1/3的范围,又或者,还可以将摄像头52设置于边框的底部等等。摄像头设置于键盘附近,用于在视频通信中,采集视频帧,并将采集的视频帧发送给处理器。The camera 52 is arranged near the keyboard 51. As shown in FIG. 5, the camera 52 can be arranged on the plane to which the keyboard 51 belongs, such as cameras 52a and 52c. The camera 52a is arranged in the area where the keyboard belongs. The area refers to the area determined by the point on the upper left corner and the point on the lower right corner of the keyboard as a rectangle. Alternatively, the camera 52a can be set to move the rectangle away by a preset distance (for example: 0.5cm, 1cm, 2cm) The determined area, or alternatively, the camera 52 can be arranged on the frame 50a of the display 50, for example: arranged on the left frame, right frame, lower frame, etc. of the display 50. As an optional implementation, it is also The camera 52 can be arranged below the frame, for example, in the range of 1/2, 1/3 below the frame, or the camera 52 can also be arranged at the bottom of the frame, and so on. The camera is set near the keyboard and is used to collect video frames in video communication and send the collected video frames to the processor.
图5所示的摄像头52b设置于边框50a的下方。The camera 52b shown in FIG. 5 is disposed under the frame 50a.
处理器(图中未示出),与所述显示器50、键盘51、摄像头52连接;处理器用于接收来自摄像头的第一视频帧,确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧;并将所述第二视频帧发送给所述显示器显示,和/或,将所述第二视频帧发送给对端电子设备显示。The processor (not shown in the figure) is connected to the display 50, the keyboard 51, and the camera 52; the processor is used to receive the first video frame from the camera, and determine that the first video frame contains a finger model that conforms to the preset finger model. Content, the finger in the first video frame is removed to obtain a second video frame; and the second video frame is sent to the display for display, and/or the second video frame is sent to the opposite end Electronic device display.
下面将结合以上结构,介绍本发明实施例的视频处理方法,请参考图6,该方法包括以下步骤:In the following, the video processing method of the embodiment of the present invention will be introduced in conjunction with the above structure. Please refer to FIG. 6. The method includes the following steps:
S600;通过摄像头采集获得第一视频帧;S600; Obtain the first video frame through camera collection;
S610:确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧;并将所述第二视频帧发送给所述显示器显示,和/或,将所述第二视频帧发送给对端电子设备显示。S610: Determine that the first video frame contains content that meets the preset finger model, remove the finger in the first video frame to obtain a second video frame; and send the second video frame to the display Display, and/or, send the second video frame to the opposite electronic device for display.
S600中,在具体实施过程中,电子设备100可以在检测到用户的视频拍摄操作(例如:点击视频拍摄按钮、产生预设手势、产生语音指令等等)之后,进行视频采集;也可以在检测到用户与对端电子设备进行视频通信时,进行视频采集,然后将采集到的视频发送给对端电子设备。例如:电子设备检测到用户的视 频通信操作(或者视频拍摄操作)时,产生视频通信指令;然后将视频通信指令发送给处理器,处理器响应该视频通信指令,开启视频通信软件,把指令发送给摄像头驱动控制摄像头进行视频采集。摄像头驱动将采集的数据发送给手指遮挡处理组件,由手指遮挡处理组件执行后续操作。In S600, in a specific implementation process, the electronic device 100 can perform video collection after detecting a user's video shooting operation (for example: clicking a video shooting button, generating a preset gesture, generating a voice command, etc.); When the user conducts video communication with the peer electronic device, video capture is performed, and then the collected video is sent to the peer electronic device. For example: when the electronic device detects the user's video communication operation (or video shooting operation), it generates a video communication instruction; then sends the video communication instruction to the processor, and the processor responds to the video communication instruction, starts the video communication software, and sends the instruction Drive the camera to control the camera for video capture. The camera driver sends the collected data to the finger occlusion processing component, and the finger occlusion processing component performs subsequent operations.
步骤S610可以由处理器执行。Step S610 may be executed by a processor.
S610中,可以将所述第一视频帧输入语义分割模型,通过语义分割模型确定第一视频帧中手指区域的掩膜,然后通过手指区域的掩膜确定出第一视频帧中的手指区域,其中,可以直接将手指区域的掩膜确定为手指区域,也可以将对手指区域的掩膜进行降噪后获得手指区域。在存在所述手指区域时,认定所述第一视频帧中包含符合所述预设手指模型的内容,所述语义分割模型基于样本照片训练获得,每张所述样本照片中包含用户手指的照片,并且每张照片标记出手指区域。对于具体如何通过语义分割模型确定出视频帧中的手指区域,将在后续介绍,在此不再赘述。如图7所示为确定的手指区域的示意图,图7中包含6张小图,分别为:图7a-至图7f,图7b为采集到的一张包含用户的手指的示意图,基于图7b所示的视频帧,将其通过语义分割模型确定出的手指区域例如如图7a所示。In S610, the first video frame may be input to the semantic segmentation model, the mask of the finger area in the first video frame is determined through the semantic segmentation model, and then the finger area in the first video frame is determined through the mask of the finger area, Among them, the mask of the finger area can be directly determined as the finger area, or the finger area can be obtained after noise reduction is performed on the mask of the finger area. When the finger area exists, it is determined that the first video frame contains content that conforms to the preset finger model, and the semantic segmentation model is obtained by training based on sample photos, and each sample photo contains a photo of the user's finger , And each photo marks the finger area. How to determine the finger area in the video frame through the semantic segmentation model will be introduced later, and will not be repeated here. Figure 7 is a schematic diagram of the determined finger area. Figure 7 contains 6 small images, which are: Figures 7a to 7f. Figure 7b is a collected schematic diagram of the user's finger, based on the figure shown in Figure 7b. For example, the finger area determined by the semantic segmentation model is shown in Fig. 7a.
在具体实施过程中,在步骤S610中可以在确定第一视频帧中包含符合预设手指模型的内容时,就去除第一视频帧中的手指(也即:对第一视频帧中的手指都去除);在一种可选的实施例中,也可以先判断是否满足预设条件,在满足预设条件的情况下,再去除第一视频帧中的手指,预设条件可以为多种条件,下面列举其中的四种进行介绍,当然,在具体实施过程中,不限于以下四种情况。(以下四种预设条件皆对应第一视频帧中存在用户打字的手指,判断满足预设条件的目的在于,在第一视频帧中存在用户打字的手指的情况下,才将手指去除,对于非用户打字的手指则可以保留,例如:用户举起手指给对方看美甲、用户端起茶杯喝茶等等,这些情况下的手指,都是用户不希望去除的手指。通过该方案能够保证去除用户打字的手指(或者是放置于键盘上的手指),而不会去除其他手指,而其他手指往往不会邻近摄像头,因此不会存在变形的问题,基于以上技术方案,一方面能够降低手指严重失真对画面造成遮挡的可能性,另一方面又能够保留正常比例的手指,达到能够精确去除比例失真手指的技术效果。)。In the specific implementation process, in step S610, when it is determined that the first video frame contains content that conforms to the preset finger model, the fingers in the first video frame are removed (that is, the fingers in the first video frame are all Remove); In an optional embodiment, it is also possible to first determine whether the preset condition is satisfied, and if the preset condition is satisfied, then remove the finger in the first video frame. The preset condition may be a variety of conditions , Four of them are listed below for introduction. Of course, in the specific implementation process, it is not limited to the following four situations. (The following four preset conditions all correspond to the user’s typing finger in the first video frame. The purpose of judging that the preset condition is met is to remove the finger when the user’s typing finger exists in the first video frame. Non-user’s typing fingers can be retained, for example: the user raises his finger to show the other party manicure, the user holds up a teacup to drink tea, etc. In these cases, the fingers are the fingers that the user does not want to remove. This solution can guarantee the removal. The user’s typing finger (or the finger placed on the keyboard) does not remove other fingers, and other fingers are often not close to the camera, so there will be no deformation problems. Based on the above technical solutions, on the one hand, it can reduce the seriousness of the fingers. Distortion may cause the possibility of blocking the picture, and on the other hand, it can retain the normal proportion of the finger, and achieve the technical effect of accurately removing the proportion distorted finger.).
第一种,确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域,则认定存在打字手指,可以去除第一视频帧中的手指。The first method is to determine that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, then it is determined that there is a typing finger, and the first video frame can be removed Fingers.
如图7a所示,其包含两个连通区,分别为连通区64、65,且这两个连通区都与底部相接,因此可见图7a中存在两个底部连通区,在这种情况下,则说明第一视频帧中存在打字手指,则说明满足预设条件,故而可以去除第一视频帧中的手指。As shown in Figure 7a, it contains two connected areas, namely connected areas 64 and 65, and these two connected areas are connected to the bottom. Therefore, it can be seen that there are two bottom connected areas in Figure 7a. In this case , It means that there are typing fingers in the first video frame, which means that the preset conditions are met, so the fingers in the first video frame can be removed.
第二种,确定所述第一视频帧中包含符合预设手指模型的内容,且确定出手指区域与人脸所在位置不存在重叠,则认定存在打字手指,可以去除第一视频帧中的手指。Second, it is determined that the first video frame contains content that meets the preset finger model, and it is determined that there is no overlap between the finger area and the position of the human face, then it is determined that there is a typing finger, and the finger in the first video frame can be removed .
示例来说,通常人脸所在区域皆为视频帧的中心区域,用户如果举起手指示 例给对方看,这种情况下,则可能会出现连通区位于人脸所在区域的情况,在这种情况下,则说明手指并非打字的手指,而是示范给对方看的手指,则说明不满足预设条件,因此不需要去除第一视频帧中的手指;只有在手指区域与人脸所在区域不重叠的情况下,则说明满足预设条件,才需要去除第一视频帧中的手指。其中,手指区域与人脸所在区域重叠可以为部分重叠,或者全部重叠,本发明实施例不做限制。For example, usually the area where the face is located is the central area of the video frame. If the user raises his finger to show the other party as an example, in this case, the connected area may be located in the area where the face is. In this case Below, it means that the finger is not a typing finger, but a finger shown to the other party. It means that the preset conditions are not met, so there is no need to remove the finger in the first video frame; only when the finger area does not overlap with the face area In the case of, it means that the finger in the first video frame needs to be removed only when the preset condition is met. Wherein, the overlap of the finger area and the area where the face is located may be a partial overlap or a full overlap, which is not limited in the embodiment of the present invention.
③获得键盘输入信号;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,去除所述第一视频帧中的手指。在具体实施过程中,如果获得键盘输入信号,则说明第一视频帧中存在打字手指,故而说明存在满足预设条件,可以去除第一视频帧中的手指。在具体实施过程中,可以通过获取键盘信号来确定是否存在键盘输入信号。其中,可以采集第一视频帧的前后预设时间段内是否检测键盘输入信号来判断是否满足预设条件,如果前后预设时间段(例如:1秒、2秒等等)检测到键盘输入信号,则认定满足预设条件,否则认定不满足预设条件。③ Obtain a keyboard input signal; determine that the first video frame contains content that conforms to the preset finger model, and that the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, remove the Finger in the first video frame. In the specific implementation process, if a keyboard input signal is obtained, it indicates that there is a typing finger in the first video frame, so it indicates that there is a finger in the first video frame that satisfies the preset condition. In the specific implementation process, the keyboard signal can be obtained to determine whether there is a keyboard input signal. Among them, you can collect whether the keyboard input signal is detected within the preset time period before and after the first video frame to determine whether the preset condition is met. If the keyboard input signal is detected before and after the preset time period (for example: 1 second, 2 seconds, etc.) , It is deemed to meet the preset conditions, otherwise it is deemed that the preset conditions are not met.
由于在上述方案中,通过键盘输入信号作为触发去除第一视频帧中的手指的条件之一,故而能够保证所去除的手指为用户处于打字状态的手指。Since in the above solution, the keyboard input signal is used as one of the conditions for triggering the removal of the finger in the first video frame, it can be ensured that the removed finger is the finger of the user in the typing state.
④确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域且与所述第一视频帧的侧边相连,去除所述第一视频帧中的手指。④ It is determined that the first video frame contains content that conforms to the preset finger model, and that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the first video frame is removed. Finger in a video frame.
在具体实施过程中,在手指位于底部区域的情况下,手指很大可能是打字的手指,但是也有可能是通过触控板输入的手指,而通常打字手指往往位于键盘两侧,故而其一般和第一视频帧的侧边相接,例如:图7a中,底部连通区65与第一视频帧的右侧边相接,故而该方案能够更加精确定位打字手指,从而精确的区域第一视频帧中的打字手指。故而如果所述手指位于所述第一视频帧的底部区域且与所述第一视频帧的侧边相连,则认定满足预设条件;否则认定不满足预设条件。In the specific implementation process, when the fingers are located in the bottom area, the fingers are likely to be typing fingers, but they may also be the fingers inputting through the touchpad. Generally, the typing fingers are often located on both sides of the keyboard, so they are generally and The sides of the first video frame are connected. For example, in Figure 7a, the bottom connected area 65 is connected to the right side of the first video frame. Therefore, this solution can position the typing finger more accurately, thus accurately region the first video frame. Typing fingers in. Therefore, if the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, it is determined that the preset condition is satisfied; otherwise, it is determined that the preset condition is not satisfied.
其中,在S610中确定第一视频帧中包含符合预设手指模型的内容时且确定出手指区域之后,还可以进一步的判断手指区域是否处于异常状态。其中可以在确定出满足预设条件的情况下,再确定手指区域是否处于异常状态,如果手指区域处于异常状态,则不去除第一视频帧中的手指,直接输出第一视频帧,确定手指区域处于异常状态,直接输出(例如:通过显示器显示、发送给对端电子设备显示等等)第一视频帧的步骤可以由处理器执行;否则去除第一视频帧中的手指。也可以在检测到第一视频帧中包含符合预设手指模型的内容时,就判断基于该预设手指模型确定的手指区域是否处于异常状态,本发明实施例不做限制。Wherein, when it is determined in S610 that the first video frame contains content that conforms to the preset finger model and after the finger area is determined, it can be further determined whether the finger area is in an abnormal state. Among them, it can be determined whether the finger area is in an abnormal state when it is determined that the preset conditions are met. If the finger area is in an abnormal state, the finger in the first video frame is not removed, and the first video frame is directly output to determine the finger area In an abnormal state, the step of directly outputting (for example: displaying on a display, sending to a peer electronic device for display, etc.) the first video frame can be executed by the processor; otherwise, the finger in the first video frame is removed. It is also possible to determine whether the finger area determined based on the preset finger model is in an abnormal state when it is detected that the first video frame contains content that conforms to the preset finger model, which is not limited in the embodiment of the present invention.
在具体实施过程中,异常状态可以存在多种情况,下面列举其中的四种进行介绍,当然,在具体实施过程中,不限于以下四种情况。In the specific implementation process, the abnormal state can have many situations, four of which are listed below for introduction. Of course, in the specific implementation process, it is not limited to the following four situations.
第一种,手指区域面积大于预设面积阈值(如图7c所示),例如:手指区域面积大于5000像素、手指区域面积大于第一视频帧的总面积的1/4、1/3等等,在这种情况下,如果去除第一视频帧中的手指的话,可能手指区域与其他区域无法 融合,因此为了保证输出的视频帧(第二视频帧)的手指区域与其背景区域融合,则只有在手指区域面积不大于预设面积阈值的情况下,才进行去除第一视频帧中的手指的操作。The first is that the area of the finger area is greater than the preset area threshold (as shown in Figure 7c), for example: the area of the finger area is greater than 5000 pixels, the area of the finger area is greater than 1/4, 1/3, etc. In this case, if the finger in the first video frame is removed, the finger area may not be fused with other areas. Therefore, in order to ensure that the finger area of the output video frame (second video frame) is fused with its background area, only When the area of the finger area is not greater than the preset area threshold, the operation of removing the finger in the first video frame is performed.
第二种,手指区域与人脸所在区域66存在重叠,如图7d所示,该手指区域包含三个连通区,分别为两个底部连通区(即:位于第一视频帧底部的连通区)以及一个中间连通区,该连通区域人脸所在区域存在重叠(对于如何确定人脸所在区域将在后续进行介绍),在这种情况下,则说明第一视频帧中用户的一只手位于底部区域,另一只不位于底部区域,其通常说明用户一只手在打字、另一只手在做打字之外的事(例如:喝水、摸头发等等),在这种情况下,用户并不希望去除非打字的手指,而如果仅去除打字的手指,则画面会比较突兀,因此在这种情况下,则认为手指区域处于异常状态,不需要去除第一视频帧中的手指,而直接输出第一视频帧。In the second type, there is an overlap between the finger area and the area 66 where the face is located. As shown in Figure 7d, the finger area contains three connected areas, which are two bottom connected areas (that is, the connected area located at the bottom of the first video frame). And an intermediate connected area, where there is overlap in the area where the face is located (how to determine the area where the face is located will be introduced later), in this case, it means that one hand of the user is at the bottom in the first video frame Area, the other is not located at the bottom area, it usually means that the user is typing with one hand and doing something other than typing with the other hand (for example: drinking water, touching hair, etc.). In this case, the user It is not desirable to remove the fingers that are not typing, and if only the typing fingers are removed, the picture will be more abrupt. Therefore, in this case, the finger area is considered to be in an abnormal state and there is no need to remove the fingers in the first video frame. Output the first video frame directly.
第三种,存在至少两个底部连通区,这两个底部连通区之间的距离大于预设距离阈值,该预设距离阈值例如为:100像素、150像素等等。在这种情况下,则第一视频帧中的用户的两只手的距离大于第一预设距离(第一预设距离与预设距离阈值相等,或者正相关)。如图7e所示,在这种情况下,其往往说明第一视频帧中用户的一只手在打字,另一只手在触摸触控板,而触摸触控板的区域往往对应用户的脖子所在区域,如果将该区域的手指区域的话,则会导致脖子区域过渡不自然,而如果仅区域打字手指,而保留触控手指,则画面会比较突兀,因此可以将这种状态认定为异常状态。The third type is that there are at least two bottom connected areas, and the distance between the two bottom connected areas is greater than a preset distance threshold, and the preset distance threshold is, for example, 100 pixels, 150 pixels, and so on. In this case, the distance between the two hands of the user in the first video frame is greater than the first preset distance (the first preset distance is equal to the preset distance threshold, or is positively correlated). As shown in Figure 7e, in this case, it often means that in the first video frame, one hand of the user is typing and the other hand is touching the trackpad, and the area where the trackpad is touched often corresponds to the user’s neck In the area where the finger area is located, the transition of the neck area will be unnatural, and if only the typing finger is in the area, and the touch finger is retained, the screen will be more abrupt, so this state can be regarded as an abnormal state .
第四种,存在非底部连通区,如图7f所示,在这种情况下,往往说明用户一只手在打字,另一只手在做其他事,在这种情况下,两只手都去除不符合用户需求,仅去除第一视频帧的底部区域的手指,则画面会比较突兀,因此认定为异常状态,不去除第一视频帧中的手指。The fourth type, there is a non-bottom connected area, as shown in Figure 7f. In this case, it often means that the user is typing with one hand and doing other things with the other hand. In this case, both hands are Removal does not meet the user's requirements, and only removes the fingers in the bottom area of the first video frame, the picture will be more abrupt, so it is considered an abnormal state, and the fingers in the first video frame are not removed.
步骤S610中,可以通过多种方式去除第一视频帧,下面列举其中的三种进行介绍,当然,在具体实施过程中,不限以下三种情况。In step S610, the first video frame can be removed in a variety of ways. Three of them are listed below for introduction. Of course, in the specific implementation process, the following three situations are not limited.
第一种,以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧。示例来说,电子设备通过响应视频通信操作,例如:在视频通信软件中第一联系人的界面,点击视频通信按钮(视频通信操作),则可以进入视频通信状态,在视频通信状态则会显示视频通信界面,该视频通信界面上有一视频预览窗口和视频接收窗口,其中,视频预览窗口用于显示当前电子设备采集到的视频帧,该视频接收窗口用于显示从对端电子设备接收到的视频帧。The first method is to replace the content of the finger area in the first video frame with the replacement content to obtain the second video frame. For example, an electronic device responds to video communication operations, such as: in the interface of the first contact in the video communication software, click the video communication button (video communication operation), then it can enter the video communication state, and it will display in the video communication state Video communication interface. There is a video preview window and a video receiving window on the video communication interface. The video preview window is used to display the video frames collected by the current electronic device, and the video receiving window is used to display the video received from the opposite electronic device. Video frame.
请参考图8,在电子设备进入视频采集的初始阶段,电子设备的用户的手并未放置于键盘上准备打字,电子设备采集到第三视频帧,第三视频帧中不包含手指,电子设备会输出该第三视频帧,第三视频帧中如图8a所示;同时电子设备会判断第三视频帧中是否包含用户手指,其可以通过本发明实施例所介绍的语义分割模型来确定,如果通过语义分割模型确定出存在手指区域,则认定第三视频帧中包含手指,否则认定第三视频帧中不包含手指;其中如果确定第三视频帧中不 包含手指,则将第三视频帧作为背景帧进行存储,如图8b所示。然后电子设备采集到第一视频帧(如图8c所示),电子设备判断第一视频帧是否包含符合预设手指模型的内容,通过将第一视频帧输入语义分割模型,最终确定出的手指区域例如如图8d中的90所示,在确定出手指区域之后,则认定包含符合预设手指模型的内容;则从背景帧中确定出手指区域对应的替换内容,所确定的替换内容如图8e所示,然后通过替换内容覆盖第一视频帧中的手指区域,从而获得第二视频帧(如图8f所示),最后输出第二视频帧。Please refer to Figure 8. When the electronic device enters the initial stage of video capture, the user's hand of the electronic device is not placed on the keyboard for typing, the electronic device captures the third video frame, the third video frame does not contain fingers, the electronic device The third video frame will be output, as shown in Figure 8a; at the same time, the electronic device will determine whether the third video frame contains the user’s finger, which can be determined by the semantic segmentation model introduced in the embodiment of the present invention, If it is determined that there is a finger area through the semantic segmentation model, the third video frame is deemed to contain a finger, otherwise it is deemed that the third video frame does not contain a finger; if it is determined that the third video frame does not contain a finger, the third video frame is It is stored as a background frame, as shown in Figure 8b. Then the electronic device collects the first video frame (as shown in Figure 8c), the electronic device determines whether the first video frame contains content that meets the preset finger model, and the first video frame is input into the semantic segmentation model to finally determine the finger For example, the area is shown as 90 in Figure 8d. After the finger area is determined, it is determined that it contains content that meets the preset finger model; then the replacement content corresponding to the finger area is determined from the background frame, and the determined replacement content is shown in the figure As shown in 8e, the finger area in the first video frame is then covered by the replacement content, thereby obtaining a second video frame (as shown in FIG. 8f), and finally the second video frame is output.
基于上述方案,能够保证在去除第一视频帧的手指的同时,不影响第一视频帧的画面比例或者布局,从而使视频帧的画面输出更加流畅。Based on the above solution, it can be ensured that the screen ratio or layout of the first video frame is not affected while removing the fingers of the first video frame, so that the screen output of the video frame is smoother.
作为一种可选的实施例,所述方法还包括:获取第三视频帧,所述第三视频帧为在所述第一视频帧之前采集的视频帧;确定所述第三视频帧中不包含符合预设手指模型的内容,则将所述第三视频帧作为背景帧;在所述背景帧中,确定出和所述第一视频帧中的手指区域对应的内容作为所述替换内容。该步骤可以由处理器执行。As an optional embodiment, the method further includes: obtaining a third video frame, where the third video frame is a video frame collected before the first video frame; and determining that the third video frame is not If the content conforms to the preset finger model is included, the third video frame is used as the background frame; in the background frame, the content corresponding to the finger area in the first video frame is determined as the replacement content. This step can be executed by the processor.
第二视频帧中,可以包含特定物体,该特定物体往往为背景区域的一个静止物体,例如:茶杯、笔等等,通过该方式确定出的第二视频帧中特定物体所在位置与第三视频帧中特定物体的坐标相同(或者偏移量小于预设偏移量,例如:10像素、20像素等等);或者通过该方式确定的第二视频帧中特定物体的大小与第三视频帧中特定物体的大小相同或相似(例如:差别在5%、10%以内)。The second video frame may contain a specific object. The specific object is often a stationary object in the background area, such as a teacup, a pen, etc. The location of the specific object in the second video frame is determined by this method and the third video The coordinates of the specific object in the frame are the same (or the offset is less than the preset offset, such as 10 pixels, 20 pixels, etc.); or the size of the specific object in the second video frame determined by this method is the same as that of the third video frame The size of specific objects in the same or similar (for example: the difference is within 5%, 10%).
在开启视频通信的过程中,如果用户将手指放置于键盘上,此时由于尚没有采集到没有背景帧的视频帧,此时电子设备输出的是包含用户的手指的视频帧。In the process of starting the video communication, if the user places a finger on the keyboard, at this time, since no video frame without a background frame has been collected yet, the electronic device outputs a video frame containing the user's finger at this time.
第二种,对所述第一视频帧进行裁剪以获得不包含所述手指区域的所述第二视频帧。In the second type, the first video frame is cropped to obtain the second video frame that does not include the finger area.
示例来说,请参考图9,电子设备采集到视频帧之后,可以先判断视频帧中是否存在符合预设手指的内容,如果不存在,则直接输出该视频帧,如图9所示,电子设备采集到第三视频帧之后,直接输出第三视频帧;如果确定出视频帧中存在符合预设手指的内容(例如9c所示),则基于采集到的视频帧(例如:第一视频帧)确定出一个不包含手指的裁剪框91,通过该裁剪框对第一视频帧进行裁剪之后,获得不包含手指的第二视频帧(如图9d所示)。该裁剪框可以通过多种方式确定,例如:①假设视频帧中左下角为原点,确定出该视频帧中手指区域的Y轴最大值,将该Y轴最大值作为下裁剪边,将视频帧的顶部作为上裁剪边;通过确定出的上裁剪边、下裁剪边的高度确定出裁剪比例((Y轴最大值减去Y轴最小值)/第一视频帧的高度),然后确定出视频帧中人的中心位置,将该中心位置向左延伸第一预设距离(1/2*裁剪比例*第一视频帧的宽度),确定出左边框,将该中心位置向右延伸第二预设距离(1/2*裁剪比例*第一视频帧的宽度),确定出右边框,基于此确定出裁剪框。②确定出一个预设大小的裁剪框,将该裁剪框放置于视频帧的中心区域,然后判断手指区域与该区域是否存在重叠,如果存在重叠,将该裁剪框整体上移直至手指区域与裁剪框不存在重叠。当然,在具体实施过程 中,还可以通过其他方式确定出裁剪框,本发明实施例不再详细列举,并且不做限制。For an example, please refer to Figure 9. After the electronic device collects the video frame, it can first determine whether there is content that meets the preset finger in the video frame. If it does not exist, it will directly output the video frame, as shown in Figure 9. After the device collects the third video frame, it directly outputs the third video frame; if it is determined that there is content that meets the preset finger in the video frame (for example, as shown in 9c), it is based on the collected video frame (for example: the first video frame). ) Determine a cropping frame 91 that does not contain a finger, and after the first video frame is cropped by the cropping frame, a second video frame that does not contain a finger is obtained (as shown in FIG. 9d). The cropping frame can be determined in a variety of ways, for example: ①Assuming that the lower left corner of the video frame is the origin, determine the Y-axis maximum value of the finger area in the video frame, use the Y-axis maximum value as the bottom cropping edge, and set the video frame The top of the is used as the upper cropping edge; the cropping ratio is determined by the determined heights of the upper and lower cropping edges ((the maximum value of the Y axis minus the minimum value of the Y axis)/the height of the first video frame), and then the video is determined The center position of the person in the frame, the center position is extended to the left by a first preset distance (1/2*cropping ratio*the width of the first video frame), the left border is determined, and the center position is extended to the right for the second preset Set the distance (1/2*cropping ratio*width of the first video frame), determine the right frame, and determine the crop frame based on this. ②Determine a cropping frame of preset size, place the cropping frame in the center area of the video frame, and then determine whether there is overlap between the finger area and the area, if there is overlap, move the cropping frame up as a whole until the finger area and the crop The boxes do not overlap. Of course, in the specific implementation process, the cropping frame may also be determined in other ways, and the embodiment of the present invention will not enumerate in detail, and it is not limited.
通过上述方案,第二视频帧中特定物体的坐标相对于第三视频帧中特定物体的坐标的偏移量大于预设偏移量;或者,第二视频帧中特定物体的大小与第三视频帧中特定物体的大小的差别大于预设值。通过上述方案,无需通过背景帧即可实现实现去除手指的功能,从而能够降低电子设备的处理负担。Through the above solution, the offset of the coordinates of the specific object in the second video frame with respect to the coordinates of the specific object in the third video frame is greater than the preset offset; or, the size of the specific object in the second video frame is the same as that of the third video frame. The difference in the size of the specific object in the frame is greater than the preset value. Through the above solution, the function of removing the finger can be realized without using the background frame, so that the processing burden of the electronic device can be reduced.
第三种,通过所述手指区域的邻近区域的像素对所述手指区域进行填充,获得所述第二视频帧。在这种情况下,第二视频帧中的特定物体的坐标与第三视频帧中特定物体的坐标的相同或者相差不大(偏移量小于预设偏移量),或者,第二视频帧中特定物体的大小与第三视频帧中特定物体的大小相同或者相差不大。通过上述方案,无需通过背景帧即可实现该功能。The third type is to fill the finger area with pixels in the vicinity of the finger area to obtain the second video frame. In this case, the coordinates of the specific object in the second video frame are the same as or not much different from the coordinates of the specific object in the third video frame (the offset is less than the preset offset), or the second video frame The size of the specific object in the third video frame is the same or not much different from the size of the specific object in the third video frame. Through the above solution, this function can be realized without passing through a background frame.
作为一种可选的实施例,在基于S610控制所述显示器显示所述第二视频帧之后,响应于确定所述第四视频帧中包含符合预设手指模型的内容时,且所述手指处于异常状态时,通过所述第四视频帧的所述手指区域和所述第四视频帧的所述手指区域确定出至少一个过渡帧;控制所述显示器显示所述至少一个过渡帧;控制所述显示器在显示所述至少一个过渡帧之后,显示所述第四视频帧。该步骤可以由处理器执行。As an optional embodiment, after controlling the display to display the second video frame based on S610, in response to determining that the fourth video frame contains content that conforms to a preset finger model, and the finger is in In an abnormal state, determine at least one transition frame through the finger area of the fourth video frame and the finger area of the fourth video frame; control the display to display the at least one transition frame; control the After displaying the at least one transition frame, the display displays the fourth video frame. This step can be executed by the processor.
实施来说,用户在视频通信过程中,先通过单手打字,电子设备采集获得第一视频帧,由于第一视频帧中存在符合预设手指模型的内容,且手指区域不存在异常状态,故而电子设备输出不包含手指的第二视频帧;然后用户将另一个手放置于触控板上,此时电子设备采集获得第四视频帧,第四视频帧依然符合预设手指模型内容,但是其手指区域的两个连通区之间的距离大于预设距离阈值,因此确定出手指区域处于异常状态,在这种情况下,可以直接输出第四视频帧,但是第二视频帧与第四视频帧之间如果直接切换,则会存在突然出现一只打字的手的手指的情况,会导致画面比较突兀,为了防止这种情况,在第二视频帧和第四视频帧间增加至少一帧过渡帧,实现从第二视频帧到第四视频帧的画面的平滑过渡。至少一个过渡帧如果为一帧,则第二视频帧与第四视频帧的权值例如为皆为0.5,至少一个过渡帧如果为多帧,则第二视频帧的权值逐渐减少、第四视频帧的权值逐渐增加,例如:至少一个过渡帧为5帧,则第一过渡帧中第二视频帧与第四视频帧的权值为(0.2,0.8),第二过渡帧中第二视频帧与第四视频帧的权值为(0.4,0.6),第三过渡帧中第二视频帧与第四视频帧的权值为(0.5,0.5),第四过渡帧中第二视频帧与第四视频帧的权值为(0.6,0.4),第五过渡帧中第二视频帧与第四视频帧的权值为(0.8,0.2)等等,当然,以上权值仅仅作为举例,并不作为限制。其中,针对手指区域之外的背景区域,过渡帧中显示第四视频帧中背景区域的内容,而通过上述方式确定出每个过渡帧中手指区域的内容。In practice, during the video communication process, the user first typed with one hand, and the electronic device collected the first video frame. Since there is content in the first video frame that conforms to the preset finger model, and there is no abnormal state in the finger area, The electronic device outputs the second video frame that does not contain the finger; then the user places the other hand on the touchpad. At this time, the electronic device captures the fourth video frame. The fourth video frame still conforms to the preset finger model content. The distance between the two connected areas of the finger area is greater than the preset distance threshold, so it is determined that the finger area is in an abnormal state. In this case, the fourth video frame can be directly output, but the second video frame and the fourth video frame If you switch directly between them, there will be a situation where the fingers of a typing hand suddenly appear, which will cause the picture to be more abrupt. To prevent this, add at least one transition frame between the second video frame and the fourth video frame. , To achieve a smooth transition from the second video frame to the fourth video frame. If at least one transition frame is one frame, the weights of the second video frame and the fourth video frame are, for example, 0.5. If at least one transition frame is multiple frames, the weight of the second video frame is gradually reduced. The weight of the video frame gradually increases. For example, if at least one transition frame is 5 frames, the weight of the second video frame and the fourth video frame in the first transition frame is (0.2, 0.8), and the second transition frame is the second in the second transition frame. The weights of the video frame and the fourth video frame are (0.4, 0.6), the weights of the second video frame and the fourth video frame in the third transition frame are (0.5, 0.5), and the second video frame in the fourth transition frame The weights of the fourth video frame and the fourth video frame are (0.6, 0.4), and the weights of the second video frame and the fourth video frame in the fifth transition frame are (0.8, 0.2), etc. Of course, the above weights are only examples. Not as a limitation. Among them, for the background area outside the finger area, the content of the background area in the fourth video frame is displayed in the transition frame, and the content of the finger area in each transition frame is determined by the above method.
在具体实施过程中,在电子设备去除第一视频帧中手指,并显示第二视频帧之后,请参考图10,还可以产生提示信息,提示用户已去除视频帧中的手指,并提示用户是否需要保留手指,如图10所示,在第二视频帧的表面显示有个提示框 100,该提示框100中提示“已去除视频聊天的手指,请确认是否继续去除”,如果用户希望继续去除手指,则点击确认按钮110,电子设备检测到用户点击该确认按钮11的操作之后,在后续采集到的视频帧依然去除手指,如图10b所示;如果用户不希望继续去除手指,则点击取消按钮120,电子设备检测到用户点击该取消按钮的操作之后,后续采集到的视频帧中不再去除手指,如图10c所示。基于上述方案,能够基于用户需求选择是否在视频通信过程中去除用户手指,从而实现更为灵活的控制。In the specific implementation process, after the electronic device removes the finger in the first video frame and displays the second video frame, please refer to FIG. 10, and a prompt message can also be generated to remind the user that the finger in the video frame has been removed, and prompt the user whether Fingers need to be retained. As shown in Figure 10, a prompt box 100 is displayed on the surface of the second video frame. The prompt box 100 prompts "The finger for video chat has been removed, please confirm whether to continue removing", if the user wants to continue removing Finger, click the confirmation button 110. After the electronic device detects that the user clicks the confirmation button 11, the finger is still removed in the subsequent captured video frames, as shown in Figure 10b; if the user does not want to continue removing the finger, click Cancel Button 120, after the electronic device detects that the user clicks the cancel button, the finger will no longer be removed from the subsequent captured video frames, as shown in FIG. 10c. Based on the above solution, it is possible to choose whether to remove the user's finger during the video communication process based on the user's needs, thereby achieving more flexible control.
在具体实施过程中,在电子设备启动视频通信(或者采集获得包含用户的手指)之后,可以提示用户是否消除手指,如果检测到用户的确认操作,则去除视频帧中的手指,否则,不去除视频帧中的手指。In the specific implementation process, after the electronic device starts the video communication (or captures the user's finger), it can prompt the user whether to remove the finger. If the user's confirmation operation is detected, the finger in the video frame is removed, otherwise, it is not removed. Finger in video frame.
示例来说,请参考图11,电子设备在采集到第一视频帧之后,经检测确定出视频帧中包含符合预设手指模型的内容(也即:手指),则产生提示框130,提示框130中提示“已检测到视频中包含手指,请确认是否去除”,如果电子设备的用户点击确认按钮140,电子设备则去除第一视频帧中的手指,显示如图11b所示的第二视频帧;如果用户点击取消按钮150,则电子设备不去除第一视频帧中的手指,而直接输出第一视频帧,如图11c所示。For example, please refer to FIG. 11. After the electronic device collects the first video frame, it is determined by detection that the video frame contains content that conforms to the preset finger model (that is, a finger), and a prompt box 130 is generated. In 130, the prompt "Finger has been detected in the video, please confirm whether to remove it", if the user of the electronic device clicks the confirmation button 140, the electronic device removes the finger in the first video frame and displays the second video as shown in Figure 11b Frame; if the user clicks the cancel button 150, the electronic device does not remove the finger in the first video frame, but directly outputs the first video frame, as shown in FIG. 11c.
以上方案除了可以应用于视频通信过程的视频采集端,例如:电子设备针对采集的视频帧进行去除手指处理,从而发送给对端电子设备的是去除手指的视频帧;以上方案还可以应用于视频帧通信过程的接收端,也即:对端电子设备接收到的是未去除手指的视频帧,然后将视频帧基于本发明实施例所介绍的视频处理方法处理之后输出,在这种情况下,以上各个步骤的视频帧并非采集获得的视频帧,而是从对端电子设备接收到的视频帧。The above scheme can be applied to the video capture end of the video communication process. For example, the electronic device performs finger removal processing on the captured video frame, so that the video frame with the finger removed is sent to the peer electronic device; the above scheme can also be applied to video The receiving end of the frame communication process, that is: the opposite end electronic device receives the video frame without removing the finger, and then outputs the video frame based on the video processing method introduced in the embodiment of the present invention. In this case, The video frames in the above steps are not video frames obtained by collection, but video frames received from the opposite end electronic device.
另一方面,基于同一发明构思,本发明实施例提供一种电子设备,包括:一个或多个处理器;一个或多个存储器;多个应用程序;以及一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述一个或多个存储器中,所述一个或多个计算机程序包括指令,当所述指令被所述电子设备的一个或多个处理器执行时,使得所述电子设备执行以下步骤:获得第一视频帧,且获得键盘输入信号;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得第二视频帧;指示显示所述第二视频帧。On the other hand, based on the same inventive concept, embodiments of the present invention provide an electronic device, including: one or more processors; one or more memories; multiple application programs; and one or more computer programs, wherein the One or more computer programs are stored in the one or more memories, and the one or more computer programs include instructions that, when executed by one or more processors of the electronic device, cause all The electronic device executes the following steps: obtaining a first video frame, and obtaining a keyboard input signal; determining that the first video frame contains content that conforms to a preset finger model, and the obtained keyboard input signal is different from the time when the first video frame is obtained. When the time of the video frame meets the preset time threshold, then the finger in the first video frame is removed to obtain a second video frame; the second video frame is instructed to be displayed.
在具体实施过程中,指示显示第二视频帧可以存在多种情况,例如:①通过电子设备的显示单元显示第二视频帧;②将第二视频帧发送给另一显示单元显示,一种可能是,电子设备不具备显示单元,另一种可能是,由另一显示单元显示更为合适,例如:另一显示单元的显示区域更大。③将第二视频帧发送至视频通信的对端电子设备,由对端电子设备显示。In the specific implementation process, there may be many situations in which the second video frame is indicated to be displayed, for example: ① The second video frame is displayed through the display unit of the electronic device; ② The second video frame is sent to another display unit for display, one possibility Yes, the electronic device does not have a display unit. Another possibility is that another display unit is more suitable for display. For example, the display area of the other display unit is larger. ③ The second video frame is sent to the opposite end electronic device of the video communication, and displayed by the opposite end electronic device.
下面将结合图12-图14、图17A、图17B介绍步骤S610中具体如何通过语义分割模型确定出手指区域,从而认定第一视频帧中包含符合预设手指模型的内容。The following will describe how to determine the finger area through the semantic segmentation model in step S610 with reference to FIGS. 12-14, 17A, and 17B, so as to determine that the first video frame contains content that conforms to the preset finger model.
本发明实施例提供一种语义分割模型训练方法,语义分割模型是像素级别上 的分类,针对一张照片,每个物体的像素被分成一类(例如:属于人的像素被分成一类、属于摩托车的像素被分成一类、属于小狗的像素被分成一类等等),除此之外背景像素也被分成一类。请参考图12,该方法包括以下步骤:The embodiment of the present invention provides a method for training a semantic segmentation model. The semantic segmentation model is a classification at the pixel level. For a photo, the pixels of each object are divided into one category (for example, pixels belonging to a person are divided into one category, belonging to Motorcycle pixels are divided into one category, pixels belonging to puppies are divided into one category, etc.), in addition to the background pixels are also divided into one category. Please refer to Figure 12, the method includes the following steps:
S1200:数据采集和标注,采集用户打字时的照片(或者其他包含用户的手的照片),并标注出手指(或者其他预设物体)区域的掩膜,用于语义分割模型的训练。S1200: Data collection and labeling, collecting photos of the user when typing (or other photos containing the user's hand), and labeling the mask of the finger (or other preset object) area for the training of the semantic segmentation model.
S1210:模型设计,设计出语义分割模型,该语义分割模型例如为:卷积神经网络模型、条件随机场模型等等。可选的,为了保证分割出的手指区域的准确性和手指边缘的精确性,可以采用双分支卷积神经网络模型,双分支例如包括:语义特征分支和边缘特征分支。其中,语义特征分支用于提取图像的语义特征,语义特征指的像素点所表征的具体物体,例如:人脸、手指等等,边缘特征分支用于提取图像的纹理特征,纹理特征指的是边缘信息、形状信息(例如:拐角)、颜色等特征。S1210: Model design, design a semantic segmentation model, the semantic segmentation model is, for example, a convolutional neural network model, a conditional random field model, and so on. Optionally, in order to ensure the accuracy of the segmented finger region and the accuracy of the edge of the finger, a dual-branch convolutional neural network model may be used. The dual-branch includes, for example, semantic feature branches and edge feature branches. Among them, the semantic feature branch is used to extract the semantic feature of the image, the semantic feature refers to the specific object represented by the pixel point, such as: human face, finger, etc., the edge feature branch is used to extract the texture feature of the image, and the texture feature refers to Features such as edge information, shape information (for example: corners), and color.
S1220:模型训练,用于通过S1200中标注好的照片和S1210中设计好的语义分割模型,训练并更新语义分割模型的参数,获取用于识别手指的语义分割模型。S1220: Model training, used to train and update the parameters of the semantic segmentation model based on the photos marked in S1200 and the semantic segmentation model designed in S1210 to obtain a semantic segmentation model for finger recognition.
在具体实施过程中,该模型的输入是将像素值归一化到0到1之间的图片(在计算过程中,图片是内存中以特定的顺序存储的多维数组),语义特征分支可以采用深层卷积神经网络提取语义特征以确定手指区域,边缘特征分支可以采用浅层神经网络提取纹理特征以保证手指边缘分割的准确性。在训练时,向双分支卷积神经网络模型输入图片之后,双分支卷积神经网络模型通过深层卷积神经网络提取语义特征,通过浅层神经网络提取纹理特征;然后通过特征融合网络对提取出的语义特征和边缘特征进行特征融合计算得到综合特征,最后的分类层以综合特征作为输入,计算每个像素点属于手指区域的置信度,进而判断每个像素点是否属于手指区域(在具体实施过程中,语义分割模型是以特定形式连接的各种操作的集合,每种操作由不同的数值组成,操作的实际表现为用其自身参数值与输入的数组做矩阵运算,并输出计算得到的数组)。然后将其与标注的手指区域进行比对,基于差异损失函数,通过差异损失函数反向传播到语义分割模型,更新语义分割模型的参数。In the specific implementation process, the input of the model is a picture that normalizes the pixel value to between 0 and 1 (in the calculation process, the picture is a multi-dimensional array stored in a specific order in the memory), and the semantic feature branch can be used The deep convolutional neural network extracts semantic features to determine the finger area, and the edge feature branch can use the shallow neural network to extract texture features to ensure the accuracy of finger edge segmentation. During training, after inputting images to the dual-branch convolutional neural network model, the dual-branch convolutional neural network model extracts semantic features through deep convolutional neural networks, and texture features through shallow neural networks; and then extracts through feature fusion network pairs The semantic features and edge features of each pixel are calculated by feature fusion calculation to obtain comprehensive features. The final classification layer takes the comprehensive features as input to calculate the confidence that each pixel belongs to the finger area, and then determines whether each pixel belongs to the finger area (in the specific implementation) In the process, the semantic segmentation model is a collection of various operations connected in a specific form. Each operation is composed of different values. The actual performance of the operation is to use its own parameter values and input arrays to perform matrix operations, and output the calculated results Array). Then it is compared with the labeled finger area, and based on the difference loss function, it is propagated back to the semantic segmentation model through the difference loss function, and the parameters of the semantic segmentation model are updated.
基于上述语义分割模型训练方法,本发明实施例还提供一种基于该语义分割模型识别出图像(或视频)中的手指(或者其他预设物体)区域的方法,请参考图13,包括以下步骤:Based on the above semantic segmentation model training method, an embodiment of the present invention also provides a method for recognizing a finger (or other preset object) region in an image (or video) based on the semantic segmentation model. Please refer to FIG. 13, including the following steps :
S1300:模型固化,具体来讲,也就是在语义分割模型训练结束之后,将获得的语义分割模型用作手指识别,其参数不再变化。S1300: The model is solidified. Specifically, after the semantic segmentation model training is completed, the obtained semantic segmentation model is used for finger recognition, and its parameters do not change.
S1310:数据预处理,该步骤获取当前视频帧并进行数据的归一化预处理,例如:将视频帧的像素值归一化到0到1之间的图像。S1310: Data preprocessing. This step obtains the current video frame and performs normalization preprocessing of the data, for example: normalizing the pixel value of the video frame to an image between 0 and 1.
S1320:模型推理。示例来说,请参考图14,该模型推理可以包括以下步骤:S1400,输入图像,即将S1310像素值归一化到0~1的图像输入语义分割模型。 S1410a:通过语义分割模型中的语义特征分支提取图像的语义特征,该语义分支例如为:深层卷积神经网络,其中,在通过语义特征分支提取语义特征时,还可以对图片进行缩图处理,比如缩小4倍、5倍等等;S1410b:通过边缘特征分支提取纹理特征,该边缘特征分支例如为:浅层卷积神经网络,同样,在通过边缘特征分支提取纹理特征时,还可以对图片进行缩图处理,比如缩小2倍、3倍等等;S1420:通过特征融合网络对提取出的语义特征和边缘特征进行特征融合计算得到综合特征,如果语义特征分支与边缘特征分支时对图片的缩放倍数不一样,则在计算综合特征时,可以先对像素较低的特征(例如:语义特征像素为8像素×8像素,边缘特征为32像素×32像素,则像素较低的特征为语义特征)进行插值放大,从而保证语义特征与纹理特征大小相同,然后再进行特征融合计算得到综合特征。S1430:将综合特征输入分类器,获得手指区域掩膜。最后的分类器以综合特征作为输入,计算每个像素点属于手指区域的置信度和不属于手指区域的置信度,进而判断每个像素点是否属于手指区域,从而得到手指区域的掩膜。S1320: Model reasoning. For an example, please refer to FIG. 14, the model inference may include the following steps: S1400, inputting an image, that is, inputting an image with the pixel value of S1310 normalized to 0 to 1 into a semantic segmentation model. S1410a: Extract the semantic features of the image through the semantic feature branch in the semantic segmentation model. The semantic branch is, for example, a deep convolutional neural network. When the semantic feature is extracted through the semantic feature branch, the image can also be reduced. For example, shrink 4 times, 5 times, etc.; S1410b: Extract texture features through edge feature branches, such as shallow convolutional neural networks. Similarly, when extracting texture features through edge feature branches, you can also Perform thumbnail processing, such as shrinking by 2 times, 3 times, etc.; S1420: Perform feature fusion calculation on the extracted semantic features and edge features through the feature fusion network to obtain comprehensive features. If the semantic feature branch and the edge feature branch are for the image If the zoom factor is not the same, when calculating the comprehensive feature, you can first calculate the feature with lower pixels (for example, the semantic feature pixel is 8 pixels × 8 pixels, and the edge feature is 32 pixels × 32 pixels. Then the feature with lower pixels is semantic Feature) performs interpolation and amplification to ensure that the semantic feature and the texture feature have the same size, and then perform feature fusion calculation to obtain a comprehensive feature. S1430: Input the comprehensive features into the classifier to obtain the finger area mask. The final classifier takes the comprehensive features as input, calculates the confidence that each pixel belongs to the finger area and the confidence that it does not belong to the finger area, and then judges whether each pixel belongs to the finger area, and obtains the mask of the finger area.
在具体实施过程中,在基于上述步骤确定出手指区域的掩膜时,可能会存在部分噪声掩膜,为了提高确定手指区域的准确性,从而保证后续对手指区域进行填充的准确性,可以对基于语义分割模型确定出的手指区域的掩膜进行噪声过滤。In the specific implementation process, when the mask of the finger area is determined based on the above steps, there may be some noise masks. In order to improve the accuracy of determining the finger area and ensure the accuracy of subsequent filling of the finger area, you can Noise filtering is performed based on the mask of the finger region determined by the semantic segmentation model.
请参考图15A,为电子设备的前置摄像头采集获得的用户的照片,其中由于前置摄像头设置于显示屏底部或者键盘上,从而前置摄像头采集到的图像中包含手指,图16B所示的白色区域为基于语义分割模型输出的手指区域掩膜,表明该区域可能为手指,每个白色区域被称为连通区,在图15B中基于语义分割模型确定出手指区域的掩膜包括五个连通区(基于不同的图片,其确定的连通区数量也不同,本发明实施例不做限制),分别为:连通区61、连通区62、连通区63、连通区64、连通区65,连通区外的框被称为该连通区的外接矩形,该手指区域的掩膜中还可以包含人脸框的经验位置66。连通区61、连通区62、连通区63因其底部区域与图片底部不接触,被称为非底部连通区,连通区64、连通区65被称为底部连通区,正常打字状态,人手的区域往往属于底部连通区。在确定出手指区域掩膜之后,可以将手指区域掩膜认定为手指区域,也可以对手指区域掩膜进行降噪,将降噪后的手指区域掩膜作为手指区域。Please refer to Figure 15A, which is the user's photo captured by the front camera of the electronic device, where the front camera is set at the bottom of the display or on the keyboard, so the image captured by the front camera contains a finger, as shown in Figure 16B The white area is the finger area mask output based on the semantic segmentation model, indicating that the area may be a finger. Each white area is called a connected area. In Figure 15B, the mask of the finger area determined based on the semantic segmentation model includes five connected areas. Areas (based on different pictures, the number of connected areas determined is also different, the embodiment of the present invention does not limit), respectively: connected area 61, connected area 62, connected area 63, connected area 64, connected area 65, connected area The outer frame is called the circumscribed rectangle of the connected area, and the mask of the finger area may also include the experience position 66 of the face frame. Connected area 61, connected area 62, and connected area 63 are called non-bottom connected areas because their bottom areas are not in contact with the bottom of the picture. Connected areas 64 and 65 are called bottom connected areas. They are in normal typing state, and are areas where people are handed. Often belongs to the bottom connected area. After the finger area mask is determined, the finger area mask can be identified as a finger area, or the finger area mask can be noise-reduced, and the finger area mask after noise reduction can be used as the finger area.
在具体实施过程中,可以通过对图像进行人脸识别,来确定出人脸框所在位置,将识别出的人脸框所在位置作为人脸框的经验位置。也可以将电子设备的屏幕中心的一块经验区域作为人脸框的经验区域;还可以通过对大量的聊天视频进行识别,分析出这些聊天视频中人脸框的位置,从而综合分析结果,得到人脸框的经验位置。In a specific implementation process, the location of the face frame can be determined by performing face recognition on the image, and the location of the recognized face frame is taken as the empirical location of the face frame. It is also possible to use an experience area in the center of the screen of the electronic device as the experience area of the face frame; it is also possible to identify a large number of chat videos and analyze the position of the face frame in these chat videos, thereby comprehensively analyzing the results and obtaining the person The experience position of the face frame.
对噪声区域进行过滤可以采用以下一种或多种方式:One or more of the following methods can be used to filter the noise area:
①根据获取到的手指区域掩膜和人脸框的经验位置66,对掩膜的二值图进行腐蚀膨胀操作,以过滤掉极小面积的噪声空洞,极小面积的噪声空洞例如为小于1像素、2像素、3像素等大小的噪声空洞,其往往为手指边缘的噪声。①According to the obtained empirical position 66 of the finger area mask and the face frame, the binary image of the mask is corroded and expanded to filter out noise holes with a very small area, for example, the noise hole with a very small area is less than 1 Noise holes with the size of pixels, 2 pixels, 3 pixels, etc., are often noise at the edge of the finger.
②查找掩膜的二值图所有连通区,并初步过滤掉面积小于预设阈值的连通 区,该预设阈值例如为10像素、20像素等等。②Find all connected areas of the binary image of the mask, and preliminarily filter out connected areas whose area is less than a preset threshold, such as 10 pixels, 20 pixels, and so on.
③筛选出外接矩形大于预设面积阈值的连通区,该预设面积阈值例如为:100像素、200像素等等。又或者,计算连通区的区域中心,根据连通区的区域中心和人脸框的经验位置的中心之间的关系判定连通区所处的区域,例如:判定连通区位于人脸框的经验位置的中心的上侧、下侧、左侧还是右侧,基于连通区所在区域不同,为其设置不同的预设面积阈值,然后筛选出外接矩形面积大于预设面积阈值的连通区。其中,如果连通区位于人脸中心的左侧或者右侧,则其对应的预设面积阈值例如为:150像素、200像素、220像素等等,如果连通区位于人脸中心的上侧,则其对应的预设面积阈值例如为80像素、100像素等等,如果连通区位于人脸中心的下侧,则其对应的预设面积阈值例如为300像素、400像素等等。通常位于人脸中心下侧的连通区的预设面积阈值大于左侧(或右侧的),位于人脸中心左侧(或右侧)的连通区的预设面积阈值大于上侧的。③ Filter out connected areas whose bounding rectangle is larger than a preset area threshold, for example: 100 pixels, 200 pixels, and so on. Or, calculate the area center of the connected area, and determine the area where the connected area is located based on the relationship between the area center of the connected area and the center of the empirical position of the face frame, for example: determine that the connected area is located at the empirical position of the face frame For the upper, lower, left or right side of the center, different preset area thresholds are set for the connected area based on the area where the connected area is located, and then connected areas whose circumscribed rectangle area is larger than the preset area threshold are filtered out. Among them, if the connected area is located on the left or right side of the face center, the corresponding preset area threshold is, for example: 150 pixels, 200 pixels, 220 pixels, etc., if the connected area is located on the upper side of the face center, then The corresponding preset area threshold is, for example, 80 pixels, 100 pixels, and so on. If the connected area is located below the center of the face, the corresponding preset area threshold is, for example, 300 pixels, 400 pixels, and so on. Generally, the predetermined area threshold of the connected area located on the lower side of the face center is larger than the left side (or right side), and the predetermined area threshold value of the connected area located on the left side (or right side) of the face center is larger than the upper side.
④对于满足上述要求的连通区(在具体实施过程中,上述三个过滤步骤,可以全部使用,也可以仅使用其中的部分,本发明实施例不做限制),则继续判定连通区是否与人脸框的经验区域有重合,如果有重合且连通区不属于底部连通区,则判定当前为非打字状态,不进行后续处理;如果没有重合、或者属于底部连通区,可以转入第⑤步继续进行判断。当然,在具体实施过程中,也可以直接基于没有重合或者属于底部连通区的区域直接判定为打字状态,将对应的连通区确定为手指区域掩膜。④ For the connected area that meets the above requirements (in the specific implementation process, the above three filtering steps can be used all, or only some of them can be used, and the embodiment of the present invention does not limit it), continue to determine whether the connected area is related to people The experience area of the face frame overlaps. If there is overlap and the connected area does not belong to the bottom connected area, it is determined that the current state is not typing and no subsequent processing is performed; if there is no overlap or belongs to the bottom connected area, you can proceed to step ⑤ to continue Make judgments. Of course, in the specific implementation process, it is also possible to directly determine that the area that does not overlap or belong to the bottom connected area is the typing state, and the corresponding connected area is determined as the finger area mask.
⑤如果满足上述要求,则计算连通区的宽高比,过滤掉宽高比小于阈值(例如:0.5、0.7、0.8等等)的连通区,倒三角形的连通区、因为正常打字状态手指区域掩膜宽高比较大,且一般不会为倒三角形,该步骤也可以在步骤④之前进行,本发明实施例不做限制。⑤If the above requirements are met, calculate the aspect ratio of the connected area, filter out connected areas with an aspect ratio less than the threshold (for example: 0.5, 0.7, 0.8, etc.), connected areas with inverted triangles, because the finger area is hidden in the normal typing state The film has a relatively large width and height, and generally does not have an inverted triangle shape. This step can also be performed before step ④, which is not limited in the embodiment of the present invention.
⑥计算剩余连通区与其外接矩形的面积占比,如果大于设定阈值(例如:0.5、0.6等等),则返回打字状态信息,确定当前用户属于打字状态,且最终筛选后的手指区域掩膜为有效掩膜区域,也即有效的手指所在区域。本步骤中,也可以直接判断对应连通区的面积是否大于预设面积(例如:6万像素、7万像素等等),如果大于,则返回打字状态信息,确定当前用户属于打字状态,最终筛选出的手指区域掩膜作为有效掩膜区域,也即有效手指所在区域。⑥Calculate the area ratio of the remaining connected area and its circumscribed rectangle. If it is greater than the set threshold (for example: 0.5, 0.6, etc.), return the typing status information to determine that the current user belongs to the typing status, and the final screened finger area mask It is the effective mask area, that is, the area where the effective finger is located. In this step, it is also possible to directly determine whether the area of the corresponding connected area is greater than the preset area (for example: 60,000 pixels, 70,000 pixels, etc.), if it is greater, return the typing status information to determine that the current user belongs to the typing status, and finally filter The extracted finger area mask is used as the effective mask area, that is, the area where the effective finger is located.
在具体实施过程中,上述步骤①~⑥可以按照顺序执行,在不冲突的情况下,也可以针对每个连通区采用步骤①~⑥分别执行,然后基于每个步骤的执行结果,判断其是否属于手指区域掩膜。经过上述降噪操作之后的剩余的连通区例如如图15C所示,连通区61、62、63已被过滤到,只剩下连通区64、65作为真正的手指区域被保留。In the specific implementation process, the above steps ① to ⑥ can be executed in order. In the case of no conflict, steps ① to ⑥ can be executed separately for each connected area, and then based on the execution result of each step, it is judged whether it is Belongs to the finger area mask. The remaining connected areas after the above noise reduction operation are shown in FIG. 15C, for example, the connected areas 61, 62, and 63 have been filtered, and only the connected areas 64, 65 are left as real finger areas.
在具体实施过程中,上述语义分割模型的训练方法也可以用于图像中的其他物体的识别,例如:垃圾桶、背景、烟灰缸、文件夹、触控笔、手掌、胳膊等等。同样,也可以用于对这些物体所在区域进行识别。只要在训练阶段采用不同的训练样本即可,例如:针对垃圾桶,训练样本是包含垃圾桶的图片及垃 圾桶所在区域的标记,针对触控笔,训练样本是包含触控笔的图片及触控笔所在区域的标记等等。In the specific implementation process, the above-mentioned semantic segmentation model training method can also be used to recognize other objects in the image, such as trash cans, backgrounds, ashtrays, folders, stylus pens, palms, arms, and so on. Similarly, it can also be used to identify the area where these objects are located. As long as different training samples are used in the training phase, for example, for the trash can, the training sample is the image of the trash can and the mark of the area where the trash can is located, and for the stylus, the training sample is the image of the stylus and the touch. Marking of the area where the stylus is located, etc.
在具体实施过程中,除了采用前述语义分割模型进行前景分割(确定手指所在区域)之外,还可以采用其他方式进行前景分割,例如:基于帧差的前景分割方法、基于运动形状的前景分割方法等等。而采用前述语义分割模型能更加精确的分割出当前视频帧中的手指区域,且不被时刻运动的打字者的身体影响,同时其速度能满足实时性的要求。In the specific implementation process, in addition to the aforementioned semantic segmentation model for foreground segmentation (to determine the area of the finger), other methods can also be used for foreground segmentation, such as: foreground segmentation method based on frame difference, foreground segmentation method based on motion shape and many more. Using the aforementioned semantic segmentation model can more accurately segment the finger area in the current video frame without being affected by the body of the typist who is moving all the time, and its speed can meet the real-time requirements.
本发明实施例还提供了一种图像处理方法,该方法用于基于视频帧的背景区域确定替换内容,请参考图16,包括以下步骤:The embodiment of the present invention also provides an image processing method, which is used to determine the replacement content based on the background area of the video frame. Please refer to FIG. 16, including the following steps:
S700:基于背景帧和当前帧做运动偏移估计,获得背景帧相对于当前帧偏移的运动偏移矩阵(在具体实施过程中,也可以计算当前帧相对于背景帧偏移的运动偏移矩阵)。在初始阶段,通过视频聊天所采集的图像中如果不包含正在输入的手指,则将其作为背景帧;如果包含正在输入的手指,则暂不确定背景帧,直至不包含正在输入的手指时,采集获得图像作为背景帧。S700: Perform motion offset estimation based on the background frame and the current frame to obtain the motion offset matrix of the background frame relative to the current frame (in the specific implementation process, the motion offset of the current frame relative to the background frame offset can also be calculated matrix). In the initial stage, if the input finger is not included in the image collected through the video chat, it will be used as the background frame; if the input finger is included, the background frame will not be determined until the input finger is not included. The acquired image is collected as the background frame.
其中,可以先检测当前帧和背景帧的特征点,然后对检测出的当前帧和背景帧的特征点进行匹配,找到配对的特征点,然后根据配对特征点计算透视变换矩阵,该透视变换矩阵即为表征背景帧相对于当前帧运动量的矩阵,其中,可以通过SIFT(Scale-invariant feature transform:尺度不变特征转换)、SURF(Speeded Up Robust Features:旋转不变特性)、ORB(Oriented FAST and Rotated BRIEF)等特征点检测算法确定出背景帧和当前帧的特征点,可以通过通过BF(Brute-Force:暴力匹配)、FLANN(Fast Approximate Nearest Neighbor Search Library:快速最近邻逼近搜索函数库)等特征点匹配算法确定出背景帧和当前帧匹配的特征点。当然,还可以通过其他方式检测特征点、进行特征点匹配等,本发明实施例不做限制。Among them, the feature points of the current frame and the background frame can be detected first, and then the detected feature points of the current frame and the background frame can be matched to find the paired feature points, and then the perspective transformation matrix is calculated according to the paired feature points. The perspective transformation matrix It is a matrix that characterizes the amount of motion of the background frame relative to the current frame. Among them, it can be achieved through SIFT (Scale-invariant feature transform), SURF (Speeded Up Robust Features: rotation invariant features), ORB (Oriented FAST and Rotated Brief) and other feature point detection algorithms determine the feature points of the background frame and the current frame. You can use BF (Brute-Force: brute force matching), FLANN (Fast Approximate Nearest Neighbor Search Library: fast nearest neighbor search library), etc The feature point matching algorithm determines the feature points that match the background frame and the current frame. Of course, it is also possible to detect feature points and perform feature point matching in other ways, which is not limited in the embodiment of the present invention.
透视变换:Perspective transformation:
透视变换(Homography)是两张图之间的一种映射关系,通过矩阵乘法运算,将一张图中的一个点映射到另一张图中的对应点,包括透视矩阵计算和坐标映射两步。Perspective transformation (Homography) is a mapping relationship between two images. Through matrix multiplication, a point in one image is mapped to a corresponding point in another image, including perspective matrix calculation and coordinate mapping.
透视变换矩阵计算:透视变换矩阵是一个3*3的矩阵H,设为:Perspective transformation matrix calculation: the perspective transformation matrix is a 3*3 matrix H, set as:
已知图A和图B中有n(n>=4)对匹配点[A1,B1],[A2,B2],[A3,B3]…[An,Bn],匹配点和H构成以下线性方程组:Given that there are n (n>=4) pairs of matching points [A1,B1], [A2,B2], [A3,B3]...[An,Bn] in Figure A and Figure B, the matching points and H form the following linear equation set:
B
1=H×A
1
B 1 =H×A 1
B
2=H×A
2
B 2 =H×A 2
B
3=H×A
3
B 3 =H×A 3
……
B
n=H×A
n
B n =H×A n
通过求解线性方程组得到H,H表达的是从图A到图B的运动偏移估计(也即:透视变换矩阵H),因此对于图A中的每一个位置的点(X1,Y1),可通过以下矩阵乘法计算得到其在图B视平面的对应坐标(X2,Y2):H is obtained by solving the linear equations, and H expresses the estimation of the motion offset from the graph A to the graph B (that is, the perspective transformation matrix H), so for each point (X1, Y1) in the graph A, The corresponding coordinates (X2, Y2) in the view plane of Figure B can be calculated by the following matrix multiplication:
S710:根据上一个步骤确定出的运动量对背景帧做运动补偿,获得补偿帧(也即:进行运动补偿后的背景帧),该过程的目的在于对齐当前帧和背景帧,以消除人体运动造成的补全图像后人体区域的割裂现象,(在具体实施过程中,也可以基于运动量对当前帧做运动补偿,本发明实施例不做限制)。S710: Perform motion compensation on the background frame according to the amount of motion determined in the previous step to obtain a compensated frame (that is, the background frame after motion compensation). The purpose of this process is to align the current frame and the background frame to eliminate human body motion. The fragmentation phenomenon of the human body region after the completion of the image (in the specific implementation process, the current frame can also be motion compensation based on the amount of motion, which is not limited in the embodiment of the present invention).
对于背景帧中的每一个位置的点(X1,Y1),可通过以下矩阵乘法计算得到其在当前帧视平面的对应坐标(X2,Y2):For each point (X1, Y1) in the background frame, the corresponding coordinates (X2, Y2) in the view plane of the current frame can be calculated by the following matrix multiplication:
通过上述透视变换,可实现对背景帧相对于当前帧的运动补偿。Through the above perspective transformation, the motion compensation of the background frame relative to the current frame can be realized.
S720:基于运动补偿后的背景帧(也即:补偿帧)和手指区域掩膜计算出用于填充的背景区域,使用用于填充的背景区域的内容/图像对当前帧的手指区域掩膜进行填充/替换;S720: Calculate the background area for filling based on the motion-compensated background frame (ie: compensation frame) and the finger area mask, and use the content/image of the background area for filling to perform the finger area mask on the current frame Fill/replace
在具体实施过程中,可以从背景帧中确定出手指区域掩膜所在位置的图像作为用于填充/替换的背景区域,然后将用于填充的背景区域覆盖于当前帧的手指区域掩膜,从而实现对当前帧的手指区域掩膜进行填充。In the specific implementation process, the image of the location of the finger area mask can be determined from the background frame as the background area for filling/replacement, and then the background area used for filling can be overlaid on the finger area mask of the current frame. Achieve filling in the finger area mask of the current frame.
S730:采用环境光渲染方法渲染填充上的前景区域和周围的背景区域,使画面亮度一致,消除因硬件造成的相邻帧间的视频帧亮度差异。在具体实施过程中,步骤S730属于可选步骤。基于上述处理得到输出帧,输出帧作为最终的视频帧用于视频输出。S730: Use an ambient light rendering method to render the filled foreground area and the surrounding background area to make the picture brightness consistent, and eliminate the difference in video frame brightness between adjacent frames caused by hardware. In the specific implementation process, step S730 is an optional step. The output frame is obtained based on the above processing, and the output frame is used as the final video frame for video output.
S740:将步骤S730中得到的输出帧作为新的背景帧,对背景帧进行更新。S740: Use the output frame obtained in step S730 as a new background frame, and update the background frame.
本发明实施例提供一种视频通信方法,请参考图8,该视频通信方法包括以下步骤:The embodiment of the present invention provides a video communication method. Please refer to FIG. 8. The video communication method includes the following steps:
S800:通过前置摄像头采集获得显示屏前的视频帧;S800: Obtain the video frame in front of the display screen through the front camera;
在具体实施过程中,本方案可以应用于任何具备视频通信功能的电子设备,该电子设备可以自带或者外接摄像头,可选的,该摄像头设置于电子设备的显示屏的下方,或者该摄像头设置于电子设备的输入装置(例如:键盘、鼠标、触控板等等)上面,可选的,该电子设备为笔记本电脑,该摄像头设置于笔记本电脑的键盘上或者设置于笔记本电脑的显示屏下方。In the specific implementation process, this solution can be applied to any electronic device with video communication function. The electronic device can have its own or an external camera. Optionally, the camera is set under the display of the electronic device, or the camera is set. On the input device of the electronic device (such as keyboard, mouse, touchpad, etc.), optionally, the electronic device is a notebook computer, and the camera is set on the keyboard of the notebook computer or set under the display screen of the notebook computer .
在具体实施过程中,假设电子设备的用户A希望与另一用户B开启视频聊天,用户A打开电子设备的即时聊天应用,然后打开与用户B的聊天界面,然后点击“视频通话”按钮,电子设备检测到用户A的该操作之后,与用户B的电子设备之间建立视频通信连接,并开启电子设备的摄像头以采集获得发送给用户B的视频,所获得的视频中包含的视频帧例如如图9A所示。又或者,用户打开电子设备100的联系人界面,选择联系人B,然后点击视频通话按钮(例如:畅联通话按钮),电子设备检测到用户A的该操作之后,与用户B的电子设备之间建立视频通信连接。当然,还可以通过其他方式与用户B的电子设备之间建立视频通信连接,本发明实施例不再详细列举,并且不做限制。In the specific implementation process, suppose that user A of the electronic device wants to start a video chat with another user B. User A opens the instant chat application of the electronic device, then opens the chat interface with user B, and then clicks the "video call" button. After the device detects this operation of user A, it establishes a video communication connection with the electronic device of user B, and turns on the camera of the electronic device to capture and obtain the video sent to user B. The obtained video contains video frames such as Shown in Figure 9A. Or, the user opens the contact interface of the electronic device 100, selects the contact B, and then clicks the video call button (for example, the unlinked call button). After the electronic device detects the operation of the user A, it will contact the electronic device of the user B. Establish a video communication connection between. Of course, it is also possible to establish a video communication connection with the electronic device of user B in other ways, which is not listed in detail in the embodiment of the present invention, and is not limited.
通常视频通信默认开启前置摄像头,但是基于用户A的选择操作或者设置操作,电子设备也可以开启后置摄像头,本发明实施例不做限制。Generally, the front camera is turned on by default for video communication, but based on user A's selection operation or setting operation, the electronic device may also turn on the rear camera, which is not limited in the embodiment of the present invention.
本发明另一实施例提供了一种视频处理的方法,请参考图17,包括以下步骤:Another embodiment of the present invention provides a video processing method. Please refer to FIG. 17, which includes the following steps:
S810:在获得前置摄像头输入的视频帧之后,判断是否存在手指打字;S810: After obtaining the video frame input by the front camera, determine whether there is finger typing;
在具体实施过程中,通过电子设备的键盘信号读取装置,读取当前的键盘信号,通过当前的键盘信号判断是否存在输入信号,如果存在输入信号,则确定存在手指打字;如果不存在输入信号,则确定不存在手指打字。In the specific implementation process, the current keyboard signal is read through the keyboard signal reading device of the electronic device, and the current keyboard signal is used to determine whether there is an input signal. If there is an input signal, it is determined that there is a finger typing; if there is no input signal , It is determined that there is no finger typing.
S820b:如果不存在手指打字,则通过当前视频帧对背景帧进行更新。S820b: If there is no finger typing, update the background frame through the current video frame.
在初始阶段,通过视频聊天所采集的图像中如果不包含正在输入的手指,则将其作为背景帧;如果包含正在输入的手指,则暂不确定背景帧,直至不包含正在输入的手指时,采集获得图像作为背景帧。In the initial stage, if the input finger is not included in the image collected through the video chat, it will be used as the background frame; if the input finger is included, the background frame will not be determined until the input finger is not included. The acquired image is collected as the background frame.
S820a:通过语义分割模型确定是否存在手指区域掩膜,具体如何确定,由于前面已做介绍,故而在此不再赘述。所获得的手指区域掩膜例如如图18B的90所示。其中,S820a与S810没有执行现有顺序之分。可选的,在基于S810确定出存在手指打字之后,再进行S820a的处理,以减少电子设备的数据处理量。S820a: Determine whether there is a finger region mask through the semantic segmentation model, and how to determine it specifically is described above, so I will not repeat it here. The obtained finger area mask is, for example, as shown at 90 in FIG. 18B. Among them, S820a and S810 do not implement the existing sequence. Optionally, after it is determined that there is finger typing based on S810, the processing of S820a is performed to reduce the amount of data processing of the electronic device.
S830:在存在手指打字,且存在手指区域掩膜的情况下,对手指区域掩膜进行降噪,具体如何降噪前面已做处理,故而在此不再赘述。该步骤属于可选步骤。S830: When there is finger typing and there is a finger area mask, noise reduction is performed on the finger area mask. The specific noise reduction has been processed before, so I will not repeat it here. This step is optional.
S840:去除手指区域掩膜的手指,具体如何去除,在前述图像处理方法中已做介绍,故而在此不再赘述,去除手指区域掩膜的手指之后的视频帧例如如图18C所示。如果存在上述步骤S830的话,则是去除降噪处理后的手指区域掩膜的手指。S840: Remove the finger from the finger area mask. The specific how to remove it has been introduced in the aforementioned image processing method, so it will not be repeated here. The video frame after removing the finger from the finger area mask is shown in FIG. 18C, for example. If there is the above step S830, it is the finger from which the finger area mask after the noise reduction process is removed.
另外,除了可以用前述图像处理的方法去除手指区域掩膜的手指之外,还可以通过其他图片覆盖该手指区域、又或者,通过当前图像帧的背景区域填充手指 区域掩膜的手指,本发明实施例不再详细列举,并且不做限制。In addition, in addition to using the aforementioned image processing method to remove the fingers of the finger area mask, the finger area can also be covered by other pictures, or the finger area masked by the background area of the current image frame can be filled. The present invention The embodiments are not listed in detail, and are not limited.
S850:获得去除手指区域掩膜的手指之后的视频帧作为新的视频帧,并输出该视频帧,该视频帧可以传输给用户B的电子设备进行显示,还可以在用户A的电子设备上显示。S850: Obtain the video frame after the finger with the finger area mask removed as a new video frame, and output the video frame, which can be transmitted to the electronic device of user B for display, or displayed on the electronic device of user A .
S860:通过新的视频帧对背景帧进行更新。S860: Update the background frame through the new video frame.
基于上述方案,进一步的能够在保持人体信息的同时,对打字手指区域进行分割识别,对被打字手指遮挡的背景区域和人体区域进行补全,解决打字手指在屏幕出现像“八爪鱼”的技术问题,极大改善用户视频聊天的体验。且基于上述方案仅仅去除用户打字的手指,而对于用户的其他手指则不需要去除,从而能够提高判断的准确性,减少误判的出现,提高人机交互的智能性。Based on the above solution, it is further able to segment and recognize the typing finger area while maintaining human body information, complement the background area and the human body area blocked by the typing finger, and solve the problem that the typing finger appears on the screen like an "octopus" Technical issues have greatly improved the user’s video chat experience. And based on the above solution, only the user's typing finger is removed, and the user's other fingers do not need to be removed, so that the accuracy of judgment can be improved, the occurrence of misjudgments can be reduced, and the intelligence of human-computer interaction can be improved.
本发明又一实施例提供一种视频通信方法,请参考图19,该方法包括以下步骤:Another embodiment of the present invention provides a video communication method. Please refer to FIG. 19. The method includes the following steps:
S1000:通过前置摄像头采集获得显示屏前的视频,具体采集过程与S800类似,在此不再赘述。S1000: Obtain the video in front of the display screen through the front camera. The specific collection process is similar to that of the S800, so I won't repeat it here.
S1010a:在获得前置摄像头采集的视频之后,确定人脸所在区域;S1010a: After obtaining the video collected by the front camera, determine the area where the face is located;
其中,可以通过人脸识别技术识别出人脸所在区域;在一种可选的实施例中,也可以以电子设备的屏幕中心的一块经验区域作为人脸框的经验区域;又或者,还可以通过对大量的聊天视频(或者自拍图片)进行识别,分析出这些聊天视频(或自拍图片)中人脸框的位置,从而综合分析结果,得到人脸框的经验区域。Among them, the area where the face is located can be identified by face recognition technology; in an optional embodiment, an experience area in the center of the screen of the electronic device can also be used as the experience area of the face frame; or, it can be By recognizing a large number of chat videos (or self-portrait pictures), the position of the face frame in these chat videos (or self-portrait pictures) is analyzed, and the result of comprehensive analysis is obtained to obtain the experience area of the face frame.
S1010b:通过语义分割模型确定手指区域掩膜,具体如何确定前面已做介绍,在此不再赘述,该步骤与S1010a的执行没有先后顺序之分。S1010b: Determine the finger region mask through the semantic segmentation model. How to determine the mask has been described above, and will not be repeated here. There is no order of execution between this step and S1010a.
S1020:通过人脸所在区域与手指区域掩膜确定当前帧中的用户是否正在打字。S1020: Determine whether the user in the current frame is typing through the mask of the face area and the finger area.
在具体实施过程中,可以判断手指区域掩膜的连通区是否存在底部连通区,如果存在则认为处于打字状态;如果不存在,则判断连通区与人脸所在区域是否重叠,如果重叠,则认为处于非打字状态,如果不重叠,则认为处于打字状态。In the specific implementation process, it can be judged whether there is a bottom connected area in the connected area of the finger area mask, if it exists, it is considered to be in the typing state; if it does not exist, it is judged whether the connected area overlaps the area where the face is located, and if it overlaps, it is considered In the non-typing state, if there is no overlap, it is considered to be in the typing state.
在具体实施过程中,也可以仅基于手指区域掩膜的连通区是否存在底部连通区来确定是否处于打字状态,例如:如果存在底部连通区,则认为处于打字状态,如果不存在底部连通区,则认为属于非打字状态。在这种情况下,则上述步骤S1010a属于可选步骤。In the specific implementation process, it is also possible to determine whether it is in the typing state only based on whether there is a bottom connected area in the connected area of the finger area mask, for example: if there is a bottom connected area, it is considered to be in the typing state, if there is no bottom connected area, It is considered a non-typing state. In this case, the above step S1010a is an optional step.
请参考图20,图19中的步骤S1020可以包括以下步骤:Please refer to FIG. 20. Step S1020 in FIG. 19 may include the following steps:
S1100:对手指区域掩膜进行腐蚀膨胀操作,以过滤掉极小面积的噪声空洞,极小面积的噪声空洞例如为小于1像素、2像素、3像素等大小的噪声空洞,其往往为手指边缘的噪声。S1100: Perform a corrosion expansion operation on the finger area mask to filter out noise holes in a very small area, such as noise holes smaller than 1 pixel, 2 pixels, 3 pixels, etc., which are often finger edges Noise.
S1110:查找图片的二值图的所有连通区,并初步过滤掉面积小于预设阈值的连通区,该预设阈值例如为10像素、20像素等等。S1110: Search for all connected regions of the binary image of the picture, and preliminarily filter out connected regions with an area smaller than a preset threshold, the preset threshold is, for example, 10 pixels, 20 pixels, and so on.
S1120:对连通区进行初筛,计算连通区的区域中心,根据其中心的位置和人脸框中心位置的关系判定其所处的区域,例如:判定连通区位于人脸中心的上 侧、下侧、左侧还是右侧,根据不同位置区域的外接矩形面积阈值筛选满足要求的连通区,筛选出外接矩形面积大于预设面积阈值的连通区。其中,如果连通区位于人脸中心的左侧或者右侧,则其对应的预设面积阈值例如为:150像素、200像素、220像素等等,如果连通区位于人脸中心的上侧,则其对应的预设面积阈值例如为80像素、100像素等等,如果连通区位于人脸中心的下侧,则其对应的预设面积阈值例如为300像素、400像素等等。通常位于人脸中心下侧的连通区的预设面积阈值大于左侧(或右侧的),位于人脸中心左侧(或右侧)的连通区的预设面积阈值大于上侧的。S1120: Preliminarily screen the connected area, calculate the area center of the connected area, and determine the area in which it is located according to the relationship between its center position and the center position of the face frame, for example: determine that the connected area is located above and below the center of the face On the side, the left side or the right side, the connected areas that meet the requirements are screened according to the circumscribed rectangle area thresholds of different location areas, and the connected areas with the circumscribed rectangle area larger than the preset area threshold are filtered out. Among them, if the connected area is located on the left or right side of the face center, the corresponding preset area threshold is, for example: 150 pixels, 200 pixels, 220 pixels, etc., if the connected area is located on the upper side of the face center, then The corresponding preset area threshold is, for example, 80 pixels, 100 pixels, and so on. If the connected area is located below the center of the face, the corresponding preset area threshold is, for example, 300 pixels, 400 pixels, and so on. Generally, the predetermined area threshold of the connected area located on the lower side of the face center is larger than the left side (or right side), and the predetermined area threshold value of the connected area located on the left side (or right side) of the face center is larger than the upper side.
上述三个步骤为可选步骤。The above three steps are optional.
然后再执行S1130,也即:通过过滤后的手指区域掩膜和人脸所在区域确定当前帧中的用户是否正在打字,其判断方式与前面介绍类似,在此不再赘述。如果基于S1130确定出用户并未打字,则可以返回非打字状态码;而如果基于S1140确定出用户打字,则可以返回打字状态码,也可以执行S1140:对连通区进行精筛,精筛可以包括以下方式:Then S1130 is executed again, that is, it is determined whether the user in the current frame is typing through the filtered finger area mask and the area where the face is located. The judgment method is similar to the previous description, and will not be repeated here. If it is determined that the user has not typed based on S1130, the non-typing status code can be returned; and if it is determined that the user has typed based on S1140, the typing status code can be returned, or S1140: the connected area can be refined, and the refined screening can include The following way:
方式一:则计算连通区的宽高比,过滤掉宽高比小于阈值(例如:0.5、0.7、0.8等等)的连通区,倒三角形的连通区、因为正常打字状态手指区域掩膜宽高比较大,且一般不会为倒三角形;Method 1: Calculate the aspect ratio of the connected area, and filter out the connected areas whose aspect ratio is less than the threshold (for example: 0.5, 0.7, 0.8, etc.), the connected area of the inverted triangle, because the width and height of the finger area mask in the normal typing state It is relatively large, and generally not an inverted triangle;
方式二:计算剩余连通区与其外接矩形的面积占比,如果大于设定阈值(例如:0.5、0.6等等),则返回打字状态信息,确定当前用户属于打字状态,且最终筛选后的手指区域掩膜为有效掩膜区域,也即有效的手指所在区域。本步骤中,也可以直接判断对应连通区的面积是否大于预设面积(例如:6W像素、7W像素等等),如果大于,则返回打字状态信息,确定当前用户属于打字状态,最终筛选出的手指区域掩膜作为有效掩膜区域,也即有效手指所在区域。Method 2: Calculate the area ratio of the remaining connected area and its circumscribed rectangle. If it is greater than the set threshold (for example: 0.5, 0.6, etc.), return the typing status information to determine that the current user belongs to the typing status, and the finally filtered finger area The mask is the effective mask area, that is, the area where the effective finger is located. In this step, it can also be directly judged whether the area of the corresponding connected area is larger than the preset area (for example: 6W pixels, 7W pixels, etc.). If it is larger, the typing status information will be returned to confirm that the current user belongs to the typing status, and finally the selected ones The finger area mask serves as the effective mask area, that is, the area where the effective finger is located.
如果基于上述精筛步骤筛选之后还剩下连通区,则返回打字状态码,确定目前用户处于打字状态;如果基于上述精筛步骤之后不剩下连通区,则返回非打字状态码,确定目前用户处于非打字状态。If there are still connected areas after screening based on the above fine screening step, return the typing status code to confirm that the current user is in the typing state; if based on the above fine screening steps, there are no connected areas left, then return to the non-typing status code to confirm the current user In a non-typing state.
S1030a:在确认没有打字的情况下,通过当前帧对背景帧更新。S1030a: After confirming that there is no typing, update the background frame through the current frame.
S1030b:在确定存在打字的情况下,通过图像处理操作去除/遮挡手指区域掩膜的手指,从而获得新的视频帧,具体如何去除,前面已做介绍,在此不再赘述。S1030b: In the case of determining that there is typing, remove/cover the fingers of the finger area mask through the image processing operation to obtain a new video frame. The specific removal has been described above and will not be repeated here.
S1040:输出新的视频帧,可以输出至用户B所在的电子设备,作为视频通信的视频帧;也可以(同时)输出至用户A所在的电子设备。S1040: Output a new video frame, which can be output to the electronic device where user B is located as a video frame for video communication; it can also be output to the electronic device where user A is located (at the same time).
S1050:通过新的视频帧对背景帧更新。上述步骤S1040与步骤S1050的执行顺序可互换。S1050: Update the background frame with a new video frame. The execution order of step S1040 and step S1050 can be interchanged.
在具体实施过程中,S810与S1020中判断是否存在手指打字的步骤可以择一采用、组合采用,或者在S810无法判断时,可以采用S1020进行判断。In the specific implementation process, the steps of judging whether there is finger typing in S810 and S1020 can be used alternatively or in combination, or when it cannot be judged in S810, S1020 can be used for judgment.
基于上述方案,进一步的能够在保持人体信息的同时,对打字手指区域进行分割识别,对被打字手指遮挡的背景区域和人体区域进行补全,解决打字手指在 屏幕出现像“八爪鱼”的槽点和问题,极大改善用户视频聊天的体验。且基于上述方案仅仅去除用户打字的手指,而对于用户的其他手指则不需要去除,从而能够提高判断的准确性,减少误判的出现,提高人机交互的智能性。Based on the above solution, it is further able to segment and recognize the typing finger area while maintaining human body information, complement the background area and the human body area blocked by the typing finger, and solve the problem that the typing finger appears on the screen like an "octopus" Slots and problems greatly improve the user’s video chat experience. And based on the above solution, only the user's typing finger is removed, and the user's other fingers do not need to be removed, so that the accuracy of judgment can be improved, the occurrence of misjudgments can be reduced, and the intelligence of human-computer interaction can be improved.
请参考图21,本发明实施例提供一种视频处理的方法,包括以下步骤:Please refer to FIG. 21. An embodiment of the present invention provides a video processing method, including the following steps:
S2100:获得视频帧;S2100: Obtain a video frame;
在具体实施过程中,电子设备可以采集获得视频帧,例如:电子设备可以在检测到用户的视频拍摄操作(例如:点击视频拍摄按钮、产生预设手势、产生语音指令等等)之后,进行视频采集;也可以在检测到用户与对端电子设备进行视频通信时,进行视频采集,然后将采集到的视频发送给对端电子设备,电子设备也可以从其他电子设备或者网络获得视频帧,本发明实施例不再详细列举,并且不做限制。In the specific implementation process, the electronic device can collect and obtain video frames. For example, the electronic device can perform the video after detecting the user's video shooting operation (for example: clicking the video shooting button, generating preset gestures, generating voice commands, etc.) Capture; it can also perform video capture when it detects that the user is in video communication with the peer electronic device, and then send the collected video to the peer electronic device. The electronic device can also obtain video frames from other electronic devices or the network. The embodiments of the invention are not listed in detail, and are not limited.
如果视频帧为电子设备采集获得视频帧,则该视频帧可以为通过前置摄像头采集获得的视频帧,也可以为后置摄像头采集获得的视频帧,或者对前置摄像头和后置摄像头采集的视频帧进行融合之后获得视频帧等等;也可以为通过电子设备100外接的其他摄像头采集获得的视频帧,例如:该电子设备100与无人机、电视、台式机中的一个或多个设备建立连接,该电子设备100可以通过这些设备采集获得视频帧。If the video frame is a video frame captured by an electronic device, the video frame can be a video frame captured by a front camera, a video frame captured by a rear camera, or a video frame captured by a front camera and a rear camera. The video frames are merged to obtain video frames, etc.; it can also be a video frame collected by other cameras connected to the electronic device 100, for example: the electronic device 100 and one or more of a drone, a TV, and a desktop computer The connection is established, and the electronic device 100 can acquire video frames through these devices.
在具体实施过程中,在电子设备检测到用户的视频通信操作(或者视频拍摄操作)时,产生视频通信指令;然后将视频通信指令发送给处理器,处理器响应该视频通信指令,开启视频通信软件,把指令发送给摄像头驱动控制摄像头进行视频采集。摄像头驱动将采集的数据发送给手指遮挡处理组件,由手指遮挡处理组件执行后续操作。In the specific implementation process, when the electronic device detects the user's video communication operation (or video shooting operation), it generates a video communication instruction; then sends the video communication instruction to the processor, and the processor responds to the video communication instruction to start the video communication The software sends instructions to the camera to drive and control the camera for video capture. The camera driver sends the collected data to the finger occlusion processing component, and the finger occlusion processing component performs subsequent operations.
S2110:获得该视频帧上预设对象所在区域;S2110: Obtain the area where the preset object on the video frame is located;
该预设对象可以为系统默认设置的对象,也可以为用户指定的对象。例如:用户在通过电子设备100采集视频帧时,由于手持电子设备100、或者用手在电子设备100上打字,从而导致电子设备100采集到遮挡摄像头的镜头的手指、正在打字输入的手指等等,这些都属于用户不希望被采集到的画面;又或者,用户在拍照时发现照片中有垃圾桶、烟灰缸(预设对象)等等,则可以通过手动选择的方式,选择画面中垃圾桶、烟灰缸等等,从而这些物体指定为预设对象,通过上述方案能够去除视频中的预设对象,从而使视频聊天更能满足用户需求,也能够保护用户的隐私。The preset object can be an object set by default by the system, or an object designated by the user. For example: when the user collects video frames through the electronic device 100, because the user holds the electronic device 100 or type on the electronic device 100 by hand, the electronic device 100 collects the fingers that block the lens of the camera, the fingers that are typing input, etc. , These are all images that the user does not want to be captured; or, when the user finds that there are trash cans, ashtrays (preset objects), etc. in the photo when taking a photo, you can manually select the trash can in the screen , Ashtrays, etc., so that these objects are designated as preset objects. Through the above solution, the preset objects in the video can be removed, so that the video chat can better meet the needs of users and can also protect the privacy of users.
在具体实施过程中,该预设对象例如为用户的手、手指、垃圾桶、烟灰缸等等。其可以通过语义分割模型自动确定出该预设对象区域(对于具体如何通过语义分割模型自动确定出该预设对象所在区域,将在后续结合图3-5予以介绍);也可以接收用户的选择操作,基于选择操作确定出该预设对象所在区域。例如:在拍摄视频的过程中,用户点击图中的烟灰缸,电子设备检测到用户的点击操作之后,确定出用户希望识别出烟灰缸,故而通过图像识别算法,识别出用户点击操作所对应的物体所在区域。In the specific implementation process, the preset objects are, for example, the user's hand, finger, trash can, ashtray, and so on. It can automatically determine the preset object area through the semantic segmentation model (how to automatically determine the area where the preset object is located through the semantic segmentation model will be described later in conjunction with Figure 3-5); it can also receive user choices Operation, the area where the preset object is located is determined based on the selection operation. For example: in the process of shooting a video, the user clicks on the ashtray in the picture. After the electronic device detects the user's click operation, it determines that the user wants to recognize the ashtray. Therefore, the image recognition algorithm is used to identify the user's click operation. The area where the object is located.
S2120:确定替换内容,该替换内容可以视频帧所对应的背景区域的内容,对于背景区域的内容如何确定,将在后续结合图7予以介绍。也可以并非背景区域的内容,例如:其他图像(例如:表情、图标等等)、对视频帧进行马赛克后预设对象所在区域的内容等等。S2120: Determine the replacement content. The replacement content may be the content of the background area corresponding to the video frame. How to determine the content of the background area will be described later with reference to FIG. 7. It may also not be the content of the background area, such as other images (for example: emoticons, icons, etc.), the content of the area where the object is preset after mosaicing the video frame, and so on.
S2120:通过替换内容填充所述预设对象所在区域,以去除/替换所述预设对象。其中,可以去除预设对象的部分内容,也可以去除预设对象的全部内容。S2120: Fill the area where the preset object is located by replacing content to remove/replace the preset object. Among them, part of the content of the preset object can be removed, or all content of the preset object can be removed.
在具体实施过程中,可以通过前面介绍的图像处理方法填充预设对象所在区域;也可以直接通过其他物体覆盖预设对象所在区域,例如:在烟灰缸所在区域覆盖表情、对烟灰缸所在区域进行马赛克等等。In the specific implementation process, the area where the preset object is located can be filled by the image processing method introduced above; or the area where the preset object is located can be directly covered by other objects, such as covering expressions in the area where the ashtray is located, and performing the area where the ashtray is located. Mosaic and so on.
例如:在视频采集过程中,在检测到预设对象所在的区域时,可以直接通过背景区域,对预设对象所在区域进行填充;又或者,在视频采集过程中,在检测到预设对象所在区域时,通过预设图标(该预设图标可以为默认图标,也可以为随机变化的图标)覆盖于视频中的预设对象区域,在检测到用户针对该预设图标的编辑操作时,移除该图标、或者将该预设图标替换为其他图标;又或者,在视频采集界面显示编辑按钮,检测到用户点击编辑按钮的操作之后,显示各种编辑操作(例如:滤镜、图标、拼图等等),检测到用户点击图标的操作之后,显示各种图标,然后基于用户的特定操作(例如:将图标拖动至预设对象表面的操作),从而通过图标遮挡于预设对象表面;又例如:在视频采集界面直接显示各种图标,基于用户的特定操作将图标遮挡预设对象表面。For example: in the video capture process, when the area where the preset object is detected, you can directly fill the area where the preset object is located through the background area; or, in the video capture process, when the preset object is detected Area, the preset object area in the video is covered by the preset icon (the preset icon can be the default icon or the icon that changes randomly), and when the user edits the preset icon is detected, the In addition to this icon, or replace the preset icon with other icons; or, display the edit button on the video capture interface, and after detecting the user's click on the edit button, display various editing operations (such as filters, icons, puzzles) Etc.), after detecting the user's click on the icon, various icons are displayed, and then based on the user's specific operation (for example: dragging the icon to the preset object surface), the icon is blocked on the preset object surface; For another example, various icons are directly displayed on the video capture interface, and the icons are hidden from the surface of the preset object based on a specific operation of the user.
S2130:用确定的替换内容替换预设对象所在区域的内容,并输出替换处理后的视频帧。S2130: Replace the content of the area where the preset object is located with the determined replacement content, and output a video frame after replacement processing.
在具体实施过程中,可以将替换处理后的视频帧传输至另一电子设备,以在另一电子设备上显示替换处理后的视频帧;也可以将替换处理后的视频帧显示当前电子设备,以提供给电子设备的用户,还可以将替换处理后的视频帧存储于电子设备。In the specific implementation process, the video frame after the replacement process can be transmitted to another electronic device to display the video frame after the replacement process on the other electronic device; the video frame after the replacement process can also be displayed on the current electronic device, To provide users of electronic devices, they can also store the replaced video frames in the electronic device.
基于上述方案,进一步的能够在保持人体信息的同时,对打字手指区域进行分割识别,对被打字手指遮挡的背景区域和人体区域进行补全,解决打字手指在屏幕出现像“八爪鱼”的技术问题,极大改善用户视频聊天的体验。Based on the above solution, it is further able to segment and recognize the typing finger area while maintaining human body information, complement the background area and the human body area blocked by the typing finger, and solve the problem that the typing finger appears on the screen like an "octopus" Technical issues have greatly improved the user’s video chat experience.
上述方案除了可以用于视频采集之外,还可以用于图像采集,例如:在检测到用户采集图像的操作之后,识别其中的预设对象,然后将预设对象去除,其去除方式可以采用前面介绍的图像处理方法,也可以通过其他图片覆盖,本发明实施例不再详细列举,并且不做限制。In addition to video capture, the above solution can also be used for image capture. For example, after detecting the user's image capture operation, identify the preset object in it, and then remove the preset object. The removal method can be the previous The image processing method introduced can also be overlaid with other pictures, which are not listed in detail in the embodiment of the present invention, and are not limited.
下面将结合两个的应用场景来对本发明实施例方案的应用场景进行介绍。In the following, the application scenarios of the solutions of the embodiments of the present invention will be introduced in combination with two application scenarios.
应用场景一:Application scenario one:
请参考图22a,在初始阶段,电子设备的显示界面上显示有与另一用户进行即时聊天的界面220,在该界面上显示有视频通信按钮220a、语音通信按钮220b,该电子设备的使用者为用户A,用户A旁边站有用户B,用户A目标的手指放置于键盘上,用户B通过鼠标点击视频通信按钮(220a)(或者用户A产生语音 指令);Please refer to Figure 22a. In the initial stage, the display interface of the electronic device displays an interface 220 for real-time chatting with another user. The interface displays a video communication button 220a and a voice communication button 220b. The user of the electronic device It is user A, user B is standing next to user A, user A’s target finger is placed on the keyboard, and user B clicks on the video communication button (220a) with the mouse (or user A generates a voice command);
电子设备检测到该视频通信操作之后,跳转至图22b所示的界面,22b包含一视频通信界面221,该视频通信界面包含视频预览界面221a和视频显示界面221b,视频预览界面221a中显示当前电子设备采集(或处理后)的视频帧(例如:用户A的视频帧),视频显示界面221b显示对端电子设备的用户的视频帧。发送此时用户A的手指一直放置于键盘上,由于电子设备未检测到不包含手指的背景帧,因此电子设备未触发手指去除功能,显示包含手指的视频帧。After the electronic device detects the video communication operation, it jumps to the interface shown in FIG. 22b. 22b includes a video communication interface 221. The video communication interface includes a video preview interface 221a and a video display interface 221b. The video preview interface 221a displays the current The video frame (for example: the video frame of user A) collected (or processed) by the electronic device, the video display interface 221b displays the video frame of the user of the opposite electronic device. At this time, the finger of user A has been placed on the keyboard. Since the electronic device does not detect the background frame that does not contain the finger, the electronic device does not trigger the finger removal function and displays the video frame containing the finger.
然后,如图22c所示,用户A将手指从键盘上拿掉,放置于膝盖之上,此时电子设备检测不到用户的手指,故而采集到的视频帧中是不包含用户的手指的,直接输出该视频帧;并且将采集到的视频帧作为背景帧。Then, as shown in Figure 22c, user A removes his finger from the keyboard and places it on his knee. At this time, the electronic device cannot detect the user’s finger, so the captured video frame does not contain the user’s finger. Output the video frame directly; and use the collected video frame as the background frame.
然后,如图22d所示,用户再次将手指放置于键盘上打字,此时采集到的视频帧中是包含用户的手指的;则通过前面介绍的方法去除视频帧中的手指,从而输出的是不包含手指的视频帧。且在视频帧上显示提示框222,该提示框222用于提示“已去除视频聊天中的手指,请确认是否继续去除”,且该提示框222包含确认按钮222a和取消按钮222b,用户B通过鼠标点击取消按钮222b(或者用户A产生语音指令);Then, as shown in Figure 22d, the user places his finger on the keyboard again to type, and the captured video frame contains the user’s finger; the finger in the video frame is removed by the method described above, and the output is Video frames without fingers. And a prompt box 222 is displayed on the video frame, the prompt box 222 is used to prompt "The finger in the video chat has been removed, please confirm whether to continue removing", and the prompt box 222 includes a confirmation button 222a and a cancel button 222b, and user B passes Click the cancel button 222b with the mouse (or user A generates a voice command);
然后,如图22e所示,用户A的手指位置相对于图22d中用户A的手指位置并未发生变化,但是视频预览界面中所显示的视频帧中又包含了用户的手指。Then, as shown in FIG. 22e, the position of the finger of the user A has not changed relative to the position of the finger of the user A in FIG. 22d, but the video frame displayed in the video preview interface again contains the user's finger.
应用场景二:Application scenario two:
用户A启动视频通信功能,启动视频通信时,用户的手放在膝盖上,所采集的视频帧中不包含用户的手指,如图23a所示;在这种情况下,电子设备将图23a所示的视频帧设置为背景帧,且输出该视频帧,如图23b所示。User A starts the video communication function. When starting the video communication, the user’s hand is placed on his knee, and the captured video frame does not contain the user’s finger, as shown in Figure 23a; in this case, the electronic device will be as shown in Figure 23a. The video frame shown is set as the background frame, and the video frame is output, as shown in Figure 23b.
随后,用户A通过键盘开始打字,电子设备采集到的视频帧中包含用户的手指,如图23c所示;电子设备确定出采集的视频帧中符合预设手指模型的内容,则产生提示信息,该提示信息例如为文字、语音、图标等等,如图23d所示,显示提示框130,提示框130中显示“已检测到视频帧中包含手指,请确认是否去除”,该提示框13上还显示有确认按钮140和取消按钮150,用户A希望去除手指,则点击了确认按钮140,电子设备检测到用户点击确认按钮140的操作。用户一直在通过键盘输入。Subsequently, user A starts typing through the keyboard, and the video frame collected by the electronic device contains the user's finger, as shown in Figure 23c; the electronic device determines that the content of the collected video frame conforms to the preset finger model, and then generates a prompt message. The prompt information is, for example, text, voice, icon, etc., as shown in Figure 23d, a prompt box 130 is displayed, and the prompt box 130 displays "finger has been detected in the video frame, please confirm whether to remove it", and the prompt box 13 The confirmation button 140 and the cancel button 150 are also displayed. The user A wants to remove the finger and clicks the confirmation button 140, and the electronic device detects that the user has clicked the confirmation button 140. The user has been typing through the keyboard.
随后,电子设备又采集到一帧包含打字的手指的视频帧,如图23e所示,由于用户已确认去除手指;则电子设备基于前面介绍方法去除该视频帧中的手指,输出不包含手指的视频帧,如图23f所示。在该视频帧上还可以产生提示信息,该提示信息用于告知用户目前处于手指去除状态,该提示信息可以持续一段时间(例如:1秒、2秒)后消失,也可以在手指去除状态下一直显示;还可以产生一个取消按钮,通过响应用户点击该取消按钮,可以退出手指去除模式,如图23f所示,提示信息与取消按钮集成到一起,为一个提示按钮230。Subsequently, the electronic device collects a video frame containing the typing finger, as shown in Figure 23e, because the user has confirmed the removal of the finger; the electronic device removes the finger in the video frame based on the method described above, and outputs the video frame that does not contain the finger. The video frame is shown in Figure 23f. Prompt information can also be generated on the video frame. The prompt information is used to inform the user that the user is currently in the finger removal state. The prompt information can last for a period of time (for example: 1 second, 2 seconds) and then disappear, or it can be in the finger removal state. It is always displayed; a cancel button can also be generated, and the finger removal mode can be exited by responding to the user clicking the cancel button. As shown in FIG. 23f, the prompt message and the cancel button are integrated together, which is a prompt button 230.
一段时间后,用户希望不要再去除手指了,通过鼠标点击23f中的提示按钮230,电子设备检测到点击该提示按钮的操作之后,再次采集到包含手指的视频 帧23g,这种情况下,无需再判断该视频帧中是否包含符合预设手指模型的内容,而是直接输出该包含手指的视频帧,如图23h所示。After a period of time, the user hopes not to remove the finger. Click the prompt button 230 in 23f with the mouse. After the electronic device detects the operation of clicking the prompt button, the video frame 23g containing the finger is collected again. In this case, there is no need It is then judged whether the video frame contains content that conforms to the preset finger model, and the video frame containing the finger is directly output, as shown in FIG. 23h.
基于同一发明构思,本发明实施例还提供一种计算机可读存储介质,包括指令,当所述指令在电子设备上运行时,使得所述电子设备执行本发明任一实施例所述的方法。Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium, including instructions, which when run on an electronic device, cause the electronic device to execute the method described in any embodiment of the present invention.
基于同一发明构思,本发明实施例提供一种计算机程序产品,所述计算机程序产品包括软件代码,所述软件代码用于执行执行本发明任一实施例所述的方法。Based on the same inventive concept, embodiments of the present invention provide a computer program product, the computer program product includes software code, and the software code is used to execute the method described in any embodiment of the present invention.
基于同一发明构思,本发明实施例提供一种包含指令的芯片,当所述芯片在电子设备上运行时,使得所述电子设备执行执行本发明任一实施例所述的方法。Based on the same inventive concept, an embodiment of the present invention provides a chip containing instructions, which when the chip runs on an electronic device, causes the electronic device to execute the method described in any embodiment of the present invention.
基于同一发明构思,本发明实施例提供一种电子设备,包括:键盘和摄像头,所述摄像头设置于所述键盘附近,所述电子设备还包括:第一采集模块,用于通过所述摄像头采集获得第一视频帧;第一确定模块,用于确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧;显示模块,用于显示所述第二视频帧,和/或,将所述第二视频帧发送给对端电子设备显示。Based on the same inventive concept, an embodiment of the present invention provides an electronic device, including: a keyboard and a camera, the camera is arranged near the keyboard, and the electronic device further includes: a first collection module for collecting data through the camera Obtain a first video frame; a first determining module, configured to determine that the first video frame contains content that conforms to a preset finger model, then remove the finger in the first video frame to obtain a second video frame; display module , For displaying the second video frame, and/or sending the second video frame to the opposite electronic device for display.
在一种可选的实施方式中,所述第一确定模块,包括:第一确定单元,用于确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域,去除所述第一视频帧中的手指;第二确定单元,用于确定所述第一视频帧中包含符合预设手指模型的内容,且确定出手指区域与人脸所在位置不存在重叠,去除所述第一视频帧中的手指;第三确定单元,用于确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域且与所述第一视频帧的侧边相连,去除所述第一视频帧中的手指。In an optional implementation manner, the first determining module includes: a first determining unit, configured to determine that the first video frame contains content that conforms to a preset finger model, and determine that the finger is located at the The bottom area of the first video frame removes the fingers in the first video frame; the second determining unit is used to determine that the first video frame contains content that meets the preset finger model, and determines that the finger area is If there is no overlap in the position of the human face, remove the finger in the first video frame; the third determining unit is used to determine that the first video frame contains content that meets the preset finger model, and to determine that the finger is located at the The bottom area of the first video frame is connected to the side of the first video frame, and the finger in the first video frame is removed.
在一种可选的实施方式中,所述第一确定模块,包括:获得单元,用于获得键盘输入信号;第四确定单元,用于确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,去除所述第一视频帧中的手指。In an optional implementation manner, the first determining module includes: an obtaining unit, configured to obtain a keyboard input signal; and a fourth determining unit, configured to determine that the first video frame contains a finger model that conforms to a preset finger model. And the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, and the finger in the first video frame is removed.
在一种可选的实施方式中,所述第一确定模块,用于:以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧;或者对所述第一视频帧进行裁剪以获得不包含所述手指区域的所述第二视频帧;或者,通过所述手指区域的邻近区域的像素对所述手指区域进行填充,获得所述第二视频帧。In an optional implementation manner, the first determining module is configured to: replace the content of the finger area in the first video frame with replacement content to obtain the second video frame; or A video frame is cropped to obtain the second video frame that does not include the finger area; or, the finger area is filled with pixels in the vicinity of the finger area to obtain the second video frame.
在一种可选的实施例方式中,所述电子设备还包括:第二采集模块,用于在采集获得所述第一视频帧之前,采集获取第三视频帧;第二确定模块,用于确定所述第三视频帧中不包含符合预设手指模型的内容,则将所述第三视频帧作为背景帧;第三确定模块,用于:在所述背景帧中,确定出和所述第一视频帧中的手指区域对应的内容作为所述替换内容。In an optional embodiment, the electronic device further includes: a second acquisition module, configured to acquire a third video frame before acquiring the first video frame; and a second determination module, configured to It is determined that the third video frame does not contain content that meets the preset finger model, then the third video frame is used as a background frame; the third determining module is configured to: determine and The content corresponding to the finger area in the first video frame is used as the replacement content.
在一种可选的实施例方式中,所述第一确定模块,用于:确定所述第一视频帧中包含符合预设手指模型的内容,且所述手指不处于异常状态,则去除所述第一视频帧中的手指,获得所述第二视频帧;所述异常状态对应以下至少一种情况: 所述第一视频帧中用户的两只手位于所述第一视频帧的底部,且用户的两只手的距离大于第一预设距离;所述第一视频帧中用户一只手位于底部区域,另一只手与所述底部区域距离大于预设距离;所述第一视频帧中用户的手指的面积大于预设面积阈值;所述第一视频帧中用户的手指遮挡住脸部。In an optional embodiment, the first determining module is configured to: determine that the first video frame contains content that conforms to a preset finger model, and the finger is not in an abnormal state, then remove all The finger in the first video frame obtains the second video frame; the abnormal state corresponds to at least one of the following situations: the two hands of the user in the first video frame are located at the bottom of the first video frame, And the distance between the two hands of the user is greater than a first preset distance; in the first video frame, one hand of the user is located in the bottom area, and the other hand is greater than the preset distance from the bottom area; the first video The area of the user's finger in the frame is greater than the preset area threshold; the user's finger in the first video frame covers the face.
在一种可选的实施例方式中,所述电子设备还包括:第四确定模块,用于确定所述第一视频帧中包含符合预设手指模型的内容,且确定出手指处于异常状态,将所述第一视频帧发送给所述显示器显示。In an optional embodiment, the electronic device further includes: a fourth determining module, configured to determine that the first video frame contains content conforming to a preset finger model, and determine that the finger is in an abnormal state, The first video frame is sent to the display for display.
在一种可选的实施例方式中,所述第一确定模块,用于:将所述第一视频帧输入语义分割模型,通过语义分割模型确定所述第一视频帧中的手指区域,在存在所述手指区域时,认定所述第一视频帧中包含符合所述预设手指模型的内容,所述语义分割模型基于样本照片训练获得,每张所述样本照片中包含用户手指的照片,并且每张照片标记出手指区域。In an optional embodiment, the first determining module is configured to: input the first video frame into a semantic segmentation model, and determine the finger area in the first video frame through the semantic segmentation model, and When the finger area exists, it is determined that the first video frame contains content that meets the preset finger model, and the semantic segmentation model is obtained through training based on sample photos, and each sample photo contains a photo of the user's finger. And each photo marks the finger area.
基于同一发明构思,本发明实施例提供一种电子设备,包括:获得模块,用于获得第一视频帧,且获得键盘输入信号;第五确定模块,用于确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得第二视频帧;指示模块,用于指示显示所述第二视频帧。Based on the same inventive concept, an embodiment of the present invention provides an electronic device, including: an obtaining module for obtaining a first video frame and obtaining a keyboard input signal; a fifth determining module for determining that the first video frame contains If the content of the preset finger model is met, and the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, the finger in the first video frame is removed to obtain the second video frame; The instruction module is used to instruct to display the second video frame.
在一种可选的实施例方式中,所述第五确定模块,用于:以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧;或者,对所述第一视频帧进行裁剪以获得不包含所述手指区域的所述第二视频帧;或者,通过所述手指区域的邻近区域的像素对所述手指区域进行填充,获得所述第二视频帧。In an optional embodiment, the fifth determining module is configured to: replace the content of the finger area in the first video frame with replacement content to obtain the second video frame; or The first video frame is cropped to obtain the second video frame that does not include the finger area; or, the finger area is filled with pixels in the vicinity of the finger area to obtain the second video frame .
在一种可选的实施例方式中,所述电子设备还包括:第三采集模块,用于在采集获得所述第一视频帧之前,采集获得第三视频帧;第六确定模块,用于确定所述第三视频帧中不包含符合预设手指模型的内容,则将所述第三视频帧作为背景帧;第七确定模块,用于:在所述背景帧中,确定出和所述第一视频帧中的手指区域对应的内容作为所述替换内容。In an optional embodiment, the electronic device further includes: a third acquisition module, configured to acquire a third video frame before acquiring the first video frame; and a sixth determining module, configured to It is determined that the third video frame does not contain content that meets the preset finger model, then the third video frame is used as a background frame; the seventh determining module is configured to: The content corresponding to the finger area in the first video frame is used as the replacement content.
在一种可选的实施例方式中,所述第五确定模块,用于:确定所述第一视频帧中包含符合预设手指模型的内容,且所述手指不处于异常状态,且且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得所述第二视频帧;所述异常状态对应以下至少一种情况:所述第一视频帧中用户的两只手位于所述第一视频帧的底部,且用户的两只手的距离大于第一预设距离;所述第一视频帧中用户一只手位于底部区域,另一只手与所述底部区域距离大于预设距离;所述第一视频帧中用户的手指的面积大于预设面积阈值;所述第一视频帧中用户的手指遮挡住脸部。In an optional embodiment, the fifth determining module is configured to: determine that the first video frame contains content that conforms to a preset finger model, and that the finger is not in an abnormal state, and obtains When the time of the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, the finger in the first video frame is removed to obtain the second video frame; the abnormal state corresponds to at least one of the following A situation: the two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance; there is one user in the first video frame The hand is located in the bottom area, and the distance between the other hand and the bottom area is greater than the preset distance; the area of the user's finger in the first video frame is greater than the preset area threshold; the user's finger in the first video frame is blocked Face.
在一种可选的实施例方式中,所述电子设备还包括:第八确定模块,用于确定所述第一视频帧中包含符合预设手指模型的内容时,且确定出手指处于异常状态,指示显示所述第一视频帧。In an optional embodiment, the electronic device further includes: an eighth determining module, configured to determine that when the first video frame contains content that conforms to a preset finger model, and determine that the finger is in an abnormal state To indicate to display the first video frame.
在一种可选的实施例方式中,所述第五确定模块,用于:将所述第一视频帧 输入语义分割模型,通过语义分割模型确定所述第一视频帧中的手指区域,在存在所述手指区域时,认定所述第一视频帧中包含符合所述预设手指模型的内容,所述语义分割模型基于样本照片训练获得,每张所述样本照片中包含用户手指的照片,并且每张照片标记出手指区域。In an optional embodiment, the fifth determining module is configured to: input the first video frame into a semantic segmentation model, and determine the finger area in the first video frame through the semantic segmentation model, and When the finger area exists, it is determined that the first video frame contains content that meets the preset finger model, and the semantic segmentation model is obtained through training based on sample photos, and each sample photo contains a photo of the user's finger. And each photo marks the finger area.
其他内容参考上文相关内容的描述,不再赘述。For other content, please refer to the description of the relevant content above, and will not repeat it.
可以理解的是,上述电子设备等为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明实施例的范围。It can be understood that, in order to realize the above-mentioned functions, the above-mentioned electronic devices and the like include hardware structures and/or software modules corresponding to the respective functions. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present invention.
本申请实施例可以根据上述方法示例对上述电子设备等进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本发明实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。下面以采用对应各个功能划分各个功能模块为例进行说明:The embodiments of the present application may divide the above-mentioned electronic devices and the like into functional modules according to the above-mentioned method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. It should be noted that the division of modules in the embodiment of the present invention is illustrative, and is only a logical function division, and there may be other division methods in actual implementation. The following is an example of dividing each function module corresponding to each function:
本申请实施例提供的方法中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例描述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、电子设备或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机可以存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD))、或者半导体介质(例如,SSD)等。The methods provided in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, a network device, an electronic device, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, SSD).
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的 系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
以上,仅为本申请的具体实施方式,但本申请实施例的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请实施例揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以权利要求的保护范围为准。The above are only specific implementations of the application, but the protection scope of the embodiments of the application is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the embodiments of the application. , Should be covered within the protection scope of the embodiments of this application. Therefore, the protection scope of the embodiments of the present application should be subject to the protection scope of the claims.
Claims (33)
- 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:显示器,键盘,摄像头和处理器;Display, keyboard, camera and processor;所述摄像头设置于所述键盘附近,用于在视频通信中,采集视频帧,并将采集的视频帧发送给所述处理器;The camera is arranged near the keyboard for collecting video frames in video communication, and sending the collected video frames to the processor;所述处理器,与所述显示器、键盘、摄像头连接,用于接收来自摄像头的第一视频帧,确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧;并将所述第二视频帧发送给所述显示器显示,和/或,将所述第二视频帧发送给对端电子设备显示。The processor is connected to the display, the keyboard, and the camera, and is configured to receive a first video frame from the camera, and if it is determined that the first video frame contains content that conforms to a preset finger model, then the first video is removed The finger in the frame obtains a second video frame; and sends the second video frame to the display for display, and/or sends the second video frame to the opposite electronic device for display.
- 如权利要求1所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容时,去除所述第一视频帧中的手指,包括:The method of claim 1, wherein when determining that the first video frame contains content that conforms to a preset finger model, removing the finger in the first video frame comprises:确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域,去除所述第一视频帧中的手指;或,It is determined that the first video frame contains content that conforms to a preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, and the finger in the first video frame is removed; or,确定所述第一视频帧中包含符合预设手指模型的内容,且确定出手指区域与人脸所在位置不存在重叠,去除所述第一视频帧中的手指;或,It is determined that the first video frame contains content that meets the preset finger model, and it is determined that there is no overlap between the finger area and the position of the human face, and the finger in the first video frame is removed; or,确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域且与所述第一视频帧的侧边相连,去除所述第一视频帧中的手指。It is determined that the first video frame contains content that conforms to a preset finger model, and it is determined that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the first video frame is removed. Finger in video frame.
- 如权利要求1所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,包括:The method of claim 1, wherein the determining that the first video frame contains content that conforms to a preset finger model, then removing the finger in the first video frame comprises:获得键盘输入信号;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,去除所述第一视频帧中的手指。Obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and that the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, remove the first video frame Finger in a video frame.
- 如权利要求1-3任一权项所述的电子设备,其特征在于,所述去除所述第一视频帧中的手指,获得第二视频帧,包括:5. The electronic device according to any one of claims 1 to 3, wherein the removing a finger in the first video frame to obtain a second video frame comprises:以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧;或者Replacing the content of the finger area in the first video frame with replacement content to obtain the second video frame; or对所述第一视频帧进行裁剪以获得不包含所述手指区域的所述第二视频帧;或者,Crop the first video frame to obtain the second video frame that does not include the finger area; or,通过所述手指区域的邻近区域的像素对所述手指区域进行填充,获得所述第二视频帧。Filling the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
- 如权利要求4所述的方法,其特征在于,所述处理器,还用于:获取第三视频帧,所述第三视频帧为在所述第一视频帧之前采集的视频帧;确定所述第三视频帧中不包含符合预设手指模型的内容,则将所述第三视频帧作为背景帧;在所述背景帧中,确定出和所述第一视频帧中的手指区域对应的内容作为所述替换内容。The method of claim 4, wherein the processor is further configured to: obtain a third video frame, the third video frame being a video frame collected before the first video frame; If the third video frame does not contain content that conforms to the preset finger model, the third video frame is used as the background frame; in the background frame, it is determined that the finger area in the first video frame corresponds to The content serves as the replacement content.
- 如权利要求1-5任一权项所述的电子设备,其特征在于,所述确定所述第 一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧,包括:5. The electronic device according to any one of claims 1 to 5, wherein when it is determined that the first video frame contains content conforming to a preset finger model, then the finger in the first video frame is removed To obtain the second video frame, including:确定所述第一视频帧中包含符合预设手指模型的内容,且所述手指不处于异常状态,则去除所述第一视频帧中的手指,获得所述第二视频帧;Determining that the first video frame contains content that meets the preset finger model, and the finger is not in an abnormal state, then removing the finger in the first video frame to obtain the second video frame;所述异常状态对应以下至少一种情况:The abnormal state corresponds to at least one of the following conditions:所述第一视频帧中用户的两只手位于所述第一视频帧的底部,且用户的两只手的距离大于第一预设距离;The two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance;所述第一视频帧中用户一只手位于底部区域,另一只手与所述底部区域距离大于预设距离;In the first video frame, one hand of the user is located at the bottom area, and the distance between the other hand and the bottom area is greater than a preset distance;所述第一视频帧中用户的手指的面积大于预设面积阈值;The area of the user's finger in the first video frame is greater than a preset area threshold;所述第一视频帧中用户的手指遮挡住脸部。The user's finger in the first video frame covers the face.
- 如权利要求6所述的电子设备,其特征在于,The electronic device of claim 6, wherein:所述处理器,还用于:确定所述第一视频帧中包含符合预设手指模型的内容时,且确定出手指处于异常状态,将所述第一视频帧发送给所述显示器显示。The processor is further configured to: when it is determined that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is in an abnormal state, the first video frame is sent to the display for display.
- 如权利要求1-7任一所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,包括:7. The method according to any one of claims 1-7, wherein the determining that the first video frame contains content conforming to a preset finger model comprises:将所述第一视频帧输入语义分割模型,通过语义分割模型确定所述第一视频帧中的手指区域,在存在所述手指区域时,认定所述第一视频帧中包含符合所述预设手指模型的内容,所述语义分割模型基于样本照片训练获得,每张所述样本照片中包含用户手指的照片,并且每张照片标记出手指区域。The first video frame is input into the semantic segmentation model, and the finger region in the first video frame is determined by the semantic segmentation model. When the finger region exists, it is determined that the first video frame contains the predetermined The content of the finger model, the semantic segmentation model is obtained by training based on sample photos, each sample photo contains a photo of the user's finger, and each photo marks the finger area.
- 如权利要求1-7任一所述的方法,其特征在于,The method according to any one of claims 1-7, wherein:所述处理器,还用于,响应于用户开启视频通信的操作,控制所述显示器显示视频通信界面,所述视频通信界面包括视频预览窗口和视频接收窗口;在所述视频预览窗口显示所述第二视频帧,在所述视频接收窗口显示从对端电子设备接收的视频帧;The processor is further configured to control the display to display a video communication interface in response to the user's operation to start video communication, and the video communication interface includes a video preview window and a video receiving window; A second video frame, displaying the video frame received from the opposite end electronic device in the video receiving window;所述处理器,还用于:将所述第一视频帧输入语义分割模型,获得手指区域掩膜;对所述手指区域掩膜进行降噪处理,获得手指区域;判断所述手指区域是否包含底部连通区且所述底部连通区与所述第一视频帧的侧边相接,如果所述手指区域是否包含底部连通区且所述底部连通区与所述第一视频帧的侧边相接,去除所述第一视频帧中的手指;所述对所述手指区域掩膜进行降噪处理包括以下至少一种:根据获取到的手指区域掩膜和人脸所在区域,对所述手指区域掩膜的二值图进行腐蚀膨胀操作;查找手指区域掩膜的二值图所有连通区,并过滤掉面积小于预设阈值的连通区;过滤掉手指区域掩膜的连通区中外接矩形小于或等于预设面积阈值的连通区,基于连通区的所在区域不同,所对应的预设面积阈值不同;过滤掉手指区域掩膜的连通区中不属于底部连通区的区域;过滤掉手指区域掩膜的连通区中与人脸所在区域存在重叠的连通区;过滤手指区域掩膜的连通区中宽高比小于第二预设阈值的连通区;计算连通区与其外接矩形的面积占比,过滤掉面积占比小于等于第三预设阈值的连通区;The processor is further configured to: input the first video frame into a semantic segmentation model to obtain a finger area mask; perform noise reduction processing on the finger area mask to obtain a finger area; and determine whether the finger area contains Bottom connected area and the bottom connected area is connected to the side of the first video frame, if the finger area includes a bottom connected area and the bottom connected area is connected to the side of the first video frame , Removing the finger in the first video frame; the performing noise reduction processing on the finger area mask includes at least one of the following: according to the obtained finger area mask and the area where the face is located, the finger area The binary image of the mask is corroded and expanded; all the connected areas of the binary image of the finger area mask are searched, and connected areas with an area smaller than the preset threshold are filtered out; the connected area of the finger area mask is filtered out and the circumscribed rectangle is less than or The connected area equal to the preset area threshold, based on the different area of the connected area, the corresponding preset area threshold is different; filter out the area of the connected area of the finger area mask that does not belong to the bottom connected area; filter out the finger area mask There is a connected area that overlaps with the area where the face is located in the connected area; filter the connected area of the connected area of the finger area mask whose aspect ratio is less than the second preset threshold; calculate the area ratio of the connected area and the circumscribed rectangle, and filter out Connected areas whose area ratio is less than or equal to the third preset threshold;所述去除第一视频帧中的手指,获得第二视频帧,包括:判断背景帧与所述第一视频帧的相似度值是否大于预设相似度值,如果大于的话,对背景帧进行运动补偿,获得运动补偿后的背景帧,所述背景帧为采集所述第一视频帧之前采集的不包含所述手指的视频帧;确定运动补偿后的背景帧的手指区域作为替换内容;通过所述替换内容对所述第一视频帧中的所述手指区域进行替换;采用环境光渲染方法渲染填充上的所述替换内容和周围的背景区域,获得所述第二视频帧;所述对所述背景帧进行运动补偿包括:基于背景帧和当前帧做运动偏移估计,获得背景帧相对于当前帧偏移的运动偏移矩阵,所述获得运动偏移矩阵包括:检测所述第一视频帧和所述背景帧的特征点,然后对所述第一视频帧和所述背景帧的特征点进行匹配,找到配对的特征点,然后根据配对特征点计算透视变换矩阵,该透视变换矩阵即为表征背景帧相对于当前帧运动量的运动偏移矩阵;根据所述运动偏移矩阵对所述背景帧做运动补偿,获得运动补偿后的背景帧;The removing fingers in the first video frame to obtain the second video frame includes: determining whether the similarity value between the background frame and the first video frame is greater than a preset similarity value, and if it is greater, moving the background frame Compensation to obtain a motion-compensated background frame, where the background frame is a video frame that does not contain the finger captured before the first video frame is captured; the finger area of the motion-compensated background frame is determined as the replacement content; The replacement content replaces the finger area in the first video frame; the replacement content on the fill and the surrounding background area are rendered using an ambient light rendering method to obtain the second video frame; The motion compensation of the background frame includes: performing motion offset estimation based on the background frame and the current frame, and obtaining a motion offset matrix of the background frame relative to the current frame. The obtaining of the motion offset matrix includes: detecting the first video The feature points of the frame and the background frame, then match the feature points of the first video frame and the background frame, find the paired feature points, and then calculate the perspective transformation matrix based on the paired feature points, the perspective transformation matrix is To characterize the motion offset matrix of the background frame relative to the current frame; perform motion compensation on the background frame according to the motion offset matrix to obtain the background frame after motion compensation;所述处理器还用于:在获得所述第二视频帧之后,将所述第二视频帧作为背景帧;The processor is further configured to: after obtaining the second video frame, use the second video frame as a background frame;所述处理器还用于:在控制所述显示器显示所述第二视频帧之后,确定接收的第四视频帧中包含符合预设手指模型的内容,且所述手指处于异常状态,则根据所述第二视频帧的手指区域和所述第四视频帧的手指区域确定出至少一个过渡帧;控制所述显示器显示所述至少一个过渡帧;控制所述显示器在显示所述至少一个过渡帧之后,显示所述第四视频帧。The processor is further configured to: after controlling the display to display the second video frame, determine that the received fourth video frame contains content conforming to a preset finger model, and the finger is in an abnormal state, then according to the The finger area of the second video frame and the finger area of the fourth video frame determine at least one transition frame; control the display to display the at least one transition frame; control the display to display the at least one transition frame To display the fourth video frame.
- 一种视频采集控制方法,其特征在于,应用于电子设备中,所述电子设备包括:键盘和摄像头,所述摄像头设置于所述键盘附近,所述包括:A video capture control method, characterized in that it is applied to an electronic device, the electronic device includes: a keyboard and a camera, the camera is arranged near the keyboard, and includes:通过所述摄像头采集获得第一视频帧;Acquiring the first video frame through the camera collection;确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧;It is determined that the first video frame contains content that meets the preset finger model, then removing the finger in the first video frame to obtain a second video frame;显示所述第二视频帧,和/或,将所述第二视频帧发送给对端电子设备显示。The second video frame is displayed, and/or the second video frame is sent to the opposite electronic device for display.
- 如权利要求10所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容时,去除所述第一视频帧中的手指,包括:The method of claim 10, wherein when determining that the first video frame contains content that conforms to a preset finger model, removing the finger in the first video frame comprises:确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域,去除所述第一视频帧中的手指;或,It is determined that the first video frame contains content that conforms to a preset finger model, and it is determined that the finger is located at the bottom area of the first video frame, and the finger in the first video frame is removed; or,确定所述第一视频帧中包含符合预设手指模型的内容,且确定出手指区域与人脸所在位置不存在重叠,去除所述第一视频帧中的手指;或,It is determined that the first video frame contains content that meets the preset finger model, and it is determined that there is no overlap between the finger area and the position of the human face, and the finger in the first video frame is removed; or,确定所述第一视频帧中包含符合预设手指模型的内容,且确定所述手指位于所述第一视频帧的底部区域且与所述第一视频帧的侧边相连,去除所述第一视频帧中的手指。It is determined that the first video frame contains content that conforms to a preset finger model, and it is determined that the finger is located at the bottom area of the first video frame and is connected to the side of the first video frame, and the first video frame is removed. Finger in video frame.
- 如权利要求10所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,包括:The method of claim 10, wherein the determining that the first video frame contains content conforming to a preset finger model, then removing the finger in the first video frame comprises:获得键盘输入信号;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值, 去除所述第一视频帧中的手指。Obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and that the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset time threshold, remove the first video frame Finger in a video frame.
- 如权利要求10-12任一权项所述的方法,其特征在于,所述去除所述第一视频帧中的手指,获得第二视频帧,包括:The method according to any one of claims 10-12, wherein the removing a finger in the first video frame to obtain a second video frame comprises:以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧;或者Replacing the content of the finger area in the first video frame with replacement content to obtain the second video frame; or对所述第一视频帧进行裁剪以获得不包含所述手指区域的所述第二视频帧;或者,Crop the first video frame to obtain the second video frame that does not include the finger area; or,通过所述手指区域的邻近区域的像素对所述手指区域进行填充,获得所述第二视频帧。Filling the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
- 如权利要求13所述的方法,其特征在于,所述方法还包括:The method of claim 13, wherein the method further comprises:在采集获得所述第一视频帧之前,采集获取第三视频帧;Before acquiring the first video frame, acquiring a third video frame;确定所述第三视频帧中不包含符合预设手指模型的内容,则将所述第三视频帧作为背景帧;Determining that the third video frame does not contain content that meets the preset finger model, then using the third video frame as a background frame;在所述以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧之前,所述方法还包括:在所述背景帧中,确定出和所述第一视频帧中的手指区域对应的内容作为所述替换内容。Before the replacing the content of the finger area in the first video frame with the replacement content to obtain the second video frame, the method further includes: determining the connection with the first video in the background frame The content corresponding to the finger area in the frame is used as the replacement content.
- 如权利要求10-14任一权项所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧,包括:The method according to any one of claims 10-14, wherein the determining that the first video frame contains content conforming to a preset finger model, then removing the finger in the first video frame, Obtain the second video frame, including:确定所述第一视频帧中包含符合预设手指模型的内容,且所述手指不处于异常状态,则去除所述第一视频帧中的手指,获得所述第二视频帧;Determining that the first video frame contains content that meets the preset finger model, and the finger is not in an abnormal state, then removing the finger in the first video frame to obtain the second video frame;所述异常状态对应以下至少一种情况:The abnormal state corresponds to at least one of the following conditions:所述第一视频帧中用户的两只手位于所述第一视频帧的底部,且用户的两只手的距离大于第一预设距离;The two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance;所述第一视频帧中用户一只手位于底部区域,另一只手与所述底部区域距离大于预设距离;In the first video frame, one hand of the user is located at the bottom area, and the distance between the other hand and the bottom area is greater than a preset distance;所述第一视频帧中用户的手指的面积大于预设面积阈值;The area of the user's finger in the first video frame is greater than a preset area threshold;所述第一视频帧中用户的手指遮挡住脸部。The user's finger in the first video frame covers the face.
- 如权利要求15所述的方法,其特征在于,所述方法还包括:The method of claim 15, wherein the method further comprises:确定所述第一视频帧中包含符合预设手指模型的内容,且确定出手指处于异常状态,将所述第一视频帧发送给所述显示器显示。It is determined that the first video frame contains content that meets the preset finger model, and it is determined that the finger is in an abnormal state, and the first video frame is sent to the display for display.
- 如权利要求10-16任一所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,包括:The method according to any one of claims 10-16, wherein the determining that the first video frame contains content conforming to a preset finger model comprises:将所述第一视频帧输入语义分割模型,通过语义分割模型确定所述第一视频帧中的手指区域,在存在所述手指区域时,认定所述第一视频帧中包含符合所述预设手指模型的内容,所述语义分割模型基于样本照片训练获得,每张所述样本照片中包含用户手指的照片,并且每张照片标记出手指区域。The first video frame is input into the semantic segmentation model, and the finger region in the first video frame is determined by the semantic segmentation model. When the finger region exists, it is determined that the first video frame contains the predetermined The content of the finger model, the semantic segmentation model is obtained by training based on sample photos, each sample photo contains a photo of the user's finger, and each photo marks the finger area.
- 如权利要求10-16任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 10-16, wherein the method further comprises:响应于用户开启视频通信的操作,显示视频通信界面,所述视频通信界面包括视频预览窗口和视频接收窗口;所述视频预览窗口用于显示本端产生的视频帧,所述视频接收窗口用于显示从对端电子设备接收的视频帧;In response to the user's operation to start video communication, a video communication interface is displayed. The video communication interface includes a video preview window and a video receiving window; the video preview window is used to display the video frame generated by the local end, and the video receiving window is used to Display the video frames received from the peer electronic device;所述确定所述第一视频帧中包含符合预设手指模型的内容,则去除所述第一视频帧中的手指,获得第二视频帧,包括:The determining that the first video frame contains content conforming to the preset finger model, then removing the finger in the first video frame to obtain the second video frame includes:将所述第一视频帧输入语义分割模型,获得手指区域掩膜;对所述手指区域掩膜进行降噪处理,获得手指区域;判断所述手指区域是否包含底部连通区且所述底部连通区与所述第一视频帧的侧边相接,如果所述手指区域是否包含底部连通区且所述底部连通区与所述第一视频帧的侧边相接,去除所述第一视频帧中的手指;所述对所述手指区域掩膜进行降噪处理包括以下至少一种:根据获取到的手指区域掩膜和人脸所在区域,对所述手指区域掩膜的二值图进行腐蚀膨胀操作;查找手指区域掩膜的二值图所有连通区,并过滤掉面积小于预设阈值的连通区;过滤掉手指区域掩膜的连通区中外接矩形小于或等于预设面积阈值的连通区,基于连通区的所在区域不同,所对应的预设面积阈值不同;过滤掉手指区域掩膜的连通区中不属于底部连通区的区域;过滤掉手指区域掩膜的连通区中与人脸所在区域存在重叠的连通区;过滤手指区域掩膜的连通区中宽高比小于第二预设阈值的连通区;计算连通区与其外接矩形的面积占比,过滤掉面积占比小于等于第三预设阈值的连通区;Input the first video frame into a semantic segmentation model to obtain a finger area mask; perform noise reduction processing on the finger area mask to obtain a finger area; determine whether the finger area includes a bottom connected area and the bottom connected area Is connected to the side of the first video frame, and if the finger area includes a bottom connected area and the bottom connected area is connected to the side of the first video frame, remove the The finger; said performing noise reduction processing on the finger area mask includes at least one of the following: corroding and expanding the binary image of the finger area mask according to the obtained finger area mask and the area where the face is located Operation; find all connected areas of the binary image of the finger area mask, and filter out the connected areas whose area is less than the preset threshold; filter out the connected areas of the connected area of the finger area mask whose circumscribed rectangle is less than or equal to the preset area threshold, Based on the different areas of the connected area, the corresponding preset area thresholds are different; filter out the area of the connected area of the finger area mask that does not belong to the bottom connected area; filter out the area of the connected area of the finger area mask and the face There are overlapping connected areas; among the connected areas of the finger area mask, the connected areas whose aspect ratio is less than the second preset threshold are filtered; the area proportion of the connected area and its circumscribed rectangle is calculated, and the area proportion is less than or equal to the third preset threshold. Threshold connected area;判断背景帧与所述第一视频帧的相似度值是否大于预设相似度值,如果大于的话,对背景帧进行运动补偿,获得运动补偿后的背景帧,所述背景帧为采集所述第一视频帧之前采集的不包含所述手指的视频帧;确定运动补偿后的背景帧的手指区域作为替换内容;通过所述替换内容对所述第一视频帧中的所述手指区域进行替换;采用环境光渲染方法渲染填充上的所述替换内容和周围的背景区域,获得所述第二视频帧;所述对所述背景帧进行运动补偿包括:基于背景帧和当前帧做运动偏移估计,获得背景帧相对于当前帧偏移的运动偏移矩阵,所述获得运动偏移矩阵包括:检测所述第一视频帧和所述背景帧的特征点,然后对所述第一视频帧和所述背景帧的特征点进行匹配,找到配对的特征点,然后根据配对特征点计算透视变换矩阵,该透视变换矩阵即为表征背景帧相对于当前帧运动量的运动偏移矩阵;根据所述运动偏移矩阵对所述背景帧做运动补偿,获得运动补偿后的背景帧;Determine whether the similarity value between the background frame and the first video frame is greater than the preset similarity value. If it is greater, perform motion compensation on the background frame to obtain a motion-compensated background frame. The background frame is the acquisition of the first video frame. A video frame collected before a video frame that does not contain the finger; determining the finger area of the background frame after motion compensation as replacement content; replacing the finger area in the first video frame through the replacement content; Using an ambient light rendering method to render the replacement content on the fill and the surrounding background area to obtain the second video frame; the performing motion compensation on the background frame includes: performing motion offset estimation based on the background frame and the current frame , Obtaining the motion offset matrix of the background frame relative to the current frame, and the obtaining the motion offset matrix includes: detecting the feature points of the first video frame and the background frame, and then comparing the first video frame and The feature points of the background frame are matched, the paired feature points are found, and then the perspective transformation matrix is calculated according to the paired feature points. The perspective transformation matrix is the motion offset matrix that characterizes the motion of the background frame relative to the current frame; according to the motion The offset matrix performs motion compensation on the background frame to obtain a background frame after motion compensation;所述方法还包括:在获得所述第二视频帧之后,将所述第二视频帧作为背景帧;The method further includes: after obtaining the second video frame, using the second video frame as a background frame;所述方法还包括:在控制所述显示器显示所述第二视频帧之后,确定采集的第四视频帧中包含符合预设手指模型的内容,且所述手指处于异常状态,则根据所述第二视频帧的手指区域和所述第四视频帧的手指区域确定出至少一个过渡帧;显示所述至少一个过渡帧;在显示所述至少一个过渡帧之后,在所述视频预览窗口显示所述第四视频帧。The method further includes: after controlling the display to display the second video frame, determining that the collected fourth video frame contains content conforming to a preset finger model, and the finger is in an abnormal state, then according to the first The finger area of the second video frame and the finger area of the fourth video frame determine at least one transition frame; display the at least one transition frame; after the at least one transition frame is displayed, display the at least one transition frame in the video preview window The fourth video frame.
- 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:一个或多个处理器;One or more processors;一个或多个存储器;One or more memories;多个应用程序;Multiple applications;以及一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述一个或多个存储器中,所述一个或多个计算机程序包括指令,当所述指令被所述电子设备的一个或多个处理器执行时,使得所述电子设备执行以下步骤:And one or more computer programs, wherein the one or more computer programs are stored in the one or more memories, and the one or more computer programs include instructions. When executed by one or more processors, the electronic device executes the following steps:获得第一视频帧,且获得键盘输入信号;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得第二视频帧;指示显示所述第二视频帧。Obtain a first video frame, and obtain a keyboard input signal; determine that the first video frame contains content that conforms to a preset finger model, and the time of the obtained keyboard input signal and the time of obtaining the first video frame meet the preset Time threshold, then remove the finger in the first video frame to obtain a second video frame; instruct to display the second video frame.
- 如权利要求19所述的电子设备,其特征在于,所述去除所述第一视频帧中的手指,获得第二视频帧,包括:The electronic device of claim 19, wherein the removing a finger in the first video frame to obtain a second video frame comprises:以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧;或者Replacing the content of the finger area in the first video frame with replacement content to obtain the second video frame; or对所述第一视频帧进行裁剪以获得不包含所述手指区域的所述第二视频帧;或者,Crop the first video frame to obtain the second video frame that does not include the finger area; or,通过所述手指区域的邻近区域的像素对所述手指区域进行填充,获得所述第二视频帧。Filling the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
- 如权利要求20所述的电子设备,其特征在于,当所述指令被所述电子设备执行时,所述电子设备还执行以下步骤:The electronic device of claim 20, wherein when the instruction is executed by the electronic device, the electronic device further executes the following steps:在采集获得所述第一视频帧之前,采集获得第三视频帧;Before acquiring the first video frame, acquiring a third video frame;确定所述第三视频帧中不包含符合预设手指模型的内容,则将所述第三视频帧作为背景帧;Determining that the third video frame does not contain content that meets the preset finger model, then using the third video frame as a background frame;在所述背景帧中,确定出和所述第一视频帧中的手指区域对应的内容作为所述替换内容。In the background frame, content corresponding to the finger area in the first video frame is determined as the replacement content.
- 如权利要求19所述的电子设备,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得第二视频帧,包括:The electronic device according to claim 19, wherein said determining that the first video frame contains content conforming to a preset finger model, and the time of the obtained keyboard input signal is different from the time of obtaining the first video frame When the time meets the preset time threshold, removing the finger in the first video frame to obtain the second video frame includes:确定所述第一视频帧中包含符合预设手指模型的内容,且所述手指不处于异常状态,且且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得所述第二视频帧;It is determined that the first video frame contains content that conforms to the preset finger model, the finger is not in an abnormal state, and the time of obtaining the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold , Remove the finger in the first video frame to obtain the second video frame;所述异常状态对应以下至少一种情况:The abnormal state corresponds to at least one of the following conditions:所述第一视频帧中用户的两只手位于所述第一视频帧的底部,且用户的两只手的距离大于第一预设距离;The two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance;所述第一视频帧中用户一只手位于底部区域,另一只手与所述底部区域距离大于预设距离;In the first video frame, one hand of the user is located at the bottom area, and the distance between the other hand and the bottom area is greater than a preset distance;所述第一视频帧中用户的手指的面积大于预设面积阈值;The area of the user's finger in the first video frame is greater than a preset area threshold;所述第一视频帧中用户的手指遮挡住脸部。The user's finger in the first video frame covers the face.
- 如权利要求22所述的电子设备,当所述指令被所述电子设备执行时,所述电子设备还执行以下步骤:The electronic device according to claim 22, when the instruction is executed by the electronic device, the electronic device further executes the following steps:确定所述第一视频帧中包含符合预设手指模型的内容时,且确定出手指处于异常状态,指示显示所述第一视频帧.When it is determined that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is in an abnormal state, indicating that the first video frame is displayed.
- 如权利要求19-23任一所述的电子设备,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,包括:22. The electronic device according to any one of claims 19-23, wherein the determining that the first video frame contains content conforming to a preset finger model comprises:述确定所述第一视频帧中包含符合预设手指模型的内容,包括:The determining that the first video frame contains content conforming to a preset finger model includes:将所述第一视频帧输入语义分割模型,通过语义分割模型确定所述第一视频帧中的手指区域,在存在所述手指区域时,认定所述第一视频帧中包含符合所述预设手指模型的内容,所述语义分割模型基于样本照片训练获得,每张所述样本照片中包含用户手指的照片,并且每张照片标记出手指区域。The first video frame is input into the semantic segmentation model, and the finger region in the first video frame is determined by the semantic segmentation model. When the finger region exists, it is determined that the first video frame contains the predetermined The content of the finger model, the semantic segmentation model is obtained by training based on sample photos, each sample photo contains a photo of the user's finger, and each photo marks the finger area.
- 一种视频通信控制方法,其特征在于,包括:A video communication control method, characterized in that it comprises:获得第一视频帧,且获得键盘输入信号;Obtain the first video frame, and obtain the keyboard input signal;确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得第二视频帧;It is determined that the first video frame contains content that conforms to the preset finger model, and the time when the keyboard input signal is obtained and the time when the first video frame is obtained meet the preset time threshold, then the first video frame is removed Finger to obtain the second video frame;指示显示所述第二视频帧。Instruct to display the second video frame.
- 如权利要求25所述的方法,其特征在于,所述去除所述第一视频帧中的手指,获得第二视频帧,包括:The method of claim 25, wherein the removing a finger in the first video frame to obtain a second video frame comprises:以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧;或者Replacing the content of the finger area in the first video frame with replacement content to obtain the second video frame; or对所述第一视频帧进行裁剪以获得不包含所述手指区域的所述第二视频帧;或者,Crop the first video frame to obtain the second video frame that does not include the finger area; or,通过所述手指区域的邻近区域的像素对所述手指区域进行填充,获得所述第二视频帧。Filling the finger area with pixels in the vicinity of the finger area to obtain the second video frame.
- 如权利要求26所述的方法,其特征在于,所述方法还包括:The method of claim 26, wherein the method further comprises:在采集获得所述第一视频帧之前,采集获得第三视频帧;Before acquiring the first video frame, acquiring a third video frame;确定所述第三视频帧中不包含符合预设手指模型的内容,则将所述第三视频帧作为背景帧;Determining that the third video frame does not contain content that meets the preset finger model, then using the third video frame as a background frame;在所述以替换内容替换所述第一视频帧中的手指区域的内容,获得所述第二视频帧之前,所述方法还包括:在所述背景帧中,确定出和所述第一视频帧中的手指区域对应的内容作为所述替换内容。Before the replacing the content of the finger area in the first video frame with the replacement content to obtain the second video frame, the method further includes: determining the connection with the first video in the background frame The content corresponding to the finger area in the frame is used as the replacement content.
- 如权利要求25所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得第二视频帧,包括:25. The method of claim 25, wherein the determining that the first video frame contains content conforming to a preset finger model, and the time when the keyboard input signal is obtained is the same as the time when the first video frame is obtained When the preset time threshold is met, removing the finger in the first video frame to obtain the second video frame includes:确定所述第一视频帧中包含符合预设手指模型的内容,且所述手指不处于异 常状态,且且获得的键盘输入信号的时间与获得所述第一视频帧的时间满足预设时间阈值,则去除所述第一视频帧中的手指,获得所述第二视频帧;It is determined that the first video frame contains content that conforms to the preset finger model, the finger is not in an abnormal state, and the time of obtaining the keyboard input signal and the time of obtaining the first video frame meet the preset time threshold , Remove the finger in the first video frame to obtain the second video frame;所述异常状态对应以下至少一种情况:The abnormal state corresponds to at least one of the following conditions:所述第一视频帧中用户的两只手位于所述第一视频帧的底部,且用户的两只手的距离大于第一预设距离;The two hands of the user in the first video frame are located at the bottom of the first video frame, and the distance between the two hands of the user is greater than the first preset distance;所述第一视频帧中用户一只手位于底部区域,另一只手与所述底部区域距离大于预设距离;In the first video frame, one hand of the user is located at the bottom area, and the distance between the other hand and the bottom area is greater than a preset distance;所述第一视频帧中用户的手指的面积大于预设面积阈值;The area of the user's finger in the first video frame is greater than a preset area threshold;所述第一视频帧中用户的手指遮挡住脸部。The user's finger in the first video frame covers the face.
- 如权利要求28所述的方法,所述方法还包括:The method of claim 28, further comprising:确定所述第一视频帧中包含符合预设手指模型的内容时,且确定出手指处于异常状态,指示显示所述第一视频帧。When it is determined that the first video frame contains content that conforms to the preset finger model, and it is determined that the finger is in an abnormal state, it is instructed to display the first video frame.
- 如权利要求25-29任一所述的方法,其特征在于,所述确定所述第一视频帧中包含符合预设手指模型的内容,包括:The method according to any one of claims 25-29, wherein the determining that the first video frame contains content that conforms to a preset finger model comprises:将所述第一视频帧输入语义分割模型,通过语义分割模型确定所述第一视频帧中的手指区域,在存在所述手指区域时,认定所述第一视频帧中包含符合所述预设手指模型的内容,所述语义分割模型基于样本照片训练获得,每张所述样本照片中包含用户手指的照片,并且每张照片标记出手指区域。The first video frame is input into the semantic segmentation model, and the finger region in the first video frame is determined by the semantic segmentation model. When the finger region exists, it is determined that the first video frame contains the predetermined The content of the finger model, the semantic segmentation model is obtained by training based on sample photos, each sample photo contains a photo of the user's finger, and each photo marks the finger area.
- 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求16-30中任一项所述的方法。A computer-readable storage medium, comprising instructions, characterized in that, when the instructions are executed on an electronic device, the electronic device is caused to execute the method according to any one of claims 16-30.
- 一种计算机程序产品,其特征在于,所述计算机程序产品包括软件代码,所述软件代码用于执行如权利要求16-30中任一项所述的方法。A computer program product, wherein the computer program product comprises software code, and the software code is used to execute the method according to any one of claims 16-30.
- 一种包含指令的芯片,其特征在于,当所述芯片在电子设备上运行时,使得所述电子设备执行如权利要求16-30中任一项所述的方法。A chip containing instructions, characterized in that, when the chip runs on an electronic device, the electronic device is caused to execute the method according to any one of claims 16-30.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911315367.1 | 2019-12-19 | ||
CN201911315367.1A CN113014846B (en) | 2019-12-19 | 2019-12-19 | Video acquisition control method, electronic equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021121302A1 true WO2021121302A1 (en) | 2021-06-24 |
Family
ID=76382556
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/137100 WO2021121302A1 (en) | 2019-12-19 | 2020-12-17 | Video collection control method, electronic device, and computer-readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113014846B (en) |
WO (1) | WO2021121302A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116363538A (en) * | 2023-06-01 | 2023-06-30 | 贵州交投高新科技有限公司 | Bridge detection method and system based on unmanned aerial vehicle |
CN118830843A (en) * | 2024-06-20 | 2024-10-25 | 重庆市罗布琳卡科技有限公司 | Physiotherapy equipment control system based on artificial intelligence |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115914497A (en) * | 2021-08-24 | 2023-04-04 | 北京字跳网络技术有限公司 | Video processing method, device, equipment, medium and program product |
CN114299446A (en) * | 2021-12-17 | 2022-04-08 | 深圳云天励飞技术股份有限公司 | Personnel number identification method and device, electronic equipment and storage medium |
CN116708931B (en) * | 2022-11-14 | 2024-03-15 | 荣耀终端有限公司 | Image processing method and electronic equipment |
CN117041670B (en) * | 2023-10-08 | 2024-04-02 | 荣耀终端有限公司 | Image processing methods and related equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101673141A (en) * | 2008-09-12 | 2010-03-17 | 鸿富锦精密工业(深圳)有限公司 | keyboard |
CN202331363U (en) * | 2011-11-11 | 2012-07-11 | 中国矿业大学 | Keyboard with camera shooting function |
CN103971361A (en) * | 2013-02-06 | 2014-08-06 | 富士通株式会社 | Image processing device and method |
WO2015121981A1 (en) * | 2014-02-14 | 2015-08-20 | 株式会社Pfu | Overhead scanner device, image acquisition method, and program |
US20180040169A1 (en) * | 2016-08-02 | 2018-02-08 | Canon Kabushiki Kaisha | Information processing apparatus, method of controlling information processing apparatus, and storage medium |
CN108257082A (en) * | 2018-02-01 | 2018-07-06 | 北京维山科技有限公司 | Method and apparatus based on fixed area removal image finger |
CN109886981A (en) * | 2019-03-07 | 2019-06-14 | 北京麦哲科技有限公司 | The method and apparatus of finger removal in a kind of scanning of books and periodicals |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8284249B2 (en) * | 2008-03-25 | 2012-10-09 | International Business Machines Corporation | Real time processing of video frames for triggering an alert |
JP5962249B2 (en) * | 2012-06-21 | 2016-08-03 | 富士通株式会社 | Character input program, information processing apparatus, and character input method |
CN103139547B (en) * | 2013-02-25 | 2016-02-10 | 昆山南邮智能科技有限公司 | The method of pick-up lens occlusion state is judged based on video signal |
US10142522B2 (en) * | 2013-12-03 | 2018-11-27 | Ml Netherlands C.V. | User feedback for real-time checking and improving quality of scanned image |
CN109218748B (en) * | 2017-06-30 | 2020-11-27 | 京东方科技集团股份有限公司 | Video transmission method, device and computer readable storage medium |
CN107909022B (en) * | 2017-11-10 | 2020-06-16 | 广州视睿电子科技有限公司 | A video processing method, apparatus, terminal device and storage medium |
CN109948525A (en) * | 2019-03-18 | 2019-06-28 | Oppo广东移动通信有限公司 | Photographing processing method and device, mobile terminal and storage medium |
-
2019
- 2019-12-19 CN CN201911315367.1A patent/CN113014846B/en active Active
-
2020
- 2020-12-17 WO PCT/CN2020/137100 patent/WO2021121302A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101673141A (en) * | 2008-09-12 | 2010-03-17 | 鸿富锦精密工业(深圳)有限公司 | keyboard |
CN202331363U (en) * | 2011-11-11 | 2012-07-11 | 中国矿业大学 | Keyboard with camera shooting function |
CN103971361A (en) * | 2013-02-06 | 2014-08-06 | 富士通株式会社 | Image processing device and method |
WO2015121981A1 (en) * | 2014-02-14 | 2015-08-20 | 株式会社Pfu | Overhead scanner device, image acquisition method, and program |
US20180040169A1 (en) * | 2016-08-02 | 2018-02-08 | Canon Kabushiki Kaisha | Information processing apparatus, method of controlling information processing apparatus, and storage medium |
CN108257082A (en) * | 2018-02-01 | 2018-07-06 | 北京维山科技有限公司 | Method and apparatus based on fixed area removal image finger |
CN109886981A (en) * | 2019-03-07 | 2019-06-14 | 北京麦哲科技有限公司 | The method and apparatus of finger removal in a kind of scanning of books and periodicals |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116363538A (en) * | 2023-06-01 | 2023-06-30 | 贵州交投高新科技有限公司 | Bridge detection method and system based on unmanned aerial vehicle |
CN116363538B (en) * | 2023-06-01 | 2023-08-01 | 贵州交投高新科技有限公司 | Bridge detection method and system based on unmanned aerial vehicle |
CN118830843A (en) * | 2024-06-20 | 2024-10-25 | 重庆市罗布琳卡科技有限公司 | Physiotherapy equipment control system based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN113014846A (en) | 2021-06-22 |
CN113014846B (en) | 2022-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021121302A1 (en) | Video collection control method, electronic device, and computer-readable storage medium | |
WO2021115181A1 (en) | Gesture recognition method, gesture control method, apparatuses, medium and terminal device | |
EP3547218B1 (en) | File processing device and method, and graphical user interface | |
US9767359B2 (en) | Method for recognizing a specific object inside an image and electronic device thereof | |
CN112954210B (en) | Photographing method and device, electronic equipment and medium | |
TWI651640B (en) | Organize digital notes on the user interface | |
US12131443B2 (en) | Image processing method and related device | |
CN106648424B (en) | Screenshot method and device | |
CN109951636A (en) | Photographing processing method and device, mobile terminal and storage medium | |
CN108647351B (en) | Text image processing method and device, storage medium and terminal | |
US20230360443A1 (en) | Gesture recognition method and apparatus, electronic device, readable storage medium, and chip | |
WO2017197593A1 (en) | Apparatus, method and computer program product for recovering editable slide | |
WO2017107855A1 (en) | Picture searching method and device | |
US20250039537A1 (en) | Screenshot processing method, electronic device, and computer readable medium | |
CN110463177A (en) | The bearing calibration of file and picture and device | |
CN109033276A (en) | Sticker pushing method and device, storage medium and electronic equipment | |
EP4030343A1 (en) | Facial skin detection method and apparatus | |
WO2022111461A1 (en) | Recognition method and apparatus, and electronic device | |
WO2022088946A1 (en) | Method and apparatus for selecting characters from curved text, and terminal device | |
CN103327251A (en) | Method and device of multimedia shooting processing and terminal device | |
CN110942065B (en) | Text box selection method, text box selection device, terminal equipment and computer readable storage medium | |
CN117132648B (en) | Visual positioning method, electronic equipment and computer readable storage medium | |
CN112381091A (en) | Video content identification method and device, electronic equipment and storage medium | |
CN112822394A (en) | Display control method and device, electronic equipment and readable storage medium | |
CN111079662A (en) | Figure identification method and device, machine readable medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20902393 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20902393 Country of ref document: EP Kind code of ref document: A1 |