WO2024245105A1 - 视频会议画面的分屏方法及相关设备 - Google Patents
视频会议画面的分屏方法及相关设备 Download PDFInfo
- Publication number
- WO2024245105A1 WO2024245105A1 PCT/CN2024/095004 CN2024095004W WO2024245105A1 WO 2024245105 A1 WO2024245105 A1 WO 2024245105A1 CN 2024095004 W CN2024095004 W CN 2024095004W WO 2024245105 A1 WO2024245105 A1 WO 2024245105A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target object
- sub
- screen
- target
- speaking
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000001514 detection method Methods 0.000 claims description 137
- 230000011218 segmentation Effects 0.000 claims description 31
- 238000005516 engineering process Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 23
- 230000008859 change Effects 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 13
- 230000004044 response Effects 0.000 claims description 13
- 238000010586 diagram Methods 0.000 description 28
- 238000012545 processing Methods 0.000 description 19
- 238000012937 correction Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 230000001965 increasing effect Effects 0.000 description 7
- 230000003993 interaction Effects 0.000 description 7
- 230000002093 peripheral effect Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000001815 facial effect Effects 0.000 description 4
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000009977 dual effect Effects 0.000 description 3
- 210000004709 eyebrow Anatomy 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 210000000887 face Anatomy 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/268—Signal distribution or switching
Definitions
- the present disclosure relates to the field of computer technology, and in particular to a screen splitting method for a video conference screen and related equipment.
- the present disclosure proposes a screen splitting method for a video conference screen and related equipment to solve or partially solve the above-mentioned problems.
- a method for splitting a video conference screen comprising:
- a video conference screen splitting device comprising:
- An acquisition module is configured to: acquire a target image acquired by an acquisition unit;
- a detection module is configured to: detect a target object in the target image
- a division module is configured to: divide the video conference screen into at least two sub-screens according to the target object;
- the display module is configured to display the target object in the target image in the at least two sub-pictures accordingly.
- a computer device comprising one or more processors, a memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, and the programs include instructions for executing the method described in the first aspect.
- a non-volatile computer-readable storage medium containing a computer program is provided.
- the processors execute the method described in the first aspect.
- a computer program product including computer program instructions.
- the computer program instructions When the computer program instructions are executed on a computer, the computer is caused to execute the method according to the first aspect.
- FIG. 1A shows a schematic diagram of an exemplary system provided by an embodiment of the present disclosure.
- FIG. 1B is a schematic diagram showing a video conference screen captured in the scene shown in FIG. 1A .
- FIG. 2 is a schematic diagram showing an exemplary process according to an embodiment of the present disclosure.
- FIG. 3A shows a schematic diagram of an exemplary target image according to an embodiment of the present disclosure.
- FIG. 3B shows a schematic diagram of displaying a detection frame in a target image according to an embodiment of the present disclosure.
- FIG3C shows a schematic diagram of an exemplary video conference screen according to an embodiment of the present disclosure.
- FIG3D shows a schematic diagram of another exemplary video conference screen according to an embodiment of the present disclosure.
- FIG3E shows a schematic diagram of a split-screen mode according to an embodiment of the present disclosure.
- FIG3F shows a schematic diagram of another split-screen mode according to an embodiment of the present disclosure.
- FIG3G shows a schematic diagram of another exemplary video conference screen according to an embodiment of the present disclosure.
- FIG. 3I shows a schematic diagram of a rotating human face.
- FIG. 4 shows a schematic diagram of an exemplary method provided by an embodiment of the present disclosure.
- FIG5 shows a schematic diagram of the hardware structure of an exemplary computer device provided in an embodiment of the present disclosure.
- FIG. 6 shows a schematic diagram of an exemplary device provided by an embodiment of the present disclosure.
- FIG. 1A shows a schematic diagram of an exemplary system 100 provided by an embodiment of the present disclosure.
- the system 100 may include at least one terminal device (eg, terminal devices 102 and 104), Server 106 and database server 108.
- the terminals 102 and 104 and the servers 106 and database server 108 may include a medium providing a communication link, such as a network, which may include various connection types, such as wired, wireless communication links or fiber optic cables, etc.
- Users 110A to 110C can use terminal device 102 to interact with server 106 through the network to receive or send messages, etc.
- user 112 can use terminal device 104 to interact with server 106 through the network to receive or send messages, etc.
- Various applications can be installed on terminal devices 102 and 104, such as video conferencing applications, reading applications, video applications, social applications, payment applications, web browsers, and instant messaging tools.
- users 110A ⁇ 110C and user 112 can respectively use video conferencing applications installed on terminal devices 102 and 104 to use video conferencing services provided by server 106, and terminal devices 102 and 104 can capture images 1022 and 1042 through cameras (for example, cameras installed on terminal devices 102 and 104) and can capture live audio through microphones (for example, microphones installed on terminal devices 102 and 104) and upload them to server 106, so that users 110A ⁇ 110C and user 112 can respectively view each other's images and hear each other's voices through video conferencing applications on terminal devices 102 and 104.
- cameras for example, cameras installed on terminal devices 102 and 104
- microphones for example, microphones installed on terminal devices 102 and 10
- the terminal devices 102 and 104 here can be hardware or software.
- the terminal devices 102 and 104 can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players, laptop computers (Laptops) and desktop computers (PCs), etc.
- the terminal devices 102 and 104 are software, they can be installed in the electronic devices listed above. They can be implemented as multiple software or software modules (for example, to provide distributed services), or they can be implemented as a single software or software module. No specific limitation is made here.
- the server 106 may be a server that provides various services, such as a background server that provides support for various applications displayed on the terminal devices 102 and 104.
- the database server 108 may also be a database server that provides various services. It is understood that in the case where the server 106 can implement the relevant functions of the database server 108, the database server 108 may not be set in the system 100.
- the server 106 and the database server 108 can also be hardware or software.
- they can be implemented as a distributed server cluster consisting of multiple servers, or as a single server.
- they can be implemented as multiple software or software modules (for example, to provide distributed services), or as a single software or software module. No specific limitation is made here.
- the screen splitting method of the video conference screen provided in the embodiment of the present application is generally executed by the terminal devices 102 and 104.
- FIG. 1B is a schematic diagram showing a video conference screen 120 captured in the scene shown in FIG. 1A .
- the video conference screen 120 can display the images captured by the two cameras in two sub-screens 1202 and 1204 respectively, wherein the camera on the terminal device 102 side captures the image of the entire conference room, and therefore, the sub-screen 1202 will display the corresponding images of the users 110A to 110C.
- a picture of a target object e.g., a face image
- an embodiment of the present disclosure provides a method for splitting a video conference screen, by detecting a target object in a target image captured by an acquisition unit, and then dividing the video conference screen into at least two sub-screens according to the target object, and displaying the target object in the target image in the sub-screen accordingly.
- the video conference screen can be automatically split, thereby helping to enhance the sense of interaction between the participants in the conference room and other online participants during the video conference, thereby enhancing the user experience.
- FIG2 shows a flow chart of an exemplary method 200 provided in an embodiment of the present disclosure.
- the method 200 can be used to automatically split the video conference screen.
- the method 200 can be implemented by the terminal devices 102 and 104 of FIG1A, or by the server 106 of FIG1A. The following is an explanation of the implementation of the method 200 by the server 106.
- the method 200 may further include the following steps.
- the server 106 may obtain a target image captured by a capture unit.
- the capture unit may be a camera provided in the terminal devices 102 and 104, and the target image may be images 1022 and 1042 captured by the camera.
- the terminal devices 102 and 104 may upload the captured image to the server 106 for processing.
- the server 106 may detect the target object in the target image 300.
- an object detection technology may be used to detect the target object in the target image 300.
- a pre-trained object detection model may be used to detect the target objects 302A-302C in the target image 300, and detection frames 304A-304C corresponding to the target objects 302A-302C may be obtained, as shown in FIG3B .
- target tracking technology can be used to track the position changes of the target object in real time, and accordingly, the detection frame will also change accordingly. In this way, even if the participant moves during the video conference, the target object can be tracked.
- a pre-trained target tracking model can be used to track the target objects 302A ⁇ 302C in the target image 300.
- the target tracking model can be a real-time face frame and face key point detection and tracking model based on deep learning.
- the model structure includes but is not limited to various forms of convolutional neural networks (Convolutional Neural Network) and various forms of Transformer networks.
- Convolutional Neural Network Convolutional Neural Network
- Transformer networks various forms of Transformer networks.
- the process may proceed to step 206 , where the server 106 may directly determine the split-screen layout according to the target object, and then divide the video conference screen based on the split-screen layout.
- FIG3C shows a schematic diagram of an exemplary video conference screen 310 according to an embodiment of the present disclosure.
- the video conference screen 310 is divided into a plurality of sub-screens.
- the video conference screen can be divided according to the total number of target objects in the target image collected by each terminal device currently participating in the video conference. For example, taking the scene shown in FIG1A as an example, the number of sub-screens can be determined according to the total number of target objects in the images collected by the terminal devices 102 and 104, which number is 4 in this example. After determining the number of sub-screens, the split-screen layout can be determined.
- the video conference screen 310 can be first divided into n sub-screens according to the number of terminal devices n, and then each terminal device is corresponding to the n sub-screens. Next, the sub-screens corresponding to the terminal device are further divided according to the number of target objects in the screen collected by each terminal device, so that the split-screen layout is obtained. As shown in FIG3C , there are two left and right sub-screens corresponding to the images captured by the two terminal devices 102 and 104, respectively, wherein the left screen further includes three sub-screens corresponding to the target objects 302A to 302C, respectively.
- the right sub-screen displays the screen captured by the terminal device 104, which may include the target object 312. In this way, a complete video conference screen 310 is formed. Moreover, since the sub-screens corresponding to different terminal devices are divided in the screen 310 (for example, divided into equal sizes), the user can know the specific number of terminals participating in the video conference through the split-screen layout.
- FIG. 3D shows a schematic diagram of another exemplary video conference screen 320 according to an embodiment of the present disclosure.
- the video conference screen 320 is divided into four sub-screens of equal size according to the total number of target objects, corresponding to target objects 302A to 302C and target object 312, respectively.
- all participants whether multiple participants in the conference room or participants attending individually, can occupy a sub-screen of equal size to others, so that each participant can interact with each participant very clearly.
- different layout methods can be selected according to the number of target objects.
- the screen can be divided into an equilateral array of sub-screens (e.g., N ⁇ N sub-screens) or an unequal array of sub-screens (e.g., N ⁇ M sub-screens) according to the number during layout.
- N ⁇ N sub-screens e.g., N ⁇ N sub-screens
- N ⁇ M sub-screens unequal array of sub-screens
- the screen 320 can be divided into two rows, with one sub-screen in the first row and two sub-screens in the second row, as shown in the left screen of FIG3C .
- the aforementioned method can also be used for split screen layout.
- the number of target objects is 7
- 3 ⁇ 3 sub-screens can be used to correspond to the target objects, wherein two sub-screens can be left blank, as shown in FIG3E .
- the number of sub-screens can be increased in the length direction when performing split-screen layout.
- the maximum number of people supported by a video conference screen is 12.
- N ⁇ 5 the screen can be divided into 1 row and N columns.
- N ⁇ 8 When N ⁇ 12, the screen can be divided into 3 rows, the first two rows are divided into N/3 columns (can be rounded up or down), and the last row is NN/3 ⁇ 2 columns (can be rounded down or up).
- the width of each sub-screen in the last row is consistent with the width of the sub-screen in the previous row, and can be arranged in the center.
- Figure 3F shows a schematic diagram of another exemplary video conferencing screen according to an embodiment of the present disclosure. As shown in Figure 3F, when the number of target objects is 7, the above method can be used to split the screen into a 4+3 layout.
- the video conference screen is divided into at least two sub-screens according to the number of target objects, and the corresponding target objects are displayed in the sub-screens.
- the video conference screen can be automatically split, which helps to enhance the sense of interaction between the participants in the conference room and other online participants during the video conference, thereby improving the user experience.
- method 200 when performing split-screen layout, in addition to considering the number of target objects, it is also possible to detect whether the participant is speaking, and then perform corresponding processing based on the sub-screen corresponding to the detected participant who is speaking. Therefore, as shown in FIG2 , method 200 further includes step 208, performing speaker detection on the target object. Optionally, this step can be processed in parallel with the target object detection, thereby increasing the processing speed.
- an indication mark for example, an icon indicating that the participant is speaking
- an icon indicating that the participant is speaking may be displayed in the sub-image corresponding to the participant who is speaking.
- a microphone-style icon may be displayed in the sub-image corresponding to the participant who is speaking, thereby reminding others that the participant is speaking.
- the video conference screen can be divided into at least two sub-screens according to a first split-screen mode; if it is determined according to the detection result that all participants in the video conference are not speaking, the video conference screen can be divided into at least two sub-screens according to a second split-screen mode.
- the second split-screen mode can be any of the split-screen layouts in the aforementioned embodiments.
- the video conference screen is divided into at least two sub-screens according to a first split-screen mode, including: enlarging and displaying a first sub-screen of the at least two sub-screens, and displaying other sub-screens of the at least two sub-screens in parallel on at least one side of the first sub-screen, and the first sub-screen can be used to display the target object who is speaking.
- FIG3G shows a schematic diagram of another exemplary video conference screen 330 according to an embodiment of the present disclosure.
- the screen 330 includes four sub-screens, which correspond to the target objects 302A to 302C and the target object 312, respectively, wherein the first sub-screen 3302 is enlarged and corresponds to the target object 302A of the participant 110A who is speaking, and the other sub-screens are displayed side by side on one side of the first sub-screen 3302.
- the sub-screen of the participant who is speaking is arranged in the middle of the screen and occupies a larger screen, and the sub-screen of the non-speaker is arranged on the side and occupies a smaller screen, so as to better improve the interactivity.
- the speaker mode (first split screen mode) is used.
- the person currently speaking is placed on the largest sub-screen, and the other participants are arranged side by side on at least one side of the largest sub-screen (when there are a large number of participants). It can be placed on two or more sides).
- the normal split screen mode (second split screen mode) is used, and each sub-screen is the same size.
- video conferencing software can confirm whether someone is speaking in the corresponding video based on the audio stream.
- participants in the conference room share video streams and audio streams.
- the audio stream collected by the terminal device on one side of the conference room cannot determine whether the people in the room are speaking, nor can it distinguish who is currently speaking, which reduces the interactivity of the meeting.
- speaker detection is performed by image processing, which can avoid the problem of being unable to distinguish who is currently speaking through the audio stream.
- the server 106 may perform key point detection on each detected target object, and then determine whether the participant corresponding to the target object in the target image is speaking according to the key point detection result.
- FIG3H shows a schematic diagram of facial key point detection.
- the key point detection of the human face can adopt the detection method of 68 key points, which are distributed in various parts of the human face, among which 0-16 points correspond to the chin, 16-21 points correspond to the right eyebrow (here is the mirror image, which is the right eyebrow of the person in the picture), 22-26 points correspond to the left eyebrow, 27-35 points correspond to the nose, 36-41 points correspond to the right eye, 42-47 points correspond to the left eye, and 48-67 points correspond to the lips.
- the human face can be recognized, and according to the changes of the key points in the target object in multiple consecutive frames, it can be determined whether the corresponding participant is speaking.
- facial key point detection method with 68 key points is only an example. It can be understood that the facial key point detection method can also have other numbers of key points, for example, 21 key points, 29 key points, and so on.
- 106 key points may be used to implement key point detection, thereby obtaining a more accurate detection result.
- the lip height can be determined based on the key points of the target object, and the lip width can be determined based on the key points of the target object; then the lip height-to-width ratio of the target object is obtained based on the lip height and the lip width, and then based on the change information of the lip height-to-width ratio, it is determined whether the target object is speaking.
- speaker detection is achieved by using image processing, which can avoid the problem of not being able to distinguish who is currently speaking through the audio stream.
- the detected key points can be corrected based on the rotation angle of the face.
- FIG. 3I shows a schematic diagram of a rotating human face.
- yaw angle Yaw
- Roll roll angle
- Pitch pitch angle
- the influence of the Roll rotation can be offset by an affine transformation, and then the influence of the Pitch and Yaw rotation can be offset by the Pitch and Yaw information of the face detection.
- multiple key points e.g., coordinates of 106 key points
- the key points of the target object are matched with the key points of the standard (average) face (i.e., the standard key points), so that an affine transformation matrix (mapping relationship) can be obtained.
- a plurality of first key points and a plurality of second key points corresponding to the lip height and the lip width respectively can be selected from the corrected plurality of key points, and the pitch angle correction is performed on the plurality of first key points to obtain the corrected lip height, and the yaw angle correction is performed on the plurality of second key points to obtain the corrected lip width.
- the corrected lip height and the corrected lip width offset the influence of the Pitch and Yaw rotations.
- the length of the line segment from 98 points to 102 points can be calculated to represent the lip height and divided by cos(Pitch) to offset the influence of Pitch, thereby obtaining the corrected lip height.
- the length of the line segment from 96 points to 100 points can be calculated to represent the lip width and divided by cos(Yaw) to offset the influence of Yaw, thereby obtaining the corrected lip width.
- the angle information of Pitch and Yaw can be provided by the face detection module.
- the key point detection result (the detection result for the lip height and the lip width) can be obtained, so that the degree of mouth opening can be expressed by the height-to-width ratio of the lips after correction.
- the change of the height-to-width ratio of the lips over a period of time can be maintained, and the variance of the height-to-width ratio during this period of time can be used to determine whether the current subject is speaking.
- the lip height-to-width ratio of the target object may be calculated and stored according to the corrected lip height and the corrected lip width.
- the change information corresponding to the lip height-width ratio of the target object in the target image can be determined in combination with the key point detection result, and whether the participant corresponding to the target object in the target image is speaking can be determined based on the change information. For example, it can be determined whether the participant is speaking based on whether the variance of the lip height-width ratio within a preset time period (for example, within 1 second) is greater than the variance threshold, and if it is greater, it can be determined that the participant is speaking. In this way, when determining whether the participant is speaking, the result can be more accurate.
- a counter can be maintained to record the number of times the speaker is judged to be speaking in the recent period, and then determine whether the speaker is speaking based on the relationship between the number and the preset counting threshold.
- determining whether the target object is speaking includes: setting a preset time period (for example, within 2 seconds); counting the number of changes in the lip aspect ratio within the preset time period; when the number of changes reaches a preset number, determining that the target object is speaking. In this way, the timing information of the key points is used to enhance the stability of the speaker detection in timing and reduce the fluctuation of the detection state.
- the count value in response to determining according to the change information that the participant corresponding to the target object in the target image is speaking (that is, the subject is determined to be speaking in the current frame), the count value is +1; in response to determining according to the change information that the participant corresponding to the target object in the target image is not speaking (that is, the subject is determined to be not speaking in the current frame), the count value is -1.
- the number of cells in the target image can be determined according to the count value within a preset time period (for example, within 2 seconds). Whether the participant corresponding to the target object is speaking. For example, when the counter value is greater than a preset count threshold (e.g., 2 times), the effect that the subject is speaking can be displayed (e.g., enlarging the sub-screen corresponding to the speaker and/or displaying the microphone icon); when the counter value is less than the preset count threshold, the effect that the subject is speaking can be canceled (e.g., restoring the sub-screen corresponding to the speaker to the same size as other sub-screens and/or hiding the microphone icon).
- a preset count threshold e.g., 2 times
- the effect that the subject is speaking can be displayed (e.g., enlarging the sub-screen corresponding to the speaker and/or displaying the microphone icon); when the counter value is less than the preset count threshold, the effect that the subject is speaking can be canceled (e.g.,
- the position of the lips of the face is determined by the key point detection after the split screen, and the relative position relationship of the lip key points is used to determine whether the current participant is speaking; at the same time, in order to improve the misjudgment of speech detection caused by face movement, rotation and other actions, the key points of face detection will be mapped to a standard face that has not rotated before speaking judgment, reducing the impact of face movement.
- the timing information of the key points is used to enhance the stability of speaker detection in timing and reduce the fluctuation of detection status.
- the audio data of the video conference can also be obtained, and then, based on the key point detection results, combined with the audio data of the video conference, it is determined whether the target object in the target image is speaking. In this way, the accuracy of speaker judgment can be further improved by combining the key point detection results and the audio data of the current video conference.
- the microphone for collecting audio is a dual-channel microphone
- the speaker can be located based on the two sets of audio data collected by the dual channels, thereby further improving the accuracy of speaker judgment.
- step 210 may be entered to match the target object with the detection frame according to the detection frame position and the split screen layout.
- the number of split screens can be increased or decreased according to the detection results, and the layout of the split screen can be changed, so that when someone joins or leaves the conference room, the relative position of the original participants can be kept unchanged, and the number of split screens can be increased or decreased in real time.
- the ROI (Region of Interest) corresponding to each participant in the split-screen layout can be determined based on the coordinates of each participant's detection frame in the original image (target image), thereby determining the one-to-one correspondence between the person, the detection frame and the sub-screen according to the position of each person's detection frame and the split-screen layout.
- the content of the detection frame may be matched with the sub-image.
- the coordinates of the detection frame corresponding to the target object and the coordinates of the sub-image corresponding to the target object may be determined first; then, according to the coordinates of the detection frame corresponding to the target object and the coordinates of the sub-image corresponding to the target object, the image corresponding to the detection frame may be matched.
- Pan and/or zoom to the sub-screen corresponding to the target object By zooming in and out on the target object, all participants can clearly see each other, improving the interactivity of the meeting.
- the detection frame may be expanded to both sides according to a certain ratio (for example, 20% for height and 40% for width), and then the width and height of the detection frame may be expanded to both sides until its aspect ratio is the same as that of the corresponding sub-screen. If the expanded ROI exceeds the boundary of the picture, it may be translated to within the range of the picture, thereby achieving the matching of the content of the detection frame with the sub-picture.
- a certain ratio for example, 20% for height and 40% for width
- the ROI of the sub-screen corresponding to each detection frame at the current moment can be calculated by linear interpolation, and a smooth movement and scaling effect from the current screen to the target face can be achieved, realizing the horizontal transformation-vertical transformation-zooming function similar to that of a surveillance camera when switching the split-screen effect.
- a time interval ⁇ t may be determined, and then the number of linear interpolations may be determined according to the pan-scaling duration T and the time interval ⁇ t.
- the number of linear interpolations is used to determine the updated coordinates of the sub-picture corresponding to each interpolation according to the original coordinates of the sub-picture and the coordinates of the ROI of the detection frame.
- the updated coordinates can be transformed at equal intervals relative to the coordinates corresponding to the previous interpolation.
- the sub-picture is gradually subjected to horizontal transformation-vertical transformation-scaling processing until the duration reaches T.
- the clarity of the sub-picture processed in the aforementioned manner may be affected. Therefore, a super-resolution technology may be used to increase the resolution of the sub-picture, thereby improving the clarity.
- step 214 it may be determined whether the virtual background function is turned on. If the virtual background function is turned on, step 216 is entered to perform semantic segmentation of the current image based on the target object (which may be processed in parallel with face detection to improve processing efficiency). In this way, the virtual background function of each split screen is realized by using the semantic segmentation capability.
- a pre-trained semantic segmentation model may be used to segment the target object from the background image.
- the semantic segmentation model may be a real-time portrait semantic segmentation model based on deep learning, and the model structure includes but is not limited to various forms of convolutional neural networks (Convolutional Neural Network) and various forms of Transformer networks.
- the portrait segmentation function can be applied to the entire input image for portrait segmentation.
- the segmentation result of each sub-screen corresponds to the segmentation result of the full image segmentation result under the ROI corresponding to the sub-screen.
- the input The value corresponding to the pixel point in the image and the pixel value of the new background to be replaced at the pixel point are used to calculate the pixel value of the virtual background result at the pixel point.
- the first value of the segmentation result at the pixel point (normalized to [0,1]) can be multiplied by the value corresponding to the pixel point in the input image, and the second value of the segmentation result at the pixel point (1 minus the first value) multiplied by the pixel value of the new background to be replaced at the pixel point can be added to obtain the pixel value of the virtual background result at the pixel point.
- step 218 the process may proceed to step 218 to render the video conference screen.
- the sub-screen can be rendered with the content of the original image ROI area corresponding to each sub-screen. If the virtual background function is turned on, the background replacement is performed in combination with the result of portrait segmentation and the background to be replaced. If any special processing is required for the detected speaker, it is also completed in this step. For example, the target object corresponding to the participant who is speaking is displayed in the first sub-screen 3302. The virtual background is displayed in the second sub-screen 3304.
- the embodiments of the present disclosure adopt an automatic split-screen system for video conferencing.
- the automatic split-screen function of video can enable people sitting in the same conference room to communicate "face to face" with colleagues participating in the meeting remotely.
- speaker detection can easily identify who is speaking in the current conference room and put the speaker's video stream in a prominent position to enhance the video conferencing experience.
- the proportion of the subject in the picture may be very small due to the venue and distance.
- the PTZ function implemented by software can realize the function of automatic lens tracking and focusing without any operation.
- the aforementioned embodiment is described with the server 106 as the execution subject.
- the aforementioned processing steps may not limit the execution subject.
- the terminal devices 102 and 104 may also implement these processing steps. Therefore, it is also possible to use the terminal devices 102 and 104 as the execution subjects of the aforementioned embodiments.
- FIG4 shows a flow chart of an exemplary method 400 provided by the embodiment of the present disclosure.
- the method 400 can be applied to the server 106 of FIG1A, and can also be applied to the terminal devices 102 and 104 of FIG1A. As shown in FIG4, the method 400 can further include the following steps.
- step 402 a target image acquired by an acquisition unit is acquired.
- the acquisition unit may be a camera provided in the terminal devices 102 and 104
- the target image may be images 1022 and 1042 acquired by the camera.
- an object detection technology may be used to detect the target object in the target image 300.
- a pre-trained object detection model may be used to detect the target objects 302A-302C in the target image 300, and detection frames 304A-304C corresponding to the target objects 302A-302C may be obtained, as shown in FIG3B .
- step 406 the video conference screen is divided into at least two sub-screens according to the target object.
- the embodiment of the present disclosure provides a method for splitting the video conference screen.
- the method detects the target object in the target image captured by the acquisition unit, and then divides the video conference screen into at least two sub-screens according to the target object, and displays the corresponding target object in the sub-screen. Therefore, when the captured target image contains multiple participants, the video conference screen can be automatically split, which helps to enhance the sense of interaction between the participants in the conference room and other online participants during the video conference, thereby improving the user experience.
- the video conference screen is divided into at least two sub-screens according to the number of the target objects, including: determining whether the target object in the target image is speaking; in response to determining that the target object in the target image is speaking, the video conference screen is divided into at least two sub-screens according to the first split-screen mode, as shown in Figure 3G.
- the spokesperson mode first split-screen mode
- the person who is currently speaking is placed on the largest sub-screen, and the remaining participants are arranged side by side on at least one side of the largest sub-screen (when the number is large, they can be placed on two or more sides).
- the normal split-screen mode (second split-screen mode) is adopted, and each sub-screen is the same size.
- the video conference screen is divided into at least two sub-screens according to a first split-screen mode, including: enlarging and displaying a first sub-screen (for example, sub-screen 3302 in Figure 3G) of the at least two sub-screens, and displaying other sub-screens of the at least two sub-screens in parallel on at least one side of the first sub-screen, wherein the first sub-screen is used to display the target object who is speaking.
- a first sub-screen for example, sub-screen 3302 in Figure 3G
- determining whether the target object in the target image is speaking includes: performing key point detection on the target object; determining whether the target object in the target image is speaking based on the key point detection result, thereby performing speaker detection through image processing, avoiding the problem of being unable to distinguish who is currently speaking through the audio stream.
- performing key point detection on the target object includes:
- the plurality of key points are subjected to roll angle correction to obtain a plurality of corrected key points
- the key point detection result is obtained according to the corrected lip height and the corrected lip width.
- the key points of face detection will be mapped to a standard face that has not been rotated before making a speech judgment to reduce the impact of face movement.
- the method further comprises: calculating the lip height-to-width ratio of the target object according to the corrected lip height and the corrected lip width and storing the lip height-to-width ratio;
- Determining whether the target object in the target image is speaking according to the key point detection result includes: determining change information corresponding to the lip aspect ratio of the target object in the target image in combination with the key point detection result, and determining whether the target object in the target image is speaking according to the change information.
- the timing information of key points is used to enhance the timing stability of speaker detection and reduce the fluctuation of detection status.
- determining whether the participant corresponding to the target object in the target image is speaking includes: setting a preset time period; counting the number of changes in the lip height-to-width ratio within the preset time period; and determining that the target object is speaking when the number of changes reaches a preset number. In this way, the timing information of the key points is used to enhance the stability of the speaker detection in timing and reduce the fluctuation of the detection state.
- determining that the target object is speaking includes: determining that the number of changes reaches a preset number; obtaining audio data of a video conference; and determining that the target object is speaking in combination with the audio data of the video conference.
- the accuracy of the speaker judgment can be further improved.
- the microphone for collecting audio is a dual-channel microphone
- the speaker can be located based on the two sets of audio data collected by the dual channels, thereby further improving the accuracy of the speaker judgment.
- detecting the target object in the target image includes: detecting the target object in the target image using a target detection or target tracking technology to obtain a detection frame of the target object;
- Displaying the target object in the target image in the at least two sub-screens accordingly includes: determining the coordinates of the detection frame corresponding to the target object; determining the coordinates of the sub-screen corresponding to the target object; and translating and/or scaling the image corresponding to the detection frame to the sub-screen corresponding to the target object according to the coordinates of the detection frame and the coordinates of the sub-screen.
- the target object in the target image is displayed in the at least two sub-screens accordingly, further comprising: in response to determining that the virtual background function in the sub-screen is turned on, segmenting the target object from the background using a segmentation technology to obtain a segmentation result; based on the segmentation result, displaying the virtual background in the sub-screen, filling the technical gap in the related technology of not displaying the virtual background in the sub-screen.
- correspondingly displaying the target object in the target image in the at least two sub-pictures includes: in response to determining that the virtual background function of the second sub-picture of the at least two sub-pictures is turned on, The virtual background is displayed in the second sub-screen (for example, the sub-screen 3304 in FIG. 3G ), filling the technical gap of not displaying the virtual background in the sub-screen in the related art.
- the method further comprises: segmenting the target object and the background in the target image using a semantic segmentation technique to obtain a segmentation result;
- Displaying a virtual background in the second sub-screen includes: displaying a virtual background in the second sub-screen according to the segmentation result.
- semantic segmentation technology is used to segment the target object from the actual background, thereby effectively replacing the virtual background.
- the target object in the target image is displayed in the at least two sub-screens accordingly, further comprising: in response to determining that the target object is speaking, an indicator icon is displayed in the sub-screen corresponding to the target object, thereby reminding others that the participant in the sub-screen corresponding to the icon is speaking, thereby improving interactivity.
- the method of the embodiment of the present disclosure can be performed by a single device, such as a computer or a server.
- the method of the present embodiment can also be applied in a distributed scenario and completed by multiple devices cooperating with each other.
- one of the multiple devices can only perform one or more steps in the method of the embodiment of the present disclosure, and the multiple devices will interact with each other to complete the described method.
- FIG5 shows a schematic diagram of the hardware structure of an exemplary computer device 500 provided in the embodiment of the present disclosure.
- the computer device 500 can be used to implement the server 106 of FIG1A, and can also be used to implement the terminal devices 102 and 104 of FIG1A. In some scenarios, the computer device 500 can also be used to implement the database server 108 of FIG1A.
- computer device 500 may include: processor 502, memory 504, network module 506, peripheral interface 508 and bus 510.
- processor 502, memory 504, network module 506 and peripheral interface 508 are connected to each other in communication within computer device 500 via bus 510.
- Processor 502 may be a central processing unit (CPU), an image processor, a neural network processor (NPU), a microcontroller (MCU), a programmable logic device, a digital signal processor (DSP), an application specific integrated circuit (ASIC), or one or more integrated circuits.
- processor 502 may be used to perform functions related to the technology described in the present disclosure.
- processor 502 may also include multiple processors integrated into a single logical component. For example, as shown in FIG. 5 , processor 502 may include multiple processors 502a, 502b, and 502c.
- the memory 504 may be configured to store data (e.g., instructions, computer codes, etc.). As shown in FIG. 5 , the data stored in the memory 504 may include program instructions (e.g., program instructions for implementing the method 200 or 400 of the embodiment of the present disclosure) and data to be processed (e.g., the memory may store configuration files of other modules, etc.). 502 may also access program instructions and data stored in memory 504, and execute program instructions to operate on the data to be processed.
- Memory 504 may include a volatile storage device or a non-volatile storage device. In some embodiments, memory 504 may include a random access memory (RAM), a read-only memory (ROM), an optical disk, a magnetic disk, a hard disk, a solid-state drive (SSD), a flash memory, a memory stick, etc.
- RAM random access memory
- ROM read-only memory
- SSD solid-state drive
- flash memory a memory stick, etc.
- the network interface 506 can be configured to provide the computer device 500 with communication with other external devices via a network.
- the network can be any wired or wireless network capable of transmitting and receiving data.
- the network can be a wired network, a local wireless network (e.g., Bluetooth, WiFi, near field communication (NFC)), a cellular network, the Internet, or a combination thereof. It is understood that the type of network is not limited to the above specific examples.
- the peripheral interface 508 can be configured to connect the computer device 500 to one or more peripheral devices to achieve information input and output.
- the peripheral devices can include input devices such as a keyboard, a mouse, a touch pad, a touch screen, a microphone, and various sensors, and output devices such as a display, a speaker, a vibrator, and an indicator light.
- the bus 510 may be configured to transmit information between various components of the computer device 500 (e.g., the processor 502, the memory 504, the network interface 506, and the peripheral interface 508), such as an internal bus (e.g., a processor-memory bus), an external bus (USB port, PCI-E bus), etc.
- an internal bus e.g., a processor-memory bus
- an external bus USB port, PCI-E bus
- the architecture of the above-mentioned computer device 500 only shows the processor 502, the memory 504, the network interface 506, the peripheral interface 508 and the bus 510, in the specific implementation process, the architecture of the computer device 500 may also include other components necessary for normal operation.
- the architecture of the above-mentioned computer device 500 may also only include the components necessary for implementing the embodiments of the present disclosure, and does not necessarily include all the components shown in the figure.
- FIG. 6 shows a schematic diagram of an exemplary device 600 provided by the embodiment of the present disclosure. As shown in Figure 6, the device 600 can be used to implement the method 200 or 400, and can further include the following modules.
- the acquisition module 602 is configured to: acquire the target image acquired by the acquisition unit.
- the acquisition unit may be a camera provided in the terminal devices 102 and 104
- the target image may be images 1022 and 1042 acquired by the camera.
- the detection module 604 is configured to detect a target object (eg, target objects 302A to 302C in FIG. 3A ) in the target image (eg, image 300 in FIG. 3A ).
- a target object eg, target objects 302A to 302C in FIG. 3A
- the target image eg, image 300 in FIG. 3A
- an object detection technology may be used to detect the target object in the target image 300.
- a pre-trained object detection model may be used to detect the target objects 302A-302C in the target image 300, and detection frames 304A-304C corresponding to the target objects 302A-302C may be obtained, as shown in FIG3B .
- the division module 606 is configured to divide the video conference screen into at least two sub-screens according to the target object.
- the display module 608 is configured to display the target object in the target image in the at least two sub-pictures accordingly.
- the present disclosure provides a method for splitting a video conference screen, which detects a target object in a target image collected by a collection unit, and then divides the video conference screen into at least two sub-screens according to the target object, and divides the corresponding sub-screens into two sub-screens.
- the target object is displayed in the sub-screen accordingly, so that when the captured target image contains multiple participants, the video conference screen can be automatically split, which helps to enhance the interaction between the participants in the conference room and other online participants during the video conference, thereby improving the user experience.
- the division module 606 is configured to: determine whether the target object in the target image is speaking; in response to determining that the target object in the target image is speaking, divide the video conference screen into at least two sub-screens according to the first split-screen mode, as shown in Figure 3G.
- the spokesperson mode first split-screen mode
- the normal split-screen mode second split-screen mode
- each sub-screen is the same size.
- the dividing module 606 is configured to: enlarge and display a first sub-picture (e.g., sub-picture 3302 in FIG. 3G ) of the at least two sub-pictures, and display other sub-pictures of the at least two sub-pictures in parallel on at least one side of the first sub-picture, wherein the first sub-picture is used to display the target object who is speaking;
- a first sub-picture e.g., sub-picture 3302 in FIG. 3G
- the display module 608 is configured to display the target object corresponding to the participant who is speaking in the first sub-screen.
- the sub-screen of the participant who is speaking is arranged in the middle of the screen and occupies a larger screen, and the sub-screen of the non-speaker is arranged on the side and occupies a smaller screen, thereby better improving interactivity.
- the detection module 604 is configured to: detect key points of the target object; determine the lip height based on the key points of the target object; determine the lip width based on the key points of the target object; obtain the lip height-to-width ratio of the target object according to the lip height and the lip width; determine whether the target object is speaking based on the change information of the lip height-to-width ratio, so that speaker detection can be performed through image processing, which can avoid the problem of not being able to distinguish who is currently speaking through the audio stream.
- the detection module 604 is configured to:
- the plurality of key points are subjected to roll angle correction to obtain a plurality of corrected key points
- the key point detection result is obtained according to the corrected lip height and the corrected lip width.
- the detection module 604 is configured to: calculate the lip height-to-width ratio of the target object according to the corrected lip height and the corrected lip width and store the lip height-to-width ratio;
- change information corresponding to the lip aspect ratio of the target object in the target image is determined, and whether the participant corresponding to the target object in the target image is speaking is determined based on the change information.
- the timing information of key points is used to enhance the timing stability of speaker detection and reduce the fluctuation of detection status.
- the detection module 604 is configured to: set a preset time period; count the number of changes in the lip height-to-width ratio within the preset time period; and determine that the target object is speaking when the number of changes reaches a preset number.
- the timing information of key points is used to enhance the timing stability of speaker detection and reduce the fluctuation of detection status.
- the detection module 604 is configured to: when the number of changes reaches a preset number, determine that the target object is speaking, including: determining that the number of changes reaches a preset number; obtaining audio data of the video conference; and determining that the target object is speaking in combination with the audio data of the video conference. In this way, the accuracy of speaker judgment can be further improved by combining the key point detection results and the audio data of the current video conference.
- the microphone for collecting audio is a dual-channel microphone
- the speaker can be located based on the two sets of audio data collected by the dual channels, thereby further improving the accuracy of speaker judgment.
- the detection module 604 is configured to: detect the target object in the target image using target detection or target tracking technology to obtain a detection frame corresponding to the target object;
- the display module 608 is configured to: determine the coordinates of the detection frame corresponding to the target object; determine the coordinates of the sub-screen corresponding to the target object; and translate and/or scale the image corresponding to the detection frame to the sub-screen corresponding to the target object according to the coordinates of the detection frame and the coordinates of the sub-screen.
- the display module 608 is configured to: in response to determining that the virtual background function in the sub-screen is turned on, use segmentation technology to segment the target object and the background to obtain a segmentation result; based on the segmentation result, display the virtual background in the sub-screen, filling the technical gap in the related technology of not displaying the virtual background in the sub-screen.
- the display module 608 is configured to: in response to determining that the virtual background function of the second sub-screen among the at least two sub-screens is turned on, display the virtual background in the second sub-screen (for example, the sub-screen 3304 of Figure 3G), filling the technical gap in the related technology of not displaying the virtual background in the sub-screen.
- the display module 608 is configured to: segment the target object and the background in the target image using semantic segmentation technology to obtain a segmentation result; and display the virtual background in the second sub-image according to the segmentation result. In this way, the segmentation of the target object and the actual background is achieved using semantic segmentation technology, thereby well achieving the replacement of the virtual background.
- the display module 608 is configured to: in response to determining that the target object is speaking, An indicator mark is displayed in the sub-screen corresponding to the target object, so as to remind other people that the participant in the sub-screen corresponding to the icon is speaking, thereby improving interactivity.
- the above device is described by dividing it into various modules according to its functions.
- the functions of each module can be implemented in the same or multiple software and/or hardware.
- the device of the above embodiment is used to implement the corresponding method 400 in any of the above embodiments, and has the beneficial effects of the corresponding method embodiment, which will not be described in detail here.
- the present disclosure also provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the computer to execute method 400 described in any of the above embodiments.
- the computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology.
- Information can be computer-readable instructions, data structures, modules of programs, or other data.
- Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, read-only compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device.
- PRAM phase change memory
- SRAM static random access memory
- DRAM dynamic random access memory
- RAM random access memory
- ROM read-only memory
- EEPROM electrically erasable programmable
- the computer instructions stored in the storage medium of the above embodiments are used to enable the computer to execute method 200 or 400 as described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
- the present disclosure further provides a computer program product, which includes a computer program.
- the computer program is executable by one or more processors so that the processors execute the method 200 or 400.
- the processor that executes the corresponding step may belong to the corresponding execution subject.
- the computer program product of the above embodiment is used to enable the processor to execute the method 400 described in any of the above embodiments, and has the beneficial effects of the corresponding method embodiment, which will not be described in detail here.
- DRAM dynamic RAM
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
本公开提供一种视频会议画面的分屏方法及相关设备。该方法,包括:获取采集单元采集的目标图像;检测所述目标图像中的目标对象;根据所述目标对象将视频会议画面划分为至少两个子画面;将所述目标图像中的目标对象对应显示在所述至少两个子画面中。
Description
相关申请的交叉引用
本申请要求申请号为202310611376.5,题为“视频会议画面的分屏方法及相关设备”、申请日为2023年05月26日的中国发明专利申请的优先权,通过引用方式将该申请整体并入本文。
本公开涉及计算机技术领域,尤其涉及一种视频会议画面的分屏方法及相关设备。
在视频会议中,面对面的交流互动是至关重要的。但是,当多人坐在会议室中开视频会议的时候,由于大部分会议室的摄像头只有一个,采集的视频会议画面也只有一个,从视频会议画面中难以区分参会人及人数,尤其无法识别当前的说话人。
发明内容
本公开提出一种视频会议画面的分屏方法及相关设备,以解决或部分解决上述问题。
本公开第一方面,提供了一种视频会议画面的分屏方法,包括:
获取采集单元采集的目标图像;
检测所述目标图像中的目标对象;
根据所述目标对象将视频会议画面划分为至少两个子画面;
将所述目标图像中的目标对象对应显示在所述至少两个子画面中。
本公开第二方面,提供了一种视频会议画面的分屏装置,包括:
获取模块,被配置为:获取采集单元采集的目标图像;
检测模块,被配置为:检测所述目标图像中的目标对象;
划分模块,被配置为:根据所述目标对象将视频会议画面划分为至少两个子画面;
显示模块,被配置为:将所述目标图像中的目标对象对应显示在所述至少两个子画面中。
本公开第三方面,提供了一种计算机设备,包括一个或者多个处理器、存储器;和一个或多个程序,其中所述一个或多个程序被存储在所述存储器中,并且被所述一个或多个处理器执行,所述程序包括用于执行根据第一方面所述的方法的指令。
本公开第四方面,提供了一种包含计算机程序的非易失性计算机可读存储介质,当所述计算机程序被一个或多个处理器执行时,使得所述处理器执行第一方面所述的方法。
本公开第五方面,提供了一种提供了一种计算机程序产品,包括计算机程序指令,当
所述计算机程序指令在计算机上运行时,使得计算机执行第一方面所述的方法。
为了更清楚地说明本公开或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A示出了本公开实施例所提供的示例性系统的示意图。
图1B示出了图1A所示的场景中所采集的视频会议画面的示意图。
图2示出了根据本公开实施例的示例性流程的示意图。
图3A示出了根据本公开实施例的一张示例性目标图像的示意图。
图3B示出了根据本公开实施例的在目标图像中显示检测框的示意图。
图3C示出了根据本公开实施例的一个示例性视频会议画面的示意图。
图3D示出了根据本公开实施例的另一示例性视频会议画面的示意图。
图3E示出了根据本公开实施例的一种分屏模式的示意图。
图3F示出了根据本公开实施例的另一分屏模式的示意图。
图3G示出了根据本公开实施例的又一示例性视频会议画面的示意图。
图3H示出了一种人脸关键点检测的示意图。
图3I示出了一个转动的人脸的示意图。
图4示出了本公开实施例所提供的一个示例性方法的示意图。
图5示出了本公开实施例所提供的示例性计算机设备的硬件结构示意图。
图6示出了本公开实施例所提供的一种示例性装置的示意图。
为使本公开的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本公开进一步详细说明。
需要说明的是,除非另外定义,本公开实施例使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开实施例中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。
图1A示出了本公开实施例所提供的示例性系统100的示意图。
如图1A所示,该系统100可以包括至少一个终端设备(例如,终端设备102、104)、
服务器106和数据库服务器108。终端设备102和104与服务器106和数据库服务器108之间可以包括提供通信链路的介质,例如,网络,该网络可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户110A~110C可以使用终端设备102通过网络与服务器106进行交互,以接收或发送消息等,同样地,用户112可以使用终端设备104通过网络与服务器106进行交互,以接收或发送消息等。终端设备102和104上可以安装有各种应用程序(APP),例如,视频会议类应用程序、读书类应用程序、视频类应用程序、社交类应用程序、支付类应用程序、网页浏览器和即时通讯工具等。在一些实施例中,用户110A~110C和用户112可以分别使用安装在终端设备102和104上的视频会议类应用程序来使用服务器106提供的视频会议类服务,并且,终端设备102和104可以通过摄像头(例如,设置在终端设备102和104上的摄像头)来采集图像1022和1042并可以通过麦克风(例如,设置在终端设备102和104上的麦克风)来采集现场音频并上传到服务器106,进而使得用户110A~110C和用户112可以分别通过终端设备102和104上的视频会议类应用程序来观看对方的画面并听到对方的声音。
这里的终端设备102和104可以是硬件,也可以是软件。当终端设备102和104为硬件时,可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器、膝上型便携计算机(Laptop)和台式计算机(PC)等等。当终端设备102和104为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器106可以是提供各种服务的服务器,例如对终端设备102和104上显示的各种应用提供支持的后台服务器。数据库服务器108也可以是提供各种服务的数据库服务器。可以理解,在服务器106可以实现数据库服务器108的相关功能的情况下,系统100中可以不设置数据库服务器108。
这里的服务器106和数据库服务器108同样可以是硬件,也可以是软件。当它们为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当它们为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。
需要说明的是,本申请实施例所提供的视频会议画面的分屏方法一般由终端设备102和104执行。
应该理解,图1A中的终端设备、用户、服务器和数据库服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、用户、服务器和数据库服务器。
图1B示出了图1A所示的场景中所采集的视频会议画面120的示意图。
如图1B所示,在图1A所示的场景中,视频会议画面120可以将两个摄像头采集的画面分别在两个子画面1202和1204中进行显示,其中,终端设备102一侧的摄像头采集的是整个会议室的画面,因此,在子画面1202中会相应显示包括用户110A~110C对应
的目标对象(例如,脸部图像)的画面。
可以看出,由于会议室中设置的摄像头的位置固定,当用户110A~110C与摄像头的相对位置不同时,用户110A~110C在子画面1202中的目标对象的大小、朝向都会出现差别,并且,由于会议室的摄像头通常与座位距离较远,使得在子画面1202中难以很好地看清每个参会人。因此,一个对于单镜头采集的画面进行自动分屏的方案,对于提升视频会议中会议室与线上参会人的互动具有很大的实际意义。
有鉴于此,本公开实施例提供了一种视频会议画面的分屏方法,通过检测采集单元采集的目标图像中的目标对象,进而根据所述目标对象来将视频会议画面划分为至少两个子画面,并将所述目标图像中的目标对象对应显示在子画面中,从而可以在采集的目标图像中包含多个参会人的情况下,自动对视频会议画面进行分屏,从而有助于在视频会议过程中提升会议室参会人与其他线上参会人的互动感,进而提升用户体验。
图2示出了本公开实施例所提供的示例性方法200的流程示意图。该方法200可以用于对视频会议画面进行自动分屏。可选地,该方法200可以由图1A的终端设备102、104来实施,也可以由图1A的服务器106来实施。下面以服务器106实施方法200来进行说明。
如图2所示,该方法200可以进一步包括以下步骤。
在步骤202,服务器106可以获取采集单元采集的目标图像。以图1A为例,该采集单元可以是终端设备102、104中设置的摄像头,该目标图像可以是摄像头采集的图像1022和1042。摄像头采集了图像之后,终端设备102、104可以将采集的图像上传到服务器106进行处理。
图3A示出了根据本公开实施例的一张示例性目标图像300的示意图。
该目标图像300可以是系统100中任一参与视频会议的终端设备所采集的图像,如图3A所示,该目标图像300中可以包括多个参会人的目标对象302A~302C。
接着,在步骤204,服务器106可以检测目标图像300中的目标对象。作为一个可选实施例,可以采用目标检测(Object Detection)技术来检测目标图像300中的目标对象。可选地,可以采用预先训练好的目标检测模型来检测目标图像300中的目标对象302A~302C,并可以得到与目标对象302A~302C对应的检测框304A~304C,如图3B所示。
进一步地,考虑到参会人在会议室中的位置是时刻变化的,如果在得到检测框之后将检测框位置固定,当参会人发生移动时可能无法及时跟随人脸的位置,因此,在一些实施例中,可以采用目标跟踪技术来实时跟踪目标对象的位置变化,相应地,检测框也会发生相应的变化,这样,即使在视频会议过程中参会人发生了移动,也可以对其目标对象进行追踪。可选地,可以采用预先训练好的目标跟踪模型来追踪目标图像300中的目标对象302A~302C,该目标跟踪模型可以是基于深度学习的实时人脸框与人脸关键点的检测追踪模型,模型结构包括但不限于各种形式的卷积神经网络(Convolutional Neural Network)以及各种形式的Transformer网络。这样,通过图像人脸检测与跟踪实现分屏的每个子画
面都能够实时跟随屏幕中的人脸。
在一些实施例中,在检测到目标图像300中的目标对象之后,可以进入步骤206,服务器106可以直接根据所述目标对象来确定分屏布局,进而基于分屏布局对视频会议画面进行划分。
图3C示出了根据本公开实施例的一个示例性视频会议画面310的示意图。如图3C所示,视频会议画面310被划分为了多个子画面。具体地,在划分子画面时,在一些实施例中,考虑到视频会议场景需要多方互动,可以根据当前参与到视频会议中的各终端设备所采集的目标图像中的目标对象的数量的总数来划分视频会议画面。例如,以图1A所示的场景为例,可以根据终端设备102、104各自采集的图像中的目标对象的总数来确定子画面的数量,该数量在本示例中为4。在确定子画面的数量之后,可以确定分屏布局。分屏布局可以有很多种。例如,为了维持现有视频会议的基本分屏布局,可以先根据终端设备的数量n将视频会议画面310划分为n个子画面,然后,将各终端设备与n个子画面相对应。接着,根据各终端设备采集的画面中的目标对象的数量进一步划分该终端设备对应的子画面,这样,就得到了分屏布局。如图3C所示,有左右两个子画面分别对应了两个终端设备102和104所采集的图像,其中,左侧的画面又进一步包括三个子画面,分别对应目标对象302A~302C。右侧子画面显示终端设备104采集的画面,可以包括目标对象312。这样,就构成了一个完整的视频会议画面310。并且,由于画面310中对不同终端设备对应的子画面进行划分(例如,等大小划分),用户可以通过该分屏布局知晓具体参与到视频会议的终端数量。
可以理解,除了采用前述实施例的分屏布局外,还可以有其他的分屏布局方式,例如,直接根据所有目标对象的数量来划分子画面。图3D示出了根据本公开实施例的另一个示例性视频会议画面320的示意图。如图3D所示,视频会议画面320根据目标对象的总数被划分为4个大小相等的子画面,分别对应了目标对象302A~302C和目标对象312。这样,可以让所有参会人,无论是在会议室中的多个参会人还是单独参加的参会人,都能占有一个与其他人相等大小的子画面,使得各参会人都能很清楚地与每个参会人进行互动。
在一些实施例中,在进行分屏布局时,根据目标对象的数量的不同,布局的方式还可以有不同的选择,例如,在布局时根据数量将画面划分为等边的子画面阵列(例如,N×N个子画面)或者不等边的子画面阵列(例如,N×M个子画面)。例如,如图3D所示,当数量为4时,可以划分得到2×2个子画面。又比如,当数量为3时,可以将画面320划分为两行,第一行放一个子画面,第二行放两个子画面,参考图3C左侧画面所示。
可以理解,当目标对象的数量更多时,也可以采用前述的方式来进行分屏布局。例如,当目标对象的数量为7时,可以采用3×3个子画面来与目标对象对应,其中,可以留白两个子画面,如图3E所示。
在一些实施例中,考虑到通常的显示屏的比例并非是正方形的而是长大于宽的矩形(例如,16:9),因此,在进行分屏布局时可以考虑在长度方向上增加子画面的数量。比如,视频会议画面最多支持人数为12,当N<5时,可以被将画面分为1行N列,当N≤8
时,可以被将画面分为2行,第一行为N/2列(可以向上取整或向下取整),第二行为N-N/2列(可以向下取整或向上取整),当N≤12时,可以被将画面分为3行,前两行分为N/3列(可以向上取整或向下取整),最后一行为N-N/3×2列(可以向下取整或向上取整)。最后一行每个子画面的宽度与之前行的子画面宽度一致,并可以居中排布。图3F示出了根据本公开实施例的又一示例性视频会议画面的示意图。如图3F所示,当目标对象的数量为7时,采用上述方式可以分屏为4+3的布局。
由此可见,通过检测目标图像中的目标对象,进而根据目标对象数量来将视频会议画面划分为至少两个子画面,并将相应的目标对象对应显示在子画面中,从而可以在采集的目标图像中包含多个参会人的情况下,自动对视频会议画面进行分屏,从而有助于在视频会议过程中提升会议室参会人与其他线上参会人的互动感,进而提升用户体验。
在一些实施例中,在进行分屏布局时,除了考虑目标对象的数量外,还可以对参会人是否正在说话进行检测,进而基于检测到的正在说话的参会人对应的子画面进行相应的处理。因此,如图2所示,方法200还进一步包括步骤208,对目标对象进行说话人检测。可选地,该步骤可以与目标对象检测并行处理,从而提高处理速度。
可选地,当确定目标图像中的目标对象正在说话(亦即,目标对象对应的参会人正在说话),可以在正在说话的参会人对应的子画面中显示指示标识,例如,用于指示参会人正在说话的图标。如图3D所示,可以在正在说话的参会人对应的子画面中显示一个话筒样式的图标,从而提醒其他人该参会人正在说话。
作为一个可选实施例,还可以根据是否检测到正在说话的参会人的结果来对分屏布局进行改变。
可选地,若根据检测结果确定所述目标图像中的目标对象正在说话,可以按照第一分屏模式将所述视频会议画面划分为至少两个子画面;若根据检测结果确定视频会议的所有参会人均没有说话,按照第二分屏模式将所述视频会议画面划分为至少两个子画面。该第二分屏模式可以是前述实施例中的分屏布局的任一种。
进一步地,按照第一分屏模式将所述视频会议画面划分为至少两个子画面,包括:将所述至少两个子画面中的第一子画面放大显示,并在所述第一子画面的至少一侧并列显示所述至少两个子画面中的其他子画面,所述第一子画面则可以用于显示正在说话的目标对象。
图3G示出了根据本公开实施例的又一示例性视频会议画面330的示意图。如图3G所示,画面330包括四个子画面,分别对应了目标对象302A~302C和目标对象312,其中,第一子画面3302被放大了并与正在说话的参会人110A的目标对象302A相对应,其他子画面则并列显示在了第一子画面3302的一侧。这样,根据说话人检测结果将正在说话的参会人的子画面排布在画面中间并占据更大的画面,非发言人的子画面排布在侧边,并占据更小的画面,从而能够更好地提升互动性。
这样,当有人说话时,采用发言人模式(第一分屏模式),在发言人模式中,当前正在发言的人放在最大子画面,其余参会人并列排布在最大子画面的至少一侧(数量较多时
可以放置在两侧或者更多侧)。当没有人说话时,采用普通分屏模式(第二分屏模式),每个子画面相同大小。通过根据说话人检测结果来选择不同分屏模式,能增加视频会议互动性,提升用户体验。
一般地,线上参会人的视频流和音频流都是独立的,在相关技术中,视频会议软件可以根据音频流确认对应的视频中是否有人说话,但是在会议室中参会的人员是共享视频流和音频流的,单从会议室一侧的终端设备所采集的音频流无法确定其中的人是否在说话,也无法分辨当前具体是谁在发言,降低了会议的互动性。
鉴于此,在一些实施例中,通过图像处理的方式来进行说话人检测,可以避免通过音频流无法分辨当前具体是谁在发言的问题。
作为一个可选实施例,服务器106可以对检测到的各目标对象进行关键点检测,然后根据关键点检测结果确定目标图像中的目标对象对应的参会人是否正在说话。
图3H示出了一种人脸关键点检测的示意图。
如图3H所示,人脸关键点检测可以采用68个关键点的检测方法,这些关键点分布在人脸的各部位,其中,0-16点对应于下巴,16-21点对应于右眼眉(这里是镜像图像,是图中人的右眼眉),22-26点对应于左眼眉,27-35点对应于鼻子,36-41点对应于右眼,42-47点对应于左眼,48-67点对应于嘴唇。通过检测关键点可以对人脸进行识别,并且,根据关键点在连续多帧目标对象中的变化,可以确定对应的参会人是否正在说话。
需要说明的是,68个关键点的人脸关键点检测方法仅是一种示例,可以理解,人脸关键点检测还可以有其他的关键点数量,例如,21个关键点、29个关键点,等等。
作为一个可选实施例,可以采用106个关键点来实现关键点检测,从而能够得到更准确的检测结果。
在一些实施例中,在检测所述目标对象的关键点之后;可以基于所述目标对象的关键点,确定嘴唇高度,并且,可以基于所述目标对象的关键点,确定嘴唇宽度;然后根据所述嘴唇高度以及所述嘴唇宽度,得到目标对象的嘴唇高宽比,进而基于所述嘴唇高宽比的变化信息,确定所述目标对象是否正在说话。这样,就实现了利用图像处理的方式来进行说话人检测,可以避免通过音频流无法分辨当前具体是谁在发言的问题。
进一步地,考虑到人脸的转动可能会造成嘴唇关键点位置的变化,在一些实施例中,在进行关键点检测时,可以基于人脸的转动角度来对检测到的关键点进行修正。
图3I示出了一个转动的人脸的示意图。
如图3I所示,对于人脸来说,在三维空间中,存在三种转动角度,偏航角(Yaw)、翻滚角(Roll)、俯仰角(Pitch)。
作为一个可选实施例,可以通过仿射变换来抵消Roll转动的影响,然后通过人脸检测的Pitch和Yaw信息来抵消Pitch和Yaw旋转的影响。
具体地,可以根据目标对象检测得到多个关键点(例如,106个关键点的坐标)。然后将该目标对象的关键点与标准(平均)人脸的关键点(也就是标准关键点)进行对应,从而可以得到一个仿射变换矩阵(映射关系),基于将该仿射变换矩阵作用于根据目标对
象检测得到多个关键点,就可以对所述多个关键点进行翻滚角(Roll)修正,进而得到修正后的多个关键点,亦即当前检测的目标对象在Roll=0时的关键点坐标。
进一步地,可以选取所述修正后的多个关键点中与嘴唇高度和嘴唇宽度分别对应的多个第一关键点和多个第二关键点,对所述多个第一关键点进行俯仰角修正,得到修正后的嘴唇高度,对所述多个第二关键点进行偏航角修正,得到修正后的嘴唇宽度,这样,修正后的嘴唇高度和修正后的嘴唇宽度就抵消了Pitch和Yaw旋转的影响。作为一个可选实施例,以106个关键点为例,可以计算98点-102点的线段长度表示嘴唇高度并除以cos(Pitch)以抵消Pitch的影响,从而得到修正后的嘴唇高度。类似地,计算96点-100点的线段长度表示嘴唇宽度并除以cos(Yaw)以抵消Yaw的影响,从而得到修正后的嘴唇宽度。其中,Pitch、Yaw的角度信息可以由人脸检测模块提供。
接着,根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度,就可以得到所述关键点检测结果(针对嘴唇高度和嘴唇宽度的检测结果),从而以转正后嘴唇的高宽比来表示嘴巴张大的幅度。
考虑到说话是一个动态过程,单依靠当前时刻嘴唇的高宽比可能无法准确判断当前参会人是否在说话。因此,在一些实施例中,可以维护一段时间内嘴唇高宽比的变化,并以此段时间内高宽比的方差来判断当前该主体是否正在说话。
作为一个可选实施例,可以根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度,计算所述目标对象的嘴唇高宽比并存储所述嘴唇高宽比。
然后,在根据关键点检测结果确定所述目标图像中的目标对象对应的参会人是否正在说话时,可以结合所述关键点检测结果,确定所述目标图像中的目标对象的嘴唇高宽比对应的变化信息,根据所述变化信息确定所述目标图像中的目标对象对应的参会人是否正在说话。例如,可以根据预设时间段内(例如,1s内)的嘴唇高宽比的方差是否大于方差阈值来判断是否正在说话,若大于则可以判断正在说话。这样,在判断是否正在说话时,结果能够更加准确。
考虑到发言人在说话时可能会有间歇性的停顿,为了使说话检测的效果更加稳定,在一些实施例中,可以维护一个计数器用以记录最近一段时间内该发言人被判定为正在说话的次数,然后根据该次数与预设计数阈值的大小关系来判断是否正在说话。可选地,基于所述嘴唇高宽比的变化信息,确定所述目标对象是否正在说话,包括:设定预设时间段(例如,2s内);统计所述预设时间段内所述嘴唇高宽比的变化次数;当所述变化次数达到预设数量时,确定所述目标对象正在说话。这样,利用了关键点的时序信息来增强发言人检测在时序上的稳定性,减少检测状态的波动。
作为一个可选实施例,响应于根据所述变化信息确定所述目标图像中的目标对象对应的参会人正在说话(亦即当前帧该主体被判定为正在说话),计数值+1;响应于根据所述变化信息确定所述目标图像中的目标对象对应的参会人不是正在说话(亦即当前帧该主体被判定为没有在说话),计数值-1。
然后,可以根据预设时间段内(例如,2s内)的所述计数值,确定所述目标图像中的
目标对象对应的参会人是否正在说话。例如,当计数器的值大于预设计数阈值(例如,2次)时,可以显示该主体正在发言的效果(例如,放大发言人对应的子画面和/或显示话筒图标),当计数器的值小于预设计数阈值时,可以取消显示该主体正在发言的效果(例如,恢复发言人对应的子画面到与其他子画面尺寸相同和/或隐藏话筒图标)。
这样,根据分屏后人脸的关键点检测来确定人脸嘴唇的位置,并根据嘴唇关键点的相对位置关系来判断当前参会人是否正在说话;同时为了改善人脸移动,转动等动作造成的说话检测误判,在进行说话判断之前会将人脸检测的关键点映射到未转动的标准人脸上,减少人脸移动带来的影响。此外,利用了关键点的时序信息来增强发言人检测在时序上的稳定性,减少检测状态的波动。
考虑到单纯利用图像处理技术来进行说话人判断可能存在误差,在一些实施例中,在确定所述变化次数达到预设数量之后,还可以获取视频会议的音频数据,然后,根据关键点检测结果,结合视频会议的音频数据,确定所述目标图像中的目标对象是否正在说话。这样,同时结合关键点检测结果和当前视频会议的音频数据,可以进一步提高说话人判断的准确性。作为一个可选实施例,当采集音频的拾音器是双声道拾音器时,可以根据双声道所分别采集的两组音频数据来对说话人进行定位,进而可以进一步提高说话人判断的准确性。
在相关技术中,有的视频会议自动分屏软件通过人体检测确定当前画面中参会人的位置,并依据人体位置进行剪裁和布局进而实现软件自动分屏的功能。当有参会人进入或者离开会议室时,难以实现实时的分屏变化。
因此,为了实现实时地增加或减少分屏数量,在确定分屏布局之后,如图2所示,可以进入步骤210,根据检测框位置和分屏布局将目标对象与检测框对应。
根据前面实施例所述,可以利用目标检测或目标跟踪技术检测所述目标图像中的目标对象,得到与所述目标对象对应的检测框,例如,图3B的检测框304A~304C。这样,每个被检测到的目标对象都可以对应有一个检测框,然后,可以根据分屏布局来确定每个目标对象在分屏布局中的位置,进而将该目标对象的检测框与分屏布局中的位置进行对应。这样,通过对目标对象300进行目标对象的检测,通过目标对象的检测框的位置来判断不同人脸的相对位置关系,并将分屏的位置和人脸的位置相对应,然后将检测框与分屏布局进行对应,可以根据检测结果来增加或减少分屏的数量,并改变分屏的布局,使得当有人加入或离开会议室时能够保持原来参会者的相对位置不变,并可以实现实时地增加或减少分屏数量。
可选地,可以根据每个参会人的检测框在原图(目标图像)中的坐标,来确定在分屏布局中每个参会人对应的ROI(Region of Interest),从而根据每个人的检测框的位置以及分屏布局确定人、检测框和子画面的一一对应关系。
然后,如图2所示,在步骤212,可以将检测框的内容与子画面进行匹配。可选地,可以先确定目标对象对应的检测框的坐标以及目标对象对应的子画面的坐标;然后,根据目标对象对应的检测框的坐标以及目标对象对应的子画面的坐标,将检测框对应的图像
平移和/或缩放至目标对象对应的子画面中。通过对目标对象进行缩放,使得参会人都能够清晰地看到其他人,提升会议的互动性。
作为一个可选实施例,可以在每个检测框和分屏子画面对应之后,先将检测框按照一定比例向两边扩展(例如,高扩展20%,宽扩展40%),在此基础上将检测框的宽和高向两侧扩展至其宽高比和对应子屏宽高比相同,如果扩展后的ROI超过了画面边界则平移至画面范围内,从而可以实现检测检测框的内容与子画面的匹配。
在一些实施例中,在对检测框对应的图像进行平移并缩放时,可以通过线性插值的方式计算当前时刻每个检测框对应的子画面的ROI,从当前画面到目标人脸的平滑移动缩放效果,实现分屏效果切换时的类似监控摄像头的水平变换-垂直变换-缩放功能。
具体地,假设某一子画面的原始坐标,用矩形左上和右下两个顶点的坐标表示为(x1_0,y1_0,x2_0,y2_0),平移缩放持续时间为T,目标人脸的检测框的ROI的坐标,用矩形左上和右下两个顶点的坐标表示为(x1_T,y1_T,x2_T,y2_T),则:
首先,可以确定一个时间间隔Δt,然后根据平移缩放持续时间T和时间间隔Δt确定线性插值的数量。
接着,所述线性插值的数量、根据子画面的原始坐标和检测框的ROI的坐标,确定每个插值对应的子画面的更新坐标。针对每个插值,更新坐标相对前一插值对应的坐标可以是等间隔变换的。
然后,按照等时间间隔的方式,根据每个插值对应的子画面的更新坐标,逐渐对子画面进行水平变换-垂直变换-缩放处理,直到持续时间达到T。
这样,从开启分屏功能到完成子画面的平移缩放,能够有一个时间T的过渡时间,进而从视觉效果上可以有一种类似监控摄像头的水平变换-垂直变换-缩放功能,提升用户体验,改善了相关技术直接切换或者对剪裁出的框进行简单的缩放而导致切换效果较为生硬的问题。
在一些实施例中,采用前述方式处理后的子画面的清晰度可能受到影响,因此,可以采用超分辨率技术来提升子画面的分辨率,进而改善清晰度。
在相关技术中,通常没有支持子画面的虚拟背景功能。本公开实施例提供了虚拟背景功能以填补空白。
因此,如图2所示,在步骤214,可以先确定是否开启虚拟背景功能。如果开启虚拟背景功能,则进入步骤216,对当前图像进行基于目标对象的语义分割(可与人脸检测并行处理从而提升处理效率),这样,利用语义分割能力实现每个分屏的虚拟背景功能。可选地,可以使用预先训练好的语义分割模型来将目标对象与背景图像进行分割。该语义分割模型可以是基于深度学习的实时人像语义分割模型,模型结构包括但不限于各种形式的卷积神经网络(Convolutional Neural Network)以及各种形式的Transformer网络。
在一些实施例中,可以应用人像分割功能对当前输入图整个做人像分割。在分屏功能开启后每个子屏的分割结果对应于全图分割结果在该子屏对应的ROI下的分割结果。然后针对每个子屏的每个像素点,根据分割结果在该像素点处的值(归一化至[0,1])、输入
图中该像素点对应的值、待替换的新背景在该像素点处的像素值,计算虚拟背景结果在该像素点处的像素值。
具体地,针对每个像素点,可以将分割结果在该像素点处的第一值(归一化至[0,1])乘以输入图中该像素点对应的值,加上分割结果在该像素点处的第二值(1减去第一值)乘以待替换的新背景在该像素点处的像素值,来得到虚拟背景结果在该像素点处的像素值。
这样,就完成了虚拟背景替换真实背景的处理。
接着,可以进入步骤218,对视频会议画面进行渲染。
可选地,可以用每个子画面对应的原图ROI区域的内容渲染该子屏,如果开启了虚拟背景功能,则结合人像分割的结果以及待替换的背景做背景替换,如果需要对检测到的说话人做任何特殊处理,也在该步骤完成。例如,在第一子画面3302中显示正在说话的参会人对应的目标对象。在第二子画面3304中显示虚拟背景。
随后当前帧处理结束,进入下一帧流程,可以开始处理下一帧。
从上述实施例可以看出,本公开实施例通过了一种视频会议自动分屏系统,应用视频自动分屏功能可以实现坐在同一会议室里的人也能和远程参会的同事“面对面”沟通。在一些实施例中,说话人检测可以方便的标识出当前会议室中谁正在发言,并把该说话人的视频流放到明显位置,提升视频会议体验。在一些场景中,例如,户外直播时,有时候因为场地、距离的关系拍摄主体在画面中的占比可能非常小,通过软件实现的PTZ功能可以不用任何操作就实现镜头自动追踪对焦的功能。
需要说明的是,前述实施例以服务器106为执行主体进行描述,实际上,前述的处理步骤可以不用限定执行主体,例如,终端设备102、104也可以实现这些处理步骤,因此,将终端设备102、104作为前述实施例的执行主体也是可以的。
本公开实施例还提供了一种视频会议画面的分屏方法。图4示出了本公开实施例所提供的示例性方法400的流程示意图。该方法400可以应用于图1A的服务器106,也可以应用于图1A的终端设备102、104。如图4所示,该方法400可以进一步包括以下步骤。
在步骤402,获取采集单元采集的目标图像。
以图1A为例,该采集单元可以是终端设备102、104中设置的摄像头,该目标图像可以是摄像头采集的图像1022和1042。
在步骤404,检测所述目标图像(例如,图3A的图像300)中的目标对象(例如,图3A的目标对象302A~302C)。
作为一个可选实施例,可以采用目标检测(Object Detection)技术来检测目标图像300中的目标对象。可选地,可以采用预先训练好的目标检测模型来检测目标图像300中的目标对象302A~302C,并可以得到与目标对象302A~302C对应的检测框304A~304C,如图3B所示。
在步骤406,根据所述目标对象将视频会议画面划分为至少两个子画面。
在步骤408,将所述目标图像中的目标对象对应显示在所述至少两个子画面中,参考图3C~图3G所示。
本公开实施例提供了的视频会议画面的分屏方法,通过检测采集单元采集的目标图像中的目标对象,进而根据目标对象来将视频会议画面划分为至少两个子画面,并将相应的目标对象对应显示在子画面中,从而可以在采集的目标图像中包含多个参会人的情况下,自动对视频会议画面进行分屏,从而有助于在视频会议过程中提升会议室参会人与其他线上参会人的互动感,进而提升用户体验。
在一些实施例中,根据所述目标对象的数量将所述视频会议画面划分为至少两个子画面,包括:确定所述目标图像中的目标对象是否正在说话;响应于确定所述目标图像中的目标对象正在说话,按照第一分屏模式将所述视频会议画面划分为至少两个子画面,如图3G所示。这样,当有人说话时,采用发言人模式(第一分屏模式),在发言人模式中,当前正在发言的人放在最大子画面,其余参会人并列排布在最大子画面的至少一侧(数量较多时可以放置在两侧或者更多侧)。当没有人说话时,采用普通分屏模式(第二分屏模式),每个子画面相同大小。通过根据说话人检测结果来选择不同分屏模式,能增加视频会议互动性,提升用户体验。
在一些实施例中,按照第一分屏模式将所述视频会议画面划分为至少两个子画面,包括:将所述至少两个子画面中的第一子画面(例如,图3G的子画面3302)放大显示,并在所述第一子画面的至少一侧并列显示所述至少两个子画面中的其他子画面,所述第一子画面用于显示正在说话的目标对象。
这样,根据说话人检测结果将正在说话的参会人的子画面排布在画面中间并占据更大的画面,非发言人的子画面排布在侧边,并占据更小的画面,从而能够更好地提升互动性。
在一些实施例中,确定所述目标图像中的目标对象是否正在说话,包括:检测所述目标对象的关键点;基于所述目标对象的关键点,确定嘴唇高度;基于所述目标对象的关键点,确定嘴唇宽度;根据所述嘴唇高度以及所述嘴唇宽度,得到目标对象的嘴唇高宽比;基于所述嘴唇高宽比的变化信息,确定所述目标对象是否正在说话,从而可以通过图像处理的方式来进行说话人检测,可以避免通过音频流无法分辨当前具体是谁在发言的问题。
在一些实施例中,确定所述目标图像中的目标对象是否正在说话,包括:对所述目标对象进行关键点检测;根据关键点检测结果确定所述目标图像中的目标对象是否正在说话,从而可以通过图像处理的方式来进行说话人检测,可以避免通过音频流无法分辨当前具体是谁在发言的问题。
在一些实施例中,对所述目标对象进行关键点检测,包括:
根据所述目标对象检测得到多个关键点;
根据所述关键点与标准关键点的对应关系,对所述多个关键点进行翻滚角修正,得到修正后的多个关键点;
选取所述修正后的多个关键点中与嘴唇高度对应的多个第一关键点,对所述多个第
一关键点进行俯仰角修正,得到修正后的嘴唇高度;
选取所述修正后的多个关键点中与嘴唇宽度对应的多个第二关键点,对所述多个第二关键点进行偏航角修正,得到修正后的嘴唇宽度;
根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度,得到所述关键点检测结果。
为了改善人脸移动,转动等动作造成的说话检测误判,在进行说话判断之前会将人脸检测的关键点映射到未转动的标准人脸上,减少人脸移动带来的影响。
在一些实施例中,所述方法还包括:根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度,计算所述目标对象的嘴唇高宽比并存储所述嘴唇高宽比;
根据关键点检测结果确定所述目标图像中的目标对象是否正在说话,包括:结合所述关键点检测结果,确定所述目标图像中的目标对象的嘴唇高宽比对应的变化信息,根据所述变化信息确定所述目标图像中的目标对象是否正在说话。
这样,利用了关键点的时序信息来增强发言人检测在时序上的稳定性,减少检测状态的波动。
在一些实施例中,确定所述目标图像中的目标对象对应的参会人是否正在说话,包括:设定预设时间段;统计所述预设时间段内所述嘴唇高宽比的变化次数;当所述变化次数达到预设数量时,确定所述目标对象正在说话。这样,利用了关键点的时序信息来增强发言人检测在时序上的稳定性,减少检测状态的波动。
在一些实施例中,当所述变化次数达到预设数量时,确定所述目标对象正在说话,包括:确定所述变化次数达到预设数量;获取视频会议的音频数据;结合视频会议的音频数据,确定所述目标对象正在说话。这样,同时结合关键点检测结果和当前视频会议的音频数据,可以进一步提高说话人判断的准确性。作为一个可选实施例,当采集音频的拾音器是双声道拾音器时,可以根据双声道所分别采集的两组音频数据来对说话人进行定位,进而可以进一步提高说话人判断的准确性。
在一些实施例中,检测所述目标图像中的目标对象,包括:利用目标检测或目标跟踪技术检测所述目标图像中的目标对象,得到所述目标对象的检测框;
将所述目标图像中的目标对象对应显示在所述至少两个子画面中,包括:确定所述目标对象对应的检测框的坐标;确定所述目标对象对应的子画面的坐标;根据所述检测框的坐标以及所述子画面的坐标,将所述检测框对应的图像平移和/或缩放至所述目标对象对应的子画面中。
通过对目标对象进行缩放,使得参会人都能够清晰地看到其他人,提升会议的互动性。
在一些实施例中,将所述目标图像中的目标对象对应显示在所述至少两个子画面中,进一步包括:响应于确定所述子画面中的虚拟背景功能被打开,利用分割技术将所述目标对象与背景进行分割,得到分割结果;根据所述分割结果,在所述子画面中显示虚拟背景,填补了相关技术中没有在子画面中显示虚拟背景的技术空白点。
在一些实施例中,将所述目标图像中的目标对象对应显示在所述至少两个子画面中,包括:响应于确定所述至少两个子画面中的第二子画面的虚拟背景功能被打开,在所述第
二子画面(例如,图3G的子画面3304)中显示虚拟背景,填补了相关技术中没有在子画面中显示虚拟背景的技术空白点。
在一些实施例中,所述方法还包括:利用语义分割技术将所述目标图像中的目标对象与背景进行分割,得到分割结果;
在所述第二子画面中显示虚拟背景,包括:根据所述分割结果,在所述第二子画面中显示虚拟背景。
这样,利用语义分割技术实现目标对象与实际背景的分割,进而很好地实现虚拟背景的替换。
在一些实施例中,将所述目标图像中的目标对象对应显示在所述至少两个子画面中,进一步包括:响应于确定所述目标对象正在说话,在所述目标对象对应的子画面中显示指示标识,从而提醒其他人该图标对应的子画面中的参会人正在说话,提升互动性。
需要说明的是,本公开实施例的方法可以由单个设备执行,例如一台计算机或服务器等。本实施例的方法也可以应用于分布式场景下,由多台设备相互配合来完成。在这种分布式场景的情况下,这多台设备中的一台设备可以只执行本公开实施例的方法中的某一个或多个步骤,这多台设备相互之间会进行交互以完成所述的方法。
需要说明的是,上述对本公开的一些实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于上述实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本公开实施例还提供了一种计算机设备,用于实现上述的方法200或400。图5示出了本公开实施例所提供的示例性计算机设备500的硬件结构示意图。计算机设备500可以用于实现图1A的服务器106,也可以用于实现图1A的终端设备102、104。在一些场景中,该计算机设备500也可以用于实现图1A的数据库服务器108。
如图5所示,计算机设备500可以包括:处理器502、存储器504、网络模块506、外围接口508和总线510。其中,处理器502、存储器504、网络模块506和外围接口508通过总线510实现彼此之间在计算机设备500的内部的通信连接。
处理器502可以是中央处理器(Central Processing Unit,CPU)、图像处理器、神经网络处理器(NPU)、微控制器(MCU)、可编程逻辑器件、数字信号处理器(DSP)、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路。处理器502可以用于执行与本公开描述的技术相关的功能。在一些实施例中,处理器502还可以包括集成为单一逻辑组件的多个处理器。例如,如图5所示,处理器502可以包括多个处理器502a、502b和502c。
存储器504可以配置为存储数据(例如,指令、计算机代码等)。如图5所示,存储器504存储的数据可以包括程序指令(例如,用于实现本公开实施例的方法200或400的程序指令)以及要处理的数据(例如,存储器可以存储其他模块的配置文件等)。处理器
502也可以访问存储器504存储的程序指令和数据,并且执行程序指令以对要处理的数据进行操作。存储器504可以包括易失性存储装置或非易失性存储装置。在一些实施例中,存储器504可以包括随机访问存储器(RAM)、只读存储器(ROM)、光盘、磁盘、硬盘、固态硬盘(SSD)、闪存、存储棒等。
网络接口506可以配置为经由网络向计算机设备500提供与其他外部设备的通信。该网络可以是能够传输和接收数据的任何有线或无线的网络。例如,该网络可以是有线网络、本地无线网络(例如,蓝牙、WiFi、近场通信(NFC)等)、蜂窝网络、因特网、或上述的组合。可以理解的是,网络的类型不限于上述具体示例。
外围接口508可以配置为将计算机设备500与一个或多个外围装置连接,以实现信息输入及输出。例如,外围装置可以包括键盘、鼠标、触摸板、触摸屏、麦克风、各类传感器等输入设备以及显示器、扬声器、振动器、指示灯等输出设备。
总线510可以被配置为在计算机设备500的各个组件(例如处理器502、存储器504、网络接口506和外围接口508)之间传输信息,诸如内部总线(例如,处理器-存储器总线)、外部总线(USB端口、PCI-E总线)等。
需要说明的是,尽管上述计算机设备500的架构仅示出了处理器502、存储器504、网络接口506、外围接口508和总线510,但是在具体实施过程中,该计算机设备500的架构还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述计算机设备500的架构中也可以仅包含实现本公开实施例方案所必需的组件,而不必包含图中所示的全部组件。
本公开实施例还提供了一种交互装置。图6示出了本公开实施例所提供的示例性装置600的示意图。如图6所示,该装置600可以用于实现方法200或400,并可以进一步包括以下模块。
获取模块602,被配置为:获取采集单元采集的目标图像。
以图1A为例,该采集单元可以是终端设备102、104中设置的摄像头,该目标图像可以是摄像头采集的图像1022和1042。
检测模块604,被配置为:检测所述目标图像(例如,图3A的图像300)中的目标对象(例如,图3A的目标对象302A~302C)。
作为一个可选实施例,可以采用目标检测(Object Detection)技术来检测目标图像300中的目标对象。可选地,可以采用预先训练好的目标检测模型来检测目标图像300中的目标对象302A~302C,并可以得到与目标对象302A~302C对应的检测框304A~304C,如图3B所示。
划分模块606,被配置为:根据所述目标对象将视频会议画面划分为至少两个子画面。
显示模块608,被配置为:将所述目标图像中的目标对象对应显示在所述至少两个子画面中。
本公开实施例提供了的视频会议画面的分屏方法,通过检测采集单元采集的目标图像中的目标对象,进而根据目标对象来将视频会议画面划分为至少两个子画面,并将相应
的目标对象对应显示在子画面中,从而可以在采集的目标图像中包含多个参会人的情况下,自动对视频会议画面进行分屏,从而有助于在视频会议过程中提升会议室参会人与其他线上参会人的互动感,进而提升用户体验。
在一些实施例中,划分模块606,被配置为:确定所述目标图像中的目标对象是否正在说话;响应于确定所述目标图像中的目标对象正在说话,按照第一分屏模式将所述视频会议画面划分为至少两个子画面,如图3G所示。这样,当有人说话时,采用发言人模式(第一分屏模式),在发言人模式中,当前正在发言的人放在最大子画面,其余参会人并列排布在最大子画面的至少一侧(数量较多时可以放置在两侧或者更多侧)。当没有人说话时,采用普通分屏模式(第二分屏模式),每个子画面相同大小。通过根据说话人检测结果来选择不同分屏模式,能增加视频会议互动性,提升用户体验。
在一些实施例中,划分模块606,被配置为:将所述至少两个子画面中的第一子画面(例如,图3G的子画面3302)放大显示,并在所述第一子画面的至少一侧并列显示所述至少两个子画面中的其他子画面,所述第一子画面用于显示正在说话的目标对象;
显示模块608,被配置为:在所述第一子画面中显示正在说话的参会人对应的目标对象。
这样,根据说话人检测结果将正在说话的参会人的子画面排布在画面中间并占据更大的画面,非发言人的子画面排布在侧边,并占据更小的画面,从而能够更好地提升互动性。
在一些实施例中,检测模块604,被配置为:检测所述目标对象的关键点;基于所述目标对象的关键点,确定嘴唇高度;基于所述目标对象的关键点,确定嘴唇宽度;根据所述嘴唇高度以及所述嘴唇宽度,得到目标对象的嘴唇高宽比;基于所述嘴唇高宽比的变化信息,确定所述目标对象是否正在说话,从而可以通过图像处理的方式来进行说话人检测,可以避免通过音频流无法分辨当前具体是谁在发言的问题。
在一些实施例中,检测模块604,被配置为:对所述目标对象进行关键点检测;根据关键点检测结果确定所述目标图像中的目标对象是否正在说话,从而可以通过图像处理的方式来进行说话人检测,可以避免通过音频流无法分辨当前具体是谁在发言的问题。
在一些实施例中,检测模块604,被配置为:
根据所述目标对象检测得到多个关键点;
根据所述关键点与标准关键点的对应关系,对所述多个关键点进行翻滚角修正,得到修正后的多个关键点;
选取所述修正后的多个关键点中与嘴唇高度对应的多个第一关键点,对所述多个第一关键点进行俯仰角修正,得到修正后的嘴唇高度;
选取所述修正后的多个关键点中与嘴唇宽度对应的多个第二关键点,对所述多个第二关键点进行偏航角修正,得到修正后的嘴唇宽度;
根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度,得到所述关键点检测结果。
为了改善人脸移动,转动等动作造成的说话检测误判,在进行说话判断之前会将人脸
检测的关键点映射到未转动的标准人脸上,减少人脸移动带来的影响。
在一些实施例中,检测模块604,被配置为:根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度,计算所述目标对象的嘴唇高宽比并存储所述嘴唇高宽比;
结合所述关键点检测结果,确定所述目标图像中的目标对象的嘴唇高宽比对应的变化信息,根据所述变化信息确定所述目标图像中的目标对象对应的参会人是否正在说话。
这样,利用了关键点的时序信息来增强发言人检测在时序上的稳定性,减少检测状态的波动。
在一些实施例中,检测模块604,被配置为:设定预设时间段;统计所述预设时间段内所述嘴唇高宽比的变化次数;当所述变化次数达到预设数量时,确定所述目标对象正在说话。
这样,利用了关键点的时序信息来增强发言人检测在时序上的稳定性,减少检测状态的波动。
在一些实施例中,检测模块604,被配置为:当所述变化次数达到预设数量时,确定所述目标对象正在说话,包括:确定所述变化次数达到预设数量;获取视频会议的音频数据;结合视频会议的音频数据,确定所述目标对象正在说话。这样,同时结合关键点检测结果和当前视频会议的音频数据,可以进一步提高说话人判断的准确性。作为一个可选实施例,当采集音频的拾音器是双声道拾音器时,可以根据双声道所分别采集的两组音频数据来对说话人进行定位,进而可以进一步提高说话人判断的准确性。
在一些实施例中,检测模块604,被配置为:利用目标检测或目标跟踪技术检测所述目标图像中的目标对象,得到与所述目标对象对应的检测框;
显示模块608,被配置为:确定所述目标对象对应的检测框的坐标;确定所述目标对象对应的子画面的坐标;根据所述检测框的坐标以及所述子画面的坐标,将所述检测框对应的图像平移和/或缩放至所述目标对象对应的子画面中。
通过对目标对象进行缩放,使得参会人都能够清晰地看到其他人,提升会议的互动性。
在一些实施例中,显示模块608,被配置为:响应于确定所述子画面中的虚拟背景功能被打开,利用分割技术将所述目标对象与背景进行分割,得到分割结果;根据所述分割结果,在所述子画面中显示虚拟背景,填补了相关技术中没有在子画面中显示虚拟背景的技术空白点。
在一些实施例中,显示模块608,被配置为:响应于确定所述至少两个子画面中的第二子画面的虚拟背景功能被打开,在所述第二子画面(例如,图3G的子画面3304)中显示虚拟背景,填补了相关技术中没有在子画面中显示虚拟背景的技术空白点。
在一些实施例中,显示模块608,被配置为:利用语义分割技术将所述目标图像中的目标对象与背景进行分割,得到分割结果;根据所述分割结果,在所述第二子画面中显示虚拟背景。这样,利用语义分割技术实现目标对象与实际背景的分割,进而很好地实现虚拟背景的替换。
在一些实施例中,显示模块608,被配置为:响应于确定所述目标对象正在说话,在
所述目标对象对应的子画面中显示指示标识,从而提醒其他人该图标对应的子画面中的参会人正在说话,提升互动性。
为了描述的方便,描述以上装置时以功能分为各种模块分别描述。当然,在实施本公开时可以把各模块的功能在同一个或多个软件和/或硬件中实现。
上述实施例的装置用于实现前述任一实施例中相应的方法400,并且具有相应的方法实施例的有益效果,在此不再赘述。
基于同一发明构思,与上述任意实施例方法相对应的,本公开还提供了一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行如上任一实施例所述的方法400。
本实施例的计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。
上述实施例的存储介质存储的计算机指令用于使所述计算机执行如上任一实施例所述的方法200或400,并且具有相应的方法实施例的有益效果,在此不再赘述。
基于同一发明构思,与上述任意实施例方法200或400相对应的,本公开还提供了一种计算机程序产品,其包括计算机程序。在一些实施例中,所述计算机程序由一个或多个处理器可执行以使得所述处理器执行所述的方法200或400。对应于方法200或400各实施例中各步骤对应的执行主体,执行相应步骤的处理器可以是属于相应执行主体的。
上述实施例的计算机程序产品用于使处理器执行如上任一实施例所述的方法400,并且具有相应的方法实施例的有益效果,在此不再赘述。
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本公开的范围(包括权利要求)被限于这些例子;在本公开的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,步骤可以以任意顺序实现,并存在如上所述的本公开实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。
另外,为简化说明和讨论,并且为了不会使本公开实施例难以理解,在所提供的附图中可以示出或可以不示出与集成电路(IC)芯片和其它部件的公知的电源/接地连接。此外,可以以框图的形式示出装置,以便避免使本公开实施例难以理解,并且这也考虑了以下事实,即关于这些框图装置的实施方式的细节是高度取决于将要实施本公开实施例的平台的(即,这些细节应当完全处于本领域技术人员的理解范围内)。在阐述了具体细节(例如,电路)以描述本公开的示例性实施例的情况下,对本领域技术人员来说显而易见的是,可以在没有这些具体细节的情况下或者这些具体细节有变化的情况下实施本公开
实施例。因此,这些描述应被认为是说明性的而不是限制性的。
尽管已经结合了本公开的具体实施例对本公开进行了描述,但是根据前面的描述,这些实施例的很多替换、修改和变型对本领域普通技术人员来说将是显而易见的。例如,其它存储器架构(例如,动态RAM(DRAM))可以使用所讨论的实施例。
本公开实施例旨在涵盖落入所附权利要求的宽泛范围之内的所有这样的替换、修改和变型。因此,凡在本公开实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本公开的保护范围之内。
Claims (13)
- 一种视频会议画面的分屏方法,包括:获取采集单元采集的目标图像;检测所述目标图像中的目标对象;根据所述目标对象将视频会议画面划分为至少两个子画面;将所述目标图像中的目标对象对应显示在所述至少两个子画面中。
- 如权利要求1所述的方法,其中,根据所述目标对象将视频会议画面划分为至少两个子画面,进一步包括:确定所述目标图像中的目标对象是否正在说话;响应于确定所述目标对象正在说话,按照第一分屏模式将所述视频会议画面划分为至少两个子画面。
- 如权利要求2所述的方法,其中,按照第一分屏模式将所述视频会议画面划分为至少两个子画面,包括:将所述至少两个子画面中的第一子画面放大显示,所述第一子画面用于显示正在说话的目标对象;在所述第一子画面的至少一侧并列显示所述至少两个子画面中的其他子画面。
- 如权利要求2所述的方法,其中,确定所述目标图像中的目标对象是否正在说话,包括:检测所述目标对象的关键点;基于所述目标对象的关键点,确定嘴唇高度;基于所述目标对象的关键点,确定嘴唇宽度;根据所述嘴唇高度以及所述嘴唇宽度,得到目标对象的嘴唇高宽比;基于所述嘴唇高宽比的变化信息,确定所述目标对象是否正在说话。
- 如权利要求4所述的方法,其中,基于所述嘴唇高宽比的变化信息,确定所述目标对象是否正在说话,包括:设定预设时间段;统计所述预设时间段内所述嘴唇高宽比的变化次数;当所述变化次数达到预设数量时,确定所述目标对象正在说话。
- 如权利要求5所述的方法,其中,当所述变化次数达到预设数量时,确定所述目标对象正在说话,包括:确定所述变化次数达到预设数量;获取视频会议的音频数据;结合视频会议的音频数据,确定所述目标对象正在说话。
- 如权利要求1所述的方法,其中,检测所述目标图像中的目标对象包括:利用目标检测或目标跟踪技术检测所述目标图像中的目标对象,得到所述目标对象的检测框;将所述目标图像中的目标对象对应显示在所述至少两个子画面中,包括:确定所述目标对象对应的检测框的坐标;确定所述目标对象对应的子画面的坐标;根据所述检测框的坐标以及所述子画面的坐标,将所述检测框对应的图像平移和/或缩放至所述目标对象对应的子画面中。
- 如权利要求1所述的方法,其中,将所述目标图像中的目标对象对应显示在所述至少两个子画面中,进一步包括:响应于确定所述子画面中的虚拟背景功能被打开,利用分割技术将所述目标对象与背景进行分割,得到分割结果;根据所述分割结果,在所述子画面中显示虚拟背景。
- 如权利要求1所述的方法,其中,将所述目标图像中的目标对象对应显示在所述至少两个子画面中,进一步包括:响应于确定所述目标对象正在说话,在所述目标对象对应的子画面中显示指示标识。
- 一种视频会议画面的分屏装置,包括:获取模块,被配置为:获取采集单元采集的目标图像;检测模块,被配置为:检测所述目标图像中的目标对象;划分模块,被配置为:根据所述目标对象将视频会议画面划分为至少两个子画面;显示模块,被配置为:将所述目标图像中的目标对象对应显示在所述至少两个子画面中。
- 一种计算机设备,包括一个或者多个处理器、存储器;和一个或多个程序,其中所述一个或多个程序被存储在所述存储器中,并且被所述一个或多个处理器执行,所述程序包括用于执行根据权利要求1-9任一项所述的方法的指令。
- 一种包含计算机程序的非易失性计算机可读存储介质,当所述计算机程序被一个或多个处理器执行时,使得所述处理器执行权利要求1-9任一项所述的方法。
- 一种计算机程序产品,包括计算机程序指令,当所述计算机程序指令在计算机上运行时,使得计算机执行如权利要求1-9中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310611376.5 | 2023-05-26 | ||
CN202310611376.5A CN116582637A (zh) | 2023-05-26 | 2023-05-26 | 视频会议画面的分屏方法及相关设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024245105A1 true WO2024245105A1 (zh) | 2024-12-05 |
Family
ID=87537562
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2024/095004 WO2024245105A1 (zh) | 2023-05-26 | 2024-05-23 | 视频会议画面的分屏方法及相关设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116582637A (zh) |
WO (1) | WO2024245105A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116582637A (zh) * | 2023-05-26 | 2023-08-11 | 北京字跳网络技术有限公司 | 视频会议画面的分屏方法及相关设备 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101080000A (zh) * | 2007-07-17 | 2007-11-28 | 华为技术有限公司 | 视频会议中显示发言人的方法、系统、服务器和终端 |
CN108933915A (zh) * | 2017-05-26 | 2018-12-04 | 和硕联合科技股份有限公司 | 视频会议装置与视频会议管理方法 |
CN109492506A (zh) * | 2017-09-13 | 2019-03-19 | 华为技术有限公司 | 图像处理方法、装置和系统 |
JP2021034900A (ja) * | 2019-08-26 | 2021-03-01 | 沖電気工業株式会社 | 処理装置、処理プログラム、及び処理方法 |
CN113065534A (zh) * | 2021-06-02 | 2021-07-02 | 全时云商务服务股份有限公司 | 一种基于人像分割精度提升的方法、系统和存储介质 |
CN113676693A (zh) * | 2021-08-19 | 2021-11-19 | 京东方科技集团股份有限公司 | 画面呈现方法、视频会议系统及可读存储介质 |
US20220182578A1 (en) * | 2020-12-04 | 2022-06-09 | Blackberry Limited | Speech Activity Detection Using Dual Sensory Based Learning |
CN116029895A (zh) * | 2023-02-23 | 2023-04-28 | 广州佰锐网络科技有限公司 | 一种ai虚拟背景实现方法、系统及计算机可读存储介质 |
CN116582637A (zh) * | 2023-05-26 | 2023-08-11 | 北京字跳网络技术有限公司 | 视频会议画面的分屏方法及相关设备 |
-
2023
- 2023-05-26 CN CN202310611376.5A patent/CN116582637A/zh active Pending
-
2024
- 2024-05-23 WO PCT/CN2024/095004 patent/WO2024245105A1/zh unknown
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101080000A (zh) * | 2007-07-17 | 2007-11-28 | 华为技术有限公司 | 视频会议中显示发言人的方法、系统、服务器和终端 |
CN108933915A (zh) * | 2017-05-26 | 2018-12-04 | 和硕联合科技股份有限公司 | 视频会议装置与视频会议管理方法 |
CN109492506A (zh) * | 2017-09-13 | 2019-03-19 | 华为技术有限公司 | 图像处理方法、装置和系统 |
JP2021034900A (ja) * | 2019-08-26 | 2021-03-01 | 沖電気工業株式会社 | 処理装置、処理プログラム、及び処理方法 |
US20220182578A1 (en) * | 2020-12-04 | 2022-06-09 | Blackberry Limited | Speech Activity Detection Using Dual Sensory Based Learning |
CN113065534A (zh) * | 2021-06-02 | 2021-07-02 | 全时云商务服务股份有限公司 | 一种基于人像分割精度提升的方法、系统和存储介质 |
CN113676693A (zh) * | 2021-08-19 | 2021-11-19 | 京东方科技集团股份有限公司 | 画面呈现方法、视频会议系统及可读存储介质 |
CN116029895A (zh) * | 2023-02-23 | 2023-04-28 | 广州佰锐网络科技有限公司 | 一种ai虚拟背景实现方法、系统及计算机可读存储介质 |
CN116582637A (zh) * | 2023-05-26 | 2023-08-11 | 北京字跳网络技术有限公司 | 视频会议画面的分屏方法及相关设备 |
Also Published As
Publication number | Publication date |
---|---|
CN116582637A (zh) | 2023-08-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230132407A1 (en) | Method and device of video virtual background image processing and computer apparatus | |
CN108933915B (zh) | 视频会议装置与视频会议管理方法 | |
KR101800617B1 (ko) | 디스플레이 장치 및 이의 화상 통화 방법 | |
CN107980221B (zh) | 合成并缩放角度分离的子场景 | |
EP3198866B1 (en) | Reconstruction of three-dimensional video | |
WO2020199906A1 (zh) | 人脸关键点检测方法、装置、设备及存储介质 | |
US9930270B2 (en) | Methods and apparatuses for controlling video content displayed to a viewer | |
US11205305B2 (en) | Presentation of three-dimensional video | |
US11527242B2 (en) | Lip-language identification method and apparatus, and augmented reality (AR) device and storage medium which identifies an object based on an azimuth angle associated with the AR field of view | |
WO2014034556A1 (ja) | 画像処理装置及び画像表示装置 | |
CN112243583A (zh) | 多端点混合现实会议 | |
CN112069863B (zh) | 一种面部特征的有效性判定方法及电子设备 | |
CN106470313B (zh) | 影像产生系统及影像产生方法 | |
WO2024245105A1 (zh) | 视频会议画面的分屏方法及相关设备 | |
CN114520888A (zh) | 影像撷取系统 | |
WO2022141651A1 (en) | Visual tracking system for active object | |
CN115334241A (zh) | 对焦控制方法、装置、存储介质及摄像设备 | |
CN115426474A (zh) | 对象显示方法、装置、系统、设备、介质和产品 | |
TWI755938B (zh) | 影像擷取系統 | |
RU2807472C2 (ru) | Способ определения допустимости признака лица и электронное устройство | |
CN118678228B (zh) | 全景会议控制方法、装置、计算机设备 | |
US20230289919A1 (en) | Video stream refinement for dynamic scenes | |
WO2024238701A1 (en) | Displaying video conference participants in alternative display orientation modes | |
WO2023235329A1 (en) | Framework for simultaneous subject and desk capture during videoconferencing | |
CN118644924A (zh) | 考勤方法、装置、设备及可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24814304 Country of ref document: EP Kind code of ref document: A1 |