[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024163006A1 - Methods and systems for constructing facial images using partial images - Google Patents

Methods and systems for constructing facial images using partial images Download PDF

Info

Publication number
WO2024163006A1
WO2024163006A1 PCT/US2023/061550 US2023061550W WO2024163006A1 WO 2024163006 A1 WO2024163006 A1 WO 2024163006A1 US 2023061550 W US2023061550 W US 2023061550W WO 2024163006 A1 WO2024163006 A1 WO 2024163006A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
autoencoded
patch
facial
composite
Prior art date
Application number
PCT/US2023/061550
Other languages
French (fr)
Inventor
Zheng Chen
Zhiqi ZHANG
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2023/061550 priority Critical patent/WO2024163006A1/en
Publication of WO2024163006A1 publication Critical patent/WO2024163006A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B27/00Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
    • G02B27/01Head-up displays
    • G02B27/0101Head-up displays characterised by optical features
    • G02B2027/0138Head-up displays characterised by optical features comprising image capture systems, e.g. camera

Definitions

  • Eye contact is important during human social communication. It increases the presence level of a communicator, makes other people feel more comfortable, and promotes the intimacy between the communicators as well. Eye contact is not only needed in the real world, but also in the next generation of communication medium in virtual reality (VR).
  • VR headsets physically block any visual connection between the user and others, which causes a visual disconnect. There are various solutions to address this visual disconnect caused by VR devices, but they have been inadequate, as described below
  • the present invention is directed to extended reality systems and methods.
  • the present invention provides a method for obtaining an image of a first facial organ and a reference image of a user.
  • the image of the first facial organ is autoencoded into a patch.
  • the patch is inserted into a corresponding location of the reference image to form a composite facial image.
  • Embodiments of the present invention can be implemented in conjunction with existing systems and processes.
  • the present device configuration and its related methods according to the present invention can be used in a wide variety of systems, including virtual reality (VR) systems, mobile devices, and the like.
  • VR virtual reality
  • various techniques according to the present invention can be adopted into existing systems via integrated circuit fabrication, operating software, and wireless communication protocols. There are other benefits as well.
  • a system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.
  • One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
  • One general aspect includes a method for generating composite facial images. The method also includes obtaining a reference facial image of a person. The method also includes obtaining a first eye image at a first time. The method also includes obtaining a second eye image at a second time. The method also includes processing the first eye image to provide a first autoencoded patch.
  • the method also includes processing the second eye image to provide a second autoencoded patch.
  • the method also includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch.
  • the method also includes overlaying the first autoencoded patch at the first location of the reference facial image.
  • the method also includes overlaying the second autoencoded patch at the second location of the reference facial image.
  • the method also includes providing a first composite image of the person at a third time, the first composite image may include the first autoencoded patch and the second autoencoded patch overlaying the reference facial image.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Implementations may include one or more of the following features.
  • the method where the first eye image is captured by a first infrared (IR) camera and the second eye image is captured by a second IR camera.
  • the method may include generating a second composite image of the person at a fourth time, a time interval between the third time and the fourth time satisfying a predetermined frame rate.
  • the method may include obtaining a mouth image, processing the mouth image to provide a third autoencoded patch, and overlaying the third autoencoded patch at a third location of the reference facial image using the mask.
  • the method may include performing gray prediction on the first autoencoded patch and the second autoencoded patch.
  • the method may include displaying the first composite image.
  • the method may include transmitting the first composite image.
  • the extended reality device also includes a housing may include a frontside and a backside.
  • the device also includes a first eye camera configured on the backside and configured to capture a first eye image of a person wearing the extended reality device.
  • the device also includes a display configured on the backside.
  • the device also includes a storage configured to store a reference image of the person.
  • the device also includes a communication interface configured to transmit a first composite image.
  • the device also includes a processor configured to: process the first eye image to provide a first autoencoded patch; generate a mask containing a first location for the first autoencoded patch; and generate a first composite image at a first time, the first composite image may include the first autoencoded patch overlaying the reference image at the first location.
  • a processor configured to: process the first eye image to provide a first autoencoded patch; generate a mask containing a first location for the first autoencoded patch; and generate a first composite image at a first time, the first composite image may include the first autoencoded patch overlaying the reference image at the first location.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Implementations may include one or more of the following features.
  • the extended reality device where the first eye camera may include a grayscale infrared camera.
  • the extended reality device may include a second eye camera and a mouth camera configured on the backside.
  • the extended reality device may include a pair of front cameras configured on the frontside.
  • the processor is further configured to generate a video may include the first composite image.
  • the processor may include a neural processing unit for performing autoencoding.
  • One general aspect includes a method for generating composite facial images.
  • the method also includes obtaining a reference facial image of a person.
  • the method also includes obtaining a first eye image at a first time.
  • the method also includes obtaining a mouth image at a second time.
  • the method also includes processing the first eye image to provide a first autoencoded patch.
  • the method also includes processing the mouth image to provide a second autoencoded patch.
  • the method also includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch.
  • the method also includes providing a first composite image of the person at a third time, the first composite image may include the first autoencoded patch and the second autoencoded patch overlaying the reference facial image respectively at the first location and the second location.
  • the method also includes performing autoencoding on the first composite image.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Implementations may include one or more of the following features.
  • the method may include obtaining a second eye image, and processing the mouth image to provide a third autoencoded patch.
  • the method may include generating a grayscale image using at least the first autoencoded patch and the second autoencoded patch, applying a discriminator using the first composite image and the grayscale image.
  • the method may include performing gray prediction on the first autoencoded patch.
  • the present invention provides configurations and methods for extended reality devices that allow a user to enable visual connection to others by creating a display of the user’s real human face that is otherwise blocked or partially blocked by the extended reality device.
  • this visual connection technique can enable VR meetings in which participants can see each other’s facial expressions, which maintains visual contact for a comfortable communication experience.
  • the present invention implements machine learning techniques to store user preference data related to the user’s facial display.
  • Figure l is a simplified diagram illustrating an extended reality device with cameras capturing eyes and mouth of a wearer according to embodiments of the present invention.
  • Figure 2 is a simplified block diagram illustrating internal components of an extended reality device according to embodiments of the present invention.
  • Figure 3 is a simplified diagram illustrating data flow of forming a composite image according to an embodiment of the present invention.
  • Figure 4 is a simplified flow diagram illustrating a method for generating a composite image according to embodiments of the present invention.
  • FIG. 5 is a simplified flow diagram illustrating an alternative method for generating a composite image according to embodiments of the present invention.
  • the present invention is directed to extended reality systems and methods.
  • the present invention provides a method for obtaining an eye image and a reference image of a user.
  • the eye image is autoencoded into a patch.
  • the patch is inserted into corresponding location of the reference image to form a composite image.
  • the method applies to obtain other facial features or facial organs from the user, such as a nose, a lip, a mouth, an eyebrow, cheeks, and ears.
  • extended reality devices tend to physically block any visual connection between the user and others.
  • a target of the present invention is to reveal real facial images of a user by using partial observations from the extended reality device worn by the user, which can be applied to a variety of virtual reality (VR) or augmented reality (AR) interaction applications (e.g., meetings, social platforms, gaming, etc.).
  • VR virtual reality
  • AR augmented reality
  • Examples of the present extended reality device can include headsets, goggles, mobile devices, and the like.
  • the present invention provides configurations and methods for extended reality devices that allow a user to enable visual connection to others by creating a display of the user’s real human face that is otherwise blocked or partially blocked by the extended reality device.
  • this visual connection technique can enable VR meetings in which participants can see each other’s facial expressions, which maintains visual contact for a comfortable communication experience.
  • the present invention implements machine learning techniques to store user preference data related to the user’s facial display.
  • any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6.
  • the use of “step of’ or “act of’ in the Claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.
  • the present invention provides a device and method of 2D photo-realistic human face restoration using partial observations for users of extended reality devices whose faces are at least partially blocked while wearing such extended reality devices.
  • the partial observations e.g., left eye, right eye, mouth, etc.
  • reference images of the user captured before the user uses the extended reality device can be used to assist the facial restoration process. In this way, the user can maintain a visual connection with another participant or target audience without having to take the extended reality device off.
  • Figure l is a simplified diagram illustrating an extended reality device with cameras capturing the eyes and the mouth of a wearer according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
  • the extended reality device 100 is configured within a housing 110 having at least a frontside and a backside.
  • the device 100 includes a display 120 (show within dotted line cutaway section 101) and at least a first eye camera 131 spatially configured on the backside (shown within dotted line cutaway section 102).
  • the first eye camera 131 e.g., inward-facing is configured to capture a first eye image of a person/user wearing the extended reality device 100.
  • the device 100 may also include a second eye camera 132 (shown within dotted line cutaway section 103) configured on the housing backside to capture a second eye image of the person, or device 100 may include a mouth camera 134 (shown within dotted line cutaway section 104 and facing downward toward the mouth) configured on the housing backside to capture a mouth image of the person.
  • a second eye camera 132 shown within dotted line cutaway section 103
  • a mouth camera 134 shown within dotted line cutaway section 104 and facing downward toward the mouth
  • the device 100 may include one or more image-capturing devices (e.g., camera, image sensor, video recorder, etc.) configured on the housing backside to capture one or more facial feature images or partial facial feature images (e.g., eye, mouth, nose, cheek, chin, brow, hairline, etc.).
  • one or more imagecapturing devices 140 may be configured on the housing frontside as well (e.g., a pair of front cameras configured to show the user the outside environment on the display).
  • any of the image-capturing devices can include an infrared (IR) image-capturing device, such as an IR camera or a grayscale IR camera.
  • IR infrared
  • the image-capturing devices configured on the housing backside may be configured at angles that do not capture the user’s facial features from the desired perspective (e.g., front-facing perspective for VR meeting application).
  • first eye camera 131 and second eye camera 132 may be configured near the sides of the device, which causes the resulting eye images to be from a side angle.
  • the eye images (or any facial feature image) will also require reconstruction to determine what the user’s facial feature should look like from the desired perspective.
  • the present invention also provides for machine learning and/or deep learning techniques to generate a composite facial image of a user at desired perspectives.
  • the device 100 also includes at least a storage, a communication interface, and a processor (see Figure 2).
  • the storage is configured to store at least a reference image of the person wearing the device 100.
  • the processor is configured to process at least the first eye image to provide a first autoencoded patch, to generate a mask containing a first location for the first autoencoded patch, and to generate a first composite image at a first time.
  • This first composite image includes at least the first autoencoded patch overlaying the reference image at the first location.
  • the extended reality device 100 also includes one or more microphones 142 for capturing audio inputs from the person/user.
  • the processor may also be configured to generate a video comprising the first composite image.
  • the video may also include the audio inputs captured by one or more microphones.
  • the processor also includes a neural processing unit (NPU) for performing autoencoding.
  • the communication interface is configured to transmit the first composite image of the person.
  • Figure 2 is a simplified block diagram illustrating internal components of an extended reality device according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
  • Figure 2 expands on the internal and external components that can be included in an extended reality device according to the present invention, such as device 100 of Figure 1.
  • these components can include a central processing unit (CPU) 210 coupled to a graphic processing unit (GPU) 212 and a display 220.
  • the display 220 may also be coupled to a screen touch sensor 222 and configured as a touchscreen.
  • the CPU 210 is coupled to the screen touch sensor 222, an auxiliary hand sensor 230 (e.g., headset sensors, remotes, controllers, etc.), and an accelerometer 232, all of which can provide sensor inputs to determine user gestures and device operating modes. Also, the CPU 210 is coupled to a motor 234, which can be configured to provide haptic feedback to the user (e.g., in response to sensor inputs, user gestures, etc.).
  • the GPU 212 can be coupled to the display 220 and one or more image-capturing devices, such as a camera 250. The imagecapturing devices can include the IR devices discussed previously (e.g., IR cameras, grayscale IR cameras, etc.).
  • the GPU 212 can be configured to transmit various forms of the user interface to display 220 depending on the operating mode, and the GPU 212 can be configured to transmit various forms of the user’s facial image or video to the display 220 (i.e., showing users how they will look to others).
  • the CPU 210 can also be further configured to update display preference data (e.g., user facial image/video composition, user interface arrangement, etc.) in a storage 260.
  • the display preference data can be updated based on user input received from the touchscreen, sensors configured within the extended reality device, or image-capturing devices.
  • the CPU 210 is also coupled to the neural processing unit (NPU) 214 that is configured to update the display preference data based on user inputs, which can be received via the screen touch sensor 222, the sensors 230, the cameras 250, or the like.
  • NPU neural processing unit
  • NPU 214 can be used to learn the user’s gestures via one or more of the sensors within the device 200 or learn the user’s facial composition via one or more of the cameras within the device 200.
  • Such gesture data and facial composition data can also be stored in storage 260.
  • the storage 260 is coupled to the CPU 210, the GPU 212, the NPU 214, the display 220, and the screen touch sensor 222.
  • the device 200 can further include a communication module 240 coupled to the CPU 210 and configured to transmit display preference data, composite images, or videos to a server.
  • This communication module 240 can be configured for mobile data, Wi-Fi, Bluetooth, etc.
  • the communication module 240 is configured to transmit user facial image data or video data to a target recipient (i.e., showing others the user’s facial image or video).
  • the communication module 240 can also be configured to transmit user profile information generated using one or more inputs from one or more sensor arrays configured within the handheld device. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. Of course, there can be other variations, modifications, and alternatives to these device elements and their configurations.
  • Figure 3 is a simplified diagram illustrating data flow of forming a composite image according to an embodiment of the present invention.
  • This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims.
  • the present invention provides a method to form composite facial images of an extended reality device user whose face is blocked or partially blocked by wearing the extended reality device.
  • Determining the composite facial image can be divided into at least two parts (i.e., a disentangled manner/implementation or a coarse-to-fine process), including but not limited to: (1) determining local geometry transformation using local image patches, and (2) determining global fusion based on local predictions.
  • a disentangled manner/implementation or a coarse-to-fine process including but not limited to: (1) determining local geometry transformation using local image patches, and (2) determining global fusion based on local predictions.
  • flow 300 implements a disentangled network architecture to determine a user’s composite facial image using a plurality of processing stages.
  • the first stage includes obtaining raw image patches 310, which can be obtained by image-capturing devices configured within the extended reality device, as discussed previously.
  • the raw images patches 330 can include at least a left eye image patch 312, a right eye image patch 314, and a mouth image patch 316.
  • the captured raw image patches may not be in the ideal perspective or orientation for the composite facial image (e.g., side view or offset angle).
  • each such image patch can be trained to predict the appropriate image patch for the target facial feature (e.g., left eye, right eye, mouth, etc.).
  • the target facial feature e.g., left eye, right eye, mouth, etc.
  • each of these image patches can be processed separately by an autoencoder (AE) device 322, denoted here as “AE1” to “AE3”.
  • AE autoencoder
  • the prediction process results in predicted local patches for the left eye image 332, the right eye image 336, and the mouth image 336.
  • each AE device 322 includes an encoder and a decoder configured to perform an autoencoding process to generate predictions 330 based on the local patches.
  • the AE diagram 201 shows the autoencoding process in which the encoder (En) automatically encodes data based in input values, the AE performs an activation function on the encoded data, and the decoder (De) decodes the data to produce the resulting output.
  • the image-capturing devices of the extended reality device can include small-mounted cameras. By using small mounted cameras to only capture local facial areas, the autoencoding process can exhibit better performance for real-time applications (e.g., VR meetings).
  • raw image patches 310 can include grayscale images obtained using IR cameras for grayscale predictions. Using grayscale image patches from the IR cameras can further optimize the autoencoding process to determine a user’s facial composite image.
  • the second stage includes using the predicted local patches 330 and reference images 340 to synthesize a new dataset (e.g., composite image), where new overlaid images with corrected local patches are generated.
  • This process can include generating a mask 342 with designated locations for overlaying the predicted image patches from the previous stage.
  • the predicted image patches of the left eye 332, right eye 334, and mouth 336 are overlayed on the designated left eye, right eye, and mouth locations of the mask 342, respectively.
  • Overlying the mask 342 on the reference facial image 340 results in the composite facial image 344.
  • the composite image process can include training local models to improve the local prediction image patches.
  • This training process is shown by the training images patches 350, including a training left eye image patch 352, a training right eye image patch 354, and a training mouth image patch 356.
  • the training process can include applying a mean squared error (MSE) process, a learned perceptual image patch similarity (LPIPS) process, other reconstruction loss processes, and the like and combinations thereof.
  • MSE mean squared error
  • LPIPS learned perceptual image patch similarity
  • These training image patches can be used to improve the prediction image patches 330 that are overlayed using mask 342 and the reference image 340 to create modified composite facial images 344.
  • the third stage includes training the global model 360 on the newly generated dataset (i.e., composite facial image 344) to enable global fusion capability.
  • AE device 362 also denoted as “AE4”
  • the global model training process can include creating a training composite image 372 to improve the prediction composite image 370.
  • the training composite image can be grayscale.
  • Both the prediction and training composite facial images 370,372 can be provided to discriminator 374 (or a classifier) to determine the difference between these images to train the global model.
  • a CPU of the extended reality device is configured to form the composite image using the steps discussed previously.
  • An NPU coupled to the CPU can be configured to perform the model training processes and other machine/deep learning related processes.
  • the CPU and NPU can be configured similarly to device 201 shown in Figure 2. Further details of example methods for generating composite images are discussed with reference to Figures 4 and 5.
  • Figure 4 is a simplified flow diagram illustrating a method for generating a composite image according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims.
  • the method of generating the composite image can be performed by an extended reality device, such as device 100 shown in Figure 1.
  • method 400 includes step 402 of obtaining a reference facial image of a person.
  • This reference image can be obtained by using an integrated image-capturing device, by user upload, network transfer, or the like.
  • the method includes obtaining a first facial organ image (e.g., left eye) at a first time, and obtaining a second facial organ image (e.g., right eye) at a second time, respectively.
  • These eye images can be captured using one or more image-capturing devices, such as cameras, image sensors, video recorders, or the like.
  • an infrared (IR) camera can be configured to capture the facial organ images, or a first IR camera can be configured to capture the first facial organ image while a second IR camera can be configured to capture the second facial organ image.
  • the method can include obtaining a plurality of facial organ images at separate times.
  • the plurality of facial organ images can be captured using one or more imagecapturing devices (e.g., IR cameras), as discussed previously.
  • a separate image-capturing device can be configured to capture each of the facial organ images, or any one of the imagecapturing devices can be configured to capture multiple facial organ images at separate times.
  • the method includes processing the first facial organ image to provide a first autoencoded patch, and processing the second facial organ image to provide a second autoencoded patch, respectively.
  • the method includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch.
  • the method includes obtaining a third facial organ image(e.g., mouth), processing the third facial organ image to provide a third autoencoded patch, and overlaying the third autoencoded patch at a third location of the reference facial image using the mask.
  • additional facial organ images or partial facial images e.g., nose, lip, ears, chin, cheeks, eyebrows, hairline, etc.
  • the method includes overlaying the first autoencoded patch at the first location of the reference facial image, and overlaying the second autoencoded patch at the second location of the reference facial image.
  • the third facial organ image and/or other partial facial images can be processed to provide additional autoencoded patches, which can be overlayed at additional locations of the reference facial image using the mask or used to generate the mask and then subsequently overlayed on the additional locations. Further, the method can include performing gray prediction on one or more of the autoencoded patches.
  • the method includes providing a first composite image of the person at a third time.
  • This first composite image includes at least the first autoencoded patch and the second autoencoded patch overlaying the reference facial image using the mask.
  • the first composite image can also include one or more of the additional autoencoded patches provided by processing one or more of the other partial facial images.
  • steps 406, 410 and their associated operations in steps 412, 416 and 418 are optional.
  • the method can also include generating one or more additional composite images of the person at different times.
  • a second composite image of the person can be generated at a fourth time such that a time interval between the third and fourth time satisfies a predetermined frame rate.
  • the times of the additional composite images can be configured such that the time intervals between such times satisfy one or more predetermined frame rates.
  • the additional composite images can also include any number of the autoencoded patches provided from processing the partial facial images.
  • Figure 5 is a simplified flow diagram illustrating an alternative method for generating a composite image according to embodiments of the present invention.
  • This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims.
  • the method includes obtaining a first eye image (e.g., left eye) at a first time, and obtaining a mouth image at a second time, respectively.
  • this method 500 uses different partial facial image types to generate the autoencoded patches. As discussed previously, different numbers and combinations of partial facial images can be obtained for further processing.
  • the method includes processing the first eye image to provide a first autoencoded patch, and processing the mouth image to provide a second autoencoded patch. And, in step 512, the method includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch.
  • a plurality of partial facial images can be processed to provide a plurality of autoencoded patches, which can be used to generate the mask including locations for each such patch.
  • the method includes performing autoencoding on the first composite image.
  • additional composite images can be determined at time intervals to satisfy one or more predetermined frame rates.
  • the additional composite images can also include any number of the autoencoded patches provided from processing the partial facial images.
  • the method can include performing autoencoding on these additional composite images as well.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

A method and device for generating composite facial images. The method includes obtaining a reference facial image of a person, obtaining first and second facial organ images at a first and second time, respectively. The method also includes processing the first and second facial organ images to provide a first and second autoencoded patches, respectively. A mask is generated containing a first location for the first patch and a second location for the second patch. The first and second autoencoded patches are overlayed at the first and second locations of the reference facial image, respectively. The method also includes providing a composite image of the person at a third time including the first and second autoencoded patches overlaying the reference facial image. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

Description

METHODS AND SYSTEMS FOR CONSTRUCTING FACIAL IMAGES USING PARTIAL IMAGES
BACKGROUND OF THE INVENTION
[0001] Eye contact is important during human social communication. It increases the presence level of a communicator, makes other people feel more comfortable, and promotes the intimacy between the communicators as well. Eye contact is not only needed in the real world, but also in the next generation of communication medium in virtual reality (VR). However, VR headsets physically block any visual connection between the user and others, which causes a visual disconnect. There are various solutions to address this visual disconnect caused by VR devices, but they have been inadequate, as described below
[0002] Therefore, new and improved systems and methods for operating extended reality devices are desired.
BRIEF SUMMARY OF THE INVENTION
[0003] The present invention is directed to extended reality systems and methods.
According to an embodiment, the present invention provides a method for obtaining an image of a first facial organ and a reference image of a user. The image of the first facial organ is autoencoded into a patch. The patch is inserted into a corresponding location of the reference image to form a composite facial image. There are other embodiments as well.
[0004] Embodiments of the present invention can be implemented in conjunction with existing systems and processes. For example, the present device configuration and its related methods according to the present invention can be used in a wide variety of systems, including virtual reality (VR) systems, mobile devices, and the like. Additionally, various techniques according to the present invention can be adopted into existing systems via integrated circuit fabrication, operating software, and wireless communication protocols. There are other benefits as well.
[0005] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for generating composite facial images. The method also includes obtaining a reference facial image of a person. The method also includes obtaining a first eye image at a first time. The method also includes obtaining a second eye image at a second time. The method also includes processing the first eye image to provide a first autoencoded patch. The method also includes processing the second eye image to provide a second autoencoded patch. The method also includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch. The method also includes overlaying the first autoencoded patch at the first location of the reference facial image. The method also includes overlaying the second autoencoded patch at the second location of the reference facial image. The method also includes providing a first composite image of the person at a third time, the first composite image may include the first autoencoded patch and the second autoencoded patch overlaying the reference facial image. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0006] Implementations may include one or more of the following features. The method where the first eye image is captured by a first infrared (IR) camera and the second eye image is captured by a second IR camera. The method may include generating a second composite image of the person at a fourth time, a time interval between the third time and the fourth time satisfying a predetermined frame rate. The method may include obtaining a mouth image, processing the mouth image to provide a third autoencoded patch, and overlaying the third autoencoded patch at a third location of the reference facial image using the mask. The method may include performing gray prediction on the first autoencoded patch and the second autoencoded patch. The method may include displaying the first composite image. The method may include transmitting the first composite image. The method may include performing autoencoding on the first composite image. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0007] One general aspect includes an extended reality device. The extended reality device also includes a housing may include a frontside and a backside. The device also includes a first eye camera configured on the backside and configured to capture a first eye image of a person wearing the extended reality device. The device also includes a display configured on the backside. The device also includes a storage configured to store a reference image of the person. The device also includes a communication interface configured to transmit a first composite image. The device also includes a processor configured to: process the first eye image to provide a first autoencoded patch; generate a mask containing a first location for the first autoencoded patch; and generate a first composite image at a first time, the first composite image may include the first autoencoded patch overlaying the reference image at the first location. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0008] Implementations may include one or more of the following features. The extended reality device where the first eye camera may include a grayscale infrared camera. The extended reality device may include a second eye camera and a mouth camera configured on the backside. The extended reality device may include a pair of front cameras configured on the frontside. The processor is further configured to generate a video may include the first composite image. The processor may include a neural processing unit for performing autoencoding. The extended reality device may include a microphone for capture audio inputs from the person. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0009] One general aspect includes a method for generating composite facial images. The method also includes obtaining a reference facial image of a person. The method also includes obtaining a first eye image at a first time. The method also includes obtaining a mouth image at a second time. The method also includes processing the first eye image to provide a first autoencoded patch. The method also includes processing the mouth image to provide a second autoencoded patch. The method also includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch. The method also includes providing a first composite image of the person at a third time, the first composite image may include the first autoencoded patch and the second autoencoded patch overlaying the reference facial image respectively at the first location and the second location. The method also includes performing autoencoding on the first composite image. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0010] Implementations may include one or more of the following features. The method may include obtaining a second eye image, and processing the mouth image to provide a third autoencoded patch. The method may include generating a grayscale image using at least the first autoencoded patch and the second autoencoded patch, applying a discriminator using the first composite image and the grayscale image. The method may include performing gray prediction on the first autoencoded patch. The method may include generating a video based at least on the first composite image. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0011] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, the present invention provides configurations and methods for extended reality devices that allow a user to enable visual connection to others by creating a display of the user’s real human face that is otherwise blocked or partially blocked by the extended reality device. In an example, this visual connection technique can enable VR meetings in which participants can see each other’s facial expressions, which maintains visual contact for a comfortable communication experience. Additionally, the present invention implements machine learning techniques to store user preference data related to the user’s facial display. There are many other benefits as well.
[0012] The present invention achieves these benefits and others in the context of known technology. However, a further understanding of the nature and advantages of the present invention may be realized by reference to the latter portions of the specification and attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] Figure l is a simplified diagram illustrating an extended reality device with cameras capturing eyes and mouth of a wearer according to embodiments of the present invention.
[0014] Figure 2 is a simplified block diagram illustrating internal components of an extended reality device according to embodiments of the present invention.
[0015] Figure 3 is a simplified diagram illustrating data flow of forming a composite image according to an embodiment of the present invention.
[0016] Figure 4 is a simplified flow diagram illustrating a method for generating a composite image according to embodiments of the present invention.
[0017] Figure 5 is a simplified flow diagram illustrating an alternative method for generating a composite image according to embodiments of the present invention. DETAILED DESCRIPTION OF THE INVENTION
[0018] The present invention is directed to extended reality systems and methods. According to an embodiment, the present invention provides a method for obtaining an eye image and a reference image of a user. The eye image is autoencoded into a patch. The patch is inserted into corresponding location of the reference image to form a composite image. In some embodiments, the method applies to obtain other facial features or facial organs from the user, such as a nose, a lip, a mouth, an eyebrow, cheeks, and ears.
[0019] As described above, extended reality devices tend to physically block any visual connection between the user and others. A target of the present invention is to reveal real facial images of a user by using partial observations from the extended reality device worn by the user, which can be applied to a variety of virtual reality (VR) or augmented reality (AR) interaction applications (e.g., meetings, social platforms, gaming, etc.). Examples of the present extended reality device can include headsets, goggles, mobile devices, and the like.
[0020] Although there have been attempts to solve this visual connection issue with users wearing VR headsets, the results have been inadequate. For example, conventional techniques to determine a 2D photo-realistic facial composition of a user in a head-pose-free manner require complicated device architectures and methods, such as 3D head reconstruction, face alignment and tracking, face synthesis, and eye synthesis. Such techniques also tend to heavily rely on a third-person camera to capture complete facial images. In such cases, the facial reconstruction can fail if any part of the of the user’s face is blocked in the captured facial image. Conventional techniques also require the implementation of multiple modules, which can result in computation speeds that are too slow to satisfy real-time requirements in frequency-sensitive scenarios (e.g., VR video meetings). Users may have the burden of tuning the parameters of each module according to different environment setups as well.
[0021] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, the present invention provides configurations and methods for extended reality devices that allow a user to enable visual connection to others by creating a display of the user’s real human face that is otherwise blocked or partially blocked by the extended reality device. In an example, this visual connection technique can enable VR meetings in which participants can see each other’s facial expressions, which maintains visual contact for a comfortable communication experience. Additionally, the present invention implements machine learning techniques to store user preference data related to the user’s facial display. [0022] The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
[0023] In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.
[0024] The reader’s attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.
[0025] Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6. In particular, the use of “step of’ or “act of’ in the Claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.
[0026] Please note, if used, the labels left, right, front, back, top, bottom, forward, reverse, clockwise and counterclockwise have been used for convenience purposes only and are not intended to imply any particular fixed direction. Instead, they are used to reflect relative locations and/or directions between various portions of an object.
[0027] According to an example, the present invention provides a device and method of 2D photo-realistic human face restoration using partial observations for users of extended reality devices whose faces are at least partially blocked while wearing such extended reality devices. The partial observations (e.g., left eye, right eye, mouth, etc.) can be obtained by various image-capturing devices configured within the extended reality device. In addition, reference images of the user captured before the user uses the extended reality device can be used to assist the facial restoration process. In this way, the user can maintain a visual connection with another participant or target audience without having to take the extended reality device off.
[0028] Figure l is a simplified diagram illustrating an extended reality device with cameras capturing the eyes and the mouth of a wearer according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
[0029] As shown, the extended reality device 100 is configured within a housing 110 having at least a frontside and a backside. The device 100 includes a display 120 (show within dotted line cutaway section 101) and at least a first eye camera 131 spatially configured on the backside (shown within dotted line cutaway section 102). The first eye camera 131 (e.g., inward-facing) is configured to capture a first eye image of a person/user wearing the extended reality device 100. In an example, the device 100 may also include a second eye camera 132 (shown within dotted line cutaway section 103) configured on the housing backside to capture a second eye image of the person, or device 100 may include a mouth camera 134 (shown within dotted line cutaway section 104 and facing downward toward the mouth) configured on the housing backside to capture a mouth image of the person.
[0030] According to an example, the device 100 may include one or more image-capturing devices (e.g., camera, image sensor, video recorder, etc.) configured on the housing backside to capture one or more facial feature images or partial facial feature images (e.g., eye, mouth, nose, cheek, chin, brow, hairline, etc.). Depending on the application, one or more imagecapturing devices 140 may be configured on the housing frontside as well (e.g., a pair of front cameras configured to show the user the outside environment on the display). Further, any of the image-capturing devices can include an infrared (IR) image-capturing device, such as an IR camera or a grayscale IR camera.
[0031] Depending upon the configuration, the image-capturing devices configured on the housing backside may be configured at angles that do not capture the user’s facial features from the desired perspective (e.g., front-facing perspective for VR meeting application). For example, first eye camera 131 and second eye camera 132 may be configured near the sides of the device, which causes the resulting eye images to be from a side angle. In such cases, the eye images (or any facial feature image) will also require reconstruction to determine what the user’s facial feature should look like from the desired perspective. To address this issue, the present invention also provides for machine learning and/or deep learning techniques to generate a composite facial image of a user at desired perspectives.
[0032] The device 100 also includes at least a storage, a communication interface, and a processor (see Figure 2). The storage is configured to store at least a reference image of the person wearing the device 100. The processor is configured to process at least the first eye image to provide a first autoencoded patch, to generate a mask containing a first location for the first autoencoded patch, and to generate a first composite image at a first time. This first composite image includes at least the first autoencoded patch overlaying the reference image at the first location.
[0033] In an example, the extended reality device 100 also includes one or more microphones 142 for capturing audio inputs from the person/user. The processor may also be configured to generate a video comprising the first composite image. The video may also include the audio inputs captured by one or more microphones. In a specific example, the processor also includes a neural processing unit (NPU) for performing autoencoding. Further, the communication interface is configured to transmit the first composite image of the person.
[0034] Figure 2 is a simplified block diagram illustrating internal components of an extended reality device according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
[0035] As shown, Figure 2 expands on the internal and external components that can be included in an extended reality device according to the present invention, such as device 100 of Figure 1. Configured to a housing 201, these components can include a central processing unit (CPU) 210 coupled to a graphic processing unit (GPU) 212 and a display 220. The display 220 may also be coupled to a screen touch sensor 222 and configured as a touchscreen.
[0036] Here, the CPU 210 is coupled to the screen touch sensor 222, an auxiliary hand sensor 230 (e.g., headset sensors, remotes, controllers, etc.), and an accelerometer 232, all of which can provide sensor inputs to determine user gestures and device operating modes. Also, the CPU 210 is coupled to a motor 234, which can be configured to provide haptic feedback to the user (e.g., in response to sensor inputs, user gestures, etc.). The GPU 212 can be coupled to the display 220 and one or more image-capturing devices, such as a camera 250. The imagecapturing devices can include the IR devices discussed previously (e.g., IR cameras, grayscale IR cameras, etc.). Further, the GPU 212 can be configured to transmit various forms of the user interface to display 220 depending on the operating mode, and the GPU 212 can be configured to transmit various forms of the user’s facial image or video to the display 220 (i.e., showing users how they will look to others).
[0037] The CPU 210 can also be further configured to update display preference data (e.g., user facial image/video composition, user interface arrangement, etc.) in a storage 260. The display preference data can be updated based on user input received from the touchscreen, sensors configured within the extended reality device, or image-capturing devices. In a specific example, the CPU 210 is also coupled to the neural processing unit (NPU) 214 that is configured to update the display preference data based on user inputs, which can be received via the screen touch sensor 222, the sensors 230, the cameras 250, or the like. In a specific example, NPU 214 can be used to learn the user’s gestures via one or more of the sensors within the device 200 or learn the user’s facial composition via one or more of the cameras within the device 200. Such gesture data and facial composition data can also be stored in storage 260. Here, the storage 260 is coupled to the CPU 210, the GPU 212, the NPU 214, the display 220, and the screen touch sensor 222.
[0038] The device 200 can further include a communication module 240 coupled to the CPU 210 and configured to transmit display preference data, composite images, or videos to a server. This communication module 240 can be configured for mobile data, Wi-Fi, Bluetooth, etc. In an example, the communication module 240 is configured to transmit user facial image data or video data to a target recipient (i.e., showing others the user’s facial image or video). The communication module 240 can also be configured to transmit user profile information generated using one or more inputs from one or more sensor arrays configured within the handheld device. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. Of course, there can be other variations, modifications, and alternatives to these device elements and their configurations.
[0039] Figure 3 is a simplified diagram illustrating data flow of forming a composite image according to an embodiment of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims. [0040] According to an example, the present invention provides a method to form composite facial images of an extended reality device user whose face is blocked or partially blocked by wearing the extended reality device. Determining the composite facial image can be divided into at least two parts (i.e., a disentangled manner/implementation or a coarse-to-fine process), including but not limited to: (1) determining local geometry transformation using local image patches, and (2) determining global fusion based on local predictions. By disentangling the composite facial image process and leveraging one or more partial observations of the user’s face, the current model can efficiently reconstruct the user’s face quickly and effectively.
[0041] As shown, flow 300 implements a disentangled network architecture to determine a user’s composite facial image using a plurality of processing stages. Here, the first stage includes obtaining raw image patches 310, which can be obtained by image-capturing devices configured within the extended reality device, as discussed previously. The raw images patches 330 can include at least a left eye image patch 312, a right eye image patch 314, and a mouth image patch 316. However, depending on the position of the image-capturing devices, the captured raw image patches may not be in the ideal perspective or orientation for the composite facial image (e.g., side view or offset angle).
[0042] Thus, local models for each such image patch can be trained to predict the appropriate image patch for the target facial feature (e.g., left eye, right eye, mouth, etc.). Using the local model 320, each of these image patches can be processed separately by an autoencoder (AE) device 322, denoted here as “AE1” to “AE3”. In this case, the prediction process results in predicted local patches for the left eye image 332, the right eye image 336, and the mouth image 336.
[0043] In an example, each AE device 322, includes an encoder and a decoder configured to perform an autoencoding process to generate predictions 330 based on the local patches. The AE diagram 201 shows the autoencoding process in which the encoder (En) automatically encodes data based in input values, the AE performs an activation function on the encoded data, and the decoder (De) decodes the data to produce the resulting output.
[0044] In an example, the image-capturing devices of the extended reality device can include small-mounted cameras. By using small mounted cameras to only capture local facial areas, the autoencoding process can exhibit better performance for real-time applications (e.g., VR meetings). In a specific example, raw image patches 310 can include grayscale images obtained using IR cameras for grayscale predictions. Using grayscale image patches from the IR cameras can further optimize the autoencoding process to determine a user’s facial composite image.
[0045] The second stage includes using the predicted local patches 330 and reference images 340 to synthesize a new dataset (e.g., composite image), where new overlaid images with corrected local patches are generated. This process can include generating a mask 342 with designated locations for overlaying the predicted image patches from the previous stage. Here, the predicted image patches of the left eye 332, right eye 334, and mouth 336 are overlayed on the designated left eye, right eye, and mouth locations of the mask 342, respectively. Overlying the mask 342 on the reference facial image 340 results in the composite facial image 344.
[0046] In an example, the composite image process can include training local models to improve the local prediction image patches. This training process is shown by the training images patches 350, including a training left eye image patch 352, a training right eye image patch 354, and a training mouth image patch 356. The training process can include applying a mean squared error (MSE) process, a learned perceptual image patch similarity (LPIPS) process, other reconstruction loss processes, and the like and combinations thereof. These training image patches can be used to improve the prediction image patches 330 that are overlayed using mask 342 and the reference image 340 to create modified composite facial images 344.
[0047] The third stage includes training the global model 360 on the newly generated dataset (i.e., composite facial image 344) to enable global fusion capability. Here, AE device 362, also denoted as “AE4”, can perform an autoencoding process on the composite image 344 to produce a global prediction composite facial image 370. Similar to the local model training process, the global model training process can include creating a training composite image 372 to improve the prediction composite image 370. In an example, the training composite image can be grayscale. Both the prediction and training composite facial images 370,372 can be provided to discriminator 374 (or a classifier) to determine the difference between these images to train the global model.
[0048] In an example, a CPU of the extended reality device is configured to form the composite image using the steps discussed previously. An NPU coupled to the CPU can be configured to perform the model training processes and other machine/deep learning related processes. The CPU and NPU can be configured similarly to device 201 shown in Figure 2. Further details of example methods for generating composite images are discussed with reference to Figures 4 and 5. [0049] Figure 4 is a simplified flow diagram illustrating a method for generating a composite image according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims.
[0050] According to an example, the method of generating the composite image can be performed by an extended reality device, such as device 100 shown in Figure 1. As shown, method 400 includes step 402 of obtaining a reference facial image of a person. This reference image can be obtained by using an integrated image-capturing device, by user upload, network transfer, or the like.
[0051] In steps 404 and 406, the method includes obtaining a first facial organ image (e.g., left eye) at a first time, and obtaining a second facial organ image (e.g., right eye) at a second time, respectively. These eye images can be captured using one or more image-capturing devices, such as cameras, image sensors, video recorders, or the like. In a specific example, an infrared (IR) camera can be configured to capture the facial organ images, or a first IR camera can be configured to capture the first facial organ image while a second IR camera can be configured to capture the second facial organ image.
[0052] In an example, the method can include obtaining a plurality of facial organ images at separate times. The plurality of facial organ images can be captured using one or more imagecapturing devices (e.g., IR cameras), as discussed previously. A separate image-capturing device can be configured to capture each of the facial organ images, or any one of the imagecapturing devices can be configured to capture multiple facial organ images at separate times.
[0053] In steps 408 and 410, the method includes processing the first facial organ image to provide a first autoencoded patch, and processing the second facial organ image to provide a second autoencoded patch, respectively. In step 412, the method includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch.
[0054] In an example, the method includes obtaining a third facial organ image(e.g., mouth), processing the third facial organ image to provide a third autoencoded patch, and overlaying the third autoencoded patch at a third location of the reference facial image using the mask. Further, additional facial organ images or partial facial images (e.g., nose, lip, ears, chin, cheeks, eyebrows, hairline, etc.) can be obtained using one or more image-capturing devices. [0055] In steps 414 and 416, the method includes overlaying the first autoencoded patch at the first location of the reference facial image, and overlaying the second autoencoded patch at the second location of the reference facial image. Additionally, the third facial organ image and/or other partial facial images can be processed to provide additional autoencoded patches, which can be overlayed at additional locations of the reference facial image using the mask or used to generate the mask and then subsequently overlayed on the additional locations. Further, the method can include performing gray prediction on one or more of the autoencoded patches.
[0056] In step 418, the method includes providing a first composite image of the person at a third time. This first composite image includes at least the first autoencoded patch and the second autoencoded patch overlaying the reference facial image using the mask. The first composite image can also include one or more of the additional autoencoded patches provided by processing one or more of the other partial facial images. In some embodiments, steps 406, 410 and their associated operations in steps 412, 416 and 418 are optional.
[0057] The method can also include generating one or more additional composite images of the person at different times. For example, a second composite image of the person can be generated at a fourth time such that a time interval between the third and fourth time satisfies a predetermined frame rate. Depending on the application, the times of the additional composite images can be configured such that the time intervals between such times satisfy one or more predetermined frame rates. The additional composite images can also include any number of the autoencoded patches provided from processing the partial facial images.
[0058] Figure 5 is a simplified flow diagram illustrating an alternative method for generating a composite image according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims.
[0059] Similar to the previous example, the method of generating the composite image can be performed by an extended reality device (e.g., device 100 of Figure 1). Also, details of similar method steps discussed for method 400 also apply to the method steps of method 500. As shown, method 500 includes step 502 of obtaining a reference facial image of a person.
[0060] In steps 504 and 506, the method includes obtaining a first eye image (e.g., left eye) at a first time, and obtaining a mouth image at a second time, respectively. Compared to method 400, this method 500 uses different partial facial image types to generate the autoencoded patches. As discussed previously, different numbers and combinations of partial facial images can be obtained for further processing.
[0061] In steps 508 and 510, the method includes processing the first eye image to provide a first autoencoded patch, and processing the mouth image to provide a second autoencoded patch. And, in step 512, the method includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch. In other cases, a plurality of partial facial images can be processed to provide a plurality of autoencoded patches, which can be used to generate the mask including locations for each such patch.
[0062] In step 514, the method includes performing autoencoding on the first composite image. As discussed previously, additional composite images can be determined at time intervals to satisfy one or more predetermined frame rates. The additional composite images can also include any number of the autoencoded patches provided from processing the partial facial images. The method can include performing autoencoding on these additional composite images as well. [0063] While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.

Claims

WHAT IS CLAIMED IS:
1. A method for generating composite facial images, the method comprising: obtaining a reference facial image of a person; obtaining an image of a first facial organ at a first time; processing the image of the first facial organ to provide a first autoencoded patch; generating a mask containing a first location for the first autoencoded patch; overlaying the first autoencoded patch at the first location of the reference facial image; and providing a first composite image of the person at a third time, the first composite image comprising the first autoencoded patch overlaying the reference facial image using the mask.
2. The method of claim 1 further comprising obtaining an image of a second facial organ at a second time; processing the image of the second facial organ to provide a second autoencoded patch; wherein the mask contains a second location for the second autoencoded patch; and overlaying the second autoencoded patch at the second location of the reference facial image; wherein the first composite image comprises the second autoencoded patch overlying the reference facial image using the mask.
3. The method of claim 2 wherein the image of the first facial organ is captured by a first infrared (IR) camera and the image of the second facial organ is captured by a second IR camera.
4. The method of claim 2 further comprising generating a second composite image of the person at a fourth time, a time interval between the third time and the fourth time satisfying a predetermined frame rate.
5. The method of claim 2 further comprising: obtaining an image of a third facial organ; processing the image of the third facial organ to provide a third autoencoded patch; and overlaying the third autoencoded patch at a third location of the reference facial image using the mask.
6. The method of claim 2 further comprising performing gray prediction on the first autoencoded patch and the second autoencoded patch.
7. The method of claim 1 further comprising displaying the first composite image.
8. The method of claim 1 further comprising performing autoencoding on the first composite image.
9. An extended reality device comprising: a housing comprising a frontside and a backside; a first eye camera configured on the backside and configured to capture a first eye image of a person wearing the extended reality device; a display configured on the backside; a storage configured to store a reference image of the person; a communication interface configured to transmit a first composite image; and a processor configured to: process the first eye image to provide a first autoencoded patch; generate a mask containing a first location for the first autoencoded patch; and generate a first composite image at a first time, the first composite image comprising the first autoencoded patch overlaying the reference image at the first location.
10. The extended reality device of claim 9 wherein the first eye camera comprises a grayscale infrared camera.
11. The extended reality device of claim 9 further comprising a second eye camera and a mouth camera configured on the backside.
12. The extended reality device of claim 9 further comprising a pair of front cameras configured on the frontside.
13. The extended reality device of claim 9 wherein the processor is further configured to generate a video comprising the first composite image.
14. The extended reality device of claim 9 wherein the processor comprises a neural processing unit for performing autoencoding.
15. The extended reality device of claim 9 wherein the processor is further configured to perform local model training to generate one or more training image patches to improve at least the first autoencoded patch.
16. A method for generating composite facial images, the method comprising: obtaining a reference facial image of a person; obtaining an image of a first facial organ at a first time; obtaining an image of a second facial organ at a second time; processing the image of the first facial organ to provide a first autoencoded patch; processing the image of the second facial organ to provide a second autoencoded patch; generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch; and providing a first composite facial image at a third time, the first composite image comprising the first autoencoded patch and the second autoencoded patch overlaying the reference facial image respectively at the first location and the second location; and performing autoencoding on the first composite facial image.
17. The method of claim 16 further comprising: obtaining an image of a third facial organ; and processing the image of the second facial organ to provide a third autoencoded patch.
18. The method of claim 16 further comprising: generating a grayscale image using at least the first autoencoded patch and the second autoencoded patch; and applying a discriminator using the first composite facial image and the grayscale image.
19. The method of claim 16 further comprising performing gray prediction on the first autoencoded patch.
20. The method of claim 16 further comprising generating a video based at least on the first composite facial image.
PCT/US2023/061550 2023-01-30 2023-01-30 Methods and systems for constructing facial images using partial images WO2024163006A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2023/061550 WO2024163006A1 (en) 2023-01-30 2023-01-30 Methods and systems for constructing facial images using partial images

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2023/061550 WO2024163006A1 (en) 2023-01-30 2023-01-30 Methods and systems for constructing facial images using partial images

Publications (1)

Publication Number Publication Date
WO2024163006A1 true WO2024163006A1 (en) 2024-08-08

Family

ID=92147347

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/061550 WO2024163006A1 (en) 2023-01-30 2023-01-30 Methods and systems for constructing facial images using partial images

Country Status (1)

Country Link
WO (1) WO2024163006A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160012275A1 (en) * 2012-12-10 2016-01-14 Sri International Iris biometric matching system
US20160292850A1 (en) * 2011-09-30 2016-10-06 Microsoft Technology Licensing, Llc Personal audio/visual system
US20180336720A1 (en) * 2015-11-03 2018-11-22 Fuel 3D Technologies Limited Systems and Methods For Generating and Using Three-Dimensional Images
US20190294915A1 (en) * 2015-06-03 2019-09-26 Innereye Ltd. Image classification by brain computer interface

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160292850A1 (en) * 2011-09-30 2016-10-06 Microsoft Technology Licensing, Llc Personal audio/visual system
US20160012275A1 (en) * 2012-12-10 2016-01-14 Sri International Iris biometric matching system
US20190294915A1 (en) * 2015-06-03 2019-09-26 Innereye Ltd. Image classification by brain computer interface
US20180336720A1 (en) * 2015-11-03 2018-11-22 Fuel 3D Technologies Limited Systems and Methods For Generating and Using Three-Dimensional Images

Similar Documents

Publication Publication Date Title
US12041389B2 (en) 3D video conferencing
US11856328B2 (en) Virtual 3D video conference environment generation
JP7200439B1 (en) Avatar display device, avatar generation device and program
US11805157B2 (en) Sharing content during a virtual 3D video conference
US9030486B2 (en) System and method for low bandwidth image transmission
WO2020203999A1 (en) Communication assistance system, communication assistance method, and image control program
JP5784226B2 (en) Method and apparatus for building an image model
US11765332B2 (en) Virtual 3D communications with participant viewpoint adjustment
WO2024169314A1 (en) Method and apparatus for constructing deformable neural radiance field network
JP2010206307A (en) Information processor, information processing method, information processing program, and network conference system
US11790535B2 (en) Foreground and background segmentation related to a virtual three-dimensional (3D) video conference
US20210392231A1 (en) Audio quality improvement related to a participant of a virtual three dimensional (3d) video conference
WO2024163006A1 (en) Methods and systems for constructing facial images using partial images
TW201305962A (en) Method and arrangement for image model construction
CN117041670B (en) Image processing method and related equipment
WO2024144805A1 (en) Methods and systems for image processing with eye gaze redirection
CN118247186A (en) Image distortion correction method, electronic device, storage medium and chip
CN118115576A (en) Image processing method, device and related equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23920244

Country of ref document: EP

Kind code of ref document: A1