WO2024163006A1

WO2024163006A1 - Methods and systems for constructing facial images using partial images

Info

Publication number: WO2024163006A1
Application number: PCT/US2023/061550
Authority: WO
Inventors: Zheng Chen; Zhiqi ZHANG; Yi Xu
Original assignee: Innopeak Technology, Inc.
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2024-08-08

Abstract

A method and device for generating composite facial images. The method includes obtaining a reference facial image of a person, obtaining first and second facial organ images at a first and second time, respectively. The method also includes processing the first and second facial organ images to provide a first and second autoencoded patches, respectively. A mask is generated containing a first location for the first patch and a second location for the second patch. The first and second autoencoded patches are overlayed at the first and second locations of the reference facial image, respectively. The method also includes providing a composite image of the person at a third time including the first and second autoencoded patches overlaying the reference facial image. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

Description

METHODS AND SYSTEMS FOR CONSTRUCTING FACIAL IMAGES USING PARTIAL IMAGES

BACKGROUND OF THE INVENTION

[0001] Eye contact is important during human social communication. It increases the presence level of a communicator, makes other people feel more comfortable, and promotes the intimacy between the communicators as well. Eye contact is not only needed in the real world, but also in the next generation of communication medium in virtual reality (VR). However, VR headsets physically block any visual connection between the user and others, which causes a visual disconnect. There are various solutions to address this visual disconnect caused by VR devices, but they have been inadequate, as described below

[0002] Therefore, new and improved systems and methods for operating extended reality devices are desired.

BRIEF SUMMARY OF THE INVENTION

[0003] The present invention is directed to extended reality systems and methods.

According to an embodiment, the present invention provides a method for obtaining an image of a first facial organ and a reference image of a user. The image of the first facial organ is autoencoded into a patch. The patch is inserted into a corresponding location of the reference image to form a composite facial image. There are other embodiments as well.

[0004] Embodiments of the present invention can be implemented in conjunction with existing systems and processes. For example, the present device configuration and its related methods according to the present invention can be used in a wide variety of systems, including virtual reality (VR) systems, mobile devices, and the like. Additionally, various techniques according to the present invention can be adopted into existing systems via integrated circuit fabrication, operating software, and wireless communication protocols. There are other benefits as well.

[0005] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for generating composite facial images. The method also includes obtaining a reference facial image of a person. The method also includes obtaining a first eye image at a first time. The method also includes obtaining a second eye image at a second time. The method also includes processing the first eye image to provide a first autoencoded patch. The method also includes processing the second eye image to provide a second autoencoded patch. The method also includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch. The method also includes overlaying the first autoencoded patch at the first location of the reference facial image. The method also includes overlaying the second autoencoded patch at the second location of the reference facial image. The method also includes providing a first composite image of the person at a third time, the first composite image may include the first autoencoded patch and the second autoencoded patch overlaying the reference facial image. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0006] Implementations may include one or more of the following features. The method where the first eye image is captured by a first infrared (IR) camera and the second eye image is captured by a second IR camera. The method may include generating a second composite image of the person at a fourth time, a time interval between the third time and the fourth time satisfying a predetermined frame rate. The method may include obtaining a mouth image, processing the mouth image to provide a third autoencoded patch, and overlaying the third autoencoded patch at a third location of the reference facial image using the mask. The method may include performing gray prediction on the first autoencoded patch and the second autoencoded patch. The method may include displaying the first composite image. The method may include transmitting the first composite image. The method may include performing autoencoding on the first composite image. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0007] One general aspect includes an extended reality device. The extended reality device also includes a housing may include a frontside and a backside. The device also includes a first eye camera configured on the backside and configured to capture a first eye image of a person wearing the extended reality device. The device also includes a display configured on the backside. The device also includes a storage configured to store a reference image of the person. The device also includes a communication interface configured to transmit a first composite image. The device also includes a processor configured to: process the first eye image to provide a first autoencoded patch; generate a mask containing a first location for the first autoencoded patch; and generate a first composite image at a first time, the first composite image may include the first autoencoded patch overlaying the reference image at the first location. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0008] Implementations may include one or more of the following features. The extended reality device where the first eye camera may include a grayscale infrared camera. The extended reality device may include a second eye camera and a mouth camera configured on the backside. The extended reality device may include a pair of front cameras configured on the frontside. The processor is further configured to generate a video may include the first composite image. The processor may include a neural processing unit for performing autoencoding. The extended reality device may include a microphone for capture audio inputs from the person. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0009] One general aspect includes a method for generating composite facial images. The method also includes obtaining a reference facial image of a person. The method also includes obtaining a first eye image at a first time. The method also includes obtaining a mouth image at a second time. The method also includes processing the first eye image to provide a first autoencoded patch. The method also includes processing the mouth image to provide a second autoencoded patch. The method also includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch. The method also includes providing a first composite image of the person at a third time, the first composite image may include the first autoencoded patch and the second autoencoded patch overlaying the reference facial image respectively at the first location and the second location. The method also includes performing autoencoding on the first composite image. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0010] Implementations may include one or more of the following features. The method may include obtaining a second eye image, and processing the mouth image to provide a third autoencoded patch. The method may include generating a grayscale image using at least the first autoencoded patch and the second autoencoded patch, applying a discriminator using the first composite image and the grayscale image. The method may include performing gray prediction on the first autoencoded patch. The method may include generating a video based at least on the first composite image. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0011] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, the present invention provides configurations and methods for extended reality devices that allow a user to enable visual connection to others by creating a display of the user’s real human face that is otherwise blocked or partially blocked by the extended reality device. In an example, this visual connection technique can enable VR meetings in which participants can see each other’s facial expressions, which maintains visual contact for a comfortable communication experience. Additionally, the present invention implements machine learning techniques to store user preference data related to the user’s facial display. There are many other benefits as well.

[0012] The present invention achieves these benefits and others in the context of known technology. However, a further understanding of the nature and advantages of the present invention may be realized by reference to the latter portions of the specification and attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] Figure l is a simplified diagram illustrating an extended reality device with cameras capturing eyes and mouth of a wearer according to embodiments of the present invention.

[0014] Figure 2 is a simplified block diagram illustrating internal components of an extended reality device according to embodiments of the present invention.

[0015] Figure 3 is a simplified diagram illustrating data flow of forming a composite image according to an embodiment of the present invention.

[0016] Figure 4 is a simplified flow diagram illustrating a method for generating a composite image according to embodiments of the present invention.

[0017] Figure 5 is a simplified flow diagram illustrating an alternative method for generating a composite image according to embodiments of the present invention. DETAILED DESCRIPTION OF THE INVENTION

[0018] The present invention is directed to extended reality systems and methods. According to an embodiment, the present invention provides a method for obtaining an eye image and a reference image of a user. The eye image is autoencoded into a patch. The patch is inserted into corresponding location of the reference image to form a composite image. In some embodiments, the method applies to obtain other facial features or facial organs from the user, such as a nose, a lip, a mouth, an eyebrow, cheeks, and ears.

[0019] As described above, extended reality devices tend to physically block any visual connection between the user and others. A target of the present invention is to reveal real facial images of a user by using partial observations from the extended reality device worn by the user, which can be applied to a variety of virtual reality (VR) or augmented reality (AR) interaction applications (e.g., meetings, social platforms, gaming, etc.). Examples of the present extended reality device can include headsets, goggles, mobile devices, and the like.

[0020] Although there have been attempts to solve this visual connection issue with users wearing VR headsets, the results have been inadequate. For example, conventional techniques to determine a 2D photo-realistic facial composition of a user in a head-pose-free manner require complicated device architectures and methods, such as 3D head reconstruction, face alignment and tracking, face synthesis, and eye synthesis. Such techniques also tend to heavily rely on a third-person camera to capture complete facial images. In such cases, the facial reconstruction can fail if any part of the of the user’s face is blocked in the captured facial image. Conventional techniques also require the implementation of multiple modules, which can result in computation speeds that are too slow to satisfy real-time requirements in frequency-sensitive scenarios (e.g., VR video meetings). Users may have the burden of tuning the parameters of each module according to different environment setups as well.

[0021] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, the present invention provides configurations and methods for extended reality devices that allow a user to enable visual connection to others by creating a display of the user’s real human face that is otherwise blocked or partially blocked by the extended reality device. In an example, this visual connection technique can enable VR meetings in which participants can see each other’s facial expressions, which maintains visual contact for a comfortable communication experience. Additionally, the present invention implements machine learning techniques to store user preference data related to the user’s facial display. [0022] The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

[0023] In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

[0024] The reader’s attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

[0025] Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6. In particular, the use of “step of’ or “act of’ in the Claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.

[0026] Please note, if used, the labels left, right, front, back, top, bottom, forward, reverse, clockwise and counterclockwise have been used for convenience purposes only and are not intended to imply any particular fixed direction. Instead, they are used to reflect relative locations and/or directions between various portions of an object.

[0027] According to an example, the present invention provides a device and method of 2D photo-realistic human face restoration using partial observations for users of extended reality devices whose faces are at least partially blocked while wearing such extended reality devices. The partial observations (e.g., left eye, right eye, mouth, etc.) can be obtained by various image-capturing devices configured within the extended reality device. In addition, reference images of the user captured before the user uses the extended reality device can be used to assist the facial restoration process. In this way, the user can maintain a visual connection with another participant or target audience without having to take the extended reality device off.

[0028] Figure l is a simplified diagram illustrating an extended reality device with cameras capturing the eyes and the mouth of a wearer according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.

[0029] As shown, the extended reality device 100 is configured within a housing 110 having at least a frontside and a backside. The device 100 includes a display 120 (show within dotted line cutaway section 101) and at least a first eye camera 131 spatially configured on the backside (shown within dotted line cutaway section 102). The first eye camera 131 (e.g., inward-facing) is configured to capture a first eye image of a person/user wearing the extended reality device 100. In an example, the device 100 may also include a second eye camera 132 (shown within dotted line cutaway section 103) configured on the housing backside to capture a second eye image of the person, or device 100 may include a mouth camera 134 (shown within dotted line cutaway section 104 and facing downward toward the mouth) configured on the housing backside to capture a mouth image of the person.

[0030] According to an example, the device 100 may include one or more image-capturing devices (e.g., camera, image sensor, video recorder, etc.) configured on the housing backside to capture one or more facial feature images or partial facial feature images (e.g., eye, mouth, nose, cheek, chin, brow, hairline, etc.). Depending on the application, one or more imagecapturing devices 140 may be configured on the housing frontside as well (e.g., a pair of front cameras configured to show the user the outside environment on the display). Further, any of the image-capturing devices can include an infrared (IR) image-capturing device, such as an IR camera or a grayscale IR camera.

[0031] Depending upon the configuration, the image-capturing devices configured on the housing backside may be configured at angles that do not capture the user’s facial features from the desired perspective (e.g., front-facing perspective for VR meeting application). For example, first eye camera 131 and second eye camera 132 may be configured near the sides of the device, which causes the resulting eye images to be from a side angle. In such cases, the eye images (or any facial feature image) will also require reconstruction to determine what the user’s facial feature should look like from the desired perspective. To address this issue, the present invention also provides for machine learning and/or deep learning techniques to generate a composite facial image of a user at desired perspectives.

[0032] The device 100 also includes at least a storage, a communication interface, and a processor (see Figure 2). The storage is configured to store at least a reference image of the person wearing the device 100. The processor is configured to process at least the first eye image to provide a first autoencoded patch, to generate a mask containing a first location for the first autoencoded patch, and to generate a first composite image at a first time. This first composite image includes at least the first autoencoded patch overlaying the reference image at the first location.

[0033] In an example, the extended reality device 100 also includes one or more microphones 142 for capturing audio inputs from the person/user. The processor may also be configured to generate a video comprising the first composite image. The video may also include the audio inputs captured by one or more microphones. In a specific example, the processor also includes a neural processing unit (NPU) for performing autoencoding. Further, the communication interface is configured to transmit the first composite image of the person.

[0034] Figure 2 is a simplified block diagram illustrating internal components of an extended reality device according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.

[0035] As shown, Figure 2 expands on the internal and external components that can be included in an extended reality device according to the present invention, such as device 100 of Figure 1. Configured to a housing 201, these components can include a central processing unit (CPU) 210 coupled to a graphic processing unit (GPU) 212 and a display 220. The display 220 may also be coupled to a screen touch sensor 222 and configured as a touchscreen.

[0036] Here, the CPU 210 is coupled to the screen touch sensor 222, an auxiliary hand sensor 230 (e.g., headset sensors, remotes, controllers, etc.), and an accelerometer 232, all of which can provide sensor inputs to determine user gestures and device operating modes. Also, the CPU 210 is coupled to a motor 234, which can be configured to provide haptic feedback to the user (e.g., in response to sensor inputs, user gestures, etc.). The GPU 212 can be coupled to the display 220 and one or more image-capturing devices, such as a camera 250. The imagecapturing devices can include the IR devices discussed previously (e.g., IR cameras, grayscale IR cameras, etc.). Further, the GPU 212 can be configured to transmit various forms of the user interface to display 220 depending on the operating mode, and the GPU 212 can be configured to transmit various forms of the user’s facial image or video to the display 220 (i.e., showing users how they will look to others).

[0037] The CPU 210 can also be further configured to update display preference data (e.g., user facial image/video composition, user interface arrangement, etc.) in a storage 260. The display preference data can be updated based on user input received from the touchscreen, sensors configured within the extended reality device, or image-capturing devices. In a specific example, the CPU 210 is also coupled to the neural processing unit (NPU) 214 that is configured to update the display preference data based on user inputs, which can be received via the screen touch sensor 222, the sensors 230, the cameras 250, or the like. In a specific example, NPU 214 can be used to learn the user’s gestures via one or more of the sensors within the device 200 or learn the user’s facial composition via one or more of the cameras within the device 200. Such gesture data and facial composition data can also be stored in storage 260. Here, the storage 260 is coupled to the CPU 210, the GPU 212, the NPU 214, the display 220, and the screen touch sensor 222.

[0038] The device 200 can further include a communication module 240 coupled to the CPU 210 and configured to transmit display preference data, composite images, or videos to a server. This communication module 240 can be configured for mobile data, Wi-Fi, Bluetooth, etc. In an example, the communication module 240 is configured to transmit user facial image data or video data to a target recipient (i.e., showing others the user’s facial image or video). The communication module 240 can also be configured to transmit user profile information generated using one or more inputs from one or more sensor arrays configured within the handheld device. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. Of course, there can be other variations, modifications, and alternatives to these device elements and their configurations.

[0039] Figure 3 is a simplified diagram illustrating data flow of forming a composite image according to an embodiment of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims. [0040] According to an example, the present invention provides a method to form composite facial images of an extended reality device user whose face is blocked or partially blocked by wearing the extended reality device. Determining the composite facial image can be divided into at least two parts (i.e., a disentangled manner/implementation or a coarse-to-fine process), including but not limited to: (1) determining local geometry transformation using local image patches, and (2) determining global fusion based on local predictions. By disentangling the composite facial image process and leveraging one or more partial observations of the user’s face, the current model can efficiently reconstruct the user’s face quickly and effectively.

[0041] As shown, flow 300 implements a disentangled network architecture to determine a user’s composite facial image using a plurality of processing stages. Here, the first stage includes obtaining raw image patches 310, which can be obtained by image-capturing devices configured within the extended reality device, as discussed previously. The raw images patches 330 can include at least a left eye image patch 312, a right eye image patch 314, and a mouth image patch 316. However, depending on the position of the image-capturing devices, the captured raw image patches may not be in the ideal perspective or orientation for the composite facial image (e.g., side view or offset angle).

[0042] Thus, local models for each such image patch can be trained to predict the appropriate image patch for the target facial feature (e.g., left eye, right eye, mouth, etc.). Using the local model 320, each of these image patches can be processed separately by an autoencoder (AE) device 322, denoted here as “AE1” to “AE3”. In this case, the prediction process results in predicted local patches for the left eye image 332, the right eye image 336, and the mouth image 336.

[0043] In an example, each AE device 322, includes an encoder and a decoder configured to perform an autoencoding process to generate predictions 330 based on the local patches. The AE diagram 201 shows the autoencoding process in which the encoder (En) automatically encodes data based in input values, the AE performs an activation function on the encoded data, and the decoder (De) decodes the data to produce the resulting output.

[0044] In an example, the image-capturing devices of the extended reality device can include small-mounted cameras. By using small mounted cameras to only capture local facial areas, the autoencoding process can exhibit better performance for real-time applications (e.g., VR meetings). In a specific example, raw image patches 310 can include grayscale images obtained using IR cameras for grayscale predictions. Using grayscale image patches from the IR cameras can further optimize the autoencoding process to determine a user’s facial composite image.

[0045] The second stage includes using the predicted local patches 330 and reference images 340 to synthesize a new dataset (e.g., composite image), where new overlaid images with corrected local patches are generated. This process can include generating a mask 342 with designated locations for overlaying the predicted image patches from the previous stage. Here, the predicted image patches of the left eye 332, right eye 334, and mouth 336 are overlayed on the designated left eye, right eye, and mouth locations of the mask 342, respectively. Overlying the mask 342 on the reference facial image 340 results in the composite facial image 344.

[0046] In an example, the composite image process can include training local models to improve the local prediction image patches. This training process is shown by the training images patches 350, including a training left eye image patch 352, a training right eye image patch 354, and a training mouth image patch 356. The training process can include applying a mean squared error (MSE) process, a learned perceptual image patch similarity (LPIPS) process, other reconstruction loss processes, and the like and combinations thereof. These training image patches can be used to improve the prediction image patches 330 that are overlayed using mask 342 and the reference image 340 to create modified composite facial images 344.

[0047] The third stage includes training the global model 360 on the newly generated dataset (i.e., composite facial image 344) to enable global fusion capability. Here, AE device 362, also denoted as “AE4”, can perform an autoencoding process on the composite image 344 to produce a global prediction composite facial image 370. Similar to the local model training process, the global model training process can include creating a training composite image 372 to improve the prediction composite image 370. In an example, the training composite image can be grayscale. Both the prediction and training composite facial images 370,372 can be provided to discriminator 374 (or a classifier) to determine the difference between these images to train the global model.

[0048] In an example, a CPU of the extended reality device is configured to form the composite image using the steps discussed previously. An NPU coupled to the CPU can be configured to perform the model training processes and other machine/deep learning related processes. The CPU and NPU can be configured similarly to device 201 shown in Figure 2. Further details of example methods for generating composite images are discussed with reference to Figures 4 and 5. [0049] Figure 4 is a simplified flow diagram illustrating a method for generating a composite image according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims.

[0050] According to an example, the method of generating the composite image can be performed by an extended reality device, such as device 100 shown in Figure 1. As shown, method 400 includes step 402 of obtaining a reference facial image of a person. This reference image can be obtained by using an integrated image-capturing device, by user upload, network transfer, or the like.

[0051] In steps 404 and 406, the method includes obtaining a first facial organ image (e.g., left eye) at a first time, and obtaining a second facial organ image (e.g., right eye) at a second time, respectively. These eye images can be captured using one or more image-capturing devices, such as cameras, image sensors, video recorders, or the like. In a specific example, an infrared (IR) camera can be configured to capture the facial organ images, or a first IR camera can be configured to capture the first facial organ image while a second IR camera can be configured to capture the second facial organ image.

[0052] In an example, the method can include obtaining a plurality of facial organ images at separate times. The plurality of facial organ images can be captured using one or more imagecapturing devices (e.g., IR cameras), as discussed previously. A separate image-capturing device can be configured to capture each of the facial organ images, or any one of the imagecapturing devices can be configured to capture multiple facial organ images at separate times.

[0053] In steps 408 and 410, the method includes processing the first facial organ image to provide a first autoencoded patch, and processing the second facial organ image to provide a second autoencoded patch, respectively. In step 412, the method includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch.

[0054] In an example, the method includes obtaining a third facial organ image(e.g., mouth), processing the third facial organ image to provide a third autoencoded patch, and overlaying the third autoencoded patch at a third location of the reference facial image using the mask. Further, additional facial organ images or partial facial images (e.g., nose, lip, ears, chin, cheeks, eyebrows, hairline, etc.) can be obtained using one or more image-capturing devices. [0055] In steps 414 and 416, the method includes overlaying the first autoencoded patch at the first location of the reference facial image, and overlaying the second autoencoded patch at the second location of the reference facial image. Additionally, the third facial organ image and/or other partial facial images can be processed to provide additional autoencoded patches, which can be overlayed at additional locations of the reference facial image using the mask or used to generate the mask and then subsequently overlayed on the additional locations. Further, the method can include performing gray prediction on one or more of the autoencoded patches.

[0056] In step 418, the method includes providing a first composite image of the person at a third time. This first composite image includes at least the first autoencoded patch and the second autoencoded patch overlaying the reference facial image using the mask. The first composite image can also include one or more of the additional autoencoded patches provided by processing one or more of the other partial facial images. In some embodiments, steps 406, 410 and their associated operations in steps 412, 416 and 418 are optional.

[0057] The method can also include generating one or more additional composite images of the person at different times. For example, a second composite image of the person can be generated at a fourth time such that a time interval between the third and fourth time satisfies a predetermined frame rate. Depending on the application, the times of the additional composite images can be configured such that the time intervals between such times satisfy one or more predetermined frame rates. The additional composite images can also include any number of the autoencoded patches provided from processing the partial facial images.

[0058] Figure 5 is a simplified flow diagram illustrating an alternative method for generating a composite image according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should not limit the scope of the claims.

[0059] Similar to the previous example, the method of generating the composite image can be performed by an extended reality device (e.g., device 100 of Figure 1). Also, details of similar method steps discussed for method 400 also apply to the method steps of method 500. As shown, method 500 includes step 502 of obtaining a reference facial image of a person.

[0060] In steps 504 and 506, the method includes obtaining a first eye image (e.g., left eye) at a first time, and obtaining a mouth image at a second time, respectively. Compared to method 400, this method 500 uses different partial facial image types to generate the autoencoded patches. As discussed previously, different numbers and combinations of partial facial images can be obtained for further processing.

[0061] In steps 508 and 510, the method includes processing the first eye image to provide a first autoencoded patch, and processing the mouth image to provide a second autoencoded patch. And, in step 512, the method includes generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch. In other cases, a plurality of partial facial images can be processed to provide a plurality of autoencoded patches, which can be used to generate the mask including locations for each such patch.

[0062] In step 514, the method includes performing autoencoding on the first composite image. As discussed previously, additional composite images can be determined at time intervals to satisfy one or more predetermined frame rates. The additional composite images can also include any number of the autoencoded patches provided from processing the partial facial images. The method can include performing autoencoding on these additional composite images as well. [0063] While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.

Claims

WHAT IS CLAIMED IS:

1. A method for generating composite facial images, the method comprising: obtaining a reference facial image of a person; obtaining an image of a first facial organ at a first time; processing the image of the first facial organ to provide a first autoencoded patch; generating a mask containing a first location for the first autoencoded patch; overlaying the first autoencoded patch at the first location of the reference facial image; and providing a first composite image of the person at a third time, the first composite image comprising the first autoencoded patch overlaying the reference facial image using the mask.

2. The method of claim 1 further comprising obtaining an image of a second facial organ at a second time; processing the image of the second facial organ to provide a second autoencoded patch; wherein the mask contains a second location for the second autoencoded patch; and overlaying the second autoencoded patch at the second location of the reference facial image; wherein the first composite image comprises the second autoencoded patch overlying the reference facial image using the mask.

3. The method of claim 2 wherein the image of the first facial organ is captured by a first infrared (IR) camera and the image of the second facial organ is captured by a second IR camera.

4. The method of claim 2 further comprising generating a second composite image of the person at a fourth time, a time interval between the third time and the fourth time satisfying a predetermined frame rate.

5. The method of claim 2 further comprising: obtaining an image of a third facial organ; processing the image of the third facial organ to provide a third autoencoded patch; and overlaying the third autoencoded patch at a third location of the reference facial image using the mask.

6. The method of claim 2 further comprising performing gray prediction on the first autoencoded patch and the second autoencoded patch.

7. The method of claim 1 further comprising displaying the first composite image.

8. The method of claim 1 further comprising performing autoencoding on the first composite image.

9. An extended reality device comprising: a housing comprising a frontside and a backside; a first eye camera configured on the backside and configured to capture a first eye image of a person wearing the extended reality device; a display configured on the backside; a storage configured to store a reference image of the person; a communication interface configured to transmit a first composite image; and a processor configured to: process the first eye image to provide a first autoencoded patch; generate a mask containing a first location for the first autoencoded patch; and generate a first composite image at a first time, the first composite image comprising the first autoencoded patch overlaying the reference image at the first location.

10. The extended reality device of claim 9 wherein the first eye camera comprises a grayscale infrared camera.

11. The extended reality device of claim 9 further comprising a second eye camera and a mouth camera configured on the backside.

12. The extended reality device of claim 9 further comprising a pair of front cameras configured on the frontside.

13. The extended reality device of claim 9 wherein the processor is further configured to generate a video comprising the first composite image.

14. The extended reality device of claim 9 wherein the processor comprises a neural processing unit for performing autoencoding.

15. The extended reality device of claim 9 wherein the processor is further configured to perform local model training to generate one or more training image patches to improve at least the first autoencoded patch.

16. A method for generating composite facial images, the method comprising: obtaining a reference facial image of a person; obtaining an image of a first facial organ at a first time; obtaining an image of a second facial organ at a second time; processing the image of the first facial organ to provide a first autoencoded patch; processing the image of the second facial organ to provide a second autoencoded patch; generating a mask containing a first location for the first autoencoded patch and a second location for the second autoencoded patch; and providing a first composite facial image at a third time, the first composite image comprising the first autoencoded patch and the second autoencoded patch overlaying the reference facial image respectively at the first location and the second location; and performing autoencoding on the first composite facial image.

17. The method of claim 16 further comprising: obtaining an image of a third facial organ; and processing the image of the second facial organ to provide a third autoencoded patch.

18. The method of claim 16 further comprising: generating a grayscale image using at least the first autoencoded patch and the second autoencoded patch; and applying a discriminator using the first composite facial image and the grayscale image.

19. The method of claim 16 further comprising performing gray prediction on the first autoencoded patch.

20. The method of claim 16 further comprising generating a video based at least on the first composite facial image.