[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021154099A1 - System for image compositing including training with synthetic data - Google Patents

System for image compositing including training with synthetic data Download PDF

Info

Publication number
WO2021154099A1
WO2021154099A1 PCT/NZ2020/050134 NZ2020050134W WO2021154099A1 WO 2021154099 A1 WO2021154099 A1 WO 2021154099A1 NZ 2020050134 W NZ2020050134 W NZ 2020050134W WO 2021154099 A1 WO2021154099 A1 WO 2021154099A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
depth map
dataset
images
neural network
Prior art date
Application number
PCT/NZ2020/050134
Other languages
French (fr)
Inventor
Tobias B. SCHMIDT
Erik B. EDLUND
Dejan Momcilovic
Josh HARDGRAVE
Original Assignee
Weta Digital Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/081,843 external-priority patent/US11710247B2/en
Application filed by Weta Digital Limited filed Critical Weta Digital Limited
Publication of WO2021154099A1 publication Critical patent/WO2021154099A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/90Arrangement of cameras or camera modules, e.g. multiple cameras in TV studios or sports stadiums
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/2224Studio circuitry; Studio devices; Studio equipment related to virtual studio applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/2224Studio circuitry; Studio devices; Studio equipment related to virtual studio applications
    • H04N5/2226Determination of depth image, e.g. for foreground/background separation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/272Means for inserting a foreground image in a background image, i.e. inlay, outlay
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0081Depth or disparity estimation from stereoscopic image signals

Definitions

  • Many visual productions use a combination of real and digital images.
  • a live actor may be in a scene with a computer-generated (“CG,” or merely “digital”) charging dinosaur.
  • An actor’s face may be rendered as a monster.
  • An actress may be rendered as a younger version of herself, etc.
  • the creators i.e., director, actors
  • One embodiment uses one or more auxiliary, or “depth,” cameras to obtain stereo depth information of live action images.
  • Each auxiliary camera outputs a standard RGB or grayscale image for purposes of comparing the different views to obtain depth information (although other cameras or sensors can be used such as infrared (IR) or RGBIR, time-of-flight, LIDAR, etc.).
  • the depth information is correlated to picture images from a main image capture device (e.g., a main cinema camera sometimes referred to as a “hero” camera or “picture” camera) that captures the same live action as the auxiliary cameras.
  • a main image capture device e.g., a main cinema camera sometimes referred to as a “hero” camera or “picture” camera
  • the raw auxiliary camera images are subjected to various steps such as one or more of pre-processing, disparity detection, feature extraction, matching, reprojection, infilling, filtering, and other steps.
  • the result of the steps is a depth map that is then aligned to the image from the picture camera.
  • each picture element (pixel) in the picture camera’s image is provided with a depth value. This allows elements or objects in the picture image to be accurately integrated with a CG image. CG elements may be integrated into live action images or vice versa.
  • the resulting composite image is then displayed and shows the live action accurately composited with the CG elements.
  • the auxiliary cameras are described as dedicated and distinct from the picture camera, in other embodiments depth information can be computed from any two or more cameras including using the picture camera described herein.
  • steps of one or more of feature extraction, matching, filtering or refinement can be implemented, at least in part, with an artificial intelligence (AI) computing approach using a deep neural network with training.
  • AI artificial intelligence
  • a combination of computer-generated (“synthetic”) and live-action (“recorded”) training data is created and used to train the network so that it can improve the accuracy or usefulness of a depth map so that compositing can be improved.
  • a tangible processor-readable medium includes instructions executable by one or more processors for: selecting a deep neural network that has been trained using a dataset derived, at least in part, from synthetic data; using the selected deep neural network to process image information from one or more auxiliary cameras to generate a depth map; and correlating the depth map with at least a portion of picture elements in at least one picture image obtained from an image capture device; and using the correlated depth map to composite one or more digital elements with one or more picture elements.
  • a method for determining picture element depths comprises: selecting a deep neural network that has been trained using a dataset derived, at least in part, from synthetic data derived from a scene; using the selected deep neural network to process image information of the scene from one or more auxiliary cameras to generate a depth map; and correlating the depth map with at least a portion of picture elements in at least one of the picture images using the correlated depth map to composite one or more digital elements with one or more picture elements.
  • a method for generating a depth map comprises: receiving a dataset including a plurality of images and depths of objects in a scene; using the dataset to train a deep neural network to assist, at least in part, in generating a depth map for use in real-time compositing of a live action recording taking place in the scene.
  • a processor-readable medium includes instructions executable by one or more digital processors for determining picture element depths.
  • the processor-readable medium comprises one or more instructions executable by the one or more digital processors for: receiving a dataset including a plurality of images and depths of objects in a scene; using the dataset to train a deep neural network to assist, at least in part, in generating a depth map for use in real-time compositing of a live action recording taking place in the scene.
  • Fig. 1 illustrates basic components and steps of an embodiment
  • Fig. 2 shows basic sub-steps in pre-processing
  • Fig. 3 illustrates an example of visual content generation system
  • Fig. 4 shows a block diagram illustrating an example computer system adaptable for use with functions described herein. Detailed Description of Embodiments
  • Embodiments allow live action images from a picture camera to be composited with computer generated images in real-time or near real-time.
  • the two types of images (live action and computer generated (“CG”)) are composited accurately by using a depth map.
  • the depth map includes a “depth value” for each pixel in the live action, or picture, image.
  • the depth value is defined as the distance between the picture camera origin and a plane that is perpendicular to the picture camera viewing direction.
  • the depth value can be referenced from a different camera or defined location; and calculated to a desired plane or point. In other embodiments, the depth can be with respect to a different reference point. Also, in some embodiments not all of the pixels need be mapped with a depth value. Rather, depth values may only need to be mapped for a region of interest. For example, parts of a scene can be masked out (greenscreen, etc.); the background may be ignored (i.e., distances past a certain value or plane), objects, or distance ranges can be identified, etc. so that they do not need to be depth-mapped to the same degree or at all.
  • a degree of tolerance or accuracy may similarly be non-uniform over a picture image, or frame, so that areas of focus (e.g., an actor’s face; an action, etc.) can be provided with heightened depth accuracy over other areas in a frame of the picture camera.
  • areas of focus e.g., an actor’s face; an action, etc.
  • the compositing process is performed in real-time. That is, each frame is composited so that it is ready for display at a standard frame rate being used for playback (e.g., 30 or 24 frames per second, etc.). It is desirable to reduce any delay between an image acquisition and display of a composited image.
  • One embodiment achieves a delay in the range of 2 to 4 frames at a predetermined framerate. This allows the team shooting the live action to be able to view the composited images essentially concurrently with the recording of the live action and enables a director, cinematographer, actors, special effects persons, etc., to coordinate the live action more effectively with the computer-generated images.
  • This approach also allows the composited images, or portions thereof, to be used with standard flat panel monitors, augmented reality, virtual reality, or other types of visual output devices.
  • frames may be skipped, or dropped, or the compositing modified to be slower than real time while still achieving desired functionality.
  • Various aspects of the features described herein may be useful at other times or places such as in a post production facility.
  • a dataset is received that includes a plurality of images and depths of objects in an environment.
  • the dataset is used to train a deep neural network to assist, at least in part, in generating a depth map for use in real-time compositing of a live action recording taking place in the environment.
  • Functionality described herein can be implemented using various programming techniques unless otherwise indicated. Functionality can be performed by one or more computers or processors executing instructions to control the processors or computers.
  • the instructions may be provided on a machine- readable medium.
  • the processor or computer-readable medium can comprise a non-transient storage medium, such as solid-state memory, a magnetic disk, optical disk etc., or a transient medium such as a signal transmitted over a computer network.
  • one or more images from the one or more auxiliary cameras are processed to generate a depth map for elements of a picture image from a camera.
  • the depth map is correlated with at least a portion of picture elements in at least one picture image received from a picture camera using the correlated depth map to composite one or more digital elements with one or more picture elements.
  • depths of the picture elements are determined by using two or more images from two or more auxiliary cameras to generate a depth map.
  • the depth map is correlated with at least a portion of picture elements in at least one of the picture images, and the correlated depth map is used to composite one or more digital elements with one or more picture elements.
  • the compositing may be performed by one or more processors or computer systems.
  • Processor-implementable instructions to control the processor or computer to perform one or more steps of the method may be provided on a machine (e.g., processor or computer-readable) medium.
  • the computer-readable medium can comprise a non-transient storage medium, such as solid-state memory, a magnetic disk, optical disk etc., or a transient medium such as a signal transmitted over a computer network.
  • depth information may be obtained by any one or more other cameras or other types of sensing devices.
  • multiple pairs of machine- vision cameras can be used at different locations and orientations on a set.
  • the main imaging camera also called a “hero” camera or a “picture” camera
  • Single cameras or other sensors can be used to obtain depth information. Examples of such cameras and sensor s are described in, for example, U.S. Patent Applications Ser. No. 17/018,943, referenced above.
  • FIG. 1 illustrates basic components and steps of a system to perform real-time compositing of live action images with computer-generated images.
  • Fig. 1 illustrates basic components and steps of a system to perform real-time compositing of live action images with computer-generated images.
  • system 100 includes a live action camera rig 110.
  • Camera rig 110 includes picture camera 112 and left and right auxiliary cameras 114 and 116, respectively.
  • depth information is obtained by using left and right stereo view cameras in order to calculate the depth of each pixel in an image or frame captured by picture camera 112.
  • the picture camera is at 2K resolution and the auxiliary cameras are at 2k resolution. In other embodiments varying resolutions for the cameras may be used.
  • One approach uses resolutions adequate so that the auxiliary camera frames can be used to compute a depth map for each pixel in a frame of an image from the picture camera. During shooting, all 3 cameras are maintained in fixed positions with respect to each other.
  • the cameras can be mounted on a common physical structure, for example. Depending on the cinematic needs of the shot, the cameras may be stationary, mounted on a boom or dolly, handheld, etc. In general, any suitable arrangement or configuration of cameras may be used. In other embodiments a fixed arrangement between cameras may not be necessary such as if the relative arrangement of cameras is otherwise known or defined.
  • the live action camera rig is used to record live action such as moving actors, vehicles or other objects.
  • live action scene need not require movement. Even where the camera changes position within an inanimate setting, or even where the camera and scene are static, the accuracy of compositing is important for the creators of the film or video to have confidence that they have achieved the desired shot.
  • Computer system 130 is merely a representation of various computing resources that can be used to perform the process actions and steps described below. Any number and type of discrete or integrated hardware and software components may be used. The components may be located local to, or remote from, the cameras as, for example, interlinked by one or more networks.
  • Calibration data 118 from the camera rig is also sent to the computer system.
  • This data can include the relative positions of the cameras to each other, lens information (focal length, aperture, magnification, etc.) rig position and orientation, or other data useful to calibrate the multiple sets of images being generated.
  • Computer system 130 can either generate images or retrieve previously stored computer graphic images such as frame 124. Since the CG images are created based on computer models, all of the depth information is already defined for each of their elements. The remaining steps of Fig. 1 are needed to quickly and accurately determine depth information for elements in the picture camera image in order that the live action image can be accurately placed “into” (i.e., composited with) the CG image.
  • steps or acts at 140 are used to generate a depth map that includes depth information for each pixel of the image from the picture camera.
  • Left image 142 from left auxiliary camera 114, together with right image 144 from right auxiliary camera 116 are processed at 146.
  • This pre-processing compares the differences, or “disparity,” between the images’ to generate disparity map 148.
  • the disparity processing can use known or future methods based on parallax effects, modeling, training, lighting or other characteristics of the images. Computation can use machine learning approaches such as artificial neural networks. Other techniques can be used. Disparity processing may remove distortions and unwanted camera or lens effects and other image anomalies.
  • Disparity map 148 is then re -projected onto the picture image using camera calibration data.
  • the resulting disparity map may have artifacts, such as “holes,” “gaps,” or other types of discontinuities in its image and depth information, as represented at 150.
  • corrections processing 152 may be necessary to correct the artifacts.
  • an artificial intelligence process is used to perform infilling and densification to remove holes.
  • the result of correcting artifacts is dense depth map 154.
  • the dense depth map is at the same or higher resolution than the picture image so that it can be mapped to the picture image to provide a depth for each pixel in the picture image.
  • This picture image plus depth map is shown as output 160.
  • the output 160 is then composited with CG image 124 to produce composite image 170 where the live action image is properly placed into the CG image based on the derived depth information from steps 140.
  • FIG. 2 shows basic sub-steps in pre-processing step 146 of Fig. 1.
  • deep neural network techniques are used to implement one or more of the steps of Fig. 2.
  • other programming techniques may be used instead of, or in addition to, the specifics described herein.
  • other artificial intelligence approaches can be employed such as those known in the field of machine learning, or otherwise.
  • specific hardware e.g., graphics processing units (GPUs), application-specific integrated circuits (ASICs), custom or semi-custom processors, etc.
  • GPUs graphics processing units
  • ASICs application-specific integrated circuits
  • custom or semi-custom processors etc.
  • any of a number of deep learning architectures currently known or yet to be devised may be employed.
  • deep belief networks, recurrent neural networks, convolutional neural networks, etc. may be used.
  • the pre-processing determines differences among the same parts or features of items in the left and right auxiliary camera images.
  • the features may be large or small depending on the degree of interest or importance to the ultimate compositing, and depending on the image area occupied by the feature.
  • a feature may be a person, an eye, eyelash, etc.
  • feature maps from the 202 and 204 images are extracted.
  • feature maps are compared to determine same features in the depth images.
  • Step 230 applies convolution filtering to achieve coarse volumetric placement and matching at a low resolution (240).
  • Step 250 refinement is then performed at high resolution using the original position disparity images to check and adjust how the modeled scene with coarse depths can be more precisely positioned in depth dimension.
  • Step 260 shows a predicted image that can be used to “train” the system when compared to ground truth mapping 270 (“recorded” or “synthetic” data).
  • the system uses silhouettes or outlines of the objects and encourages correct alignment of the outlines to reduce hops or jumps in pixel images from frame to frame so that the final rendered sequence is continuous.
  • Color images and depth maps are used as reference data, such as ground truth 270 data, to compare generated or predicted frames (such as predicted frames at 260) and correct the model so that predicted frames are closer to the training data obtained.
  • Training data can be based on recorded or synthetic data.
  • synthetic training data is based on LIDAR or photogrammetric scans of actors and objects on the actual set. In other embodiments synthetic data can be obtained in any suitable manner.
  • the sequence of steps in Fig. 2 for pre-processing to generate an improved disparity map can also be used to improve disparity map with artifacts 150 of Fig. 1.
  • the picture image can be combined with disparity map with artifacts 150.
  • each of steps 250 - 270 may be applied to an initial disparity map with artifacts such as 150 of Fig. 1 to generate an improved disparity map without artifacts
  • FIG. 3 is a block diagram of an exemplary computer system 900 for use with implementations described herein.
  • Computer system 900 is merely illustrative and not intended to limit the scope of the claims.
  • One of ordinary skill in the art would recognize other variations, modifications, and alternatives.
  • computer system 900 may be implemented in a distributed client-server configuration having one or more client devices in communication with one or more server systems.
  • computer system 900 includes a display device such as a monitor 910, computer 920, a data entry device 930 such as a keyboard, touch device, and the like, a user input device 940, a network communication interface 950, and the like.
  • User input device 940 is typically embodied as a computer mouse, a trackball, a track pad, wireless remote, tablet, touch screen, and the like.
  • user input device 940 typically allows a user to select and operate objects, icons, text, characters, and the like that appear, for example, on the monitor 910.
  • Network interface 950 typically includes an Ethernet card, a modem (telephone, satellite, cable, ISDN), (asynchronous) digital subscriber line (DSL) unit, and the like. Further, network interface 950 may be physically integrated on the motherboard of computer 920, may be a software program, such as soft DSL, or the like.
  • Computer system 900 may also include software that enables communications over communication network 952 such as the HTTP, TCP/IP, RTP/RTSP, protocols, wireless application protocol (WAP), IEEE 902.11 protocols, and the like.
  • communication network 952 may include a local area network, a wide area network, a wireless network, an Intranet, the Internet, a private network, a public network, a switched network, or any other suitable communication network, such as for example Cloud networks.
  • Communication network 952 may include many interconnected computer systems and any suitable communication links such as hardwire links, optical links, satellite or other wireless communications links such as BLUETOOTH, WIFI, wave propagation links, or any other suitable mechanisms for communication of information.
  • communication network 952 may communicate to one or more mobile wireless devices 956A-N, such as mobile phones, tablets, and the like, via a base station such as wireless transceiver 954.
  • Computer 920 typically includes familiar computer components such as a processor 960, and memory storage devices, such as a memory 970, e.g., random access memory (RAM), storage media 980, and system bus 990 interconnecting the above components.
  • a memory 970 e.g., random access memory (RAM)
  • RAM random access memory
  • storage media 980 e.g., hard disk drives, floppy disk drives, and the like.
  • system bus 990 interconnecting the above components.
  • computer 920 is a PC compatible computer having multiple microprocessors, graphics processing units (GPU), and the like. While a computer is shown, it will be readily apparent to one of ordinary skill in the art that many other hardware and software configurations are suitable for use with the present invention.
  • Memory 970 and Storage media 980 are examples of tangible non-transitory computer readable media for storage of data, audio/video files, computer programs, and the like.
  • tangible media include disk drives, solid-state drives, floppy disks, optical storage media and bar codes, semiconductor memories such as flash drives, flash memories, random-access or read-only types of memories, battery -backed volatile memories, networked storage devices, Cloud storage, and the like.
  • An embodiment uses a deep neural network with a training dataset to implement steps of feature extraction, matching, filtering and/or refinement.
  • the training dataset can use images and depth reference data obtained by capturing or scanning the real-world people and objects. For example, the walls, furniture, props, actors, costumes, and other objects and even visual effects can be initially captured (i.e., “digitized”) by using LIDAR, photogramme try, or other techniques. This results in highly accurate depth and color texture information for objects captured in the images.
  • This “recorded” data can then be used, in turn, to generate “synthetic” data by using the recorded data in computer modeling and rendering programs to change the positions of objects, camera characteristics and placement, environmental effects (e.g., lighting, haze, etc.) within the computer-generated scene and to capture images of the scenes along with the computer-generated depth information for the items in the images.
  • synthetic data can then be used, in turn, to generate “synthetic” data by using the recorded data in computer modeling and rendering programs to change the positions of objects, camera characteristics and placement, environmental effects (e.g., lighting, haze, etc.) within the computer-generated scene and to capture images of the scenes along with the computer-generated depth information for the items in the images.
  • generic datasets may be obtained of unrelated sets or environments. Any one or more of these types of data, or mixtures or combinations of data; can be combined into a “training dataset,” used to improve the later real-time depth detection during a live-action shoot so that digital images can be more accurately composited onto, e.g., a director’s camera viewfinder or an actor’s virtual or augmented reality headset; in order to show what the final, composited, scene will look like.
  • custom synthetic data is obtained by capturing key aspects of the actual set or environment that will be used in an upcoming live action shoot where views of composite CG and live action are desired to be presented in real time.
  • Actors and costumes can be captured in various poses and positions on the set.
  • Other characteristics of the physical set and environment can be captured such as lighting, object positionings, camera view positioning and settings, camera noise, etc.
  • the custom recorded data is imported into a computer graphics rendering program so that the objects may be digitally repositioned. Lighting and noise or other effects can be added or subtracted in the digital images. Actors can be posed and placed along with various props and effects, if desired. Selected images of these synthesized views can be captured along with their depth information.
  • only the synthetic data obtained from custom recorded data is used to comprise the training dataset.
  • any desired combinations of recorded, custom recorded and/or synthetic data can be used.
  • One embodiment uses semi-synthetic data where one or a few recorded data instances are used to generate many synthetic instances.
  • a dataset may be pre-compiled from recorded data from one or more unrelated sets or environments. This pre-compiled dataset can then be used to train a deep neural network to be used for real time compositing when live-action shooting occurs in a different setting, environment or location.
  • a training dataset is synthesized from custom recorded data from scanning an actual set to be used in a future shoot.
  • the training dataset is then used to train a deep neural network to improve the depth mapping of images in real time when the future shoot is undertaken. Details of known procedures for training using datasets are provided in, e.g., reference (5), cited above.
  • a machine learning training approach includes starting with random weights. Predictions are made by the network. The differences between the predicted and actual depths are computed and the weights are changed to make the prediction closer according to a scoring function. This is repeated until suitable training has been achieved for a threshold number of images.
  • the size of the training dataset may vary widely, such as from one or a few to hundreds of thousands or millions of images.
  • training can take from hours up to one or more weeks. Evaluation of the effectiveness of the training can be performed visually by a human operator after an initial automatic evaluation, although in other embodiments the training evaluation actions can be automated in different arrangements including wholly manual or wholly automated.
  • An operator interface is provided to allow a human to change settings. During the live action filming an operator can change settings on the auxiliary cameras (used to capture depth disparity information). Camera positions (distance apart), gain, brightness or other characteristics can be adjusted to improve the depth map generation. Differently trained neural networks can be available for an operator to switch from.
  • Data can be recorded at higher resolution for areas of interest such as human faces, furniture, etc. Information about the actual shoot can be used such as “X’s” placed on the floor where actors will stand. Those areas can be subjected to more dense recording or synthesizing of data. Conversely, if it is known that areas of the set or environment will not be used then those areas can be the subject of less attention, or might be ignored entirely, for the training dataset.
  • One approach allows adding camera noise into the synthesized images in order to better train for the real camera images that will be handled as inputs during the live action shooting. Actual measured noise levels of cameras are used as target levels. Frequency response analysis of camera noise characteristics can be performed and those characteristics matched in the synthetic data for better training.
  • the processing time to match a depth map to a live-action frame can be shortened by the use of a pre-stored camera parameters so that when a camera setting (e.g., focal length, etc.) is changed in the main picture camera, the corresponding change in the camera’s frame captures can be applied to the depth map.
  • a camera setting e.g., focal length, etc.
  • Any suitable programming and/or database retrieval technique may be used.
  • a look-up table is used that includes pre-computed values for the effect of changes in the main camera settings on the resulting captured images.
  • a lookup table entry corresponding to the new focal length is used and applied to the depth map in order that the depth map be modified (“distorted”) in the same way as the captured main images.
  • This approach can similarly be used for changes in other camera parameters.
  • Embodiments may also employ a human operator to visually inspect, in real- time, the depth map “fitting” to the captured main images and to make visual adjustments.
  • the operator can have x, y and z (depth) adjustments and can fit the depth map to the captured image by panning and scrolling and zooming.
  • Other controls can be provided to the operator.
  • a combination of automated and manual matching tools can be provided at an operator interface. These approaches can be used at any one or more of the steps shown in Figures 1 or 2.
  • routines of particular embodiments including C, C++, Java, assembly language, etc.
  • Different programming techniques can be employed such as procedural or object oriented.
  • the routines can execute on a single processing device or multiple processors. Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different particular embodiments. In some particular embodiments, multiple steps shown as sequential in this specification can be performed at the same time.
  • Particular embodiments may be implemented in a computer-readable storage medium for use by or in connection with the instruction execution system, apparatus, system, or device. Particular embodiments can be implemented in the form of control logic in software or hardware or a combination of both. The control logic, when executed by one or more processors, may be operable to perform that which is described in particular embodiments. [55] Some embodiments are implemented as processor implementable code provided on a computer-readable medium.
  • the computer-readable medium can comprise a non transient storage medium, such as solid-state memory, a magnetic disk, optical disk etc., or a transient medium such as a signal transmitted over a computer network.
  • Particular embodiments may be implemented by using a programmed general purpose digital computer, by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms may be used.
  • the functions of particular embodiments can be achieved by any means as is known in the art.
  • Distributed, networked systems, components, and/or circuits can be used. Communication, or transfer, of data may be wired, wireless, or by any other means.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Graphics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

Embodiments allow live action images from an image capture device to be composited with computer generated images in real-time or near real-time. The two types of images (live action and computer generated) are composited accurately by using a depth map. In an embodiment, the depth map includes a "depth value" for each pixel in the live action image. In an embodiment, steps of one or more of feature extraction, matching, filtering or refinement can be implemented, at least in part, with an artificial intelligence (AI) computing approach using a deep neural network with training. A combination of computer-generated ("synthetic") and live-action ("recorded") training data is created and used to train the network so that it can improve the accuracy or usefulness of a depth map so that compositing can be improved.

Description

SYSTEM FOR IMAGE COMPOSITING INCLUDING TRAINING WITH SYNTHETIC DATA
Cross References to Related Applications
This application claims the benefit of U.S. Provisional Patent Application Serial No. 62/968,041, entitled SYSTEM USING ARTIFICIAL INTELLIGENCE TO GENERATE A DEPTH MAP INCLUDING TRAINING WITH SYNTHETIC DATA, filed on January 30, 2020); from U.S. Provisional Patent Application Serial No. 62/968,035, entitled METHOD FOR GENERATING PER PIXEL DEPTH INFORMATION, filed on January 30, 2020; and from U.S. Patent Application Serial No. 17/081843 entitled SYSTEM FOR IMAGE COMPOSITING INCLUDING TRAINING WITH SYNTHETIC DATA, filed on 27 October 2020; which are each hereby incorporated by reference as if set forth in full in this application for all purposes.
This application is related to the following applications: U.S. Patent Application No. 17/018,943, entitled COMPUTER-GENERATED IMAGE PROCESSING INCLUDING VOLUMETRIC SCENE RECONSTRUCTION filed September 11, 2020 which claims priority to U.S. Provisional Application No. 62/983,530 entitled COMPUTER GENERATED IMAGE PROCESSING INCLUDING VOLUMETRIC SCENE RECONSTRUCTION filed February 28, 2020 which are hereby incorporated by reference as if set forth in full in this application for all purposes.
Background
[1] Many visual productions (e.g., movies, video) use a combination of real and digital images. For example, a live actor may be in a scene with a computer-generated (“CG,” or merely “digital”) charging dinosaur. An actor’s face may be rendered as a monster. An actress may be rendered as a younger version of herself, etc. In order to allow the creators (i.e., director, actors) of the live action scenes to better interact with and utilize the digital models it is desirable to provide the live action creators with a close approximation of what the final composited imagery will look like at the time of recording, or “shooting,” the live action scenes.
[2] Since recording live action occurs in real time and often requires many “takes” it is useful to be able to generate the composited imagery in real time, or near real-time, so that an on-set assessment of the recorded takes can be made. This approach also allows the human creators to more accurately interact with and react to the digital imagery.
[3] However, such real-time processing to composite the CG with live action is often difficult because of the large amount of data involved and due to the computing difficulty of accurately matching depth information between the live action and CG images. For example, it is necessary to determine depths (e.g., distance from camera) of elements in a live action scene in order to accurately composite the live action elements with CG images in a realistic way.
[4] It is an object of at least preferred embodiments to address at least some of the aforementioned disadvantages. An additional or alternative object is to at least provide the public with a useful choice.
Summary
[5] One embodiment uses one or more auxiliary, or “depth,” cameras to obtain stereo depth information of live action images. Each auxiliary camera outputs a standard RGB or grayscale image for purposes of comparing the different views to obtain depth information (although other cameras or sensors can be used such as infrared (IR) or RGBIR, time-of-flight, LIDAR, etc.). The depth information is correlated to picture images from a main image capture device (e.g., a main cinema camera sometimes referred to as a “hero” camera or “picture” camera) that captures the same live action as the auxiliary cameras. The raw auxiliary camera images are subjected to various steps such as one or more of pre-processing, disparity detection, feature extraction, matching, reprojection, infilling, filtering, and other steps. The result of the steps is a depth map that is then aligned to the image from the picture camera. In an embodiment, each picture element (pixel) in the picture camera’s image is provided with a depth value. This allows elements or objects in the picture image to be accurately integrated with a CG image. CG elements may be integrated into live action images or vice versa. The resulting composite image is then displayed and shows the live action accurately composited with the CG elements. Although the auxiliary cameras are described as dedicated and distinct from the picture camera, in other embodiments depth information can be computed from any two or more cameras including using the picture camera described herein.
[6] In an embodiment, steps of one or more of feature extraction, matching, filtering or refinement can be implemented, at least in part, with an artificial intelligence (AI) computing approach using a deep neural network with training. A combination of computer-generated (“synthetic”) and live-action (“recorded”) training data is created and used to train the network so that it can improve the accuracy or usefulness of a depth map so that compositing can be improved.
[7] In accordance with an aspect, a tangible processor-readable medium includes instructions executable by one or more processors for: selecting a deep neural network that has been trained using a dataset derived, at least in part, from synthetic data; using the selected deep neural network to process image information from one or more auxiliary cameras to generate a depth map; and correlating the depth map with at least a portion of picture elements in at least one picture image obtained from an image capture device; and using the correlated depth map to composite one or more digital elements with one or more picture elements. [8] In accordance with a further aspect, a method for determining picture element depths, the method comprises: selecting a deep neural network that has been trained using a dataset derived, at least in part, from synthetic data derived from a scene; using the selected deep neural network to process image information of the scene from one or more auxiliary cameras to generate a depth map; and correlating the depth map with at least a portion of picture elements in at least one of the picture images using the correlated depth map to composite one or more digital elements with one or more picture elements.
[9] In accordance with a further aspect, a method for generating a depth map comprises: receiving a dataset including a plurality of images and depths of objects in a scene; using the dataset to train a deep neural network to assist, at least in part, in generating a depth map for use in real-time compositing of a live action recording taking place in the scene.
[10] In accordance with a further aspect, a processor-readable medium includes instructions executable by one or more digital processors for determining picture element depths. The processor-readable medium comprises one or more instructions executable by the one or more digital processors for: receiving a dataset including a plurality of images and depths of objects in a scene; using the dataset to train a deep neural network to assist, at least in part, in generating a depth map for use in real-time compositing of a live action recording taking place in the scene.
Brief Description of the Drawings
Fig. 1 illustrates basic components and steps of an embodiment;
Fig. 2 shows basic sub-steps in pre-processing;
Fig. 3 illustrates an example of visual content generation system; and
Fig. 4 shows a block diagram illustrating an example computer system adaptable for use with functions described herein. Detailed Description of Embodiments
[11] Embodiments allow live action images from a picture camera to be composited with computer generated images in real-time or near real-time. The two types of images (live action and computer generated (“CG”)) are composited accurately by using a depth map. The depth map includes a “depth value” for each pixel in the live action, or picture, image. In an embodiment, the depth value is defined as the distance between the picture camera origin and a plane that is perpendicular to the picture camera viewing direction.
In other embodiments, the depth value can be referenced from a different camera or defined location; and calculated to a desired plane or point. In other embodiments, the depth can be with respect to a different reference point. Also, in some embodiments not all of the pixels need be mapped with a depth value. Rather, depth values may only need to be mapped for a region of interest. For example, parts of a scene can be masked out (greenscreen, etc.); the background may be ignored (i.e., distances past a certain value or plane), objects, or distance ranges can be identified, etc. so that they do not need to be depth-mapped to the same degree or at all. A degree of tolerance or accuracy may similarly be non-uniform over a picture image, or frame, so that areas of focus (e.g., an actor’s face; an action, etc.) can be provided with heightened depth accuracy over other areas in a frame of the picture camera.
[12] In an embodiment, the compositing process is performed in real-time. That is, each frame is composited so that it is ready for display at a standard frame rate being used for playback (e.g., 30 or 24 frames per second, etc.). It is desirable to reduce any delay between an image acquisition and display of a composited image. One embodiment achieves a delay in the range of 2 to 4 frames at a predetermined framerate. This allows the team shooting the live action to be able to view the composited images essentially concurrently with the recording of the live action and enables a director, cinematographer, actors, special effects persons, etc., to coordinate the live action more effectively with the computer-generated images. This approach also allows the composited images, or portions thereof, to be used with standard flat panel monitors, augmented reality, virtual reality, or other types of visual output devices. In other embodiments, frames may be skipped, or dropped, or the compositing modified to be slower than real time while still achieving desired functionality. Various aspects of the features described herein may be useful at other times or places such as in a post production facility.
[13] In an embodiment, a dataset is received that includes a plurality of images and depths of objects in an environment. The dataset is used to train a deep neural network to assist, at least in part, in generating a depth map for use in real-time compositing of a live action recording taking place in the environment. Functionality described herein can be implemented using various programming techniques unless otherwise indicated. Functionality can be performed by one or more computers or processors executing instructions to control the processors or computers. The instructions may be provided on a machine- readable medium. The processor or computer-readable medium can comprise a non-transient storage medium, such as solid-state memory, a magnetic disk, optical disk etc., or a transient medium such as a signal transmitted over a computer network.
[14] In an embodiment, one or more images from the one or more auxiliary cameras are processed to generate a depth map for elements of a picture image from a camera.
The depth map is correlated with at least a portion of picture elements in at least one picture image received from a picture camera using the correlated depth map to composite one or more digital elements with one or more picture elements. In a stereo approach, depths of the picture elements are determined by using two or more images from two or more auxiliary cameras to generate a depth map. The depth map is correlated with at least a portion of picture elements in at least one of the picture images, and the correlated depth map is used to composite one or more digital elements with one or more picture elements. The compositing may be performed by one or more processors or computer systems. Processor-implementable instructions to control the processor or computer to perform one or more steps of the method may be provided on a machine (e.g., processor or computer-readable) medium. The computer-readable medium can comprise a non-transient storage medium, such as solid-state memory, a magnetic disk, optical disk etc., or a transient medium such as a signal transmitted over a computer network. In other approaches, depth information may be obtained by any one or more other cameras or other types of sensing devices. For example, multiple pairs of machine- vision cameras can be used at different locations and orientations on a set. The main imaging camera (also called a “hero” camera or a “picture” camera) can include a stereo pair of cameras for 3D filming. Single cameras or other sensors can be used to obtain depth information. Examples of such cameras and sensor s are described in, for example, U.S. Patent Applications Ser. No. 17/018,943, referenced above.
[15] Fig. 1 illustrates basic components and steps of a system to perform real-time compositing of live action images with computer-generated images.
[16] Fig. 1 illustrates basic components and steps of a system to perform real-time compositing of live action images with computer-generated images. The term “real-time” as used to describe depth map generation, processing and use in compositing; includes “near real-time” where there is a short delay or lag in processing, Since the depth map generation starts at the same time as, or slightly after, the capture of a picture frame, the depth map will not be available until after the captured frame is available.
[17] In Fig. 1, system 100 includes a live action camera rig 110. Camera rig 110 includes picture camera 112 and left and right auxiliary cameras 114 and 116, respectively. In the system illustrated in Fig. 1 , depth information is obtained by using left and right stereo view cameras in order to calculate the depth of each pixel in an image or frame captured by picture camera 112. In an embodiment, the picture camera is at 2K resolution and the auxiliary cameras are at 2k resolution. In other embodiments varying resolutions for the cameras may be used. One approach uses resolutions adequate so that the auxiliary camera frames can be used to compute a depth map for each pixel in a frame of an image from the picture camera. During shooting, all 3 cameras are maintained in fixed positions with respect to each other. The cameras can be mounted on a common physical structure, for example. Depending on the cinematic needs of the shot, the cameras may be stationary, mounted on a boom or dolly, handheld, etc. In general, any suitable arrangement or configuration of cameras may be used. In other embodiments a fixed arrangement between cameras may not be necessary such as if the relative arrangement of cameras is otherwise known or defined.
[18] In other embodiments, other approaches to obtain depth information may be used. For example, structured light, time-of-flight, photogramme try, etc. techniques may be employed. One or more auxiliary cameras may be used. Other variations are possible.
[19] In general, the live action camera rig is used to record live action such as moving actors, vehicles or other objects. However, the live action scene need not require movement. Even where the camera changes position within an inanimate setting, or even where the camera and scene are static, the accuracy of compositing is important for the creators of the film or video to have confidence that they have achieved the desired shot.
[20] The picture image and the left and right depth images, also referred to as “frames,” are provided to computer system 130. Computer system 130 is merely a representation of various computing resources that can be used to perform the process actions and steps described below. Any number and type of discrete or integrated hardware and software components may be used. The components may be located local to, or remote from, the cameras as, for example, interlinked by one or more networks.
[21] Calibration data 118 from the camera rig is also sent to the computer system. This data can include the relative positions of the cameras to each other, lens information (focal length, aperture, magnification, etc.) rig position and orientation, or other data useful to calibrate the multiple sets of images being generated.
[22] Computer system 130 can either generate images or retrieve previously stored computer graphic images such as frame 124. Since the CG images are created based on computer models, all of the depth information is already defined for each of their elements. The remaining steps of Fig. 1 are needed to quickly and accurately determine depth information for elements in the picture camera image in order that the live action image can be accurately placed “into” (i.e., composited with) the CG image.
[23] In Fig. 1, steps or acts at 140 are used to generate a depth map that includes depth information for each pixel of the image from the picture camera.
[24] Left image 142 from left auxiliary camera 114, together with right image 144 from right auxiliary camera 116 are processed at 146. This pre-processing compares the differences, or “disparity,” between the images’ to generate disparity map 148. The disparity processing can use known or future methods based on parallax effects, modeling, training, lighting or other characteristics of the images. Computation can use machine learning approaches such as artificial neural networks. Other techniques can be used. Disparity processing may remove distortions and unwanted camera or lens effects and other image anomalies.
[25] Disparity map 148 is then re -projected onto the picture image using camera calibration data. In this operation, the resulting disparity map may have artifacts, such as “holes,” “gaps,” or other types of discontinuities in its image and depth information, as represented at 150. As a result, corrections processing 152 may be necessary to correct the artifacts. In an embodiment, an artificial intelligence process is used to perform infilling and densification to remove holes.
[26] The result of correcting artifacts (if necessary) is dense depth map 154. In an embodiment, the dense depth map is at the same or higher resolution than the picture image so that it can be mapped to the picture image to provide a depth for each pixel in the picture image. This picture image plus depth map is shown as output 160. The output 160 is then composited with CG image 124 to produce composite image 170 where the live action image is properly placed into the CG image based on the derived depth information from steps 140.
[27] Using the dense depth map, various items in the CG image will be properly placed and masked behind items in the live action image or vice versa. Additional features can be provided in the compositing, such as to allow making objects transparent or semi-transparent in order to see image items that would otherwise be occluded. The correct placement of live action elements in depth can assist in the use of transparency in the CG. Similarly, additional features or effects such as shadowing/lighting (e.g. CG object drops shadow on live action actor) can be generated and composited more realistically.
[28] Fig. 2 shows basic sub-steps in pre-processing step 146 of Fig. 1.
[29] In an embodiment, deep neural network techniques are used to implement one or more of the steps of Fig. 2. In other embodiments, other programming techniques may be used instead of, or in addition to, the specifics described herein. For example, other artificial intelligence approaches can be employed such as those known in the field of machine learning, or otherwise. In applications where specific hardware (e.g., graphics processing units (GPUs), application-specific integrated circuits (ASICs), custom or semi-custom processors, etc.), is used to accelerate computation it may be useful to include legacy approaches to problem solving such as procedural or “brute force” techniques. In other embodiments, any of a number of deep learning architectures currently known or yet to be devised, may be employed. For example, deep belief networks, recurrent neural networks, convolutional neural networks, etc., may be used.
[30] In Fig. 2, the pre-processing determines differences among the same parts or features of items in the left and right auxiliary camera images. The features may be large or small depending on the degree of interest or importance to the ultimate compositing, and depending on the image area occupied by the feature. For example, a feature may be a person, an eye, eyelash, etc. At step 210, feature maps from the 202 and 204 images are extracted. At step 220 feature maps are compared to determine same features in the depth images. Step 230 applies convolution filtering to achieve coarse volumetric placement and matching at a low resolution (240).
[31] At step 250 refinement is then performed at high resolution using the original position disparity images to check and adjust how the modeled scene with coarse depths can be more precisely positioned in depth dimension. Step 260 shows a predicted image that can be used to “train” the system when compared to ground truth mapping 270 (“recorded” or “synthetic” data). The system uses silhouettes or outlines of the objects and encourages correct alignment of the outlines to reduce hops or jumps in pixel images from frame to frame so that the final rendered sequence is continuous.
[32] Color images and depth maps are used as reference data, such as ground truth 270 data, to compare generated or predicted frames (such as predicted frames at 260) and correct the model so that predicted frames are closer to the training data obtained. Training data can be based on recorded or synthetic data. In one embodiment, synthetic training data is based on LIDAR or photogrammetric scans of actors and objects on the actual set. In other embodiments synthetic data can be obtained in any suitable manner.
[33] The sequence of steps in Fig. 2 for pre-processing to generate an improved disparity map can also be used to improve disparity map with artifacts 150 of Fig. 1. The picture image can be combined with disparity map with artifacts 150. In other words, each of steps 250 - 270 may be applied to an initial disparity map with artifacts such as 150 of Fig. 1 to generate an improved disparity map without artifacts
[34] FIG. 3 is a block diagram of an exemplary computer system 900 for use with implementations described herein. Computer system 900 is merely illustrative and not intended to limit the scope of the claims. One of ordinary skill in the art would recognize other variations, modifications, and alternatives. For example, computer system 900 may be implemented in a distributed client-server configuration having one or more client devices in communication with one or more server systems.
[35] In one exemplary implementation, computer system 900 includes a display device such as a monitor 910, computer 920, a data entry device 930 such as a keyboard, touch device, and the like, a user input device 940, a network communication interface 950, and the like. User input device 940 is typically embodied as a computer mouse, a trackball, a track pad, wireless remote, tablet, touch screen, and the like. Moreover, user input device 940 typically allows a user to select and operate objects, icons, text, characters, and the like that appear, for example, on the monitor 910.
[36] Network interface 950 typically includes an Ethernet card, a modem (telephone, satellite, cable, ISDN), (asynchronous) digital subscriber line (DSL) unit, and the like. Further, network interface 950 may be physically integrated on the motherboard of computer 920, may be a software program, such as soft DSL, or the like.
[37] Computer system 900 may also include software that enables communications over communication network 952 such as the HTTP, TCP/IP, RTP/RTSP, protocols, wireless application protocol (WAP), IEEE 902.11 protocols, and the like. In addition to and/or alternatively, other communications software and transfer protocols may also be used, for example IPX, UDP or the like. Communication network 952 may include a local area network, a wide area network, a wireless network, an Intranet, the Internet, a private network, a public network, a switched network, or any other suitable communication network, such as for example Cloud networks. Communication network 952 may include many interconnected computer systems and any suitable communication links such as hardwire links, optical links, satellite or other wireless communications links such as BLUETOOTH, WIFI, wave propagation links, or any other suitable mechanisms for communication of information. For example, communication network 952 may communicate to one or more mobile wireless devices 956A-N, such as mobile phones, tablets, and the like, via a base station such as wireless transceiver 954.
[38] Computer 920 typically includes familiar computer components such as a processor 960, and memory storage devices, such as a memory 970, e.g., random access memory (RAM), storage media 980, and system bus 990 interconnecting the above components. In one embodiment, computer 920 is a PC compatible computer having multiple microprocessors, graphics processing units (GPU), and the like. While a computer is shown, it will be readily apparent to one of ordinary skill in the art that many other hardware and software configurations are suitable for use with the present invention. Memory 970 and Storage media 980 are examples of tangible non-transitory computer readable media for storage of data, audio/video files, computer programs, and the like. Other types of tangible media include disk drives, solid-state drives, floppy disks, optical storage media and bar codes, semiconductor memories such as flash drives, flash memories, random-access or read-only types of memories, battery -backed volatile memories, networked storage devices, Cloud storage, and the like.
[39] As mentioned, above, one or more of the steps illustrated and described in connection with Figs. 1 or 2 may be performed with AI techniques. An embodiment uses a deep neural network with a training dataset to implement steps of feature extraction, matching, filtering and/or refinement. The training dataset can use images and depth reference data obtained by capturing or scanning the real-world people and objects. For example, the walls, furniture, props, actors, costumes, and other objects and even visual effects can be initially captured (i.e., “digitized”) by using LIDAR, photogramme try, or other techniques. This results in highly accurate depth and color texture information for objects captured in the images. This “recorded” data can then be used, in turn, to generate “synthetic” data by using the recorded data in computer modeling and rendering programs to change the positions of objects, camera characteristics and placement, environmental effects (e.g., lighting, haze, etc.) within the computer-generated scene and to capture images of the scenes along with the computer-generated depth information for the items in the images.
[40] In addition to generating recorded and synthetic datasets from the actual movie set on which the filming is to take place, generic datasets may be obtained of unrelated sets or environments. Any one or more of these types of data, or mixtures or combinations of data; can be combined into a “training dataset,” used to improve the later real-time depth detection during a live-action shoot so that digital images can be more accurately composited onto, e.g., a director’s camera viewfinder or an actor’s virtual or augmented reality headset; in order to show what the final, composited, scene will look like.
[41] In an embodiment, custom synthetic data is obtained by capturing key aspects of the actual set or environment that will be used in an upcoming live action shoot where views of composite CG and live action are desired to be presented in real time. Actors and costumes can be captured in various poses and positions on the set. Other characteristics of the physical set and environment can be captured such as lighting, object positionings, camera view positioning and settings, camera noise, etc.
[42] Once captured, the custom recorded data is imported into a computer graphics rendering program so that the objects may be digitally repositioned. Lighting and noise or other effects can be added or subtracted in the digital images. Actors can be posed and placed along with various props and effects, if desired. Selected images of these synthesized views can be captured along with their depth information. In an embodiment, only the synthetic data obtained from custom recorded data is used to comprise the training dataset. However, in other embodiments, any desired combinations of recorded, custom recorded and/or synthetic data can be used. One embodiment uses semi-synthetic data where one or a few recorded data instances are used to generate many synthetic instances. [43] Although it can be beneficial to create the dataset using data recorded from the actual set to be used (“custom recorded data”), in other embodiments a dataset may be pre-compiled from recorded data from one or more unrelated sets or environments. This pre-compiled dataset can then be used to train a deep neural network to be used for real time compositing when live-action shooting occurs in a different setting, environment or location.
[44] In one embodiment, a training dataset is synthesized from custom recorded data from scanning an actual set to be used in a future shoot. The training dataset is then used to train a deep neural network to improve the depth mapping of images in real time when the future shoot is undertaken. Details of known procedures for training using datasets are provided in, e.g., reference (5), cited above.
[45] A machine learning training approach includes starting with random weights. Predictions are made by the network. The differences between the predicted and actual depths are computed and the weights are changed to make the prediction closer according to a scoring function. This is repeated until suitable training has been achieved for a threshold number of images. The size of the training dataset may vary widely, such as from one or a few to hundreds of thousands or millions of images.
[46] In an embodiment, higher importance is assigned to edges or silhouettes of objects.
[47] Depending on the resolution of the images, and number of images in the dataset, and other factors, training can take from hours up to one or more weeks. Evaluation of the effectiveness of the training can be performed visually by a human operator after an initial automatic evaluation, although in other embodiments the training evaluation actions can be automated in different arrangements including wholly manual or wholly automated. An operator interface is provided to allow a human to change settings. During the live action filming an operator can change settings on the auxiliary cameras (used to capture depth disparity information). Camera positions (distance apart), gain, brightness or other characteristics can be adjusted to improve the depth map generation. Differently trained neural networks can be available for an operator to switch from.
[48] Data can be recorded at higher resolution for areas of interest such as human faces, furniture, etc. Information about the actual shoot can be used such as “X’s” placed on the floor where actors will stand. Those areas can be subjected to more dense recording or synthesizing of data. Conversely, if it is known that areas of the set or environment will not be used then those areas can be the subject of less attention, or might be ignored entirely, for the training dataset.
[49] One approach allows adding camera noise into the synthesized images in order to better train for the real camera images that will be handled as inputs during the live action shooting. Actual measured noise levels of cameras are used as target levels. Frequency response analysis of camera noise characteristics can be performed and those characteristics matched in the synthetic data for better training.
[50] In an embodiment, the processing time to match a depth map to a live-action frame can be shortened by the use of a pre-stored camera parameters so that when a camera setting (e.g., focal length, etc.) is changed in the main picture camera, the corresponding change in the camera’s frame captures can be applied to the depth map. Any suitable programming and/or database retrieval technique may be used. In an embodiment, a look-up table is used that includes pre-computed values for the effect of changes in the main camera settings on the resulting captured images. For example, if there is a focal length change at the main camera a lookup table entry corresponding to the new focal length is used and applied to the depth map in order that the depth map be modified (“distorted”) in the same way as the captured main images. This approach can similarly be used for changes in other camera parameters.
[51] Embodiments may also employ a human operator to visually inspect, in real- time, the depth map “fitting” to the captured main images and to make visual adjustments. The operator can have x, y and z (depth) adjustments and can fit the depth map to the captured image by panning and scrolling and zooming. Other controls can be provided to the operator. A combination of automated and manual matching tools can be provided at an operator interface. These approaches can be used at any one or more of the steps shown in Figures 1 or 2.
[52] Although the description has been described with respect to particular embodiments thereof, these particular embodiments are merely illustrative, and not restrictive. Controls can be provided to allow modifying various parameters of the compositing at the time of performing the recordings. For example, the resolution, number of frames, accuracy of depth position may all be subject to human operator changes or selection.
[53] Any suitable programming language can be used to implement the routines of particular embodiments including C, C++, Java, assembly language, etc. Different programming techniques can be employed such as procedural or object oriented. The routines can execute on a single processing device or multiple processors. Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different particular embodiments. In some particular embodiments, multiple steps shown as sequential in this specification can be performed at the same time.
[54] Particular embodiments may be implemented in a computer-readable storage medium for use by or in connection with the instruction execution system, apparatus, system, or device. Particular embodiments can be implemented in the form of control logic in software or hardware or a combination of both. The control logic, when executed by one or more processors, may be operable to perform that which is described in particular embodiments. [55] Some embodiments are implemented as processor implementable code provided on a computer-readable medium. The computer-readable medium can comprise a non transient storage medium, such as solid-state memory, a magnetic disk, optical disk etc., or a transient medium such as a signal transmitted over a computer network.
[56] Particular embodiments may be implemented by using a programmed general purpose digital computer, by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms may be used. In general, the functions of particular embodiments can be achieved by any means as is known in the art. Distributed, networked systems, components, and/or circuits can be used. Communication, or transfer, of data may be wired, wireless, or by any other means.
[57] It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application. It is also within the spirit and scope to implement a program or code that can be stored in a machine -readable medium to permit a computer to perform any of the methods described above.
[58] As used in the description herein and throughout the claims that follow, “a”, “an”, and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
[59] The term ‘comprising’ as used in this specification means ‘consisting at least in part of’ . When interpreting each statement in this specification that includes the term ‘comprising’, features other than that or those prefaced by the term may also be present. Related terms such as ‘comprise’ and ‘comprises’ are to be interpreted in the same manner. [60] In this specification where reference has been made to patent specifications, other external documents, or other sources of information, this is generally for the purpose of providing a context for discussing the features of the invention. Unless specifically stated otherwise, reference to such external documents or such sources of information is not to be construed as an admission that such documents or such sources of information, in any jurisdiction, are prior art or form part of the common general knowledge in the art.
[61] Thus, while particular embodiments have been described herein, latitudes of modification, various changes, and substitutions are intended in the foregoing disclosures, and it will be appreciated that in some instances some features of particular embodiments will be employed without a corresponding use of other features without departing from the scope and spirit as set forth. Therefore, many modifications may be made to adapt a particular situation or material to the essential scope and spirit.

Claims

Claims We claim:
1. A tangible processor-readable medium including instructions executable by one or more processors for: selecting a deep neural network that has been trained using a dataset derived, at least in part, from synthetic data; using the selected deep neural network to process image information from one or more auxiliary cameras to generate a depth map; and correlating the depth map with at least a portion of picture elements in at least one picture image obtained from an image capture device; and using the correlated depth map to composite one or more digital elements with one or more picture elements.
2. The tangible processor-readable medium of claim 1 , wherein the dataset includes recorded data and/or the dataset includes third party recorded data.
3. The tangible processor-readable medium of claim 1, the one or more tangible media further comprising logic for: re-projecting the disparity map into an image from the image capture device.
4. The tangible processor-readable medium of claim 3, the one or more tangible media further comprising logic for: infilling holes in the re -projected disparity map.
5. The tangible processor-readable medium of claim 1, further comprising: a signal interface for receiving image camera information and providing the image camera information to the one or more processors for processing by the deep neural network.
6. The tangible processor-readable medium of claim 5, wherein the image camera information includes a focal length of the image camera, and wherein the operation is in real time.
7. A method for determining picture element depths, the method comprising: selecting a deep neural network that has been trained using a dataset derived, at least in part, from synthetic data derived from a scene; using the selected deep neural network to process image information of the scene from one or more auxiliary cameras to generate a depth map; and correlating the depth map with at least a portion of picture elements in at least one of the picture images using the correlated depth map to composite one or more digital elements with one or more picture elements.
8. The method of claim 7, wherein the dataset includes recorded data and/or the dataset includes third party recorded data.
9. The method of claim 8, further comprising: re -projecting the disparity map into an image from the image capture device.
10. The method of claim 9, further comprising infilling holes in the re projected disparity map.
11. The method of claim 10, further comprising receiving image camera information and providing the deep neural network for processing.
12. The method of claim 11, wherein the image camera information includes a focal length of the image camera, and wherein the operation is in real time.
13. A method for generating a depth map, the method comprising: receiving a dataset including a plurality of images and depths of objects in a scene; using the dataset to train a deep neural network to assist, at least in part, in generating a depth map for use in real-time compositing of a live action recording taking place in the scene.
14. The method of claim 13, wherein the dataset includes recorded data and/or the dataset includes third party recorded data.
15. A processor-readable medium including instructions executable by one or more digital processors for determining picture element depths, the processor- readable medium comprising one or more instructions executable by the one or more digital processors for: receiving a dataset including a plurality of images and depths of objects in a scene; using the dataset to train a deep neural network to assist, at least in part, in generating a depth map for use in real-time compositing of a live action recording taking place in the scene.
PCT/NZ2020/050134 2020-01-30 2020-10-28 System for image compositing including training with synthetic data WO2021154099A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202062968041P 2020-01-30 2020-01-30
US62/968,041 2020-01-30
US17/081,843 2020-10-27
US17/081,843 US11710247B2 (en) 2020-01-30 2020-10-27 System for image compositing including training with synthetic data

Publications (1)

Publication Number Publication Date
WO2021154099A1 true WO2021154099A1 (en) 2021-08-05

Family

ID=77078996

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/NZ2020/050134 WO2021154099A1 (en) 2020-01-30 2020-10-28 System for image compositing including training with synthetic data

Country Status (1)

Country Link
WO (1) WO2021154099A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030043270A1 (en) * 2001-08-29 2003-03-06 Rafey Richter A. Extracting a depth map from known camera and model tracking data
US20030174285A1 (en) * 2002-03-14 2003-09-18 Douglas Trumbull Method and apparatus for producing dynamic imagery in a visual medium
US20150248764A1 (en) * 2014-02-28 2015-09-03 Microsoft Corporation Depth sensing using an infrared camera
US20190102949A1 (en) * 2017-10-03 2019-04-04 Blueprint Reality Inc. Mixed reality cinematography using remote activity stations
US20190289281A1 (en) * 2018-03-13 2019-09-19 Magic Leap, Inc. Image-enhanced depth sensing via depth sensor control
US20190356905A1 (en) * 2018-05-17 2019-11-21 Niantic, Inc. Self-supervised training of a depth estimation system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030043270A1 (en) * 2001-08-29 2003-03-06 Rafey Richter A. Extracting a depth map from known camera and model tracking data
US20030174285A1 (en) * 2002-03-14 2003-09-18 Douglas Trumbull Method and apparatus for producing dynamic imagery in a visual medium
US20150248764A1 (en) * 2014-02-28 2015-09-03 Microsoft Corporation Depth sensing using an infrared camera
US20190102949A1 (en) * 2017-10-03 2019-04-04 Blueprint Reality Inc. Mixed reality cinematography using remote activity stations
US20190289281A1 (en) * 2018-03-13 2019-09-19 Magic Leap, Inc. Image-enhanced depth sensing via depth sensor control
US20190356905A1 (en) * 2018-05-17 2019-11-21 Niantic, Inc. Self-supervised training of a depth estimation system

Similar Documents

Publication Publication Date Title
US11978225B2 (en) Depth determination for images captured with a moving camera and representing moving features
US9426451B2 (en) Cooperative photography
US11710247B2 (en) System for image compositing including training with synthetic data
US7307654B2 (en) Image capture and viewing system and method for generating a synthesized image
CN110874818B (en) Image processing and virtual space construction method, device, system and storage medium
US20160150217A1 (en) Systems and methods for 3d capturing of objects and motion sequences using multiple range and rgb cameras
US20230186434A1 (en) Defocus operations for a virtual display with focus and defocus determined based on camera settings
CN108286945B (en) Three-dimensional scanning system and method based on visual feedback
CN101140661A (en) Real time object identification method taking dynamic projection as background
US20170374256A1 (en) Method and apparatus for rolling shutter compensation
JP2016537901A (en) Light field processing method
JP2020042772A (en) Depth data processing system capable of optimizing depth data by image positioning with respect to depth map
US20180342075A1 (en) Multi-view back-projection to a light-field
US10845188B2 (en) Motion capture from a mobile self-tracking device
CN116168076A (en) Image processing method, device, equipment and storage medium
CA3199128A1 (en) Systems and methods for augmented reality video generation
US11620765B2 (en) Automatic detection of a calibration object for modifying image parameters
CN113870213A (en) Image display method, image display device, storage medium, and electronic apparatus
KR20150101343A (en) Video projection system
KR102561903B1 (en) AI-based XR content service method using cloud server
JP2018116421A (en) Image processing device and image processing method
WO2021154099A1 (en) System for image compositing including training with synthetic data
CN117527993A (en) Device and method for performing virtual shooting in controllable space
US11282233B1 (en) Motion capture calibration
CN116152141A (en) Target object repositioning method and device, storage medium and electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20815966

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20815966

Country of ref document: EP

Kind code of ref document: A1