CN116612509A

CN116612509A - Biological feature task network

Info

Publication number: CN116612509A
Application number: CN202310104367.7A
Authority: CN
Inventors: 阿里·哈桑尼; 哈菲兹·马利克; R·U·D·雷法特; Z·埃尔沙伊尔
Original assignee: Ford Global Technologies LLC
Current assignee: Ford Global Technologies LLC
Priority date: 2022-02-15
Filing date: 2023-02-13
Publication date: 2023-08-18

Abstract

The present disclosure provides a "biometric task network". A deep neural network may provide output from a selected biometric analysis task that is one of a plurality of biometric analysis tasks based on an image. The selected biometric analysis tasks may be performed in a deep neural network that includes a common feature extraction neural network, a plurality of biometric task specific neural networks, a plurality of segmented mask neural networks, and an expert pooling neural network that performs the plurality of biometric analysis tasks by inputting the images to the common feature extraction network to determine potential variables. The latent variables may be input to the plurality of biometric task-specific neural networks to determine a plurality of biometric analysis task outputs. The latent variables may be input to a segmented neural network to determine facial feature segmentation outputs. The facial feature segmentation output may be output to a plurality of segmented mask neural networks.

Description

Biological feature task network

Cross Reference to Related Applications

This patent application claims priority from U.S. provisional patent application No. 63/310,401, filed on 2.15 of 2022, which application is hereby incorporated by reference in its entirety.

Technical Field

The present disclosure relates to a biometric mission network in a vehicle.

Background

The image may be acquired by a sensor and processed using a computer to determine data about objects in the environment surrounding the system. The operation of the sensing system may include acquiring accurate and timely data about objects in the system environment. The computer may acquire images from one or more image sensors, which may be processed to determine data about the object. The computer may use data extracted from the image of the object to operate systems, including vehicles, robots, security and object tracking systems.

Disclosure of Invention

Biometric analysis may be implemented in a computer to determine data about objects (e.g., potential users) in or around a system or machine, such as a vehicle. Based on the data determined from the biometric analysis, the vehicle may be operated, for example. Biometric analysis herein means measuring or calculating data about a user based on physical characteristics of the user. For example, a computing device in a vehicle or a traffic infrastructure system may be programmed to obtain one or more images from one or more sensors included in the vehicle or the traffic infrastructure system and grant a user permission to operate the vehicle based on biometric data determined from the images. Such grant of permission is referred to herein as biometric identification. Biometric identification means determining the identity of a potential user. The determined user identity may be recorded to track which user is accessing the vehicle or compared to a list of authorized users to authenticate the user before permission is granted to the user to operate the vehicle or system. Biometric analysis includes determining one or more physical characteristics, such as user drowsiness, gaze direction, user posture, user living being, and the like. In addition to vehicles, the biometric analysis tasks may also be applied to other machines or systems. For example, computer systems, robotic systems, manufacturing systems, and security systems may require that the acquired images be used to identify potential users prior to granting access to the system or secure area.

Advantageously, the techniques described herein may enhance the ability of computing devices in a traffic infrastructure system to perform biometric analysis based on recognition facial biometric algorithms (such as facial feature recognition) including redundant tasks across different applications. Furthermore, some facial biometric algorithms have sparse or limited training data sets. The techniques described herein include a multi-tasking network including a common feature recognition neural network and a plurality of biometric analysis tasking neural networks. The deep neural network is configured to include a common feature extraction neural network as a "backbone" and a plurality of biometric analysis task neural networks that receive as input a set of common latent variables generated by the common feature extraction neural network. The deep neural network includes a plurality of expert pooled deep neural networks that enhance training of the deep neural network by sharing results among a plurality of biometric analysis tasks.

Spoof detection for biometric detection may benefit from a multi-tasking biometric neural network. Fraud means using fraud techniques to fool the biometric identification system. For example, an unauthorized user seeking access to a vehicle protected by a biometric identification system may place a photograph of the authorized user's person in front of them. In other examples, an unauthorized user may wear a mask with a photograph of an authorized user to fool the biometric system into granting access rights. Spoofing detection by detecting a photograph or mask is referred to as live detection because it attempts to determine whether facial features presented to the system were obtained from a live person rather than a photograph rendering. Disclosed herein is an enhanced biometric identification system based on a neural network that combines output from an image segmentation task with a biometric identification task and a skin tone biometric analysis task to enhance biometric identification and in vivo detection. Combining the output from the image segmentation task with the biometric task and the skin tone biometric analysis task enhances the accuracy of both biometric identification and in vivo detection while reducing the computational resources required to determine the two tasks.

Disclosed herein is a method comprising providing output from a selected biometric analysis task that is one of a plurality of biometric analysis tasks based on an image provided from an image sensor, wherein the selected biometric analysis task is performed in a deep neural network comprising a common feature extraction neural network, a plurality of biometric task specific neural networks, a plurality of segmented mask neural networks, and an expert pooling neural network that performs the plurality of biometric analysis tasks by inputting the image to the common feature extraction neural network to determine latent variables. The latent variables may be input to the plurality of biometric task-specific neural networks to determine a plurality of biometric analysis task outputs. The latent variables may be input to a segmented neural network to determine facial feature segmentation outputs. The facial feature segmentation output may be input to a plurality of feature mask neural networks to determine a plurality of segmentation mask outputs. The plurality of biometric analysis task outputs and the plurality of segmentation mask outputs may be input to an expert pooled neural network to determine a living task output, and the plurality of biometric analysis task outputs and the living task output may be output. The device may be operated based on output from the deep neural network according to the selected biometric analysis task. The plurality of segmented mask outputs from the one or more segmented mask neural networks may be stored in the one or more memories to determine a time-segmented mask output based on a sequence of frames of the video data.

The plurality of biometric analysis tasks may include biometric identification, living body determination, drowsiness determination, gaze determination, gesture determination, and facial feature segmentation. The common feature extraction neural network may include a plurality of convolutional layers. The plurality of biometric task-specific neural networks may include a plurality of fully connected layers. The plurality of segmented mask neural networks may include a plurality of fully connected layers. The expert pooled neural network may include a plurality of fully connected layers. The deep neural network may be trained by: determining a plurality of loss functions for a plurality of biometric analysis task outputs and a living task output based on the ground truth; combining the plurality of loss functions to determine a joint loss function; and back-propagating the loss function and the joint loss function through the deep neural network to output a set of weights. The plurality of biometric task outputs and the living task output may be input to a plurality of SoftMax functions before being input to the plurality of loss functions. During training, one or more outputs from a plurality of biometric task-specific neural networks and the in-vivo task output may be set to zero. The deep neural network may be configured to include a subset of the common feature extraction network and the biometric task-specific neural network during inference based on the selected biometric analysis task. The deep neural network may be trained based on a loss function determined from sparse classification cross entropy statistics. The deep neural network may be trained based on a loss function determined from the mean square error statistics.

A computer readable medium storing program instructions for performing some or all of the above method steps is also disclosed. Also disclosed is a computer programmed to perform some or all of the above method steps, the computer comprising a computer device programmed to provide output from a selected biometric analysis task that is one of a plurality of biometric analysis tasks based on an image provided from an image sensor, wherein the selected biometric analysis task is performed in a deep neural network comprising a common feature extraction neural network, a plurality of biometric task specific neural networks, a plurality of split mask neural networks, and an expert pooling neural network that performs the plurality of biometric analysis tasks by inputting the image to the common feature extraction neural network to determine a latent variable. The latent variables may be input to the plurality of biometric task-specific neural networks to determine a plurality of biometric analysis task outputs. The latent variables may be input to a segmented neural network to determine facial feature segmentation outputs. The facial feature segmentation output may be input to a plurality of feature mask neural networks to determine a plurality of segmentation mask outputs. The plurality of biometric analysis task outputs and the plurality of segmentation mask outputs may be input to an expert pooled neural network to determine a living task output, and the plurality of biometric analysis task outputs and the living task output may be output. The device may be operated based on output from the deep neural network according to the selected biometric analysis task. The plurality of segmented mask outputs from the one or more segmented mask neural networks may be stored in the one or more memories to determine a time-segmented mask output based on a sequence of frames of the video data.

The instructions may include further instructions, wherein the plurality of biometric analysis tasks may include biometric identification, living body determination, drowsiness determination, gaze determination, gesture determination, and facial feature segmentation. The common feature extraction neural network may include a plurality of convolutional layers. The plurality of biometric task-specific neural networks may include a plurality of fully connected layers. The plurality of segmented mask neural networks may include a plurality of fully connected layers. The expert pooled neural network may include a plurality of fully connected layers. The deep neural network may be trained by: determining a plurality of loss functions for a plurality of biometric analysis task outputs and a living task output based on the ground truth; combining the plurality of loss functions to determine a joint loss function; and back-propagating the loss function and the joint loss function through the deep neural network to output a set of weights. The plurality of biometric task outputs and the living task output may be input to a plurality of SoftMax functions before being input to the plurality of loss functions. During training, one or more outputs from a plurality of biometric task-specific neural networks and the in-vivo task output may be set to zero. The deep neural network may be configured to include a subset of the common feature extraction network and the biometric task-specific neural network during inference based on the selected biometric analysis task. The deep neural network may be trained based on a loss function determined from sparse classification cross entropy statistics. The deep neural network may be trained based on a loss function determined from the mean square error statistics.

Drawings

Fig. 1 is a block diagram of an exemplary communication infrastructure system.

Fig. 2 is a diagram of an exemplary biometric image.

Fig. 3 is a diagram of an exemplary biometric system.

Fig. 4 is a diagram of an exemplary multitasking biometric system.

Fig. 5 is a diagram of an exemplary spoofed biometric image.

Fig. 6 is a diagram of an exemplary segmented image.

Fig. 7 is a diagram of an exemplary multi-tasking biometric system including segmented anti-spoofing processing.

Fig. 8 is a diagram of an exemplary multi-tasking biometric system including segmented anti-spoofing processing and memory.

FIG. 9 is a flow chart of an exemplary process for training a deep neural network to perform a biometric analysis task.

FIG. 10 is a flow chart of an exemplary process for training a deep neural network including segmented anti-spoofing processing to perform biometric tasks.

FIG. 11 is a flow chart of an exemplary process for operating a vehicle based on a multi-tasking biometric system.

Detailed Description

Fig. 1 is a diagram of a sensing system 100 that may include a traffic infrastructure node 105 that includes a server computer 120 and a stationary sensor 122. The sensing system 100 includes a vehicle 110 that is operable in an autonomous ("autonomous" itself means "fully autonomous" in the present disclosure) mode, a semi-autonomous mode, and an occupant driving (also referred to as non-autonomous) mode. The computing device 115 of one or more vehicles 110 may receive data regarding the operation of the vehicle 110 from the sensors 116. Computing device 115 may operate vehicle 110 in an autonomous mode, a semi-autonomous mode, or a non-autonomous mode.

The computing device 115 includes a processor and memory such as are known. Further, the memory includes one or more forms of computer-readable media and stores instructions executable by the processor to perform operations including as disclosed herein. For example, the computing device 115 may include one or more of programming to operate vehicle braking, propulsion (i.e., controlling acceleration of the vehicle 110 by controlling one or more of an internal combustion engine, an electric motor, a hybrid engine, etc.), steering, climate control, interior and/or exterior lights, etc., and to determine whether and when the computing device 115 (rather than a human operator) is controlling such operations.

The computing device 115 may include, or be communicatively coupled to, more than one computing device (i.e., a controller included in the vehicle 110 for monitoring and/or controlling various vehicle components, etc. (i.e., the powertrain controller 112, the brake controller 113, the steering controller 114, etc.)), i.e., via a vehicle communication bus as described further below. The computing device 115 is typically arranged for communication over a vehicle communication network (i.e., including a bus in the vehicle 110, such as a Controller Area Network (CAN), etc.); the network of vehicles 110 may additionally or alternatively include, for example, known wired or wireless communication mechanisms, i.e., ethernet or other communication protocols.

The computing device 115 may transmit and/or receive messages to and/or from various devices in the vehicle (i.e., controllers, actuators, sensors (including sensor 116), etc.) via a vehicle network. Alternatively or additionally, where computing device 115 actually includes multiple devices, a vehicle communication network may be used to communicate between devices represented in this disclosure as computing device 115. Further, as mentioned below, various controllers or sensing elements (such as sensors 116) may provide data to the computing device 115 via a vehicle communication network.

In addition, the computing device 115 may be configured to communicate with a remote server computer 120 (i.e., cloud server) via a network 130 through a vehicle-to-infrastructure (V-to-I) interface 111, including hardware, firmware, and software that permit the computing device 115 to communicate via, for example, the wireless internet, as described belowOr a network 130 of cellular networks, is in communication with the remote server computer 120. Thus, V-pair I interface 111 may include a wireless interface configured to utilize various wired and/or wireless networking technologies (i.e., cellular,/-pair>And wired and/or wireless packet networks), memory, transceivers, and so forth. The computing device 115 may be configured to communicate with other vehicles 110 over the V-to-I interface 111 using a vehicle-to-vehicle (V-to-V) network (i.e., according to Dedicated Short Range Communications (DSRC) and/or the like), i.e., formed on a mobile ad hoc network basis between neighboring vehicles 110 or formed over an infrastructure-based network. The computing device 115 also includes non-volatile memory such as is known. The computing device 115 may record the data by storing the data in non-volatile memory for later retrieval and communication via the vehicle communication network and the vehicle-to-infrastructure (V The pair I) interface 111 is transmitted to the server computer 120 or the user mobile device 160.

As already mentioned, programming for operating one or more vehicle 110 components (i.e., braking, steering, propulsion, etc.) without human operator intervention is typically included in instructions stored in memory and executable by a processor of computing device 115. Using data received in computing device 115 (i.e., sensor data from sensors 116, data of server computer 120, etc.), computing device 115 may make various determinations and/or control various vehicle 110 components and/or operations to operate vehicle 110 without a driver. For example, the computing device 115 may include programming to adjust vehicle 110 operational behaviors (i.e., physical manifestations of vehicle 110 operation), such as speed, acceleration, deceleration, steering, etc., as well as strategic behaviors (i.e., generally controlling operational behaviors in a manner intended to achieve efficient traversal of a route), such as distance between vehicles and/or amount of time between vehicles, lane changes, minimum clearance between vehicles, left turn cross-path minimum, arrival time to a particular location, and minimum time from arrival to intersection (no signal lights) crossing an intersection.

The term controller as used herein includes computing devices that are typically programmed to monitor and/or control specific vehicle subsystems. Examples include a powertrain controller 112, a brake controller 113, and a steering controller 114. The controller may be, for example, a known Electronic Control Unit (ECU), possibly including additional programming as described herein. The controller is communicatively connected to the computing device 115 and receives instructions from the computing device to actuate the subsystems according to the instructions. For example, brake controller 113 may receive instructions from computing device 115 to operate brakes of vehicle 110.

The one or more controllers 112, 113, 114 for the vehicle 110 may include known Electronic Control Units (ECUs) or the like, including, as non-limiting examples, one or more powertrain controllers 112, one or more brake controllers 113, and one or more steering controllers 114. Each of the controllers 112, 113, 114 may include a respective processor and memory and one or more actuators. The controllers 112, 113, 114 may be programmed and connected to a vehicle 110 communication bus, such as a Controller Area Network (CAN) bus or a Local Interconnect Network (LIN) bus, to receive instructions from the computing device 115 and to control actuators based on the instructions.

The sensors 116 may include a variety of devices known to provide data via a vehicle communication bus. For example, a radar fixed to a front bumper (not shown) of the vehicle 110 may provide a distance from the vehicle 110 to a next vehicle in front of the vehicle 110, or a Global Positioning System (GPS) sensor provided in the vehicle 110 may provide geographic coordinates of the vehicle 110. For example, distances provided by radar and/or other sensors 116 and/or geographic coordinates provided by GPS sensors may be used by computing device 115 to autonomously or semi-autonomously operate vehicle 110.

The vehicle 110 is typically a ground-based vehicle 110 (i.e., passenger vehicle, pickup truck, etc.) capable of autonomous and/or semi-autonomous operation and having three or more wheels. The vehicle 110 includes one or more sensors 116, a V-to-I interface 111, a computing device 115, and one or more controllers 112, 113, 114. The sensors 116 may collect data related to the vehicle 110 and the operating environment of the vehicle 110. By way of example and not limitation, the sensor 116 may include, for example, altimeters, cameras, LIDARs, radars, ultrasonic sensors, infrared sensors, pressure sensors, accelerometers, gyroscopes, temperature sensors, pressure sensors, hall sensors, optical sensors, voltage sensors, current sensors, mechanical sensors (such as switches), and the like. The sensor 116 may be used to sense the operating environment of the vehicle 110, i.e., the sensor 116 may detect phenomena such as weather conditions (rain, outside ambient temperature, etc.), road grade, road location (i.e., using road edges, lane markings, etc.), or the location of a target object (such as a nearby vehicle 110). The sensors 116 may also be used to collect data, including dynamic vehicle 110 data related to the operation of the vehicle 110, such as speed, yaw rate, steering angle, engine speed, brake pressure, oil pressure, power level applied to the controllers 112, 113, 114 in the vehicle 110, connectivity between components, and accurate and timely execution of components of the vehicle 110.

The vehicle may be equipped to operate in both an autonomous mode and an occupant driving mode. Semi-autonomous mode or fully autonomous mode means an operating mode in which the vehicle may be driven partially or fully by a computing device that is part of a system having sensors and a controller. The vehicle may be occupied or unoccupied, but in either case, the vehicle may be driven partially or fully without occupant assistance. For purposes of this disclosure, autonomous mode is defined as a mode in which each of vehicle propulsion (i.e., via a powertrain including an internal combustion engine and/or an electric motor), braking, and steering is controlled by one or more vehicle computers; in semi-autonomous mode, the vehicle computer controls one or more of vehicle propulsion, braking, and steering. In the non-autonomous mode, none of these are controlled by the computer.

The traffic infrastructure node 105 may include a physical structure such as a tower or other support structure (i.e., pole, box mountable to bridge supports, cell phone tower, roadway sign supports, etc.), on which the infrastructure sensor 122 and server computer 120 may be mounted, stored, and/or housed and powered, etc. For ease of illustration, one traffic infrastructure node 105 is shown in fig. 1, but the system 100 may and likely will include tens, hundreds, or thousands of traffic infrastructure nodes 105. The traffic infrastructure nodes 105 are typically stationary, i.e., fixed to a particular geographic location and cannot be moved from that location. Infrastructure sensors 122 may include one or more sensors, such as those described above for vehicle 110 sensors 116, i.e., lidar, radar, cameras, ultrasonic sensors, and the like. Infrastructure sensors 122 are fixed or stationary. That is, each sensor 122 is mounted to an infrastructure node so as to have a field of view that is substantially non-moving and unchanged.

The server computer 120 generally has features in common with the V-to-I interface 111 and the computing device 115 of the vehicle 110, and thus will not be further described to avoid redundancy. Although not shown for ease of illustration, the traffic infrastructure node 105 also includes a power source, such as a battery, solar cell, and/or connection to a power grid. The server computer 120 of the traffic infrastructure node 105 and/or the computing device 115 of the vehicle 110 may receive the data of the sensors 116, 122 to monitor one or more objects. In the context of the present disclosure, an "object" is a physical (i.e., material) structure detected by the vehicle sensors 116 and/or the infrastructure sensors 122. The object may be a biological object, such as a person. The server computer 120 and/or the computing device 115 may perform a biometric analysis on the object data acquired by the sensors 116/122.

Fig. 2 is a diagram of an image acquisition system 200 included in a vehicle 110 or a traffic infrastructure node 105. The image 202 may be acquired by a camera 204, which may be a sensor 116, 122, having a field of view 206. For example, when a user approaches the vehicle 110, an image 202 may be acquired. The computing device 115 or the server computer 120 in the vehicle 110 may perform a biometric analysis task that authenticates the user and grants the user permission to operate the vehicle 110, i.e., unlock the doors to allow the user to enter the vehicle 110. In addition to biometric identification, a spoof-detection biometric analysis task (such as a live detection) may also be used to determine whether the image data presented for user identification is a real image of a real user, i.e., not a photograph of the user or not a mask of the user.

Other biometric analysis tasks based on the user's image as known in the art include drowsiness detection, head pose detection, gaze detection, and emotion detection. Drowsiness detection can generally determine the alertness state of a user by analyzing eyelid position and blink rate. Head pose detection may also determine the alertness and attentiveness status of a user to vehicle operation, typically by analyzing the position and orientation of the user's face to detect nodding and bowing gestures. Gaze detection may determine the direction in which the user's eyes are looking to determine whether the user is focusing on vehicle operation. The emotion detection may determine an emotional state of the user, i.e. whether the user is irritated or not, to detect distraction that may be to the operation of the vehicle.

Fig. 3 is a diagram of a biometric analysis task system 300. The biometric analysis task system 300 may be implemented on the server computer 120 or the computing device 115. Biometric analysis tasks including user recognition, fraud detection, drowsiness detection, head pose detection, and gaze detection share a common computing task 304, such as determining the position and orientation of the user's face in the image 302. Other common computing tasks include determining the location and size of facial features such as eyes, mouth, and nose. Fig. 3 shows an image 302 being input to a common computing task 304 to determine common facial feature data (such as, for example, position, orientation, and features). The common facial feature data may be input to the biometric analysis tasks 306, 308, 310 to determine biometric task outputs 312, 314, 316. The biometric task output 312, 314, 316 may be one or more of user identification, fraud detection, drowsiness detection, head pose detection, gaze detection, emotion detection, and the like.

Fig. 4 is a diagram of a biometric analysis task system implemented as a Deep Neural Network (DNN) 400 configured to input an image 402 into a plurality of convolutional layers 404, 406, 408 included in a common feature extraction Convolutional Neural Network (CNN) 410. The biometric analysis task system implemented as DNN 400 may be implemented on server computer 120 or computing device 115. Common feature extraction or "stem" CNN 410 extracts common facial features from the image data and outputs latent variables. The latent variable is a common facial feature output by CNN 410 in response to input image 402 including a face. CNN 410 is trained as described herein to process image 402 to determine the location of a face and to determine facial features indicative of the location, orientation, and size of facial features such as eyes, nose, and mouth. CNN 410 may also determine facial features such as skin tone, texture, and the presence/absence of facial hair, as well as the presence/absence of objects (such as eyeglasses and body perforations). Facial features are referred to as common facial features because they are output as potential variables common to the multiple biometric analysis task neural networks 412, 414, 416, 418, 420 and the segmentation neural network 422. The biometric analysis task neural network 412, 414, 416, 418, 420 includes a plurality of fully connected layers that can perform tasks such as biometric identification, fraud detection including living beings, drowsiness detection, head pose detection, gaze detection, and emotion detection. The segmented neural network 422 may perform facial feature segmentation by determining regions in an image of a user's face, which may include features such as eyes, teeth, lips, nose, and facial skin, for example.

Regions in the determined image indicative of different facial features are segmented. Segmentation marks regions according to segment categories, where each segment category indicates a different facial feature. Exemplary segment categories include left eye, right eye, teeth, upper lip, lower lip, nose, facial skin, left eyebrow, right eyebrow, and the like. For a typical segmented neural network 422, the number of face segmentation classes may be equal to 14 or more. Each segment is further specified by its position, shape, and the number of pixels included in the segment. In determining the loss function, segments in the predicted segmented image output by the segmented neural network 422 may be compared to segments in the ground truth segmented image. Comparing the segments includes determining that pixels of the predicted segment overlap pixels of the ground truth segment having the same category name. The comparison may be a qualitative comparison based on sparse classification cross entropy statistics, which requires a percentage (e.g., 50%) of ground truth fragment pixels to overlap with predicted fragments having the same class name. In other examples, the comparison may be a quantitative comparison based on a mean square error statistic, which counts the number of non-overlapping pixels in the ground truth segment and squares the number of non-overlapping pixels.

The biometric analysis task neural networks 412, 414, 416, 418, 420 and the segmentation neural network 422 are trained to output predictions about the biometric analysis task or the segmentation task based on the latent variables output from the common feature extraction CNN 410. The predictions output from the biometric analysis task include biometric identification of the user. The biometric prediction is the probability that the input image includes a face that matches a face that DNN 400 was previously trained to recognize. Living prediction is the probability that the image 402 includes a photograph or mask of a living user instead of a user. Drowsiness prediction is the probability that the image 402 includes that the user experiences drowsiness. Head pose predictions are estimates of roll, pitch, and yaw of a user's head. Emotion prediction is the probability that a user is experiencing a strong emotion, such as anger or fear. The face segmentation prediction is an image with regions indicating the location of facial features.

DNN 400 may include SoftMax functions 424, 426, 428, 430, 432 on the outputs of the biological profiling task neural networks 412, 414, 416, 418, 420, respectively. The SoftMax functions 424, 426, 428, 430, 432 are functions that convert vectors of K real values to vectors of K real values that sum to 1 between 0 and 1. The outputs from SoftMax functions 424, 426, 428, 430, 432 may be output as a result of a biometric analysis task or used to calculate a loss function. Processing the outputs of the biological analysis task neural networks 412, 414, 416, 418, 420 with SoftMax functions 424, 426, 428, 430, 432 allows the outputs to be combined into a joint loss function for training the DNN 400. The loss function compares the output from the SoftMax functions 424, 426, 428, 430, 432 connected to the biometric task neural networks 412, 414, 416, 418, 420 to ground truth values to determine how close the biometric task neural network is to the correct result. The SoftMax functions 424, 426, 428, 430, 432 prevent one or more of the outputs from dominating the computational joint loss function due to the difference in the values of the outputs by limiting the outputs from each of the biometric task neural networks 412, 414, 416, 418, 420 to values between 0 and 1. The joint loss function is determined by combining the individual loss functions of each of the biometric analysis task neural networks 412, 414, 416, 418, 420 (typically by summing the individual loss functions together).

Training DNN 400 may include determining a training dataset of images 402 and a ground truth for each image 402. Ground truth is data about objects included in image 402 that indicates the results expected when processing the image with DNN 400. For example, if DNN 400 is being trained to determine the identity of a user included in image 402, the ground truth will include the identity of the user. Determining the ground truth may include examining the image 402 in the training dataset and estimating the correct results expected from SoftMax functions 424, 426, 428, 430, 432 connected to the biometric analysis task neural networks 412, 414, 416, 418, 420. During training, the output from the biometric analysis task may be compared to ground truth previously determined for the biometric analysis task to determine a loss function indicative of the accuracy with which DNN 400 processes image 402.

During training, image 402 may be input to DNN 400 multiple times. Each time the image 402 is input, a loss function is determined based on the ground truth, and the weights or parameters controlling the operation of the convolutional layers 404, 406, 408 of the CNN 410 and the fully-connected layers of the biometric analysis task neural networks 412, 414, 416, 418, 420 may be changed based on the loss function. The loss function may be counter-propagated through DNN 400 to select the weight for each layer that results in the lowest joint loss (i.e., the most correct result). Back propagation is a process for applying a loss function to the weights of DNN 400 layers, starting at the layer closest to the output and proceeding back to the layer closest to the input. By processing multiple input images 402 including ground truth in the training dataset multiple times, a set of weights for the layers of DNN 400 may be selected that converge to a correct result that may determine the entire training dataset of DNN 400. Selecting a set of optimal weights in this way is referred to as training DNN 400.

Training DNN 400 using the joint loss function may advantageously compensate for differences in the training dataset for each biometric analysis task. The joint loss function is determined based on combining individual loss functions for the biometric analysis task. The joint loss function may be determined by adding the individual loss functions for the biometric analysis task. The training quality of the biometric tasks may depend on the number of available images with the appropriate ground truth for each biometric analysis task. For example, biometric identification tasks may benefit from large commercial data sets including ground truth. Other biometric analysis tasks (such as drowsiness determination or gaze determination) may require the user to acquire image data and manually estimate ground truth. Advantageously, training DNN 400 using the joint loss function determined as discussed herein may allow a training dataset to be shared between a biometric analysis task with a large training dataset and a biometric analysis task with a smaller training dataset. For example, individual loss functions may be weighted when combined to form a joint loss function to give more weight to the loss function that includes a larger training data set.

In examples where one or more of the biometric analysis tasks may have a small amount of ground truth data included in the training dataset, training of DNN400 may be enhanced by branch training isolation. The branch training isolation sets the output of the biometric analysis mission neural network from ground truth data without a specific image in the training dataset to a null value. Setting the output from the biometric task neural network to a null value also sets the loss function determined for the biometric task neural network to zero. The branch training isolation also freezes weights included in the biometric analysis task neural network of the image. This allows the biometric analysis task neural network to be available for joint training with the rest of DNN400 without penalizing the biometric analysis task with a sparse training dataset. For example, drowsiness detection typically has fewer ground truth images than recognition tasks.

Fig. 5 is a diagram of an image acquisition system 500 included in a vehicle 110 or a traffic infrastructure node 105. The spoofed image 502 may be acquired by a camera 504, which may be a sensor 116, 122, having a field of view 508. An image 502 of a user wearing a mask made of, for example, spandex or the like, has been acquired to appear as an authorized user of the vehicle. The computing device 115 or the server computer 120 in the vehicle 110 may perform a biometric analysis task that authenticates the user and grants the user permission to operate the vehicle 110, i.e., unlock the doors to allow the user to enter the vehicle 110. In addition to biometric identification, a spoof-detection biometric analysis task (such as a live detection) may also be used to determine whether the image data presented for user identification is a real image of a real user, i.e., not a photograph of the user or not a mask of the user. Without living being detection, the biometric identification system may be spoofed to grant access to unauthorized users.

Other techniques exist for anti-spoofing detection. For example, a 3D or depth scanner (such as a lidar scanner) may be employed in addition to a camera to acquire data of potential users. The 3D scanner may detect differences between the flat shot photo and the face. A thermal or far infrared camera may detect characteristic thermal emissions from facial skin to distinguish the mask from the face. Techniques discussed herein for merging output from an image segmentation task with a biometric identification task and a skin tone biometric analysis task can enhance the accuracy of both biometric identification and in vivo detection, while eliminating the need for additional 3D or infrared sensors. By configuring a biometric analysis task neural network for a plurality of biometric tasks as described herein, a training dataset may be used that includes relatively fewer examples for tasks such as living being to produce good training results with fewer resources than using more training images and ground truth.

Fig. 6 is a diagram of an image 600 of a face 602 processed 604 with an image segmentation task as described herein to produce a segmented image 606. The segmented image 606 includes image segments indicating eyebrows 608, 610, eyes 612, 614, skin tone 616, lips 618, and teeth 620. Light reflection measured on a camera image 600 of a face may be used to determine anti-spoofing, but may be used to determine anti-spoofing only when an area of the image 600 is determined to be human skin. The use of reflectance data to determine the difference between a photograph or mask and a living person is generally accurate only when the area being treated is known to be human skin and the area is not occluded by hair or features such as lips or eyes. Guiding in vivo determinations using segmented image 606 may enhance training and inference of the biometric analysis task neural network.

Fig. 7 is a diagram of a DNN 700 configured to combine data from the skin mask neural network 724 output, the hair mask neural network 726 output, the eye mask neural network 728 output, and the mask neural network 730 output with the outputs for the biometric neural network 706 and the skin tone neural network 708. The segmented mask or mask image is an image that includes a marker region that indicates the location of features in the image (in this example, facial features including eyes, nose, mouth, etc.). DNN 700 inputs image 702 to common feature extraction CNN 704. The latent variables (i.e., facial features) output from the common feature extraction CNN 704 are output to the biometric neural network 706, the skin tone neural network 708, and the segmentation neural network 714. The biometric neural network 706, the skin tone neural network 708, and the segmented neural network 714 are collectively referred to herein as a biometric task-specific neural network. The biometric neural network 706 and the skin tone neural network 708 neural network include fully connected layers and produce predictions about the identity and skin tone of the face in the input image 702. The segmented neural network 714 includes fully connected layers and produces a segmented image 606, as shown in fig. 6.

Facial feature segmentation output from segmented neural network 714 is combined with latent variables from common feature extraction CNN 704 to generate a mask image comprising skin mask 716, hair mask 718, eye mask 720, and mask 722. Mask images including the skin mask 716, the hair mask 718, the eye mask 720, and the mouth mask 722 are input to the skin mask neural network 724, the mouth mask neural network 726, the eye mask neural network 728, and the mouth mask neural network 730, which collectively characterize the face mask neural network 736. The segmented mask outputs from the skin mask neural network 724 output, the hair mask neural network 726 output, the eye mask neural network 728 output, and the mask neural network 730 are combined with the predicted outputs from the biometric neural network 706 and the skin tone neural network 708 in the expert pooling neural network 732. The expert pooled neural network 732 includes a full connection layer and processes the biometric neural network 706 and the skin tone neural network 708 predictions to determine a living task output or living prediction based on the skin mask neural network 724 output, the hair mask neural network 726 output, the eye mask neural network 728 output, and the mask neural network 730 output. At the inferred time, the biometric prediction and the living body prediction are output to the computing device 115 and/or the server computer 120. The biometric prediction may be a probability that the face in the input image 702 matches the face used to train DNN 700. The living body prediction may be a probability that the face in the input image 702 is acquired from a living person. For example, the product of these two probabilities may be a probability that a live person is correctly identified and used to determine whether vehicle access should be granted.

At training time, the biometric prediction and the living prediction are output to SoftMax functions 710, 712, respectively, to determine a biometric analysis task output. Training DNN 700 is typically performed on server computer 120. The SoftMax functions 710, 712 map the outputs from the biometric neural network 706 and the skin tone neural network 708 to an interval between 0 and 1, so they may be combined with the output of the SoftMax function 734 that maps the output from the expert pooled neural network 732 to an interval between 0 and 1. The outputs from SoftMax functions 710, 712, 734 are combined with ground truth into a joint loss function for back propagation through DNN 700, as discussed above with respect to training DNN 400 with respect to fig. 4. After training, DNN 700 may be transmitted to computing device 115 in vehicle 110 to perform a biometric identification task on image data acquired by sensors 116 included in vehicle 110.

Determining the joint loss function may be a function of the loss method, the task complexity, and the fusion method. The loss method is a mathematical technique for determining the loss. As discussed above with respect to fig. 4, the determination of the loss function for the segmented image may be based on the segment class or the variance of the segmented pixels. For example, using a mean square error (which computes the difference in mask segment output versus the pixel count of the ground truth) will generate a loss proportional to the square of the number of pixels included in the mask segment class. Because there are potentially a large number of pixels in the segment generated for the input image 702, the loss may be proportional to the large number of pixels. For example, the total number of pixels included in all segments may be 1000. If the difference between the ground truth segmented image and the predicted segmented image is large, the difference in pixels may be a large number, e.g., 500. Since the mean square error is based on the square of the difference, it can be as high as 500 ² Proportional, this isA very large number. The use of sparse classification cross entropy to determine the penalty (which determines the probability of the presence/absence of mask segment output versus the probability of ground truth) will be proportional to the number of categories of facial features. In this example, the plurality of facial categories include left and right eyes, lips, teeth, nose, left and right eyebrows, and the like (e.g., 14 facial categories). Using sparse classification cross entropy statistics to determine the loss function will typically result in a much smaller number than mean square error statistics. Determining a loss value using an erroneous technique that results in a large loss value may result in destructive training interference, where one loss value dominates the joint loss function.

The loss values may be normalized to reflect task complexity, i.e., more complex (or more difficult) tasks may result in larger loss function values. In addition, it may be beneficial to bias the penalty towards tasks having outputs that are input to other tasks included in DNNs 400, 700. A dynamic loss scheme may be used to solve this problem. In a dynamic loss scheme, the loss may be normalized to a range of 0 to 1 based on the complexity of the loss function, which is further weighted by training importance. Early in training, the input functions (i.e., semantic segmentation and ID) may be prioritized by increasing the weight values applied to the loss function. Once their verification accuracy is enhanced, the joint loss functions may be weighted to a difficult task (i.e., anti-spoofing) and then to a less complex task, such as skin tone, based on further enhancement of the verification accuracy. Determining the joint loss function according to the loss method, task complexity, and fusion method may enhance training of the DNNs 400, 700 by increasing the rate at which the training converges on the set of DNNs 400, 700 weights that provide the smallest joint loss function in the training dataset. Increasing the rate of training convergence reduces the time and computational resources required to train DNNs 400, 700.

In an example of a biometric analysis task determined using DNN 700, DNN 700 may be trained as discussed above. At the inference time, portions of DNN 700 including segmentation neural network 714, skin mask neural network 724, hair mask neural network 726, eye mask neural network 728, and mask neural network 730, and expert pooling neural network 732 may be removed from DNN 700 without degrading biometric recognition accuracy. The biometric identification neural network 706 that has been trained using the segmentation neural network 714, the skin mask neural network 724, the hair mask neural network 726, the eye mask neural network 728, the mask neural network 730, and the expert pooling neural network 732 may determine the biometric identification with the same accuracy as when aided by the segmentation neural network 714, the skin mask neural network 724, the hair mask neural network 726, the eye mask neural network 728, the mask neural network 730, and the expert pooling neural network 732. Performing biometric identification without the segmentation neural network 714, the skin mask neural network 724, the hair mask neural network 726, the eye mask neural network 728, the mask neural network 730, and the expert pooling neural network 732 may provide a DNN 700 that achieves high accuracy in less time while requiring less computing resources.

Fig. 8 is a diagram of DNN 800 for performing a biometric analysis. DNN 800 includes the same components of DNN 700 in fig. 7 configured in the same manner except that memories 802, 804 are included between outputs from eye mask 720 and mask 722 and inputs to eye mask neural network 728 and mask neural network 730. The memories 802, 804 allow queuing of data from multiple input data frames. DNN 800 may receive as input a sequence of frames of video data comprising an image 702 of a user's face acquired over a short period of time (e.g., one second or less) and determine a time-segmented mask output. The sequence of video frames may include movements of the user's eyes and/or mouth, such as blinking or lip movements during speech.

Storing temporal image data about the movement of the user's eyes and mouth may allow DNN 800 to determine a living organism with greater accuracy and with greater confidence than determining a living organism based on a single static image. For example, DNN 800 may process the temporal data by determining derivatives with respect to the mask image or calculating optical flow data with respect to the mask image. The eye mask neural network 728 and the mask neural network 730 may be configured to determine the 3D convolution by stacking the temporal image data into a 3D stack and using the 3D convolution kernel to determine motion in the temporal image data. The memories 802, 804 may be applied to facial features that may be expected to move during a short time sequence input to the DNN 800. For example, hair characteristics and skin tone typically do not change in a short period of time and therefore do not generally benefit from time treatment.

DNN 800, including memories 802, 804, may be trained by a video sequence of images 702 that includes facial feature motion and ground truth data present in the video sequence that describes the facial feature motion. For example, a video sequence of a user blinking and/or speaking may be included in the training database. At training time, image 702 of the video sequence may be input to DNN 800. The memories 802, 804 may store the output from the mask 720 and mask 722 until the video clip ends or the limit of the number of frames in the memory is reached. When the memories 802, 804 are filled, the loss function based on the eye-mask neural network 728 and the mask neural network 730 may be set to zero, and the weights of the eye-mask neural network 728 and the mask neural network 730 may be frozen. When the video segment ends or the limit of frame storage is reached, the temporal data stored in memory may be output to the eye mask neural network 728 and mask neural network 730 for processing and subsequent output of predictions for determining the loss function. The loss functions may be combined into a joint loss function that may be counter-propagated through DNN 800 including eye mask neural network 728 and mask neural network 730 to determine weights.

Fig. 9 is a flow chart of a process 900 described with respect to fig. 1-4 for training a DNN 400 including a common feature extraction CNN410, a plurality of biometric analysis task neural networks 412, 414, 416, 418, 420, and a segmentation neural network 422. The process 900 may be implemented by the computing device 115 or the processor of the server computer 120 taking as input and outputting the biometric analysis task predictions the image data from the sensors 116, 122 executing the commands. DNN 400 is typically executed on server computer 120 on traffic infrastructure node 105 at a training time and transmitted to computing devices 115 in vehicle 110 at an inferred time to operate. Process 900 includes a plurality of blocks that may be performed in the order shown. Alternatively or additionally, process 900 may include fewer blocks or may include blocks performed in a different order.

Process 900 begins at block 902, where image 402 is acquired from a training dataset. Image 402 includes ground truth data for one or more biometric analysis tasks as discussed above with respect to fig. 3. In examples where ground truth data is not available for one or more biometric analysis tasks, the output from one or more biometric analysis task neural networks 412, 414, 416, 418, 420 or segmentation neural network 422 may be set to a null value and the weights of one or more biometric analysis task neural networks 412, 414, 416, 418, 420 or segmentation neural network 422 may be frozen. Freezing the neural network prevents weights included in the programming of the neural network from being updated based on the joint loss function determined at block 910 by back propagation.

At block 904, the image 402 is input to the common feature extraction CNN 410 to determine facial features to be output as potential variables, as discussed above with respect to fig. 5.

At block 906, the latent variables are input to the plurality of biometric analysis task neural networks 412, 414, 416, 418, 420 or the segmented neural network 422, which process the latent variables to determine predictions about the input image 402, as discussed above with respect to fig. 4. At the inferred time, the prediction may be output to computing device 115 for operating vehicle 110.

At block 908, predictions output from the biometric analysis task neural networks 412, 414, 416, 418, 420 are input to the SoftMax functions 424, 426, 428, 430, 432 at training time to adjust the output predictions to be between 0 and 1. Adjusting the output predictions allows the output predictions to be combined into a joint loss function at block 910 without one or more output predictions numerically dominate the calculation.

At block 910, outputs from the SoftMax functions 424, 426, 428, 430, 432 and the split neural network 422 are combined with ground truth to determine a joint loss function of the DNN 400 in response to the input image 402.

At block 912, the joint loss function may be counter-propagated through the layers of DNN 400 to determine the optimal weights for the layers of DNN 400. The optimal weights are the weights that result in the output matching best to the ground truth included in the training dataset. Training DNN 400 includes inputting one input image 402 multiple times while changing the weights of programming the layers of DNN 400, as discussed above with respect to fig. 5. Training DNN 400 includes selecting weights for layers of DNN 400 that provide the lowest joint loss function for the largest number of input images 402 in the training dataset. After block 912, the process 900 ends.

Fig. 10 is a flow chart of a process 1000 for training DNNs 700 including common feature extraction CNN704, biometric identification neural network 706, skin tone neural network 708, segmentation neural network 714, and feature mask neural network 736 described with respect to fig. 1-9. Process 1000 may be implemented by a processor of computing device 115 or server computer 120 that takes as input and outputs biometric analysis task predictions image data from sensors 116, 122 executing commands. DNN 700 is typically executed on server computer 120 on traffic infrastructure node 105 at a training time and transmitted to computing devices 115 in vehicle 110 for operation at an inferred time. Process 1000 may alternatively or additionally include fewer blocks or may include blocks performed in a different order.

Process 1000 begins at block 1002, where an image 702 is acquired from a training dataset. Image 702 includes ground truth for one or more biometric analysis tasks as described above with respect to fig. 4, 7, and 8. In examples where the ground truth is not available for one or more biometric analysis tasks, the output from the one or more biometric analysis tasks may be set to a null value and the weights of the one or more biometric analysis tasks may be frozen to prevent them from being updated based on a joint loss function determined from the output from the biometric analysis tasks.

At block 1004, image 702 is input to common feature extraction CNN 704 to determine facial features to be output as potential variables, as discussed above with respect to fig. 4 and 5.

At block 1006, the latent variables are input to a plurality of biometric neural networks 706, skin tone neural networks 708, and segmentation neural networks 714, which process the latent variables to determine a first prediction regarding a biometric analysis task and a facial feature segmentation task.

At block 1008, the output from the split neural network 714 is combined with the latent variables from the common feature extraction CNN 704 at the skin mask 716, hair mask 718, eye mask 720, and mask 722 to form a feature mask.

At block 1010, the feature mask output is to a feature mask neural network 736 to determine feature outputs in DNN 700. In DNN800, outputs from eye mask 720 and mask 722 are output to memories 802, 804, respectively, where a plurality of eye mask 720 outputs and mask 722 outputs are stored. The time data from the memories 802, 804 is output to the eye mask neural network 728 and the mask neural network 730 at the end of the video sequence including the image 702 input to the DNN 800.

At block 1012, the outputs from the biometric neural network 706 and the skin tone neural network 708 are combined with the output from the feature mask neural network 736 in the expert pooling neural network 732. The living predictions output from expert pooled neural network 732 and the identification predictions and skin tone predictions output from biometric neural network 706 and skin tone neural network 708 may be output to computing device 115 at an inferred time to operate the vehicle. For example, the output prediction may be used to identify a user and allow access to the device. For example, the output predictions may also be used to determine the stress the user is experiencing or to determine the alertness of the user.

At block 1014, predictions output from the expert pooling neural network 732, the biometric neural network 706, and the skin tone neural network 708 are input to SoftMax functions 734, 710, 712, respectively, at training time. The SoftMax function adjusts the output prediction to be between 0 and 1.

At block 1016, the outputs from the SoftMax functions 734, 710, 712 are input to a loss function, which is combined to produce a joint loss function. Individual loss functions may be added to form a joint loss function.

At block 1018, the joint loss function may be counter-propagated through the layers of DNN 700 to determine the optimal weights for the layers of DNN 700. The optimal weights are the weights that result in the output matching best to the ground truth included in the training dataset. After block 1018, process 1000 ends.

Fig. 11 is a flow chart of a process 1100 for operating a vehicle based at least in part on a biometric task described with respect to fig. 1-7. Process 1100 includes performing biometric and in-vivo biometric analysis tasks and image segmentation with DNN 700, which includes common feature extraction CNN 704, biometric neural network 706, skin tone neural network 708, and segmentation neural network 714, for determining biometric predictions and in-vivo predictions. The process 1100 may be implemented by a processor of the computing device 115 or the server computer 120 that takes as input and outputs the biometric analysis task predictions image data from the sensors 116, 122 executing the commands. Alternatively or additionally, process 1100 may include fewer blocks or may include blocks performed in a different order.

After training DNN 700 as discussed above with respect to fig. 7, trained DNN 700 may be transmitted to computing device 115, e.g., in a device such as vehicle 110, for inference. As discussed above with respect to fig. 7, DNN 700 may be trained with biometric identification neural network 706 and skin tone neural network 708, as well as segmentation neural network 714. At inference time, DNN 700 may be reduced to include only a biometric analysis task neural network that is relevant to one or more specific tasks. In this example, DNN 700 will be used for biometric identification and will include a biometric identification neural network 706 and a segmentation neural network 714. Reconfiguring the DNN 700 in this manner allows the DNN 700 to be trained with a wide variety of training data sets including training images and ground truth for multiple biometric analysis tasks, while providing a lightweight DNN 700 at inference time that saves computational resources including memory space and execution time.

Process 1100 begins at block 1102, where vehicle 110 acquires image 702 using sensors 116 included in vehicle 110. Process 1100 may also be implemented in a security system, a robotic guidance system, or a handheld device such as a cellular telephone that seeks to determine the identity of a potential user prior to granting access to the device.

At block 1104, computing device 115 inputs image 702 to common feature extraction CNN704 to determine facial features to output as potential variables.

At block 1106, latent variables are input to the biometric neural network 706, the skin tone neural network 708, and the segmentation neural network 714 to determine a biometric prediction and a living prediction.

At block 1108, the biometric prediction and the living prediction are output from DNN 700 to computing device 115 in vehicle 110.

At block 1110, the computing device 115 may determine an authentication score by multiplying the biometric prediction with the living prediction.

At block 1112, computing device 115 tests the authentication score from block 1110. If the authentication score is greater than the user specified threshold, the user is authenticated and process 1100 moves to block 1114. The threshold may be determined by processing the plurality of spoofed and authentic images 702 to determine a distribution of authentication scores for the spoofed and authentic input images 702. A threshold value separating the distribution of spoofed and authentic input images 702 may be determined. If the authentication score is below the user specified threshold, the user fails to authenticate and process 1100 ends.

At block 1114, the user has been authenticated and the user is granted permission to operate the vehicle 110. This may include unlocking the vehicle door to allow access to the vehicle 110 and granting the user permission to operate various vehicle components, such as climate control, infotainment, etc., to name a few. After block 1114, the process 1100 ends.

Computing devices such as those discussed herein typically each include commands that are executable by one or more computing devices such as those identified above and for implementing blocks or steps of the processes described above. For example, the process blocks discussed above may be embodied as computer-executable commands.

Computer with a memory for storing dataExecutable commands may be compiled or interpreted by a computer program created using various programming languages and/or techniques, including but not limited to the following, singly or in combination: java (Java) ^TM C, C ++, python, julia, SCALA, visual Basic, java Script, perl, HTML, etc. In general, a processor (i.e., microprocessor) receives, i.e., from memory, a computer-readable medium, etc., and executes these commands, thereby performing one or more processes, including one or more of the processes described herein. Such commands and other data may be stored in files and transmitted using a variety of computer readable media. Files in a computing device are typically a collection of data stored on a computer readable medium such as a storage medium, random access memory, or the like.

Computer-readable media (also referred to as processor-readable media) includes any non-transitory (i.e., tangible) media that participates in providing data (i.e., instructions) that may be read by a computer (i.e., by a processor of a computer). Such a medium may take many forms, including but not limited to, non-volatile media and volatile media. The instructions may be transmitted over one or more transmission media, including fiber optic, wire, wireless communications, including internal components that make up a system bus coupled to the processor of the computer. Common forms of computer-readable media include, for example, RAM, PROM, EPROM, FLASH-EEPROM, any other memory chip or cartridge, or any other medium from which a computer can read.

Unless explicitly indicated to the contrary herein, all terms used in the claims are intended to be given their ordinary and customary meaning as understood by those skilled in the art. In particular, the use of singular articles such as "a," "an," "the," and the like are to be construed to recite one or more of the indicated elements unless a claim recites an explicit limitation to the contrary.

The term "exemplary" is used herein in the sense of indicating examples, i.e., references to "exemplary widgets" should be read as referring to only examples of widgets.

The adverb "about" of a modifier or result means that the shape, structure, measurement, value, determination, calculation, etc., may deviate from the exactly described geometry, distance, measurement, value, determination, calculation, etc., due to imperfections in materials, machining, manufacturing, sensor measurement, calculation, processing time, communication time, etc.

In the drawings, like reference numerals refer to like elements. Furthermore, some or all of these elements may be changed. With respect to the media, processes, systems, methods, etc., described herein, it should be understood that while the steps or blocks of such processes, etc., have been described as occurring according to a particular ordered sequence, such processes may be practiced by performing the described steps in an order other than that described herein. It should also be understood that certain steps may be performed concurrently, other steps may be added, or certain steps described herein may be omitted. In other words, the description of the processes herein is provided for the purpose of illustrating certain embodiments and should in no way be construed as limiting the claimed invention.

According to the present invention, there is provided a system having: a computer comprising a processor and a memory, the memory comprising instructions executable by the processor to: providing an output from a selected biometric analysis task that is one of a plurality of biometric analysis tasks based on an image provided from an image sensor; wherein the selected biometric analysis tasks are performed in a deep neural network comprising a common feature extraction neural network, a plurality of biometric task specific neural networks, a plurality of segmented mask neural networks, and an expert pooling neural network that performs the plurality of biometric analysis tasks by: inputting the image to the common feature extraction neural network to determine potential variables; inputting the latent variables to the plurality of biometric task-specific neural networks to determine a plurality of biometric analysis task outputs; inputting the latent variable to a segmented neural network to determine facial feature segmentation outputs; inputting the facial feature segmentation output to a plurality of feature mask neural networks to determine a plurality of segmentation mask outputs; inputting the plurality of biometric analysis task outputs and the plurality of segmented mask outputs to the expert pooled neural network to determine a living task output; and outputting the plurality of biometric analysis task outputs and the living body task output.

According to one embodiment, the invention also features a device wherein the instructions include instructions for operating the device based on output from the deep neural network according to the selected biometric analysis task.

According to one embodiment, a plurality of segmented mask outputs from one or more segmented mask neural networks are stored in one or more memories to determine a time-segmented mask output based on a sequence of frames of video data.

According to one embodiment, the plurality of biometric analysis tasks includes biometric identification, living body determination, drowsiness determination, gaze determination, pose determination, and facial feature segmentation.

According to one embodiment, the common feature extraction neural network comprises a plurality of convolutional layers.

According to one embodiment, the plurality of biometric task-specific neural networks includes a plurality of fully connected layers.

According to one embodiment, the plurality of segmented mask neural networks comprises a plurality of fully connected layers.

According to one embodiment, the expert pooled neural network comprises a plurality of fully connected layers.

According to one embodiment, the instructions comprise further instructions for training the deep neural network by: determining a plurality of loss functions for the plurality of biometric analysis task outputs and the living task output based on ground truth; combining the plurality of loss functions to determine a joint loss function; and back-propagating the loss function and the joint loss function through the deep neural network to output a set of weights.

According to one embodiment, the plurality of biometric task outputs and the living task output are input to a plurality of SoftMax functions before being input to the plurality of loss functions.

According to one embodiment, during training, one or more outputs from the plurality of biometric task-specific neural networks and the in-vivo task output are set to zero.

According to the invention, a method comprises: providing an output from a selected biometric analysis task that is one of a plurality of biometric analysis tasks based on an image provided from an image sensor; wherein the selected biometric analysis tasks are performed in a deep neural network comprising a common feature extraction neural network, a plurality of biometric task specific neural networks, a plurality of segmented mask neural networks, and an expert pooling neural network that performs the plurality of biometric analysis tasks by: inputting the image to the common feature extraction neural network to determine potential variables; inputting the latent variables to the plurality of biometric task-specific neural networks to determine a plurality of biometric analysis task outputs; inputting the latent variable to a segmented neural network to determine facial feature segmentation outputs; inputting the facial feature segmentation output to a plurality of feature mask neural networks to determine a plurality of segmentation mask outputs; inputting the plurality of biometric analysis task outputs and the plurality of segmented mask outputs to the expert pooled neural network to determine a living task output; and outputting the plurality of biometric analysis task outputs and the living body task output.

In one aspect of the invention, the method includes a device, wherein the instructions include instructions for operating the device based on output from the deep neural network according to the selected biometric analysis task.

In one aspect of the invention, a plurality of segmented mask outputs from one or more segmented mask neural networks are stored in one or more memories to determine a time-segmented mask output based on a sequence of frames of video data.

In one aspect of the invention, the plurality of biometric analysis tasks includes biometric identification, living body determination, drowsiness determination, gaze determination, pose determination, and facial feature segmentation.

In one aspect of the invention, the common feature extraction neural network includes a plurality of convolutional layers.

In one aspect of the invention, the plurality of biometric task-specific neural networks includes a plurality of fully connected layers.

In one aspect of the invention, the plurality of segmented mask neural networks includes a plurality of fully connected layers.

In one aspect of the invention, the expert pooled neural network includes a plurality of fully connected layers.

In one aspect of the invention, the method comprises: training the deep neural network by: determining a plurality of loss functions for the plurality of biometric analysis task outputs and the living task output based on ground truth; combining the plurality of loss functions to determine a joint loss function; and back-propagating the loss function and the joint loss function through the deep neural network to output a set of weights.

Claims

1. A method, comprising:

providing an output from a selected biometric analysis task that is one of a plurality of biometric analysis tasks based on an image provided from an image sensor;

wherein the selected biometric analysis tasks are performed in a deep neural network comprising a common feature extraction neural network, a plurality of biometric task specific neural networks, a plurality of segmented mask neural networks, and an expert pooling neural network that performs the plurality of biometric analysis tasks by:

inputting the image to the common feature extraction neural network to determine potential variables;

inputting the latent variables to the plurality of biometric task-specific neural networks to determine a plurality of biometric analysis task outputs;

inputting the latent variable to a segmented neural network to determine facial feature segmentation outputs;

inputting the facial feature segmentation output to a plurality of feature mask neural networks to determine a plurality of segmentation mask outputs;

inputting the plurality of biometric analysis task outputs and the plurality of segmented mask outputs to the expert pooled neural network to determine a living task output; and

Outputting the plurality of biometric analysis task outputs and the living body task output.

2. The method of claim 1, further comprising a device, wherein the device operates the device based on output from the deep neural network according to the selected biometric analysis task.

3. The method of claim 1, wherein a plurality of segmented mask outputs from one or more segmented mask neural networks are stored in one or more memories to determine a time-segmented mask output based on a sequence of frames of video data.

4. The method of claim 1, wherein the plurality of biometric analysis tasks includes biometric identification, living body determination, drowsiness determination, gaze determination, gesture determination, and facial feature segmentation.

5. The method of claim 1, wherein the common feature extraction neural network comprises a plurality of convolutional layers.

6. The method of claim 1, wherein the plurality of biometric task-specific neural networks comprises a plurality of fully connected layers.

7. The method of claim 1, wherein the plurality of segmented mask neural networks comprise a plurality of fully connected layers.

8. The method of claim 1, wherein the expert pooled neural network comprises a plurality of fully connected layers.

9. The method of claim 1, further comprising training the deep neural network by:

determining a plurality of loss functions for the plurality of biometric analysis task outputs and the living task output based on ground truth;

combining the plurality of loss functions to determine a joint loss function; and

the loss function and the joint loss function are back propagated through the deep neural network to output a set of weights.

10. The method of claim 9, wherein the plurality of biometric analysis task outputs and the living task output are input to a plurality of SoftMax functions before being input to the plurality of loss functions.

11. The method of claim 9, wherein during training, one or more outputs from the plurality of biometric task-specific neural networks and the in-vivo task output are set to zero.

12. The method of claim 1, wherein the deep neural network is configured to include the common feature extraction network and a subset of the biometric task-specific neural networks during inference based on a selected biometric analysis task.

13. The method of claim 1, wherein the deep neural network is trained based on a loss function determined from sparse classification cross entropy statistics.

14. The method of claim 1, wherein the deep neural network is trained based on a loss function determined from mean square error statistics.

15. A system comprising a computer programmed to perform the method of any one of claims 1 to 14.