CN116863521A

CN116863521A - Face living body detection method, model training method, device, equipment and medium

Info

Publication number: CN116863521A
Application number: CN202310779602.0A
Authority: CN
Inventors: 王珂尧; 张国生
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-10-10

Abstract

The disclosure provides a face living body detection method, a model training method, a device, equipment and a medium, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as living body detection. The specific implementation scheme is as follows: acquiring a face image to be identified; inputting the face image to be recognized into a pre-trained face living body detection model to obtain a living body detection result; the face living body detection model comprises a plurality of convolutional neural network modules and a plurality of visual self-attention modules, wherein the convolutional neural network modules and the visual self-attention modules are alternately connected.

Description

Face living body detection method, model training method, device, equipment and medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as living body detection and the like. In particular, the disclosure relates to a face living body detection method, a model training method, a device, equipment and a medium.

Background

The human face living body detection is used for distinguishing whether an image is shot by a real person or not, is a basic composition module of a human face recognition system, and can be applied to various applications such as attendance checking, entrance guard, security protection, financial payment and the like in the field of human face recognition.

Ensuring accuracy of face liveness detection can help ensure security of the face recognition system.

Disclosure of Invention

The disclosure provides a face living body detection method, a model training method, a device, equipment and a medium.

According to a first aspect of the present disclosure, there is provided a method of face in vivo detection, the method comprising:

acquiring a face image to be identified;

inputting the face image to be recognized into a pre-trained face living body detection model to obtain a living body detection result;

the face living body detection model comprises a plurality of convolutional neural network modules and a plurality of visual self-attention modules, wherein the convolutional neural network modules and the visual self-attention modules are alternately connected.

According to a second aspect of the present disclosure, there is provided a training method of a face living body detection model, the method comprising:

acquiring a plurality of training face images and living body detection labels corresponding to the training face images;

training the human face living body detection model according to the training human face image and the living body detection label corresponding to the training human face image;

According to a third aspect of the present disclosure, there is provided an apparatus for face in-vivo detection, the apparatus comprising:

the test data module is used for acquiring a face image to be identified;

the reasoning module is used for inputting the face image to be recognized into a pre-trained face living body detection model to obtain a living body detection result;

According to a fourth aspect of the present disclosure, there is provided a training apparatus of a face living body detection model, the apparatus comprising:

the training data module is used for acquiring a plurality of training face images and living body detection labels corresponding to the training face images;

the training module is used for training the human face living body detection model according to the training human face image and the living body detection label corresponding to the training human face image;

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the face biopsy method and/or training method of the face biopsy model.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the above-described face living detection method and/or training method of a face living detection model.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described face biopsy method and/or training method of a face biopsy model.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a face living body detection method according to an embodiment of the disclosure;

FIG. 2 is a schematic structural diagram of one implementation of alternating connection of convolutional neural network modules and visual self-attention modules of a face in-vivo detection model provided by embodiments of the present disclosure;

fig. 3 is a flowchart illustrating partial steps of another face in-vivo detection method according to an embodiment of the present disclosure;

FIG. 4 is a network training process diagram of one implementation of model training by freezing parameters of a module in another face in vivo detection method according to an embodiment of the present disclosure;

fig. 5 is a flowchart of a training method of a human face living body detection network according to an embodiment of the present disclosure;

fig. 6 is a flowchart illustrating partial steps of another training method for a human face living body detection network according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a face living body detection apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a training device for a human face living body detection network according to an embodiment of the present disclosure;

Fig. 9 is a block diagram of an electronic device used to implement the face biopsy method and training method of the face biopsy network of the embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In some related technologies, face living body detection is achieved by using a traditional face manual feature extraction and classification method, specifically, face features are extracted based on a feature extractor related to manual, and then feature classification is performed based on a traditional classifier of a Support Vector Machine (SVM) to obtain a face living body detection result.

However, the traditional manual face feature extraction and classification method is poor in robustness when the face gesture in a real scene is overlarge or the illumination difference is large, so that the recognition effect is not ideal.

In some related technologies, deep learning models, such as convolutional neural network models and LSTM (long short-term memory) networks, are used to implement face feature extraction and classification, and obtain a face living body detection result.

However, the deep learning model based on a single image as input has the problems of higher sensitivity to light rays, poor generalization for plane attacks such as photos, videos and the like and 3D (three-dimensional) stereo attacks such as head covers, head models and the like, and influences the practical application performance.

The embodiment of the disclosure provides a face living body detection method, a training method of a face living body detection model, a face living body detection device, a training device of a face living body detection model, electronic equipment and a computer readable storage medium, and aims to solve at least one of the technical problems in the prior art.

The face living body detection method and the training method of the face living body detection model provided by the embodiment of the disclosure may be executed by electronic devices such as a terminal device or a server, and the terminal device may be a vehicle-mounted device, a user device (UserEquipment, UE), a mobile device, a user terminal, a cellular phone, a cordless phone, a personal digital assistant (PersonalDigitalAssistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a manner that a processor invokes computer readable program instructions stored in a memory. Alternatively, the method may be performed by a server.

Fig. 1 shows a flowchart of a face living body detection method provided by an embodiment of the present disclosure. As shown in fig. 1, the face living body detection method according to the embodiment of the present disclosure may include step S110 and step S120.

In step S110, a face image to be recognized is acquired;

in step S120, inputting a face image to be recognized into a pre-trained face living body detection model to obtain a living body detection result;

For example, in step S110, the face image to be recognized may be an image containing only a face.

In some possible implementations, the face image to be identified may be an image obtained directly by photographing.

In some possible implementations, the face image to be identified may be an image obtained by preprocessing a person image.

In some possible implementation manners, the preprocessing of the human image may include face detection, key point detection, face alignment, affine transformation and normalization processing, and the preprocessing may reduce the influence of light and the like on the human face image to be identified, so as to improve the accuracy of the obtained living body detection result.

In some possible implementations, the face image to be identified may be an RGB (red green blue color system) image; in some possible implementations, the face image to be identified may be a depth image; in some possible implementations, the face image to be identified may be an infrared image.

In some possible implementations, in step S120, the living body detection result is whether the face image to be identified is a face living body image.

In some possible implementations, the pre-trained face biopsy model may be obtained using a plurality of training face images and a biopsy label corresponding to the training face images.

In some possible implementations, the training face image may include a real face image, a plane attack image such as a screen attack, a paper attack, and a stereo attack image such as a headgear and a head model. The living body detection label corresponding to the real face image may be a label identifying that the image is a living body of a face, and the living body detection labels corresponding to the plane attack image and the stereo attack image may be labels identifying that the image is not a living body of a face.

In some possible implementations, the training face images may be both RGB images, or depth images, or infrared images.

Under the condition that the training face image is an RGB image, the accuracy of a living body detection result obtained by the face living body detection method provided by the embodiment of the disclosure when the face image to be identified is the RGB image is higher.

Under the condition that the training face image is a depth image, the accuracy of a living body detection result obtained by the face living body detection method provided by the embodiment of the disclosure when the face image to be identified is the depth image is higher.

Under the condition that the training face image is an infrared image, the accuracy of a living body detection result obtained by the face living body detection method provided by the embodiment of the disclosure when the face image to be identified is the infrared image is higher.

In some possible implementations, the face biopsy model includes a plurality of convolutional neural network modules and a plurality of ViT (vision self-attention) modules that are alternately connected.

The convolutional neural network module is composed of one or more Convolitional layers, can be used for extracting image features of an image, can extract bottom features and local features of the image, and can extract high-level semantic features and global features along with the increase of the number of layers of the convolutional layers. The input of the convolutional neural network module can be an image or a feature map of the image, and the output is the image feature map.

The ViT module includes a plurality of connected ViT structures, each ViT structure including Transformer Encoder (self-attention encoder). The input of the ViT module can be an image or a small block formed by dividing an image feature map, and the output can be a small block formed by dividing an image feature map.

In some possible implementations, the ViT module can also include LinearProjectionof FlattenedPatches (linear projection of a planar block).

The linear projector of the flat patents is an image feature layer, and is mainly used for realizing image blocking and Token (vector) sequence generation, inputting the generated sequence into a transformerler, and realizing blocking of a feature image output by a convolutional neural network module to generate a Token sequence.

For example, for 224x224 images, linear projection of flatted patches may be divided into 16x 16-sized patches, resulting in 196 patches, each Patch flat (flattened) into a vector of length 16x16x3 = 768, resulting in a 196x768 two-dimensional matrix, i.e., the sequence format that the transformerer encoder needs to input.

In addition, there is a classification vector, which is a learnable vector. The dimension of the matrix is 1x768, and the matrix with the dimension of 197x768 can be obtained by splicing the ClassToken with the matrix generated on the upper side.

In addition, there is a learnable vector position vector with dimension of 197x768, and the Element-Sum is performed on the position code and the matrix with dimension of 197x768 generated above to obtain the final input matrix of the solution input converter encoder, and the dimension of the matrix is 197x768.

The transformerlencoder includes a plurality of encodblocks, each of which is composed of Multi-header attention layer and Add & nor layer (Add represents residual structure, nor represents LayerNormalization) which can extract high-level semantic features and global features of an image and determine parameters of ClassToken through learning of the high-level semantic features and global features. The feature high-level semantic features and the global features acquired by the transformerlencoder are stored in a matrix, the matrix comprises image features corresponding to a plurality of image blocks, and a feature map which can be input into a convolutional neural network module can be generated by splicing the image features corresponding to the image blocks.

In some possible implementations, the ViT structure in which the ViT module includes multiple connections may be ViT module including one linearprojectionf flutteddatcles and multiple converterdercoders connected in sequence. For example, the ViT module includes 3 ViT structures, and the ViT module includes 3 linear project of flutteedpatents, transformer Encoder connected in sequence.

In some possible implementations, the convolutional neural network module and visual self-attention module alternating connection may be that the input of the visual self-attention module 21 includes the output of the convolutional neural network module 11, the input of the convolutional neural network 12 includes the output of the visual self-attention module 21, the input of the visual self-attention module 22 includes the output of the convolutional neural network module 12, and so on until all convolutional neural network module and visual self-attention module connections are completed.

FIG. 2 illustrates a schematic of a structure of one implementation of alternating connection of convolutional neural network modules and visual self-attention modules. As shown in fig. 2, the face living body detection model may include a first convolutional neural network module CM1, a first visual self-attention module VM1, a second convolutional neural network module CM2, a second visual self-attention module VM2, a third convolutional neural network module CM3, and a third visual self-attention module VM3, which are sequentially connected.

The convolutional neural network module and the visual self-attention module are different feature extraction modules which can be used for extracting the bottom-layer features and the local features of the face image to be recognized, and the high-layer semantic features and the global features respectively. The bottom layer features and the local features can be used for identifying plane attacks such as screen attacks and paper attacks, and the high-level semantic features and the global features can be used for identifying stereo attack images such as head covers and head models. Therefore, various characteristics required by human face living body detection can be obtained through the human face living body detection model formed by the convolutional neural network module and the visual self-attention module, so that generalization and robustness of the human face living body detection model to various attacks are improved, and the detection accuracy of the human face living body detection model is improved.

Meanwhile, through alternate connection of the convolutional neural network module and the visual self-attention module, the features extracted by the convolutional neural network module and the visual self-attention module can be fully fused, the richness of the obtained features is further improved, the generalization and the robustness of the face living body detection model to various attacks are improved, and the detection accuracy of the face living body detection model is improved.

In some possible implementations, as shown in fig. 2, the first convolutional neural network module CM1 is further connected to a second convolutional neural network module CM2, and the second convolutional neural network module CM2 is further connected to a third convolutional neural network module CM3.

That is, the input of the second convolutional neural network module CM2 further comprises the output of the first convolutional neural network module CM 1; the input of the third convolutional neural network module CM3 further comprises the output of the second convolutional neural network module CM 2.

In some possible implementations, the first convolutional neural network module CM1 may be composed of a plurality of convolutional layers, and the second convolutional neural network module CM2 and the third convolutional neural network module CM3 may be composed of a plurality of resnetblocks (residual network blocks).

In some possible implementations, as shown in fig. 2, the face living detection model further includes a classification module composed of fully connected layers, and configured to obtain living detection results according to features extracted by the alternately connected convolutional neural network module and the visual self-attention module.

In some possible implementations, the face biopsy model may be trained by freezing parameters of the module to facilitate different features of the images for better learning by the convolutional neural network module and the visual self-attention module.

In the face living body detection method provided by the embodiment of the disclosure, the face living body detection model formed by alternately connecting the convolutional neural network module and the visual self-attention module is used for detecting the face image to be detected to obtain the living body detection result, and as the convolutional neural network module and the visual self-attention module are different feature extraction modules, different image features can be extracted to obtain various features required by face living body detection, and generalization and robustness of the face living body detection model to various attacks are improved. Meanwhile, the convolutional neural network module and the visual self-attention module are alternately connected, so that the features extracted by the convolutional neural network module and the visual self-attention module can be fully fused, the richness of the obtained features is further improved, and the generalization and the robustness of the human face living body detection model to various attacks are improved. Therefore, the accuracy of the living body detection result obtained by using the human face living body detection method provided by the embodiment of the disclosure is higher.

The following specifically describes a face living body detection method provided by an embodiment of the present disclosure.

As described above, in some possible implementations, the face image to be identified may be an image obtained by preprocessing a person image.

Fig. 3 shows a flow chart of the steps of acquiring a face image to be recognized by preprocessing. As shown in fig. 3, acquiring the face image to be recognized through preprocessing may include step S310, step S320, step S330, step S340.

In step S310, a person image to be identified is obtained, face detection is performed on the person image to be identified, and a face region of the person image to be identified is obtained;

in step S320, face key point detection is performed on the face area, and face key point coordinates are obtained;

in step S330, face alignment is performed on the face region according to the face key point coordinates, and a face image with a preset size is obtained through affine transformation;

in step S340, the face image is normalized, and a face image to be identified is obtained.

In some possible implementations, in step S310, the acquired image of the person to be identified may be an image acquired by photographing the person, and the size thereof may not be limited.

In some possible implementations, face detection may be performed on the person image to be identified by a conventional machine learning method, to obtain a face position in the person image to be identified; the face position in the image of the person to be identified can also be obtained through a pre-trained face detection model, and the face area is determined according to the face position.

The face detection model may be any face living body detection model, and the embodiment of the present disclosure does not limit the specific composition of the face detection model.

In some possible implementations, in step S320, the face keypoint detection of the acquired face region may be performed by using a pre-trained face keypoint detection model.

The face key point detection model may be any deep learning model capable of realizing face key point detection, and the embodiment of the present disclosure does not limit specific composition of the face key point detection model.

In some possible implementations, there are 72 face keypoints in total.

In some possible implementations, the preset size may be 224x224, and in step S330, the target face of the person image to be recognized is face-aligned according to the coordinate values of the key points of the face, and at the same time, the face image including only the face region is intercepted by affine transformation, and the face image is adjusted to the preset size.

In some possible implementations, the normalization of the face image may be by subtracting 128 from 256 the pixel value of each pixel in the face image, such that the pixel value of each pixel is between [ -0.5,0.5 ].

The non-face area in the image of the person to be identified can be removed through preprocessing, the influence of the non-face area on the detection of the living body of the face is reduced, meanwhile, the influence of light rays and the like on the image of the person to be identified can be reduced through normalization, and the accuracy of the obtained living body detection result is improved.

As mentioned above, in some possible implementations, as shown in fig. 2, the first convolutional neural network module CM1 is further connected to the second convolutional neural network module CM2, and the second convolutional neural network module CM2 is further connected to the third convolutional neural network module CM3.

The bottom characteristic information extracted by the bottom convolutional neural network module (such as the first convolutional neural network module CM 1) can be transmitted to the upper convolutional neural network module (such as the third convolutional neural network module CM 3) through the connection between the convolutional neural network modules, so that the feature richness obtained by the upper convolutional neural network module is improved.

Compared with a common convolution layer, resNetBlock is easy to optimize, and can relieve the gradient disappearance problem caused by adding depth in the deep neural network, so that compared with the common convolution layer, resNetBlock is more suitable for being used as a high-layer convolution neural network.

Through experiments, the first convolutional neural network module CM1 is formed by 2 convolutional layers, the second convolutional neural network module CM2 and the third convolutional neural network module CM3 are formed by 4 ResNetBlock, and then the performance of the human face living body detection model is best.

In some specific implementations, the specific process of inputting the face image to be recognized into the pre-trained face biopsy model to obtain the biopsy result may include:

Inputting a face image to be identified into a first convolutional neural network module CM1 formed by 2 convolutional layers, performing downsampling operation on the 2 nd convolutional layer of the first convolutional neural network module CM1, and extracting bottom characteristic information of the face image to be identified;

then inputting the feature map extracted by the first convolutional neural network module CM1 into a first visual self-attention module VM1 composed of 3 ViT structures, adding the features output by the first convolutional neural network module CM1 and the first visual self-attention module VM1 into a second convolutional neural network module CM2 composed of 4 Resnet blocks, and similarly, performing downsampling operation on the last convolutional layer of the second convolutional neural network module CM 2;

the feature map extracted by the second convolutional neural network module CM2 is then input into a second visual self-attention module VM2 consisting of 3 ViT structures. The features obtained by the second convolutional neural network module CM2 and the second visual self-attention module VM2 are added and input into a third convolutional neural network module CM3 consisting of 4 ResnetBlock, and the downsampling operation is carried out on the last convolutional layer;

and finally, inputting the feature map extracted by the third convolutional neural network module CM3 into a third visual self-attention module VM3 consisting of 3 ViT structures, and obtaining a final living body detection result by using the full connection layer FC1 according to the features (specifically can be ClassToken) output by the third visual self-attention module VM 3.

As described above, in some possible implementations, the face biopsy model may be trained by freezing parameters of the module to facilitate different features of the better learning image of the convolutional neural network module and the visual self-attention module.

FIG. 4 is a network training process diagram illustrating one implementation of model training by freezing parameters of a module.

As shown in fig. 4, in the training process, parameters of all visual self-attention modules (i.e., the modules drawing oblique lines in the first network structure diagram in fig. 4) are frozen first, and a training face image and a pre-training model corresponding to a face living body detection model are used for training by using a living body detection label corresponding to the training face image, so as to obtain a first training model.

The method comprises the steps that a pre-training model corresponding to a human body living body detection model is consistent with a theme structure of the human body living body detection model, and the method also comprises a plurality of convolution neural network modules and visual self-attention modules which are alternately connected, wherein the difference is that the last convolution neural network module is connected with a full-connection layer FC2, a living body detection result is obtained, network loss is determined according to the living body detection result and a living body detection label corresponding to a training human face image, and parameters of the convolution neural network modules of the pre-training model corresponding to the living body detection model are modified according to the network loss.

And freezing parameters of a convolutional neural network module (namely a module drawing oblique lines in a second network structure diagram in fig. 4) of the first training model, and training the first training model by using the training face image and a living body detection label corresponding to the training face image to obtain a second training model.

The first training model is consistent with the human body living body detection model in structure, network loss is determined according to the living body detection result and a living body detection label corresponding to the training face image according to the acquired living body detection result output by the full connection layer FC1 of the first training model, and parameters of a visual self-attention module of the first training model are modified according to the network loss.

And finally, training the second training model by using the training face image and the living body detection label corresponding to the training face image to obtain a trained face living body detection model.

The second training model is consistent with the human body living body detection model in structure, network loss is determined according to the living body detection result and the living body detection label corresponding to the training face image according to the acquired living body detection result output by the full connection layer FC1 of the second training model, and parameters of the convolutional neural network module and the visual self-attention module of the second training model are modified according to the network loss.

By freezing the parameters of the convolutional neural network module and the visual self-attention module respectively, the convolutional neural network module and the visual self-attention module can respectively learn the characteristics of the face image to be recognized, so that the mutual interference of the process of learning the image characteristics by the convolutional neural network module and the visual self-attention module is avoided, and the convolutional neural network module and the visual self-attention module can learn better characteristics.

In some possible implementations, the training data of the pre-training model and the first training model corresponding to the face biopsy model are different.

And training the pre-training model corresponding to the human face living body detection model by using the living body detection label corresponding to the plane attack image to obtain a first training model.

And training the first training model by using the living body detection label corresponding to the stereoscopic attack image to obtain a second training model.

Because the planar attacks such as screen attacks and paper attacks need to be extracted, the planar attack images and the living body detection labels corresponding to the planar attack images are used for training the pre-training model corresponding to the human face living body detection model under the condition that parameters of all visual self-attention modules are frozen, and the convolutional neural network module can learn to extract the underlying features and the local features.

And the three-dimensional attacks such as headgear attack and head model attack need to be extracted, so that under the condition that parameters of all convolutional neural network modules are frozen, the three-dimensional attack images and the living body detection labels corresponding to the three-dimensional attack images are used for training the first training model, and the visual self-attention module can learn to extract the high-level semantic features and the global features.

And finally, training the second training model by using all data, so that the characteristics extracted by the convolutional neural network module and the characteristics extracted by the visual self-attention module can be fully fused.

Through the training mode, the human face living body detection model can simultaneously acquire the bottom layer features and the high-layer semantic features, simultaneously acquire the local features and the global features, fully fuse the acquired features, promote the richness of the extracted features, and promote the generalization and the accuracy of the human face living body detection model to various attacks.

Fig. 5 illustrates a flowchart of a training method for a face living body detection model provided by an embodiment of the present disclosure, and as illustrated in fig. 5, the training method for a face living body detection model provided by an embodiment of the present disclosure may include step S510 and step S520.

In step S510, a plurality of training face images and living body detection tags corresponding to the training face images are acquired;

in step S520, training the face living body detection model according to the training face image and the living body detection label corresponding to the training face image;

For example, in step S510, the training face image may include a real face image, a plane attack image such as a screen attack or a paper attack, and a stereo attack image such as a headgear or a head model. The living body detection label corresponding to the real face image may be a label identifying that the image is a living body of a face, and the living body detection labels corresponding to the plane attack image and the stereo attack image may be labels identifying that the image is not a living body of a face.

In some possible implementations, the training face image may be an image taken directly by shooting.

In some possible implementations, the training face image may also be an image obtained by preprocessing the image.

In some possible implementations, the preprocessing of the image may include face detection, keypoint detection, face alignment, affine transformation, and normalization processing, and the preprocessing may reduce the influence of light and the like on the training face image, and improve the performance of the obtained face living detection model by improving the quality of training data.

In some possible implementations, the training face image may be an RGB image; in some possible implementations, the training face image may be a depth image; in some possible implementations, the training face image may be an infrared image.

In some possible implementations, in step S520, the face living detection model includes a plurality of convolutional neural network modules and a plurality of ViT (vision self-attention) modules that are alternately connected.

The convolutional neural network module and the visual self-attention module are different feature extraction modules which can be used for extracting the bottom-layer features, the local features, the high-layer semantic features and the global features of the training face image respectively. The bottom layer features and the local features can be used for identifying plane attacks such as screen attacks and paper attacks, and the high-level semantic features and the global features can be used for identifying stereo attack images such as head covers and head models. Therefore, various characteristics required by human face living body detection can be obtained through the human face living body detection model formed by the convolutional neural network module and the visual self-attention module, so that generalization and robustness of the human face living body detection model to various attacks are improved, and the detection accuracy of the human face living body detection model is improved.

In the training method of the human face living body detection model provided by the embodiment of the disclosure, the convolutional neural network module and the visual self-attention module are alternately connected, and as the convolutional neural network module and the visual self-attention module are different feature extraction modules, different image features can be extracted, so that various features required by human face living body detection are obtained, and generalization and robustness of the human face living body detection model to various attacks are improved. Meanwhile, the convolutional neural network module and the visual self-attention module are alternately connected, so that the features extracted by the convolutional neural network module and the visual self-attention module can be fully fused, the richness of the obtained features is further improved, and the generalization and the robustness of the human face living body detection model to various attacks are improved. Therefore, the performance of the human face living body detection model obtained by the training method of the human face living body detection model provided by the embodiment of the disclosure is better.

As described above, in some possible implementations, the training face image may be an image obtained by preprocessing a person image.

Acquiring training face images through preprocessing may include:

acquiring a training character image, performing face detection on the training character image, and acquiring a face area of the training character image;

detecting key points of the face in the face area to obtain coordinates of the key points of the face;

face alignment is carried out on the face area according to the face key point coordinates, and a face image with a preset size is obtained through affine transformation;

and carrying out normalization processing on the face image to obtain a training face image.

In some possible implementations, the training character image acquired may be an image acquired by photographing the character, and the size of the image may not be limited.

In some possible implementations, face detection may be performed on the training person image by a conventional machine learning method to obtain a face position in the training person image; the face position in the training character image can be obtained through a pre-trained face detection model, and the face area is determined according to the face position.

In some possible implementations, the face keypoint detection of the acquired face region may be a face keypoint detection of the face region using a pre-trained face keypoint detection model.

In some possible implementations, there are 72 face keypoints in total.

In some possible implementations, the preset size may be 224x224, and the target face of the training person image is aligned according to the key point coordinate values of the face, and meanwhile, the face image only including the face area is intercepted through affine transformation, and the face image is adjusted to the preset size.

The non-face area in the training character image can be removed through preprocessing, the influence of the non-face area on human face living body detection is reduced, meanwhile, the influence of light rays and the like on the training face image can be reduced through normalization, and the accuracy of the obtained living body detection result is improved.

In some possible implementations, the training face image may also be augmented with data augmentation. The method of enhancing data according to the embodiments of the present disclosure is not limited.

In some specific implementations, the specific process of training the face biopsy model according to the training face image and the biopsy label corresponding to the training face image may include:

inputting a training face image into a first convolutional neural network module CM1 formed by 2 convolutional layers, performing downsampling operation on the 2 nd convolutional layer of the first convolutional neural network module CM1, and extracting bottom characteristic information of the training face image;

and finally, inputting the feature map extracted by the third convolutional neural network module CM3 into a third visual self-attention module VM3 consisting of 3 ViT structures, obtaining a final living body detection result by using the full connection layer FC1 according to the features (specifically can be ClassToken) output by the third visual self-attention module VM3, determining network loss according to the final living body detection result and a living body detection label corresponding to the training face image, and modifying parameters of face living body detection.

FIG. 6 is a flow diagram illustrating one implementation of model training by freezing parameters of a module, which may include step S610, step S620, step S630, as shown in FIG. 6.

In step S610, under the condition of freezing the parameters of the visual self-attention module of the face living detection model, inputting the training face image into a pre-training model corresponding to the face living detection model, and training the pre-training model corresponding to the face living detection model according to the result output by the full connection layer connected by the convolutional neural network module of the face living detection model and the living detection label corresponding to the training face image to obtain a first training model;

in step S620, under the condition of freezing parameters of the convolutional neural network module of the first training model, inputting the training face image into the first training model, and training the first training model according to the result output by the full connection layer connected by the vision self-attention module of the first training model and the living body detection label corresponding to the training face image to obtain a second training model;

in step S630, the training face image is input into the second training model, and the second training model is trained according to the result output by the full connection layer connected by the visual self-attention module of the second training model and the living body detection label corresponding to the training face image, so as to obtain the trained living body face detection model.

In the training process, parameters of all vision self-attention modules are frozen first, and a training face image and a living body detection label corresponding to the training face image are used for training a pre-training model corresponding to a face living body detection model, so that a first training model is obtained.

And freezing parameters of the convolutional neural network module of the first training model, and training the first training model by using the training face image and the living body detection label corresponding to the training face image to obtain a second training model.

Based on the same principle as the method shown in fig. 1, fig. 7 shows a schematic structural diagram of a face living body detection apparatus provided by an embodiment of the present disclosure, and as shown in fig. 7, the face living body detection apparatus 70 may include:

a test data module 710, configured to obtain a face image to be identified;

the reasoning module 720 is used for inputting the face image to be recognized into a pre-trained face living body detection model to obtain a living body detection result;

In the human face living body detection device provided by the embodiment of the disclosure, the human face living body detection model formed by alternately connecting the convolutional neural network module and the visual self-attention module is used for detecting the human face image to be detected to obtain the living body detection result, and as the convolutional neural network module and the visual self-attention module are different feature extraction modules, different image features can be extracted to obtain various features required by human face living body detection, and generalization and robustness of the human face living body detection model to various attacks are improved. Meanwhile, the convolutional neural network module and the visual self-attention module are alternately connected, so that the features extracted by the convolutional neural network module and the visual self-attention module can be fully fused, the richness of the obtained features is further improved, and the generalization and the robustness of the human face living body detection model to various attacks are improved. Therefore, the accuracy of the living body detection result obtained by using the human face living body detection device provided by the embodiment of the disclosure is higher.

It will be appreciated that the above-described modules of the face biopsy device in the embodiment of the present disclosure have functions of implementing the corresponding steps of the face biopsy method in the embodiment shown in fig. 1. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules may be software and/or hardware, and each module may be implemented separately or may be implemented by integrating multiple modules. The functional description of each module of the above-mentioned face biopsy device may be specifically referred to the corresponding description of the face biopsy method in the embodiment shown in fig. 1, and will not be repeated here.

Based on the same principle as the method shown in fig. 5, fig. 8 shows a schematic structural diagram of a training device for a human face living body detection model according to an embodiment of the present disclosure, and as shown in fig. 8, the training device 80 for a human face living body detection model may include:

the training data module 810 is configured to obtain a plurality of training face images and living body detection labels corresponding to the training face images;

the training module 820 is configured to train the face living body detection model according to the training face image and the living body detection label corresponding to the training face image;

In the training device for the human face living body detection model provided by the embodiment of the disclosure, the convolutional neural network module and the visual self-attention module are alternately connected, and as the convolutional neural network module and the visual self-attention module are different feature extraction modules, different image features can be extracted, so that various features required by human face living body detection are obtained, and generalization and robustness of the human face living body detection model to various attacks are improved. Meanwhile, the convolutional neural network module and the visual self-attention module are alternately connected, so that the features extracted by the convolutional neural network module and the visual self-attention module can be fully fused, the richness of the obtained features is further improved, and the generalization and the robustness of the human face living body detection model to various attacks are improved. Therefore, the performance of the human face living body detection model obtained by the training device of the human face living body detection model provided by the embodiment of the disclosure is better.

It can be appreciated that the above-described modules of the training apparatus for a face living body detection model in the embodiment of the present disclosure have functions of implementing the respective steps of the training method for a face living body detection model in the embodiment shown in fig. 5. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules may be software and/or hardware, and each module may be implemented separately or may be implemented by integrating multiple modules. For the functional description of each module of the training device for the face living body detection model, reference may be specifically made to the corresponding description of the training method for the face living body detection model in the embodiment shown in fig. 5, which is not repeated herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

The electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a face in-vivo detection method and a training method for a face in-vivo detection model as provided by embodiments of the present disclosure.

Compared with the prior art, the electronic equipment detects the face image to be identified through the face living detection model formed by alternately connecting the convolutional neural network module and the visual self-attention module to obtain the living detection result, and as the convolutional neural network module and the visual self-attention module are different feature extraction modules, different image features can be extracted to obtain various features required by face living detection, and generalization and robustness of the face living detection model to various attacks are improved. Meanwhile, the convolutional neural network module and the visual self-attention module are alternately connected, so that the features extracted by the convolutional neural network module and the visual self-attention module can be fully fused, the richness of the obtained features is further improved, and the generalization and the robustness of the human face living body detection model to various attacks are improved. Therefore, the accuracy of the living body detection result obtained by the electronic equipment provided by the embodiment of the disclosure is higher.

The readable storage medium is a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute a face in-vivo detection method and a training method of a face in-vivo detection model as provided by the embodiments of the present disclosure.

Compared with the prior art, the readable storage medium detects a face image to be identified through the face living detection model formed by the convolution neural network module and the visual self-attention module which are alternately connected to obtain a living detection result, and as the convolution neural network module and the visual self-attention module are different feature extraction modules, different image features can be extracted to obtain various features required by face living detection, and generalization and robustness of the face living detection model to various attacks are improved. Meanwhile, the convolutional neural network module and the visual self-attention module are alternately connected, so that the features extracted by the convolutional neural network module and the visual self-attention module can be fully fused, the richness of the obtained features is further improved, and the generalization and the robustness of the human face living body detection model to various attacks are improved. Therefore, the accuracy of the living body detection result obtained by using the readable storage medium provided by the embodiment of the disclosure is higher.

The computer program product comprises a computer program which, when executed by a processor, implements a face biopsy method and a training method of a face biopsy model as provided by embodiments of the present disclosure.

Compared with the prior art, the computer program product detects the face image to be identified through the face living detection model formed by the convolution neural network module and the visual self-attention module which are alternately connected to obtain the living detection result, and as the convolution neural network module and the visual self-attention module are different feature extraction modules, different image features can be extracted to obtain various features required by face living detection, and generalization and robustness of the face living detection model to various attacks are improved. Meanwhile, the convolutional neural network module and the visual self-attention module are alternately connected, so that the features extracted by the convolutional neural network module and the visual self-attention module can be fully fused, the richness of the obtained features is further improved, and the generalization and the robustness of the human face living body detection model to various attacks are improved. Accordingly, the accuracy of the living body detection result obtained using the computer program product provided by the embodiment of the present disclosure is higher.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM902, and the RAM903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, a face living detection method and/or a training method of a face living detection model. For example, in some embodiments, the face biopsy method and/or training method of the face biopsy model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM902 and/or the communication unit 909. When the computer program is loaded into the RAM903 and executed by the computing unit 901, one or more steps of the above-described face living detection method and/or training method of the face living detection model may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the face biopsy method and/or the training method of the face biopsy model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of face in-vivo detection, comprising:

acquiring a face image to be identified;

2. The method of claim 1, wherein the face living detection model comprises a first convolutional neural network module, a first visual self-attention module, a second convolutional neural network module, a second visual self-attention module, a third convolutional neural network module, a third visual self-attention module, connected in sequence.

3. The method of claim 2, wherein the first convolutional neural network module connects the second convolutional neural network module; the second convolutional neural network module is connected with the third convolutional neural network module.

4. The method of claim 2, wherein the first convolutional neural network module comprises a plurality of convolutional layers; the second convolutional neural network module comprises a plurality of residual network blocks; the third convolutional neural network comprises a plurality of residual network blocks.

5. The method of claim 2, wherein the third visual self-attention module connects fully connected layers;

inputting the face image to be recognized into a pre-trained face living body detection model to obtain a living body detection result, wherein the method comprises the following steps of:

and inputting the face image to be recognized into the first convolutional neural network module, and determining the living body detection result according to the output of the full-connection layer.

6. The method of claim 1, wherein the face biopsy model is obtained by training a second training model using a plurality of training face images and a biopsy label corresponding to the training face images;

the second training model trains and acquires the first training model by using the training face image and a living body detection label corresponding to the training face image under the condition of freezing parameters of the convolutional neural network module;

and under the condition that the parameters of the visual self-attention module are frozen, the first training model trains and acquires a pre-training model corresponding to the human face living body detection model by using the training human face image and the living body detection label corresponding to the training human face image.

7. The method of claim 6, wherein the plurality of training face images includes a planar attack image and a stereo attack image;

the second training model trains and acquires the first training model by using the three-dimensional attack image and a living body detection label corresponding to the three-dimensional attack image under the condition of freezing parameters of the convolutional neural network module;

And the first training model trains and acquires a pre-training model corresponding to the human face living body detection model by using the plane attack image and the living body detection label corresponding to the plane attack image under the condition of freezing the parameters of the visual self-attention module.

8. The method of claim 1, wherein the acquiring a face image to be identified comprises:

acquiring a person image to be identified, carrying out face detection on the person image to be identified, and acquiring a face area of the person image to be identified;

detecting the key points of the face in the face area to obtain the coordinates of the key points of the face;

carrying out face alignment on the face region according to the face key point coordinates, and obtaining a face image with a preset size through affine transformation;

and carrying out normalization processing on the face image to obtain the face image to be recognized.

9. A training method of a human face living body detection model comprises the following steps:

10. The method of claim 9, wherein the face living detection model comprises a first convolutional neural network module, a first visual self-attention module, a second convolutional neural network module, a second visual self-attention module, a third convolutional neural network module, a third visual self-attention module, connected in sequence.

11. The method of claim 10, wherein the first convolutional neural network module connects the second convolutional neural network module; the second convolutional neural network module is connected with the third convolutional neural network module.

12. The method of claim 10, wherein the first convolutional neural network module comprises a plurality of convolutional layers; the second convolutional neural network module comprises a plurality of residual network blocks; the third convolutional neural network comprises a plurality of residual network blocks.

13. The method of claim 10, wherein the third visual self-attention module connects fully connected layers;

The training of the face living body detection model according to the training face image and the living body detection label corresponding to the training face image comprises the following steps:

and inputting the training face image into the first convolutional neural network module, and training the face living body detection model according to the living body detection label corresponding to the training face image and the output of the full-connection layer.

14. The method of claim 9, wherein the training the face biopsy model according to the training face image and the biopsy label corresponding to the training face image comprises:

under the condition that parameters of a vision self-attention module of a pre-training model corresponding to the human face living body detection model are frozen, inputting the training human face image into the pre-training model corresponding to the human face living body detection model, and training the pre-training model corresponding to the human face living body detection model according to a result output by a full-connection layer connected with a convolution neural network module of the human face living body detection model and a living body detection label corresponding to the training human face image to obtain a first training model;

under the condition of freezing parameters of a convolutional neural network module of the first training model, inputting the training face image into the first training model, and training the first training model according to a living body detection label corresponding to the training face image and a result output by a full connection layer connected with a visual self-attention module of the first training model to obtain a second training model;

And inputting the training face image into the second training model, and training the second training model according to the result output by the full-connection layer connected with the visual self-attention module of the second training model and the living body detection label corresponding to the training face image to obtain a trained living body face detection model.

15. The method of claim 14, wherein the plurality of training face images includes a planar attack image and a stereo attack image;

under the condition that parameters of a vision self-attention module of a pre-training model corresponding to the face living detection model are frozen, inputting the training face image into the pre-training model corresponding to the face living detection model, training the pre-training model corresponding to the face living detection model according to a result output by a full connection layer connected with a convolution neural network module of the face living detection model and a living detection label corresponding to the training face image, and acquiring a first training model, wherein the training model comprises the following steps:

under the condition that parameters of a vision self-attention module of a pre-training model corresponding to the face living body detection model are frozen, inputting the plane attack image into the pre-training model corresponding to the face living body detection model, and training the face living body detection model according to a result output by a full-connection layer connected by a convolutional neural network module of the face living body detection model and a living body detection label corresponding to the plane attack image to obtain a first training model;

Under the condition of freezing parameters of a convolutional neural network module of the first training model, inputting the training face image into the first training model, training the first training model according to a result output by a full connection layer connected by a visual self-attention module of the first training model and a living body detection label corresponding to the training face image, and obtaining a second training model, wherein the method comprises the following steps:

under the condition that parameters of a convolutional neural network module of the first training model are frozen, the three-dimensional attack image is input into the first training model, and the first training model is trained according to a result output by a full-connection layer connected with a visual self-attention module of the first training model and a living body detection label corresponding to the three-dimensional attack image, so that a second training model is obtained.

16. An apparatus for face in-vivo detection, comprising:

the test data module is used for acquiring a face image to be identified;

17. A training device for a human face living body detection model, comprising:

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8 or any one of claims 9-15.

19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8 or any one of claims 9-15.

20. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8 or the method according to any one of claims 9-15.