CN111062324A

CN111062324A - Face detection method and device, computer equipment and storage medium

Info

Publication number: CN111062324A
Application number: CN201911301011.2A
Authority: CN
Inventors: 周康明; 曹磊磊
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2020-04-24

Abstract

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for face detection, a computer device, and a storage medium. The method comprises the following steps: acquiring an image to be detected, extracting the face features of the image to be detected, and performing feature fusion on the extracted face features to obtain face fusion features; extracting the receptive field characteristics of a plurality of scales for the face fusion characteristics according to preset receptive field characteristic extraction parameters to obtain the receptive field characteristics corresponding to each scale; pooling the receptive field characteristics of all scales to generate face output characteristics corresponding to the face fusion characteristics; and performing regression detection on the face output characteristics to obtain a detection result of the face in the image to be detected. By adopting the method, the accuracy of face detection can be improved.

Description

Face detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for face detection, a computer device, and a storage medium.

Background

The human face detection is the basis of tasks such as human face recognition, human face key point detection, human face tracking, human face expression recognition and the like, and is always concerned by academia and industry. Since the deep learning performance gets a breakthrough, the performance of the face detection model also brings about a leap development.

In a traditional mode, feature maps with different scales are extracted through a neural network and used for detecting faces with different scales in face detection, namely a large-scale feature map can be used for detecting a small face, and a small-scale feature map is used for detecting a large face.

However, for a shallow large-scale feature map, because semantic information is not strong enough, the features of a small face cannot be expressed sufficiently, so that the accuracy of face detection is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a face detection method, an apparatus, a computer device and a storage medium, which can improve the accuracy of face detection.

A method of face detection, the method comprising:

acquiring an image to be detected, extracting the face features of the image to be detected, and performing feature fusion on the extracted face features to obtain face fusion features;

extracting the receptive field characteristics of a plurality of scales for the face fusion characteristics according to preset receptive field characteristic extraction parameters to obtain the receptive field characteristics corresponding to each scale;

pooling the receptive field characteristics of all scales to generate face output characteristics corresponding to the face fusion characteristics;

and performing regression detection on the face output characteristics to obtain a detection result of the face in the image to be detected.

In one embodiment, extracting the receptive field features of multiple scales for the face fusion features according to preset receptive field feature extraction parameters to obtain receptive field features corresponding to the scales includes:

respectively inputting the human face fusion characteristics into each convolution layer determined by different receptive field characteristic extraction parameters to extract the receptive field characteristics;

respectively carrying out normalization processing on the feature extraction results output by the convolution layers to obtain normalized feature extraction results;

and carrying out nonlinear mapping on each normalized feature extraction result based on a preset mapping mode to obtain the receptive field features corresponding to each scale.

acquiring a convolution kernel parameter for extracting receptive field characteristics;

obtaining convolution kernel expansion coefficients corresponding to different scale receptive fields;

and determining each convolution layer based on the convolution kernel parameters and the expansion coefficients of each convolution kernel, and extracting the receptive field characteristics of the human face fusion characteristics according to the determined convolution layers to generate the receptive field characteristics of multiple scales.

In one embodiment, the extracting of the receptive field features from the face fusion features according to the determined convolutional layers respectively to generate the receptive field features of multiple scales includes:

and according to the determined convolution layers, extracting the receptive field characteristics of the face fusion characteristics in parallel to generate the receptive field characteristics of multiple scales.

In one embodiment, pooling the receptive field features of each scale to generate face output features corresponding to the face fusion features includes:

and calculating the average value of the receptive field characteristics of a plurality of scales as the human face output characteristics.

An apparatus for face detection, the apparatus comprising:

the fusion feature generation module is used for acquiring an image to be detected, extracting the face feature of the image to be detected, and performing feature fusion on the extracted face feature to obtain a face fusion feature;

the receptive field characteristic extraction module is used for extracting the receptive field characteristics of a plurality of scales for the face fusion characteristics according to preset receptive field characteristic extraction parameters to obtain the receptive field characteristics corresponding to each scale;

the pooling processing module is used for pooling the receptive field characteristics of all scales to generate face output characteristics corresponding to the face fusion characteristics;

and the regression detection module is used for carrying out regression detection on the face output characteristics to obtain a detection result of the face in the image to be detected.

In one embodiment, the receptive field feature extraction module includes:

the first receptive field feature extraction submodule is used for respectively inputting the face fusion features into the convolution layers determined by different receptive field feature extraction parameters to extract the receptive field features;

the normalization processing submodule is used for respectively performing normalization processing on the feature extraction results output by the convolution layers to obtain normalized feature extraction results;

and the nonlinear mapping submodule is used for carrying out nonlinear mapping on each normalized feature extraction result based on a preset mapping mode to obtain the receptive field features corresponding to each scale.

In one embodiment, the receptive field feature extraction module includes:

the convolution kernel parameter acquisition submodule is used for acquiring convolution kernel parameters for extracting receptive field characteristics;

the convolution kernel expansion coefficient acquisition submodule is used for acquiring convolution kernel expansion coefficients corresponding to different scale receptive fields;

and the second receptive field feature extraction submodule determines each convolution layer based on the convolution kernel parameters and each convolution kernel expansion coefficient, and extracts the receptive field features of the face fusion features according to the determined convolution layers to generate the receptive field features of multiple scales.

A computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above.

According to the face detection method, the face detection device, the computer equipment and the storage medium, face feature extraction and feature fusion are carried out on the acquired image to be detected to obtain face fusion features, then, the face fusion features are subjected to multi-scale receptive field feature extraction according to preset receptive field feature extraction parameters to obtain receptive field features corresponding to all scales, then pooling processing is carried out to generate face output features corresponding to the face fusion features, regression detection is carried out, and a detection result of the face in the image to be detected is obtained. Therefore, the face output features are generated based on the receptive field features of multiple scales, so that the face detection result is obtained based on the receptive fields of multiple different scales, and the accuracy of face detection can be improved.

Drawings

FIG. 1 is a diagram of an application scenario of a face detection method in an embodiment;

FIG. 2 is a schematic flow chart of a face detection method according to an embodiment;

FIG. 3 is a diagram illustrating a network architecture of a face detection model in one embodiment;

FIG. 4 is a schematic flow chart of the step of extracting the characteristics of the receptive field in one embodiment;

FIG. 5 is a diagram illustrating an exemplary embodiment of a domain pyramid extraction network model;

FIG. 6 is a block diagram of an embodiment of a face detection apparatus;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The face detection method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may obtain an image to be detected, and send the image to the server 104 through a network, and the server 104 extracts a face feature of the image to be detected after obtaining the image to be detected, and performs feature fusion on the extracted face feature to obtain a face fusion feature. Then, the server 104 extracts the receptive field features of multiple scales for the face fusion features according to the preset receptive field feature extraction parameters, so as to obtain the receptive field features corresponding to each scale. Further, the server 104 performs pooling processing on the receptive field characteristics of each scale to generate face output characteristics and performs regression detection to obtain a detection result of the face in the image to be detected. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and various video cameras, etc., and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In an embodiment, as shown in fig. 2, a face detection method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:

step S202, obtaining an image to be detected, extracting the face features of the image to be detected, and performing feature fusion on the extracted face features to obtain face fusion features.

The image to be detected is an image for face recognition, and the image to be detected includes a face, for example, a small face as described in the background art.

Specifically, the server may extract the Face features through a plurality of different Face detection algorithms, for example, S3FD (Single Shot Scale-innovative Face Detector), Pyramid Box, Selective Refinement Network (SRN), dsfd (dual Shot Face Detector), Retina Face, and the like, where these algorithms use deep convolutional neural Network models such as vgnet or Res Net as a backbone Network to extract the Face features of different scales for detecting faces of different scales. The following description is given with reference to a specific embodiment.

Referring to fig. 3, the server inputs the image to be recognized into the face detection model, and performs extraction of face features of multiple scales to obtain face features C2, face features C3, face features C4, and face features C5 of multiple scales.

Further, the server performs feature fusion on the face features of two adjacent scales in a layer-by-layer fusion mode by using two adjacent-scale features to obtain a face fusion feature P2, a face fusion feature P3, a face fusion feature P4 and a face fusion feature P5 of multiple scales, namely obtaining a face fusion feature P5 by using the face feature C3, obtaining a face fusion feature P4 by performing feature fusion on the face feature C4 and the face fusion feature P5, obtaining a face fusion feature P3 by performing feature fusion on the face feature C3 and the face fusion feature P4, and obtaining a face fusion feature P2 by performing feature fusion on the face feature C2 and the face fusion feature P3.

Alternatively, the server may down-sample the top-most face fusion feature P5 multiple times to obtain each down-sampled face fusion feature P6 and face fusion feature P7.

And S204, extracting the receptive field characteristics of multiple scales for the face fusion characteristics according to preset receptive field characteristic extraction parameters to obtain the receptive field characteristics corresponding to each scale.

The receptive field is a region where the input image can be seen by the features in the neural network, that is, the feature output is affected by the pixels in the receptive field region. The receptive field characteristic extraction parameters refer to extraction parameters of receptive field characteristics with different scales, and the characteristic extraction parameters are different for receptive fields with different scales.

In this embodiment, with reference to fig. 3, after obtaining the face fusion features, the server inputs the output face fusion features P2-P7 into the receptive field pyramid model, so as to extract the multi-scale receptive field features of the face fusion features, thereby obtaining the multi-scale receptive field features of the face fusion features.

And step S206, pooling the receptive field characteristics of all scales to generate face output characteristics corresponding to the face fusion characteristics.

Specifically, after the human face fusion features are subjected to receptive field feature extraction to obtain receptive field features of multiple scales, the features need to be integrated and classified. However, if the extracted features are directly input into the classifier, the server is faced with a huge amount of computation, for example, a 300 × 300 input image (single channel) is convolved by 100 convolution kernels with a size of 3 × 3, and the size of the obtained feature matrix is (300-3+1) ═ 88804, which is a huge amount of data.

Therefore, after the server extracts the receptive field features, the extracted receptive field features can be input into the pooling layer, and the obtained multi-scale receptive field features can be subjected to dimension reduction processing.

Specifically, common Pooling operation processing modes include a maximum Pooling (Max Pooling) mode and an Average Pooling (Average Pooling) mode.

And S208, performing regression detection on the face output characteristics to obtain a detection result of the face in the image to be detected.

Specifically, after obtaining the face output features, the server may perform a classification operation and a regression operation on the face output features to obtain a classification result and a regression result, respectively. The loss value of the classification result can be calculated through a soft max cross entropy loss function, and the loss value of the regression result can be calculated through an L1smooth loss function.

In this embodiment, the model for face feature extraction and the model for receptive field feature extraction are both models that have been subjected to model training, verification, and testing in advance. Specifically, the server respectively acquires a training set image, a verification set image and a test set image, trains the constructed model through the training set image, calculates the loss value of the classification result through a soft max cross entropy loss function, calculates the loss value of the regression result through an L1smooth loss function, and iteratively updates the model parameters of the person object detection model based on the loss value of the classification result and the loss value of the regression result.

Furthermore, the server can verify the training effect of the model through a verification set image while training the constructed model so as to determine the detection precision of the model, and after the training is finished, the trained model is tested through a test set image.

According to the face detection method, face feature extraction and feature fusion are carried out on an obtained image to be detected to obtain face fusion features, then, the face fusion features are subjected to multi-scale reception field feature extraction according to preset reception field feature extraction parameters to obtain reception field features corresponding to all scales, pooling processing is further carried out to generate face output features corresponding to the face fusion features, regression detection is carried out, and a detection result of a face in the image to be detected is obtained. Therefore, the face output features are generated based on the receptive field features of multiple scales, so that the face detection result is obtained based on the receptive fields of multiple different scales, and the accuracy of face detection can be improved.

In one embodiment, extracting the receptive field features of multiple scales from the face fusion features according to preset receptive field feature extraction parameters to obtain receptive field features corresponding to the scales may include:

step S402, the human face fusion characteristics are respectively input into the convolution layers determined by different receptive field characteristic extraction parameters for receptive field characteristic extraction.

The convolutional layer is a network model for feature extraction in the neural network, and is composed of a plurality of convolution units, and parameters of each convolution unit are obtained through iterative training and optimization. As described above, in the present embodiment, the network model is obtained by performing iterative training through the training set images, and performing verification through the verification set images and performing test through the test set images.

Referring to fig. 5, in this embodiment, after the image to be detected is subjected to face feature extraction and feature fusion, each obtained face fusion feature is input into the receptive field pyramid model shown in fig. 5 to extract the receptive field features.

Specifically, in order to make the output features of the receptive field pyramid be a plurality of features of different scales, the face fusion features are input into the convolutional layers determined by different receptive field feature extraction parameters through different branches, so as to extract the receptive field features of different scales.

Step S404, normalizing the feature extraction results output by each convolution layer, respectively, to obtain normalized feature extraction results.

In the receptive field pyramid network, each convolution layer is followed by a Batch Normalization (BN) layer respectively to normalize the output result of the convolution layer, so that the perception can be accelerated and the network learning rate can be accelerated in the training process of the receptive field pyramid network.

Specifically, the data processing procedure of the BN layer may include: and solving a data mean value, solving a data variance, standardizing data, training parameters, and carrying out linear transformation on output values through training parameters to obtain new output values.

And step S406, based on a preset mapping mode, carrying out nonlinear mapping on each normalized feature extraction result to obtain the receptive field features corresponding to each scale.

In deep learning, each layer of the neural network only carries out linear transformation, and the multilayer input is also linearly transformed after being superposed. Due to the fact that the expression capability of the linear model is not enough, a nonlinear factor can be introduced into deep learning, namely, after each BN layer, an excitation layer is accessed.

Specifically, the excitation layer may be implemented by a plurality of different excitation functions, for example, commonly used excitation functions include a Sigmoid function, a tanh (hyperbolic tangent) function, a Linear rectification function (regu), a leakage modified Linear Unit (leakage ReLU) function, an ELU function, a maxout function, and the like, and those skilled in the art will understand that this is merely an example, and in practical applications, more functions may be included, which is not limited thereto.

Specifically, the server can introduce a nonlinear factor through an excitation function of an excitation layer, and perform nonlinear mapping on each normalized feature extraction result, so that a network output result contains nonlinear features, and the problem which cannot be solved by a linear model is solved.

In the above embodiment, the face fusion features are respectively input into each convolution layer determined by different reception field feature extraction parameters, so as to perform reception field feature extraction, then normalization processing is performed, and then nonlinear mapping processing is performed, so that the accuracy of the output reception field features can be improved, and the accuracy of face recognition can be improved.

In one embodiment, extracting the receptive field features of multiple scales from the face fusion features according to preset receptive field feature extraction parameters to obtain receptive field features corresponding to the scales may include: acquiring a convolution kernel parameter for extracting receptive field characteristics; obtaining convolution kernel expansion coefficients corresponding to different scale receptive fields; and determining each convolution layer based on the convolution kernel parameters and the expansion coefficients of each convolution kernel, and extracting the receptive field characteristics of the human face fusion characteristics according to the determined convolution layers to generate the receptive field characteristics of multiple scales.

The convolution kernel parameter refers to a parameter for characterizing a scale of a convolution kernel, for example, in the present embodiment, the convolution layer parameter is 3 × 3, that is, the convolution kernel is a matrix of 3 × 3.

The convolution kernel expansion coefficient refers to a coefficient for performing convolution kernel expansion on a convolution kernel, and the convolution kernel expansion refers to expanding the convolution kernel to a scale constrained by an expansion scale and filling a matrix area which is not occupied in the original convolution kernel through a matrix hole.

Referring to fig. 5, in the present embodiment, the face fusion feature performs the extraction of the receptive field features in three scales, and the face fusion feature is respectively input into the convolutional layers of three different branches. The convolution kernel expansion coefficients of the three branches are respectively set to be D-1, D-3 and D-5, and the convolution kernel parameters of the convolution kernels of the three convolution layers are shared.

For example, the server obtains a face fusion feature that is a fusion feature with 256 channels, and then the server inputs the 256 face fusion features into three branches of the receptive field pyramid respectively, and then three receptive field features with the same scale, that is, receptive field features of 3 256 channels, can be obtained. Further, the server passes the 3 receptor field features of the 256 channels through a pooling layer (pooling), and the receptor field features are changed into 1 feature of the 256 channels after pooling operation and output, so that the face output features of the 256 channels are obtained.

In the above embodiment, the convolution kernel parameters and the convolution kernel expansion coefficients corresponding to different scale reception fields are obtained, each convolution layer is determined based on the convolution kernel parameters and the convolution kernel expansion coefficients, and the reception field features are extracted from the face fusion features according to the determined convolution layers, so that the convolution layers can share the convolution kernel with the same convolution kernel parameter, the data volume of the reception field pyramid network model can be reduced, and the operating pressure of the server is further reduced.

In one embodiment, the extracting of the receptive field features from the face fusion features according to the determined convolutional layers respectively to generate the receptive field features of multiple scales may include: and according to the determined convolution layers, extracting the receptive field characteristics of the face fusion characteristics in parallel to generate the receptive field characteristics of multiple scales.

Specifically, with continuing reference to fig. 5, when extracting the 3 different-scale receptive field features, the server may extract the three-scale receptive field features from the face fusion feature in parallel, that is, extract the features of the input face fusion feature in parallel and simultaneously by using three convolution layers with expansion coefficients D1, D3, and D5, so as to output the 3 different-scale receptive field features in parallel.

Further, referring to fig. 3, after feature extraction and feature fusion are performed on the same image to be detected, after face fusion features of multiple scales are obtained, when the face fusion features of multiple scales are subjected to receptive field feature extraction of multiple scales, respectively, the data processing rate can be further increased, and the face detection speed can be increased.

In the above embodiment, the receptive field features of multiple scales are extracted in parallel, so that the feature extraction process can be performed in parallel, the feature extraction time can be saved, the feature extraction speed is increased, and the face detection speed can be increased.

In one embodiment, pooling the receptive field features of each scale to generate a face output feature corresponding to the face fusion feature may include: and calculating the average value of the receptive field characteristics of a plurality of scales, and taking the average value as the human face output characteristic.

After the server extracts the receptive field features of multiple scales from the face fusion features to obtain the receptive field features of multiple scales, the server may calculate an Average value of the receptive field features of multiple scales to obtain face output features, that is, the face output features are obtained by an Average Pooling (Average Pooling) method.

For example, the server outputs the three-scale receptive field characteristics as X respectively₁、X₂、X₃The method specifically comprises the following steps:

X₁∈R^H×W×C，X₂∈R^H×W×C，X₃∈R^H×W×C

wherein, H, W and C respectively represent the height, width and channel number of the characteristic diagram. Thus, the feature average of multiple scale receptive field features can be calculated, namely:

thereby obtaining the human face output characteristics.

In the above embodiment, the average value of the receptive field features of multiple scales is calculated and output as the face output feature, and the face output feature is generated based on the receptive field features of multiple scales, so that the detection result of the face is obtained based on the receptive fields of multiple different scales, and the accuracy of face detection can be improved.

It should be understood that although the steps in the flowcharts of fig. 2 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided a face detection apparatus including: the fusion feature generation module 100, the receptive field feature extraction module 200, the pooling processing module 300, and the regression detection module 400, wherein:

the fusion feature generation module 100 is configured to acquire an image to be detected, extract a face feature of the image to be detected, and perform feature fusion on the extracted face feature to obtain a face fusion feature.

And the receptive field feature extraction module 200 is configured to extract receptive field features of multiple scales for the face fusion features according to preset receptive field feature extraction parameters, so as to obtain receptive field features corresponding to each scale.

And the pooling processing module 300 is configured to pool the receptive field features of each scale to generate a face output feature corresponding to the face fusion feature.

And the regression detection module 400 is configured to perform regression detection on the face output features to obtain a detection result of the face in the image to be detected.

In one embodiment, the receptive field feature extraction module 200 may include:

and the first receptive field feature extraction submodule is used for respectively inputting the face fusion features into the convolution layers determined by different receptive field feature extraction parameters to extract the receptive field features.

And the normalization processing submodule is used for respectively performing normalization processing on the feature extraction results output by the convolution layers to obtain normalized feature extraction results.

and the convolution kernel parameter acquisition submodule is used for acquiring convolution kernel parameters for extracting the receptive field characteristics.

And the convolution kernel expansion coefficient acquisition submodule is used for acquiring convolution kernel expansion coefficients corresponding to different scale receptive fields.

In one embodiment, the second receptive field feature extraction submodule is configured to perform, according to the determined convolution layers, parallel receptive field feature extraction on the face fusion features to generate receptive field features of multiple scales.

In one embodiment, the pooling processing module 300 is configured to calculate an average of the receptive field features of multiple scales as the face output features.

For specific limitations of the face detection apparatus, reference may be made to the above limitations of the face detection method, and details are not described here. All or part of the modules in the face detection device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing face detection data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a face detection method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program: acquiring an image to be detected, extracting the face features of the image to be detected, and performing feature fusion on the extracted face features to obtain face fusion features; extracting the receptive field characteristics of a plurality of scales for the face fusion characteristics according to preset receptive field characteristic extraction parameters to obtain the receptive field characteristics corresponding to each scale; pooling the receptive field characteristics of all scales to generate face output characteristics corresponding to the face fusion characteristics; and performing regression detection on the face output characteristics to obtain a detection result of the face in the image to be detected.

In one embodiment, when the processor executes the computer program, the method performs multiple scales of receptive field feature extraction on the face fusion feature according to preset receptive field feature extraction parameters to obtain receptive field features corresponding to each scale, and may include: respectively inputting the human face fusion characteristics into each convolution layer determined by different receptive field characteristic extraction parameters to extract the receptive field characteristics; respectively carrying out normalization processing on the feature extraction results output by the convolution layers to obtain normalized feature extraction results; and carrying out nonlinear mapping on each normalized feature extraction result based on a preset mapping mode to obtain the receptive field features corresponding to each scale.

In one embodiment, when the processor executes the computer program, the method performs multiple scales of receptive field feature extraction on the face fusion feature according to preset receptive field feature extraction parameters to obtain receptive field features corresponding to each scale, and may include: acquiring a convolution kernel parameter for extracting receptive field characteristics; obtaining convolution kernel expansion coefficients corresponding to different scale receptive fields; and determining each convolution layer based on the convolution kernel parameters and the expansion coefficients of each convolution kernel, and extracting the receptive field characteristics of the human face fusion characteristics according to the determined convolution layers to generate the receptive field characteristics of multiple scales.

In one embodiment, when the processor executes the computer program, the method for extracting the receptive field features of the face fusion features according to the determined convolutional layers respectively to generate receptive field features of multiple scales may include: and according to the determined convolution layers, extracting the receptive field characteristics of the face fusion characteristics in parallel to generate the receptive field characteristics of multiple scales.

In one embodiment, when the processor executes the computer program, the implementing pooling processing on the receptive field features of each scale to generate the face output features corresponding to the face fusion features may include: and calculating the average value of the receptive field characteristics of a plurality of scales as the human face output characteristics.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring an image to be detected, extracting the face features of the image to be detected, and performing feature fusion on the extracted face features to obtain face fusion features; extracting the receptive field characteristics of a plurality of scales for the face fusion characteristics according to preset receptive field characteristic extraction parameters to obtain the receptive field characteristics corresponding to each scale; pooling the receptive field characteristics of all scales to generate face output characteristics corresponding to the face fusion characteristics; and performing regression detection on the face output characteristics to obtain a detection result of the face in the image to be detected.

In one embodiment, the implementing, by the processor, the method of extracting the receptor field features of multiple scales for the face fusion feature according to the preset receptor field feature extraction parameters to obtain the receptor field features corresponding to the scales may include: respectively inputting the human face fusion characteristics into each convolution layer determined by different receptive field characteristic extraction parameters to extract the receptive field characteristics; respectively carrying out normalization processing on the feature extraction results output by the convolution layers to obtain normalized feature extraction results; and carrying out nonlinear mapping on each normalized feature extraction result based on a preset mapping mode to obtain the receptive field features corresponding to each scale.

In one embodiment, the implementing, by the processor, the method of extracting the receptor field features of multiple scales for the face fusion feature according to the preset receptor field feature extraction parameters to obtain the receptor field features corresponding to the scales may include: acquiring a convolution kernel parameter for extracting receptive field characteristics; obtaining convolution kernel expansion coefficients corresponding to different scale receptive fields; and determining each convolution layer based on the convolution kernel parameters and the expansion coefficients of each convolution kernel, and extracting the receptive field characteristics of the human face fusion characteristics according to the determined convolution layers to generate the receptive field characteristics of multiple scales.

In one embodiment, the computer program, when executed by the processor, implements the extraction of the receptive field features from the face fusion features according to the determined convolutional layers, and generates the receptive field features of multiple scales, which may include: and according to the determined convolution layers, extracting the receptive field characteristics of the face fusion characteristics in parallel to generate the receptive field characteristics of multiple scales.

In one embodiment, the implementing, by the processor, pooling of the receptive field features of each scale to generate the face output features corresponding to the face fusion features may include: and calculating the average value of the receptive field characteristics of a plurality of scales as the human face output characteristics.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A face detection method, comprising:

according to preset receptive field feature extraction parameters, conducting receptive field feature extraction of multiple scales on the face fusion features to obtain receptive field features corresponding to all scales;

2. The method according to claim 1, wherein the extracting of the receptive field features of multiple scales from the face fusion features according to preset receptive field feature extraction parameters to obtain the receptive field features corresponding to each scale comprises:

respectively inputting the human face fusion characteristics into each convolution layer determined by different receptive field characteristic extraction parameters to extract receptive field characteristics;

3. The method according to claim 1, wherein the extracting of the receptive field features of multiple scales from the face fusion features according to preset receptive field feature extraction parameters to obtain the receptive field features corresponding to each scale comprises:

and determining each convolution layer based on the convolution kernel parameters and each convolution kernel expansion coefficient, and extracting the receptive field characteristics of the human face fusion characteristics according to the determined convolution layers to generate the receptive field characteristics of multiple scales.

4. The method according to claim 3, wherein the extracting of the receptive field features from the face fusion features according to the determined convolutional layers respectively to generate receptive field features of multiple scales comprises:

5. The method according to claim 1, wherein the pooling of the receptive field features of the respective scales to generate the face output features corresponding to the face fusion features comprises:

6. A face detection apparatus, comprising:

the receptive field feature extraction module is used for extracting the receptive field features of multiple scales from the face fusion features according to preset receptive field feature extraction parameters to obtain the receptive field features corresponding to all scales;

7. The apparatus of claim 6, wherein the receptive field feature extraction module comprises:

8. The apparatus of claim 6, wherein the receptive field feature extraction module comprises:

and the second receptive field feature extraction submodule determines each convolution layer based on the convolution kernel parameters and the convolution kernel expansion coefficients, extracts the receptive field features of the face fusion features according to the determined convolution layers and generates the receptive field features of multiple scales.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.