CN107463919A

CN107463919A - A kind of method that human facial expression recognition is carried out based on depth 3D convolutional neural networks

Info

Publication number: CN107463919A
Application number: CN201710713962.5A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-08-18
Filing date: 2017-08-18
Publication date: 2017-12-12

Abstract

The present invention proposes a kind of method that human facial expression recognition is carried out based on depth 3D convolutional neural networks, and its main contents includes：3D starting residual errors network, facial markers, shot and long term memory network unit, its process is, by the time relationship between different frame in the spatial relationship in convolutional neural networks extraction face-image and video, facial markers contribute to prior facial composition in network attention characteristic pattern, therefore input of the extraction facial markers as network, the recognition capability of the delicate change of sequence pair facial expression is improved, so as to more accurately be identified.The present invention proposes a kind of method that time relationship between every frame in video sequence is extracted using 3D convolutional neural networks and shot and long term memory network, and extract facial marks, emphasize the facial composition of more expressive force, improve the recognition capability of the delicate change of facial expression, innovative solution for the new design in the field of detecting a lie, and judicial domain has done further contribution.

Description

A kind of method that human facial expression recognition is carried out based on depth 3D convolutional neural networks

Technical field

The present invention relates to Expression Recognition field, and face is carried out based on depth 3D convolutional neural networks more particularly, to a kind of The method of Expression Recognition.

Background technology

Human facial expression recognition refers to isolate specific emotional state from given still image or dynamic video sequence, So that it is determined that the mental emotion of identified object, realizes understanding and identification of the computer to human face expression, fundamentally changes people It is the premise of computer understanding people's emotion so as to reach more preferable man-machine interaction with the relation of computer, and people explore With the effective way for understanding intelligence.Therefore expression recognition psychology, intelligent robot, intelligent monitoring, virtual reality and There is very big potential using value in the fields such as digital filming.Specifically, in psychological field, the table of people is analyzed by computer Feelings information, so as to infer the psychological condition of people, finally reach realize it is man-machine between intelligent interaction, with human facial expression recognition, The change of research human psychology mood is the important breakthrough of modern science and technology.And in field in intelligent robotics, carried out using computer Facial Expression Image acquisition, facial expression image pretreatment, Expression analysis etc., promote man-machine communication, reach a higher scientific and technological water It is flat.In addition, in digital filming field, can be according to the automatic capture pictures of smiling face's expression detected.Although at present in expression knowledge side The research in face is a lot of, but due to the complexity and cost consideration of method, in the market does not obtain popularization also and used, and by Very fast in human face expression pace of change, part expression is difficult to catch identification, therefore in terms of Expression Recognition rate is improved, even in the presence of one Fixed challenge.

The present invention proposes a kind of method that human facial expression recognition is carried out based on depth 3D convolutional neural networks, network structure Residual error Internet (3DIR) is originated by a 3D and shot and long term memory network forms (LSTM) and formed, extracts the sky in face-image Between time relationship in relation and video between different frame, facial markers contribute to prior face in network attention characteristic pattern Portion's composition, therefore input of the facial markers as network is extracted, the recognition capability of the delicate change of sequence pair facial expression is improved, so as to More accurately identified.The present invention proposes one kind and extracts video using 3D convolutional neural networks and shot and long term memory network Per the method for time relationship between frame in sequence, and facial marks is extracted, emphasize the facial composition of more expressive force, improve face The recognition capability of the delicate change of expression, for the new design of field in intelligent robotics, and the innovative solution of psychological field Further contribution is done.

The content of the invention

For Expression Recognition, it is proposed that one kind extracts video sequence using 3D convolutional neural networks and shot and long term memory network Per the method for time relationship between frame in row, and facial marks is extracted, improve the recognition capability of the delicate change of facial expression, be intelligence The new design of energy robot field, and the innovative solution of psychological field have done further contribution.

To solve the above problems, present invention offer is a kind of to carry out human facial expression recognition based on depth 3D convolutional neural networks Method, its main contents include：

(1) 3D originates residual error network；

(2) facial markers；

(3) shot and long term memory network unit.

Wherein, described depth 3D convolutional neural networks, residual error Internet (3DIR) is originated by a 3D and shot and long term is remembered Recall network composition (LSTM) composition, LSTM extracts the spatial relationship in face-image behind 3D starting residual error Internets And the time relationship in video between different frame, facial markers contribute in network attention characteristic pattern it is prior face into Point, therefore input of the facial markers as network is extracted, the recognition capability of the delicate change of sequence pair facial expression is improved, so as to carry out More accurately identify.

Wherein, described shot and long term memory network (LSTM), LSTM provide memory function, are responsible for non-volatile recording context Information, comprising input structure (i), forget structure (f) and export structure (o), three structures are each responsible for storing up on time step t Memory cell c rewriting, maintain and retrieve, make σ (x)=(1+exp (- x))^-1For Sigmoid functions, For hyperbolic tangent function, x, h, c, W and b are respectively to input, output, location mode, ginseng Matrix number and parameter vector,

Given input x_t,h_t-1And c_t-1When on time step t, LSTM renewal is provided by equation (1).

Further, described 3D startings residual error network, 3D starting residual error networks have higher discrimination, its network knot Structure is：The video that size is 10 × 299 × 299 × 3 is inputted, wherein frame number is 10, represents color per frame size for 299 × 299,3 Chrominance channel, an adsorption layer is followed by, 3DIR includes A, B, C layer, and by 3DIR-A layers, size of mesh opening is reduced to 18 by 38 × 38 × 18,8 × 8 are reduced to by 18 × 18 by 3DIR-B size of mesh opening, average pond is carried out by 3DIR-C layers, finally by complete Articulamentum output result.

Further, described facial markers, in the network architecture using facial markers, major facial composition and face are distinguished The less other parts of portion's expression, in human facial expression recognition, extraction facial markers improve discrimination, retain every frame in network Time sequencing, CNN and LSTM are trained simultaneously in ad-hoc network, on raw residual network, with reference to facial markers, with residue Unit replaces optimal path, the input tensor of facial markers and remaining unit is carried out element multiplication, in order to extract facial marks, Facial bounding box is obtained using cross-platform computer vision library face recognition, alignd using the face for returning partial binary feature Algorithm, extract 66 facial markers points.

Wherein, described facial alignment algorithm, after detecting and preserving the facial markers of all databases, it is in the training stage Each sequence generates a facial markers wave filter, the facial markers in given sequence per frame, all images in sequence is adjusted Whole is the correspondingly sized of network median filter, and according to the marking path detected, power is distributed to all pixels in sequence per frame Weight, pixel is endowed bigger weight closer to facial markers, the then pixel, by using manhatton distance and linear weight letter Number, makes the Expression Recognition rate in database reach higher level, the manhatton distance between facial markers and pixel is its phase The difference sum of part is answered, weighted value is distributed to the weighting function of its individual features, is a simple line of manhatton distance Property function, is defined as follows：

W (L, P)=1-0.1d_M(L,P) (2)

Wherein d_M(L,P)It is the manhatton distance between facial markers L and pixel P, facial marks position has peak, its The pixel of surrounding has with corresponding facial markers apart from proportional relatively low weight, in order to avoid two adjacent facials mark weights It is folded, 7 × 7 windows are defined around each facial markers, each mark uses the weighting function of this 49 pixels respectively, Facial markers are added in network, the optimal road in raw residual network is replaced by weighting function w and input layer x element multiplication Footpath：

Wherein, x_lAnd x_l+1It is the input and output of l layers respectively,It is Hadamard product code, F is survival function, and f is sharp Function living.

Wherein, described facial markers point, after detecting face, facial markers point is extracted by facial alignment algorithm, it It is 299 × 299 pixels to reset face-image afterwards, can be from sequence because larger image and sequence possess deeper network More abstract characteristics are extracted, therefore select large-size images to be used as input, all networks have identical setting, respectively to every Individual database is trained, and the accuracy of network is assessed using independent motif task and integration across database task.

Wherein, described independent motif task, each database with strict theme independent mode be divided into training set and Checking collection, in all databases, using 5 times of Cross-Validation technique testing results, by 5 times of discrimination average out to, for each Database and each network for folding, being proposed using above-mentioned setting nursery, delete the multiplication unit of mark, and with remaining unit Input and output between simple and fast mode replace, select 20% target at random as test group, and report these mesh Target test result.

Wherein, described integration across database task, in integration across database task, in order to test each database, the data Storehouse is entirely used for test network, and remaining database is used for training network, test result indicates that, the method proposed improves table The success rate of feelings identification.

Further, described shot and long term memory network unit, the characteristic pattern obtained from 3DIR units include characteristic pattern sequence Concept of time in row, vector quantization is carried out to resulting 3DIR characteristic patterns in its sequence dimension, as needed for LSTM units Sequencing input, the disorder feature figure of vectorization is fed to LSTM units, preserves the time sequencing of list entries and by the spy Sign figure is delivered to LSTM units, and in the training stage, using asynchronous stochastic gradient descent, 0.9 momentum, weight decays to 0.0001, Learning rate is 0.01, and loss function and the evaluation index of accuracy are used as using classification cross entropy.

Brief description of the drawings

Fig. 1 is a kind of system flow for the method that human facial expression recognition is carried out based on depth 3D convolutional neural networks of the present invention Figure.

Fig. 2 is a kind of network frame for the method that human facial expression recognition is carried out based on depth 3D convolutional neural networks of the present invention Figure.

Fig. 3 is a kind of facial markers for the method that human facial expression recognition is carried out based on depth 3D convolutional neural networks of the present invention Figure.

Embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system flow for the method that human facial expression recognition is carried out based on depth 3D convolutional neural networks of the present invention Figure.Mainly include 3D starting residual errors network, facial markers, shot and long term memory network unit.

Fig. 2 is a kind of network frame for the method that human facial expression recognition is carried out based on depth 3D convolutional neural networks of the present invention Figure.Input video sequence, 3DIR combination facial markers strengthen countenance feature, and LSTM networks afterwards produce 3DIR layers Enhanced feature figure be fully connected as input, and therefrom extracting time information by associated with softmax activation primitives Layer output result.

Fig. 3 is a kind of facial markers for the method that human facial expression recognition is carried out based on depth 3D convolutional neural networks of the present invention Figure.Further, described facial markers, in the network architecture using facial markers, difference major facial composition and facial table The less other parts of feelings, in human facial expression recognition, extraction facial markers improve discrimination, retain the time per frame in network Sequentially, CNN and LSTM are trained simultaneously in ad-hoc network, on raw residual network, with reference to facial markers, with remaining unit Optimal path is replaced, the input tensor of facial markers and remaining unit is carried out element multiplication, in order to extract facial marks, is used Cross-platform computer vision library face recognition obtains facial bounding box, is calculated using the face alignment for returning partial binary feature Method, extract 66 facial markers points.

W (L, P)=1-0.1d_M(L,P) (2)

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized with other concrete forms.In addition, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement and modification also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

A kind of 1. method that human facial expression recognition is carried out based on depth 3D convolutional neural networks, it is characterised in that mainly including 3D Originate residual error network (one)；Facial markers (two)；Shot and long term memory network unit (three).
2. based on the depth 3D convolutional neural networks described in claims 1, it is characterised in that originate residual error net by a 3D Network layers (3DIR) and shot and long term memory network composition (LSTM) composition, LSTM is behind 3D starting residual error Internets, extraction Time relationship in spatial relationship and video in face-image between different frame, facial markers contribute to network attention feature Prior facial composition in figure, therefore input of the facial markers as network is extracted, improve the delicate change of sequence pair facial expression Recognition capability, so as to more accurately be identified.
3. based on the shot and long term memory network (LSTM) described in claims 2, it is characterised in that LSTM provides memory function, It is responsible for non-volatile recording contextual information, comprising input structure (i), forgets structure (f) and export structure (o), three structures exist Storage element c rewriting is each responsible on time step t, maintains and retrieves, make σ (x)=(1+exp (- x))^-1For Sigmoid letters Number,For hyperbolic tangent function, x, h, c, W and b are respectively to input, and output is single First state, parameter matrix and parameter vector,

<mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>f</mi> <mi>t</mi> </msub> <mo>=</mo> <mi>&sigma;</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mi>f</mi> </msub> <mo>&CenterDot;</mo> <mo>&lsqb;</mo> <msub> <mi>h</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&rsqb;</mo> <mo>+</mo> <msub> <mi>b</mi> <mi>f</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>i</mi> <mi>t</mi> </msub> <mo>=</mo> <mi>&sigma;</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>&CenterDot;</mo> <mo>&lsqb;</mo> <msub> <mi>h</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&rsqb;</mo> <mo>+</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>o</mi> <mi>t</mi> </msub> <mo>=</mo> <mi>&sigma;</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mi>o</mi> </msub> <mo>&CenterDot;</mo> <mo>&lsqb;</mo> <msub> <mi>h</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&rsqb;</mo> <mo>+</mo> <msub> <mi>b</mi> <mi>o</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>g</mi> <mi>t</mi> </msub> <mo>=</mo> <mi>&phi;</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mi>C</mi> </msub> <mo>&CenterDot;</mo> <mo>&lsqb;</mo> <msub> <mi>h</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&rsqb;</mo> <mo>+</mo> <msub> <mi>b</mi> <mi>C</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>C</mi> <mi>t</mi> </msub> <mo>=</mo> <msub> <mi>f</mi> <mi>t</mi> </msub> <mo>*</mo> <msub> <mi>C</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>+</mo> <msub> <mi>i</mi> <mi>t</mi> </msub> <mo>*</mo> <msub> <mi>g</mi> <mi>t</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>h</mi> <mi>t</mi> </msub> <mo>=</mo> <msub> <mi>o</mi> <mi>t</mi> </msub> <mo>*</mo> <mi>&phi;</mi> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Given input x_t,h_t-1And c_t-1When on time step t, LSTM renewal is provided by equation (1).
4. based on the 3D starting residual error networks (one) described in claims 1, it is characterised in that 3D starting residual error networks have Higher discrimination, its network structure are：The video that size is 10 × 299 × 299 × 3 is inputted, wherein frame number is 10, per frame chi Very little is 299 × 299,3 expression color channels, is followed by an adsorption layer, 3DIR includes A, B, C layer, passes through 3DIR-A layers, grid Size is reduced to 18 × 18 by 38 × 38, is reduced to 8 × 8 by 18 × 18 by 3DIR-B size of mesh opening, is carried out by 3DIR-C layers Average pond, finally by being fully connected a layer output result.
5. based on the facial markers (two) described in claims 2, it is characterised in that in the network architecture using facial markers, Major facial composition and the less other parts of facial expression are distinguished, in human facial expression recognition, extraction facial markers, which improve, to be known Not rate, retain the time sequencing per frame in network, CNN and LSTM are trained simultaneously in ad-hoc network, in raw residual network On, with reference to facial markers, optimal path is replaced with remaining unit, the input tensor of facial markers and remaining unit is entered row element It is multiplied, in order to extract facial marks, facial bounding box is obtained using cross-platform computer vision library face recognition, using recurrence office The facial alignment algorithm of portion's binary features, extract 66 facial markers points.
6. based on the facial alignment algorithm described in claims 5, it is characterised in that detect and preserve the face of all databases It is that each sequence generates a facial markers wave filter in the training stage, the facial markers in given sequence per frame will after mark All Image Adjustings in sequence are the correspondingly sized of network median filter, according to the marking path detected, to every in sequence The all pixels distribution weight of frame, pixel is endowed bigger weight closer to facial markers, the then pixel, by using graceful Kazakhstan Pause distance and Line Weight Function, the Expression Recognition rate in database is reached higher level, between facial markers and pixel Manhatton distance be its corresponding component difference sum, weighted value is distributed to the weighting function of its individual features, is Man Ha One simple linear function of distance of pausing, is defined as follows：

W (L, P)=1-0.1d_M(L,P) (2)

Wherein d_M(L,P)It is the manhatton distance between facial markers L and pixel P, facial marks position has peak, around it Pixel have with corresponding facial markers apart from proportional relatively low weight, in order to avoid two adjacent facials marks are overlapping, 7 × 7 windows are defined around each facial markers, each mark uses the weighting function of this 49 pixels respectively, in net Facial markers are added in network, the optimal road in raw residual network is replaced by weighting function w and input layer x element multiplication Footpath：

x_l+1=f (yl) (3)

Wherein, x_lAnd x_l+1It is the input and output of l layers respectively, ° is Hadamard product code, F is survival function, and f is activation letter Number.
7. based on the facial markers point described in claims 5, it is characterised in that after detecting face, calculated by face alignment Method extract facial markers point, afterwards reset face-image be 299 × 299 pixels, due to larger image and sequence possess it is deeper Network, more abstract characteristics can be extracted from sequence, therefore select large-size images to have as input, all networks Identical is set, and each database is trained respectively, network is assessed using independent motif task and integration across database task Accuracy.
8. based on the independent motif task described in claims 7, it is characterised in that each database is only with strict theme Cube formula is divided into training set and checking collects, in all databases, using 5 times of Cross-Validation technique testing results, by discrimination 5 times of average out to, for each database and each network for folding, being proposed using above-mentioned setting nursery, delete multiplying for mark Method unit, and replaced with the simple and fast mode between the input and output of remaining unit, 20% target conduct is selected at random Test group, and report the test result of these targets.
9. based on the integration across database task described in claims 7, it is characterised in that in integration across database task, in order to test Each database, the database being entirely used for test network, remaining database is used for training network, test result indicates that, The method proposed improves the success rate of Expression Recognition.
10. based on the shot and long term memory network unit (three) described in claims 1, it is characterised in that obtained from 3DIR units Characteristic pattern include concept of time in feature graphic sequence, vector is carried out to resulting 3DIR characteristic patterns in its sequence dimension Change, as the sequencing input needed for LSTM units, the disorder feature figure of vectorization is fed to LSTM units, preserves list entries Time sequencing and this feature figure is delivered to LSTM units, in the training stage, using asynchronous stochastic gradient descent, 0.9 momentum, Weight decays to 0.0001, learning rate 0.01, loss function and the evaluation index of accuracy is used as using classification cross entropy.