CN102013103A

CN102013103A - Method for dynamically tracking lip in real time

Info

Publication number: CN102013103A
Application number: CN 201010571128
Authority: CN
Inventors: 王士林; 李建华; 刘功申; 李翔; 李生红
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2010-12-03
Filing date: 2010-12-03
Publication date: 2011-04-13
Anticipated expiration: 2030-12-03
Also published as: CN102013103B

Abstract

The invention provides a method which belongs to the technical field of image processing and pattern recognition, in particular to a method for dynamically tracking a lip in real time. The method comprises the following steps: shooting and acquiring an image sequence containing a lip area by a digital video (DV); based on a continuous image lip segmentation method of fuzzy clustering and Kalman prediction, dividing all pixel points in an image into lip pixel points or non-lip pixel points, and outputting the probability of the lip pixel points to which the all pixel points belong; and acquiring outlines of lips in each frame in the lip image sequence on the basis of a lip probability allocation plan provided by the step two through a 14-point dynamic shape model and Kalman forecast. The method can be used for automatically tracking the movement of the lip in the image sequence and has the advantages of higher processing speed (so as to ensure instantaneity) and recognition accuracy.

Description

Real-time dynamic lip tracking method

Technical Field

The invention relates to a method in the technical field of image processing and pattern recognition, in particular to a real-time dynamic lip tracking method.

Background

In recent years, Automatic Speech Recognition (ASR) technology has advanced greatly, and a series of mature products are formed, which can obtain a good recognition effect in an environment with a high signal-to-noise ratio. However, the performance of these systems is often limited by the level of background noise, and the results achieved by these systems are often unsatisfactory in heavily noisy environments, such as in-car, factory, airport, etc. Accordingly, more and more scholars seek ways to improve speech recognition from sources other than audio. The McGurk effect (the McGurk effect) reveals that there is an indivisible intrinsic relationship between the audio/visual information during the speaker's instruction. Therefore, it is thought that understanding of the narrative is aided by the introduction of visual information of lip movements, and this type of speech recognition system is called an automated lip reading system. In the above system, one of the first and most critical steps is to accurately and rapidly acquire lip motion change conditions from the video, i.e. a real-time lip tracking method. The accuracy and reliability of the lip reading system often directly determine the performance of the lip reading system.

After a search of the prior art documents, the Lip region detection and tracking (Lip detection and tracking) published by a company at the 11th International Conference on Image Analysis and Processing, page 8-13, which adopts the intensity of the luminance edge as a standard for detecting the Lip contour to converge the Lip edge to the strongest edge by an iterative method. Meanwhile, under the limitation of a reasonable lip model, the reasonability of finally obtaining the lip model is ensured. The technology has the following defects: firstly, this is a lip tracking technique for grayscale (luminance) images, which is greatly affected by the illumination condition due to the lack of chrominance information; second, the technique relies on the bright edges of the lip image, while the edge information depends on the contrast of the image, and the unpainted lip image tends to be of low contrast, causing instability of the edge information. Based on the two points, the accuracy and robustness of the technology need to be improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a real-time dynamic lip tracking method, which realizes the acquisition and tracking of lip movement of a speaker, and ensures the real-time processing speed while acquiring higher matching accuracy.

The invention is realized by the following technical scheme:

the invention comprises the following steps:

step one, shooting and acquiring an image sequence containing a lip area through a digital camera. Since the color space collected by the commonly used digital camera is the RGB color space, the RGB color space is not a uniform color space that conforms to the human eye's color difference vision. Therefore, it needs to be converted into the CIE-LAB uniform color space as follows:

(\begin{matrix} X \\ Y \\ Z \end{matrix}) = (\begin{matrix} 0.490 & 0.310 & 0.200 \\ 0.177 & 0.813 & 0.011 \\ 0.000 & 0.010 & 0.990 \end{matrix}) (\begin{matrix} R \\ G \\ B \end{matrix})

<math><mrow><msup><mi>L</mi><mo>*</mo></msup><mo>=</mo><mfenced open='{' close=''><mtable><mtr><mtd><mn>116</mn><msup><mrow><mo>(</mo><msup><mi>Y</mi><mo>′</mo></msup><mo>)</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>3</mn></mrow></msup><mo>-</mo><mn>16</mn></mtd><mtd><mi>if</mi><msup><mi>Y</mi><mo>′</mo></msup><mo>></mo><mn>0.008856</mn></mtd></mtr><mtr><mtd><mn>903.3</mn><msup><mi>Y</mi><mo>′</mo></msup></mtd><mtd><mi>otherwise</mi></mtd></mtr></mtable></mfenced></mrow></math>

a^{*} = 500 (K_{1}^{1 / 3} - K_{2}^{1 / 3})

b^{*} = 200 (K_{2}^{1 / 3} - K_{3}^{1 / 3})

wherein,

<math><mrow><msub><mi>K</mi><mi>i</mi></msub><mo>=</mo><mfenced open='{' close=''><mtable><mtr><mtd><msub><mi>Φ</mi><mi>i</mi></msub></mtd><mtd><mi>if</mi><msub><mi>Φ</mi><mi>i</mi></msub><mo>></mo><mn>0.008856</mn></mtd></mtr><mtr><mtd><mn>7.787</mn><msub><mi>Φ</mi><mi>i</mi></msub><mo>+</mo><mn>16</mn><mo>/</mo><mn>116</mn></mtd><mtd><mi>otherwise</mi></mtd></mtr></mtable></mfenced></mrow></math>

and step two, dividing all pixel points in the image into lip pixel points or non-lip pixel points through a continuous image lip segmentation method based on fuzzy clustering and Kalman prediction, and outputting the probability that all the pixel points belong to the lip pixel points. The specific method comprises the following steps:

for an N × M image I, X ═ X_1，1，…，x_r，s，…，x_N，MRepresents the color information set of all pixel points in the image, where x_r，s∈R^qRepresenting the color characteristics of the pixel at coordinates (r, s). In addition, let d_i，r，sAs a color feature x_r，sAnd ith color center v_i(i-0 represents a lip class, and i-1 represents a non-lip class). Finally, the whole lip segmentation algorithm target function based on the fuzzy clustering technology is as follows:

and obey

<math><mrow><mo>&ForAll;</mo><mrow><mo>(</mo><mi>r</mi><mo>,</mo><mi>s</mi><mo>)</mo></mrow><mo>&Element;</mo><mi>I</mi><mo>.</mo></mrow></math>

Wherein, U represents a fuzzy membership matrix (i.e. probability value that a pixel belongs to a certain class), and the gs function is a position penalty function, i.e. the membership of lips of pixels in a lip region is enhanced, and the membership of lips of pixels outside the lip region is reduced.

The probability of the lip pixel points is obtained in the whole lip segmentation process by adopting an iterative mode of gradient reduction to obtain the optimal solution of a membership matrix which enables the objective function to be minimum, the Kalman prediction of the color center and the spatial position of the lips has the effect of predicting the color center/non-lip color center and the spatial position of the lips of the current frame through the color centers and the spatial positions of the lips of a plurality of previous frames, and the final output result is the probability that all pixel points in the image belong to the lip pixel points, namely u_0，r，s，

The kalman prediction specifically includes:

x_k＝Ax_k-1+w_k-1

z_k＝Hx_k+v_k

wherein x is_kRepresenting the current state, w_k-1Representing noise at the time of the state transition. And A is the state transition matrix; z is a radical of_kRepresents the current time measurement (i.e., color center and lip space position parameters), and v_kThen it represents a measurement error and H is the measurement matrix. The state transition error and the measurement error are generally considered to fit a normal distribution: p (w) N (0, Q); p (v) N (0, R). The calculation of kalman filter prediction is an iterative recursive process, which is specifically as follows:

1) initializing an initial state and initial estimation error covariance;

2) predicting the current state according to the state of the previous step, and obtaining a predicted measurement value through the predicted state by using an H measurement function, wherein the measurement value is a required correction result after Kalman filtering;

3) correcting the system model according to the currently observed measured value, inputting the final output of the current frame measured value into a correction process, and correcting the system model;

4) repeating steps 2) and 3) until the last frame of the lip sequence.

Step three, acquiring the lip contour in each frame in the lip image sequence on the basis of the lip probability distribution map provided in the step two through a 14-point dynamic shape model and Kalman prediction, wherein the details are as follows:

the objective function is defined as:

<math><mrow><mi>max</mi><mo>{</mo><mi>C</mi><mrow><mo>(</mo><msub><mi>λ</mi><mi>p</mi></msub><mo>)</mo></mrow><mo>=</mo><munder><mi>Π</mi><mrow><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>&Element;</mo><msub><mi>R</mi><mi>l</mi></msub><mrow><mo>(</mo><msub><mi>λ</mi><mi>p</mi></msub><mo>)</mo></mrow></mrow></munder><msub><mi>prob</mi><mi>l</mi></msub><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><munder><mi>Π</mi><mrow><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>&Element;</mo><msub><mi>R</mi><mi>nl</mi></msub><mrow><mo>(</mo><msub><mi>λ</mi><mi>p</mi></msub><mo>)</mo></mrow></mrow></munder><msub><mi>prob</mi><mi>nl</mi></msub><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>}</mo></mrow></math>

wherein λ_pIs a 14 o' clock lip profile parameter, R₁Is the lip region, R_b1Is a non-lip region. prob₁Is lip-like probability, prob_m1Is a non-lip class probability. By means of an iterative search it is possible to search,obtaining a final lip contour model lambda_p. The kalman prediction functions to predict the initial lip model of the current frame by using the lip contour points of the previous frames, and the method is similar to that described in the second step, except that the measured value is the 14-point lip contour coordinate value.

Compared with the prior art, the invention has the following beneficial effects: according to the characteristics that the lip image contrast is low, lip points are often relatively gathered, the lip sequence has continuity in a time domain and the like, a novel lip segmentation and lip contour extraction method is adopted, and the method is superior to the traditional method based on brightness (color) edges in performance and robustness. Through a large number of experimental tests, the method can accurately track and extract the lip contour on the basis of ensuring the real-time performance (processing more than 30 frames of images per second).

Drawings

FIG. 1 is a work flow diagram of the method of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings: the present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following embodiments.

As shown in fig. 1, the present embodiment includes the following steps:

firstly, acquiring a lip image sequence containing lip areas by a digital camera at a frame rate of 24 frames per second, wherein the format of each frame of image is RGB, and the resolution is 220 x 180. And converting the color space into a CIE-LAB uniform color space, which comprises the following steps:

(\begin{matrix} X \\ Y \\ Z \end{matrix}) = (\begin{matrix} 0.490 & 0.310 & 0.200 \\ 0.177 & 0.813 & 0.011 \\ 0.000 & 0.010 & 0.990 \end{matrix}) (\begin{matrix} R \\ G \\ B \end{matrix})

a^{*} = 500 (K_{1}^{1 / 3} - K_{2}^{1 / 3})

b^{*} = 200 (K_{2}^{1 / 3} - K_{3}^{1 / 3})

wherein,

and secondly, dividing all pixel points in the image into lip pixel points or non-lip pixel points by a continuous image lip segmentation method based on fuzzy clustering and Kalman prediction, and outputting the probability that all the pixel points belong to the lip pixel points. The specific method comprises the following steps:

for a 220X 180 image I, X ═ X_1，1，…，x_r，s，…，x_N，MRepresents the color information set of all pixel points in the image, where x_r，s∈R^qRepresenting the Lab three-dimensional color characteristics of the pixel points located at the coordinates (r, s). In addition, let d_i，r，sAs a color feature x_r，sAnd ith color center v_i(i-0 represents a lip class, and i-1 represents a non-lip class). Finally, the whole lip segmentation algorithm target function based on the fuzzy clustering technology is as follows:

and obey

Wherein, U represents a fuzzy membership matrix (i.e. probability value that a pixel belongs to a certain class), and the gs function is a position penalty function, i.e. the membership of lips of pixels in a lip region is enhanced, and the membership of lips of pixels outside the lip region is reduced. In the whole lip segmentation process, the optimal solution of the membership matrix which minimizes the objective function is obtained by adopting a gradient descending iteration mode.

The role of kalman prediction is to predict the lip/non-lip color center and lip space position of the current frame from the color center and lip space position of the previous frames. The final output result is the probability that all pixel points in the image belong to the lip class, namely

The kalman prediction of the color center and the lip space position specifically comprises:

x_k＝Ax_k-1+w_k-1

z_k＝Hx_k+v_k

wherein x is_kRepresenting the current state, w_k-1Representing noise at the time of the state transition. And A is the state transition matrix; z is a radical of_kRepresents the current time measurement (i.e., color center and lip space position parameters), and v_kThen it represents a measurement error and H is the measurement matrix. Error of state transitionThe difference and measurement error are generally considered to fit a normal distribution: p (w) N (0, Q); p (v) N (0, R). The calculation of kalman filter prediction is an iterative recursive process, which is specifically as follows:

1) initializing an initial state and initial estimation error covariance;

4) repeating steps 2) and 3) until the last frame of the lip sequence.

Thirdly, through a 14-point dynamic shape model and Kalman prediction, on the basis of the lip probability distribution map provided in the second step, the lip contour in each frame in the lip image sequence is obtained, specifically as follows:

the objective function is defined as:

wherein λ_pIs a 14 o' clock lip profile parameter, R₁Is the lip region, R_b1Is a non-lip region. prob₁Is lip-like probability, prob_m1Is a non-lip class probability. Obtaining a final lip contour model lambda through iterative search_p. The kalman prediction functions to predict the initial lip model of the current frame by using the lip contour points of the previous frames, and the method is similar to that described in the second step, except that the measured value is the 14-point lip contour coordinate value.

The iterative search method is characterized in that:

firstly, initializing a 14-point lip model lambda by using lip class probability distribution obtained by a lip image segmentation algorithm_p。

Secondly, according to the target function, calculating the displacement of the lip contour point and updating the position of the contour point:

<math><mrow><mi>Δ</mi><msub><mi>λ</mi><mi>p</mi></msub><mo>=</mo><mo>{</mo><msub><mi>dx</mi><mi>i</mi></msub><mo>,</mo><msub><mi>dy</mi><mi>i</mi></msub><mo>}</mo><mo>=</mo><mo>{</mo><mo>-</mo><mfrac><mrow><mo>&PartialD;</mo><mi>C</mi></mrow><mrow><mo>&PartialD;</mo><msub><mi>x</mi><mi>i</mi></msub></mrow></mfrac><mo>,</mo><mo>-</mo><mfrac><mrow><mo>&PartialD;</mo><mi>C</mi></mrow><mrow><mo>&PartialD;</mo><msub><mi>y</mi><mi>i</mi></msub></mrow></mfrac><mo>}</mo><mi>i</mi><mo>=</mo><mn>0,1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mn>13</mn></mrow></math>

λ_p，new＝λ_p，old+wΔλ_p

where w is the step size per offset, set to 0.05 in the example.

And thirdly, repeating the step 2 until the target function converges.

By testing more than 2000 speech sequences of 50 speakers, the method of the present embodiment accurately tracks the lip contour and ensures the processing speed to be greater than 30 frames per second.

Claims

1. A real-time dynamic lip tracking method is characterized by comprising the following steps:

shooting and acquiring an image sequence including a lip region through a digital camera;

dividing all pixel points in the image into lip pixel points or non-lip pixel points by a continuous image lip segmentation method based on fuzzy clustering and Kalman prediction, and outputting the probability that all the pixel points belong to the lip pixel points;

and step three, acquiring the lip contour in each frame in the lip image sequence on the basis of the lip probability distribution map provided in the step two through a 14-point dynamic shape model and Kalman prediction.

2. The real-time dynamic lip tracking method according to claim 1, wherein when the color space collected by the digital camera is RGB color space, it is converted into CIE-LAB uniform color space, specifically as follows:

(\begin{matrix} X \\ Y \\ Z \end{matrix}) = (\begin{matrix} 0.490 & 0.310 & 0.200 \\ 0.177 & 0.813 & 0.011 \\ 0.000 & 0.010 & 0.990 \end{matrix}) (\begin{matrix} R \\ G \\ B \end{matrix})

a^{*} = 500 (K_{1}^{1 / 3} - K_{2}^{1 / 3})

b^{*} = 200 (K_{2}^{1 / 3} - K_{3}^{1 / 3})

wherein,

<math><mrow><msub><mi>K</mi><mi>i</mi></msub><mo>=</mo><mfenced open='{' close=''><mtable><mtr><mtd><msub><mi>Φ</mi><mi>i</mi></msub></mtd><mtd><mi>if</mi><msub><mi>Φ</mi><mi>i</mi></msub><mo>></mo><mn>0.008856</mn></mtd></mtr><mtr><mtd><mn>7.787</mn><msub><mi>Φ</mi><mi>i</mi></msub><mo>+</mo><mn>16</mn><mo>/</mo><mn>116</mn></mtd><mtd><mi>otherwise</mi></mtd></mtr></mtable></mfenced><mo>.</mo></mrow></math>

3. the method of claim 1, wherein the segmentation method comprises:

for an N × M image I, X ═ X_1，1，…，x_r，s，…，x_N，MRepresents the color information set of all pixel points in the image, where x_r，s∈R^qRepresenting the color characteristics of the pixel points located at the coordinates (r, s);

in addition, let d_i，r，sAs a color feature X_r，sAnd ith color center v_iEuclidean distance between, wherein: i-0 represents a lip class, i-1 represents a non-lip class;

finally, the whole lip segmentation algorithm target function based on the fuzzy clustering technology is as follows:

and obey

<math><mrow><mo>&ForAll;</mo><mrow><mo>(</mo><mi>r</mi><mo>,</mo><mi>s</mi><mo>)</mo></mrow><mo>&Element;</mo><mi>I</mi><mo>;</mo></mrow></math>

The U represents a fuzzy membership matrix, namely the probability value that the pixel belongs to a certain class, and the gs function is a position penalty function, namely the lip class membership of the pixel in the lip region is enhanced, and the lip membership of the pixel outside the lip region is reduced.

4. The method as claimed in claim 1, wherein the probability of the lip pixel points and the entire lip segmentation process are determined by iterative means of gradient descent to obtain the optimal solution of the membership matrix for minimizing the objective function, and the kalman prediction of the color center and the spatial position of the lips is performed by predicting the color center and the spatial position of the lips of the current frame according to the color center and the spatial position of the lips of the previous frames, and finally, the probability of the lip pixel points and the entire lip segmentation process are determined by using the color center and the spatial position of the lips of the previous framesThe output result is the probability that all pixel points in the image belong to the lip pixel points, namely

5. The real-time dynamic lip tracking method of claim 4, wherein the Kalman prediction is:

x_k＝Ax_k-1+w_k-1

z_k＝Hx_k+v_k

wherein x is_kRepresenting the current state, w_k-1Representing noise at the time of state transition, and a is a state transition matrix; z is a radical of_kRepresenting the current time measurement, namely the color center and lip space position parameters; and v is_kThen the measurement error is indicated, H is the measurement matrix; the state transition error and the measurement error are in accordance with a normal distribution: p (w) N (0, Q); p (v) N (0, R).

6. The method of claim 4, wherein the Kalman prediction calculation is an iterative recursive process, as follows:

1) initializing an initial state and initial estimation error covariance;

4) repeating steps 2) and 3) until the last frame of the lip sequence.

7. The method for real-time dynamic lip tracking according to claim 1, wherein the objective function defined by the lip contour in each frame of the sequence of acquired lip images is:

wherein: lambda [ alpha ]_pIs a 14 o' clock lip profile parameter, R₁Is the lip region, R_b1Is a non-lip region;

prob₁is lip-like probability, prob_m1Probability of non-lip class;

obtaining a final lip contour model lambda through iterative search_p；

The Kalman prediction is used for predicting the initial lip model of the current frame through the lip contour points of a plurality of previous frames, and the method is different from the step two in that the measured value is a 14-point lip contour coordinate value.