MXPA00010043A

MXPA00010043A - Face recognition from video images

Info

Publication number: MXPA00010043A
Application number: MXPA/A/2000/010043A
Authority: MX
Inventors: Thomas Maurer; Egor Valerievich Elagin; Luciano Pasquale Agostino Nocera; Johannes Bernhard Steffens; Hartmut Neven
Original assignee: Eyematic Interfaces Inc
Priority date: 1998-04-13
Filing date: 2000-10-13
Publication date: 2002-03-05

Abstract

The present invention is embodied in an apparatus, and related method, for detecting and recognizing an object in an image frame. The object may be, for example, a head having particular facial characteristics. The object detection process uses robust and computationally efficient techniques. The object identification and recognition process uses an image processing technique based on model graphs and bunch graphs that efficiently represent image features as jets. The jets are composed of wavelet transforms and are processed at nodes or landmark locations on an image corresponding to readily identifiable features. The system of the invention is particularly advantageous for recognizing a person over a wide variety of pose angles.

Description

RECOGNITION OF FACES FROM VIDEO IMAGES FIELD OF THE INVENTION The present invention relates to detection and monitoring of objects based on vision, and more particularly with systems for detecting objects in video images such as human faces, and the tracking and identification of objects in real time.

BACKGROUND OF THE INVENTION The recently developed object and face recognition techniques include the use of elastic group graphics pariation. The group graphic recognition technique is highly effective for recognizing faces when the image to be analyzed is segmented so that the portion of the face of the image occupies a substantial portion of the image. However, the elastic group graphic technique may not reliably detect objects in large scenes where the object of interest occupies only a small fraction of the scene. In addition, the real-time use of the graphic group elastic recognition technique, the process of image segmentation must be computationally efficient or many of the operating advantages of the recognition technique are not obtained. Accordingly, there is a significant need for an image processing technique to detect an object in video images and to prepare the video image for further processing by a process of paring out group graphics in a computationally efficient manner. The present invention satisfies these needs.

BRIEF DESCRIPTION OF THE INVENTION The present invention is constituted in an apparatus, and related method, for detecting and recognizing an object in an image frame. The object detection process uses robust and computationally efficient techniques. The object identification and recognition processes use an image processing technique based on model graphics and group graphics that efficiently represent the image characteristics such as beams. The system of the invention is particularly advantageous for recognizing a person over a wide variety of pose angles. In one embodiment of the invention, the object is detected and a portion of the image frame associated with the object is joined by a junction box. The joining portion of the image frame is transformed using a wavelet transformation to generate a transformed image. The nodes associated with the differentiating characteristics of the object defined by the wave train beams of a group graph generated from representative group images are located on the transformed image. The object is identified based on a similarity between the wave train beams associated with an object image in a gallery of object images and the wave train beams in the nodes on the transformed image. Additionally, the detected object can be dimensioned and centered within the image binding portion so that the detected object has the predetermined size and position within the joining portion and the bottom portions of the joining portion of the frame. image are not associated with the object before identifying the object to be deleted. Often, the object is a head of a person that shows a facial region. The group graph can be based on a three-dimensional representation of the object. In addition, the waveform transformation can be carried out using phase calculations that are performed using hardware adapted for phase presentation. In an alternative embodiment of the invention, the object is in a sequence of images in the detection step an object further includes tracking the object between the image frames based on a path associated with the object. In addition, the step of locating the nodes includes following the nodes between image frames and re-initializing the tracking node if the position of the nodes is deviated by exceeding a predetermined position restriction between image frames. Additionally, the image frames can be stereoscopic images and the detection stage can include detecting convex regions which are associated with the movement of the head. Other features and advantages of the present invention will become apparent from the following description of the preferred embodiments, taken in conjunction with the accompanying drawings, which illustrate, by way of example, the principles of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram of a face recognition process, according to the invention. Figure 2 is a block diagram of a face recognition system, according to the invention.

Figure 3 is a series of images to show the processes of detection, finding and identification of the recognition process of Figure 1. Figure 4 is a block diagram of the head detection and tracking process, according to the invention. Figure 5 is a flow diagram, with accompanying images, to illustrate a disparity in the detection process according to the invention. Figure 6 is a schematic diagram of a convex detector, according to the invention. Figure 7 is a flow diagram of a head tracking process, according to the invention. Figure 8 is a flow diagram of a pre-selector, according to the invention. Figure 9 is a flow diagram, with accompanying photographs, to illustrate a technique for finding reference points of the facial recognition apparatus and the system of Figure 1. Figure 10 is a series of images showing the processing of the facial image using the Gabor wave trains, according to the invention. Figure 11 is a series of graphs showing the construction of a beam, an image graph and a group graph using the wavelet processing technique of Figure 10, according to the invention. Fig. 12 is a diagram of a model graph, according to the invention, for processing facial images. Figure 13 includes two diagrams showing the use of wavelet processing to locate facial features.

Figure 14 is a one-sided diagram of the extracted eye and mouth regions to illustrate the course-to-fine reference point finding technique. Figure 15 is a schematic diagram illustrating circular phase behavior. Figure 16 are schematic diagrams illustrating a representation of two phase complements having a circular behavior, according to the invention. Figure 17 is a flow diagram showing a tracking technique for following the reference points found by the technique of finding reference points of the invention. Figure 18 is a series of facial images showing the tracking of facial features, according to the invention. Figure 19 is a diagram of a Gaussian image pyramid technique to illustrate the tracking of reference points in one dimension. Figure 20 is a series of two facial images, with attached graphs of a stance angle versus a frame number, showing the tracking of facial features over a sequence of 50 image frames. Figure 21 is a flow diagram, with the attached photographs, to illustrate a technique for estimating the apparatus pose and recognition system of Figure 1. Figure 22 is a graph of a model is a stenoscope showing the orientation of the access of three-dimensional view (3-D). Figure 23 is a perspective view of a three-dimensional camera calibration configuration.

Figure 24 is a schematic rectification diagram for projecting corresponding pixels of stereoscopic images along the same line numbers. Figure 25 is an image box showing a correlation mapping process between a range of an image frame and a search interval of another picture frame. Figure 26 are images of a pair of stereoscopic images, a disparity map and image reconstruction illustrating three-dimensional image decoding. Figure 27 is a flow diagram of an image identification process, according to the invention. Fig. 28 is an image showing the use of background suppression.

DETAILED DESCRIPTION OF THE PREFERRED MODALITIES The present invention is constituted in a method, and related apparatus, for detecting and recognizing an object in an image frame. The object may be, for example, a head having particular facial features. Object detection processes use robust and computationally efficient techniques. Object identification and recognition processes use an image processing technique based on model graphics and group graphics that efficiently represent image characteristics such as beams. The beams are constituted of transforms of reference points and are processed in nodes or positions of reference points on an image that corresponds to facially identifiable characteristics. The system of the invention is particularly advantageous for recognizing a person over a wide variety of pose angles. With reference to figures 1-3 an image processing system of the invention is described. The object recognition process 10 operates on digitized video image data that is provided by an image processing system 12. The image data includes an image of an object class, such as a human face. The image data can be single video image frames or a series of sequential monocular or stereoscopic image frames. Prior to the processing of a facial image using elastic group graphic techniques, the head and the image are generally located, according to the invention, using a head detection and tracking process. Depending on the nature of the image data, the head detection module uses one of several visual pathways which will be based, for example, on movement, color or size (stereoscopic vision), topology or pattern. The head detection process places a junction box around the detected head and therefore reduces the image region so that it must be processed by the process of finding the reference point. Based on the data received from the head detection and tracking process, a preselector process 16 selects the most suitable views of the image material for further analysis and defines head detection to center and enlarge the head image. The selected head image is provided to a process of finding reference points for detecting individual facial features using the elastic group graphic technique. Once the facial reference points have been found in the facial image, a reference point tracking process 20 can be used to follow the reference points. The characteristics extracted at the reference points are then compared against the corresponding characteristics extracted from an image gallery by an identification process. This division of the image recognition process is advantageous because the process of finding the reference point is relatively long in time and often can not be carried out in real time in a series of image frames having a relatively high image speed. However, on the other hand, the tracking of the reference points can be carried out faster than the frame rate. Therefore, although the process of finding the initial reference point occurs, a buffer can be filled with new frames of input images. Once the reference points are located, the tracking of the reference points is started and the processing system can be retained for the processing of the images that are in buffer memory until the memory is erased. Note that the preselector and the reference point tracking module can be omitted from the face recognition process. The screen output of the recognition process is shown in Figure 3 for detection, finding of reference points and identification process. The upper left image window shows an image acquired with the detected head indicated by a connecting rectangle. The head image is centered, re-dimensioned and provided to the process of finding reference points. The upper right image window provides the output of the reference point finding module with the facial image marked with nodes on the facial reference points. The marked image is provided to the identified process which is illustrated in the lower window. The image that is further to the left represents the selected face provided by the process of finding reference points for identification. The three images on the right represent more similar gallery images classified in the order of similarity between the more similar face which is in the most left position. Each gallery image represents a label (ie, an identification number and person's name) associated with the image. The system then informs the label associated with the most similar face. The face recognition process can be implemented using a three-dimensional (3D) reconstruction process based on stereoscopic images. The 3D face recognition process provides independent recognition of the point of view. The image processing system 12 for implementing the face recognition process of the invention is shown in Figure 2. The processing system receives the image of a person from a video source 26 which generates a stream of picture frames. of digital video. The video image frames are transferred to a random access video (VRAM) memory 28 for processing. A satisfactory imaging system is the Matrox Meteor II, available from MatroxMR (Dorval, Quebec, Canada; www.matrox.com), which generates digitized images produced by a conventional CCD camera and transfers images in real time in memory at a frame rate of 30 Hz. A typical resolution for a picture frame 256 pixels by 256 pixels . The picture frame is processed by an image process having a central processing unit (CPU) coupled to the VRAM and a random access memory (RAM) 32. The RAM stores the program code 34 and the data to implement the facial recognition process of the invention. Alternatively, the image processing process can be implemented in specific application hardware. The head detection process is described in more detail with reference to Figure 4. The facial image can be stored in a VRAM 28 as a single image 36, a monocular video stream of images 38 or a binocular video stream of images 40. For a single image, the processing time may not be critical and the distribution of elastic group graphs, described in more detail below, may be used to detect a face on the face covers in at least 10% of the image and having a diameter of at least 50 pixels. If the face is smaller than 10% of the image or if multiple faces are present, a face detector used in neural network can be used, as described in H.A. Rowley, S. Baluja and T. Kanade, "Rotation Invarient Neural Network-Based Face Detection", Proceedings Computer Vision and Pattern Recognition, 1998. If the image includes information in color, a skin color detection process can be used to increase the reliability of face detection. The skin color detection process can be based on a look-up table that contains the possible skin colors. The confidence values which indicate the reliability of the detection of faces and which are generated during the parity of the group graph or by means of the neural network, can be increased for regions of images with skin color. A monocular image stream of at least 10 frames per second can be analyzed for image movement, particularly if the image stream includes only a single person moving against a stationary background. A technique for tracking the head involves the use of difference in the images to determine which regions of an image have been moved.

As described in greater detail below with respect to binocular images, the movement of the head often results in an image difference having convex regions within a moving silhouette. This moving silhouette technique can be easily located and the movement of the head can be followed if the image includes a single person in a vertical position at the front of a static background. Clustering algorithm groups move regions into groups. The upper part of the highest group that exceeds a minimum threshold value and a diameter is considered as the head and is marked. Another advantageous use of head movement detection uses graphics paring which is requested only when a minimum threshold is exceeded in terms of the number of pixels affected by the movement of an image. The threshold is selected so that a graphical parity analysis is performed that consumes a relatively long time and is performed only if there is a sufficient change in the images that justifies a renewed dependent analysis. Other techniques for determining convex regions of a moving silhouette with interference can be used, for example, by Turk et al, "Eignefaces for Recognition", Journal of Cognitive Neuroscience, vol. 3) No. 1 p. 71, 1991. Optical flow methods, as described in DJ.Fleet, "Measurement of Image Velocity", Kluwer International Series in Engineering and Computer Science, no. 169, 1992, provides an alternative and reliable means to determine which image regions change but are computationally more intense. With reference to Figure 5, reliable and rapid detection of the face and head is possible using an image current of stereoscopic binocular video images (block 50). Stereoscopic vision allows the differentiation between objects in front and in the background and also allows to determine the size of the object for objects of known size, such as heads and hands. Motion is detected between two images in a series of images by applying a difference routine to the images in both the right image channel and the left image channel (block 52). A disparity map is calculated for the pixels that move in both image channels (block 54). The convex detector then uses disparity histograms (block 56) which shows in number of pixels against disparity. Regions of images that have a disparity confined to a certain range of disparity are selected by inspecting the maximum local of the disparity histogram (block 58). The pixels associated with a local maximum are referred to as motion silhouettes. Motion silhouettes are binary images. Some movement silhouettes can be discarded as too small to be generated by a person (block 60). The silhouette of movement associated with a given depth can differentiate a person from other moving objects (block 62). The convex regions of the movement silhouette (block 64) are detected by a convex detector, as shown in figure 6. The convex detector analyzes the convex regions within the silhouettes. The convex detector verifies whether a pixel 68 belongs to a movement silhouette having neighboring pixels that are within the allowed region 70 on the circumference or width of the disparity 72. The connected allowed region can be connected anywhere on the circumference. The output of the convex detector is a binary value. Similarly, skin color silhouettes can be used to detect heads and hands. Movement silhouettes, skin color silhouettes, convex detector outputs applied to movement silhouettes and convex detector outputs applied to skin-colored silhouettes provide four different evidence maps. An evidence map of a scalar function over an image domain that indicates the evidence that a certain pixel belongs to a face or a hand. Each of the four evidence maps is evaluated binarily. Evidence maps are linearly superimposed for a given disparity and verified for a local maximum. The local maximum indicates the candidate positions where the heads or hands can be found. The expected diameter of the head can then be inferred from the local maximum on the disparity map that gives rise to the evidence map. Head detection as described works well even in the presence of a strong moving background. The head tracking process (block 42) generates head position information that can be used to generate head trajectory verification. As shown in figure 7, the newly detected head positions (block 78) can be compared with existing head trajectories. Then a thinning takes place (block 80) which replaces multiple neighboring directions by a single representative detection (block 82). The new position is verified to determine if the new estimated position belongs to an existing path in advance (block 84) assuming spatio-temporal continuity. For each position estimate found for a frame acquired at time t, the algorithm searches (block 86) for the nearest head position estimate that has been determined from the previous frame at a time t-1 and connects it (block 88) . If an estimate that is close enough can not be found, the assumption is made that a new head has appeared (block 90) and a new trajectory is initiated. To connect the individual estimates with the trajectories, only image coordinates are used.

Each path is assigned a trust which is updated using a loose integrator. If the confidence value decreases below a predetermined threshold, the path is deleted (block 92). A hysteresis mechanism is used to stabilize trajectory creation and suppression. In order to initiate a trajectory (block 90), a confidence value greater than that necessary to suppress a trajectory must be achieved. The preselector 16 (FIG. 2) operates to select images suitable for recognition from a series of images belonging to the same path. This selection is particularly useful if the computational power of the hardware is not sufficient to analyze the image of a trajectory individually. However, if the available calculation power is sufficient to analyze all the faces found, it may not be necessary to use the preselector. The preselector 16 receives the input from the head tracking process 14 and provides output to the reference point finding process 18. The input can be: a monocular value image with a size of 256 x 256 pixels represented by a two-dimensional arrangement of octets. - an integer representing the sequence number of the image. This number is the same for all images that belong to the same sequence. four integer values representing the pixel coordinates of the upper left and lower right corners of a square-shaped junction rectangle surrounding the face.

The output of the preselector can be: a monocular gray value image selected from the previous sequence. four values of integers that represent the pixel coordinates of the upper left and lower right corners of a rectangle that joins, in square form, that represents the position of the face in a more precise way compared to the rectangle that the preselector accepts as an entrance.

As shown in Figure 8, the preselector 16 comprises a series of face candidates belonging to the same path determined by the head tracking process 14 (block 100). The graphical elastic group pariation, as described below with respect to the finding of the reference point, is applied (block 102) to this sequence of images containing an object of interest (for example the head of a person) in order to select the most suitable images for further processing (ie, finding / recognizing the reference point). The preselector applies a mapping of graphs in order to evaluate each image for its quality. Additionally, the paring result provides more accurate information about the position and size of the face compared to the head detection module. The confidence values generated by the pariation procedure are used as a measure of adequacy of the image. The preselector sends an image to the next module and the confidence value exceeds the best confidence value measured so far in the current sequence (block 104-110). The preselector joins the image detected by a junction box and provides the image with a process for finding the reference point. The subsequent process begins with the processing of each image that enters but ends in an image that has a higher confidence value (measured by the preselector) that comes from inside the same sequence. This can lead to an increased CPU workload but provides faster preliminary results. Consequently, the preselector eliminates the filters of a set of the most suitable images for further processing.

Alternatively, the preselector can evaluate the images as follows: subsequent modules (eg, reference point markers, identifier) wait until the sequence has finished in order to select the last and therefore the most promising image approved by the preselector. This leads to a low CPU workload but implies a time delay until the final result is available (eg recognition). Subsequent modules take each image approved by the preselector, they evaluate it individually and leave the final selection to the following modules (for example, by recognition trust). This also provides quick preliminary results. In this case, the final recognition result may change within a sequence, which provides a better final recognition speed. However, this solution requires the largest amount of CPU time among the three evaluation alternatives.

The facial reference points and the characteristics of the head can be located using an elastic graphic matching technique shown in Figure 9. In the elastic graphic matching technique, a captured image (block 140) is transformed into a space of Gabor using a wavelet transformation (block 142) which is described in more detail below, with respect to figure 10. The transformed image (block 144) is represented by complex values 40, which represent wavelet components, for each pixel of the original image. Then, a rigid copy of a model chart, which is described in more detail below with respect to Figure 12, is placed on the image transformed into different model node positions to locate an optimum similarity position (block 146) . The search for optimal similarity can be carried out by placing the model graph in the upper left corner of the image, extracting the beams in the nodes, and determining the similarity between the image graph and the model graph. The search continues by sliding the model graph from left to right starting from the upper left corner of the image (block 148). When a defined position of the face is found (block 150) the nodes are allowed to move individually, introducing elastic graphic distortions (block 152). A phase-insensitive similarity function, discussed below, is used in order to locate a good match (block 154). After . a phase-sensitive similarity function is used to locate a beam with precision because the phase is very sensitive to small beam shifts. The phase-insensitive and phase-sensitive similarity functions are described below with respect to Figures 10-13. Note that although the graphs are shown in Figure 9 with respect to the original image, the movements of the model's graphs and the coincidences are actually carried out on the transformed image. The transformation of the wave train is described with reference to Figure 10. An original image is processed using a Gabor wave train to generate a convolution result. The Gabor-based wave train consists of a two-dimensional complex wave field modulated by a Gaussian envelope. k '-. r "• 2 s ikx { e - e (1) * s" • The wave train is a plane wave with a wave vector k, restricted by the Gaussian interval, the size of which is related to the wavelength and parameterized by s. The term in square brackets removes the DC component. The amplitude of the wave vector k can be chosen as follows where v is related to the desired spatial resolutions. v + 2 * "= 2 p, v - 1,2, ... (2) A wave train, centered at the image position, is used to extract the wave train component Jk from the image with the gray level distribution I (x), ( 3 ) The space of wave vectors k is typically sampled in a discrete hierarchy of five resolution levels (which differ by semi-octaves) and eight orientations in each resolution level (see for example figure 13), and thus in this way 40 complex values are generated for each sampled image point (the real and imaginary components referred to are the cosine and sine phases of a plane wave). The samples in the space k are designated by the index j = 1, ..., 40 and all the components of the wave train centered on a single image point are considered as a vector which is called a beam 60. Each beam describes the local features of the area surrounding ax. If shown with sufficient density, the image can be reconstructed from beams within the bandpass covered by the sampled frequencies. Therefore, each component of a beam is the filter response of a Gabor wave train extracted at a point (x, y) of the image. A labeled image graph 162, as shown in Figure 11, is used to describe aspects of an object (in this context, a face). The nodes 164 of the labeled chart refer to points on the object and are labeled by beams 160. The edges 166 of the graph are labeled with distance vectors between the nodes. The nodes and edges define a graph topology. The graphs with the same geometry can be compared. The normalized dot product and the absolute components of the two beams define the beam similarity. This value is independent of changes in lighting and contrast. To calculate the similarity between two graphs, the sum of corresponding beam similarities between the graphs is taken. A pattern chart 168 that is particularly designed to find a human face in an image is shown in figure 12. The numbered nodes in the image have the following positions: 0 right pupil 1 left pupil 2 upper part of the nose 3 right corner of the right eyebrow 4 left corner of the right eyebrow 5 right corner of the left eyebrow 6 left corner of the left eyebrow 7 right nostril 8 nose tip 9 left nostril 10 right corner of mouth 11 middle of upper lip 12 left corner of mouth 13 middle of lower lip 14 lower part of right ear 15 upper part of right ear 16 upper part of left ear 17 lower part of left ear To represent a face, a data structure called group 170 is used. It is similar to the graph described above, but instead of joining only a single beam in each node, a complete set of beams 172 (a set of beams) are joined at each node. Each beam is derived from a different facial image. To form a group graph, a connection of facial images (the gallery of group graphs) with the node positions in defined positions of the head is marked. These defined positions are called reference points. When a group graphic is matched to an image, each beam extracted from the image is compared with all the beams in the corresponding group attached to the group graph and the one that best matches is selected. This matching process is called elastic group graphic matching. When constructed using a properly selected gallery, a group graphic covers a wide variety of faces that may have different significant local properties. In order to find a face in an image box, the graph moves and scales over the image box until a place is placed in which the graph best matches (the beams that best match within the group beams they are more similar to the beams extracted from the image in the current positions of the nodes). Since the characteristics of the face differ from one face to another, the graph is made more general for the task, for example, each node is assigned beams of the corresponding reference point taken from 0 to 100 individual faces. If the graphs have relative distortion, a second term can be introduced that takes into account the geometric distortions. The two different beam similarity functions are used for two different or even complementary tasks. If the components of a beam J are written in shape with an amplitude a- and a phase f, the similarity of the two beams J and J 'is the standardized scalar product of the amplitude vector: The other function of similarity has the form This function includes a vector of relative displacement between the image points of which the two beams refer. When comparing two beams during the matching of graphs, the similarity between them is maximized with respect to d, which leaves an accurate determination of the beam position. Both similarity functions are used, often preferring the insensitive phase version (which varies uniformly with the relative position), when the first coincidence of a graph, and given the phase-sensitive version when the beam is precisely positioned. The solution of finding a course-to-fine reference point, shown in Figure 14, uses graphs that have fewer nodes and cores on lower resolution images. After the finding of the general reference points has been carried out, a higher precision localization can be carried out in higher resolution images for a precise finding of a particular facial feature. The responses of the Gabor convolusions are complex numbers which are usually stored as absolute and phase values because when comparing the Gabor beams it can be carried out more efficiently if the values are represented in a domain instead of in a real-imaginary domain. Typically, the absolute and phase values are stored as "floating" values. The calculations are then made using arithmetic based on floating numbers. These phase values vary between a range of -pap, where -p is equal to p so that the number distribution can be displayed on a circular axis as shown in figure 15. When the phase value exceeds this range, that is, due to an addition or subtraction of a constant phase value, the resulting value must be adjusted to be within this range which requires a greater computational effort than the single addition of the floating number.

The presentation of the whole number commonly used and the related arithmetic number provided by most processors are the two add-ons. Since this value has a finite range, overflow or insufficient capacity can occur in addition and subtraction operations. The maximum positive number of an integer of two octets is 32767. Adding a number one provides a number that actually represents -32768. Therefore, the arithmetic behavior of the two integer complement numbers is very close to that of the requirements for the arithmetic phase. Therefore, we can represent the phase values by numbers of two octets. The values of phase j are mapped into integer I values as shown in Fig. 16. The value in the range of -p to p is seldom required during the comparison match stages described below. Therefore, the mapping between [-p, p] and [-32768, 32768] does not need to be calculated so often. However, phase additions and subtractions are very frequent. These calculations are much faster using the adapted processor interval. Therefore, this adaptation technique can significantly improve the computation speed of the processor. After the facial features and reference points are located, the facial features can be followed over consecutive frames, as illustrated in Figures 17 and 18. The tracking technique of the invention obtains robust tracking over large sequences of frames when using a tracking connection scheme that detects whether or not tracking of a feature or node has been lost and the tracking process for that node is re-initialized. The position X_n of a single node in an image I_n of an image sequence is now either by finding the reference points on the image I_n using the method of finding the reference points (block 180) described above, or well by tracking the node from the image I_ (n-1) to I_n using the tracking process. The node is then followed (block 182) to a corresponding position I_ (n + 1) in the image I_ (n + 1) by one of several techniques. The tracking methods described below are advantageously adapted for fast movement. A first tracking technique involves the prediction of linear motion. The search for the corresponding node position I_ (n + 1) in the new image I_ (n + 1) in a position generated by a motion estimator is initiated. A disparity vector (X_n - X_ (n-1)) is calculated which represents the displacement, assuming a constant velocity, of the node between two preceding frames. The disparity or displacement vector D_n can be added to the position X_n to predict the position of the node X_ (n + 1). This linear motion model is particularly advantageous for adapting to a constant speed movement. The linear motion model also provides good tracking if the frame rate is high compared to the acceleration of the objects that are tracked. However, the linear motion model works poorly if the frame rate is too slow so that strong acceleration of objects occurs between frames in the image sequence. Because it is difficult for any motion model to continue to object under such conditions, the use of a camera having higher frame rates is recommenced.

The linear motion model can generate a vector D_n too large of the estimated movement which can lead to an accumulation of error in the estimation of movement. Consequently, the linear prediction can be damped using a damping factor fJD. The resulting estimated motion vector is D_n = f_D * (X_n - X_ (n-1)). An adequate damping factor is 0.9. If there is no previous table I_ (n-1), for example, for a frame immediately after the finding of the reference points, the estimated motion vector is set equal to zero (D_n = 0). Figure 19 illustrates a tracking technique based on a Gaussian image pyramid, applied to a dimension. Instead of using the original image resolution, the image is sampled down 2-4 times to create a Gaussian image pyramid. A 4-level image pyramid results in an instance of 24 pixels over the finest original resolution level that is represented as only 3 pixels at the thickest level. The beams can be calculated and compared at any level of the pyramid.

The tracking of the node on the Gaussian image pyramid is generally done first at the thickest level and then in a preceding manner at the finest level. A beam is extracted over the thickest Gauss level of the real picture frame I_ (n + 1) at the position I_ (n + 1) using the damped linear motion estimate X_ (n + 1) = (X_n + D_n) as described above, and compared with the corresponding beam calculated at the thicker Gauss level and the previous image frame. From these two beams, the disparity is determined, that is, the vector 2D R points from X_ (n + 1) to that position that corresponds best to the beam of the previous frame. This new position is assigned X_ (n + 1). The disparity calculation is described below in more detail. The position in the next finest Gauss level of the real image (which is 2 * X_ (n + l)), corresponds to the position X_ (n + 1) in the thickest Gauss level and is the point of Start for the calculation of disparity in the next finer level. The beam extracted at this point is compared with the corresponding beam calculated at the same Gauss level of the previous image frame. This process is repeated for all Gauss levels until the finest resolution level is reached, or until the Gauss level is reached which is specified to determine the position of the node corresponding to the previous frame position. Figure 19 shows two representative levels of the Gaussian image pyramid, a heavier level 194 above, and a finer level 196 below. The assumption is made that each beam has filter responses for two frequency levels. Starting at position 1 at the thickest Gauss level, X_ (n + 1) = X_n + D_n a first disparity is moved using only the lowest frequency beam coefficients leading to position 2. A second disparity is moved by the use of all the beam coefficients of both frequency levels leading to position 3, the final position of this level of Gauss. Position 1 at the finest Gauss level corresponds to position 3 at the thickest level where the coordinates are doubled. This sequence of disparity movement is repeated, and position 3 at the finest Gauss level is the final position of the reference point followed. After the new node position followed in the actual picture frame has been determined, the beams in all Gauss levels are calculated in this position. A stored array of beams is calculated for the previous frame, representing the node followed, and then replaced by a new array of beams calculated for the current frame. The use of a Gauss image pyramid has two main advantages: firstly, the movements of the nodes are much smaller in terms of pixels over the thicker level compared to the original image, which makes it possible to follow the perform only a local movement instead of an exhaustive search in a large image region. Secondly, the calculation of beam components is much faster for lower frequencies, because the calculation is made with a small core interval in an image sampled below, rather than over a large core interval over the original resolution image. Note that the level of correspondence can be chosen dynamically, for example, in the case of tracking facial features, the level of correspondence can be chosen depending on the actual size of the face. In addition, the size of the Gauss image pyramid can be altered through the tracking process, that is, the size can be increased when the movement is faster, and can be decreased when the movement is smaller. Typically, the maximum movement of the node above the thicker Gauss level is limited to a range of 1 to 4 pixels. Also note that movement estimation is often done only at the thickest level. Now we will describe the calculation of the displacement vector between two beams given at the same Gauss level (the disparity vector). To calculate the displacement between two consecutive frames, a method is used which was originally developed to estimate disparity in stereoscopic images, based on D. Jepson, "Computation of component image velocity from local phase information", International Journal of Computer Vision, vol. 5, edition 1, pages 77-104, 1990 and in W.M. Theimer and H.A. Mallot, "Phase-based binocular vergence control and depth reconstruction using active vision", CVGIP: Image Understanding, vol. 60, edition 3, pages 343-358, November 1994. The strong variation of the phases of complex filter responses is used explicitly to calculate the displacement with precision of its pixel (see Wiskott, L., "Labeled Graphs and Dynamic Link Matching for Face Recognition and Scene Analysys ", Verlag Harri Deutsch, Thun-Frankfurt am Main, Reihe Physik 53, PhD Thesis, 1995). By writing the answer J of the thirteenth Gabor filter in terms of amplitude, phase f and a similarity function it can be defined as Suppose that J and J 'are two beams at the positions X and X' = X + d, the displacement d can be found by maximizing the similarity S with respect to ad, where kj are the wave vectors associated with the filter that Jj generates. . Because the estimate of d is only accurate for small offsets, that is, a large overlap of the Gabor beams, the large displacement vectors are treated as a first estimate only, and the process is repeated in the following manner. First, only the lower frequency level filter responses are used, resulting in a first estimate d_l. Then, this estimate is executed and the beam J is recalculated at the position X_l = X + d_l, which is closer to the position X 'of the beam]'. Then, the two lower frequency levels are used for the displacement estimate d_2, and the J beam is recalculated at the position X_2 = X_l + d_2. This is repeated until the highest frequency level used is reached, and the final disparity d between the two initial beams J and] 'are given as the sum d = d_l + d_2 + .... As a result, shifts up to half the wavelength of the nucleus with the least frequency can be calculated (see Wiskott 1995 above). Although the offsets are determined using floating point numbers, the beams can be extracted (that is, they can be calculated by convolution) in pixel position (whole numbers) only, resulting in a systematic rounding error. To compensate for this subpixel error? d, the phases of the complex Gabor responses must be shifted according to ? fj =? d - kj (6) so that the beams will appear as if they had been extracted in the correct subpixel position. Consequently, Gabor beams must be followed with subpixel accuracy without taking rounding errors into consideration. Note that Gabor beams provide a substantial advantage in image processing because the problem of subpixel accuracy is more difficult to solve in most other image processing methods. The tracking error can also be detected by determining whether a confidence or similarity value is less than a predetermined threshold (block 184 of FIG. 17). The similarity value (or confidence) S can be calculated to indicate that also two image regions in the two image frames correspond to each other simultaneously with the calculation of the displacement of a node between consecutive image frames. Typically, the confidence value is close to 1, indicating good correspondence. If the confidence value is not close to 1, either the corresponding point and the image has not been found (that is, because the frame rate is too slow compared to the speed of the moving object), or this image region has changed so drastically from one picture frame to the next, that the correspondence is no longer well defined (for example, for monitoring the node of the pupil of the eye when the eyelid has been closed). Nodes that have a confidence value below a certain threshold can be inactivated. A tracking error can also be detected when certain geometric constraints are violated (block 186). If many nodes are followed simultaneously, the geometric configuration of the nodes can be verified to determine consistency. Such geometric restrictions can be very loose, for example, when facial features are followed, so that the nose must be between the eyes and the mouth. Alternatively, such geometric constraints may be more precise, for example a model that contains the information accurately of the tracked face. For intermediate precision, the constraints can be based on a plan model. In the flat plan model, the nodes of the face graph establish the assumption that they are in a plane plan. For image sequences that start with a front view, the positions of nodes followed can be compared with the corresponding node positions of the frontal graph transformed by an affinity transformation to the real square. The 6 parameters of the optimal affinity transformation are found by minimizing the least squares error at node positions. The deviations between the positions of nodes followed and the positions of transformed nodes are compared with a threshold. Nodes that have deviations greater than the thresholds are deactivated. The affinity transformation parameters can be used to determine the relative pose and scale (compared to an initial graph) simultaneously (block 188). Therefore, this general plan floor model ensures that tracking errors do not grow and exceed a predetermined threshold. If a followed node is inactivated due to a tracking error, the node can be activated in the correct position (block 190), advantageously using group graphs that include different poses and tracking is continued from the corrected position (block 192). ). After a tracked node has been inactivated, the system can wait until a predefined pose is reached for which there is a pose-specific group graph. Otherwise, if only a frontal group graphic is stored, the system must wait until the frontal pose is reached to correct any tracking error. The stored group of beams can be compared with the image region surrounding the placement position (for example, from the model in plan) which works in the same way as the follow-up, except that, instead of being compared with the beam of a previous image frame, the comparison is repeated with all the beams of the group of examples, and the most similar one is taken. Because the facial features are known, for example, the actual pose, the scale and even the general position, graphic matching or an exhaustive search in the image and / or the pose space is not needed and a correction can be made tracking the node in real time. For tracking correction, group graphs are not necessary for many different poses and scales due to the rotation in the image plane as well as the scale can be taken into consideration when transforming either the local image region or the beams of the group graph accordingly, as shown in figure 20. In addition to the front pose, the group graphs need to be created only for deep rotations. The speed of the reboot process can be increased by taking advantage of the fact that the identity of the person followed remains the same during an image sequence. Consequently, in an initial learning session, you can take a first sequence of the person where the person shows a complete repertoire of frontal facial expressions. This first sequence can be followed with high precision using the tracking and correction scheme described above based on a large generalized group chart that contains the knowledge of many different people. This process can be carried out offline and generates a new custom group graph. The custom group chart can be used to track this person at a fast speed in real time because the custom group chart is much smaller than a larger generalized group chart. You can also increase the speed of reinitialization of the process by using partial group graphical reinitialization. A partial group graph contains only a set of nodes of a total group graph. The subgroup can be as small as only a single node. A pose estimation group graphic makes use of a family of two-dimensional group graphs defined in the image plane. The different graphs within a family constitute the different poses and / or scales of the head. The process of finding the reference points tries to match each family group graph to the input image in order to determine the pose or size of the head and the image. An example of such a pose estimation procedure is shown in Figure 21. The first stage of the pose estimation is equivalent to the finding of a regular reference point. The image (block 198) is transformed (blocks 200 and 202) in order to use graph similarity functions. After, instead of just one, a family of three group graphs is used. The first group graph contains only the front facing faces (equivalent to the front view described above) and the other two group graphs contain faces that have rotated 3/4 (one represents the rotations on the left and the other on the right ). As in the previous, the initial positions of each of the graphs is in the upper left corner, and the positions of the graphs are explored in the image and the position and the graph that returns with the greatest similarity after the finding of the points. reference is the one that is selected (blocks 204-214).

After the initial match for each graph, the similarities of the final positions are compared (block 216). The graph that corresponds best to the pose given in the image will have the greatest similarity (block 218). In Figure 21, the graph that has rotated to the left provides the best fit with the image, as indicated by its similarity. Depending on the resolution and the degree of rotation of the face in the image, the sillitude of the correct graph and of the graphs for other poses may vary, becoming very close when the face is approximately halfway between two poses for which They have defined graphs. By creating group graphs for more poses, a finer pose estimation procedure can be implemented that can differentiate between more degrees of head rotation and manual rotations in other directions (for example up or down). In order to robustly find a face at an arbitrary distance from the camera, a similar solution can be used to which 2 or 3 group graphs can be used, each with different scales. The assumption that the face and the image has the same scale as the group graph that returns the most to the facial image will be established. The techniques of finding three-dimensional (3D) reference points related to the technique described above can also be used multiple group graphs adapted for different poses. However, the 3D solution uses only a group graph defined in 3D space. The geometry of the 3D graph reflects an average face or head geometry. When extracting the beams of the images of the faces of several people in different degrees of rotation, a 3D group graph is generated which is analogous to the 2D solution. Each beam is now parameterized with three rotation angles. As in the 2D solution, the nodes are located at the fiducial points on the surface of the head. The projections of the 3D graph are then used in the matching process. An important generalization of the 3D solution is that each node has the parameterized family of group beams attached to different poses. The second generalization is that the graph can experience Euclidean transformations in 3D space and not only transformations in the image plane. The 3D graph matching process can be formulated as a coarse to fine solution that first uses graphs with few nodes and cores and then in later stages uses denser graphs. The coarse to fine approach is particularly suitable if high precision of location of trait points in certain areas of the face is desired. Therefore, computational effort is saved by adopting a hierarchical solution in which the finding of reference points is first performed in a coarser resolution, and later the graphs adapted to a higher resolution are verified to analyze certain regions in more detail. fine. In addition, the computational workload can easily be divided into a multiple processing machine so that once the thick regions are found, some dependent or child processes begin to work in parallel on their own part of the total image. At the end of the child or dependent processes, the processes communicate the coordinates of the features that have located the master process, which appropriately scales and combines to place them as support within the original image and in this way the calculation time is considerably reduced total. Numerous ways of constructing 3D models of heads texture have been developed. This section describes a solution based on a stereoscopic condition. Algorithms based on stereoscopic conditions are described for the case of fully calibrated cameras. The algorithms perform an area-based match of the image pixels and are suitable in the case where dense 3D information is needed. This can then be used to define precisely a larger object description. Additional background information regarding the formation of stereoscopic images and their matching can be found in U. Ahond and J. Aggrawal, "Structure from Stereo: a Review", IEEE Transactions on Systems, Man, and Cybernetics, 19 (6), p. 1489-1510, 1989, or more recently in R. Sara and R. Bajcsy, "On Occluding Contour Artigfacts in Stereo Vision", Proc. Int. Conf. Computer Vision and Pattern Recognition, IEEE Computer Society, Puerto Rico, 1997.; M. Okutomi and T. Kanade, "Multiple-baseline Stereo", IEEE Trans. on Pattern Analysis and Machine Intelligence, 15 (4), p. 353-363, 1993; P. Belhumeur, "A Bayesian Approach to Binocular Stereopsis" 1, Intl. J. of Computer Vision, 19 (3), p. 237-260, 1996; Roy, S. and Cox, I., "Maximum-Flow Formulation of the N-camera Stereo Correspondence Problem", Proc. Int. Conf. Computer Vision, Narosa Publishing House, Bombay, India, 1998; Scharstein, D. and Szeliski, R., "Stereo Matching with Non-Linear Diffusion", Proc. Int. Conf. Computer Vision and Pattern Recognition, IEEE Computer Society, San Francisco, CA, 1996; and Tomasi, C. Manduchi, R., "Stereo without Search", Proc. European Conformity Computer Vision, Cambridge, United Kingdom, 1996. An important topic in stereoscopic is known as the correspondence problem (coincidence); that is, recovering range data from a binocular stereoscopic image, the corresponding projections of the 3D spatial points have been found in the left and right images. To reduce the space search dimension, a bipolar restriction is applied (see, S. Maybank and O. Faugeras, "A Theory of Self-Calibration of a Moving Camera", Intl. J. of Computer Vision, 8 (2) , pp. 123-151, 1992. Stereoscopic can be formulated in a four-step process: calibration: calculating the camera parameters rectification: the stereoscopic pair is projected, so that the corresponding features in the images are found in the These lines are called epipolar lines, which is not absolutely necessary but greatly improves the functioning of the algorithm, since the matching process can be carried out as a one-dimensional search along horizontal lines in the rectified images. : a cost function that is calculated locally for each position in a search interval The maximum correlation is used to select the corresponding pixels in the search stereoscopic reconstruction: 3-D coordinates are calculated from coincident pixel coordinates in the stereoscopic pair.

The post processing can be added just after the match in order to remove the match error. The possible errors resulting from coincidence ambiguities are mainly due to the fact that the match is made locally. Various geometric constraints as well as filtering can be applied to reduce the number of false matches. When working with continuous surfaces (for example a face in front position), interpolation can also be used to recover mismatched areas (mainly non-textured areas, where the correlation score does not have a clear monomodal maximum). The formalism that leads to the equations used in the rectification and reconstruction process is called projective geometry and is presented in detail in O. Faugeras, "Three-Dimensional Computer Vision, A Geometric Viewpoint," MIT Press, Cambridge, Massachusetts, 1993. The model used provides significant advantages. Generally, the assumption of a single orifice bed model is established, which is shown in Figure 22. If needed, a lens distortion can also be calculated at the calibration time (the most important factor is the radial distortions of the the glasses). From a practical point of view, calibration is performed using a calibration aid, that is, an object with a known 3-D structure. Usually, a cube with visible points or a square pattern is used as a calibration aid, as shown in figure 23. To simplify the simplification algorithms, the centered images of each stereoscopic pair are rectified first (see N. Ayache et al. C. Hansen, "Rectification of Images for Binoculars Trinocular Stereovision", Proc. Of 9th International Conference on Pattern Recognition, 1, p.11-16, Italy, 1988), so that the corresponding points are in the same lines of image. Then, by definition, the corresponding points have coordinates (UL, VL) and (UL - d, VL), in the rectified left and right images, where "d" is known as the disparity. For details regarding the rectification processes reference is made to Faugeras, supra. It is important to choose the rectification plane (plane used to project the images to obtain the rectified images). Usually, this plane is chosen to minimize the distortion of the projected images, and so that the corresponding pixels are located along the same line number (epipolar lines that are parallel and aligned), as shown in Figure 4 Such a configuration is called standard geometry.

With reference to figure 26, coincidence is the process of finding corresponding points in the left and right images. Several correlation functions can be used to measure this disparity; for example, the normalized cross-correlation (see H. Moravec, "Robot Rover Visual Navigation", Computer Science, Artificial Intelligence, pp. 13-15, 105-108, UMI Research Press 1980/1981) is given by: : (IL, IR) = 2 cov (IL, IR) / (var (IL) + var (IR)) (6) Where II and IR are the rectified images left and right. The correlation function is applied in a rectangular area at a point (UL, VL) and (UR, VR). The cost function C (IL, IR) is calculated, as shown in figure 25 for the search interval which is the size lxN (due to the rectification process), where n is a certain admissible number. For each pixel (UL, VL) in the left image, the coincidence produces a correlation profile C (UL, VL, d) where "d" is defined as the disparity in the point (UL, VL), that is: du = UR - UL (7) dv = 0 (8) The second equation expresses the fact that the epipolar lines are aligned. As a result of the matching procedure, a disparity map is transmitted, or a disparity image can be superimposed on a base image (here the left image of the stereoscopic pair). The disparity map tells us "how much it moves along the epipolar line to find the corresponding pixel in the right image of the stereoscopic pair". Several refinements can be used in a coincident time. For exampleAt each point a list of possible correspondents can be maintained such as a visibility restriction, order restriction and disparity gradient restriction (A. Yuille and T. Poggio, "A Generalized Constraint for Stereo Correspondence", MIT, Artificial Intelligence Laboratory Memo, No. 777, 1984; Dhond et al., Supra; and Faugeras, supra) to remove impossible configurations (see, R. Sara et al., 1997, supra). One can also use cross match where the match is made from left to right and then from right to left, and a candidate is accepted (peak correlation) if both matches lead to the same image pixel, that is, if dLR = UL - UR = -dRL (9) where dLR is the disparity found in the match left to right and dRL is from right to left. In addition, a pyramid strategy can be used to help the entire matching process by restricting the search interval. This is implemented by carrying out the match at each level of the resolution pyramid, using the estimation of the preceding level. Note that a hierarchical scheme also promotes a surface continuity. Note that when a stereoscopic condition is used for 2-D segmentation purposes, only one disparity map is needed. Then one can avoid using the calibration process described previously, and use a projective geometry result (see, QT Luong, "Fundamental Matrix and autocalibration on Computer Vision", Ph.D. Thesis, University of Paris Sud, Orsay, France, December 1992) which shows that rectification can be obtained if the fundamental matrix is available. The fundamental matrix can be used to change to rectify the images, so that the match can be carried out as previously described. To refine the 3-D position estimates, a subpixel correction of the entire disparity map is calculated resulting in a subpixel disparity map. The subpixel disparity can be obtained either by: using a second order of interpolation of the correlation ratings around the maximum detected, using a more general solution, as described in F. Devernay, "Computing Differential Properties of. D.}. Shapes from Stereoscopic Images without { 3-D.}. Models ", INRIA, RR-2304, Sophia Antipolis, 1994 (which takes into consideration the distortion between the correlation intervals between the left and right correlation intervals, induced by the perspective projection, assuming that the image of a flat patch is being formed Of surface). The first solution is the fastest, while the second solution provides a more reliable estimate of the subpixel disparity. To obtain a rapid subpixel estimation, and at the same time preserve the accuracy of the estimate, proceed as follows. Assume that II and IR are the left and right rectified images. Suppose it is the known subpixel correlation and that A (u, v) is the map transformation of the correlation interval of the left image to the right (for a flat image it is an affinity mapping that preserves the image rows) . For the corresponding pixels in the left and right images, IR (UL - d + e, VL) = a II (A (UL, VL)) (10) where the coefficient takes into consideration the possible differences in the gains of the camera. A first order linear approximation of the previous formula with respect to "e" and "A" provides a linear system where each coefficient is estimated over the corresponding left and right correlation intervals. A least squares solution of this linear system provides subpixel correction. Note that when a continuous surface is to be recovered (such as the face in a frontal pose) an interpolation scheme can be used over the filtered disparity map. Such a scheme can be derived from the following considerations. Since we establish the assumption that the underlying surface is continuous, the interpolated and uniform disparity map d must be verified by the following equation: min. { JJ [(d '-d) +? (Vd) a] du dv} ( eleven ) where it is a parameter of uniformity and the integration is taken on the image (for the coordinates of pixel u and v). An integrative algorithm is obtained directly using Euler equations, and using an approximation of a Laplacian V operator. From the disparity map, and the calibration of the camera, the spatial position of the 3D points based on the triangulation is calculated (see Dhond et al., supra). The result of the reconstruction (from a single pair of stereoscopic images) is a list of spatial points. In the case where several images are used (stereoscopic poliocular), a verification stage can be used (see, R. Sara, "Reconstruction of 3-D Geometry and Topology from Polynocular Stereo", http: //cmp.felk .cvut.cz / sara). During this procedure, the set of reconstructed points of all the stereoscopic pairs are reprojected again to the disparity space of all the camera pairs and verified if the projected points coincide with the position predicted in the other image of each one of them. The pairs. It seems that the verification eliminates deviations (especially the artifacts of coinciding close occlusions) very effectively. Figure 26 shows a typical result of applying a stereoscopic algorithm to a stereoscopic pair of images obtained by projecting textured light. The upper row of Figure 26 shows the left, right image and the color image taken in a short time interval ensuring that the subject does not move. The lower row shows the two reconstructed face model views obtained by applying the stereoscopic concept to the textured images, and the texture mapped with the color image. Note that interpolation and filtering have been applied to the disparity map, so that the reconstruction on the face is uniform and continuous. Note also that the results are shown as a set of points without treatment that are obtained from the stereoscopic concept; these points can be framed together to obtain a continuous surface, for example by using algorithm positions that can be compared with the extracted beams of stored gallery images. Whether full graphs are compared, as in the case for face recognition applications, or only partial graphs or even individual nodes. Before the beams are extracted for the current comparison, numerous image normalizations are applied. One such normalization is called background suppression. It is necessary to suppress the influence of the background on the image probe because different backgrounds between the probe and the image galleries decrease the similarities and frequently lead to erroneous classifications. Therefore, we take the nodes and edges that surround the face as face boundaries. Background pixels undergo a uniform tone decrease when they deviate from the face. Each pixel value outside the head is modified as follows: where: and c is a constant background gray value that represents the Euclidean distance of the pixel position from the nearest edge of the graph, c is a falling value of constant pitch, of course, other functional dependencies between pixel values and a distance The limits of graphs are also possible. As shown in Figure 28, automatic background suppression drags the gray value evenly to a constant when it deviates from the nearest edge. This method still leaves a background region surrounding the visible face, but avoids strong altering edges in the image, which could occur if this region were simply filled with a constant gray value. Although the foregoing has been established with reference to specific embodiments of the invention, it will be appreciated by those skilled in the art that these are only illustrations and that changes can be made in these embodiments without departing from the principles of the invention, the scope of the which is defined by the appended claims.

Claims

1. A process for recognizing objects in an image frame, characterized in that it comprises: detecting an object in the image frame and joining a portion of the picture frame associated with the object; transforming the junction portion of the image frame using a wavelet transform to generate a transformed image; locating, the transformed image, nodes associated with differentiating features of the object defined by the wavelet beams of a group graph generated from a plurality of representative object images; identifying the object based on a similarity between the wave train beams associated with an object image in a gallery of object images and the wave train beams in the nodes of the transformed image.

2. The process for recognizing objects, according to claim 1, characterized in that it further comprises sizing and centrally detecting the detected object within the attached portion of the image so that the detected object has a predetermined size and location within the attached portion.

3. The process for recognizing objects, according to claim 1, characterized in that it further comprises deleting background portions of the attached portion of the image frame not associated with the object before identifying the object.

4. The process for recognizing objects, according to claim 3, characterized in that the suppressed background portions are gradually suppressed near the edges of the object in the joined portion of the image frame.

5. The process for recognizing objects, according to claim 1, characterized in that the object is the head of a person showing a facial region.

6. The process for recognizing objects, according to claim 1, characterized in that the group graph is based on a three-dimensional representation of the object.

7. The process for recognizing objects, according to claim 1, characterized in that the waveform transformation is performed by performing phase calculations that are performed using a hardware adapted for the phase representation.

8. The process for recognizing objects, according to claim 1, characterized in that the location step is performed using a coarse to fine approach.

9. The process for recognizing objects, according to claim 1, characterized in that the group graph is based on predetermined poses.

10. The process for recognizing objects, according to claim 1, characterized in that the identification step uses a three-dimensional representation of the object.

11. A process for recognizing objects in an image frame sequence, characterized in that it comprises: detecting an object in the image frames and joining a portion of each image frame associated with the object; transforming the attached portion of each image frame using a wavelet transform to generate a transformed image; locating, on the transformed images, nodes associated with differentiating features of the object defined by the wave train beams of a group graph generated from a plurality of representative object images; identifying the object based on a similarity between wave train beams associated with an object image in a gallery of object images and beams of wave trains in the nodes on the transformed images.

12. The process for recognizing objects, according to claim 11, characterized in that the step of detecting an object further comprises tracking the object between image frames based on a path associated with the object.

13. The process for recognizing objects, according to claim 11, characterized in that it further comprises a preselection process that chooses a more suitable view of an object of a sequence of views belonging to a particular path.

14. The process for recognizing objects, according to claim 11, characterized in that the step of locating the nodes includes following the nodes in the pictures of images.

15. The process for recognizing objects, according to claim 14, characterized in that it further comprises reinitializing a node followed if the position of the node deviates exceeding a predetermined position restriction between image frames.

16. The process for recognizing objects, according to claim 15, characterized in that the predetermined position restriction is based on a geometric position restriction associated with the relative positions between the node positions.

17. The process for recognizing objects, according to claim 11, characterized in that the image frames are stereoscopic images and the detection step includes generating a disparity histogram and a silhouette image to detect the object.

18. The process for recognizing objects, according to claim 17, characterized in that the disparity histogram and the silhouette image generate convex regions which are associated with the movement of the head and which are detected by a convex detector.

19. The process for recognizing objects, according to claim 11, characterized in that the wave train transformations are carried out using phase calculations that are carried out using a hardware adapted for the phase representation.

20. The process for recognizing objects, according to claim 11, characterized in that the group graph is based on a three-dimensional representation of the object.

21. The process for recognizing objects, according to claim 11, characterized in that the placement step is carried out using a coarse to fine approach.

22. The process for recognizing objects, according to claim 11, characterized in that the group graph is based on predetermined poses.

23. An apparatus for recognizing objects in an image frame, characterized in that it comprises: (a means for detecting an object in the image frame and joining a portion of the picture frame associated with the object, a means for transforming the attached portion of the frame image using a transformation of wave trains to generate a transformed image, a means for placing, in the transformed image, nodes associated with differentiating features of the object defined by wave train beams of a group graphic generated from a plurality of representative object images, a means for identifying the object based on a similarity between the wave train beams associated with an object image in a gallery of object images and beams of wave trains in the nodes on the transformed image.

24. A process for recognizing objects in an image frame sequence, characterized in that it comprises: means for detecting an object in the image frames and joining a portion of the image frame associated with the object; means for transforming the attached portion of each image frame using a wavelet transformation to generate a transformed image; means for placing, in the transformed image, nodes associated with differentiating features of the object defined by beams of wave trains of a group graph generated from a plurality of representative object images; means for identifying the object based on a similarity between wave train beams associated with an object image in a gallery of object images and beams of wave trains in the nodes on the transformed images.