CN113378649A

CN113378649A - Identity, position and action recognition method, system, electronic equipment and storage medium

Info

Publication number: CN113378649A
Application number: CN202110545406.8A
Authority: CN
Inventors: 张雷; 杨思佳; 张子豪; 张宇; 董鹏越
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2021-09-10

Abstract

The invention relates to an identity, position and action recognition method, a system, an electronic device and a storage medium, wherein the identity, position and action recognition system comprises an identity recognition module, a position recognition module and/or an action recognition module; the identity recognition module is used for recognizing the identity of a person based on an image of the person; the position identification module is used for identifying the position of the person based on images of the person simultaneously acquired from a plurality of cameras with fixed relative positions; the motion recognition module is to recognize a motion of the person based on a sequence of video frames including an image of the person. According to the identity, position and action recognition system and method provided by the embodiment of the invention, the identity of the personnel can be recognized in a space with dense people flow through the identity recognition module, the position recognition module and the action recognition module, and the position of the personnel and the action of the personnel are determined, so that the energy of the personnel is saved, and the intelligent and active monitoring is realized.

Description

Identity, position and action recognition method, system, electronic equipment and storage medium

Technical Field

The invention relates to the field of security combined with an intelligent space, in particular to an identity, position and action identification method, an identity, position and action identification system, electronic equipment and a storage medium.

Background

In modern society, many public places such as municipal places, industrial areas, hospitals, campuses, gymnasiums, shopping malls, restaurants, commercial office buildings, etc. are densely populated and have high mobility, and in order to ensure normal operation, a large number of workers are usually provided to maintain good order and to cope with sudden situations. With the development of modern information technology, security systems with video monitoring systems are installed in many places, however, in the existing security systems, the video monitoring systems and alarm systems are often mutually independent, and the video monitoring systems are still in a traditional passive monitoring mode and can only record videos of abnormal conditions.

The intelligent space is a working or living space embedded with a computing and information device and a multi-mode sensing device, and has a natural and convenient interactive interface so as to support people to conveniently obtain the services of a computer system. The intelligent video monitoring system can be applied to scenes such as schools, hospitals, administrative institutions, markets, canteens, gymnasiums and the like, is used for functions such as intelligent video monitoring, intelligent attendance checking, intelligent teaching and the like, forms fusion of physical space and network information space, has wide application prospect in numerous fields such as intelligent houses, intelligent security and protection, man-machine interaction and the like in the future, and has important practical significance for constructing digital, networked and intelligent smart cities.

Disclosure of Invention

It is an object of the present invention to provide an identity, location and action recognition method, system, electronic device and storage medium that at least partially solve the problems of the prior art.

Specifically, the embodiment of the invention provides the following technical scheme:

in a first aspect, the present invention provides an identity, location and action recognition system comprising: the system comprises an identity recognition module, a position recognition module and/or an action recognition module;

the identity recognition module is used for recognizing the identity of a person based on an image of the person;

the position identification module is used for identifying the position of the person based on images of the person simultaneously acquired from a plurality of camera devices with fixed relative positions;

the motion recognition module is to recognize a motion of the person based on a sequence of video frames including an image of the person.

Optionally, the identity recognition module includes a face recognition module, and the face recognition module is configured to:

and carrying out feature point capture on a face image in the image of the person by using OpenPose.

Rotating the face image to a horizontal position based on the feature points;

dividing the face image of the horizontal position into five parts and inputting the five parts into TP-GAN to obtain a front face image;

extracting a feature vector from the front face image by using a ResNet-29 network in the Dlib;

and based on the characteristic vector, carrying out similarity judgment by using a database in which image data corresponding to the identity information is stored, and identifying the identity of the person.

Optionally, the rotating the face image to a horizontal position based on the feature points comprises:

calculating the difference between the pixel coordinates of the left eye and the right eye in the human face image to obtain the angle to be rotated;

and rotating the face image by a rotation angle to obtain the face image at a horizontal position by taking the left-eye pixel coordinate as a rotation center.

Optionally, dividing the face image of horizontal position into five parts for inputting into TP-GAN comprises:

cutting out a left eye part, a right eye part, a nose part and a mouth part from the face image;

scaling the face image, left eye portion, right eye portion, nose portion, and mouth portion with fidelity as input to the TP-GAN.

Optionally, the identifying the position of the person based on the images of the person simultaneously acquired from the plurality of imaging devices whose relative positions are fixed includes:

performing two-dimensional pixel coordinate positioning on the image of the person acquired by each of the plurality of image pickup devices by using an openposition single-attitude estimation algorithm to obtain two-dimensional pixel coordinates of the human body in each image;

calibrating and acquiring internal and external parameters of the plurality of camera devices through the camera devices;

and reconstructing three-dimensional position coordinates of the human body based on the two-dimensional pixel coordinates of the human body in each image and the internal and external parameters of the plurality of camera devices, thereby identifying the position of the person.

Optionally, the action recognition module includes a data processing sub-module, a feature extraction sub-module and an action classification sub-module;

the data processing submodule comprises a data preprocessing process and a video frame sampling process;

the feature extraction submodule comprises a residual error network merged into attention;

the action classification submodule comprises two layers of long-time and short-time memory networks.

Optionally, the data preprocessing process employs a data enhancement algorithm that shifts each image in the sequence of video frames in the original order in the horizontal direction by a random unit length and a random direction within a given range.

Optionally, the video frame sampling process employs a video frame sampling algorithm that samples an intermediate segment of the sequence of video frames.

Optionally, the residual portion of the attentive residual network includes three convolutional layers using 1 × 1, 3 × 3, and 1 × 1 convolutional kernels, respectively.

In a second aspect, the identity, location and action recognition method implemented by the identity, location and action recognition system according to the first aspect of the present invention includes:

capturing a sequence of video frames comprising images of a person by means of a camera device arranged in a fixed relative position in space;

and inputting the video frame sequence into an identity, position and action recognition system, wherein the identity, position and action recognition system recognizes the identity of the person through an identity recognition module, recognizes the position of the person through a position recognition module, and recognizes the action of the person through an action recognition module.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the identity, location, and action recognition method according to the second aspect.

In a fourth aspect, the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of the identity, location and action recognition method according to the second aspect.

In a fifth aspect, the present invention further provides a computer program product, which includes a computer program that, when being executed by a processor, implements the steps of the identity, location and action recognition method according to the second aspect.

According to the identity, position and action recognition system and method provided by the embodiment of the invention, the identity of the personnel can be recognized in a space with dense people flow through the identity recognition module, the position recognition module and the action recognition module, and the position of the personnel and the action of the personnel are determined, so that the energy of the personnel is saved, and the intelligent and active monitoring is realized.

Drawings

FIG. 1 is a schematic diagram of the structure of an identity, location and action recognition system provided by an embodiment of the present invention;

FIG. 2 is a flow chart of a data processing process provided by an embodiment of the invention;

fig. 3 is a structural diagram of a residual error network and a conventional network according to an embodiment of the present invention;

FIG. 4 is a block diagram of an improved residual network provided by an embodiment of the present invention;

FIG. 5 is a block diagram of a channel attention module provided by an embodiment of the present invention;

FIG. 6 is a block diagram of a residual network incorporating attention according to an embodiment of the present invention;

FIG. 7 is a flow chart of a method of identity, location and action recognition provided by an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an identity, position and action identification method, a system, electronic equipment and a storage medium, which can utilize a RIKIBOT ROS robot to carry a depth camera and a laser radar to perform slam positioning and mapping on a space, collect local details and construct a comprehensive monitoring area. A plurality of cameras can be arranged in the space to capture information of personnel, the identity, the position and the action of the personnel are identified in real time by relying on a network server, the space is connected with an alarm system, and when the system detects abnormal behaviors of the personnel, such as limb conflict, accidental fall, illegal invasion and the like, the automatic alarm function can be realized.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an identity, location, and action recognition system according to an embodiment of the present invention, where the identity, location, and action recognition system includes an identity recognition module 110, a location recognition module 120, and an action recognition module 130.

The identity recognition module 110 is configured to recognize an identity of a person based on an image of the person;

the position recognition module 120 is configured to recognize a position of the person based on images of the person simultaneously acquired from a plurality of cameras (image pickup devices) whose relative positions are fixed;

the action recognition module 130 is configured to recognize an action of the person based on a sequence of video frames comprising an image of the person.

Wherein the identity recognition module 110 comprises a face recognition module. Face recognition, one of the most important biometric techniques, has a significant passive advantage over other techniques that require the assistance of the subject. And recognition based on facial features can be applied to low-cost vision sensors, and thus is widely applied in daily life. A typical face recognition system generally consists of a detection module, an alignment module, a representation module, and a matching module. The detection module is used for positioning the face area. The alignment module transforms the detected face region with reference facial feature points. The representation module maps the face image to a vector space for recognition. In the matching module, the face descriptor vectors are compared by a similarity measure method to distinguish whether they are the same person. For the identification of the multi-angle human face, the method mainly comprises an integral method, a characteristic-based method and the like. In recent years, more and more computer vision problems are well solved by using a deep learning algorithm, and the method also has good development in the aspects of target recognition and image generation.

The face recognition module provided by the embodiment of the invention takes a video as input, and provides a face rotation method and a positive face generation and recognition method based on GAN. Next, the ResNet-29 network in Dlib is used as the feature extraction model of the system. The model can convert the input facial image into a 128-dimensional vector and compare it to a local portrait database that is vectorized in advance. The system uses the Euclidean distance to judge the similarity, and the model defines that when the Euclidean distance of two groups of vectors is less than 0.6, the vectors can be regarded as similar vectors, namely the identity information of the two vectors is matched with the identity information of the same person. The identity information contained in the image may ultimately be output.

In most cases, the face captured in the video tends to be tilted. To generate a frontal face using TP-GAN, the system of embodiments of the present invention uses a method of rotating the face based on facial feature points.

The face positioning model in the system is based on OpenPose, which is a human body key point estimation method based on a partial affinity field and a joint affinity field. For a person, the facial localization model may detect 70 facial keypoints. And (X, Y, Z) rectangular coordinate systems are established by taking the plane where the image is located as the plane where the X and Y axes are located, and each key point has a position (X, Y) and a detection confidence coefficient C. To rotate the face to the horizontal direction, the difference between the left-eye and right-eye pixel coordinates is calculated, resulting in the angle θ to be rotated:

wherein (x)_reye,y_reye) Is the coordinate of the right eye, (x)_leye,y_leye) Is the coordinates of the left eye.

After the angle θ is obtained, the image is rotated to a horizontal position using the above-mentioned facial key points. In the embodiment of the present invention, the left eye coordinate (x)_leye,y_leye) As a center of rotation. A commonly used image rotation method is to select the center of an image as a rotation center, but this rotation method is not suitable for face rotation because a face is not always at the center of an image. When the angle θ is large, the face region may be rotated out of the image region using a conventional method for rotation.

After an image of the face rotated to the horizontal direction is obtained, coordinates of key points of the rotated face are obtained through calculation so as to be used for cutting out the face area. Specifically, in order to increase the operation speed of the system, the whole rotation process can be regarded as a process of rotating by θ degrees around the Z axis, and a rotation matrix is established, as shown in equation (2):

wherein (x)_reye,y_reye) Is the center of rotation, B is the original coordinate system, and a is the rotated coordinate system.

Transforming coordinates of face key points of an original image into a matrix form^Bp, obtaining a coordinate matrix of the rotated image face key points through rotating the matrix^Ap：

The image is then divided into five parts to generate a frontal face by TP-GAN. The TP-GAN is a dual path network including global paths and local paths. The former is focused on facial contours and the latter is focused on local textures. The combination of the feature maps corresponding to the two paths will generate a frontal face image.

The identity, position and action recognition system provided by the embodiment of the invention uses Keras and TensorFlow to implement TP-GAN, and is developed on Anaconda by adopting Python. Specifically, the CMU Muliti-PIE face database containing 75000 images of over 337 subjects who were captured from 15 perspectives was used for training. First, a new classifier is trained using a lightweight CNN model, and the extractor is fine-tuned according to Muliti-PIE. After fine tuning, the loss value of the training model was 0.997. The GAN model was then trained on google ml engine. Finally, the TP-GAN provided by the embodiment of the invention achieves better face generation effect.

To extract features, the image is divided into five parts: a face image of 128 × 128 resolution, left and right eye images of 40 × 40 resolution, a nose image of 40 × 32 resolution, and a mouth image of 40 × 32 resolution. Since the resolution of the desired image is much lower than the resolution of the input image, a suitable scaling can be used to scale without distortion.

First, the size of the face portion is scaled in the rotated 128 × 128 pixel image. The coordinates of the rotated image face key points obtained by the above equation (3) are used to establish a mathematical relationship between the rotated image and the scaled image. Specifically, the boundary of the image required by the face part is set as f:

[x_left，x_right，y_upper，y_lower]_f (4)

then, the eyes, nose, and mouth portions in the 128 × 128-resolution face image are clipped. To obtain a screenshot at the target resolution, the scale used is

The coordinates of the eye, nose and mouth after rotation are expressed as (x)_n，y_n) The position of the face image with the resolution of 128 multiplied by 128 is (u)_n，v_n) The following formula (5) gives:

the height and width of the desired partial image are (h, w). Its boundary can be expressed as:

finally, the image of four parts such as the eye, nose and mouth parts in the face image of 128 × 128 resolution obtained by the equation (6) and the face image of 128 × 128 resolution obtained by the equation (4) are used as the input of TP-GAN to generate a front face image.

After a front face image is obtained, in order to identify the identity of the person, a ResNet-29 network in Dlib is used as a feature extraction model of the system, the face image is converted into a 128-dimensional feature vector, the similarity between the generated image and a local data set is measured by using Euclidean distance, the minimum value is taken as a final result, when the Euclidean distance between the feature vectors of two face images is less than 0.6, the two face images can be regarded as the same person, and identity identification is completed.

According to the identity recognition module provided by the embodiment, feature points of a video frame comprising a face image are captured by OpenPose, the face is rotated to a horizontal position based on the captured feature points, the rotated image is divided into five parts and input into TP-GAN, a front face image is obtained, a ResNet-29 network in a Dlib is used as a feature extraction model to extract feature vectors, similarity discrimination is carried out on the extracted vectors and the feature vectors in a database, and identity information is recognized. The identity recognition module provided by the embodiment can rotate all the faces in the picture comprising a plurality of persons to the horizontal direction even if part of the faces are covered. The identity recognition module provided by the embodiment of the invention has better performance in the aspect of large-posture face recognition, for example, compared with a ResNet-29 network in Dilb which does not adopt the face rotation method and the positive face generation and recognition method based on GAN provided by the invention, the identity recognition module provided by the embodiment of the invention can extract the features of the face image with the deflection angle of the head of a subject larger than minus 60 degrees and recognize the identity, and can also extract the features of the face image with the deflection angle of the head of the subject larger than plus 50 degrees, which cannot be realized by the ResNet-29 network in Dilb which does not adopt the face rotation method and the positive face generation and recognition method based on GAN provided by the invention.

The identity recognition module provided by another embodiment of the invention can comprise a gait recognition module, wherein the gait recognition is carried out by the walking posture of people, and the identity recognition module has the advantages of non-contact, remote recognition and difficult camouflage and has obvious advantages in the field of intelligent video monitoring. As the pedestrians have certain differences in muscle strength, tendon and bone length, bone density, center of gravity and the like, one person can be uniquely marked based on the differences, and a human motion model can be built or features can be directly extracted from a human body contour by utilizing the characteristics to realize gait recognition. The gait recognition module mainly comprises gait acquisition, gait segmentation, feature extraction, feature comparison and other modules. The input of the gait recognition module is a walking video image sequence, the video sequence captures the continuous change of the pedestrian in the walking process, the gait recognition algorithm excavates the gait characteristics of the pedestrian from the walking video image sequence, and compares the newly acquired gait characteristics with the gait characteristics stored in the data set to complete recognition.

The identity recognition module provided by another embodiment of the invention can comprise a pedestrian re-recognition module, and in the intelligent space personnel identity behavior real-time analysis and alarm system, a human face picture with very high quality can not be obtained due to the camera resolution and the shooting angle. Therefore, when face recognition fails, the pedestrian Re-recognition technology (Re-ID) becomes a very important substitute technology.

Pedestrian re-identification (Person re-identification), also known as pedestrian re-identification, is widely recognized as a sub-problem in image retrieval, which is a technique for determining whether a specific pedestrian is present in an image or video using computer vision techniques. The pedestrian re-identification technology can make up visual limitation of the existing fixed camera, can be combined with pedestrian detection and pedestrian tracking technologies, and is applied to the fields of video monitoring, intelligent security and the like.

In the system, when the camera cannot detect a face image (such as face shielding, insufficient light and the like), a pedestrian re-identification technology is adopted as an alternative scheme, the pedestrian re-identification technology is combined with technologies such as motion identification and position identification, and the identity and behavior of indoor personnel are analyzed in real time.

The identity, position and motion recognition system provided by the embodiment of the invention comprises a position recognition module 120, and the identity, position and motion recognition method provided by the embodiment of the invention comprises a human body position acquisition method based on multiple cameras.

In outdoor environments, human body position location is mainly achieved by satellite navigation technology. In indoor environment, technologies such as bluetooth positioning, Wi-Fi positioning, infrared positioning, UWB positioning, RFID positioning and the like are continuously developed and applied to daily life of people, and indoor positioning technologies can position people in real time in classrooms, laboratories and the like. In the field of intelligent security, when a positioning object is detected to be dangerous, an alarm can be sent out immediately and the position of a person can be determined. In emergency situations, such as fire rescue, emergency evacuation, earthquake relief, etc., indoor positioning information is very important.

The position recognition module 120 according to the embodiment of the present invention includes a plurality of cameras fixed in relative positions to acquire human body images, and the plurality of cameras have set internal and external parameters, are focused on the same area, and acquire human body images at the same time. The acquired human body image is used for reconstructing three-dimensional coordinates of human body joint points and calculating the region of the human body by combining internal and external parameters of the camera.

The human body position acquisition method provided by the embodiment of the invention comprises the steps of acquiring a human body image by using a camera; calculating two-dimensional pixel coordinates of a human body in the human body image; obtaining internal and external parameters of a camera through camera calibration; and reconstructing a three-dimensional coordinate of a human joint point based on the two-dimensional pixel coordinate of the human body and the internal and external parameters of the camera, and calculating to obtain the position of the human body based on the three-dimensional coordinate of the human joint point.

For the two-dimensional pixel coordinate positioning of the human body in the single camera, the embodiment adopts a single-pose estimation algorithm based on openspace to process the human body image collected by the camera, and positions the two-dimensional joint points of the human body in the image. The openposition open source program is a human posture recognition project developed by the university of Kanai Meilong, USA, and is an open source library developed on the basis of a convolutional neural network and a supervision learning frame with caffe as a framework. The gesture estimation in the aspects of human body action, facial expression, finger motion and the like can be realized, and the gesture estimation method is suitable for single people and multiple people and has excellent robustness.

The internal and external parameters of the camera are obtained through camera calibration, which is the premise of the conversion of human body joint points from two dimensions to three dimensions. For the problem of calibration of multiple cameras, in this embodiment, a Zhang-Yongyou binocular calibration method is first adopted to obtain the position relationship between two cameras, for example, cameras 1 and 2, cameras 2 and 3, cameras 3 and 4, and the like; then, through the transitivity of the transformation matrix, the position transformation matrix (external parameters) of a plurality of cameras such as the camera 1, the camera 2, the camera 3 and the camera 4 relative to the world coordinate system is calculated; completing multi-camera calibration to obtain internal and external parameters of a plurality of cameras.

As known from the imaging principle of the camera, the process of displaying three-dimensional coordinate points in space on a two-dimensional pixel plane of the camera is actually the process of converting the space points from a world coordinate system to a two-dimensional pixel coordinate system. In the process, a world coordinate system is converted into a camera coordinate system, then the camera coordinate system is converted into a camera physical coordinate system, and finally the camera physical coordinate system is converted into a pixel coordinate system. For a single camera, a ray equation (in a world coordinate system) from a central point of the camera to a space point can be calculated according to internal and external parameters of the camera and two-dimensional pixel point coordinates of the space point projected by the camera, so that the space point can be inferred to be on a required straight line. If the three-dimensional coordinates of the space point are calculated, the connecting line of the space point and another camera needs to be calculated, and then the intersection point of the two straight lines is calculated to be the space point, the above contents belong to the traditional binocular vision positioning problem, however, the method needs that a human body is clearly imaged in the two cameras respectively, the requirement on the cameras is high, and the actual requirement cannot be met under the condition that a shielding object exists, and aiming at the problem, the embodiment of the invention introduces multiple cameras on the basis of binocular vision, so that the three-dimensional reconstruction problem is changed into the problem that the intersection points are solved by multiple rays. However, the data has inevitable noise, which causes the ray estimated by calculation to deviate from the actual ray, thereby causing a problem that the intersection of a plurality of straight lines does not exist. Therefore, the research adopts a least square method to estimate the target joint point when estimating the target point.

In order to accurately estimate the three-dimensional position information of the human body, the present embodiment acquires the three-dimensional position coordinates of 25 joint points of the human body, for example, the head [ x ]₁ y₁ z₁]^TShoulder [ x ]₄ y₄ z₄]^TElbow [ x ]₁₂ y₁₂ z₁₂]^TWrist [ x ]₂₅ y₂₅ z₂₅]^TAnd the like. The spatial position of the human body is obtained by calculation according to the following formula:

z→[min_iz_i，max_iz_i](7) the minimum region where the human body is located in the three-dimensional space, namely the acquired position of the human body, can be determined according to the formula (7).

The identity, location and action recognition system provided by the embodiment of the present invention includes an action recognition module 130.

The aim of motion recognition is to judge the motion of a human body in a video, and in early motion recognition research, scholars at home and abroad design various manual characteristics and perform a large number of experiments, such as contour silhouettes, human body joint points, space-time interest points, motion tracks and the like. Because of relying on artificial feature extraction, the method has poor immunity and generalization capability and cannot be widely applied. In contrast, deep learning methods are able to learn data features autonomously and more efficiently and accurately. Therefore, the feature extraction method based on deep learning gradually replaces the process of manually extracting features. Ji et al first proposed a 3D-CNN algorithm that applies a 3D convolution kernel to video frames on a time axis to capture spatio-temporal information and use it for human motion recognition. Tran D et al propose a C3D network and use it in the fields of motion recognition, scene recognition, video similarity analysis, etc. Carreira et al dilate the 2D convolution into a 3D convolution, forming a dilated 3D convolution network I3D. Donahue et al propose an LRCN model that uses a Convolutional Neural Network (CNN) to extract features and uses LSTM to implement action classification. In the action recognition, the CNN and the LSTM are used, so that the recognition precision is improved to a great extent, and the workload is reduced. However, as CNN gets deeper, severe gradient loss and network degradation problems will occur. In order to solve the problem, the present application adopts an Attention residual network composed of a Convolutional Attention Block attachment Module (CBAM) and a residual network (ResNet) to extract features.

In video motion recognition, the processed data is no longer a single image, but a sequence of images having a temporal order. If each frame in the video is processed as input data, the computational cost of the model is greatly increased. Therefore, 16 frames are taken as samples from each video. Then, the samples are input into the model, and the learning of the network weight is performed. Finally, the actions are classified with a Softmax classifier. The action recognition module 130 provided by the embodiment of the present invention includes three sub-modules: data processing, feature extraction, and action classification.

As shown in fig. 2, the data processing sub-module mainly includes: a data preprocessing process and a video frame sampling process.

Since the original resolution of video data is usually large, the direct use is computationally expensive, so that it needs to be preprocessed. One data preprocessing process is as follows:

parsing the video into a sequence of video frames by using an ffmpeg module;

scaling the original video frame in equal proportion according to the training requirement;

performing center clipping on the zoomed video frame;

converting the cut video frame into a tensor form;

the tensor is regularized.

The above data preprocessing process has two problems: firstly, the center clipping of the video frame can cause the loss of edge information; secondly, the sample capacity of the motion recognition data set is relatively small, and the overfitting problem is easy to occur during training. Therefore, in order to at least partially solve the above problems, the present invention proposes a data enhancement algorithm for video, which is shown as algorithm 1.

In Algorithm 1, a sequence of video frames f is used₁，f₂，…，f_nDenotes each motion video V, facilitating processing of the video data indirectly using image processing methods. Specifically, algorithm 1 advances each image in the video frame sequence within a given range in the original orderTranslation in the horizontal direction of the row (unit length and direction of translation is random, n-1, meaning translation to the left; n-1, meaning translation to the right). If a motion video contains 50 frames and the range of generating random numbers is (-6, 6), the data can be extended by 600 times at most through the data enhancement process. Meanwhile, the video frame image is translated in the horizontal direction, and the problem of edge information loss caused by center cutting can be solved. Therefore, the data enhancement algorithm is added into the data preprocessing, and the improved data preprocessing process is as follows:

parsing the video into a sequence of video frames using a data enhancement algorithm;

scaling the video frame sequence according to the training requirement;

performing center clipping on the zoomed video frame;

converting the cut video frame into a tensor form;

the tensor is regularized.

In video motion recognition, since the start and stop of motion capture are difficult to synchronize with the start and stop of the motion itself, there are usually a large number of redundant frames at the start and end of the video. The video frame sampling method is to sample at the whole time of the video frame sequence, and the specific process is as follows:

randomly generating a number L between (0, R-16), wherein R is the length of the video after being analyzed into a video frame sequence;

starting from the L-th frame, 16 frames of images are sequentially selected as the input of the model.

Although the sampling method solves the problem of the calculation cost of the network model caused by input, the problem that the information amount contained in each time interval is not equal in the whole video frame sequence is not considered. If the randomly generated start frame is located in a period of time where the amount of information in the entire video is low, the input data obtained by the above sampling method may interfere with the model.

The embodiment of the invention improves the sampling method, and improves the quality of input information by sampling the middle section of the video, as shown in algorithm 2.

In the algorithm 2, when the frame number of the video is small (n is less than or equal to 48), the measure adopted in this embodiment is to ignore the influence of redundant frames, randomly generate an integer k in the range of (0, n-16), and then sequentially select 16 frames of images from the kth frame; when the frame number of the video is larger, redundant frames in the starting and stopping time period are removed, an integer k is randomly generated in the range of (n/3-16, 2n/3-16), and then 16 frames of images are sequentially selected from the kth frame.

For the feature extraction submodule, in the feature extraction process, the embodiment of the invention adopts a residual error network structure (ResNet + CBAM) which is integrated with attention, and the reason is as follows: firstly, a shortcut connection structure (shortcut connection) in a residual network solves the problems of gradient disappearance and network degradation in deep network training; and secondly, a convolution attention module (CBAM) is added into the residual error network, so that the network can extract features more pertinently, and the representation of the network on discriminant features is enhanced.

Compared with the traditional network, the residual error network (ResNet) adds shortcut connection on the basis of the traditional network, and the structure pair of the two networks is shown in figure 3.

In fig. 3(b), the right curve represents a shortcut connection that can pass the input x directly to the output location, while the structure in the left dashed box corresponds to the residual part of the residual network, whose output is f (x). The output result of the residual network is therefore:

H(x)＝F(x)+x (8)

when f (x) is 0, then h (x) is x, which is the identity map. If the shallow network reaches the saturation accuracy, the purpose of increasing the network depth without increasing the training error can be realized by adding a plurality of identical mapping layers behind the shallow network, and the problem of degradation of the deep network is solved. Thus, the residual network takes the residual result f (x) close to 0 as a learning objective. In addition, as can be seen from equation (8), when the residual network performs error backward propagation, since the derivative of x is constantly 1, even if the chain derivative of f (x) approaches to 0, the gradient vanishing problem can be effectively avoided, thereby ensuring that the network weight is updated.

In addition, two convolution layers in the residual network structure usually adopt a convolution kernel of 3 × 3, and as the network is further deepened, the problems of parameter redundancy and multiplication of calculation amount easily occur. To address this problem, the embodiment of the present invention adopts the residual network improved structure shown in fig. 4, and its residual part is composed of three convolutional layers and uses convolution kernels of 1 × 1, 3 × 3 and 1 × 1, respectively. Here, the dimension of the channel of the input tensor is reduced by using a1 × 1 convolution kernel, so that a 3 × 3 convolution kernel acts on the tensor with a relatively small size to achieve the purpose of improving the calculation efficiency, and then the dimension of the channel of the tensor is increased by using the 1 × 1 convolution kernel. For the shortcut connection in fig. 4, if x and f (x) dimensions are different, then x needs to be dimension adjusted using a1 × 1 convolution.

As can be seen from fig. 5, the channel attention module first compresses the input feature map F by using global average pooling and maximum pooling, then simultaneously inputs the two compressed features into a Multi-Layer Perceptron (MLP) for dimension reduction and dimension increase operation, finally performs summation operation on two vectors output by the MLP, and obtains a channel attention weighting coefficient M by a sigmoid function_CAs shown in equation 9.

Where σ denotes a Sigmoid activation function, W₀And W₁Is a weight matrix in the multi-layer perceptron MLP,

and

mean pooling characteristics and maximum pooling characteristics are indicated, respectively.

CBAM weights input features F with channel attentionCoefficient M_CMultiplication results in a new feature F'. Then, F' is input into a spatial attention module to obtain a spatial attention weighting coefficient M_S. Finally, let M_SMultiplication by F' yields the final attention feature F ", as shown in equations (10) and (11).

As shown in fig. 6, in the residual error network (ResNet + CBAM) merged into attention in the embodiment of the present invention, the CBAM module is first used to extract the key information in the input features, and then the extracted key information is input into the improved residual error network to further extract the depth features.

Conventional Recurrent Neural Networks (RNNs) can deal with timing problems, but when the input sequence is long, learning cannot be achieved due to the disappearance of the gradient. To solve this problem, Schmidhuber et al proposed a Long-short memory network (LSTM) that is good at handling long-term sequence information as a special structure of RNN. Therefore, in the action classification part, the embodiment of the invention adopts two layers of LSTMs to learn the time sequence relation between the video frame sequences, and then classifies the video frames into corresponding action classes by using a Softmax classifier.

The basic structure of the LSTM is to complete the input and output of information through an input gate, a forgetting gate and an output gate. Wherein the input gate is multiplied by the sigma layer, the tanh layer and one point by point in the middle of the upper diagram

Is composed of input x for determining current time_tHow many cells need to be saved to the current cell state c_tPerforming the following steps; the forgetting gate is multiplied by the sigma layer at the left side of the upper graph and one point by point

Composition of determining c of the previous time_t-1Whether or not to reserve c to the current time_tPerforming the following steps; the output gate is multiplied by the layer sigma on the right side of the upper graph and one point by point

The current cell state c is determined_tHow much of the current output value h can be passed to the LSTM_tIn (1). The update recursion formula for LSTM is as follows:

f_t＝σ(W_fh_t-1+U_fx_t+b_f (12)

i_t＝σ(W_ih_t-1+U_ix_t+b_i) (13)

o_t＝σ(W_oh_t-1+U_ox_t+b_o) (16)

h_t＝o_t·tanh(c_t) (17)

in the formula: w_f、W_i、W_c、W_oAnd U_f、U_i、U_c、U_oIs a corresponding weight matrix; b_f、b_i、b_c、b_oIs a corresponding offset; σ and tanh are activation functions.

The action recognition module 130 provided by the embodiment of the present invention can effectively increase the diversity of data, improve the quality of data, enhance the extraction of discriminant features by the model, and finally better recognize the action of the human body by the data preprocessing sub-module adopting the data enhancement algorithm, the data processing sub-module adopting the video frame sampling process of the video frame sampling algorithm, the feature extraction sub-module adopting the attention-integrated residual error network structure, and the action classification sub-module adopting the two-layer LSTM. Compared with the existing method, the method for recognizing the deep learning action by integrating the attention mechanism can more effectively extract the spatial information and the time information of the video data, and has better recognition effect.

In operating system Ubuntu 16.04; the deep learning framework pytorch1.6.0; general parallel computing architecture cuda 10.2; a deep neural network GPU acceleration library cudnn 7.6.5; the display card GeForce RTX 2080Ti and the display memory 11 GB; the graphics card driver nvidia 450.80; in the implementation environment of the hard disk 512GB, the motion recognition method proposed by the present invention is verified.

The action recognition method provided by the invention is verified on the UCF YouTube data set and the KTH data set respectively.

UCF YouTube was published by the computer vision research center of florida university, containing 1600 videos, classified into the following 11 action categories: shooting, golf swing, riding a bicycle, riding a horse, walking a dog, diving, bumping, playing tennis, jumping a trampoline, and discharging a ball. Each category contains 25 groups of video, each group containing at least 4 video segments with a resolution of 320 x 240.

There are 600 videos in the KTH data set, which contain variations in scale, clothing and lighting, with a resolution of 160 × 120. Under 4 different scenes, the data set is formed by executing 6 types of actions by 25 persons, and specifically comprises the following steps: walking, jogging, running, clapping, waving, and boxing.

In order to verify the effectiveness of the method provided by the invention, the UCF YouTube data set is divided according to 60% as a training set, 20% as a verification set and 20% as a test set. For the KTH dataset, due to the small number of samples, a method of averaging 5 cross-validations was used, with 80% of the data being trained each time, and the remaining 20% being tested.

First, the resolution of the UCF YouTube dataset is 320 × 240, and the UCF YouTube dataset is directly used, which causes memory overflow due to excessive calculation amount, so that it needs to be scaled, while the resolution of the KTH dataset is only 160 × 120, and the KTH dataset can be directly input into the model. Secondly, because the video action recognition has high requirements on the GPU computing power, in order to improve the training efficiency, the transfer learning is applied to the feature extraction part (ResNet + CBAM) of the model, namely: the weights trained on ImageNet by ResNet50 are migrated to the ResNet structure used in this application. To further reduce the risk of network overfitting, Dropout technique is used in all FC layers, i.e. nodes of FC layers are deactivated randomly with a certain probability. Finally, multithreading is used in the data loading part to speed up the reading speed of the model. The experimental parameter settings for this test are shown in table 1.

TABLE 1 Experimental parameters

After the model training is finished, the recognition rate of the method provided by the application on the UCF YouTube data set reaches 95.45%. The recognition rate of 9 actions reaches more than 90%.

To illustrate the superiority of the method of the present application, models ResNet + LSTM, ResNet + LSTM + algorithm 1+ algorithm 2, ResNet + CBAM + LSTM + algorithm 1+ algorithm 2 were tested on the same data set UCF YouTube, respectively, and then compared to existing methods, as shown in table 2.

Table 2 comparison on UCF YouTube with other methods

As can be seen from Table 2, the video motion recognition method based on Attention residual error network and LSTM Proposed by the present application is superior to the existing Deep learning methods Deep-Temporal LSTM, deployed DB-LSTM and Incepositionv 3+ Bi-LSTM-Attention on UCF YouTube data set. In addition, on the basis of ResNet and LSTM models, the effectiveness of the algorithm 1, the algorithm 2 and the CBAM module on the performance improvement of the models is verified respectively, and experimental results show that the recognition rate of the various improved models is improved by 1.56%, 1.16% and 1.88% in sequence before the improvement.

To further illustrate the effectiveness of the methods presented in this application, experiments were also performed on the KTH data set. There are 600 videos on the KTH data set, first 120 videos are selected as a test set, and the rest are selected as a training set and are called as Dataset1, and then the first cross validation is performed. The rest were similar, 5 times of cross-validation were performed, and the average value was finally taken as the experimental result, as shown in table 3.

Table 3 KTH dataset cross validation comparison

As can be seen from Table 3, the average recognition rate of the method on KTH reaches 96.83%, which is 6.06% higher than that of ResNet + LSTM, and the average recognition rate of the method is respectively increased by 3.12%, 1.60% and 1.34% before the improvement of various methods.

Table 4 shows a comparison between the KTH data set and the existing method, and it can be seen that the method still performs well on the KTH data set, further proving the effectiveness of the method.

Table 4 comparison on KTH dataset with other methods

The application provides a deep learning action identification method integrated with an attention mechanism. The method increases the diversity of data and improves the quality of the data by improving a data preprocessing method and a video frame sampling method; by integrating CBAM into the residual error network, the extraction of discriminant features by the model is enhanced. Compared with the existing method, the method can more effectively extract the spatial information and the time information of the video data, and has better identification effect.

Fig. 7 is a flowchart of an identity, location, and action recognition method provided in an embodiment of the present invention, where the identity, location, and action recognition method provided in an embodiment of the present invention includes:

step 710, capturing a video frame sequence comprising images of persons by a camera which is arranged in a fixed relative position in space;

and 720, inputting the video frame sequence into an identity, position and action recognition system, wherein the identity, position and action recognition system recognizes the identity of the person through an identity recognition module, recognizes the position of the person through an action recognition module, and recognizes the action of the person through the action recognition module.

According to the identity, position and action recognition system, through a human body picture captured by a camera, a human face recognition module is used for carrying out identity recognition on a visitor, an action recognition module is used for analyzing and recording the action of the visitor, and a position recognition module is used for detecting, tracking and recording the action position and the action track of the visitor.

According to the identity, position and action recognition method implemented by the identity, position and action recognition system, the action rule and characteristics of a specific monomer can be analyzed through the information collected by the system, action roads, regional residence time, action time periods and the like can be researched and judged, and data can be visually displayed in a graphical mode. And performing behavior detection on the video, namely entering and leaving an intrusion area, rapidly moving personnel, taking and placing articles, loitering and gathering people.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform an identity, location and action recognition method comprising: capturing a sequence of video frames comprising images of a person by means of a camera arranged in a fixed relative position in space; and inputting the video frame sequence into an identity, position and action recognition system, wherein the identity, position and action recognition system recognizes the identity of the person through an identity recognition module, recognizes the position of the person through an action recognition module, and recognizes the action of the person through an action recognition module.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the identity, location and action recognition methods provided by the above methods, the method comprising: capturing a sequence of video frames comprising images of a person by means of a camera arranged in a spatially fixed relative position; and inputting the video frame sequence into an identity, position and action recognition system, wherein the identity, position and action recognition system recognizes the identity of the person through an identity recognition module, recognizes the position of the person through an action recognition module, and recognizes the action of the person through the action recognition module.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, is implemented to perform the identity, location and action recognition methods provided above, the method comprising: capturing a sequence of video frames comprising images of a person by means of a camera arranged in a fixed relative position in space; and inputting the video frame sequence into an identity, position and action recognition system, wherein the identity, position and action recognition system recognizes the identity of the person through an identity recognition module, recognizes the position of the person through an action recognition module, and recognizes the action of the person through an action recognition module.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions recorded in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An identity, location and motion recognition system, comprising: the system comprises an identity recognition module, a position recognition module and/or an action recognition module;

the identity recognition module is used for recognizing the identity of the person based on the image of the person;

the position identification module is used for identifying the position of the person based on the images of the person acquired from a plurality of camera devices with fixed relative positions;

the action recognition module is configured to recognize an action of the person based on a sequence of video frames including an image of the person.

2. The identity, location and action recognition system of claim 1, wherein the identity recognition module comprises a face recognition module configured to:

carrying out feature point capture on a face image in the image of the person by using OpenPose;

rotating the face image to a horizontal position based on the feature points;

dividing the face image of the horizontal position into n parts and inputting the n parts into TP-GAN to obtain a face image of the front side;

and based on the characteristic vector, carrying out similarity judgment by using a database which stores image data corresponding to the identity information, and identifying the identity of the personnel.

3. The identity, location and action recognition system of claim 2, wherein the rotating the facial image to a horizontal position based on the feature points comprises:

calculating the difference of pixel coordinates of a left eye and a right eye in the face image to obtain a to-be-rotated angle;

and rotating the face image by the rotation angle to obtain the face image at the horizontal position by taking the left-eye pixel coordinate as a rotation center.

4. The identity, location and action recognition system of claim 2, wherein dividing the face image for horizontal position into n partial input TP-GANs comprises:

5. The identity, location and motion recognition system of claim 1, wherein the identifying the location of the person based on the images of the person taken together from the plurality of cameras that are fixed in relative position comprises:

performing two-dimensional pixel coordinate positioning on the image of the person acquired by each of the plurality of camera devices by using an openposition single-attitude estimation algorithm to obtain two-dimensional pixel coordinates of the human body in each image;

6. The identity, location and action recognition system of claim 1, wherein the action recognition module comprises a data processing sub-module, a feature extraction sub-module, and an action classification sub-module;

7. The identity, location and motion recognition system of claim 6, wherein the data pre-processing process employs a data enhancement algorithm that sequences each image in the sequence of video frames to perform a horizontal translation within a predetermined range.

8. The identity, location and action recognition system of claim 6, wherein the video frame sampling process employs a video frame sampling algorithm to sample an intermediate segment of the sequence of video frames.

9. The identity, location and action recognition system of claim 6, wherein the residual portion of the attentive residual network comprises three convolutional layers using 1 x 1, 3 x 3 and 1 x 1 convolutional kernels, respectively.

10. Identity, location and action identification an identity, location and action identification method comprising:

inputting the sequence of video frames into an identity, location and action recognition system of any one of claims 1-9, the identity, location and action recognition system recognizing the identity of the person by an identity recognition module, the location of the person by a location recognition module, and the action of the person by an action recognition module.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the identity, location and action recognition method according to claim 10 are performed when the program is executed by the processor.

12. A non-transitory computer readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, performs the steps of the identity, location and action recognition method according to claim 10.