CN114898466A

CN114898466A - Video motion recognition method and system for smart factory

Info

Publication number: CN114898466A
Application number: CN202210521070.6A
Authority: CN
Inventors: 文豪; 陆哲明; 李浩来; 崔家林
Original assignee: Efort Intelligent Equipment Co ltd
Current assignee: Efort Intelligent Equipment Co ltd
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-08-12
Anticipated expiration: 2042-05-13
Also published as: CN114898466B

Abstract

The invention relates to the technical field of deep learning action recognition, in particular to a video action recognition method and a video action recognition system for an intelligent factory, wherein the recognition method specifically comprises the following steps: s101, a factory video data fragment generating step S102 and a factory worker operation action data set generating step; s103, generating a factory operation target detection data set; s104, modeling a factory worker action recognition model; s105, a factory worker position information coding network training step; s106, building a factory worker behavior recognition algorithm; s107, behavior recognition input step; s108, behavior recognition and output; the specific system comprises: the system comprises a model training program, a label file generating program, model training electronic equipment, a processing and calculating center, a server side and a video monitoring terminal; compared with the traditional action recognition method which usually only adopts RGB characteristics to express the video, the method can greatly eliminate the influence of other information when the video characteristics are obtained, thereby improving the action recognition effect of factory workers.

Description

Video motion recognition method and system for smart factory

Technical Field

The invention relates to the technical field of deep learning action recognition, in particular to a video action recognition method and system for an intelligent factory.

Background

The work of factory worker action recognition mainly focuses on the definition of the actions of workers and the production of data sets and modeling of action recognition models. The current mainstream methods are as follows: firstly, an image recognition method is adopted, and a picture is input to recognize the state of a worker at a certain moment so as to judge the action of the worker. And secondly, inputting a frame sequence into a network recognition action by adopting a video classification method. And thirdly, extracting information related to the action by adopting a sensor, and judging by combining a deep learning method.

The prior art has the following defects: (1) since many operations cannot be determined from a state at a moment and a time-series relationship must be combined, an operation recognition method based on an image recognition method cannot handle such an object. (2) The recognition of motion relies heavily on the recognition of scenes, motion-related items, rather than the motion itself. Therefore, when information useful for motion recognition cannot be obtained from the scene and the related item itself, the recognition result is poor. (3) The current deep learning method has not good enough level for modeling time sequence information and can not accurately model different time sequence relations. (4) At present, most of research on motion recognition adopts a supervised learning method to detect motion, and higher accuracy and recall rate can be obtained on a data set. However, the method is limited by the cost required by labeling video data, and can not cover massive dynamic behaviors in a real scene, so that the actual landing effect is poor (5).

Disclosure of Invention

In order to solve the above problems, the present invention provides a video motion recognition method and system for an intelligent factory.

A video motion recognition method for an intelligent factory specifically comprises the following steps:

s101, a factory video data fragment generation step: processing and processing the video of the factory worker operation by utilizing an image preprocessing technology, and converting all original videos into usable factory worker operation data fragments;

s102, a factory worker operation action data set generation step: labeling the factory worker operation data segments for classification, and making the factory worker operation data segments into data for learning of an action recognition model;

s103, a factory work target detection data set generation step: outputting the worker operation video into frames, sampling the pictures, and performing frame selection and marking on the targets of people, a workbench and operation workpieces.

S104, a modeling method of a factory worker action recognition model comprises the following steps: after frame sampling, cropping and data enhancement are carried out on data of a factory worker operation action data set, the data are converted into a standard data sequence acceptable by a model and input into a 3D-ResNet deep neural network suitable for video understanding to train the model;

s105, a factory worker position information coding network training step: the method comprises the steps that a factory operation target detection data set is subjected to scaling, normalization type preprocessing, turning, random position and mosaic type data enhancement methods and then input into a target detection algorithm for training, so that position information of workers, an operation table and an operated workpiece can be provided, and then the position information is embedded into a multi-channel matrix input position coding branch for training;

s106, building a factory worker behavior recognition algorithm: splicing the trained action recognition model and the depth characteristics output by the tail part of the position information coding model, enabling the action recognition network and the position information coding network to respectively form an action recognition branch and a position information coding branch to form the depth characteristics which contain position information codes and reflect the behaviors of workers, inputting a full connection layer and freezing the network parameters before training to obtain a complete worker behavior recognition model;

s107, behavior recognition input step: inputting a video needing to identify the behavior of a worker into a factory worker behavior identification model;

s108, behavior recognition and output: and obtaining a behavior prediction probability vector based on the trained factory worker behavior recognition model, comparing the behavior category vectors to obtain a behavior recognition result, and sending the recognition result to a server in a socket communication mode.

The overall training frame of the algorithm belongs to supervised learning, and the core idea of the supervised learning is to label data, optimize model parameters according to the output result of input data and the label during training, enable the algorithm to find the corresponding relation between the input and the label, learn an optimal model from a data set, and predict the corresponding label when facing data without the label.

The step S101 of processing and processing the video of the factory worker operation specifically includes: preprocessing, labeling and classifying the monitoring video stream data, and converting the monitoring video stream into a worker action identification data set.

The step of generating the factory video data segment in step S102 is specifically as follows: the method comprises the steps of firstly utilizing an image cutting technology to cut a video frame to a working area of a worker so as to eliminate the influence of other areas, and firstly utilizing a video clipping technology to clip a segment of a factory worker working video by taking an action starting point as a starting point and an action ending point as a finishing point according to action types.

The labeling specification of the workpiece target detection data in step S103 is as follows: outputting a worker operation video to form a frame, sampling pictures, selecting a workpiece operated by a person, not marking all workpieces in the picture, and only detecting the workpiece operated by the worker to avoid inputting noise information of irrelevant actions to a neural network.

The neural network for identifying factory worker behaviors in the step S104 is composed of two neural network branches, wherein one neural network branch is a 3D-ResNet-based classical deep learning action identification algorithm and is composed of a 3D convolution kernel, and the neural network can move in a time dimension, extract time sequence characteristics and directly acquire continuous frame sequence identification actions; and the other is a depth position information coding network, frame sequence position information extracted by a target detection algorithm is embedded into a four-dimensional matrix, then a depth position information coding branch is input, and finally the action modeling depth characteristic output by the action identification branch and the depth position code output by the position information coding branch are spliced and input into a full-connection layer for prediction.

The specific design steps of the frame sequence position information characteristic embedded matrix for the factory worker action recognition target detection characteristics in the step S105 are as follows: the method comprises the steps of carrying out target detection by adopting n frames sampled on a video clip to be detected, embedding detection information on each frame into a matrix of a k channel, wherein the number of k channels depends on the number of types of targets concerned by motion recognition, each channel is a matrix with the size of 1-4, information of each target detection frame is contained in each channel, and each channel represents position information of one type of targets respectively.

A video motion recognition system for a smart factory comprises:

the model training program is used for inputting a data set file to the action recognition branch to obtain action information depth vectors and inputting the data set file to the YOLO target detection network to obtain a target position information matrix and then inputting the position information coding network to output position information depth codes;

a label file generation program constructs the detailed information of the data set in a dictionary file form so as to facilitate the training module to use at any time;

the model training electronic equipment is used for saving the model parameters obtained by the model training verification cycle as files and outputting training verification data as log files;

the video monitoring terminal is used for acquiring data;

the processing and calculating center is used for processing and identifying the transmitted video data and then transmitting the video data to the next terminal;

and the server is used for storing and using the identification result of the transmitted data.

The model training program specifically comprises:

the video sampling module samples frames in an input video at equal intervals or randomly at equal intervals;

the image preprocessing module is used for converting the format of a video frame, cutting a picture, scaling the size, labeling the category and the like, and converting the original video segment into a worker behavior data set for model training;

the action recognition network module is used for converting the input worker behavior video data set result neural network into a behavior depth feature vector and providing appearance and time sequence related information in the video for next worker behavior recognition;

the target detection network module is used for detecting the types and positions of targets interested by the algorithm in the video frames and providing the target types and positions to the position information coding module of the next step;

the position information coding module is used for embedding the target type and the position information output by the target detection module into a position information matrix and converting the target type and the position information into a position information depth characteristic vector through a position information depth coding network;

and the joint network training module is used for freezing the parameters of the designated module, and obtaining the prediction probability of the action of the worker by the behavior depth characteristic vector provided by the action recognition module and the joint depth information vector spliced by the position information depth characteristic vector provided by the position information coding module.

The model training electronic equipment comprises a storage for storing model parameters, a data set and a labeling file, a processor for training a model algorithm, a model training algorithm storage and a video recording terminal.

The processing and computing center comprises a behavior recognition network model file library, a computing center, a memory and a server; the identification network model file library stores the parameter data of the deep neural network to be used by a calculation center; the computing center carries out preprocessing, feature extraction and behavior prediction on video data transmitted by the video monitoring terminal, and prediction results are stored in a log file of the memory and are simultaneously transmitted to the server side for use.

The invention has the beneficial effects that: the video action recognition method and system for the intelligent factory can solve the problems of specificity and difficult time sequence modeling in the field of factory action recognition, and achieve about 95% of worker action recognition effect on a verification set; compared with the traditional action recognition method which usually only adopts RGB characteristics to express the video, the method can greatly eliminate the influence of noise and redundant information when the video characteristics are obtained, thereby improving the action recognition effect of factory workers.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a block diagram illustrating an overall schematic diagram of a video motion recognition method for an intelligent factory according to an embodiment of the present invention;

FIG. 2 is a block diagram of a method for identifying video motion of an intelligent factory according to an embodiment of the present invention;

FIG. 3 illustrates a video motion recognition network model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a model training procedure according to an embodiment of the present invention;

FIG. 5 is a schematic structure of a markup dictionary file according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of model training electronics in accordance with an embodiment of the present invention;

FIG. 7 is a schematic diagram of a plant worker action recognition model execution electronics in accordance with an embodiment of the present invention;

fig. 8 is a schematic diagram of detection information and a matrix according to an embodiment of the invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further explained below.

As shown in fig. 1 to 8, a video motion recognition method for an intelligent factory specifically includes the following steps:

s107, behavior recognition input step: inputting a video needing to identify worker behaviors into a factory worker behavior identification model;

s108, behavior recognition and output: the method comprises the steps of obtaining a behavior prediction probability vector based on a trained factory worker behavior recognition model, comparing behavior category vectors to obtain a behavior recognition result, and meanwhile sending the recognition result to a server in a socket communication mode, so that the real-time performance is higher, and the method is convenient to observe and detect at any time.

The video action recognition method and system for the intelligent factory can solve the problems of specificity and difficult time sequence modeling in the field of factory action recognition, and can achieve about 95% of worker action recognition effect on a verification set; compared with the traditional action recognition method which usually only adopts RGB characteristics to express the video, the method can greatly eliminate the influence of noise and redundant information when the video characteristics are obtained, thereby improving the action recognition effect of factory workers.

The step S101 of processing and processing the video of the factory worker operation specifically includes: the method comprises the steps of preprocessing, labeling and classifying monitoring video stream data, and converting the monitoring video stream into a worker action recognition data set.

The labeling specification of the workpiece target detection data in step S103 is as follows: the method comprises the steps of outputting a worker operation video to form a frame, sampling pictures, selecting a workpiece operated by a person, not marking all workpieces in the frame, and only detecting the workpiece operated by the worker so as to avoid inputting noise information of irrelevant actions to a neural network.

The neural network for identifying factory worker behaviors in the step S104 is composed of two neural network branches, one of which is a 3D-ResNet-based classical deep learning action identification algorithm and is composed of 3D convolution kernels, and can move in a time dimension, extract timing characteristics, and directly acquire continuous frame sequence identification actions. The other is a depth position information coding network, frame sequence position information extracted by a target detection algorithm is embedded into a four-dimensional matrix, and then a depth position information coding branch is input. And finally, splicing the motion modeling depth features output by the motion recognition branch and the depth position codes output by the position information coding branch, and inputting the spliced motion modeling depth features and the depth position codes into a full-connection layer for prediction.

The specific design steps of the frame sequence position information characteristic embedded matrix for the factory worker action recognition target detection characteristics in the step S105 are as follows: the method comprises the steps of performing target detection by adopting n frames sampled on a video clip to be detected, embedding detection information on each frame into a matrix of a k channel, wherein the number of k channels depends on the number of types of targets concerned by motion recognition, each channel is a matrix with the size of 1-4 and contains information of each target detection frame, and each channel represents position information of one type of targets respectively, as shown in figure 8.

The video is preprocessed in the step S104 and the step S105, so that the recognition is more accurate, the recognition efficiency is higher, and the specificity is strong.

The function of the step S107 for the plant worker action recognition model execution device is: inputting a trained model, inputting a video stream transmitted by a video monitoring terminal, and inputting all model parameters; and obtaining a worker behavior prediction result of the input factory monitoring video stream based on the trained factory worker action recognition system.

As shown in fig. 3, the video motion recognition network model is a video motion recognition network model, and the motion recognition neural network model is composed of 3 sub-neural network modules, wherein the two sub-neural network modules include two neural network branches, one of which is a 3D-ResNet-based classical deep learning motion recognition algorithm and is composed based on a 3D convolution kernel, and the sub-neural network model can move in a time dimension t, extract a time sequence feature, and directly acquire a continuous frame sequence recognition motion; the other is a depth position information coding network, frame sequence position information extracted by a YOLOv3 target detection algorithm is embedded into a four-dimensional matrix, and then a depth position information coding branch is input; YOLOv3 adopts Darknet-53 as a basic network, has very good image recognition effect, the whole network mainly comprises 5 groups of residual blocks, and the size of a prior frame is obtained by adopting K-means clustering; because the complexity of the position information is far lower than that of the image, a characteristic connection mode of a full connection layer can be adopted, so that the time receptive field of the position information coding branch can directly reach the size of the whole input sequence, the time sequence information in a long range can be conveniently obtained, and the time sequence modeling is easier; and finally, splicing the motion modeling depth features output by the motion recognition branch and the depth position codes output by the position information coding branch, and inputting the spliced motion modeling depth features and the depth position codes into a full-connection layer for prediction.

As shown in fig. 7, the step S108 includes, for the plant worker action recognition model execution electronic device: a video monitoring terminal; a video streaming interface; a behavior category library; a model input interface; a worker motion recognition model algorithm processor; a data storage; a display; an identification result transmission interface; and an action recognition model algorithm memory. After the equipment is powered on and the program of the worker action recognition algorithm is run through the processor, the following steps are executed:

transmitting a video stream from the video monitoring terminal in real time through a video stream transmission interface for receiving a worker action recognition algorithm;

acquiring trained models from a model input interface, and sending the models, the behavior category library and all model parameters to the text semantic feature extraction algorithm processor;

the worker action recognition algorithm processor carries out prediction recognition on the actions of factory workers based on a trained action recognition model, a YOLO target detection algorithm and a position information coding network;

the worker action recognition model algorithm processor outputs the prediction result of the worker action to a data memory and the recognition result transmission interface for a server program to take;

the display displays the worker action recognition results, the real-time video stream and the result transmission conditions of the recognition result transmission interface which are stored in the data storage.

A video motion recognition system for a smart factory comprises:

a label file generation program, which constructs the detailed information of the data set in the form of a dictionary file to be convenient for a training module to use at any time, wherein a schematic structure of the label dictionary file is given in fig. 5;

the video monitoring terminal is used for acquiring data;

the processing and calculating center is used for processing and identifying the transmitted video data and then transmitting the processed and identified video data to the next terminal;

The model training program specifically comprises:

the position information coding module is used for embedding the target type and the position information output by the target detection module into a position information matrix and converting the position information matrix into a position information depth characteristic vector through a position information depth coding network;

The model training program is used for executing a model training verification cycle, inputting a data set file to a motion recognition branch to obtain a motion information depth vector and inputting the data set file to a YOLO target detection network to obtain a target position information matrix, inputting a position information coding network to output position information depth codes, splicing and inputting the motion information depth vector and the position information depth codes into a full connection layer to obtain an output prediction result, judging the quality of the result through a Cross EntropyLoss discriminator, and continuously optimizing parameters through an SGD optimizer.

And outputting all models and model parameters thereof to a model output interface.

And storing the obtained model parameters of the model training verification cycle as files, and outputting training verification data as log files.

The model training electronic equipment comprises a storage for storing model parameters, a data set and a label file, a processor for training a model algorithm, a model training algorithm storage and a video recording terminal.

After the model training electronic equipment is powered on and the processor runs the program of the algorithm model, the following steps are executed:

executing a model building script, reproducing a model, including an action identification branch and a position information coding branch, and building an initial model structure;

reading a pre-training model file in a memory, extracting model parameters and giving the model parameters to obtain an initial action recognition model;

generating a discriminator for discriminating the quality of the output result of the model and the training level of the model;

running a configuration script in a training stage, determining a video preprocessing method in the training stage, a data enhancement mode of a video, loading a data set example, determining the type and parameter setting of an optimizer and the setting type of a training log;

and running a configuration script of the verification stage, determining a video preprocessing method of the verification stage, loading a data set example of the verification stage and setting the verification log.

The processing and computing center is connected with the server and the external monitoring and blocking communication, and the processing center obtains video data transmitted by the video monitoring terminal, processes and identifies the video data, and transmits the video data to the server for storing and using an identification result.

Compared with the prior art, the invention has the following three innovation points.

Aiming at the problem of poor training result caused by the particularity of the field of the current factory worker action recognition and the lack of the labeling data, a whole set of specification and workflow from monitoring video stream to data set production and the format and specification of the labeling file are provided.

Aiming at the characteristic that the traditional action recognition algorithm generally only utilizes video rgb flow, target position information based on target detection is introduced, and new modal information is provided for action recognition.

The invention provides a mixed neural network model based on deep learning action recognition and target recognition and a video action recognition method facing a smart factory, which comprises an action recognition branch and a target position and type information coding branch.

The foregoing shows and describes the general principles, principal features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A video motion recognition method for an intelligent factory is characterized by comprising the following steps: the method specifically comprises the following steps:

s102, generating a factory worker operation action data set: marking labels on the operation data segments of factory workers for classification, and making the operation data segments of the factory workers into data for learning of an action recognition model;

s103, a factory work target detection data set generation step: outputting a worker operation video to form a frame, sampling a picture, and performing frame selection and marking on a person, a workbench and an operation workpiece type target;

2. The method of claim 1, wherein the video motion recognition method for intelligent factory comprises: the step S101 of processing and processing the video of the factory worker operation specifically includes: preprocessing, labeling and classifying the monitoring video stream data, and converting the monitoring video stream into a worker action identification data set.

3. The method of claim 1, wherein the video motion recognition method for intelligent factory comprises: the step of generating the factory video data segment in step S102 is specifically as follows: the method comprises the steps of firstly utilizing an image cutting technology to cut a video frame to a working area of a worker so as to eliminate the influence of other areas, and firstly utilizing a video clipping technology to clip a segment of a factory worker working video by taking an action starting point as a starting point and an action ending point as a finishing point according to action types.

4. The method of claim 1, wherein the video motion recognition method for intelligent factory comprises: the labeling specification of the workpiece target detection data in step S103 is as follows: outputting a worker operation video to form a frame, sampling pictures, selecting a workpiece operated by a person, not marking all workpieces in the picture, and only detecting the workpiece operated by the worker to avoid inputting noise information of irrelevant actions to a neural network.

5. The method of claim 1, wherein the video motion recognition method for intelligent factory comprises: the neural network for identifying factory worker behaviors in the step S104 is composed of two neural network branches, wherein one neural network branch is a 3D-ResNet-based classical deep learning action identification algorithm and is composed of a 3D convolution kernel, and the neural network can move in a time dimension, extract time sequence characteristics and directly acquire continuous frame sequence identification actions; and the other is a depth position information coding network, frame sequence position information extracted by a target detection algorithm is embedded into a four-dimensional matrix, then a depth position information coding branch is input, and finally the action modeling depth characteristic output by the action identification branch and the depth position code output by the position information coding branch are spliced and input into a full-connection layer for prediction.

6. The method of claim 1, wherein the video motion recognition method for intelligent factory comprises: the specific design steps of the frame sequence position information characteristic embedded matrix for the factory worker action recognition target detection characteristics in the step S105 are as follows: the method comprises the steps of carrying out target detection by adopting n frames sampled on a video clip to be detected, embedding detection information on each frame into a matrix of a k channel, wherein the number of k channels depends on the number of types of targets concerned by motion recognition, each channel is a matrix with the size of 1-4, information of each target detection frame is contained in each channel, and each channel represents position information of one type of targets respectively.

7. The system for identifying video motion of intelligent factory facing to any one of claims 1 to 6, wherein: the method comprises the following steps:

the video monitoring terminal is used for acquiring data;

8. The system of claim 7, wherein: the model training program specifically comprises:

the image preprocessing module is used for converting the format of the video frame, cutting the picture, scaling the size, labeling the category and the like, and converting the original video segment into a worker behavior data set for model training;

the action recognition network module is used for converting the input worker behavior video data set result neural network into a behavior depth characteristic vector and providing appearance and time sequence type related information in a video for next worker behavior recognition;

9. The system of claim 7, wherein: the model training electronic equipment comprises a storage for storing model parameters, a data set and a label file, a processor for training a model algorithm, a model training algorithm storage and a video recording terminal.

10. The system of claim 7, wherein: the processing and computing center comprises a behavior recognition network model file library, a computing center, a memory and a server; the identification network model file library stores the parameter data of the deep neural network to be used by a calculation center; the computing center carries out preprocessing, feature extraction and behavior prediction on video data transmitted by the video monitoring terminal, and prediction results are stored in a log file of the memory and are simultaneously transmitted to the server side for use.