CN103327356B

CN103327356B - A kind of video matching method, device

Info

Publication number: CN103327356B
Application number: CN201310268664.1A
Authority: CN
Inventors: 孙茂杰
Original assignee: TCL Corp
Current assignee: TCL Corp
Priority date: 2013-06-28
Filing date: 2013-06-28
Publication date: 2016-02-24
Anticipated expiration: 2033-06-28
Also published as: CN103327356A

Abstract

The present invention is applicable to video technique field, provides a kind of video matching method, device, and described method comprises: all key frames extracting target video, and preserves the characteristic information of each Target key frames; The characteristic information of the characteristic information of described Target key frames with the source key frame of video generated in advance is compared, obtains the matching degree of target video and source video.The present invention, adopt and analyze teaching source video in advance, learner is when study, and the target video of an application software Water demand learner, avoids the situation simultaneously will analyzing two-path video, and the CPU computing alleviating half consumes.

Description

Video matching method and device

Technical Field

The invention belongs to the technical field of videos, and particularly relates to a video matching method and device.

Background

At present, the video learning application software developed by researchers, such as the golf learning application software, generally plays a golf video teaching film, captures a video of a learner who is learning and imitates the action in the teaching film through a camera, and guides the learner by comparing the action of the teaching film and the video image frame of the learner frame by frame. However, such direct comparison of two video frame data costs a lot of Central Processing Unit (CPU) resources, and the comparison speed is relatively slow in the primary CPU used by a consumer electronic device such as a Television (TV), which results in a slow response speed of the TV or other consumer electronic devices.

Disclosure of Invention

The embodiment of the invention provides a video matching method and device, and aims to solve the problem that in the prior art, a large amount of CPU operation consumption is consumed when actions of a teaching film and video image frames of a learner are compared frame by frame.

In one aspect, a video matching method is provided, and the method includes:

extracting all key frames of a target video, and storing the characteristic information of each target key frame;

and comparing the characteristic information of the target key frame with the characteristic information of the pre-generated source video key frame to obtain the matching degree of the target video and the source video.

Further, the extracting the key frame of the target video specifically includes:

reading a current video frame of a target video, and taking the current video frame as a first key frame;

acquiring feature information of the first key frame;

reading a next frame of video frame;

acquiring the characteristic information of the next frame of video frame;

comparing the characteristic information in the two video frames to obtain the matching degree;

and if the matching degree is lower than a preset matching threshold, taking the read next video frame as a second key frame, otherwise, continuously reading the next video frame until all video frames in the target video are completely read.

Further, the acquiring the feature information of the video frame includes:

background erasing processing is carried out on a frame of key frame, and a human body model is left;

horizontally and vertically scanning the human body model to obtain a rectangular area where a human body is located;

dividing the rectangular area from top to bottom by n equally to form contour lines;

and acquiring the characteristic information of each contour line, and taking the characteristic information of each contour line as the characteristic information of the key frame.

Further, the acquiring the feature information of the video frame includes:

identifying 5 characteristic points of the human body on the four limbs and the top of the head by using the geometric relationship of the geodesic distances between vertexes of the human body model;

generating 5 skeleton central lines according to the 5 characteristic points;

and determining the positions of the joint points according to the 5 skeleton central lines, and taking the positions of all the joint points as the characteristic information of the key frame.

In another aspect, a video matching apparatus is provided, the apparatus including:

the characteristic information acquisition unit is used for extracting all key frames of the target video and storing the characteristic information of each target key frame;

and the matching degree acquisition unit is used for comparing the characteristic information of the target key frame with the characteristic information of the pre-generated source video key frame to obtain the matching degree of the target video and the source video.

Further, the feature information extraction unit includes:

the video frame reading module is used for reading a current video frame and a next video frame of a target video and taking the current video frame as a first key frame;

the characteristic information acquisition module is used for acquiring the characteristic information of the first key frame and the next frame of video frame;

and the matching degree acquisition module is used for comparing the characteristic information in the two frames of video frames to obtain the matching degree, if the matching degree is lower than a preset matching threshold value, the read next video frame is taken as a second key frame, otherwise, the next video frame is continuously read until all the video frames in the target video are completely read.

Further, the feature information acquisition module includes:

the first background erasing submodule is used for carrying out background erasing processing on a frame of key frame and leaving a human body model;

the rectangular area acquisition submodule is used for horizontally and vertically scanning the human body model to obtain a rectangular area where a human body is located;

the equal-height division submodule is used for equally dividing the rectangular area from top to bottom by n to form contour lines;

and the contour line information acquisition submodule is used for acquiring the characteristic information of each contour line and taking the characteristic information of each contour line as the characteristic information of the key frame.

Further, the feature information acquisition module includes:

the second background erasing submodule is used for carrying out background erasing processing on a frame of key frame and leaving a human body model;

the characteristic point acquisition submodule is used for identifying 5 characteristic points of the human body positioned on the four limbs and the top of the head by utilizing the geometric relationship of the geodesic distance between the vertexes of the human body model;

the bone center line generation submodule is used for generating 5 bone center lines according to the 5 characteristic points;

and the joint point position acquisition submodule is used for determining the positions of joint points according to the 5 skeleton central lines and taking the positions of all the joint points as the characteristic information of the key frame.

In the embodiment of the invention, the teaching source video is analyzed in advance, and when a learner learns, the application software only needs to analyze the target video of the learner, so that the condition that two paths of videos need to be analyzed simultaneously is avoided, and half of CPU operation consumption is reduced. In addition, matching between the target video and the source video is not performed on each pixel point in the video frames, but is performed on the feature information in the two video frames, so that the matching speed is higher. In addition, for the source video, the frame taking interval i is set according to the speed of the human body movement of the source video, for example, for the yoga video, the movement speed is relatively slow, the value of i can be taken as a relatively large value, 1 second or even longer, and one frame of video frame is taken for feature information extraction instead of extracting each frame, so that the operation time of feature extraction can be saved in a certain precision range in multiples. In addition, not every pair of extracted video frames is compared, but a video frame with the largest human body form change is extracted as a key frame, only the key frame is compared, and the operation time of feature extraction can be saved in a doubling way within a certain precision range.

Drawings

Fig. 1 is a flowchart illustrating an implementation of a video matching method according to an embodiment of the present invention;

FIG. 2 is a flowchart of an implementation of extracting all key frames of a target video according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating an implementation of a method for extracting feature information of a key frame according to an embodiment of the present invention;

fig. 4 is a flowchart of an implementation of another method for extracting feature information of a key frame according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a human body model left after background erasure processing is performed on a frame of video frame according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a rectangular area where a human body is located according to an embodiment of the present invention;

FIG. 7 is a schematic contour diagram of a rectangular area where a human body is located, which is divided into equal parts from top to bottom according to an embodiment of the present invention;

fig. 8 is a block diagram of a video matching apparatus according to a second embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the embodiment of the invention, key frames of a target video are extracted, and the characteristic information of each target key frame is recorded; and comparing the characteristic information of the target key frame with the characteristic information of the pre-generated source video key frame to obtain the matching condition between the target video frame and the source video frame.

The following detailed description of the implementation of the present invention is made with reference to specific embodiments:

example one

Fig. 1 shows an implementation flow of a video matching method according to an embodiment of the present invention, which is detailed as follows:

in step S101, all key frames of the target video are extracted, and feature information of each target key frame is saved.

The target video is a video of a learner who is learning and imitates actions in a teaching piece, and the process of extracting all key frames of the target video is shown in fig. 2, and specifically includes:

step S201, reading a current video frame of a target video, and taking the current video frame as a first key frame;

step S202, acquiring characteristic information of the first key frame;

step S203, reading the next frame of video frame;

step S204, acquiring the characteristic information of the next frame of video frame;

s205, comparing the characteristic information in the two frames of video frames to obtain a matching degree;

and step S206, if the matching degree is lower than a preset matching threshold, taking the read next video frame as a second key frame, otherwise, continuously reading the next video frame until all video frames in the target video are completely read.

In this embodiment, a first frame of video frame appearing in the target video is taken as a key frame, then a next frame of video frame is read, feature information in the two frames is compared to obtain a matching degree (if the feature information of the two frames is completely the same, the matching degree is 1), if the matching degree is lower than a preset matching threshold value, it is considered that human body morphology in the two frames has changed greatly, the read next video frame is taken as a key frame, otherwise, the next frame of video frame is continuously read until all video frames in the target video are completely read. It should be noted that: each 30 frames per second of video does not need to be read one frame by one frame, one frame can be set to be read one second, the next frame refers to the video frame of the next second, certainly, one frame can be set to be read two seconds, and the method is not limited in the process, and can save the operation time of feature extraction in a certain precision range.

The extraction of the feature information of the key frame may be implemented by the following steps, and the specific flow is shown in fig. 3:

step S211, performing background erasing processing on a frame of key frames, and leaving a human body model, as shown in fig. 4.

And S212, horizontally and vertically scanning to obtain a rectangular area where the human body is located, and calculating the aspect ratio relation of the rectangular area.

As shown in fig. 5, recording a width-height proportional relationship of the region of interest, and when comparing with the source video frame later, if the width-height proportional relationship of the region of interest of the source video frame to be compared meets a preset error range, continuing to compare, otherwise, determining that the target video frame is not matched with the source video frame. Wherein the source video frame is a video frame of a teaching film.

And step S213, equally dividing the region of interest from top to bottom by n to form contour lines.

And step S214, acquiring the characteristic information of each contour line, and taking the characteristic information of each contour line as the characteristic information of the key frame.

As shown in fig. 6, n = 8. Each contour line is cut into line segments with alternate black (background) and white (human body), the proportional relation of the cut line segments of each contour line is recorded, the color of the first line segment is noted to be marked (0 represents the black color of the background, and 1 represents the white color of the human body), the information of the contour line with the highest height in the following figure may be (0, 0.5, 0.2 and 0.3), and therefore the characteristic information of each key frame can be obtained. The feature information of the key frames comprises the number of the key frames and the occurrence time of each key frame besides the information of the contour line of each key frame.

In addition, the feature information of the key frame may be obtained as shown in fig. 7 by the following steps, including:

step S221, performing background erasing processing on a frame of key frames, and leaving a human body model, as shown in fig. 4.

Step S222, identifying 5 characteristic points of the human body positioned on the four limbs and the top of the head by using the geometric relationship of the geodesic distances between the vertexes of the human body model.

In this embodiment, a vertex V having the largest distance from the human body model is obtained at any point on the human body model, and the vertex V is added to the end feature point set V as one of the end feature points. And then searching a point with the largest sum of geodesic distances from each point in the set V in the human body model as a new terminal characteristic point until the geodesic distance from the new terminal characteristic point to the terminal characteristic point in the set V is smaller than a preset threshold value, wherein the threshold value is an empirical value. The geodesic distance is the length of the shortest path connecting two points (or two sets) in space, and the feature points in the terminal feature point set V calculated by the method are 5 identified feature points.

Due to the symmetry of the manikin, the distance from the end of the head to the end of 2 upper limbs is equal, and the distance to the end of 2 lower limbs is also equal. The embodiment of the invention can automatically extract the tail end feature point Vhead of the head from the 5 tail end feature points by utilizing the characteristic.

And step S223, generating 5 skeleton central lines according to the 5 characteristic points.

In this embodiment, firstly, the geodesic distance isocurves of N levels are sequentially determined with the end feature point Vhead of the head as the starting point, and the difference between the isodesic curves of each level is d. After the distance of the difference of the equivalent curves of each layer is obtained, the geodesic distance equivalent curve function of the total N layers of the whole human body model can be obtained by taking the tail end characteristic point Vhead of the head as a starting point. When the number of layers is gradually increased, and when the geodetic distance isocurves of a certain layer appear for the first time, 3 (a closed curve formed by two geodetic distance isocurves is a micro-ellipse), two closed curves with smaller circumference length can be determined as the boundary between the arm and the trunk. When there are 4 geodesic distance isocurves in a certain layer, two closed curves with larger perimeter can be determined as the boundary lines of the legs and the trunk, and accordingly the four limbs and the trunk can be distinguished.

After the range of the four limbs of the human body is determined, the end characteristic points of the four limbs are respectively taken as starting points, and the geodesic distance isocurve is respectively calculated for the four limb parts according to the increasing d of each layer, so that the section of the geodesic distance isocurve on the four limbs can be better vertical to the bone central line on the four limbs, and the section contour is more consistent with the section pointed by the medical science, so that the joint point position can be more accurately judged by using the circularity of the section when the bone central line is slightly bent. Finally, the adjacent geodesic distance isocurve centers are connected to generate 5 skeleton central lines, and then the positions of the joint points can be determined on the central lines.

And S224, determining the positions of the joint points according to the 5 skeleton central lines, and taking the positions of the joint points as the characteristic information of the key frame.

In this embodiment, after the bone centerline is determined, the positions of the joint points may be determined by using a method of minimum included angle of each segment of the centerline, and the positions of the joint points are used as the feature information of the video frame. Suppose that the ith geodesic distance isocurve center is C_iLine segment C_i-tC_iAnd line segment C_i+tC_iThe included angle is the included angle between the geodesic distance and the center of the isocurve. the size of t is determined according to the number of layers of the geodesic distance isocurve divided by the skeleton central line, generally, the more the number of layers is, the larger t is, and the function of t is to reduce the calculation of the included angle influenced by local data fluctuation.

In step S102, the feature information of the target key frame is compared with the feature information of the pre-generated key frame of the source video to obtain the matching degree between the target video and the source video.

In this embodiment, the feature information of the source video key frame is extracted according to the same extraction method as the feature information of the target video key frame. Comparing the feature information of the two key frames, when the difference between the two key frames is greater than a preset difference threshold value, indicating that the action of the learner is inconsistent with the action of the teaching piece, and in addition, providing a corresponding learning score for the learner according to the difference value.

According to the embodiment, the teaching source video is analyzed in advance, and when a learner learns, the application software only needs to analyze the target video of the learner, so that the situation that two paths of videos need to be analyzed simultaneously is avoided, and half of CPU operation consumption is reduced. In addition, matching between the target video and the source video is not performed on each pixel point in the video frames, but is performed on the feature information in the two video frames, so that the matching speed is higher. In addition, for the source video, the frame taking interval i is set according to the speed of the human body movement of the source video, for example, for the yoga video, the movement speed is relatively slow, the value of i can be taken as a relatively large value, 1 second or even longer time, and one frame of video frame is taken for feature information extraction instead of extracting each frame, so that the operation time of feature extraction can be saved in a certain precision range in multiples. In addition, in this embodiment, instead of comparing each pair of extracted video frames, a video frame in which the human body morphology changes most is extracted as a key frame according to the change of the feature information of the video frames, and only the key frame is compared, so that the calculation time for feature extraction can be saved in a double manner within a certain precision range.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the embodiments described above may be implemented by using a program to instruct relevant hardware, and the corresponding program may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk or optical disk.

Example two

Fig. 8 is a block diagram showing a specific structure of a video matching apparatus according to a second embodiment of the present invention, and only the relevant parts to the second embodiment of the present invention are shown for convenience of description. The video matching device can be a software unit, a hardware unit or a unit combining software and hardware, which is arranged in a computer, a television or a mobile terminal, and comprises: a feature information acquisition unit 51 and a matching degree acquisition unit 52.

The feature information acquiring unit 51 is configured to extract all key frames of the target video, and store feature information of each target key frame;

a matching degree obtaining unit 52, configured to compare the feature information of each target key frame stored by the feature information obtaining unit 51 with the feature information of a pre-generated source video key frame, so as to obtain a matching degree between the target video and the source video.

Specifically, the feature information acquiring unit 51 includes:

the characteristic information acquisition module is used for acquiring the characteristic information of the current video frame read by the video frame reading module and the characteristic information of the next video frame read by the video frame reading module;

and the matching degree acquisition module is used for comparing the characteristic information in the two frames of video frames acquired by the characteristic information acquisition module to obtain the matching degree, and if the matching degree is lower than a preset matching threshold, the next video frame is read as a second key frame, otherwise, the next video frame is continuously read.

In this embodiment, the feature information obtaining unit 51 and the matching degree obtaining unit 52 analyze the teaching source video in advance, and when the learner learns, the application software only needs to analyze the target video of the learner, so that the situation that two paths of videos need to be analyzed simultaneously is avoided, and half of the CPU operation consumption is reduced.

Specifically, as an embodiment, the characteristic information obtaining module 51 includes:

the rectangular area acquisition submodule is used for horizontally and vertically scanning a human body model left after the first background erasing submodule carries out background erasing processing on a frame of key frame, so that a rectangular area where a human body is located is obtained;

the equal-height division submodule is used for equally dividing the rectangular area where the human body is located, obtained by the rectangular area obtaining submodule, from top to bottom by n to form equal-height lines;

and the contour line information acquisition sub-module is used for acquiring the characteristic information of each contour line in the contour lines acquired by the contour dividing sub-module and taking the characteristic information of each contour line as the characteristic information of the key frame.

In this embodiment, the characteristic information obtaining module 51 performs matching of the video frames according to the characteristic information of each contour line of the obtained key frame, and when the learner learns, the application software only needs to analyze the characteristic information of each contour line of the key frame of the target video of the learner, so as to avoid the situation that two paths of videos need to be analyzed simultaneously, and reduce half of the CPU operation consumption.

As another embodiment, the feature information acquiring module 51 includes:

the characteristic point acquisition submodule is used for identifying 5 characteristic points of the human body positioned at the four limbs and the vertex by utilizing the geometric relationship of the geodesic distance between the vertexes of the human body model, which is left by the second background erasing submodule after background erasing processing is carried out on a frame of key frame;

the bone central line generation submodule is used for generating 5 bone central lines according to the 5 characteristic points identified by the characteristic point acquisition submodule;

and the joint point position acquisition submodule is used for determining the positions of joint points according to the 5 skeleton center lines generated by the skeleton center line generation submodule, and the positions of all the joint points are used as the characteristic information of the key frame.

In this embodiment, the characteristic information obtaining module 51 determines the positions of the joint points of the human body according to the 5 characteristic points identified from the key frames, and when the learner learns, the application software only needs to analyze the characteristic information in the key frame of the target video of the learner, thereby avoiding the situation that two paths of videos need to be analyzed simultaneously, and reducing half of the CPU computation consumption.

According to the embodiment, the teaching source video is analyzed in advance, and when a learner learns, the application software only needs to analyze the target video of the learner, so that the situation that two paths of videos need to be analyzed simultaneously is avoided, and half of CPU operation consumption is reduced. In addition, matching between the target video and the source video is not performed on each pixel point in the video frames, but is performed on the feature information in the two video frames, so that the matching speed is higher. In addition, for the source video, the frame taking interval i is set according to the speed of the human body movement of the source video, for example, for the yoga video, the movement speed is relatively low, the value of i can be taken as a relatively large value, 1 second or even longer time, and one frame of video frame is taken for feature information extraction instead of extracting each frame, so that the operation time of feature extraction can be saved in a certain precision range in multiples. In addition, in the embodiment, not every pair of extracted video frames is compared, but a video frame with the largest human body form change is extracted according to the change of the feature information of the video frames to serve as a key frame, and only the key frame is compared, so that the calculation time for feature extraction can be saved in a certain precision range.

The video matching device provided by the embodiment of the present invention can be applied to the first corresponding method embodiment, and for details, reference is made to the description of the first embodiment, and details are not repeated here.

It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for video matching, the method comprising:

comparing the characteristic information of the target key frame with the characteristic information of a pre-generated source video key frame to obtain the matching degree of the target video and the source video;

acquiring the feature information of the key frame comprises the following steps:

acquiring characteristic information of each contour line, and taking the characteristic information of each contour line as the characteristic information of the key frame;

or,

generating 5 skeleton central lines according to the 5 characteristic points;

2. The method of claim 1, wherein said extracting all key frames of the target video specifically comprises:

acquiring feature information of the first key frame;

reading a next frame of video frame;

acquiring the characteristic information of the next frame of video frame;

and if the matching degree is lower than a preset matching threshold, taking the read next frame of video frame as a second key frame, otherwise, continuously reading the next frame of video frame until all video frames in the target video are completely read.

3. A video matching apparatus, characterized in that the apparatus comprises:

the matching degree obtaining unit is used for comparing the characteristic information of the target key frame with the characteristic information of a pre-generated source video key frame to obtain the matching degree of the target video and the source video;

the characteristic information acquisition module includes:

the contour line information acquisition submodule is used for acquiring the characteristic information of each contour line and taking the characteristic information of each contour line as the characteristic information of the key frame;

or,

the characteristic information acquisition module includes:

4. The apparatus of claim 3, wherein the feature information acquiring unit comprises:

and the matching degree acquisition module is used for comparing the characteristic information in the two frames of video frames to obtain the matching degree, if the matching degree is lower than a preset matching threshold value, the read next frame of video frame is used as a second key frame, otherwise, the next frame of video frame is continuously read until all video frames in the target video are completely read.