WO2020111505A1

WO2020111505A1 - Method and system for generating object gt information for image machine learning

Info

Publication number: WO2020111505A1
Application number: PCT/KR2019/013511
Authority: WO
Inventors: 양창모; 송재종; 추유식
Original assignee: 전자부품연구원
Priority date: 2018-11-26
Filing date: 2019-10-15
Publication date: 2020-06-04
Also published as: KR20200068043A

Abstract

A method for generating GT information of objects in an image, according to the present invention, comprises the steps of: performing automatic analysis according to a predetermined GT structure so as to generate GT information of each object included in the image; correcting the generated GT information; and generating metadata by transforming the corrected GT information, wherein the GT information for each object comprises the type of the object, the orientation of the object, the state of the object, positional information of the object, and attributes of the object. According to the present invention, standard GT constitutional elements and a GT description method can be defined so as to describe collective information of objects such as humans and vehicles present in a still image or a video frame, and GT information can be more efficiently generated than by conventional methods by automating the GT information. Ultimately, by defining standard GT elements for vehicle or human objects in an image and a video frame, it is possible to provide and utilize common standard GT generation techniques for image and video machine learning.

Description

Method and system for generating object GT information for machine learning

The present invention relates to a method for generating GT (Ground Truth) information used for machine learning of an image or video, and more specifically, to a method and system for generating GT information for an object such as a person or a car included in an image or video frame. It is about.

Regarding object recognition and tracking in an image, a technique for detecting an object (hereinafter referred to as an object) is the basic technique. In the conventional object detection method, a GT (Ground Truth) of an object is manually extracted in an offline state and trained. It takes a method of generating an detector to perform object detection.

As described above, extraction of GT and generation of GT information are fundamentally important in object recognition. In the conventional method for generating GT information, there is no comprehensive expression method for describing an object included in an image or video frame. For this reason, in the conventional method for generating GT information, a method of describing elements required for image machine learning according to its own standards has been used, and the elements used must be extremely limited.

In addition, conventionally, in order to generate object GT information for machine learning, a method in which a user manually checks an image or a video frame and manually tags it has been used. Such a method allows the user to receive GT information for a vast learning image. It has the disadvantage that it takes a lot of manpower and time because it is manually tagged.

In the GT information generation method of the present invention, a standard GT component and a GT description method are defined so as to describe comprehensive information of human or vehicle objects existing in an image or video frame.

In addition, an object of the present invention is to provide a method and a system capable of efficiently generating GT information by automating GT tagging.

The method for generating GT information of an object in an image according to the present invention includes automatically analyzing and generating GT information of each object included in an image according to a predetermined GT structure, and correcting the generated GT information. And generating metadata by converting the completed GT information, and the GT information for each object includes an object type, an object posture, an object state, an object location information, and an object property.

The GT information generation system according to another embodiment of the present invention includes an image storage unit that stores an image, and a GT analysis unit that receives an image file from the image storage unit and analyzes and generates GT information for each object present in each frame. And, it includes a GT information correction unit for modifying the GT information generated by the GT analysis unit, and a metadata generation unit for generating metadata by converting the modified GT information.

According to the present invention, a standard GT component and a GT technology method can be defined so as to describe comprehensive information of objects such as a person or a vehicle existing in a still image image or a video frame.

In addition, GT information can be automated to generate more efficiently compared to conventional methods.

Furthermore, by defining standard GT elements for automobile or human objects of image and video frames, it is possible to prepare and utilize common standard GT generation techniques for image and video machine learning.

1 is a block diagram of a GT information generation system according to an embodiment of the present invention.

2 is an overall flowchart of a method for generating GT information according to an embodiment of the present invention.

3 is a detailed flowchart of the GT analysis step shown in FIG. 2;

FIG. 4 is a detailed flowchart of the GT information modification step illustrated in FIG. 2.

The objects and effects of the present invention are not limited to those mentioned above, and the objects and effects of the present invention and technical configurations for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings.

In the description of the present invention, when it is determined that a detailed description of known functions or configurations may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms. Each of the following embodiments is provided to make the disclosure of the present invention complete, and to provide a complete knowledge of the scope of the invention to those skilled in the art to which the invention pertains, and to limit the scope of the invention. no.

Throughout the specification, when a part is said to "include" or "equipment" a component, this means that other components may be further included instead of excluding other components, unless specifically stated to the contrary. . In addition, terms such as "...unit", "...device", "...device", "...unit" or "...module" described in the specification may have at least one function or operation. Means a unit of processing, which may be implemented in hardware or software or a combination of hardware and software.

On the other hand, in each embodiment of the present invention, each component, functional blocks or means may be composed of one or more sub-components, the electrical, electronic, and mechanical functions performed by each component are electronic circuits , It may be implemented with various known devices such as an integrated circuit, an ASIC (Application Specific Integrated Circuit), or mechanical elements, or may be implemented separately, or two or more may be integrated into one.

Hereinafter, the configuration of the present invention will be described in detail with reference to the accompanying drawings.

In the GT information used for the machine learning of the conventional image object, an extremely limited element was used depending on the subject who performs the machine learning. In the present invention, the overall information of the image object such as a person or a vehicle is tagged and GT meta is used. Use a method of organizing data. The image object GT information defined in the present invention is shown in Table 1 below.

프레임 번호Frame number
	객체의 개수Number of objects
		객체 ID (LIST)Object ID (LIST)
			객체의 종류Type of object
			객체의 자세Object posture
			객체의 상태The state of the object
			객체의 위치 정보Object location information
			객체의 속성Object properties

The frame number has one value in the case of inputting a video image, and the number of each frame in the case of video input. The number of objects means the number of objects detected in the corresponding image or frame. The object ID list is composed of the number of objects. For each object, the object type, object posture, object state, object position information, and object property information are defined based on the object ID.

The types of objects are classified as'person' or'car', and the posture of the object is the'front','back','left','right','front-' of the posture of the object expressed in the image or video frame. It is divided into 8 directions such as'left','front-right','back-right', and'back-left'.

The state of an object is divided into ‘total’, ‘cut’, and ‘overlap’ depending on whether the entire object is visible in an image or video frame or overlapping with another object.

The location information of the object expresses four coordinates in which the object's bounding box is expressed based on the coordinates of (0, 0) of the image or video frame.

The property information of the object is configured differently according to the type of the object. When the type of the object is, for example, a person, it is composed of race, gender, age, height, color of the top, color of the bottom, and whether or not the glasses are worn. There may be car color, license plate number, car make, car model, and car year.

As described above, the present invention provides a method and system for defining a standard GT component so as to describe comprehensive information of objects existing in an image or video frame, and generating and managing GT information accordingly.

1 is a configuration diagram of a GT information generation system according to an embodiment of the present invention.

As shown, the system includes an image storage unit 100, a GT analysis unit 110, a GT information correction unit 120, a GT metadata generation unit 130, and a metadata storage unit 140.

The image storage unit 100 analyzes GT information and stores an image or video frame (hereinafter referred to as a'frame') containing an object to be generated. Non-volatile memory such as a hard disk, volatile memory such as RAM, Alternatively, it may be implemented as a register that serves as a buffer for temporarily storing streaming data.

The GT analysis unit 110 automatically determines each piece of information according to the above-described GT components for each frame from the image input from the image storage unit 100, that is, the type of the object, the posture of the object, the state of the object, and the location information of the object. Extract the properties of an object. For automatic GT analysis of objects in an image, you can use open source or cloud API as well as self-developed algorithm.

The GT information correction unit 120 displays the extracted GT information as a list so that the user can modify the GT information generated by the GT analysis unit through analysis and displays it to the user. Preferably, a list of objects having the above-described GT information structure is displayed for each frame number (image number of a video or frame number of a video), so that the user can review.

When the user receives a modification input for the type of object, the posture of the object, the state of the object, the position information of the object, and the property information of the object, the GT information is modified by reflecting the input.

The GT metadata generation unit 130 converts the modified GT information into metadata, and the metadata storage unit 140 stores metadata.

The metadata storage unit 140 and the image storage unit 100 are described separately from a logical point of view, and in hardware, each may be a separate storage device and may be configured in a single physical storage device.

Hereinafter, a method of generating GT information according to an embodiment of the present invention will be described in detail with reference to FIGS. 2 to 4.

After the program starts, an image or video file to generate GT information is opened (S200), and GT automatic analysis of the video file is performed (S210).

The GT automatic analysis step will be described in more detail as follows.

First, the input video is divided into frames, and a frame number is assigned to the frame to be analyzed (S211). In the case of a normal image, only one frame exists, so frame division is not performed.

Next, GT analysis is automatically analyzed for the divided frame image (S214). For automatic GT analysis of images, developed algorithms, open source, or cloud APIs can be used, and examples of cloud APIs include the Sighthound Cloud API.

The GT generated by analysis is generated as shown in Table 1 above, the frame number, the number of objects in each frame, the object ID for each object, the type of object by object ID, the posture of the object, the state of the object, the location information of the object, the object It is preferable to construct a data structure including the properties of.

The GT information analyzed and generated is stored for modification (S216).

When the input image is a video, it is determined whether there is a next frame to perform automatic GT analysis (S218), and if there is no, there is no more frame to analyze, and the analysis step is completed.

When the GT automatic analysis step S210 is completed, the GT information correction step S220 is performed.

The GT information modification step (S220) is a step for the user to manually modify the GT information that is automatically analyzed.

To this end, the GT information first stored through the information correction unit 120 is represented as a list and displayed to the user (S222). In this step, a list of objects is represented according to the number of images or video frames.

Then, among the object list expressed in step S222, when the user selects an object for which information is to be modified, the selection input is received (S224), and the selected object is switched to a modifiable state.

The user modifies GT information by receiving an input for modifying GT information of the selected object (S226). Specifically, the user can modify the type of the object, the posture of the object, the state of the object, the location information of the object, the attribute information of the object, and the information correction unit 120 receives the user's correction input for each of the attribute information GT Update and update information.

In a subsequent step (S218), it is determined whether GT information correction has been completed for all frames of the image or video, and if there is a frame that has not yet been reviewed, the process returns to step S224, and if all frames have been reviewed for correction, GT information Complete the modification.

When analysis and modification are completed, GT information is converted into metadata (S230). The converted generated metadata is stored in an appropriate storage (metadata storage unit 140). XML, EXCEL, JSON, TEXT, etc. formats can be used as metadata.

The preferred embodiments of the present invention have been described above, which are merely used in a general sense to easily describe the technical contents of the present invention and to help understand the present invention, and are not intended to limit the scope of the present invention. It will be apparent to those skilled in the art to which the present invention pertains that other modifications based on the technical spirit of the present invention may be implemented in addition to the embodiments disclosed herein, and thus, It should be determined by the description of the claims.

Claims

Automatically analyzing and generating GT information of each object included in the image according to a predetermined GT structure,

Modifying the generated GT information;

Steps to generate metadata by converting the modified GT information

Method for generating GT information of an object in an image including a.
The method of claim 1, wherein the generating step

Each frame-a normal image or each frame of a movie is collectively referred to as follows-by analyzing each object included in the object type, the attitude of the object, the state of the object, the location information of the object, and GT information including the properties of the object A method of generating GT information of an object in an image that is being generated.
The method of claim 1, wherein the modifying step

In each frame, the GT information of the generated object is represented as a list and presented to the user.

Receiving a user's correction input for the selected GT information according to the user's selection,

Comprising the step of modifying and updating the GT information according to the correction input,

The method of generating GT information of an object in an image is to repeat the above three steps until the modification review of the object in all frames is completed.
The method of claim 1, wherein the step of generating metadata

And converting the modified GT information into any one format among XML, EXCEL, JSON, and TEXT.
An image storage unit for storing images,

A GT analysis unit that receives an image file from the image storage unit and analyzes and generates GT information for each object present in each frame,

A GT information correction unit for correcting GT information generated by the GT analysis unit,

A system for generating GT information of objects in an image including a metadata generating unit that converts the modified GT information to generate metadata.
According to claim 5, The GT analysis unit

A system for generating GT information of objects in an image that performs analysis by embedding at least one of a self-developed analysis algorithm, open source, and cloud AIP.
The method of claim 5,

GT information generation system of the object in the image further comprising a metadata storage unit for storing the metadata.