CN111401342A

CN111401342A - Question type sample manufacturing method based on label automation

Info

Publication number: CN111401342A
Application number: CN202010497093.9A
Authority: CN
Inventors: 田博帆
Original assignee: Nanjing Hongsong Information Technology Co ltd
Current assignee: Nanjing Hongsong Information Technology Co ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2020-07-10

Abstract

The invention relates to a question type sample manufacturing method based on label automation, which specifically comprises the following steps: s1 self-defining a sample label; s2, preprocessing a sample image to realize the clear extraction of the handwriting sample; s3 marking characters and question types, marking each character in the handwritten image by combining label definition and requirements, and independently cutting the question type by using the coordinate record of the question type to obtain a question type image; s4, judging the characters in the question type, and acquiring a question type coordinate set in the handwritten image; s5, converting the question type coordinate, and calculating the coordinate of the circumscribed rectangular frame of the question type sample after converting the coordinate of the question type coordinate set; s6, converting the character coordinates, namely converting the vertex coordinates of the question type image into characters of the question type cutting image when the question type image is cut independently; s7 generating a picture and an xml file; s8 topic and character classification: and according to the labeled category labels, dividing the different cut pictures into different folders for automatic classification.

Description

Question type sample manufacturing method based on label automation

Technical Field

The invention belongs to the technical field of deep learning sample manufacturing, and particularly relates to an automatic question type sample cutting and manufacturing method based on a labeling processing technology.

Background

Artificial intelligence and big data belong to the products of the times, and the artificial intelligence and the big data are connected with each other in a myriad of ways. The production and development of artificial intelligence can not be separated from data, all artificial intelligence taking data as a center is influenced by big data, and the production and life modes of people are changed silently. At present, the large data on which artificial intelligence depends still cannot avoid the cost of investing a large amount of manual labeling and data cleaning, and in order to deal with the increasing data volume and improve the efficiency of data processing, a data processing tool capable of automatically cutting and classifying is urgently needed, which is not only beneficial to obtaining high-quality data samples, but also greatly reduces the cost of manual investment. Meanwhile, the advantages of the method are fully exerted in the aspect that artificial intelligence depends on big data.

Therefore, the method for manufacturing the question type samples based on the label automation is provided based on the field of operation identification, relevant research is made on the requirements of various question type data samples and an automatic cutting and classifying method.

Disclosure of Invention

The invention aims to provide a method for automatically making a question model sample based on a label, which can finish automatic cutting and classification of a question model and simultaneously contains all character information in a question model range.

In order to solve the technical problems, the invention adopts the technical scheme that: the title model sample manufacturing method based on label automation specifically comprises the following steps:

step S1 custom sample tab: determining the category range of the sample, counting the sample to be collected, and performing label definition and expression on the category of the sample;

step S2 sample image preprocessing: preprocessing the sample image defined and expressed by the label in the step S1 to realize the clear extraction of the handwriting sample;

step S3 character and question type labeling: after a sample image is subjected to a preprocessing step, a handwritten image containing clear handwriting is obtained, each character in the handwritten image is marked according to label definition and requirements, and an item type is cut independently by utilizing item type coordinate records, so that an item type image is obtained;

step S4 question type internal character judgment: traversing all the labeled labels once to obtain a question type coordinate set in the handwritten image;

step S5 conversion of model coordinates: calculating the coordinates of an external rectangular frame of the question model sample after converting the coordinates of the question model coordinate set;

step S6 conversion of character coordinates: when the question type image is cut independently, converting the vertex coordinates of the question type image into characters of a question type cutting image;

step S7 generates a picture and an xml file: classifying according to the labeled object types, converting characters in the question types to generate corresponding question type xml files, and cutting labeled objects of other types;

step S8 topic and character classification: and according to the labeled category labels, dividing the different cut pictures into different folders for automatic classification.

According to the technical scheme, firstly, a clean image is obtained by adopting an image processing technology, then, the processed image is marked with a question type and characters, and finally, automatic cutting and classification are carried out according to marked coordinates and type information so as to achieve the purpose of fast cutting and classification, which not only meets the requirement of high-efficiency operation, but also meets the requirement of obtaining high-quality samples in batch.

As a preferred embodiment of the present invention, the manner of collecting the sample in step S1 includes scanning and taking a picture.

As a preferred technical solution of the present invention, in step S3, an open-source L abelmg tool is used to label each character in the handwritten image in combination with label definition and requirement, where the information of character labeling includes a character frame coordinate and an actual category to which the character belongs, and the character frame coordinate is:

and the coordinates of the top left corner vertex of the character are as follows:

the vertex coordinates of the lower right corner of the character are:

。

as a preferred technical scheme of the invention, the marked character frame coordinate records are combined into a coordinate set and recorded as:

wherein n represents the number of all the marked objects, and T represents the coordinate record set of the question type; c represents a non-thematic coordinate record set; and the first coordinate record is represented, namely the marked character coordinate record or the marked question type coordinate record.

As a preferred technical solution of the present invention, in the step S4, a traversal is performed on all the labeled labels, and an obtained topic coordinate set is recorded as:

. In order to ensure independent cropping of the question model, all the labeled labels need to be traversed once to obtain a question model coordinate set in the whole graph.

As a preferred technical solution of the present invention, the step S4 further includes traversing all the character coordinates in the handwritten image once according to the subject coordinate record, and determining each character label box according to the principle of the relevant point method in the areaScreening out all character sets belonging to a question type in the question type area range to which the vertex belongs, and recording as a set:

. In order to avoid that all marked characters in the question type range can cause missing detection, according to the question type coordinate records, traversing all character coordinates in the whole image once, judging the question type area range to which any vertex of each character marking frame belongs according to the principle of a related point method in the area, and screening out all character sets belonging to a certain question type.

As a preferred embodiment of the present invention, the step S5 includes the steps of:

step S51: adding the coordinate record of the ith topic to the step S4

In the set;

step S52: calculating the maximum value and the minimum value corresponding to the x-axis direction coordinate and the y-axis direction coordinate corresponding to all coordinate records in the question type coordinate set to obtain the circumscribed rectangular frame coordinate of the question type,

is recorded as:

. The method is mainly used for ensuring the integrity of the character information of the question type when the question type is cut independently and needing to convert the coordinate to obtain the coordinate of the circumscribed rectangular frame of the question type.

As a preferred embodiment of the present invention, the step S6 specifically includes the following steps:

step S61: when the question type image is cut independently, the coordinate value of the top left corner of the circumscribed rectangular frame is determined by the coordinate corresponding to the sample image

Becomes (0, 0), and the offset calculation formula is:

(ii) a Wherein,

，

and the actually labeled coordinate of the top left corner point of the question type is as follows:

；

step S62: and performing coordinate conversion on all characters in the question type area range once to obtain character coordinates relative to the question type cutting chart, and recording as:

(ii) a Wherein the coordinates of the upper left corner points of the characters are:

and the coordinates of the right small corner point of the character are as follows:

；

the amount of offset in the X-axis direction is shown,

the offset amount in the Y-axis direction is shown. In order to support the cutting of various question type graphs and generate corresponding xml file labels, when the question type is cut independently, the vertex coordinate value of the upper left corner of the question type graph is changed from that of the whole graph

Becomes (0, 0).

As a preferred embodiment of the present invention, the step S7 further includes separately classifying the question types, including the question type picture and the question type xml file. Step S7 is roughly divided into various categories such as different types of characters, question types, and others according to the types of objects to be labeled, and except that the character coordinates in the question type need to be converted to generate a corresponding question type xml file, other labeled objects are cut from the original coordinates of the entire map.

Compared with the prior art, the invention has the beneficial effects that: when a scanned sample picture is provided, the actual sizes of the picture and handwritten characters and the types of question types possibly contained in the picture do not need to be considered, only the question type range to be cut is selected by a manual frame, the types of the question types are labeled, all characters relevant to the question types and the labels of selected objects are completed, the selected objects are not limited to the special requirements of character length, character size, character types and the like, the technology can automatically cut the character coordinates relevant to the question types and converted question types and classify the character coordinates into a specified folder to serve as a training sample set for deep learning, so that the purposes of quick cutting and classification are achieved, the requirement of high-efficiency operation is met, the requirement of obtaining high-quality samples in batch is met, and the teaching informatization is facilitated.

Drawings

The technical scheme of the invention is further described by combining the accompanying drawings as follows:

FIG. 1 is a flow chart of a method for automatically making a question mark sample based on a label according to the present invention;

FIG. 2 is a diagram of the color separation contrast effect in step S2 of the label-based automated title model creation method of the present invention;

FIG. 3 is a diagram illustrating the effect of the character labeling coordinates in step S3 in the label-based automatic question mark sample preparation method according to the present invention;

fig. 4 is a diagram illustrating the automatic classification result in step S8 in the label-based automated title model creation method according to the present invention.

Detailed Description

For the purpose of enhancing the understanding of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and examples, which are provided for the purpose of illustration only and are not intended to limit the scope of the present invention.

Example (b): as shown in fig. 1, the method for automatically making a question mark sample based on a label specifically includes the following steps:

the method comprises the following steps of S1, self-defining a sample label, wherein the sample label is defined, the class range of a sample is determined, the sample to be collected is counted, and the class of the sample is defined and expressed, the sample collection mode in the step S1 comprises scanning and photographing, the label definition and expression in the step S1 aims to avoid the complexity caused by input in the sample labeling process and ensure convenient, quick and humanized entry on the one hand, and to ensure that a sample picture can be stored quickly and accurately on different system platforms without the current situation that the sample picture cannot be stored due to the limitation of special symbols, and the question type sample label is a mouth arithmetic title ('kousaanti'), a vertical title ('shushiti') and the like, and the character sample label definition and expression example is a division number ('/', and is expressed as: 'chuhao', and a multiplication number ('×') is expressed as 'chenghao', and the like;

step S2 sample image preprocessing: preprocessing the sample image defined and expressed by the label in the step S1 to realize the clear extraction of the handwriting sample; in the step S2, since the sample image is mainly acquired by two modes of scanning and photographing, the collected handwritten sample is affected by natural background, handwriting color, ink, light and the like, and after the handwritten sample is recorded into the system electronically, further image preprocessing is required to be performed on the handwritten sample, so as to complete the clear extraction of the handwritten sample. Such as: removing interference handwriting of special colors, namely, inputting an image sample by adopting RGB three channels during scanning, separating color channels and keeping the handwriting color handwriting as much as possible; as shown in fig. 2;

step S3 marking characters and question types, namely, preprocessing a sample image to obtain a handwritten image containing clear handwriting, marking each character in the handwritten image by combining label definition and requirements, and independently cutting the question type by using question type coordinate records to obtain a question type image, wherein in the step S3, an open source L abelmg tool is adopted to mark each character in the handwritten image by combining label definition and requirements, and the information of character marking comprises character frame coordinates and the actual class to which the character belongsOtherwise, the character frame coordinates are:

the vertex coordinates of the lower right corner of the character are:

(ii) a Exemplified by the character "1000", as shown in fig. 3, and the information labeled by all characters is saved in the corresponding xml file; forming a coordinate set by the marked character frame coordinate records, and recording the coordinate set as:

wherein n represents the number of all the marked objects, and T represents the coordinate record set of the question type; c represents a non-thematic coordinate record set;

expressing a first coordinate record, namely a marked character coordinate record or a marked question type coordinate record; meanwhile, the question model can be cut independently by using the question model coordinate record;

step S4 question type internal character judgment: traversing all the labeled labels once to obtain a question type coordinate set in the handwritten image; in the step S4, in order to ensure independent cropping of the question type, all the labeled labels are traversed once, and the obtained question type coordinate set is recorded as:

(ii) a In order to avoid that all the marked characters in the question type range may cause missed detection, the step S4 further includes recording the characters according to the question coordinates

Traversing all the character coordinates in the handwritten image once, and judging each character marking frame according to the principle of a relevant point method in the areaThe question type area range of the vertex of (1) is screened out all character sets belonging to a certain question type and recorded as

And (3) gathering:

；

step S5 conversion of model coordinates: calculating the coordinates of an external rectangular frame of the question model sample after converting the coordinates of the question model coordinate set; the step S5 includes the steps of:

step S51 adds the coordinate record of the ith question to the step S4

In the set;

step S52, calculating the maximum and minimum values corresponding to the x and y axis direction coordinates corresponding to all coordinate records in the question type coordinate set to obtain the circumscribed rectangle frame coordinates of the question type,

is recorded as:

. The method mainly aims to ensure the integrity of the character information of the question type when the question type is cut independently, and the coordinate needs to be converted to obtain the coordinate of the circumscribed rectangular frame of the question type;

step S6 conversion of character coordinates: when the question type image is cut independently, converting the vertex coordinates of the question type image into characters of a question type cutting image; the step S6 specifically includes the following steps:

step S61 is that when the question type image is cut independently, the coordinate value of the top left corner of the circumscribed rectangle frame is determined by the coordinate corresponding to the sample image

Becomes (0, 0), and the offset calculation formula is:

；

wherein,

，

。

step S62, performing coordinate conversion on all characters in the question type area range once to obtain the character coordinates of the corresponding question type cropping map, and recording as:

；

the amount of offset in the X-axis direction is shown,

Becomes (0, 0);

step S7 generates a picture and an xml file: classifying according to the labeled object types, converting characters in the question types to generate corresponding question type xml files, and cutting labeled objects of other types; the step S7 further comprises the step of classifying the question types individually, wherein the question types comprise question type pictures and question type xml files; step S7 is roughly divided into various categories such as different types of characters, question types, and others according to the types of objects to be labeled, and except that the character coordinates in the question type need to be converted to generate a corresponding question type xml file, other labeled objects are cut from the original coordinates of the entire map.

Step S8 topic and character classification: dividing the different cut pictures into different folders for automatic classification according to the labeled category labels; the topic type is individually classified as a focus, and comprises topic type pictures and corresponding xml files, as shown in fig. 4.

It is obvious to those skilled in the art that the present invention is not limited to the above embodiments, and it is within the scope of the present invention to adopt various insubstantial modifications of the method concept and technical scheme of the present invention, or to directly apply the concept and technical scheme of the present invention to other occasions without modification.

Claims

1. A question type sample manufacturing method based on label automation is characterized by comprising the following steps:

2. The method for automatically making a question mark type sample based on a label according to claim 1, wherein the manner of collecting the sample in the step S1 includes scanning and photographing.

3. The method for automatically producing question patterns based on labels according to claim 1, wherein in step S3, an open source L abelmg tool is used to label each character in the handwritten image, in combination with label definition and requirement, the information of character labeling includes character frame coordinates, the actual category to which the character belongs, and the character frame coordinates are:

the vertex coordinates of the lower right corner of the character are:

。

4. the label-based automated question pattern making process of claim 3The method is characterized in that the marked character frame coordinate records are combined into a coordinate set and recorded as:

5. The method for automatically making the question mark sample based on the label according to claim 4, wherein the step S4 is performed by one traversal for all the labeled labels, and the obtained question mark coordinate set is recorded as:

。

6. the method for automatically making question type samples based on labels according to claim 4, wherein the step S4 further comprises traversing all the character coordinates in the handwritten image once according to the question coordinate records, determining the question type area range to which the vertex of each character labeling box belongs according to the principle of the relevant point method in the area, screening out all the character sets belonging to a certain question type, and recording as a set:

。

7. the method for automatically making a question mark type sample based on a label according to claim 6, wherein said step S5 comprises the following steps:

step S51: adding the coordinate record of the ith topic to the step S4

In the set;

step S52: computing stationRecording the maximum value and the minimum value corresponding to the x-axis direction coordinate and the y-axis direction coordinate in all the coordinate records in the question type coordinate set to obtain the circumscribed rectangular frame coordinate of the question type, and recording the circumscribed rectangular frame coordinate as:

。

8. the method for automatically making a question mark type sample based on a label according to claim 6, wherein said step S6 comprises the following steps:

Becomes (0, 0), and the offset calculation formula is:

(ii) a Wherein,

，

；

right small angle of characterThe point coordinates are:

；

the amount of offset in the X-axis direction is shown,

the offset amount in the Y-axis direction is shown.

9. The method for automatically making a question mark sample based on a label according to claim 6, wherein the step S7 further comprises classifying question marks separately, including question mark pictures and question mark xml files.