CN116662881A

CN116662881A - Automatic data labeling method and system

Info

Publication number: CN116662881A
Application number: CN202310627177.3A
Authority: CN
Inventors: 杨洲鑫; 熊雪菲; 王耀威; 蒋冬梅; 田永鸿
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-08-29

Abstract

The invention discloses a method and a system for automatically labeling data, wherein the method comprises the following steps: acquiring data to be marked, configuring a marking template based on the data to be marked, and acquiring a pre-training model; distilling knowledge of the pre-trained model to obtain a backbone model of the automatic labeling model, and training the automatic labeling model based on the backbone model by using manually labeled data to obtain a trained automatic labeling model; and (3) automatically labeling the data by using the trained automatic labeling model, manually auditing the automatic labeling result, and storing the manual auditing result. The method and the system can improve the accuracy of a system applying automatic labeling, are convenient for realizing the whole flow process of data labeling, liberate labeling labor force and greatly improve the labeling efficiency of labeling staff.

Description

Automatic data labeling method and system

Technical Field

The invention relates to the technical field of data labeling, in particular to an automatic data labeling method and system.

Background

On the one hand, the management of the data sets is very complicated, the marked data is stored in the mirror server in the form of a catalog and a file at present, and when the data sets are too many, people can take a long time to find the needed data sets. The marking tools on the market are various, the user experience of most tools is not good enough, many open source marking tools are non-automatic, and the learning cost of marking personnel is high. On the other hand, as automated labeling proceeds deeply, an automated labeling system requires efficient training of an automated labeling model. At present, the artificial intelligence application scene is rapidly increased, and a large amount of new data is brought by a new application scene. To train an artificial intelligence model that adapts to a new scene, the user needs to quickly annotate the new data. Therefore, efficient training of the automatic labeling model for new data of new scenes becomes an indispensable function of the automatic labeling system.

Disclosure of Invention

The invention aims to solve the technical problems that the automatic data labeling method and system aims to solve the problems that a plurality of open-source labeling tools are non-automatic, the learning cost of labeling personnel is high, and when an automatic labeling model is applied to an actual labeling scene, the calculation resources are wasted and training is slow in the prior art.

In a first aspect, the present invention provides a method and a system for automatically labeling data, where the method includes:

acquiring data to be marked, configuring a marking template based on the data to be marked, and acquiring a pre-training model;

distilling knowledge of the pre-trained model to obtain a backbone model of an automatic labeling model, and training the automatic labeling model based on the backbone model by using manually labeled data to obtain a trained automatic labeling model;

and (3) automatically labeling the data by using the trained automatic labeling model, manually auditing the automatic labeling result, and storing the manual auditing result.

In one implementation, the obtaining the data to be marked includes:

connecting a data source, inputting configuration information, connecting the configuration information to a data warehouse, and synchronizing data in the data warehouse to obtain the data to be marked;

or selecting local import to obtain the data to be marked.

In one implementation manner, the configuring the labeling template based on the data to be labeled includes:

acquiring a data type of the data to be annotated, wherein the data type comprises pictures, texts, voices and videos, and determining an annotation scene of the data to be annotated based on the data type;

and determining the annotation template corresponding to the annotation scene based on the annotation scene, wherein the annotation template comprises a code configuration template and a visual configuration template.

In one implementation, the training the automatic labeling model based on the backbone model using the manually labeled data to obtain a trained automatic labeling model includes:

performing knowledge distillation on the pre-training model to obtain a backbone model of an automatic labeling model;

labeling a small number of data sets by using a manual labeling mode;

and training the automatic labeling model based on the backbone model to obtain the trained automatic labeling model.

In one implementation, the method further comprises:

automatically labeling the unlabeled data by using the trained automatic labeling model;

checking and correcting the automatically marked result by using a manual check;

and checking and correcting the manual auditing result to improve the accuracy of intelligent labeling.

In a second aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a data automatic labeling program stored in the memory and capable of running on the processor, and when the processor executes the data automatic labeling program, the processor implements the steps of the data automatic labeling method according to any one of the foregoing schemes.

In a third aspect, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores an automatic data labeling program, where the automatic data labeling program, when executed by a processor, implements the steps of the automatic data labeling method according to any one of the above schemes.

The beneficial effects are that: compared with the prior art, the automatic data labeling method and system provided by the invention can improve the accuracy of a system applying automatic labeling, facilitate the realization of the whole flow process of data labeling, liberate labeling labor force and greatly improve the labeling efficiency of labeling staff. In addition, the automatic labeling model has high training speed and saves computing resources.

The device can help the user to provide the data labeling service with low cost and high efficiency, can upload and download the data set in the automatic labeling system, provides a unified management page of the data set, has simple and clear operation of the automatic labeling system, greatly reduces the learning cost of labeling personnel, and provides real-time preview labeling operation progress and result for manual labeling and intelligent labeling. After the large model is uploaded, the background automatic knowledge is distilled to obtain a backbone model, then the automatic labeling model is trained based on the backbone model, so that automatic updating iteration of the backbone model is realized, and the large model is used for training the backbone model.

Drawings

Fig. 1 is a flowchart of a specific implementation of a method for automatically labeling data according to an embodiment of the present invention.

Fig. 2 is a functional framework diagram of a method for automatically labeling data according to an embodiment of the present invention.

Fig. 3 is a functional schematic diagram of an automatic data labeling system according to an embodiment of the present invention.

Fig. 4 is a schematic block diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and more specific, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the invention provides an automatic data labeling method, and in specific implementation, the embodiment firstly obtains data to be labeled, configures a labeling template based on the data to be labeled, and obtains a pre-training model. And then, carrying out knowledge distillation on the pre-training model to obtain a backbone model of the automatic labeling model, and training the automatic labeling model based on the backbone model by using manually labeled data to obtain the trained automatic labeling model. And finally, automatically labeling the data by using the trained automatic labeling model, manually auditing the automatic labeling result, and storing the manual auditing result. The embodiment can improve the accuracy of a system applying automatic labeling, is convenient for realizing the whole flow process of data labeling, liberates labeling labor force and greatly improves the labeling efficiency of labeling staff. In addition, the automatic labeling model has high training speed and saves computing resources.

The automatic data labeling method can be applied to terminal equipment, wherein the terminal equipment comprises intelligent product terminals such as computers, intelligent televisions and mobile phones. As shown in fig. 1, the automatic data labeling method of the present embodiment specifically includes the following steps:

and step S100, obtaining data to be marked, configuring a marking template based on the data to be marked, and obtaining a pre-training model.

Specifically, the present embodiment first connects to a data source and inputs configuration information to connect to a data warehouse. And then synchronizing the data in the data warehouse to obtain the data to be marked. Alternatively, the embodiment may select local import to obtain the data to be annotated. In one implementation, the present embodiment provides for unified management of data source connections, as shown in FIG. 2, the data management includes 3 modules of data set management, data source management, and annotation item management. The data source management is to manage external different types of data resources, and the different types of databases are managed in a distinguishing way according to different access modes and data storage. The data set management is to carry out version management on accessed data, provide version isolation, and greatly improve the efficiency of managing the data of a user through continuous accumulation of the data. The project management is to uniformly manage different types of marked projects, provide user authority management projects, ensure the safety of the projects and improve the management efficiency.

The embodiment comprises the functions of data source connection, target connection, data synchronization and the like when the data source connection is managed in a unified mode, and supports the access of batch data which are already landed, wherein the batch data comprise relational (structured) data, non-relational (unstructured) data, multi-mode data such as distributed storage data, text images and the like, and vector features of data center precipitation. The data access of the embodiment realizes a unified paradigm of reading in and writing out through a mode of pushing or pulling based on a unified connector interface in an active sensing mode and a model of an adapter. And provides metadata directory management functions, and uniformly encapsulates structured, unstructured, distributed file systems, and message middleware data interfaces, supporting componentization, configurability, and visual management. The data access of the embodiment can help the user synchronize the data from the application program, the API and the database to the warehouse, and support the importing of the data from different data sources. Data integration can be made simple, secure and scalable. And the user is supported to upload the self-owned data for the call of the data annotation.

Further, the embodiment obtains the data type of the data to be annotated, and determines an annotation scene corresponding to the data to be annotated based on the data type. And then, determining the annotation template corresponding to the annotation scene based on the annotation scene. The annotation template of the embodiment supports various data types such as pictures, texts, voices and videos, supports various annotation scenes such as image classification, target detection, audio segmentation, text triples and video classification and the like, and a user can select a required annotation template to manually annotate according to different data types and service scenes. For example, when the pictures are required to be classified into cats and dogs, the labeling templates of the picture classification under the picture types can be selected, so that the requirements of business scenes can be met. The labeling template of the embodiment also provides two configuration modes for the user, including two modules of code labeling configuration and visual labeling configuration, so that different types of data can be labeled conveniently by using a concise interface, and data in a standardized format can be output. The visual interface can improve the convenience of user labeling configuration operation, and the code labeling can improve the flexibility of labeling configuration.

Next, the embodiment may upload a large model for intelligent labeling, which is a pre-trained model, on the automated labeling system. Pre-training models is a common means of improving the performance of deep learning algorithms. The pre-training model can be generalized to a network architecture that is a kind of deep learning and contains a set of weights that this network architecture trains over a huge amount of data. With the network architecture and weights, it can be used as the backbone network for a particular visual task and provide initialization parameters. In this way, a specific downstream task has a better training starting point, and better algorithm performance can be realized while the exploration space is reduced.

And step 200, performing knowledge distillation on the pre-training model to obtain a backbone model of the automatic labeling model, and training the automatic labeling model based on the backbone model by using manually labeled data to obtain the trained automatic labeling model.

Specifically, according to the uploaded large model, the pre-training model is subjected to knowledge distillation, so that a backbone model of the automatic labeling model is generated. Next, the present embodiment labels a small number of data sets based on the manner of manual labeling. And finally, retraining the automatic labeling model for finishing knowledge distillation of the backbone model by using a small number of manually marked data sets to obtain the trained automatic labeling model.

And step S300, automatically labeling the automatic labeling model after training, manually auditing the automatic labeling result, and storing the manual auditing result.

The intelligent label of this embodiment includes 2 modules, as shown in FIG. 2, of intelligent labels and manual reviews. The intelligent labeling scheme provided by the automatic labeling system can replace manual completion data labeling through the automatic labeling model or the preset model or the custom uploading model which completes training, manual auditing is to correct intelligent labeling results, and the model can automatically update and iterate according to the results after manual auditing, so that the accuracy of an algorithm is improved. After the automatic labeling is finished, the automatic labeling result can be manually checked according to the automatic labeling result, and a manual checking result is obtained. The manual auditing result can be automatically stored in a database, and the system can use the data set after verification and correction to carry out iterative training on the automatic labeling model so as to improve the accuracy of intelligent labeling.

In summary, the embodiment firstly obtains the data to be annotated, configures an annotation template based on the data to be annotated, and obtains the pre-training model. And then, distilling knowledge of the pre-trained model to obtain a backbone model of the automatic labeling model, and training the automatic labeling model based on the backbone model by using manually labeled data to obtain the trained automatic labeling model. And finally, automatically labeling the data by using the trained automatic labeling model, manually auditing the automatic labeling result, and storing the manual auditing result. The embodiment can improve the accuracy of a system applying automatic labeling, is convenient for realizing the whole flow process of data labeling, liberates labeling labor force and greatly improves the labeling efficiency of labeling staff. In addition, the automatic labeling model has high training speed and saves computing resources.

Exemplary System

Based on the above embodiment, the present invention further provides an automatic data labeling system, as shown in fig. 3, where the automatic data labeling system includes: the system comprises a data importing module 10, a template configuration module 20, a model training module 30 and an automatic labeling module 40. Specifically, the template configuration module 10 is configured to obtain data to be annotated. The template configuration module 20 is configured to configure an annotation template based on the data to be annotated. The model training module 30 is configured to obtain a pre-training model, perform knowledge distillation on the pre-training model to obtain an automatic labeling model, and train the automatic labeling model with the knowledge distillation completed by using manually labeled data to obtain a trained automatic labeling model. The automatic labeling module 40 is configured to use the trained automatic labeling model pair to perform automatic labeling, perform manual auditing on the automatic labeling result, and store the manual auditing result.

In one implementation, the data import module 10 includes:

the data synchronization unit is connected with a data source, inputs configuration information, connects the configuration information to a data warehouse, and synchronizes data in the data warehouse to obtain the data to be marked;

or,

and the data importing unit is used for selecting local importing to obtain the data to be marked.

In one implementation, the template configuration module 20 includes:

the scene determining unit is used for obtaining the data type of the data to be annotated, wherein the data type comprises pictures, texts, voices and videos, and the annotation scene of the data to be annotated is determined based on the data type;

and the template configuration unit is used for determining the annotation template corresponding to the annotation scene based on the annotation scene, and the annotation template comprises a code configuration template and a visual configuration template.

In one implementation, the model training module 30 includes:

the knowledge distillation unit is used for carrying out knowledge distillation on the pre-training model to obtain a backbone model of the automatic labeling model;

the manual marking unit is used for marking a small number of data sets in a manual marking mode;

and the training unit is used for training the automatic labeling model based on the backbone model to obtain the trained automatic labeling model.

In one implementation, the apparatus further comprises:

the automatic labeling module is used for automatically labeling the unlabeled data by using the trained automatic labeling model;

the verification and correction module is used for verifying and correcting the manual auditing result;

and the iteration training module is used for checking and correcting the manual auditing result so as to improve the accuracy of intelligent labeling.

The working principle of each module in the automatic data labeling system of the embodiment is the same as that of each step in the above method embodiment, and will not be described here again.

Based on the above embodiment, the present invention also provides a terminal device, and a schematic block diagram of the terminal device may be shown in fig. 4. The terminal device may include one or more processors 100 (only one shown in fig. 4), a memory 101, and a computer program 102, e.g., a data auto-labeling program, stored in the memory 101 and executable on the one or more processors 100. The one or more processors 100, when executing the computer program 102, may implement the functions of the various modules/units in the data automatic annotation system embodiment, without limitation.

In one embodiment, the processor 100 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In one embodiment, the memory 101 may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory 101 may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the electronic device. Further, the memory 101 may also include both an internal storage unit and an external storage device of the electronic device. The memory 101 is used to store computer programs and other programs and data required by the terminal device. The memory 101 may also be used to temporarily store data that has been output or is to be output.

It will be appreciated by persons skilled in the art that the functional block diagram shown in fig. 4 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal device to which the present inventive arrangements are applied, and that a particular terminal device may include more or fewer components than shown, or may combine some of the components, or may have a different arrangement of components.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, storage, operational databases, or other media used in the various embodiments provided herein may include non-volatile and volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual operation data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for automatically labeling data, the method comprising:

2. The method for automatically labeling data according to claim 1, wherein the obtaining the data to be labeled comprises:

or selecting local import to obtain the data to be marked.

3. The method for automatically labeling data according to claim 1, wherein the configuring the labeling template based on the data to be labeled comprises:

4. The method for automatically labeling data according to claim 1, wherein the training the automatic labeling model based on the backbone model by using the manually labeled data to obtain a trained automatic labeling model comprises:

labeling a small number of data sets by using a manual labeling mode;

5. The method for automatically labeling data according to claim 1, further comprising:

6. An automatic data annotation system, the system comprising:

the data importing module is used for acquiring data to be marked;

the template configuration module is used for configuring an annotation template based on the data to be annotated;

the model training module is used for acquiring a pre-training model, carrying out knowledge distillation on the pre-training model to obtain an automatic labeling model, and training the automatic labeling model subjected to knowledge distillation by using manually labeled data to obtain a trained automatic labeling model;

and the automatic labeling module is used for automatically labeling the trained automatic labeling model pair, manually auditing the automatic labeling result and storing the manual auditing result.

7. The automatic annotation system of claim 6, wherein the data import module comprises:

or,

8. The automatic annotation system of claim 6, wherein the template configuration module comprises:

9. A terminal device, characterized in that it comprises a memory, a processor and an automatic data labeling program stored in the memory and executable on the processor, said processor implementing the steps of the automatic data labeling method according to any of claims 1-5 when executing said automatic data labeling program.

10. A computer-readable storage medium, wherein a data automatic labeling program is stored on the computer-readable storage medium, and the data automatic labeling program, when executed by a processor, implements the steps of the data automatic labeling method according to any of claims 1-5.