[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108875769A - Data mask method, device and system and storage medium - Google Patents

Data mask method, device and system and storage medium Download PDF

Info

Publication number
CN108875769A
CN108875769A CN201810064918.0A CN201810064918A CN108875769A CN 108875769 A CN108875769 A CN 108875769A CN 201810064918 A CN201810064918 A CN 201810064918A CN 108875769 A CN108875769 A CN 108875769A
Authority
CN
China
Prior art keywords
data
unlabeled data
control
unlabeled
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810064918.0A
Other languages
Chinese (zh)
Inventor
谢津
周昕宇
张华翼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Megvii Technology Co Ltd
Beijing Maigewei Technology Co Ltd
Original Assignee
Beijing Maigewei Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Maigewei Technology Co Ltd filed Critical Beijing Maigewei Technology Co Ltd
Priority to CN201810064918.0A priority Critical patent/CN108875769A/en
Publication of CN108875769A publication Critical patent/CN108875769A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the present invention provides a kind of data mask method, device and system and storage medium.Data mask method includes:The unlabeled data and its pre- markup information of the first number are obtained, pre- markup information is to carry out pre- mark using unlabeled data of the marking model to the first number to obtain, and pre- markup information includes pre- annotation results;Show in the display interface the first number unlabeled data and its pre- annotation results;User is received to the first feedback information of the unlabeled data of the first number;And the final annotation results of the unlabeled data of the first number are determined according to the first feedback information.Data mask method, device and system and storage medium according to an embodiment of the present invention, first unlabeled data is marked in advance by data labeling system, and these unlabeled data and its pre- annotation results can be shown in the display interface, user need to only change the pre- annotation results of mistake, annotating efficiency can greatly be promoted by doing so, and reduce mark cost.

Description

Data mask method, device and system and storage medium
Technical field
The present invention relates to field of computer technology, relate more specifically to a kind of data mask method, device and system and Storage medium.
Background technique
To today, the effect of data is increasingly highlighted Artificial Intelligence Development.Training is what a neural network model, Usually require the data of up to a million or even more than one hundred million magnitudes.The mark period of data and cost directly affect an artificial intelligence public affairs The industrial competition of department.
The data mark process that current data marks platform is manually to mark one by one, mark circle based on this mark process Face is also that singular strong point marks one by one.Current data mark platform has the following disadvantages:Data dimension model is to data Progress manually marks one by one;Its mark cost is generally proportional with data set scale, usually requires when marking super large data set Biggish human input and longer mark period.
Summary of the invention
The present invention is proposed in view of the above problem.The present invention provides a kind of data mask methods, device and system And storage medium.
According to an aspect of the present invention, a kind of data mask method is provided.Data mask method includes:Obtain the first number Unlabeled data and its pre- markup information, pre- markup information is to be carried out using marking model to the unlabeled data of the first number What pre- mark obtained, pre- markup information includes pre- annotation results;Show in the display interface the first number unlabeled data and Its pre- annotation results;User is received to the first feedback information of the unlabeled data of the first number;And according to the first feedback letter Cease the final annotation results for determining the unlabeled data of the first number.
Illustratively, display interface includes tab area and menu bar region, and the unlabeled data of the first number is shown in In tab area, menu bar region includes the mode control for being used to indicate the dimension model of data in tab area, dimension model Be it is one or more in high probability mode, high parallel pattern and boundary scheme, obtain the first number unlabeled data and its Pre- markup information includes:Determine the dimension model that user is selected by mode control;According to the dimension model of selection, mark is utilized Model marks the unlabeled data of the second number in advance, to obtain the pre- markup information of the unlabeled data of the second number, The pre- labeled data of first number is at least partly unlabeled data in the unlabeled data of the second number.
Illustratively, mode control includes the high probability control for being arranged in different location, high similar control and boundary One or more in control, high probability control, high similar control and boundary control are respectively used to instruction high probability mode, Gao Xiang Antitype and boundary scheme.
Illustratively, each single item in one or more in high similar control, high probability control and boundary control includes Positive example control and negative example control, positive example control are used to control the aobvious of the unlabeled data for belonging to positive example under corresponding dimension model Show, negative example control is used to control the display of the unlabeled data for belonging to negative example under corresponding dimension model.
Illustratively, mode control is drop down list control, and drop down list control provides and high probability mode, high similar mould One or more corresponding drop-down list items in formula and boundary scheme.
Illustratively, pre- markup information further includes data score, obtains the unlabeled data and its pre- mark of the first number Information further includes:Data score is selected to be greater than the first score threshold or obtain less than second from the unlabeled data of the second number Divide unlabeled data of the unlabeled data of threshold value as the first number, or selects number from the unlabeled data of the second number According to unlabeled data of the unlabeled data as the first number of the preset number of highest scoring.
Illustratively, pre- markup information further includes data score, and in the display interface, the unlabeled data of the first number is It is arranged according to the data score of the unlabeled data of the first number.
Illustratively, display interface includes menu bar region, and menu bar region includes random control;Method further includes:When When receiving the selection information for random control, the unlabeled data of random selection third number is concentrated from unlabeled data; The unlabeled data of third number is shown in the display interface;User is received to the second feedback of the unlabeled data of third number Information;And the final annotation results of the unlabeled data of third number are determined according to the second feedback information.
Illustratively, random control includes positive example control and negative example control, and positive example control is for controlling under stochastic model Belong to the display of the unlabeled data of positive example, negative example control is used to control the unlabeled data for belonging to negative example under stochastic model Display.
Illustratively, display interface includes menu bar region, and menu bar region includes export control, generating test set control With it is one or more in initialization model control, wherein export control is for control will be in the unlabeled data of the first number At least partly unlabeled data and the final annotation results of at least partly unlabeled data export as the file of predetermined format, it is raw Being used to control from labeled data at test set control concentrates the labeled data of selection predetermined number to obtain test set, test Collect the mark accuracy rate for testing marking model, initialization model control, which is used to control, carries out initially the parameter of marking model Change.
Illustratively, display interface further includes information bar region, and information bar region includes for showing sample information, statistics One or more regions in information, accuracy rate information and shortcut key information, wherein sample information includes belonging to currently wait mark Infuse the sample of the unlabeled data of classification;Statistical information include the number of labeled data, the number of unlabeled data, belong to just It is the number of labeled data of example, one or more in the number for the labeled data for belonging to negative example;Accuracy rate information is used for Indicate the accuracy rate of marking model;Shortcut key information is used to indicate preset shortcut key.
Illustratively, pre- markup information further includes data score, and display interface includes reversion control, filter controls, filtering It is one or more in threshold controls and filtering number control, wherein reversion control will be shown in the display interface for controlling The current annotation results of unlabeled data be negative by positive example update and example or positive example be updated to by negative example, filter controls are for controlling The filtration fraction unlabeled data from the unlabeled data marked in advance using marking model, using remaining unlabeled data as The unlabeled data of one number is for showing, filtering threshold control is for controlling for not marking from what is marked in advance using marking model The score threshold that unlabeled data is filtered in data is infused, filtering number control is marked not for controlling from using marking model in advance The number of the unlabeled data filtered in labeled data.
Illustratively, filtering threshold control is slider control or Input.
Illustratively, receive user includes to the first feedback information of the unlabeled data of the first number:It receives for spy Determine the toggling command of unlabeled data;The final annotation results of the unlabeled data of the first number are determined according to the first feedback information Including:The current annotation results of specific unlabeled data are updated by positive example and is negative example or positive example is updated to by negative example, wherein the The final annotation results of the unlabeled data of one number are current mark of the unlabeled data in mark finish time of the first number Infuse result.
Illustratively, toggling command includes that the left mouse button of display area where being directed to specific unlabeled data clicks behaviour Make.
Illustratively, receive user includes to the feedback information of the unlabeled data of the first number:Receive for it is specific not The illegal command of labeled data;The final annotation results packet of the unlabeled data of the first number is determined according to the first feedback information It includes:Specific unlabeled data is labeled as invalid data to obtain the current annotation results of specific unlabeled data, wherein first The final annotation results of the unlabeled data of number are current mark of the unlabeled data in mark finish time of the first number As a result.
Illustratively, illegal command includes the left mouse button double-click behaviour for display area where specific unlabeled data Make.
Illustratively, display interface includes menu bar region, information bar region and tab area, and menu bar region is display The upper area at interface, information bar region are the left area in the lower area of display interface, and tab area is lower area In right area.
According to a further aspect of the invention, a kind of data annotation equipment is provided, including:Module is obtained, for obtaining first The unlabeled data of number and its pre- markup information, pre- markup information are the unlabeled data using marking model to the first number Carry out what pre- mark obtained, pre- markup information includes pre- annotation results;And display module, for showing in the display interface The unlabeled data of one number and its pre- annotation results;Receiving module, for receiving user to the unlabeled data of the first number The first feedback information;And result determining module, for determining the unlabeled data of the first number according to the first feedback information Final annotation results.
According to a further aspect of the invention, a kind of data labeling system, including processor and memory are provided, wherein institute It states and is stored with computer program instructions in memory, for executing when the computer program instructions are run by the processor State data mask method.
According to a further aspect of the invention, a kind of storage medium is provided, stores program instruction on said storage, Described program instruction is at runtime for executing above-mentioned data mask method.
Data mask method, device and system and storage medium according to an embodiment of the present invention, can be first by data mark Injection system marks unlabeled data in advance, and can show these unlabeled data and its pre- mark in the display interface As a result.User need to only change wherein wrong pre- annotation results, and annotating efficiency can greatly be promoted by doing so, reduce mark at This.
Detailed description of the invention
The embodiment of the present invention is described in more detail in conjunction with the accompanying drawings, the above and other purposes of the present invention, Feature and advantage will be apparent.Attached drawing is used to provide to further understand the embodiment of the present invention, and constitutes explanation A part of book, is used to explain the present invention together with the embodiment of the present invention, is not construed as limiting the invention.In the accompanying drawings, Identical reference label typically represents same parts or step.
Fig. 1 shows showing for the exemplary electronic device for realizing data mask method according to an embodiment of the present invention and device Meaning property block diagram;
Fig. 2 shows the schematic flow charts of data mask method according to an embodiment of the invention;
Fig. 3 shows the schematic diagram of data labeling system according to an embodiment of the invention;
Fig. 4 shows according to an embodiment of the invention for showing display circle of unlabeled data and its pre- annotation results The schematic diagram in face;
Fig. 5 shows the schematic block diagram of data annotation equipment according to an embodiment of the invention;And
Fig. 6 shows the schematic block diagram of data labeling system according to an embodiment of the invention.
Specific embodiment
In order to enable the object, technical solutions and advantages of the present invention become apparent, root is described in detail below with reference to accompanying drawings According to example embodiments of the present invention.Obviously, described embodiment is only a part of the embodiments of the present invention, rather than this hair Bright whole embodiments, it should be appreciated that the present invention is not limited by example embodiment described herein.Based on described in the present invention The embodiment of the present invention, those skilled in the art's obtained all other embodiment in the case where not making the creative labor It should all fall under the scope of the present invention.
To solve the above-mentioned problems, it the embodiment of the invention provides a kind of data mask method, device and system and deposits Storage media.In Intellectualization marking platform provided in an embodiment of the present invention, data labeling system self-teaching and can select number According to being marked in advance.The mark that mark person's (or saying user) need to only right the wrong, does not need to mark all data one by one again Note.Data mask method and device according to an embodiment of the present invention can be applied to any required neck being labeled to data Domain, such as face mark, card number mark etc..
Firstly, describing referring to Fig.1 for realizing the example of data mask method and device according to an embodiment of the present invention Electronic equipment 100.
As shown in Figure 1, electronic equipment 100 includes one or more processors 102, one or more storage devices 104.It can Selection of land, electronic equipment 100 can also include input unit 106, output device 108 and data acquisition facility 110, these groups Part passes through the interconnection of bindiny mechanism's (not shown) of bus system 112 and/or other forms.It should be noted that electronics shown in FIG. 1 is set Standby 100 component and structure be it is illustrative, and not restrictive, as needed, the electronic equipment also can have it His component and structure.
The processor 102 can be central processing unit (CPU), graphics processor (GPU) or have data processing The processing unit of ability and/or the other forms of instruction execution capability, and can control other in the electronic equipment 100 Component is to execute desired function.
The storage device 104 may include one or more computer program products, and the computer program product can To include various forms of computer readable storage mediums, such as volatile memory and/or nonvolatile memory.It is described easy The property lost memory for example may include random access memory (RAM) and/or cache memory (cache) etc..It is described non- Volatile memory for example may include read-only memory (ROM), hard disk, flash memory etc..In the computer readable storage medium On can store one or more computer program instructions, processor 102 can run described program instruction, to realize hereafter institute The client functionality (realized by processor) in the embodiment of the present invention stated and/or other desired functions.In the meter Can also store various application programs and various data in calculation machine readable storage medium storing program for executing, for example, the application program use and/or The various data etc. generated.
The input unit 106 can be the device that user is used to input instruction, and may include keyboard, mouse, wheat One or more of gram wind and touch screen etc..
The output device 108 can export various information (such as image and/or sound) to external (such as user), and And the output device 108 may include display.Optionally, the output device can also be including loudspeaker etc..Optionally, It is real using same interactive device (such as touch screen) together with the input unit 106 can integrate with the output device 108 It is existing.
The available required data of the data acquisition facility 110 (including unlabeled data and labeled data), and And acquired data are stored in the storage device 104 for the use of other components.Optionally, data acquisition facility 110 can be the image collecting devices such as camera, camera.Optionally, data acquisition facility 110 can be wired or wireless communication Device (including unlabeled data and has marked number for the data needed for obtaining from external equipment (server end or cloud) According to).
Illustratively, the exemplary electronic device for realizing data mask method according to an embodiment of the present invention and device can To be realized in the equipment of personal computer or remote server etc..
In the following, data mask method according to an embodiment of the present invention will be described with reference to Fig. 2.Fig. 2 shows according to the present invention one The schematic flow chart of the data mask method 200 of a embodiment.As shown in Fig. 2, data mask method 200 includes the following steps S210, S220 and S230.
In step S210, the unlabeled data and its pre- markup information of the first number are obtained, pre- markup information is to utilize mark Injection molding type carries out what pre- mark obtained to the unlabeled data of the first number, and pre- markup information includes pre- annotation results.
Labeled data as described herein (including unlabeled data and labeled data) may be any type of data, wrap Include but be not limited to text, image, voice, video etc..It may be noted that labeled data as described herein has quantative attribute.It is exemplary Ground, in the case that unlabeled data is image, an image can be considered as a unlabeled data.Illustratively, number is not marked In the case where for video, the video of one section of specific length can be considered as a unlabeled data.
First number can be any suitable number, can be set as needed, and the present invention limits not to this. For example, can be concentrated from unlabeled data in face mark application and choose 1000 facial images (facial image and people Face mark is related, may not may also include face comprising face in facial image) unlabeled data as the first number.This First, second equal terms described in text are only used for distinguishing purpose, are not offered as sequence.
Illustratively, in face mark application, it includes face which can be marked out in 1000 images respectively, which Not comprising face.It is pre- annotation results that whether every image, which includes face this result,.That is, pre- annotation results can be with For the classification results of unlabeled data, classification results are referred to as label.Illustratively, pre- markup information can also include number According to score.For example, the probability that every image includes face can be considered as data score.
Fig. 3 shows the schematic diagram of data labeling system according to an embodiment of the invention.As shown in figure 3, data mark System can consist of the following components.
I. data pool (Pool):Comprising unlabeled data collection U and labeled data collection L.
II. (Agent) is acted on behalf of:As the core of system, agency plays control marking model training, chooses unlabeled data The effects of being marked in advance.
III. marking model (Model):It is trained using data pool, and unlabeled data is predicted.Mark mould The training method of type may include following three kinds:Supervised learning, semi-supervised learning and unsupervised learning.Model training process can be with Independently of mark process, carrying out always from the background.
In addition, agency is contacted with external mark person (Inspector), the pre- mark knot that mark person provides agency Fruit is checked, its error section is corrected.It may be noted that mark person can be people, it is also possible to the inspection system realized by machine.
In Fig. 3, the workflow of data labeling system is as shown in the 0-5 in figure, wherein 0 is model training process, 1- 5 be mark process.
0.Model is trained using the data that Agent is provided, these data come from Pool.
1.Agent chooses a collection of unlabeled data from Pool.
2.Agent gives the data of selection to Model and gives a forecast.Example 1:Model is for predicting belonging to unlabeled data Classification, such as Model output unlabeled data belongs to the different other probability of predetermined class, and Agent can be according to wherein general at this time The highest classification of rate marks unlabeled data in advance;Example 2:Model is used to extract the data characteristics of unlabeled data, this When Agent can calculate the similarity between unlabeled data and multiple labeled data, and according between unlabeled data The highest labeled data of similarity belonging to classification unlabeled data is marked in advance.
The result that Model is provided can be any pair of helpful output of data classification, as Model the last layer exports Probability distribution or the result of certain middle layer output.
3.Agent marks this batch of unlabeled data using following a certain strategies after the output for obtaining Model in advance Note, while the data score (score) of each unlabeled data can be provided, and choose the higher a part of data of data score It exports together with its pre- annotation results to Inspector.
Following several strategies can be used in Agent:
High probability strategy:The probability distribution that Agent is exported according to Model, using the highest classification of probability as not marking number According to pre- annotation results, such other probability is as data score.
High Similar strategies:Agent according to Model middle layer export as a result, calculating unlabeled data to each having marked The distance for infusing data, pre- annotation results of the classification as unlabeled data belonging to the smallest labeled data, this minimum The negative value of distance is as data score.
Boundary strategy:Agent is using accuracy of the strategy to data classification in Active Learning (Active Learning) It gives a mark.These strategies include:
Uncertainty sampling:The classification uncertainty for calculating each sample (for example calculates the probability distribution P of Model output Entropy), as data score.
(Query by Committee) votes in the committee:The several different submodels of training simultaneously, these submodels pair Classification belonging to some unlabeled data carries out " ballot ", and Agent measures the disunity between these ballots using certain criterion Degree, as data score.
4.Inspector feeds back to Agent after modifying to marking error.
Final annotation results are put into Pool by 5.Agent, update U and L.
Illustratively, marking model can be any suitable neural network model, such as conventional convolutional neural networks. Illustratively, it can use marking model and class prediction or feature extraction carried out to each unlabeled data respectively, and according to class Other prediction result or extracted feature determine the pre- annotation results of each unlabeled data.
In step S220, show in the display interface the first number unlabeled data and its pre- annotation results.
Display interface is shown by display device.Illustratively, display device can be liquid crystal display, organic light emitting display The various displays such as device, cathode-ray tube (CRT) display.
In the display interface, the unlabeled data of the first number can disposably be shown, can also be shown in batches.In batches It shows the unlabeled data of the first number or also needs to mark other after the unlabeled data of the first number and do not mark number It is remaining during user checks the unlabeled data that current time shows in the display interface in the case where Unlabeled data can be preloaded, avoid user check subsequent unlabeled data when also need to wait, with further Accelerate mark progress.
In the display interface, the pre- annotation results of the unlabeled data of the first number can all show or partially show Show.In one example, the pre- annotation results of the unlabeled data of the first number can be not quite identical, such as has plenty of number 1, have plenty of number 2, in such a case, it is possible to show the pre- annotation results of each unlabeled data.In one example, The pre- annotation results of the unlabeled data of one number are consistent, such as are all numbers 1, therefore can only show in the display interface Show a pre- annotation results (such as the text information for showing such as " 1 "), without showing around each unlabeled data Its pre- annotation results.In addition, in the display interface, the pre- annotation results of the unlabeled data of the first number can directly display, It can also show indirectly.For example, if user wants 1 to be labeled number, can click shown on display interface with 1 relevant selection control (such as the button for being marked with " number 1 ") of number, then will show pre- annotation results for number on display interface The unlabeled data of word 1.In this case, it can be understood as not marking for the first number to 1 relevant selection control of number A kind of indirect display mode of the pre- annotation results of data.
Fig. 4 shows according to an embodiment of the invention for showing display circle of unlabeled data and its pre- annotation results The schematic diagram in face.As shown in figure 4, display interface may include menu bar region, information bar region and tab area three parts, In, display interface includes menu bar region, information bar region and tab area, and menu bar region is the upper zone of display interface Domain, information bar region are the left area in the lower area of display interface, and tab area is the right area in lower area.
It may be noted that the layout of display interface as described herein can be set as needed, it is not limited to cloth shown in Fig. 4 Office.That is, display interface might not according to model split shown in Fig. 4, and menu bar region, information bar region and The content shown in position and each region where tab area is also not necessarily consistent with Fig. 4.
In the embodiment shown in fig. 4, the unlabeled data of the first number is shown in tab area.First number is not Labeled data is several images, and mark purpose is to judge whether every image includes number 1.
In step S230, user is received to the first feedback information of the unlabeled data of the first number.
In the display interface, the unlabeled data of the first number can be shown, such as above-mentioned 1000 are labeled as wrapping in advance Image containing number 1.User can check this 1000 images occur the image of mistake, example for pre- annotation results Such as, it includes number be not 1 but 7, user can pass through the interactive devices such as mouse, keyboard, touch screen and data mark be System (such as the system realized by above-mentioned electronic equipment 100) interaction, to correct the mark of mistake.For example, left mouse button can be used The image of pre- marking error is clicked, so that the pre- annotation results of the image invert, the annotation results after reversion indicate the image not Include number 1.Certainly, user can also directly input the annotation results of image by interactive device, such as annotation results are repaired It is changed to number 7.
User by interactive device to the information that data labeling system inputs be feedback information (including the first feedback information And the second feedback information being described below), including but not limited to above-mentioned error correction information.For example, if the user thinks that The current annotation results of the unlabeled data of first number be correctly, can click relevant to submitting selection control (such as It is marked with the button of " submission ").In this case, the first feedback information may include the point of selection control relevant to submission Hit operation information.
In step S240, the final annotation results of the unlabeled data of the first number are determined according to the first feedback information.
If receiving the error correction information of user's input, the annotation results of unlabeled data can be entangled Just.It is appreciated that the correction for each unlabeled data repeated multiple times can carry out, new mark is obtained after correcting every time As a result.For convenience of description, the annotation results by unlabeled data at current time are indicated with current annotation results.It is appreciated that The current annotation results of some unlabeled data can be the pre- annotation results of the unlabeled data, be also possible to through primary or more New annotation results after secondary correction.Finally, when user confirm the first number unlabeled data mark complete (for example, with Click " submission " control in family) when, it can determine that the current annotation results of each unlabeled data at this time are the unlabeled data Final annotation results.
Data mask method according to an embodiment of the present invention can first carry out unlabeled data by data labeling system pre- Mark, and these unlabeled data and its pre- annotation results can be shown in the display interface.User need to only change wherein wrong Pre- annotation results accidentally, annotating efficiency can greatly be promoted by doing so, and reduce mark cost.
Illustratively, data mask method according to an embodiment of the present invention can be in setting with memory and processor It is realized in standby, device or system.
Data mask method according to an embodiment of the present invention can be deployed at personal terminal, such as smart phone, plate Computer, personal computer etc..
Alternatively, data mask method according to an embodiment of the present invention can also be deployed in server end and client with being distributed At end.For example, can obtain labeled data (such as acquiring facial image at Image Acquisition end) in client, client be will acquire Data transmission give server end (or cloud), by server end (or cloud) carry out data mark.
Another embodiment according to the present invention, display interface may include tab area and menu bar region, the first number Unlabeled data may be displayed in tab area, and menu bar region may include the mark for being used to indicate data in tab area The mode control of mode, dimension model are one or more, the step in high probability mode, high parallel pattern and boundary scheme S210 may include:Determine the dimension model that user is selected by mode control;According to the dimension model of selection, mark mould is utilized Type marks the unlabeled data of the second number in advance, to obtain the pre- markup information of the unlabeled data of the second number, the The pre- labeled data of one number is at least partly unlabeled data in the unlabeled data of the second number.
The unlabeled data of second number is the data initially obtained from unlabeled data collection.It can be from the second number not At least partly unlabeled data (i.e. the unlabeled data of the first number), and the unlabeled data that will be selected is selected in labeled data And its pre- annotation results are exported to display device, to be shown by display device.It include filtering as described below in display interface In the case where control, it can use filter controls control and select not mark number at least partly from the unlabeled data of the second number According to the unlabeled data as the first number.The side of at least partly unlabeled data is selected from the unlabeled data of the second number Formula can be the classification according to indicated by the pre- annotation results of unlabeled data and/or the data score of unlabeled data is selected It selects.For example, the unlabeled data of the second number can be 10000 images in number mark application, every image includes one A number, the number can be any of 0~9.Certainly, some images can not include number.By marking in advance, obtain The pre- annotation results of every image, that is, which number know that every image includes is.If what current needs marked is number 1, then pre- annotation results can be selected from 10000 images to be exported for those of number 1 image and by the image selected to aobvious Showing device is shown.Assuming that including number 1 there are 900 images, then the first number is 900.
Mode control can be button control.As shown in figure 4, the left-half of menu bar region includes " random ", " high phase Like ", " high probability ", " boundary " these four controls, every kind of control is further divided into " just " and two kinds of " negative "." random ", " high phase Seemingly ", " high probability ", " boundary " respectively correspond four kinds of different dimension models, herein referred as stochastic model, high parallel pattern, height Conceptual schema and boundary scheme.In Fig. 4, the control currently chosen is the positive example control under high probability mode.
In above-mentioned four kinds of dimension models, high parallel pattern, high probability mode and boundary scheme correspond to different pre- marks Note strategy.When user is by clicking a certain mode control to select corresponding dimension model, data labeling system can basis The dimension model of user's selection marks unlabeled data in advance.
Another embodiment according to the present invention utilizes marking model not marking to the second number according to the dimension model of selection Note data are marked in advance, may include to obtain the pre- markup information of the unlabeled data of the second number:In user's selection In the case that dimension model is high probability mode or boundary scheme, each of unlabeled data of the second number is not marked Unlabeled data input marking model is carried out class prediction by data, and the output result of marking model is used to indicate this and does not mark Note data belong to the other probability of at least one predetermined class;And determine the predetermined classification of maximum probability at least one predetermined classification For the pre- annotation results of the unlabeled data.
Illustratively, marking model can be used for predicting the classification of the unlabeled data of input.Marking model is at last Layer (i.e. output layer) can export unlabeled data and belong to the other probability of a variety of different predetermined class (i.e. probability distribution), this epoch Reason can be using the highest predetermined classification of wherein probability as the pre- annotation results of the unlabeled data.For example, being answered in number mark In, by an image input marking model, marking model can export 11 dimensional vectors, and the value of 11 dimensions distinguishes table Probability in diagram picture comprising 0~9 and other classifications (classification other than i.e. 0~9).It can according to the output result of marking model With the classification of the affiliated maximum probability of each unlabeled data of determination.For example, after certain image is inputted marking model, according to mark The output result of injection molding type determines that the image includes the maximum probability of number 5, then can determine that the pre- annotation results of the image are Number 5.It is above-mentioned number mark example be more classification problems, although some embodiments of this paper using more classification problems as example into Row description, but it is understood that, two classification problems are also applicable.In addition, it will be understood by those skilled in the art that more points Class problem itself can also be decomposed into multiple two classification problems to be handled.For example, in number mark application, it can be by needle Classifying and dividing to 0~9 is 10 individual two classification problems.For example, marking model may include multiple submodels, wherein the One submodel be mainly used for judging in image whether comprising number 1, second submodel be mainly used for judging in image whether Comprising number 2, and so on.In this case, the output result of first submodel can be only used for wrapping in instruction image Probability containing number 1, the output result of second submodel can be only used for the probability, etc. comprising number 2 in instruction image.
In the case where dimension model is high probability mode or boundary scheme, can be described in the present embodiment by the way of Determine the pre- annotation results of unlabeled data.
Another embodiment according to the present invention utilizes marking model not marking to the second number according to the dimension model of selection Note data are marked in advance, may include to obtain the pre- markup information of the unlabeled data of the second number:In user's selection In the case that dimension model is high parallel pattern, for each unlabeled data in the unlabeled data of the second number, by this Unlabeled data inputs marking model, to extract the data characteristics of the unlabeled data;It is special according to the data of the unlabeled data The data characteristics of sign and at least one of labeled data collection labeled data, calculate the unlabeled data and at least one Similarity between labeled data;Classification belonging to the maximum labeled data of similarity between the determining and unlabeled data For the pre- annotation results of the unlabeled data.
Illustratively, the phase between the unlabeled data of input and multiple labeled data can be calculated using marking model Like degree, agency can be using classification belonging to the highest labeled data of the similarity between the unlabeled data as this at this time The pre- annotation results of unlabeled data.
In the present embodiment, marking model can be used for the classification of the unlabeled data of prediction input.In such case Under, data characteristics can be the output result of some middle layer (such as softmax layers preceding layer) of marking model.For example, Assuming that unlabeled data is image, data characteristics can be the characteristic pattern of the last one convolutional layer output of marking model (feature map)。
Similarity between two data can be measured with the distance of such as Euclidean distance.The data characteristics of two data The distance between it is smaller, the similarity between two data is bigger.Can according to the data characteristics of unlabeled data and it is each The distance between data characteristics of labeled data calculates the similarity between the unlabeled data and each labeled data.Ability Field technique personnel are understood that the calculation of similarity, are repeated herein not to this.
Another embodiment according to the present invention, mode control may include the high probability control for being arranged in different location, One or more in high similar control and boundary control, high probability control, high similar control and boundary control are respectively used to refer to Show high probability mode, high parallel pattern and boundary scheme.
Referring to fig. 4, showing random control, high probability control, high similar control and boundary control is four kinds of different buttons Control, they are located at different positions, control the selection of corresponding dimension model respectively.
Another embodiment according to the present invention, in one or more in high similar control, high probability control and boundary control Each single item may include positive example control and negative example control, positive example control is used to controlling the positive example that belongs under corresponding dimension model The display of unlabeled data, negative example control are used to control the display of the unlabeled data for belonging to negative example under corresponding dimension model.
As shown in figure 4, random control, high similar control, high probability control and boundary control are respectively divided into two kinds of controls, i.e., Therefore positive example control and negative example control include eight button controls relevant to dimension model on display interface shown in Fig. 4 altogether Part.However, the classification and number of mode control shown in Fig. 4 are only exemplary rather than limitation of the present invention, for example, random control Any control in part, high similar control, high probability control and boundary control can only include positive example control, and no longer specific It is divided into two kinds of controls.
Positive example (Positive) as described herein refers to the case where annotation results are specified classifications, negative example
(Negative) referring to annotation results not is the case where specifying classification.Positive example and negative example itself are it can be appreciated that two The different classification of kind.For example, if user selects positive example control, data labeling system can be defeated in face mark application Pre- annotation results are that the unlabeled data (such as facial image) comprising face then counts, whereas if user selects negative example control out It is the unlabeled data not comprising face that pre- annotation results, which can be exported, according to labeling system.Further, if user selection It is the positive example control under high probability mode, then data labeling system can not mark number to the second number according to high probability mode According to (description seen above) is marked in advance, judge whether each unlabeled data includes face.In addition, illustratively, number The unlabeled data of the first number can also be shown according to the sequence of the probability comprising face from high to low according to labeling system.Instead It, if user's selection is negative example control under high probability mode, data labeling system can be according to high probability mode pair The unlabeled data of second number is marked in advance, judges whether each unlabeled data includes face.In addition, illustratively, Data labeling system can also according to do not include face probability from high to low (namely comprising face probability from low to high) Sequence show the unlabeled data of the first number.
Another embodiment according to the present invention, mode control can be drop down list control, and drop down list control provides and height One or more corresponding drop-down list items in conceptual schema, high parallel pattern and boundary scheme.
The implementation of mode control shown in Fig. 4 is only a kind of example rather than limits that mode control can have other Suitable implementation, such as can be realized using drop down list control.It, can be downward when user clicks drop down list control Extend, shows multiple drop-down list items.User can further click on any drop-down list item, to select required mark mould Formula.
Another embodiment according to the present invention, pre- markup information can also include data score, and step S210 can also include: Data score is selected to be greater than the first score threshold or less than the second score threshold from the unlabeled data of second number Unlabeled data of the unlabeled data as first number, or selected from the unlabeled data of second number Unlabeled data of the unlabeled data of the preset number of data highest scoring as first number.
Illustratively, in the case where the dimension model that user selects is high probability mode, for not marking for the first number Each unlabeled data in data is infused, the probability which belongs to current classification to be marked is the unlabeled data Data score.For example, if current need to mark number 1, and user has selected high probability mode in number mark application Under positive example control, then current classification to be marked is digital 1 positive example, i.e. classification as " including number 1 "., whereas if working as Preceding needs mark number 1, and user has selected the negative example control under high probability mode, then and current classification to be marked is number 1 Negative example, i.e. classification as " not including number 1 ".
Illustratively, in the case where the dimension model that user selects is high parallel pattern, for not marking for the first number Infuse each unlabeled data in data, the phase between the unlabeled data and the labeled data for belonging to current classification to be marked It is the data score of the unlabeled data like degree.
For example, data score can be used for measuring the accuracy of the pre- annotation results of corresponding unlabeled data.For example, not The data score of labeled data can be the unlabeled data belong to certain classification probability or with the mark that belongs to certain classification Infuse the similarity between data.For example, it is assumed that the probability that certain image includes number 1 is 0.7, the probability comprising number 7 is 0.2, the total probability comprising other numbers is 0.1, it may be considered that the data of this image are scored at 0.7.
Illustratively, in the case where the dimension model that user selects is boundary scheme, for not marking for the first number Each unlabeled data in data, the classification uncertainty or classification disunity degree of the unlabeled data are that this does not mark number According to data score.
For example, data score can be used for measuring the mark value of corresponding unlabeled data.For example, unlabeled data Data score can be the classification uncertainty or classification disunity degree of the unlabeled data.The classification of unlabeled data is not true Fixed degree or classification disunity degree are higher, which is more difficult to classify.Such case can be understood as at unlabeled data In on classification boundaries, being difficult to be divided into predetermined classification.For users, the accuracy of the pre- annotation results of these data compared with It is low, therefore compared in the way of high probability or high sequencing of similarity, user needs to take more time to check data Mark correctness.However, the data of these more difficult classification often carry more information, have very much to the training of marking model It helps.Therefore, the classification uncertainty of unlabeled data or classification disunity degree are higher, it is believed that it marks value and gets over Greatly.It illustratively, can be right according to the classification uncertainty of unlabeled data or the sequence of disunity degree from high to low of classifying Unlabeled data is ranked up, and is shown according to the sequence sequenced.
Illustratively, the classification uncertainty of unlabeled data can belong at least one predetermined class for the unlabeled data The entropy of other probability.The implementation for carrying out class prediction to unlabeled data using marking model is hereinbefore described, this Place does not repeat.As described above, marking model can be distributed with output probability, such as above-mentioned 11 dimensional vector.It is general that these can be calculated Classification uncertainty of the entropy of rate as unlabeled data.
Illustratively, marking model may include multiple submodels.Several different submodels can be trained simultaneously, these Submodel carries out " ballot " to classification belonging to some unlabeled data.It can be measured using certain criterion between these ballots Disunity degree (Disagreement), as data score.
When selecting the unlabeled data of the first number, data score can choose greater than predetermined threshold (i.e. the first score threshold Value), it also can choose data score less than predetermined threshold (i.e. the second score threshold).Assuming that the data of unlabeled data obtain It point is the probability that the unlabeled data belongs to certain classification.For example, in face mark application, it can not marking from the second number The probability that selection belongs to face in note data is greater than the image of (or being less than) 0.6, and not using these images as the first number Labeled data is shown.In another example can select to belong to from the unlabeled data of the second number face probability highest (or It is minimum) 200 images, and shown these images as the unlabeled data of the first number.
Another embodiment according to the present invention, pre- markup information further include data score, in the display interface, the first number Unlabeled data is arranged according to the data score of the unlabeled data of the first number.
The calculation of data score is hereinbefore described, does not repeat herein.Illustratively, if user wants mark Image can be then shown on the desplay apparatus by number 1 according to the sequence of the probability comprising number 1 from high to low.Front is all A possibility that probability is high, and pre- annotation results malfunction can smaller, is followed by that likelihood ratio is lower, and what pre- annotation results malfunctioned can Energy property is bigger.User can check subsequent image whether really comprising number 1, if it is not, then can be entangled with emphasis Just.It is quickly checked as it can be seen that sorting according to probability height and facilitating user.It is appreciated that can also according to include number 1 it is general The sequence of rate from low to high shows image.That is, when sorting according to data score and show unlabeled data, it can basis It needs to select preferentially to show the high or low unlabeled data of data score.
According to embodiments of the present invention, display interface may include menu bar region, and menu bar region may include controlling at random Part;Data mask method 200 can also include:When receiving the selection information for random control, from unlabeled data collection The unlabeled data of middle random selection third number;The unlabeled data of third number is shown in the display interface;Receive user To the second feedback information of the unlabeled data of third number;And not marking for third number is determined according to the second feedback information The final annotation results of data.
If the dimension model that user selects is stochastic model, random selection can be concentrated some not from unlabeled data Labeled data is shown, is labeled by user.For example, if user is carried out by interactive device designation date labeling system Face mark, and user has selected the positive example control under stochastic model, then selects at random in the image collection that can never mark It selects several (example 1000 is opened) images and shows in the display interface.User can pick out the image for not including face wherein, And information is corrected to data labeling system input error by interactive device, the annotation results of these images are corrected as not including Face.Remaining user did not carried out the image corrected and was then defaulted as by data labeling system comprising face.
According to embodiments of the present invention, random control may include positive example control and negative example control, and positive example control is for controlling The display of the unlabeled data for belonging to positive example under stochastic model, negative example control are used to control the negative example that belongs under stochastic model The display of unlabeled data.
The effect of positive example control and negative example control is hereinbefore described, can understand the present embodiment with reference to above description, Details are not described herein again.
Another embodiment according to the present invention, display interface may include menu bar region, and menu bar region may include leading It is one or more in control, generating test set control and initialization model control out.Control is exported for controlling the first number The final annotation results of at least partly unlabeled data and at least partly unlabeled data in purpose unlabeled data export as The file of predetermined format, generating test set control are used to control the labeled data for concentrating selection predetermined number from labeled data To obtain test set, test set is used to test the mark accuracy rate of marking model, and initialization model control is for controlling to mark The parameter of model is initialized.
For example, data mask method 200 can also include:When receiving the selection information for export control, by the The final annotation results of the unlabeled data of four numbers and the unlabeled data of the 4th number export as the file of predetermined format, In, the unlabeled data of the 4th number is at least partly unlabeled data in the unlabeled data of the first number.
For example, concentrating selection predetermined from labeled data when receiving the selection information for generating test set control For the labeled data of number to obtain test set, test set is used to test the mark accuracy rate of marking model.
For example, being carried out to the parameter of marking model initial when receiving the selection information for initialization model control Change.
As shown in figure 4, showing " export ", " generating test set ", " initialization in the right area of menu bar region These three controls of model ".It is button control that these three controls are shown in Fig. 4, merely illustrative rather than limitation.
" export " control is used to control the unlabeled data of the 4th number and its export of final annotation results.4th number Unlabeled data can be current time shown unlabeled data in the display interface, be also possible to the first number not Its final result indicates that it belongs to the unlabeled data of positive example (such as comprising number 1) in labeled data.For invalid data and/ Or belongs to the unlabeled data of negative example (such as not comprising number 1) and its final annotation results and can not export.Certainly, optional Ground, can by the unlabeled data of the first number whole unlabeled data and its final annotation results export.Number is not marked According to and its final annotation results can export as the file, such as text document, form document etc. of any suitable format.
" generating test set " control is used to control the generation of test set.When user clicks the control by interactive device, Data labeling system can automatically generate test set.It include several labeled data in test set, practical annotation results are Know.It can use marking model to be labeled the labeled data in test set, obtain test annotation results.Test is marked It infuses result and practical annotation results compares, it may be determined that the mark accuracy rate of marking model.When the mark accuracy rate of marking model is super When crossing preset threshold (such as 98%), it can use marking model and unlabeled data be labeled automatically.That is, at this In the case of kind, marking model can be no longer labeled to annotation results (the i.e. above-mentioned pre- mark knot of acquisition to unlabeled data Fruit) transfer to user to check.The annotating efficiency that can be further improved data is done so, and saves mark cost.
Marking model is constantly trained, and the accuracy rate marked in advance using marking model can be continuous with the carry out of mark It improves, therefore marks and carry out to a certain extent error label usually only seldom in pre- annotation results obtained later.
" initialization model " control is used to control the initialization of marking model.Process is being labeled to unlabeled data In, marking model can use the collection of labeled data in data pool and/or unlabeled data collection is trained.However, mark Model is likely to occur a variety of situations, such as over-fitting, poor fitting etc. in the training process, may cause marking model increasingly Difference, the error rate marked in advance are higher and higher.User can be carried out using parameter of the initialization model control to marking model at any time Initialization, makes marking model be returned to original state, restarts to train.
Another embodiment according to the present invention, display interface can also include information bar region, and information bar region may include For showing one or more regions in sample information, statistical information, accuracy rate information and shortcut key information, wherein sample Example information may include the sample for belonging to the unlabeled data of current classification to be marked;Statistical information may include labeled data Number, the number of unlabeled data, belong to positive example labeled data number, belong to negative example labeled data number In it is one or more;Accuracy rate information is used to indicate the accuracy rate of marking model;Shortcut key information is used to indicate preset fast Prompt key.
Referring to fig. 4, left side information bar region the top, show four samples (its be four images).Information Four samples shown in column region are the sample for belonging to the unlabeled data of current classification to be marked.Due to the mode control of selection For the positive example control under high probability mode, therefore current classification to be marked is digital 1 positive example.User can refer to sample information Judge in unlabeled data which belongs to current classification to be marked, which is not belonging to current classification to be marked.
In the intermediate region in information bar region, statistical information is shown.As shown in figure 4, statistical information includes " not yet marking Note ", " mark ", " positive example quantity ", " negative number of cases amount " these four information respectively indicate under current time labeled data Number, the number of unlabeled data, belong to positive example labeled data number, belong to negative example labeled data number.
Bottom in information bar region, shows shortcut key information.For example, submit function that can be realized with space bar, Upper jump function and lower jump function can be realized with w key and x key respectively.Upper jump, which refers to, is moved to this from current unlabeled data for cursor A upper unlabeled data is arranged, lower jump, which refers to, is moved to the next unlabeled data of this column from current unlabeled data for cursor.Herein not Introduce each shortcut key one by one, those skilled in the art can be described herein and Fig. 4 understands shortcut key information by reading.
In Fig. 4, accuracy rate information is shown in menu bar region.It is appreciated that Fig. 4 is only example, accuracy rate information It can show in information bar region.Referring to fig. 4, the right area in menu bar region, show " training error rate " and " Top1% error rate " this two information, this two information can reflect the accuracy rate of marking model." training error rate " and " Top1% error rate " is lower, and the accuracy rate of marking model is higher.
Illustratively, display interface may include reversion control.Reversion control will be shown in the display interface for controlling The current annotation results of unlabeled data be negative by positive example update and example or positive example be updated to by negative example.For example, step S240 can To include:When receiving the selection information for reversion control, by all unlabeled data shown in the display interface Current annotation results are updated by positive example to be negative example or is updated to positive example by negative example, wherein the unlabeled data of the first number is most Whole annotation results are current annotation results of the unlabeled data in mark finish time of the first number.Referring to fig. 4, in marked area The upper left corner area in domain shows " reversion " control, clicks the control, can working as the unlabeled data in entire display interface Preceding annotation results are inverted, for example, being reversed to originally comprising number 1 not comprising number 1.It may sometimes be dredged due to user Suddenly, the reasons such as systematic error cause a large amount of unlabeled data the annotation results opposite with concrete class occur, therefore with reversion Control inverts the annotation results of a large amount of unlabeled data simultaneously, can quickly handle the mistake mark of a large amount of unlabeled data, be not necessarily to User corrects one by one manually.
Illustratively, display interface may include one in filter controls, filtering threshold control and filtering number control Or it is multinomial.Filter controls are used to control the filtration fraction unlabeled data from the unlabeled data marked in advance using marking model, Be used to show using remaining unlabeled data as the unlabeled data of the first number, filtering threshold control for control for from The score threshold of unlabeled data is filtered in the unlabeled data marked in advance using marking model, filtering number control is for controlling The number of the unlabeled data filtered from the unlabeled data marked in advance using marking model.
For example, step S210 may include:If filter controls are in folded state, marked in advance from using marking model Unlabeled data in selection data score be greater than third score threshold or less than the 4th score threshold unlabeled data make For the unlabeled data of the first number, or from the unlabeled data marked in advance using marking model selection except data score most Unlabeled data of the unlabeled data as the first number other than the unlabeled data of high or minimum predetermined number, if Filter controls are in unfolded state, then do not mark number using the unlabeled data marked in advance using marking model as the first number According to.
For example, data mask method 200 can also include:When receiving the operation information for filtering threshold control, Third score threshold or the 4th score threshold are determined according to the mode of operation of filtering threshold control.
For example, data mask method 200 can also include:When receiving the operation information for filtering number control, Predetermined number is determined according to the mode of operation of filtering number control.
It can use marking model to mark in advance several unlabeled data (such as unlabeled data of above-mentioned second number), and According to filter controls and filtering threshold control or filter number control instruction, from all unlabeled data marked in advance select to Small part unlabeled data is shown that remaining unlabeled data is then filtered not as the unlabeled data of the first number It shows.
Illustratively, filter controls can be folding and expanding control, be folded state when it is "+" (as shown in Figure 4), It is unfolded state when it is "-".For example, user can click filter controls by left mouse button to change its state.Example Property, filtering threshold control can be slider control or Input.User can drag slider control to change It filters threshold value (i.e. third score threshold or the 4th score threshold), or can directly input a numerical value conduct in Input Filtering threshold.Referring to fig. 4, in the upper left corner area of tab area, "+" control is shown, which is filter controls.Filtering control " 1 " control shown on the right side of part is filtering threshold control, and the current filter threshold value shown in Fig. 4 is " 1 ", and is now in height Under conceptual schema, that is to say, that probability will will be filtered less than 1 for belonging to positive example (such as comprising number 1), due to not marking Data belong to probability of all categories and are substantially less than 1, therefore actually whole unlabeled data can be shown.
In addition, referring to fig. 4, in the upper left corner area of tab area, it is also shown that control as " hiding x " hides number Mesh x can be inputted in Input.One is also shown on the right side of Input for inputting hiding number " to submit hidden Hiding " button control.User inputs after number x in text box, clicks " submit and hide " control, data labeling system Using number x as hiding number, i.e., the predetermined number used when filter controls filter.It may include above-mentioned for filtering number control Input " submit and the hide " control for the Input and right side for hiding number.For example, being labeled and locating for number 1 When under high probability mode, user inputs number 100, then it is highest can to filter the probability comprising number 1 for data labeling system 100 images, only show residual image.Under high probability or high parallel pattern, can cross filter data score it is high do not mark number According to because the accuracy height of the pre- annotation results of these unlabeled data, may not need user's inspection.It, can under boundary scheme To cross the low unlabeled data of filter data score, because the higher mark value of data score is higher.
Illustratively, display interface can also include submitting control.For example, data mask method 200 can also include:When When receiving for the selection information for submitting control, determine that the current annotation results of the unlabeled data of the first number are the first number The final annotation results of purpose unlabeled data.Referring to fig. 4, in the upper left corner area of tab area, " submission " control is shown.With After submission control is clicked at family, data labeling system can be using the current annotation results of the unlabeled data of the first number as most Whole annotation results, and (the 5th number does not mark number by at least partly unlabeled data in the unlabeled data of the first number According to) and its final annotation results be stored in labeled data concentration, and concentrated from unlabeled data and remove not marking for the 5th number Data are infused, to update data pool.
Illustratively, display interface can also include display number control.For example, data mask method 200 can also wrap It includes:When receiving the operation information for display number control, determined according to the mode of operation of display number control current The number for the unlabeled data that moment shows in the display interface.Illustratively, the display number control is slider control Or Input.User can drag slider control to change display number, or can be directly in Input A numerical value is inputted as display number.Referring to fig. 4, in the upper area of tab area, the right side of " submit and hide " control is shown One slider control, the slider control are to show number control.
Illustratively, display interface can also include page scroll control.For example, data mask method 200 can also wrap It includes:When receiving the operation information for page scroll control, shown according to the update of the mode of operation of page scroll control The unlabeled data shown on interface.Referring to fig. 4, in the rightmost side region of tab area, a scroll bar control, the rolling are shown Control is page scroll control.When the unlabeled data of the first number is more, can not disposably show in the display interface When, the display situation of page scroll control control unlabeled data can be used.
Illustratively, display interface can also include previous wave control and/or latter wave control.For example, step S220 can To include:When receiving the selection information for previous wave control, it is shown in during previous mark in the display interface Shown unlabeled data;And/or it when receiving the selection information for latter wave control, shows in the display interface The unlabeled data to be shown in annotation process next time.Referring to fig. 4, in the upper right corner of tab area, " previous wave " is shown " latter wave " control, the two controls are shown as button control.When the user clicks when previous wave control, display interface can be shown Show the data that last consignment of marked, this batch data has been stored in labeled data concentration as labeled data originally.When with After family selects previous wave control, this batch data can be concentrated from labeled data and be removed, it is aobvious to be re-used as unlabeled data Show in the display interface, and the final annotation results that this batch data obtains in upper primary annotation process can be shown, with It is checked by user.Annotation process as described herein refers to be displayed on the display interface to mark from unlabeled data at the end of Carve (such as user clicks at the time of submitting control) this process.When the user clicks when latter wave control, display interface can be shown Show next group unlabeled data to be marked.
It should be noted that in various embodiments of the present invention, it, can when user is operated by executing to a certain control To receive the selection information or operation information that user is directed to the control.Such as user can be used mouse, touch screen, keyboard or The operation such as phonetic order or selection control.
Another embodiment according to the present invention, step S230 may include:The reversion received for specific unlabeled data refers to It enables;Step S240 may include:By the current annotation results of specific unlabeled data by positive example update be negative example or by negative example more It is newly positive example, wherein the final annotation results of the unlabeled data of the first number are that the unlabeled data of the first number is marking The current annotation results of finish time.
Illustratively, toggling command may include that the left mouse button of display area where being directed to specific unlabeled data is clicked Operation.
User can click any unlabeled data (such as any image in Fig. 4) with left mouse button, if this is not marked The current annotation results of note data are positive example (such as being noted as image includes number 1), then can be by the current annotation results Update is negative example (such as be noted as image and do not include number 1).
Toggling command can also be other instructions.Refer in the shortcut key information in information bar region for example, with reference to Fig. 4 Show that reverse function may be implemented in the s key on keyboard.Therefore, if user by cursor dwell on any unlabeled data, and press S key on lower keyboard then can equally invert the current annotation results of the unlabeled data.
Another embodiment according to the present invention, step S230 may include:Receive the invalid finger for being directed to specific unlabeled data It enables;Step S240 may include:Specific unlabeled data is labeled as invalid data to obtain the current of specific unlabeled data Annotation results, wherein the final annotation results of the unlabeled data of the first number are that the unlabeled data of the first number is marking The current annotation results of finish time.
Invalid data is exactly the data for not being suitable for carrying out current class, such as current needs carry out the mark of number 1, But be doped with a facial image in image collection relevant to digital mark, then the facial image can be labeled as in vain Data, it is completely irrelevant with number mark, it can choose and no longer carry out any and digital correlation mark to it.
Illustratively, illegal command includes the left mouse button double-click behaviour for display area where specific unlabeled data Make.
According to a further aspect of the invention, a kind of data annotation equipment is provided.Fig. 5 is shown according to an embodiment of the present invention Data annotation equipment 500 schematic block diagram.
As shown in figure 5, data annotation equipment 500 according to an embodiment of the present invention includes obtaining module 510, display module 520, receiving module 530 and result determining module 540.Optionally, device 500 can also include display device.Each mould Block can execute each step/function above in conjunction with Fig. 2-4 data mask method described respectively.Below only to the data mark The major function of each component of dispensing device 500 is described, and omits the detail content having been described above.
Unlabeled data and its pre- markup information that module 510 is used to obtain the first number are obtained, pre- markup information is benefit Carry out what mark in advance obtained with unlabeled data of the marking model to the first number, pre- markup information includes pre- annotation results.It obtains The program instruction that modulus block 510 can store in 102 Running storage device 104 of processor in electronic equipment as shown in Figure 1 To realize.
Display module 520 be used for show in the display interface the first number unlabeled data and its pre- annotation results.It is aobvious Show the program instruction that module 520 can store in 102 Running storage device 104 of processor in electronic equipment as shown in Figure 1 To realize.
Receiving module 530 is for receiving user to the first feedback information of the unlabeled data of the first number.Receiving module 530 program instructions that can be stored in 102 Running storage device 104 of processor in electronic equipment as shown in Figure 1 are realized.
As a result determining module 540 is used to determine the final mark of the unlabeled data of the first number according to the first feedback information As a result.As a result determining module 540 can store in 102 Running storage device 104 of processor in electronic equipment as shown in Figure 1 Program instruction realize.
Illustratively, display interface includes tab area and menu bar region, and the unlabeled data of the first number is shown in In tab area, menu bar region includes the mode control for being used to indicate the dimension model of data in tab area, dimension model Be it is one or more in high probability mode, high parallel pattern and boundary scheme, obtain module 510 be specifically used for:Determine user The dimension model selected by mode control;According to the dimension model of selection, marking model not marking to the second number is utilized Data are marked in advance, and to obtain the pre- markup information of the unlabeled data of the second number, the pre- labeled data of the first number is At least partly unlabeled data in the unlabeled data of second number.
Illustratively, mode control includes the high probability control for being arranged in different location, high similar control and boundary One or more in control, high probability control, high similar control and boundary control are respectively used to instruction high probability mode, Gao Xiang Antitype and boundary scheme.
Illustratively, each single item in one or more in high similar control, high probability control and boundary control includes Positive example control and negative example control, positive example control are used to control the aobvious of the unlabeled data for belonging to positive example under corresponding dimension model Show, negative example control is used to control the display of the unlabeled data for belonging to negative example under corresponding dimension model.
Illustratively, mode control is drop down list control, and drop down list control provides and high probability mode, high similar mould One or more corresponding drop-down list items in formula and boundary scheme.
Illustratively, pre- markup information further includes data score, obtains module 510 and is specifically also used to:From the second number Selected in unlabeled data data score be greater than the first score threshold or less than the second score threshold unlabeled data as The unlabeled data of first number, or the preset number of selection data highest scoring from the unlabeled data of the second number Unlabeled data of the unlabeled data as the first number.
Illustratively, pre- markup information further includes data score, and in the display interface, the unlabeled data of the first number is It is arranged according to the data score of the unlabeled data of the first number.
Illustratively, display interface includes menu bar region, and menu bar region includes random control;Data annotation equipment 500 further include:Selecting module (not shown), for when receiving the selection information for random control, from unlabeled data Concentrate the unlabeled data of random selection third number;Display module 520 is also used to show third number in the display interface Unlabeled data;Receiving module 530 is also used to receive user to the second feedback information of the unlabeled data of third number;And As a result determining module 540 is also used to determine the final annotation results of the unlabeled data of third number according to the second feedback information.
Illustratively, random control includes positive example control and negative example control, and positive example control is for controlling under stochastic model Belong to the display of the unlabeled data of positive example, negative example control is used to control the unlabeled data for belonging to negative example under stochastic model Display.
Illustratively, display interface includes menu bar region, and menu bar region includes export control, generating test set control With it is one or more in initialization model control, wherein export control is for control will be in the unlabeled data of the first number At least partly unlabeled data and the final annotation results of at least partly unlabeled data export as the file of predetermined format, it is raw Being used to control from labeled data at test set control concentrates the labeled data of selection predetermined number to obtain test set, test Collect the mark accuracy rate for testing marking model, initialization model control, which is used to control, carries out initially the parameter of marking model Change.
Illustratively, display interface further includes information bar region, and information bar region includes for showing sample information, statistics One or more regions in information, accuracy rate information and shortcut key information, wherein sample information includes belonging to currently wait mark Infuse the sample of the unlabeled data of classification;Statistical information include the number of labeled data, the number of unlabeled data, belong to just It is the number of labeled data of example, one or more in the number for the labeled data for belonging to negative example;Accuracy rate information is used for Indicate the accuracy rate of marking model;Shortcut key information is used to indicate preset shortcut key.
Illustratively, pre- markup information further includes data score, and display interface includes reversion control, filter controls, filtering It is one or more in threshold controls and filtering number control, wherein reversion control will be shown in the display interface for controlling The current annotation results of unlabeled data be negative by positive example update and example or positive example be updated to by negative example, filter controls are for controlling The filtration fraction unlabeled data from the unlabeled data marked in advance using marking model, using remaining unlabeled data as The unlabeled data of one number is for showing, filtering threshold control is for controlling for not marking from what is marked in advance using marking model The score threshold that unlabeled data is filtered in data is infused, filtering number control is marked not for controlling from using marking model in advance The number of the unlabeled data filtered in labeled data.
Illustratively, filtering threshold control is slider control or Input.
Illustratively, receiving module 530 is specifically used for:Receive the toggling command for being directed to specific unlabeled data;As a result really Cover half block 540 is specifically used for:The current annotation results of specific unlabeled data are negative by positive example update and example or are updated by negative example For positive example, wherein the final annotation results of the unlabeled data of the first number are that the unlabeled data of the first number is tied in mark The current annotation results at beam moment.
Illustratively, toggling command includes that the left mouse button of display area where being directed to specific unlabeled data clicks behaviour Make.
Illustratively, receiving module 530 is specifically used for:Receive the illegal command for being directed to specific unlabeled data;As a result really Cover half block 540 is specifically used for:Specific unlabeled data is labeled as invalid data to obtain the current mark of specific unlabeled data Infuse result, wherein the final annotation results of the unlabeled data of the first number are that the unlabeled data of the first number is tied in mark The current annotation results at beam moment.
Illustratively, illegal command includes the left mouse button double-click behaviour for display area where specific unlabeled data Make.
Illustratively, display interface includes menu bar region, information bar region and tab area, and menu bar region is display The upper area at interface, information bar region are the left area in the lower area of display interface, and tab area is lower area In right area.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.
Fig. 6 shows the schematic block diagram of data labeling system 600 according to an embodiment of the invention.Data mark system System 600 includes display device 610, storage device 620 and processor 630.
The display device 610 be used for show show unlabeled data, unlabeled data pre- annotation results and other Information.
The storage of storage device 620 is for realizing the corresponding steps in data mask method according to an embodiment of the present invention Computer program instructions.
The processor 630 is for running the computer program instructions stored in the storage device 620, to execute basis The corresponding steps of the data mask method of the embodiment of the present invention.
In one embodiment, for executing following step when the computer program instructions are run by the processor 630 Suddenly:The unlabeled data and its pre- markup information of the first number are obtained, pre- markup information is using marking model to the first number Unlabeled data carry out what pre- mark obtained, pre- markup information includes pre- annotation results;The first number of display in the display interface Purpose unlabeled data and its pre- annotation results;User is received to the first feedback information of the unlabeled data of the first number;With And the final annotation results of the unlabeled data of the first number are determined according to the first feedback information.
Illustratively, display interface includes tab area and menu bar region, and the unlabeled data of the first number is shown in In tab area, menu bar region includes the mode control for being used to indicate the dimension model of data in tab area, dimension model To be one or more in high probability mode, high parallel pattern and boundary scheme, the computer program instructions are by the processing The step of unlabeled data and its pre- markup information of the first number of acquisition of used execution, includes when the operation of device 630:It determines The dimension model that user is selected by mode control;According to the dimension model of selection, using marking model to the second number not Labeled data is marked in advance, to obtain the pre- markup information of the unlabeled data of the second number, the pre- mark number of the first number According at least partly unlabeled data in the unlabeled data for the second number.
Illustratively, mode control includes the high probability control for being arranged in different location, high similar control and boundary One or more in control, high probability control, high similar control and boundary control are respectively used to instruction high probability mode, Gao Xiang Antitype and boundary scheme.
Illustratively, each single item in one or more in high similar control, high probability control and boundary control includes Positive example control and negative example control, positive example control are used to control the aobvious of the unlabeled data for belonging to positive example under corresponding dimension model Show, negative example control is used to control the display of the unlabeled data for belonging to negative example under corresponding dimension model.
Illustratively, mode control is drop down list control, and drop down list control provides and high probability mode, high similar mould One or more corresponding drop-down list items in formula and boundary scheme.
Illustratively, pre- markup information further includes data score, and the computer program instructions are transported by the processor 630 The step of unlabeled data and its pre- markup information of the first number of acquisition of used execution when row further includes:From the second number Unlabeled data in selection data score be greater than the first score threshold or less than the second score threshold unlabeled data make For the unlabeled data of the first number, or select from the unlabeled data of the second number the preset number of data highest scoring Unlabeled data of the unlabeled data as the first number.
Illustratively, pre- markup information further includes data score, and in the display interface, the unlabeled data of the first number is It is arranged according to the data score of the unlabeled data of the first number.
Illustratively, display interface includes menu bar region, and menu bar region includes random control;The computer program Instruction is also used to execute following steps when being run by the processor 630:When receiving the selection information for random control, The unlabeled data of random selection third number is concentrated from unlabeled data;Not marking for third number is shown in the display interface Data;User is received to the second feedback information of the unlabeled data of third number;And is determined according to the second feedback information The final annotation results of the unlabeled data of three numbers.
Illustratively, random control includes positive example control and negative example control, and positive example control is for controlling under stochastic model Belong to the display of the unlabeled data of positive example, negative example control is used to control the unlabeled data for belonging to negative example under stochastic model Display.
Illustratively, display interface includes menu bar region, and menu bar region includes export control, generating test set control With it is one or more in initialization model control, wherein export control is for control will be in the unlabeled data of the first number At least partly unlabeled data and the final annotation results of at least partly unlabeled data export as the file of predetermined format, it is raw Being used to control from labeled data at test set control concentrates the labeled data of selection predetermined number to obtain test set, test Collect the mark accuracy rate for testing marking model, initialization model control, which is used to control, carries out initially the parameter of marking model Change.
Illustratively, display interface further includes information bar region, and information bar region includes for showing sample information, statistics One or more regions in information, accuracy rate information and shortcut key information, wherein sample information includes belonging to currently wait mark Infuse the sample of the unlabeled data of classification;Statistical information include the number of labeled data, the number of unlabeled data, belong to just It is the number of labeled data of example, one or more in the number for the labeled data for belonging to negative example;Accuracy rate information is used for Indicate the accuracy rate of marking model;Shortcut key information is used to indicate preset shortcut key.
Illustratively, pre- markup information further includes data score, and display interface includes reversion control, filter controls, filtering It is one or more in threshold controls and filtering number control, wherein reversion control will be shown in the display interface for controlling The current annotation results of unlabeled data be negative by positive example update and example or positive example be updated to by negative example, filter controls are for controlling The filtration fraction unlabeled data from the unlabeled data marked in advance using marking model, using remaining unlabeled data as The unlabeled data of one number is for showing, filtering threshold control is for controlling for not marking from what is marked in advance using marking model The score threshold that unlabeled data is filtered in data is infused, filtering number control is marked not for controlling from using marking model in advance The number of the unlabeled data filtered in labeled data.
Illustratively, filtering threshold control is slider control or Input.
Illustratively, the reception user of used execution when the computer program instructions are run by the processor 630 Include to the step of the first feedback information of the unlabeled data of the first number:The reversion received for specific unlabeled data refers to It enables;The computer program instructions when being run by the processor 630 used execution according to the first feedback information determine The step of final annotation results of the unlabeled data of one number includes:By the current annotation results of specific unlabeled data by just Example, which updates, to be negative example or is updated to positive example by negative example, wherein the final annotation results of the unlabeled data of the first number are first Current annotation results of the unlabeled data of number in mark finish time.
Illustratively, toggling command includes that the left mouse button of display area where being directed to specific unlabeled data clicks behaviour Make.
Illustratively, the reception user of used execution when the computer program instructions are run by the processor 630 Include to the step of feedback information of the unlabeled data of the first number:Receive the illegal command for being directed to specific unlabeled data; The computer program instructions when being run by the processor 630 used execution according to the first feedback information determine first number The step of final annotation results of purpose unlabeled data includes:Specific unlabeled data is labeled as invalid data to obtain spy Determine the current annotation results of unlabeled data, wherein the final annotation results of the unlabeled data of the first number are the first number Unlabeled data mark finish time current annotation results.
Illustratively, illegal command includes the left mouse button double-click behaviour for display area where specific unlabeled data Make.
Illustratively, display interface includes menu bar region, information bar region and tab area, and menu bar region is display The upper area at interface, information bar region are the left area in the lower area of display interface, and tab area is lower area In right area.
In addition, according to embodiments of the present invention, additionally providing a kind of storage medium, storing program on said storage Instruction, when described program instruction is run by computer or processor for executing the data mask method of the embodiment of the present invention Corresponding steps, and for realizing the corresponding module in data annotation equipment according to an embodiment of the present invention.The storage medium It such as may include the storage card of smart phone, the storage unit of tablet computer, the hard disk of personal computer, read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), portable compact disc read-only memory (CD-ROM), USB storage, Or any combination of above-mentioned storage medium.
In one embodiment, described program instruction can make computer or place when being run by computer or processor Reason device realizes each functional module of data annotation equipment according to an embodiment of the present invention, and and/or can execute according to this The data mask method of inventive embodiments.
In one embodiment, described program instruction is at runtime for executing following steps:Obtain the first number not Labeled data and its pre- markup information, pre- markup information are marked in advance using unlabeled data of the marking model to the first number What note obtained, pre- markup information includes pre- annotation results;The unlabeled data of the first number and its pre- is shown in the display interface Annotation results;User is received to the first feedback information of the unlabeled data of the first number;And it is true according to the first feedback information The final annotation results of the unlabeled data of fixed first number.
Illustratively, display interface includes tab area and menu bar region, and the unlabeled data of the first number is shown in In tab area, menu bar region includes the mode control for being used to indicate the dimension model of data in tab area, dimension model To be one or more in high probability mode, high parallel pattern and boundary scheme, described program instruction is used at runtime to be held The step of unlabeled data and its pre- markup information of capable the first number of acquisition includes:Determine that user is selected by mode control Dimension model;According to the dimension model of selection, marked in advance using unlabeled data of the marking model to the second number, with The pre- markup information of the unlabeled data of the second number is obtained, the pre- labeled data of the first number does not mark number for the second number At least partly unlabeled data in.
Illustratively, mode control includes the high probability control for being arranged in different location, high similar control and boundary One or more in control, high probability control, high similar control and boundary control are respectively used to instruction high probability mode, Gao Xiang Antitype and boundary scheme.
Illustratively, each single item in one or more in high similar control, high probability control and boundary control includes Positive example control and negative example control, positive example control are used to control the aobvious of the unlabeled data for belonging to positive example under corresponding dimension model Show, negative example control is used to control the display of the unlabeled data for belonging to negative example under corresponding dimension model.
Illustratively, mode control is drop down list control, and drop down list control provides and high probability mode, high similar mould One or more corresponding drop-down list items in formula and boundary scheme.
Illustratively, pre- markup information further includes data score, and what is executed used in described program instruction at runtime obtains The step of taking the unlabeled data and its pre- markup information of the first number further include:It is selected from the unlabeled data of the second number Data score is greater than the first score threshold or unlabeled data not the marking as the first number less than the second score threshold Data, or select from the unlabeled data of the second number data highest scoring preset number unlabeled data as The unlabeled data of one number.
Illustratively, pre- markup information further includes data score, and in the display interface, the unlabeled data of the first number is It is arranged according to the data score of the unlabeled data of the first number.
Illustratively, display interface includes menu bar region, and menu bar region includes random control;Described program instruction exists It is also used to execute following steps when operation:When receiving the selection information for random control, from unlabeled data concentrate with The unlabeled data of machine selection third number;The unlabeled data of third number is shown in the display interface;User is received to the Second feedback information of the unlabeled data of three numbers;And the unlabeled data of third number is determined according to the second feedback information Final annotation results.
Illustratively, random control includes positive example control and negative example control, and positive example control is for controlling under stochastic model Belong to the display of the unlabeled data of positive example, negative example control is used to control the unlabeled data for belonging to negative example under stochastic model Display.
Illustratively, display interface includes menu bar region, and menu bar region includes export control, generating test set control With it is one or more in initialization model control, wherein export control is for control will be in the unlabeled data of the first number At least partly unlabeled data and the final annotation results of at least partly unlabeled data export as the file of predetermined format, it is raw Being used to control from labeled data at test set control concentrates the labeled data of selection predetermined number to obtain test set, test Collect the mark accuracy rate for testing marking model, initialization model control, which is used to control, carries out initially the parameter of marking model Change.
Illustratively, display interface further includes information bar region, and information bar region includes for showing sample information, statistics One or more regions in information, accuracy rate information and shortcut key information, wherein sample information includes belonging to currently wait mark Infuse the sample of the unlabeled data of classification;Statistical information include the number of labeled data, the number of unlabeled data, belong to just It is the number of labeled data of example, one or more in the number for the labeled data for belonging to negative example;Accuracy rate information is used for Indicate the accuracy rate of marking model;Shortcut key information is used to indicate preset shortcut key.
Illustratively, pre- markup information further includes data score, and display interface includes reversion control, filter controls, filtering It is one or more in threshold controls and filtering number control, wherein reversion control will be shown in the display interface for controlling The current annotation results of unlabeled data be negative by positive example update and example or positive example be updated to by negative example, filter controls are for controlling The filtration fraction unlabeled data from the unlabeled data marked in advance using marking model, using remaining unlabeled data as The unlabeled data of one number is for showing, filtering threshold control is for controlling for not marking from what is marked in advance using marking model The score threshold that unlabeled data is filtered in data is infused, filtering number control is marked not for controlling from using marking model in advance The number of the unlabeled data filtered in labeled data.
Illustratively, filtering threshold control is slider control or Input.
Illustratively, the used reception user executed does not mark number to the first number at runtime for described program instruction According to the first feedback information the step of include:Receive the toggling command for being directed to specific unlabeled data;Described program instruction is being transported When row the step of the final annotation results of unlabeled data for determining the first number according to the first feedback information of used execution Including:The current annotation results of specific unlabeled data are updated by positive example and is negative example or positive example is updated to by negative example, wherein the The final annotation results of the unlabeled data of one number are current mark of the unlabeled data in mark finish time of the first number Infuse result.
Illustratively, toggling command includes that the left mouse button of display area where being directed to specific unlabeled data clicks behaviour Make.
Illustratively, the used reception user executed does not mark number to the first number at runtime for described program instruction According to feedback information the step of include:Receive the illegal command for being directed to specific unlabeled data;Described program instructs at runtime Used execution according to the first feedback information determine the first number unlabeled data final annotation results the step of include: Specific unlabeled data is labeled as invalid data to obtain the current annotation results of specific unlabeled data, wherein the first number The final annotation results of purpose unlabeled data are current mark knot of the unlabeled data in mark finish time of the first number Fruit.
Illustratively, illegal command includes the left mouse button double-click behaviour for display area where specific unlabeled data Make.
Illustratively, display interface includes menu bar region, information bar region and tab area, and menu bar region is display The upper area at interface, information bar region are the left area in the lower area of display interface, and tab area is lower area In right area.
Each module in data labeling system according to an embodiment of the present invention can pass through reality according to an embodiment of the present invention The processor computer program instructions that store in memory of operation of the electronic equipment of data mark are applied to realize, or can be with The computer instruction stored in the computer readable storage medium of computer program product according to an embodiment of the present invention is counted Calculation machine is realized when running.
Although describing example embodiment by reference to attached drawing here, it should be understood that above example embodiment are only exemplary , and be not intended to limit the scope of the invention to this.Those of ordinary skill in the art can carry out various changes wherein And modification, it is made without departing from the scope of the present invention and spiritual.All such changes and modifications are intended to be included in appended claims Within required the scope of the present invention.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, apparatus embodiments described above are merely indicative, for example, the division of the unit, only Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can be tied Another equipment is closed or is desirably integrated into, or some features can be ignored or not executed.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the present invention and help to understand one or more of the various inventive aspects, To in the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure, Or in descriptions thereof.However, the method for the invention should not be construed to reflect following intention:It is i.e. claimed The present invention claims features more more than feature expressly recited in each claim.More precisely, such as corresponding power As sharp claim reflects, inventive point is that the spy of all features less than some disclosed single embodiment can be used Sign is to solve corresponding technical problem.Therefore, it then follows thus claims of specific embodiment are expressly incorporated in this specific Embodiment, wherein each, the claims themselves are regarded as separate embodiments of the invention.
It will be understood to those skilled in the art that any combination pair can be used other than mutually exclusive between feature All features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed any method Or all process or units of equipment are combined.Unless expressly stated otherwise, this specification (is wanted including adjoint right Ask, make a summary and attached drawing) disclosed in each feature can be replaced with an alternative feature that provides the same, equivalent, or similar purpose.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of any Can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) realize some moulds in data annotation equipment according to an embodiment of the present invention The some or all functions of block.The present invention is also implemented as a part or complete for executing method as described herein The program of device (for example, computer program and computer program product) in portion.It is such to realize that program of the invention can store On a computer-readable medium, it or may be in the form of one or more signals.Such signal can be from internet Downloading obtains on website, is perhaps provided on the carrier signal or is provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.
The above description is merely a specific embodiment or to the explanation of specific embodiment, protection of the invention Range is not limited thereto, and anyone skilled in the art in the technical scope disclosed by the present invention, can be easily Expect change or replacement, should be covered by the protection scope of the present invention.Protection scope of the present invention should be with claim Subject to protection scope.

Claims (21)

1. a kind of data mask method, including:
The unlabeled data and its pre- markup information of the first number are obtained, the pre- markup information is using marking model to described The unlabeled data of first number carries out what pre- mark obtained, and the pre- markup information includes pre- annotation results;
Show in the display interface first number unlabeled data and its pre- annotation results;
User is received to the first feedback information of the unlabeled data of first number;And
The final annotation results of the unlabeled data of first number are determined according to first feedback information.
2. the method for claim 1, wherein the display interface includes tab area and menu bar region, and described The unlabeled data of one number is shown in the tab area, and the menu bar region includes being used to indicate the tab area The mode control of the dimension model of interior data, the dimension model are in high probability mode, high parallel pattern and boundary scheme It is one or more,
The unlabeled data for obtaining the first number and its pre- markup information include:
Determine the dimension model that user is selected by the mode control;
According to the selected dimension model, marked in advance using unlabeled data of the marking model to the second number, To obtain the pre- markup information of the unlabeled data of second number, the pre- labeled data of first number is described second At least partly unlabeled data in the unlabeled data of number.
3. method according to claim 2, wherein the mode control includes the high probability control for being arranged in different location One or more, the high probability control, the similar control of the height and the boundary in part, high similar control and boundary control Control is respectively used to indicate the high probability mode, the high parallel pattern and the boundary scheme.
4. method as claimed in claim 3, wherein the similar control of the height, the high probability control and the boundary control In it is one or more in each single item include positive example control and negative example control, the positive example control is for controlling corresponding mark mould The display of the unlabeled data for belonging to positive example under formula, the negative example control belong to negative example under corresponding dimension model for controlling Unlabeled data display.
5. method according to claim 2, wherein the mode control is drop down list control, the drop down list control It provides and one or more corresponding drop-downs in the high probability mode, the high parallel pattern and the boundary scheme List items.
6. method according to claim 2, wherein the pre- markup information further includes data score, described to obtain the first number Purpose unlabeled data and its pre- markup information further include:
Data score is selected to be greater than the first score threshold or less than the second score from the unlabeled data of second number Unlabeled data of the unlabeled data of threshold value as first number, or from the unlabeled data of second number Select the unlabeled data of the preset number of data highest scoring as the unlabeled data of first number.
7. the method for claim 1, wherein the pre- markup information further includes data score, in the display interface On, the unlabeled data of first number is arranged according to the data score of the unlabeled data of first number.
8. the method for claim 1, wherein the display interface includes menu bar region, the menu bar region packet Include random control;
The method also includes:
When receiving the selection information for the random control, random selection third number is concentrated not from unlabeled data Labeled data;
The unlabeled data of the third number is shown on the display interface;
User is received to the second feedback information of the unlabeled data of the third number;And
The final annotation results of the unlabeled data of the third number are determined according to second feedback information.
9. method according to claim 8, wherein the random control includes positive example control and negative example control, the positive example Control is used to control the display of the unlabeled data for belonging to positive example under stochastic model, and the negative example control is for controlling random mould The display of the unlabeled data for belonging to negative example under formula.
10. the method for claim 1, wherein the display interface includes menu bar region, the menu bar region packet It includes one or more in export control, generating test set control and initialization model control, wherein
At least partly unlabeled data and institute that the export control is used to control by the unlabeled data of first number The final annotation results for stating at least partly unlabeled data export as the file of predetermined format,
The generating test set control, which is used to control from labeled data, concentrates the labeled data of selection predetermined number to obtain Test set, the test set are used to test the mark accuracy rate of the marking model,
The initialization model control initializes the parameter of the marking model for controlling.
11. the method for claim 1, wherein the display interface further includes information bar region, the information bar region Including for showing one or more regions in sample information, statistical information, accuracy rate information and shortcut key information, In,
The sample information includes the sample for belonging to the unlabeled data of current classification to be marked;
The statistical information includes the number of labeled data, the number of unlabeled data, the labeled data for belonging to positive example It is one or more in number, the number for the labeled data for belonging to negative example;
The accuracy rate information is used to indicate the accuracy rate of the marking model;
The shortcut key information is used to indicate preset shortcut key.
12. the method for claim 1, wherein the pre- markup information further includes data score, the display interface packet It includes one or more in reversion control, filter controls, filtering threshold control and filtering number control, wherein
The reversion control is used to control the current annotation results for the unlabeled data that will be shown on the display interface by just Example, which updates, to be negative example or is updated to positive example by negative example,
The filter controls are not marked for controlling the filtration fraction from the unlabeled data marked in advance using the marking model Data are used to show using remaining unlabeled data as the unlabeled data of first number,
The filtering threshold control is for controlling for the mistake from the unlabeled data marked in advance using the marking model The score threshold of unlabeled data is filtered,
The filtering number control is used to control to filter from the unlabeled data marked in advance using the marking model The number of unlabeled data.
13. method as claimed in claim 12, wherein the filtering threshold control is slider control or Input.
14. the method for claim 1, wherein
The reception user includes to the first feedback information of the unlabeled data of first number:
Receive the toggling command for being directed to specific unlabeled data;
The final annotation results of the unlabeled data that first number is determined according to first feedback information include:
The current annotation results of the specific unlabeled data are updated by positive example and is negative example or positive example is updated to by negative example, In, the final annotation results of the unlabeled data of first number are that the unlabeled data of first number terminates in mark The current annotation results at moment.
15. method as claimed in claim 14, wherein the toggling command includes for where the specific unlabeled data The left mouse button single-click operation of display area.
16. the method for claim 1, wherein
The reception user includes to the feedback information of the unlabeled data of first number:
Receive the illegal command for being directed to specific unlabeled data;
The final annotation results of the unlabeled data that first number is determined according to first feedback information include:
The specific unlabeled data is labeled as invalid data to obtain the current annotation results of the specific unlabeled data, Wherein, the final annotation results of the unlabeled data of first number are that the unlabeled data of first number is tied in mark The current annotation results at beam moment.
17. the method described in claim 16, wherein the illegal command includes for where the specific unlabeled data The left mouse button double click operation of display area.
18. such as the described in any item methods of claim 1 to 17, wherein the display interface includes menu bar region, information bar Region and tab area, the menu bar region are the upper area of the display interface, and the information bar region is described aobvious Show that the left area in the lower area at interface, the tab area are the right area in the lower area.
19. a kind of data annotation equipment, including:
Module is obtained, for obtaining the unlabeled data and its pre- markup information of the first number, the pre- markup information is to utilize Marking model carries out what pre- mark obtained to the unlabeled data of first number, and the pre- markup information includes pre- mark knot Fruit;And
Display module, for show in the display interface first number unlabeled data and its pre- annotation results;
Receiving module, for receiving user to the first feedback information of the unlabeled data of first number;And
As a result determining module, the final mark of the unlabeled data for determining first number according to first feedback information Infuse result.
20. a kind of data labeling system, including display device, processor and memory, wherein be stored with meter in the memory Calculation machine program instruction, for executing as claim 1 to 18 is any when the computer program instructions are run by the processor Data mask method described in.
21. a kind of storage medium stores program instruction on said storage, described program instruction is at runtime for holding Row such as the described in any item data mask methods of claim 1 to 18.
CN201810064918.0A 2018-01-23 2018-01-23 Data mask method, device and system and storage medium Pending CN108875769A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810064918.0A CN108875769A (en) 2018-01-23 2018-01-23 Data mask method, device and system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810064918.0A CN108875769A (en) 2018-01-23 2018-01-23 Data mask method, device and system and storage medium

Publications (1)

Publication Number Publication Date
CN108875769A true CN108875769A (en) 2018-11-23

Family

ID=64326003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810064918.0A Pending CN108875769A (en) 2018-01-23 2018-01-23 Data mask method, device and system and storage medium

Country Status (1)

Country Link
CN (1) CN108875769A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710933A (en) * 2018-12-25 2019-05-03 广州天鹏计算机科技有限公司 Acquisition methods, device, computer equipment and the storage medium of training corpus
CN110263853A (en) * 2019-06-20 2019-09-20 杭州睿琪软件有限公司 The method and device of artificial client state is checked using error sample
CN110378396A (en) * 2019-06-26 2019-10-25 北京百度网讯科技有限公司 Sample data mask method, device, computer equipment and storage medium
CN111339325A (en) * 2018-12-19 2020-06-26 财团法人工业技术研究院 Data marking system and data marking method
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device
CN112163132A (en) * 2020-09-21 2021-01-01 中国建设银行股份有限公司 Data labeling method and device, storage medium and electronic equipment
CN112446404A (en) * 2019-09-04 2021-03-05 天津职业技术师范大学(中国职业培训指导教师进修中心) Online image sample labeling system based on active learning, labeling method and application thereof
CN113704650A (en) * 2020-05-21 2021-11-26 阿里巴巴集团控股有限公司 Information display method, device, system, equipment and storage medium
CN113839953A (en) * 2021-09-27 2021-12-24 上海商汤科技开发有限公司 Labeling method and device, electronic equipment and storage medium
CN115712745A (en) * 2023-01-09 2023-02-24 荣耀终端有限公司 User annotation data acquisition method and system and electronic equipment
CN116385459A (en) * 2023-03-08 2023-07-04 阿里巴巴(中国)有限公司 Image segmentation method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090076989A1 (en) * 2007-09-14 2009-03-19 Accenture Global Service Gmbh Automated classification algorithm comprising at least one input-invariant part
CN104850832A (en) * 2015-05-06 2015-08-19 中国科学院信息工程研究所 Hierarchical iteration-based large-scale image sample marking method and system
CN107067025A (en) * 2017-02-15 2017-08-18 重庆邮电大学 A kind of data automatic marking method based on Active Learning
CN107153822A (en) * 2017-05-19 2017-09-12 北京航空航天大学 A kind of smart mask method of the semi-automatic image based on deep learning
CN107492135A (en) * 2017-08-21 2017-12-19 维沃移动通信有限公司 A kind of image segmentation mask method, device and computer-readable recording medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090076989A1 (en) * 2007-09-14 2009-03-19 Accenture Global Service Gmbh Automated classification algorithm comprising at least one input-invariant part
CN104850832A (en) * 2015-05-06 2015-08-19 中国科学院信息工程研究所 Hierarchical iteration-based large-scale image sample marking method and system
CN107067025A (en) * 2017-02-15 2017-08-18 重庆邮电大学 A kind of data automatic marking method based on Active Learning
CN107153822A (en) * 2017-05-19 2017-09-12 北京航空航天大学 A kind of smart mask method of the semi-automatic image based on deep learning
CN107492135A (en) * 2017-08-21 2017-12-19 维沃移动通信有限公司 A kind of image segmentation mask method, device and computer-readable recording medium

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339325A (en) * 2018-12-19 2020-06-26 财团法人工业技术研究院 Data marking system and data marking method
CN109710933A (en) * 2018-12-25 2019-05-03 广州天鹏计算机科技有限公司 Acquisition methods, device, computer equipment and the storage medium of training corpus
CN110263853A (en) * 2019-06-20 2019-09-20 杭州睿琪软件有限公司 The method and device of artificial client state is checked using error sample
WO2020253741A1 (en) * 2019-06-20 2020-12-24 杭州睿琪软件有限公司 Method and device for checking status of manual client by using error samples
CN110378396A (en) * 2019-06-26 2019-10-25 北京百度网讯科技有限公司 Sample data mask method, device, computer equipment and storage medium
CN112446404A (en) * 2019-09-04 2021-03-05 天津职业技术师范大学(中国职业培训指导教师进修中心) Online image sample labeling system based on active learning, labeling method and application thereof
CN113704650A (en) * 2020-05-21 2021-11-26 阿里巴巴集团控股有限公司 Information display method, device, system, equipment and storage medium
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device
CN112163132A (en) * 2020-09-21 2021-01-01 中国建设银行股份有限公司 Data labeling method and device, storage medium and electronic equipment
CN112163132B (en) * 2020-09-21 2024-05-10 中国建设银行股份有限公司 Data labeling method and device, storage medium and electronic equipment
CN113839953A (en) * 2021-09-27 2021-12-24 上海商汤科技开发有限公司 Labeling method and device, electronic equipment and storage medium
CN115712745A (en) * 2023-01-09 2023-02-24 荣耀终端有限公司 User annotation data acquisition method and system and electronic equipment
CN116385459A (en) * 2023-03-08 2023-07-04 阿里巴巴(中国)有限公司 Image segmentation method and device
CN116385459B (en) * 2023-03-08 2024-01-09 阿里巴巴(中国)有限公司 Image segmentation method and device

Similar Documents

Publication Publication Date Title
CN108875769A (en) Data mask method, device and system and storage medium
CN108875768A (en) Data mask method, device and system and storage medium
CN110750959B (en) Text information processing method, model training method and related device
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
CN107111608A (en) Automatic generation of N-grams and concept relationships from linguistic input data
CN108647205A (en) Fine granularity sentiment analysis model building method, equipment and readable storage medium storing program for executing
CN103870001B (en) A kind of method and electronic device for generating candidates of input method
CN109461157A (en) Image, semantic dividing method based on multi-stage characteristics fusion and Gauss conditions random field
CN109471945A (en) Medical file classification method, device and storage medium based on deep learning
CN110249341A (en) Classifier training
CN109101469A (en) The information that can search for is extracted from digitized document
CN109657204A (en) Use the automatic matching font of asymmetric metric learning
CN103534697B (en) For providing the method and system of statistics dialog manager training
CN107818491A (en) Electronic installation, Products Show method and storage medium based on user's Internet data
CN107609563A (en) Picture semantic describes method and device
Jamalpur et al. Machine learning intersections and challenges in deep learning
CN109816438A (en) Information-pushing method and device
Ma et al. UniTranSeR: A unified transformer semantic representation framework for multimodal task-oriented dialog system
CN109154945A (en) New connection based on data attribute is recommended
CN108536784A (en) Comment information sentiment analysis method, apparatus, computer storage media and server
CN109740515A (en) One kind reading and appraising method and device
CN110309114A (en) Processing method, device, storage medium and the electronic device of media information
CN110399547A (en) For updating the method, apparatus, equipment and storage medium of model parameter
CN112837466B (en) Bill recognition method, device, equipment and storage medium
CN110490237A (en) Data processing method, device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181123