CN115905656A

CN115905656A - Label determining method and device, electronic equipment and storage medium

Info

Publication number: CN115905656A
Application number: CN202211657142.6A
Authority: CN
Inventors: 周涛; 刘紫千; 苏卓; 贾晋康; 杨凯文
Original assignee: Tianyi Safety Technology Co Ltd
Current assignee: Tianyi Safety Technology Co Ltd
Priority date: 2022-12-22
Filing date: 2022-12-22
Publication date: 2023-04-04

Abstract

The application relates to the technical field of data processing, and discloses a label determination method, a label determination device, electronic equipment and a storage medium, wherein the method comprises the following steps: after receiving data to be processed, determining model information corresponding to the category information based on the category information of the data to be processed; adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the category information; and inputting the data to be processed into the target classification model to obtain the label of the data to be processed. According to the embodiment, an initial classification model is adjusted according to model information corresponding to the category information of the data to be processed, so that a target classification model corresponding to the category information is obtained; and inputting the data to be processed into the target classification model to obtain the label of the data to be processed matched with the class information. For different service scenes, different classification standards can be matched only by adjusting the initial classification model.

Description

Label determining method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a tag determination method and apparatus, an electronic device, and a storage medium.

Background

In data security, data acquisition is a leading stage of a data security life cycle, a data tag determination process is an important part of the data acquisition stage, and the accuracy of tags influences the security and the tightness of data. In different business scenarios, different tags are typically used to establish the standard.

In the related art, for a single service scenario, a data tag in the service scenario is determined.

However, the classification criteria (class information) used in different service scenarios are different, so the above approach cannot be flexibly applied in different service scenarios.

Disclosure of Invention

The application provides a label determination method, a label determination device, electronic equipment and a storage medium, which are used for accurately determining data labels in different service scenes.

In a first aspect, an embodiment of the present application provides a tag determination method, where the method includes:

after receiving data to be processed, determining model information corresponding to the category information based on the category information of the data to be processed;

adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the class information;

and inputting the data to be processed into the target classification model to obtain the label of the data to be processed.

According to the scheme, after the data to be processed is received, the initial classification model is adjusted according to the model information corresponding to the class information of the data to be processed, and the target classification model corresponding to the class information is obtained; therefore, the label of the data to be processed matched with the category information can be obtained by inputting the data to be processed into the target classification model. Therefore, for different service scenes, different classification standards (class information) can be matched only by adjusting the initial classification model, and the labels under each classification standard can be accurately obtained through only one model.

In some optional embodiments, the model information comprises a target number and target model parameters of the target taxon; adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the category information, including:

selecting the target number of classification units from an initial classification model as the target classification units;

and aiming at any target classification unit, adjusting the current model parameters of the target classification unit to corresponding target model parameters to obtain the target classification model.

According to the scheme, the target number of classification units are selected from the initial classification model to serve as the target classification units, so that the label level of the category information is met; and adjusting the current model parameters of each target classification unit to corresponding target model parameters, thereby obtaining a target classification model for accurately determining each level of label corresponding to the class information.

In some optional embodiments, determining, based on the category information of the data to be processed, model information corresponding to the category information includes:

determining model information corresponding to the category information of the data to be processed based on a preset corresponding relation; the preset corresponding relation comprises a corresponding relation between preset category information and model information.

According to the scheme, the preset corresponding relation between different types of information and the model information is set, and the model information corresponding to the type information of each piece of data to be processed can be accurately determined based on the preset corresponding relation.

In some optional embodiments, the model information comprises a target number and target model parameters of the target taxon; determining the preset corresponding relationship by:

aiming at any preset category information, determining the target quantity corresponding to the category information based on the hierarchical relation of the sample labels of the category information;

and training the target classification units of the target quantity based on the sample data of the category information and the sample labels to obtain target model parameters corresponding to the category information.

According to the scheme, the target quantity corresponding to each category of information is accurately determined based on the hierarchical relation of the sample labels of each category of information; and training the target classification unit through the sample data and the sample labels of each type of information, and accurately determining target model parameters suitable for each type of information.

In some optional embodiments, training the target classification units of the target number based on the sample data of the category information and the sample labels to obtain target model parameters corresponding to the category information includes:

and aiming at the target classification unit of any level, training the target classification unit based on the sample data of the class information and the sample label corresponding to the level to obtain the target model parameter of the target classification unit of the level.

According to the scheme, due to the fact that label levels of different types of information are different, each level of label needs to be determined through different classification units, and the target classification units of all levels are trained through sample data and sample labels corresponding to all levels, so that target model parameters of the target classification units of all levels are obtained.

In some optional embodiments, if there is a target taxon of a previous level in the target taxon, before training the target taxon based on the sample data of the category information and the sample label corresponding to the level, the method further includes:

after the training of the target classification unit of the previous level is finished, adjusting the initial model parameters of the target classification unit of the level to the target model parameters of the target classification unit of the previous level.

According to the scheme, through parameter migration, a better training starting point is provided for the target classification unit of the next level, the timeliness of model training is rapidly improved, and the iteration efficiency is improved.

In some optional embodiments, training the target classification unit based on sample data of the category information and sample labels corresponding to the hierarchy includes:

for any sample label corresponding to the hierarchy, determining the sampling weight of the sample label based on the number of sample data corresponding to the sample label;

and sampling the sample label and corresponding sample data based on the sampling weight, and training the target classification unit based on the sample data obtained by sampling and the sample label.

In a second aspect, an embodiment of the present application provides a tag determination apparatus, including:

the model information determining module is used for determining model information corresponding to the category information based on the category information of the data to be processed after the data to be processed is received;

the model adjusting module is used for adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the category information;

and the label determining module is used for inputting the data to be processed into the target classification model to obtain the label of the data to be processed.

In a third aspect, an embodiment of the present application provides an electronic device, which includes at least one processor and at least one memory, where the memory stores a computer program, and when the program is executed by the processor, the processor is caused to execute the tag determination method according to any one of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program executable by an electronic device, and when the program runs on the electronic device, the program causes the electronic device to execute the tag determination method according to any one of the first aspect.

In addition, for technical effects brought by any one implementation manner of the second aspect to the fourth aspect, reference may be made to technical effects brought by different implementation manners of the first aspect, and details are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings may be obtained according to these drawings without inventive labor.

Fig. 1 is a schematic flow chart of a first tag determination method provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of a second tag determination method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a sorting unit provided in an embodiment of the present application;

fig. 4 is a schematic flow chart of a third tag determination method provided in the embodiment of the present application;

fig. 5 is a schematic flow chart of a first method for determining a preset corresponding relationship provided in the embodiment of the present application;

fig. 6 is a schematic flow chart of a second method for determining a preset corresponding relationship provided in the embodiment of the present application;

fig. 7 is a schematic structural diagram of a tag determination apparatus according to an embodiment of the present application;

fig. 8 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.

In the description of the present application, unless otherwise explicitly stated or limited, the term "coupled" is to be construed broadly and can mean, for example, directly coupled or indirectly coupled through intervening media, or the communication between two devices. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

In data security, data acquisition is a leading stage of a data security life cycle, a data tag determination process is an important part of the data acquisition stage, and the accuracy of tags influences the security and the tightness of data. In different business scenarios, different labels are often used to establish the criteria.

However, due to differences in industry and business function fields, data classification standards need to be formulated according to current situations, and different classification standards exist in organizations due to different business scenes. Therefore, the above method cannot be flexibly applied to different service scenarios.

Based on this, the embodiment of the application provides a tag determination method, a tag determination device, an electronic device and a storage medium, and the method includes: after receiving data to be processed, determining model information corresponding to the category information based on the category information of the data to be processed; adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the category information; and inputting the data to be processed into the target classification model to obtain the label of the data to be processed.

The following describes the technical solutions of the present application and how to solve the above technical problems in detail with reference to the accompanying drawings and specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

An embodiment of the present application provides a first tag determination method, which is applied to an electronic device, and as shown in fig. 1, the method may include:

step S101: after the data to be processed is received, determining model information corresponding to the category information based on the category information of the data to be processed.

In implementation, labels of data need to be determined for different service scenes, so that the category information of the data to be processed is not fixed and unchanged, and the adopted information such as model parameters and the like can be distinguished; based on the method, after the data to be processed is received, the model information corresponding to the category information is determined based on the category information of the data to be processed.

Step S102: and adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the class information.

In this embodiment, in order to reduce the occurrence of the condition that the model resource is limited and increase the flexibility of the use of the model, not one target classification model is trained for each type of information, but the same model is used for all types of information, and only the information such as the parameters of the model is adjusted for different types of information.

Based on this, the present embodiment adjusts the initial classification model based on the corresponding model information to obtain the target classification model corresponding to the category information.

Step S103: and inputting the data to be processed into the target classification model to obtain the label of the data to be processed.

In some optional embodiments, the model information comprises a target number and target model parameters of the target taxon;

correspondingly, an embodiment of the present application provides a second tag determination method, which is applied to an electronic device, and as shown in fig. 2, the method may include:

step S201: after the data to be processed is received, based on the category information of the data to be processed, determining the model information corresponding to the category information.

The specific implementation manner of step S201 may refer to the above embodiments, and is not described herein again.

Step S202: selecting the target number of classification units from an initial classification model as the target classification unit.

In the implementation, the labels of different types of information have different levels, if some standards need to be labeled with three levels, some standards need to be labeled with four levels, and each level of label needs to be determined by different classification units;

based on this, the model information needs to include the target number, and the target number of classification units is selected from the initial classification model as target classification units for determining the label of the corresponding hierarchy after parameter setting.

The embodiment does not limit the specific implementation manner of the classification unit, such as a Gated Recurrentunit (GRU);

referring to fig. 3, the GRU is the infrastructure that remembers the context of text documents and the construction of an attention mechanism that allows the GRU to selectively focus on important elements in the text, optionally a word attention module, and encode forward and reverse sequences using bidirectional GRUs. In implementation, a word sequence may be embedded as a forward direction input GRU, which produces each state as a backward and forward hidden state for each time step t. Placing attention on the GRU state to produce a fixed-dimension vector representation; and combining the maximum pool (Maxpool) and average pool (Meanpool) representations of all GRU hidden states with the attention vector to produce a sequence representation of the input-output layer; the output layer is set to be a fully connected layer with the GRU state sequence shape active, and the dimension of the layer is determined by the number of categories in the classification task.

Step S203: and aiming at any target classification unit, adjusting the current model parameters of the target classification unit to corresponding target model parameters to obtain the target classification model.

As described above, each target classification unit is used for determining the label of the corresponding hierarchy after parameter setting is performed;

based on this, the model information also needs to include target model parameters of each target classification unit, and the current model parameters of each target classification unit are adjusted to corresponding target model parameters, so that a target classification model for accurately determining each level of labels corresponding to the class information is obtained.

Step S204: and inputting the data to be processed into the target classification model to obtain the label of the data to be processed.

The specific implementation manner of step S204 can refer to the above embodiments, and is not described herein again.

According to the scheme, the target quantity of classification units are selected from the initial classification model to serve as the target classification units, so that the label levels of the corresponding category information are met; the current model parameters of each target classification unit are adjusted to corresponding target model parameters, so that a target classification model of each level label for accurately determining corresponding category information is obtained.

An embodiment of the present application provides a third tag determination method, which is applied to an electronic device, and as shown in fig. 4, the method may include:

step S401: after the data to be processed is received, determining model information corresponding to the category information of the data to be processed based on a preset corresponding relation.

The preset corresponding relation comprises a corresponding relation between preset category information and model information.

In this embodiment, by setting the preset corresponding relationship between different types of information and model information, the model information corresponding to the type information of each piece of data to be processed can be accurately determined based on the preset corresponding relationship.

Step S402: adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the class information;

step S403: and inputting the data to be processed into the target classification model to obtain the label of the data to be processed.

The specific implementation manner of steps S402 to S403 may refer to the above embodiments, and details are not described herein.

Referring to fig. 5, in some alternative embodiments, the preset corresponding relationship is obtained as follows:

step S501: and determining the target quantity corresponding to the category information based on the hierarchical relation of the sample labels of the category information aiming at any preset category information.

As described above, the labels of different types of information have different levels, and each level of label needs to be determined by different classification units;

based on this, when performing model training, the target number (the required number of classification units) corresponding to each piece of category information is determined based on the hierarchical relationship of the sample labels of each piece of category information.

Step S502: and training the target classification units of the target number based on the sample data of the category information and the sample labels to obtain target model parameters corresponding to the category information.

In this embodiment, in order to determine the target model parameters corresponding to each category of information, the target classification unit needs to be trained based on the sample data of each category of information and the corresponding sample label.

For any type of information, sample data of the type of information and a corresponding sample label are taken as input, a prediction label is taken as output, the similarity between the sample label and the prediction label is taken as an optimization condition, a target classification unit is trained, and after the training is finished, model parameters (target model parameters) of the target classification unit are determined.

The present embodiment does not limit the specific implementation manner of the sample data, such as sample data (positive example) with a sample label and sample data (negative example) without a sample label.

As described above, the classification unit may employ a GRU, and illustratively, one 1-layer GRU having 136 hidden units may be used and attention is added to the top of the GRU layer; applying an exit probability of 0.5 to the GRU output; after 10 epochs (one generation of training), batch (one batch of data) is set to 128 for training.

According to the scheme, the target quantity corresponding to each type of information is accurately determined based on the hierarchical relation of the sample labels of each type of information; and training the target classification unit through the sample data of each category of information and the sample label, and accurately determining target model parameters suitable for each category of information.

The embodiment of the present application provides a second method for determining a preset corresponding relationship, as shown in fig. 6, the method may include:

step S601: and determining the target quantity corresponding to the category information based on the hierarchical relation of the sample labels of the category information aiming at any preset category information.

The specific implementation manner of step S601 may refer to the above embodiments, and is not described herein again.

Step S602: and aiming at the target classification unit of any level, training the target classification unit based on the sample data of the class information and the sample label corresponding to the level to obtain the target model parameter of the target classification unit of the level.

Due to the fact that label hierarchies of different types of information are different, each level of labels need to be determined through different classification units; therefore, the sample labels of the target classification units of each level are different, and the corresponding target model parameters are also different;

based on this, in this embodiment, the target classification units of each hierarchy are trained based on the sample data and the sample labels corresponding to each hierarchy, so as to obtain the target model parameters of the target classification units of each hierarchy.

In some optional embodiments, if there is a target taxon of a previous level in the target taxon, before training the target taxon based on the sample data of the category information and the sample label corresponding to the level, the following steps are further performed:

In implementation, if each classification unit starts training based on initial random parameters, the training time is long, and due to a certain association relationship among the hierarchical labels, the target model parameters of the previous hierarchical level can be inherited through a knowledge distillation mode, for example, the initial model parameters of other layers of the target classification unit of the hierarchical level except for the output layer are adjusted to the target model parameters of the target classification unit of the previous hierarchical level), so that iteration can be completed more quickly, and the target model parameters of the hierarchical level can be obtained.

Illustratively, the target taxon of each level is the "parent" of the target taxon of the next level, and is the "child" of the target taxon of the previous level.

In implementation, a lower learning rate may be applied to the transition parameters (from the target classification unit of the previous level), and a higher learning rate may be used for the final fully connected classification (output) level. Using Adam (adaptive moment estimation) as an optimizer sets the learning rate of the fully connected layer to 0.001 (high learning rate) because all parameters in this layer are initialized randomly and should be readjusted to the best possible value. The learning rate of other layers is randomly changed to be below 0.001 so as to keep the model knowledge of the target classification unit of the previous layer. After training the target classification unit at the uppermost level, the embedded layer is frozen, and the embedded layer is frozen to prevent the classifiers of the classes at the lower levels of the classification method from being over-matched.

In some alternative embodiments, the step S602 can be implemented by, but not limited to, the following ways:

for any sample label corresponding to the hierarchy, determining the sampling weight of the sample label based on the quantity of sample data corresponding to the sample label;

and sampling the sample label and the corresponding sample data based on the sampling weight, and training the target classification unit based on the sample data obtained by sampling and the sample label.

Exemplarily, since the sample data categories (related sample labels) are more, the problem of unbalanced sample data quantity between different categories has a great influence on the training difficulty; the sampling method with weight can also be adopted, different weights (the weight is inversely proportional to the number) are given to different classes according to different sample data numbers, and the sampling frequency is determined according to the weights, namely, the sampling frequency is more repeated when the sample data corresponding to a certain sample label is less.

As shown in fig. 7, based on the same inventive concept, an embodiment of the present application provides a tag determination apparatus 700, including:

the model information determining module 701 is configured to determine, after receiving data to be processed, model information corresponding to category information based on the category information of the data to be processed;

a model adjusting module 702, configured to adjust the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the category information;

a label determining module 703, configured to input the data to be processed into the target classification model, so as to obtain a label of the data to be processed.

In some optional embodiments, the model information comprises a target number and target model parameters of the target taxon; the model adjustment module 702 is specifically configured to:

In some optional embodiments, the model information determining module 701 is specifically configured to:

In some optional embodiments, the model information comprises a target number and target model parameters of the target classification unit; determining the preset corresponding relationship by:

for any preset category information, determining the target quantity corresponding to the category information based on the hierarchical relation of the sample labels of the category information;

and training the target classification units based on the sample data of the class information and the sample labels corresponding to the levels aiming at the target classification units of any level to obtain target model parameters of the target classification units of the levels.

In some optional embodiments, training the target classification unit based on sample data of the category information and sample labels corresponding to the hierarchies includes:

Since the apparatus is the apparatus in the method in the embodiment of the present application, and the principle of the apparatus for solving the problem is similar to that of the method, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 8, based on the same inventive concept, an embodiment of the present application provides an electronic device 800, including: a processor 801 and a memory 802;

the memory 802 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 802 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or memory 802 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 802 may be a combination of the above.

The processor 801 may include one or more Central Processing Units (CPUs), graphics Processing Units (GPUs), or digital Processing units (dsps), among others.

The specific connection medium between the memory 802 and the processor 801 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 802 and the processor 801 are connected by a bus 803 in fig. 8, the bus 803 is represented by a thick line in fig. 8, and the bus 803 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but that does not indicate only one bus or one type of bus.

Wherein the memory 802 stores program code which, when executed by the processor 801, causes the processor 801 to perform the following:

adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the category information;

In some optional embodiments, the model information comprises a target number and target model parameters of the target classification unit; the processor 801 specifically executes:

In some optional embodiments, the processor 801 specifically performs:

Since the electronic device is an electronic device for executing the method in the embodiment of the present application, and the principle of the electronic device for solving the problem is similar to that of the method, reference may be made to the implementation of the method for the implementation of the electronic device, and repeated details are not described again.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above-mentioned tag determination method. The readable storage medium may be a nonvolatile readable storage medium, among others.

The present application is described above with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the application. It will be understood that one block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

Accordingly, the subject application may also be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Furthermore, the present application may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this application, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for tag determination, the method comprising:

2. The method of claim 1, wherein the model information includes a target number and target model parameters of target classification units; adjusting the initial classification model based on the corresponding model information to obtain a target classification model corresponding to the category information, including:

3. The method of claim 1, wherein determining model information corresponding to the category information based on the category information of the data to be processed comprises:

4. The method of claim 3, wherein the model information includes a number of targets and target model parameters for target classification units; determining the preset corresponding relationship by:

5. The method of claim 4, wherein training the target classification units of the target number based on the sample data and the sample labels of the category information to obtain target model parameters corresponding to the category information comprises:

6. The method of claim 5, wherein if there is a target taxon at a previous level of the target taxon, before training the target taxon based on the sample data of the class information and the sample label corresponding to the level, further comprising:

7. The method of claim 5, wherein training the target taxon based on sample data of the class information and sample labels corresponding to the hierarchy comprises:

8. A tag determination apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises at least one processor and at least one memory, wherein the memory stores a computer program that, when executed by the processor, causes the processor to carry out the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores a computer program executable by an electronic device, which program, when run on the electronic device, causes the electronic device to carry out the method of any one of claims 1 to 7.