CN113760484B

CN113760484B - Data processing method and device

Info

Publication number: CN113760484B
Application number: CN202010605126.7A
Authority: CN
Inventors: 马千里; 张明阳
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2024-10-22
Anticipated expiration: 2040-06-29
Also published as: CN113760484A

Abstract

The invention discloses a data processing method and device, and relates to the technical field of computers. One embodiment of the method comprises the following steps: collecting a plurality of data processing samples, the data processing samples having a plurality of characteristic attributes and corresponding ranks, wherein the ranks indicate processing priorities of the data processing samples; aiming at the data processing sample, judging whether the attribute value of each characteristic attribute is missing or not, if so, filling the corresponding attribute value for the characteristic attribute; training a classification model by utilizing a data processing sample with a complete attribute value and a corresponding grade to obtain a task grade model; determining the grade of a new data processing task by using a task grade model; and determining the processing strategy of the new data processing task according to the grade of the new data processing task. According to the method and the device, a more accurate task grade model can be constructed, so that a processing strategy determined for a data processing task is more reasonable.

Description

Data processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for data processing.

Background

As data warehouses grow in size, they need to run thousands of data processing tasks each day. How to reasonably classify the data processing tasks of the data warehouse operation and distinguish which data processing tasks are important and which data processing tasks are secondary, and further, the stability of the warehouse data and the credibility of the data are directly affected by adopting a corresponding data processing scheme.

At present, the data processing tasks running in the data warehouse are classified mainly in a way that a decision maker (manual) judges the importance degree of the data processing tasks subjectively according to own experience, then the data processing tasks are classified, and then the processing strategy of the data processing tasks is determined according to the manually determined data processing tasks.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:

The existing grading mode of the data processing tasks mainly depends on experience, so that the efficiency and accuracy of grade determination of the data processing tasks are low, and the processing strategy determined for the data processing tasks is unreasonable, so that user experience is affected.

Disclosure of Invention

In view of this, the embodiment of the invention provides a data processing method and device, which can construct a task level model to effectively improve the efficiency and accuracy of determining the level of a data processing task, so that the processing strategy determined for the data processing task is reasonable, and the user experience is improved.

To achieve the above object, according to one aspect of an embodiment of the present invention, there is provided a method of data processing, including:

collecting a plurality of data processing samples, the data processing samples having a plurality of characteristic attributes and corresponding ranks, wherein the ranks are indicative of processing priorities of the data processing samples;

judging whether the attribute value of each characteristic attribute is missing or not according to the data processing sample, and filling the corresponding attribute value for the characteristic attribute if the attribute value is missing;

Training a classification model by utilizing a data processing sample with a complete attribute value and a corresponding grade to obtain a task grade model;

And determining the grade of a new data processing task by using the task grade model, and determining the processing strategy of the new data processing task according to the grade of the new data processing task.

Preferably, the method of data processing further comprises: dividing a plurality of data processing samples into a complete data set and a missing data set, wherein the data processing samples divided into the complete data set have complete attribute values, and the data processing samples divided into the missing data set have missing attribute values;

and executing the step of judging whether the attribute value of the characteristic attribute is missing or not aiming at each characteristic attribute of each data processing sample in the missing data set.

Preferably, the method of data processing further comprises:

Generating a code with a preset bit number for the data processing sample according to attribute values corresponding to the plurality of characteristic attributes of the data processing sample;

Dividing the complete data set into at least one coding set according to the codes of the data processing samples, wherein the codes corresponding to all the data processing samples belonging to the same coding set are equal;

filling the corresponding attribute value for the characteristic attribute, which comprises the following steps:

matching a target code set corresponding to the data processing sample in the missing data set according to the code corresponding to the data processing sample in the missing data set;

And filling the missing attribute values of the data processing samples in the missing data set according to the attribute values of the data processing samples corresponding to the target coding set.

Preferably, the method comprises the steps of,

Generating a first signature value with a preset bit number for each characteristic attribute according to the attribute value of each characteristic attribute in the data processing sample, wherein each bit value in the first signature value belongs to a value in a preset numerical range;

and calculating codes corresponding to the data processing samples according to the preset weights corresponding to the characteristic attributes and the first signature values corresponding to each characteristic attribute.

Preferably, the step of generating a code of a preset number of bits for the data processing samples comprises:

Generating a signature value with a preset bit number for each characteristic attribute according to the attribute value of each characteristic attribute in the data processing sample;

according to a preset replacement strategy, carrying out numerical replacement on each digit value in the signature value;

And calculating codes corresponding to the data processing samples according to the preset weights corresponding to the characteristic attributes and the numerical value replaced results corresponding to each characteristic attribute.

Preferably, the replacement policy comprises:

The value zero is replaced with a first value and the non-zero value is replaced with a second value.

Preferably, before the step of training the classification model with the data processing samples having complete attribute values and corresponding ranks, further comprising:

calculating the information gain of each characteristic attribute based on the data processing samples processed by the attribute values;

Selecting a plurality of target characteristic attributes according to the information gain of each characteristic attribute;

A step of training a classification model using data processing samples with complete attribute values and corresponding levels, comprising:

The classification model is trained using data processing samples having attribute values for the complete target feature attributes.

Preferably, the step of calculating the information gain corresponding to each of the feature attributes includes:

for each of the characteristic attributes described,

Calculating a first information entropy by utilizing the probability of attribute values corresponding to the characteristic attributes in the plurality of data processing samples;

classifying the data processing samples according to attribute values corresponding to the characteristic attributes;

calculating a second information entropy according to the probability of each category in the classification result and the probability of the attribute value corresponding to the characteristic attribute;

And calculating the information gain corresponding to the characteristic attribute by using the first information entropy and the second information entropy.

Preferably, the method comprises the steps of,

The plurality of target feature attributes includes: any two or three of a service duration for performing the data processing task, a number of recursive subtasks required for performing the data processing task, and a number of uses for performing the data processing task.

Preferably, the method of data processing further comprises: setting a data processing sequence table, wherein the data processing sequence table is used for storing data processing tasks according to the processing sequence;

The step of determining the processing strategy of the new data processing task comprises the following steps:

Inserting the new data processing task into the data processing sequence table according to the grade of the new data processing task;

and processing new data processing tasks stored in the data processing sequence table according to the storage sequence of the data processing sequence table.

Preferably, the method of data processing further comprises: allocating a corresponding monitoring scheme for each grade, wherein the monitoring scheme comprises monitoring time intervals corresponding to different grades;

monitoring the new data processing task according to the monitoring time intervals corresponding to different levels;

And when the monitored result indicates that the new data processing task reaches the processing time limit and/or the processing of the data processing task of the last level corresponding to the new data processing task is completed, processing the new data processing task.

In a second aspect, an embodiment of the present invention provides an apparatus for data processing, including: the system comprises an acquisition unit, a filling unit, a training unit and a processing unit, wherein,

The acquisition unit is used for acquiring a plurality of data processing samples, wherein the data processing samples have a plurality of characteristic attributes and corresponding grades, and the grades indicate the processing priority of the data processing samples;

The filling unit is used for judging whether the attribute value of each characteristic attribute of the data processing sample acquired by the acquisition unit is missing or not, and filling the corresponding attribute value for the characteristic attribute if the attribute value of each characteristic attribute is missing;

The training unit is used for training the classification model by utilizing the data processing sample with the complete attribute value and the corresponding grade acquired by the acquisition unit and the filling unit to acquire a task grade model;

The processing unit is used for obtaining a task grade model by utilizing the training unit, determining the grade of a new data processing task, and determining the processing strategy of the new data processing task according to the grade of the new data processing task.

One embodiment of the above application has the following advantages or benefits: generally, the trained task grade model can grade the data processing task by means of computing resources, and compared with manual grading, the scheme provided by the embodiment of the application can effectively improve the grading efficiency of the data processing task. In addition, the method and the system can directly influence the accuracy of the trained task grade model because the integrity of the attribute values of the characteristic attributes of the data processing samples used for training the classification model, fill the attribute values with missing characteristic attributes, ensure that the data processing samples used for training the classification model have complete attribute values, therefore, a more accurate task grade model is trained, and the grade accuracy of the data processing task is effectively improved, and then, according to the grade, the processing strategy of the data processing task is determined, so that the data processing planning is more reasonable, and the user experience is improved.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main flow of a method of data processing according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main flow of filling in attribute values missing from a characteristic attribute of a data processing sample according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the main flow of generating a code of a preset number of bits according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a main process of selecting a plurality of target feature attributes according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a task level model portion structure according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of the main flow of determining a processing policy for a new data processing task according to an embodiment of the invention;

FIG. 7 is a schematic diagram of the main flow of determining a processing policy for a new data processing task according to an embodiment of the invention;

FIG. 8 is a schematic diagram of the main units of an apparatus for data processing according to an embodiment of the present invention;

FIG. 9 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

FIG. 10 is a schematic diagram of a computer system suitable for use with a server implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

FIG. 1 is a method of data processing according to an embodiment of the present invention, as shown in FIG. 1, the method of data processing may include the steps of:

S101: collecting a plurality of data processing samples, the data processing samples having a plurality of characteristic attributes and corresponding ranks, wherein the ranks indicate processing priorities of the data processing samples;

s102: for the data processing sample, judging whether the attribute value of each characteristic attribute is missing, if so, executing S103; otherwise, executing S104;

s103: filling the corresponding attribute value for the characteristic attribute;

s104: training a classification model by utilizing a data processing sample with a complete attribute value and a corresponding grade to obtain a task grade model;

S105: determining the grade of a new data processing task by using a task grade model;

S106: and determining the processing strategy of the new data processing task according to the grade of the new data processing task.

The plurality of data processing samples collected in step S101 may be sent by the receiving user through the terminal, or may be collected in an existing data warehouse.

Wherein a data processing sample refers to a data processing task with a processing priority or level that has been processed by the data warehouse.

The feature attribute refers to information capable of characterizing features or characteristics of a data processing task, such as a name of the data processing task, a type of the data processing task, a limited use CPU size of the data processing task, a limited use memory size of the data processing task, a version of the task, whether an alarm is given, a service duration required for executing the data processing task, the number of recursive subtasks required for executing the data processing task, and the number of times of use of a data warehouse model required for executing the data processing task.

Wherein, the attribute value refers to the value of the feature attribute after the numerical value. The digitizing may be performed according to preset digitizing rules.

The task level model may be a decision tree obtained through an ID3 algorithm, that is, the training classification model may specifically be a decision tree generated by using the ID3 algorithm for a plurality of data processing samples.

Generally, the task grade model obtained based on training can grade the task by means of computing resources, and compared with manual grading, the scheme provided by the embodiment of the application can effectively improve the grade grading efficiency of the data processing task. In addition, the method and the system can directly influence the accuracy of the trained task grade model because the integrity of the attribute values of the characteristic attributes of the data processing samples used for training the classification model, fill the attribute values with missing characteristic attributes, ensure that the data processing samples used for training the classification model have complete attribute values, therefore, a more accurate task grade model is trained, and the grade accuracy of the data processing task is effectively improved, and then, according to the grade, the processing strategy of the data processing task is determined, so that the data processing planning is more reasonable, and the user experience is improved.

In an embodiment of the present invention, the method for processing data may further include: dividing a plurality of data processing samples into a complete data set and a missing data set, wherein the data processing samples divided into the complete data set have complete attribute values, and the data processing samples divided into the missing data set have missing attribute values; and executing the step of judging whether the attribute value corresponding to the characteristic attribute is missing or not according to each characteristic attribute of each data processing sample in the missing data set. By dividing different data processing samples into a complete data set and a missing data set and only supplementing attribute values for the data processing samples in the missing data set, the workload of the judging process is effectively reduced, and the computing resources are saved.

Wherein the plurality of characteristic attributes may be as shown in table 1 below.

TABLE 1

The service duration is the duration of service of the data warehouse for the data processing task; the number of recursive subtasks refers to the number of recursive subtasks that the data warehouse needs to execute in the process of executing the data processing tasks; the number of times a data warehouse model is used refers to the number of times a data warehouse model is used to perform a certain data processing task over a period of time.

In an embodiment of the present invention, as shown in fig. 2, the method for processing data may further include the following steps:

S201: generating a code of a preset bit number for the data processing sample according to attribute values corresponding to a plurality of characteristic attributes of the data processing sample;

s202: dividing the complete data set into at least one coding set according to the codes of the data processing samples, wherein the codes corresponding to all the data processing samples belonging to the same coding set are equal;

s203: matching the data processing sample in the missing data set with a corresponding target coding set according to the codes corresponding to the data processing sample in the missing data set;

s204: and filling the attribute values missing from the data processing samples in the missing data set according to the attribute values of the data processing samples corresponding to the target coding set.

Wherein the above-described step S203 and step S204 may be performed for each data processing sample in the missing data set.

Step S203 and step S204 are specific implementation manners of filling the corresponding attribute values for the missing attribute value feature attributes.

Wherein, the code refers to a digit string with fixed digits.

The specific implementation manner of S201 may be two types:

the first implementation mode:

For a combination of multiple characteristic attributes of a data processing sample, performing: and generating a corresponding code with a preset bit number for the data processing sample by utilizing the attribute value corresponding to the characteristic attribute included by the combination of the plurality of characteristic attributes.

For example, if the data processing sample includes the characteristic attribute A, B, C, D, E, then the combination of the plurality of characteristic attributes of the data processing sample may be, for example, a combination of characteristic attribute a and characteristic attribute B; a feature attribute A and a feature attribute C are combined; feature attribute A, feature attribute B, and feature attribute C combinations; feature attributes A, B, C, D, E combinations, etc. For the combination of the characteristic attribute A and the characteristic attribute B, generating a corresponding code with a preset bit number for the data processing sample by utilizing the attribute value of the characteristic attribute A and the attribute value of the characteristic attribute B; for the combination of the characteristic attribute A, the characteristic attribute B and the characteristic attribute C, the attribute value corresponding to the characteristic attribute A, the attribute value corresponding to the characteristic attribute B and the attribute value corresponding to the characteristic attribute C are utilized to generate a corresponding code of a preset bit number for the data processing sample, and the like.

The second implementation mode:

And generating a corresponding code with a preset bit number for the data processing sample by utilizing the attribute value corresponding to each of the plurality of characteristic attributes of the data processing sample.

Wherein the second implementation is a special case of the third implementation. For example, if the data processing sample includes the characteristic attribute A, B, C, D, E, a corresponding code of a preset number of bits is generated for the data processing sample by using the attribute value corresponding to the characteristic attribute a, the attribute value corresponding to the characteristic attribute B, the attribute value corresponding to the characteristic attribute C, the attribute value corresponding to the characteristic attribute D, and the attribute value corresponding to the characteristic attribute E of the data processing sample.

The second implementation simplifies the process of generating the code with the preset bit number for the data processing sample, and can effectively reduce the consumption of computing resources.

In a preferred embodiment, the first implementation manner is selected to generate a corresponding code with a preset number of bits for the data processing sample, so that the data processing sample corresponds to a plurality of codes with preset numbers of bits, and the code sets of the data processing sample can be better divided according to the characteristic attribute by the code with the preset number of bits, which is obtained by the first implementation manner, so as to ensure the accuracy of the subsequent matching result.

The specific embodiments of step S203 may include two types:

In one embodiment, for a data processing sample in a missing data set, the implementation of matching the corresponding target code set: the target code set is found that is identical to the code of the data processing samples in the missing data set.

Such an implementation may be directed to the first implementation provided by the above embodiments that generates a code of a preset number of bits for the data processing samples.

In another embodiment, for implementation of matching data processing samples in the missing data set to corresponding target code sets: and searching a target coding set with the minimum coding difference value with the data processing sample in the missing data set.

Such an implementation may be directed to a second implementation of the above embodiment that generates a code of a preset number of bits for data processing samples.

By dividing the complete data set into at least one coding set, wherein the codes corresponding to all the data processing samples belonging to the same coding set are equal, the purpose of dividing the near data processing samples into the same coding set and matching the corresponding target coding set for each data processing sample in the missing data set is achieved, and the purpose of searching the nearest coding set for each data processing sample in the missing data set is to find out the closest coding set for each data processing sample in the missing data set, namely, the data processing sample in the missing data set is very close to the attribute value included by the data processing sample in the closest coding set, so that the attribute value missing by the characteristic attribute of the data processing sample is filled in the follow-up data processing sample according to the attribute value of the target coding set, and the accuracy of the filled attribute value can be effectively improved.

In one embodiment of the present invention, generating a code of a preset number of bits for data processing samples may include: generating a first signature value of a preset bit number for each characteristic attribute according to the attribute value of each characteristic attribute in the data processing sample, wherein each bit value in the first signature value belongs to a value in a preset numerical range; and calculating codes corresponding to the data processing samples according to the preset weights corresponding to the characteristic attributes and the first signature values corresponding to each characteristic attribute.

Wherein the preset value range can be [ -1,1]. Through the limitation of the preset numerical range, the condition that the influence of a single characteristic attribute is overlarge is avoided, and all characteristic attributes are balanced.

The generating of the first signature value with the preset number of bits may be implemented by using a hash algorithm. For example, the feature attribute is a service duration, the attribute value corresponding to the service duration is a, and the attribute value a is converted into a signature value (e.g., 0001) with a preset number of bits through a hash algorithm.

For example, the encoding of the data processing samples is calculated using the characteristic attributes A, B, C and E in the data processing samples. The signature value corresponding to each feature attribute may be: a.fwdarw.0, 0,1; b.fwdarw.0, 0,1,0; c.fwdarw.0, 0,1; e.fwdarw.0, 1,0,1; for example, the preset weight corresponding to the feature attribute may be that the weight corresponding to the feature attribute a is 1; the weight corresponding to the characteristic attribute B is 3; the weight corresponding to the characteristic attribute C is 1; the weight corresponding to the characteristic attribute D is 3; accordingly, the specific process of calculating the encoding of the data processing samples: multiplying each bit in the numerical value corresponding to each characteristic attribute by a weight to obtain a weighted result of the signature value corresponding to each characteristic attribute, and summing the numerical values in the same bit in the weighted result of each characteristic attribute to obtain the numerical value in the corresponding bit of the code; the addition of the values on the same bit in the weighted result of each feature attribute means that the value of the first bit in the weighted result of the feature attribute a, the value of the first bit in the weighted result of the feature attribute B, the value of the first bit in the weighted result of the feature attribute C and the value of the first bit in the weighted result of the feature attribute E are added, and the obtained result is the value of the first bit in the code; the value of the second bit in the weighted result of the characteristic attribute A, the value of the second bit in the weighted result of the characteristic attribute B, the value of the second bit in the weighted result of the characteristic attribute C and the value of the second bit in the weighted result of the characteristic attribute E are added, and the obtained result is the value of the second bit in the code; and so on, the value of the fourth bit (last bit) in the weighted result of the characteristic attribute A, the value of the fourth bit (last bit) in the weighted result of the characteristic attribute B, the value of the fourth bit (last bit) in the weighted result of the characteristic attribute C and the value of the fourth bit (last bit) in the weighted result of the characteristic attribute E are added, and the obtained result is the value of the fourth bit (last bit) in the code. Described in one specific example, the specific examples are as follows:

in one embodiment of the present invention, as shown in fig. 3, generating a code of a preset number of bits for data processing samples may include the steps of:

s301: generating a second signature value with a preset bit number for each characteristic attribute according to the attribute value of each characteristic attribute in the data processing sample;

The generating of the second signature value of the preset number of bits may be implemented using a hash algorithm. For example, the feature attribute is a service duration, the attribute value corresponding to the service duration is a, and the attribute value a is converted into a signature value (e.g., 0001) with a preset number of bits through a hash algorithm.

S302: according to a preset replacement strategy, carrying out numerical replacement on each digit value in the second signature value;

Wherein the replacement policy may include: the value zero is replaced with a first value and the non-zero value is replaced with a second value, the first and second values being user settable. For example, the signature value 0001 converted from the attribute value a is obtained by performing numerical permutation on the first bit 0, the second bit 0, the third bit 0 and the fourth bit 1, respectively, with the first value being-1 and the second value being 1, and the result of the permutation is: -1, -1, -1,1.

S303: and calculating the codes of the data processing samples according to the preset weights corresponding to the characteristic attributes and the results after numerical value replacement corresponding to each characteristic attribute.

For example, the encoding of the data processing samples is calculated using the characteristic attributes A, B, C and E in the data processing samples. The signature value corresponding to each feature attribute obtained in step S301 may be: a.fwdarw.0, 0,1; b.fwdarw.0, 0,1,0; c.fwdarw.0, 0,1; e.fwdarw.0, 1,0,1; for example, the preset weight corresponding to the feature attribute may be that the weight corresponding to the feature attribute a is 1; the weight corresponding to the characteristic attribute B is 3; the weight corresponding to the characteristic attribute C is 1; the weight corresponding to the characteristic attribute D is 3; accordingly, the specific process of calculating the encoding of the data processing samples: multiplying each bit in the result after the value replacement corresponding to each characteristic attribute by a weight to obtain a weighted result of the signature value corresponding to each characteristic attribute, and adding the values on the same bit in the weighted result of each characteristic attribute to obtain the value on the corresponding bit of the code; for example, the result (-1, -1, 1) after the numerical replacement corresponding to the feature attribute a corresponds to the addition weight result (-1, -1, 1); the result (-1, -1) after the numerical value corresponding to the characteristic attribute B is replaced is (-3, -3); the result (-1, 1) after the numerical value corresponding to the characteristic attribute C is replaced is the result (-1, 1) of the adding weight corresponding to the characteristic attribute C; the result (-1, -1, 1) after the numerical value corresponding to the characteristic attribute E is replaced corresponds to the added weight result (-3, -3, 3). The addition of the values on the same bit in the weighted result of each feature attribute means that the value of the first bit in the weighted result of the feature attribute a, the value of the first bit in the weighted result of the feature attribute B, the value of the first bit in the weighted result of the feature attribute C and the value of the first bit in the weighted result of the feature attribute E are added, and the obtained result is the value of the first bit in the code; the value of the second bit in the weighted result of the characteristic attribute A, the value of the second bit in the weighted result of the characteristic attribute B, the value of the second bit in the weighted result of the characteristic attribute C and the value of the second bit in the weighted result of the characteristic attribute E are added, and the obtained result is the value of the second bit in the code; and so on, the value of the fourth bit (last bit) in the weighted result of the characteristic attribute A, the value of the fourth bit (last bit) in the weighted result of the characteristic attribute B, the value of the fourth bit (last bit) in the weighted result of the characteristic attribute C and the value of the fourth bit (last bit) in the weighted result of the characteristic attribute E are added, and the obtained result is the value of the fourth bit (last bit) in the code. The implementation manner of the above steps S301 to S303 is described with a specific example, and specifically shown as follows:

through the process of converting the attribute values into the corresponding codes, similar data processing samples can have the same codes, and the code set division is simpler and more convenient.

In an embodiment of the present invention, as shown in fig. 4, before the step of training the level model by using attribute values corresponding to a plurality of feature attributes, the method may further include the following steps:

s401: calculating the information gain corresponding to each characteristic attribute based on the data processing samples of the processed attribute values;

s402: and selecting a plurality of target characteristic attributes according to the information gain corresponding to each characteristic attribute.

Accordingly, on the basis of the above embodiment, the step of training the level model using attribute values corresponding to the plurality of feature attributes may include: and training a classification model by using attribute values corresponding to the plurality of target feature attributes.

The resource occupation of the training grade model can be effectively reduced by selecting the target characteristic attribute with relatively large influence, and the model construction cost is reduced while the accuracy of the obtained task grade model is ensured.

Wherein, the step of calculating the information gain corresponding to each characteristic attribute may include: for each feature attribute, performing: calculating a first information entropy by utilizing the probability of the attribute value corresponding to the characteristic attribute; classifying the data processing samples according to attribute values corresponding to the characteristic attributes; calculating a second information entropy according to the proportion occupied by each level in the classification result and the probability of the attribute value corresponding to the characteristic attribute; and calculating the information gain corresponding to the characteristic attribute by using the first information entropy and the second information entropy.

Wherein, the first information entropy can be calculated by the following calculation formula (1);

Wherein, H _Sj represents a first information entropy corresponding to the j-th characteristic attribute; p _ij represents the probability of the ith attribute value corresponding to the jth characteristic attribute; n represents the number of attribute values corresponding to the j-th feature attribute.

Wherein, the second information entropy can be calculated by the following calculation formula (2);

Wherein H _Sj' represents a second information entropy corresponding to the jth characteristic attribute; k _g represents the proportion of the attribute value of the jth characteristic attribute in the g-th level in the classification result; probability of the ith attribute value corresponding to the jth feature attribute in the jth level of p _ijg; n _g represents the number of attribute values corresponding to the j-th characteristic attribute in the g-th level; m represents the number of classifications.

Wherein the ratio of the g-th level is the ratio of the number of attribute values (i.e. ith attribute value) of a certain characteristic attribute (i.e. jth characteristic attribute) in the g-th level to the number of all attribute values (i.e. n) corresponding to the jth characteristic attributeFor example, the number of attribute values corresponding to a certain feature attribute in the g-th level is 40, and the number of all attribute values corresponding to the feature attribute is 100 (i.e., n=100), so that the ratio of the g-th level to the feature attribute is 0.4, i.e., k _g =0.4.

Accordingly, the information gain corresponding to the characteristic attribute is calculated using the following calculation formula (3).

Calculation formula (3):

H_j＝H_Sj-H_Sj′ (3)

Wherein H _j characterizes the information gain corresponding to the jth feature attribute.

In one embodiment of the invention, the plurality of target feature attributes may include: any two or three of a service duration for performing the data processing task, a number of recursive subtasks required for performing the data processing task, and a number of uses of the data warehouse model required for performing the data processing task. As shown in FIG. 5, a resulting task level model may include the length of service to perform a data processing task, the number of recursive subtasks required to perform the data processing task, and the number of uses of the data warehouse model required to perform the task. Wherein the high, low, and medium number attribute values representing the amount of recursive subtask data (these attribute values may also be digitized during use, i.e., converted to corresponding values, e.g., high number corresponds to 2, low number corresponds to 0, medium number corresponds to 1); a. b, c, d represent attribute values of service duration; attribute values representing the number of uses (these attribute values may also be digitized during use, i.e., converted to corresponding values, e.g., high for 10, low for-1, and medium for 5); l1, L2, and L3 represent data processing task levels, where L1 has a higher processing priority than L2 and L2 has a higher processing priority than L3.

The data processing task grade is determined based on the task grade model with higher accuracy, so that the grade of the data processing task can be determined more accurately, and the data processing tasks with various grades in the data warehouse can be managed better.

For example, based on the partial task level model shown in fig. 5, the tasks to be divided are divided, and a new data processing task has a low attribute value of the data amount of the recursive subtask, and a high attribute value of the number of times of use, and the task level corresponding to the new data processing task is L1.

In the embodiment of the invention, two specific implementation manners are possible for determining the processing strategy of the new data processing task.

The implementation mode is as follows:

The method of data processing may further comprise: setting a data processing sequence table, wherein the data processing sequence table is used for storing data processing tasks according to the processing sequence; accordingly, as shown in FIG. 6, determining a processing policy for a new data processing task may include the steps of:

S601: inserting the new data processing task into the data processing sequence table according to the grade of the new data processing task;

The specific implementation manner of the step can be two types:

first kind: searching the region where other data processing tasks with the grade are located in the data processing sequence table according to the grade of the new data processing task, and adding the new data processing task to the last position in the searched region;

Second kind: searching a region where other data processing tasks with the grade are located in a data processing sequence table according to the grade of the new data processing task, determining two adjacent other data processing tasks if a plurality of other data processing tasks are included in the searched region, wherein the processing deadline of one of the two adjacent other data processing tasks is earlier than the processing deadline of the new data processing task, the processing deadline of the other data processing task is later than the processing deadline of the new data processing task, and inserting the new data processing task between the two adjacent other data processing tasks; if the data processing task comprises one other data processing task, comparing the processing deadline of the new data processing task with the processing deadline of the other data processing task, wherein the comparison result is that the processing deadline of the new data processing task is earlier than the processing deadline of the other data processing task, and the new data processing task is inserted before the other data processing task; the comparison result is that the processing deadline of the new data processing task is later than that of other data processing tasks, and the new data processing task is inserted into the other data processing tasks.

S602: and processing the new data processing task stored in the data processing sequence table according to the storage sequence of the data processing sequence table.

The implementation mode II is as follows:

The method of data processing may further comprise: allocating a corresponding monitoring scheme for each grade, wherein the monitoring scheme comprises monitoring time intervals corresponding to different grades; accordingly, as shown in FIG. 7, determining a processing policy for a new data processing task may include the steps of:

s701: monitoring the new data processing task according to the monitoring scheme including the monitoring time intervals corresponding to different levels;

The specific implementation mode of the steps is as follows: and monitoring the new data processing task according to the monitoring time intervals corresponding to the different levels.

S702: and when the monitored result indicates that the new data processing task reaches the processing time limit and/or the processing of the data processing task of the last level corresponding to the new data processing task is completed, processing the new data processing task.

As shown in fig. 8, an embodiment of the present invention provides a data processing apparatus 800, where the task data processing apparatus 800 may include: acquisition unit 801, shim unit 802, training unit 803, and processing unit 804, wherein,

An acquisition unit 801, configured to acquire a plurality of data processing samples, where the data processing samples have a plurality of characteristic attributes and corresponding levels, and the levels indicate processing priorities of the data processing samples;

A filling unit 802, configured to determine, for the data processing samples collected by the collection unit 801, whether the attribute value of each feature attribute is missing, and if so, fill the corresponding attribute value for the feature attribute;

A training unit 803, configured to train the classification model by using the data processing samples with complete attribute values and corresponding levels acquired by the acquisition unit 801 and obtained by the shim unit 802, to obtain a task level model;

A processing unit 804, configured to determine a level of the new data processing task by using the task level model obtained by the training unit 803, and determine a processing policy of the new data processing task according to the level of the new data processing task.

In an embodiment of the present invention, as shown in fig. 8, the apparatus 800 for data processing may further include: the dividing unit 805, wherein,

A dividing unit 805, configured to divide the plurality of data processing samples into a complete data set and a missing data set, where the data processing samples divided into the complete data set have complete attribute values, and the data processing samples divided into the missing data set have missing attribute values;

the shim unit 802 is configured to perform, for each data processing sample corresponding to each data processing sample in the missing data set divided by the dividing unit 804, a step of determining whether the attribute value corresponding to the feature attribute is missing.

In an embodiment of the present invention, as shown in fig. 8, the apparatus 800 for data processing may further include: a conversion unit 806, wherein,

A conversion unit 806, configured to generate a code with a preset number of bits for the data processing sample according to attribute values corresponding to the plurality of feature attributes of the data processing sample; dividing the complete data set into at least one coding set according to the codes of the data processing samples, wherein the codes corresponding to all the data processing samples belonging to the same coding set are equal;

The shim unit 802 is further configured to match, according to the codes corresponding to the data processing samples in the missing data set, the target code set corresponding to the data processing samples in the missing data set; and filling the attribute values missing from the data processing samples in the missing data set according to the attribute values of the data processing samples corresponding to the target coding set.

In the embodiment of the present invention, the converting unit 806 is further configured to generate, for each feature attribute in the data processing sample, a first signature value of a preset number of bits according to the attribute value of each feature attribute, where each bit value in the first signature value belongs to a value within a preset numerical range; and calculating codes corresponding to the data processing samples according to the preset weights corresponding to the characteristic attributes and the first signature values corresponding to each characteristic attribute.

In the embodiment of the present invention, the converting unit 806 is further configured to generate, for each feature attribute, a second signature value with a preset number of bits according to the attribute value of each feature attribute in the data processing sample; according to a preset replacement strategy, carrying out numerical replacement on each digit value in the second signature value; and calculating codes corresponding to the data processing samples according to the preset weights corresponding to the characteristic attributes and the results after numerical value replacement corresponding to each characteristic attribute.

In the embodiment of the present invention, the preset replacement policy in the conversion unit 806 may include: the value zero is replaced with a first value and the non-zero value is replaced with a second value.

In the embodiment of the present invention, the training unit 803 is further configured to calculate an information gain of each feature attribute based on the data processing samples processed with the attribute values; selecting a plurality of target characteristic attributes according to the information gain of each characteristic attribute; the classification model is trained using data processing samples having attribute values for the complete target feature attributes.

In the embodiment of the present invention, the training unit 803 is further configured to calculate, for each feature attribute, a first information entropy by using probabilities of attribute values corresponding to the feature attributes in the plurality of data processing samples; classifying the data processing samples according to attribute values corresponding to the characteristic attributes; calculating a second information entropy according to the proportion occupied by each level in the classification result and the probability of the attribute value corresponding to the characteristic attribute; and calculating the information gain corresponding to the characteristic attribute by using the first information entropy and the second information entropy.

In an embodiment of the present invention, the plurality of target feature attributes may include: any two or three of a service duration for performing the data processing task, a number of recursive subtasks required for performing the data processing task, and a number of uses of the data warehouse model required for performing the data processing task.

In the embodiment of the present invention, the processing unit 804 is further configured to set a data processing sequence table, where the data processing sequence table is used to store data processing tasks according to a processing sequence; inserting the new data processing task into the data processing sequence table according to the grade of the new data processing task; and processing the new data processing task stored in the data processing sequence table according to the storage sequence of the data processing sequence table.

In the embodiment of the present invention, the processing unit 804 is further configured to allocate a corresponding monitoring scheme to each level, where the monitoring scheme includes monitoring time intervals corresponding to different levels; monitoring the new data processing task according to the monitoring time intervals corresponding to different levels; and when the monitored result indicates that the new data processing task reaches the processing time limit and/or the processing of the data processing task of the last level corresponding to the new data processing task is completed, processing the new data processing task.

Fig. 9 illustrates an exemplary system architecture 900 of a data processing method or apparatus to which embodiments of the present invention may be applied.

As shown in fig. 9, system architecture 900 may include terminal devices 901, 902, 903, a network 904, a server 905, and a data warehouse 906. The network 904 is the medium used to provide communication links between the terminal devices 901, 902, 903 and the server 905, and between the server 905 and the data repository. The network 904 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 905 over the network 904 using the terminal device 901, 902, 903, the terminal device 901, 902, 903 sending data processing samples to the server 905 or the server 905 sending a task level model to the terminal device 901, 902, 903 for the user to review the structure of the task level model, etc.

Terminal devices 901, 902, 903 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like. For the user to manually examine the data processing samples.

The data warehouse 906 interacts with the server 905 through the network 904, the data warehouse 906 receives the task level model sent by the server 905 to level the data processing task of the data warehouse 906 or the data warehouse 906 sends information of the data processing task to the server 905, and the server assigns a corresponding level to the data processing task by using the constructed task level model and sends the level to the data warehouse 906, so that the data warehouse 906 marks the level corresponding to the data processing task, and the like.

The server 905 may be a server that provides various services, such as building a task level model based on data processing samples, grading data processing tasks based on task level models, and the like. The management server may perform analysis or the like on a plurality of feature attributes in the data processing sample and feed back the processing results (e.g., task level model or level—just an example) to the data repository 906.

It should be noted that, the method for processing data provided in the embodiment of the present invention is generally performed by the server 905, and accordingly, the device for processing data is generally disposed in the server 905.

It should be understood that the number of terminal devices, networks, servers and data warehouses in fig. 9 is merely illustrative. There may be any number of terminal devices, networks, servers, and data stores, as desired for implementation.

Referring now to FIG. 10, there is illustrated a schematic diagram of a computer system 1000 suitable for use with a server embodying embodiments of the present invention. The server illustrated in fig. 10 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU) 1001, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the system 1000 are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 1001.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present invention may be implemented in software or in hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a shim unit, a training unit, and a processing unit. Where the names of the units do not constitute a limitation of the unit itself in some cases, for example, an acquisition unit may also be described as "a unit that acquires a plurality of data processing samples".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: collecting a plurality of data processing samples, the data processing samples having a plurality of characteristic attributes and corresponding ranks, wherein the ranks indicate processing priorities of the data processing samples; aiming at the data processing sample, judging whether the attribute value of each characteristic attribute is missing or not, if so, filling the corresponding attribute value for the characteristic attribute; training a classification model by utilizing a data processing sample with a complete attribute value and a corresponding grade to obtain a task grade model; determining the grade of a new data processing task by using a task grade model; and determining the processing strategy of the new data processing task according to the grade of the new data processing task.

According to the technical scheme provided by the embodiment of the application, the trained task grade model can grade the data processing task by means of the computing resource, and compared with manual grading, the grading efficiency of the data processing task can be effectively improved by the scheme provided by the embodiment of the application. In addition, the method and the system can directly influence the accuracy of the trained task grade model because the integrity of the attribute values of the characteristic attributes of the data processing samples used for training the classification model, fill the attribute values with missing characteristic attributes, ensure that the data processing samples used for training the classification model have complete attribute values, therefore, a more accurate task grade model is trained, and the grade accuracy of the data processing task is effectively improved, and then, according to the grade, the processing strategy of the data processing task is determined, so that the data processing planning is more reasonable, and the user experience is improved.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of data processing, comprising:

judging whether the attribute value of each characteristic attribute is missing or not according to the data processing sample, if so, filling the corresponding attribute value for the characteristic attribute;

Determining the grade of a new data processing task by using the task grade model;

Determining a processing strategy of the new data processing task according to the grade of the new data processing task;

Further comprises: dividing a plurality of data processing samples into a complete data set and a missing data set, wherein the data processing samples divided into the complete data set have complete attribute values, and the data processing samples divided into the missing data set have missing attribute values;

2. A method of data processing according to claim 1, wherein,

3. The method of data processing according to claim 1, wherein the step of generating a code of a predetermined number of bits for the data processing samples comprises:

4. The method of data processing according to claim 1, wherein the step of generating a code of a predetermined number of bits for the data processing samples comprises:

Generating a second signature value with a preset bit number for each characteristic attribute according to the attribute value of each characteristic attribute in the data processing sample;

according to a preset replacement strategy, carrying out numerical replacement on each digit value in the second signature value;

5. The method of data processing according to claim 4, wherein the permutation policy comprises:

6. The method of data processing according to any one of claims 1 to 5, further comprising, prior to the step of training the classification model with data processing samples having complete attribute values and corresponding levels:

a step of training a classification model using data processing samples having complete attribute values and corresponding levels, comprising:

7. The method of data processing according to claim 6, wherein the step of calculating the information gain corresponding to each of the characteristic attributes includes:

for each of the characteristic attributes described,

Calculating a second information entropy according to the proportion of each level in the classification result and the probability of the attribute value corresponding to the characteristic attribute;

8. A method of data processing according to claim 6, wherein,

The plurality of target feature attributes includes: any two or three of a service duration for performing the data processing task, a number of recursive subtasks required for performing the data processing task, and a number of uses of the data warehouse model required for performing the data processing task.

9. A method of data processing according to any one of claims 1 to 5, 7, 8, characterized in that,

Further comprises: setting a data processing sequence table, wherein the data processing sequence table is used for storing data processing tasks according to the processing sequence;

10. A method of data processing according to any one of claims 1 to 5, 7, 8, characterized in that,

Further comprises: allocating a corresponding monitoring scheme for each grade, wherein the monitoring scheme comprises monitoring time intervals corresponding to different grades;

11. An apparatus for data processing, comprising: the system comprises an acquisition unit, a filling unit, a training unit and a processing unit, wherein,

The training unit is used for training the classification model by utilizing the data processing samples with complete attribute values and corresponding grades acquired by the acquisition unit and obtained by the filling unit to obtain a task grade model;

the processing unit is used for obtaining a task grade model by utilizing the training unit, determining the grade of a new data processing task, and determining the processing strategy of the new data processing task according to the grade of the new data processing task;

the apparatus for data processing further comprises: a dividing unit converting unit, wherein,

The dividing unit is configured to divide the plurality of data processing samples into a complete data set and a missing data set, where the data processing samples divided into the complete data set have complete attribute values, and the data processing samples divided into the missing data set have missing attribute values;

The conversion unit is further used for generating codes with preset digits for the data processing samples according to attribute values corresponding to a plurality of characteristic attributes of the data processing samples; dividing the complete data set into at least one coding set according to the codes of the data processing samples, wherein the codes corresponding to all the data processing samples belonging to the same coding set are equal;

The filling unit is further used for matching the data processing samples in the missing data set with corresponding target coding sets according to the codes corresponding to the data processing samples in the missing data set; and filling the missing attribute values of the data processing samples in the missing data set according to the attribute values of the data processing samples corresponding to the target coding set.

12. A data processing electronic device, comprising:

One or more processors;

storage means for storing one or more programs,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-10.

13. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-10.