CN114528810A

CN114528810A - Data code generation method and device, electronic equipment and storage medium

Info

Publication number: CN114528810A
Application number: CN202210154751.3A
Authority: CN
Inventors: 李超
Original assignee: Shanghai Himalaya Technology Co ltd
Current assignee: Shanghai Himalaya Technology Co ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-05-24

Abstract

The invention relates to the technical field of artificial intelligence, and provides a data code generation method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring any data to be encoded in a preset data set, wherein a value interval determined by the maximum value and the minimum value of data in the preset data set is larger than a preset interval; carrying out coding operation on data to be coded to obtain a plurality of coding index values of the data to be coded; acquiring a target coding row corresponding to each coding index value from a preset coding matrix according to each coding index value; and merging the target coding lines to obtain a coding result of the data to be coded so as to carry out model training according to the coding result of the data to be coded. The invention effectively reduces the storage space occupied by the high-dimensional ID characteristic data, greatly reduces the model training cost, and is not limited by the openness of the upstream link and the downstream link related to the model training.

Description

Data code generation method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data code generation method and device, electronic equipment and a storage medium.

Background

When the depth model is trained, firstly, vectorization (Embedding) processing needs to be performed on feature data participating in the training, that is, the feature data is encoded into a form meeting the input requirement of the depth model. For single feature data with a small number of values or a small value span range, the size of the storage vectorization space (also referred to as the size of the dictionary) can be predetermined according to the maximum value of the values. For example, for a single characteristic data such as gender, the number of values is only two when actually expressed: 0 or 1, and the size of the dictionary is determined by the value space (maximum value) 1 of the features. For high-dimensional ID characteristic data with a large span range, such as advertisement ID or user ID, or some text characteristics with indefinite length, the span range of the value is very large, even hundreds of millions or billions, and if a storage mode similar to that of single characteristic data is directly used, the number of model parameters is huge, and model training and deployment cannot be performed.

For high-dimensional ID characteristic data, in the prior art, a bottom training frame is rewritten or some training plug-ins are provided, filtering and analysis of the ID characteristic data are automatically supported in model training, so that the span range of the high-dimensional ID characteristic data is reduced, the high-dimensional ID characteristic data is encoded into a form meeting the input requirement of a depth model and then is stored, and model training and deployment by using the high-dimensional ID characteristic data are realized. Since the upstream and downstream links involved in model training need to be modified uniformly, on one hand, the implementation cost is extremely high, and on the other hand, the upstream and downstream links are required to have a certain openness, and the method can hardly be implemented under the condition of limited openness.

Disclosure of Invention

The invention aims to provide a data code generation method, a data code generation device, an electronic device and a storage medium, which can generate codes of high-dimensional ID class characteristic data without modifying an upstream link and a downstream link related to model training, carry out model training according to the generated codes, greatly reduce the cost of the model training and are not limited by the openness of the upstream link and the downstream link related to the model training.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a data encoding generation method, where the method includes:

acquiring any data to be coded in a preset data set, wherein a numerical value interval determined by the maximum value and the minimum value of the data in the preset data set is larger than a preset interval;

performing coding operation on the data to be coded to obtain a plurality of coding index values of the data to be coded;

acquiring a target coding row corresponding to each coding index value from a preset coding matrix according to each coding index value;

and merging the target coding lines to obtain a coding result of the data to be coded so as to carry out model training according to the coding result of the data to be coded.

Further, the step of performing encoding operation on the data to be encoded to obtain a plurality of encoding index values of the data to be encoded includes:

performing hash operation on the data to be coded by using a plurality of different preset hash functions to obtain a plurality of first hash values;

performing a modular operation on each first hash value according to a preset modular value to obtain a modular result of each first hash value;

and taking the modulus result of the first hash values as a plurality of coding index values of the data to be coded.

Further, each of the predetermined hash functions corresponds to one of the predetermined coding matrices, the number of rows of each of the predetermined coding matrices is the same as the predetermined modulus, and the step of obtaining, according to each of the coding index values, a target coding row corresponding to each of the coding index values from the predetermined coding matrices includes:

for any target coding index value, determining a target preset coding matrix corresponding to the target coding index value according to the preset hash function corresponding to the target coding index value;

and taking the target coding index value as a row index, and acquiring a target coding row corresponding to the row index from the target preset coding matrix.

Further, the step of performing encoding operation on the data to be encoded to obtain a plurality of encoding index values of the data to be encoded further includes:

carrying out Hash operation on the data to be coded to obtain a second Hash value;

sequentially segmenting the second hash value according to a preset segment number to obtain a plurality of hash segments;

determining the coding index value of each hash segment according to a preset modulus value;

and taking the coding index values of the plurality of hash segments as a plurality of coding index values of the data to be coded.

Further, the step of sorting the hash segments according to a segmentation order, and the step of determining the code index value of each hash segment according to the preset modulus includes:

performing a modular operation on each hash segment according to the preset modular value to obtain a modular result of each hash segment;

and calculating the coding index value of the hash segment according to the serial number of the hash segment, the modulus taking result and the preset modulus value.

Further, the number of rows of the predetermined coding matrix is a product of the predetermined modulus and the number of hash segments.

Further, the encoding result is represented by a target matrix, the number of rows of the target matrix is one row, the number of columns of the target matrix is the same as the number of columns of the preset encoding matrix, and the step of combining the target encoding rows to obtain the encoding result of the data to be encoded includes:

for any target column in the target matrix, determining an element of the target column in the target matrix according to an element of the target column in the target coding rows.

In a second aspect, an embodiment of the present invention provides a data encoding generation apparatus, where the apparatus includes:

the device comprises an acquisition module, a decoding module and a processing module, wherein the acquisition module is used for acquiring any data to be encoded in a preset data set, and a numerical value interval determined by the maximum value and the minimum value of the data in the preset data set is larger than a preset interval;

the operation module is used for carrying out coding operation on the data to be coded to obtain a plurality of coding index values of the data to be coded;

the obtaining module is further configured to obtain, according to each of the code index values, a target code row corresponding to each of the code index values from a preset code matrix;

and the merging module is used for merging the target coding lines to obtain a coding result of the data to be coded.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a processor and a memory; the memory is used for storing programs; the processor is configured to implement the data encoding generation method in the first aspect when executing the program.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data encoding generation method in the first aspect.

Compared with the prior art, the data code generating method, the device, the electronic device and the storage medium provided by the embodiment of the invention obtain a plurality of code index values of data to be coded by performing code operation on the data to be coded, obtain a target code row corresponding to each code index value from a preset code matrix according to each code index value, finally combine a plurality of target codes to obtain a code result of the data to be coded, obtain a plurality of code index values by performing code operation on the data to be coded, combine the plurality of code index values to obtain a code result, obviously reduce the width range of the code result, and do not need to change the conditions of an upstream link and a downstream link related to model training in the whole process, thereby effectively reducing the storage space occupied by high-dimensional ID characteristic data, the cost of model training is greatly reduced, and the model training is not limited by the openness of the upstream link and the downstream link involved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 shows an example diagram of a prior art coding implementation provided by an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a data encoding generation method according to an embodiment of the present invention.

Fig. 3 is a flowchart illustrating another data encoding generation method according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating another data encoding generation method according to an embodiment of the present invention.

Fig. 5 is an exemplary diagram illustrating an encoding implementation of multiple hashes according to an embodiment of the present invention.

Fig. 6 is a flowchart illustrating another data encoding generation method according to an embodiment of the present invention.

Fig. 7 is a diagram illustrating an example of a multi-slicing coding implementation according to an embodiment of the present invention.

Fig. 8 is a flowchart illustrating a data encoding generation method according to an embodiment of the present invention.

Fig. 9 is a block diagram illustrating a data encoding and generating apparatus according to an embodiment of the present invention.

Fig. 10 is a block diagram of an electronic device provided by an embodiment of the invention.

Icon: 10-an electronic device; 11-a processor; 12-a memory; 13-a bus; 100-data code generating means; 110-an obtaining module; 120-an operation module; 130-merge module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

In the description of the present invention, it should be noted that if the terms "upper", "lower", "inside", "outside", etc. indicate an orientation or a positional relationship based on that shown in the drawings or that the product of the present invention is used as it is, this is only for convenience of description and simplification of the description, and it does not indicate or imply that the device or the element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are only used to distinguish one description from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

In order to avoid modifying the upstream and downstream links involved in model training to reduce cost, and at the same time, not being limited by the openness thereof, one processing method in the prior art is as follows: the dimension is reduced through Hash, and the value space (namely the value span range) is reduced to 1/10 to 1/2. The specific implementation mode is as follows: the ID value of the dimension to be reduced is converted into a long-value integer by a hash function, then a modulus is taken on the integer (i.e. a remainder is taken after the integer division, for example, 10 is taken on 3, and the result is 1), and finally the size of the modulus is used as a value space of the ID value of the dimension to be reduced, that is, the size of the dictionary. Referring to fig. 1, fig. 1 shows an example diagram of a coding implementation manner of the prior art according to an embodiment of the present invention, in fig. 1, data to be coded is first subjected to a hash operation to obtain a hash result hashResult1, then a modulus of the hash result hashResult1 is performed, a 2 nd row is retrieved from a preset coding matrix according to a modulus result2, a vector corresponding to the 2 nd row is used as a coding result of the data to be coded, all training data are coded in this manner, and all coding results are used as training data to train a model.

Although the above method can effectively reduce the value space, the training effect is not ideal when the model training is performed by using the data encoded by the above method, and the inventors have carefully and deeply studied the prior art to find that the training effect is not ideal because the hash collision is large when the above method is encoded, so called hash collision is: different data to be coded correspond to the same vector through the coding in the above manner, for example, modulo 3, and different values such as 4, 7, 10, etc. since the modulo results are all 1, the data will eventually correspond to the vector corresponding to 1, thereby causing poor model learning effect. The inventor further researches the problem of high hash collision, and finds that the reason of high hash collision is that the size of a module is too small, and in order to reduce the hash collision, the inventor tries to increase the size of the module, but the size of the module is too large, so that on one hand, too many parameters of model training are caused, sample data required by training is increased, the efficiency of model training is affected, and the effect of reducing the value space is not obvious, and on the other hand, if the data of a preset data set is increased, so that the value interval of the data set is changed, the model needs to be trained again after the module value is updated, and the cost of model training is increased.

In view of the above problems, the inventor further tries to count the frequency of the data to be encoded, filter the values that do not appear for a long time, then make the high-frequency data to be encoded into a dictionary, and take the first several of the dictionary. Although hash collision is reduced, the method needs to maintain a relatively large dictionary, which is very unfavorable for online deployment of the model. Meanwhile, data statistics needs to be carried out, the dictionary needs to be maintained and updated frequently, and a model needs to be retrained after the update, so that the maintenance cost is high.

In view of this, embodiments of the present invention provide a data code generating method and apparatus, an electronic device, and a storage medium, which do not need to modify an upstream link and a downstream link involved in model training, do not have a higher hash collision, and have a significantly reduced value space, and are described in detail below.

Referring to fig. 2, fig. 2 is a flowchart illustrating a data encoding generating method according to an embodiment of the present invention, where the method includes the following steps:

step S101, any data to be encoded in a preset data set is obtained, wherein a numerical value interval determined by the maximum value and the minimum value of the data in the preset data set is larger than a preset interval.

In this embodiment, in order to train the model, it is usually necessary to obtain an original sample including a plurality of feature attributes, and the original sample may be organized by a table, and data in the same column in the table belong to the same feature attribute. The data in the preset data set belong to the same characteristic attribute, the preset data set can comprise any column of data in a table, the values of the characteristic attributes of original samples with the same characteristic attribute can be the same or different, a numerical value interval can be determined according to the maximum value and the minimum value of the data in the preset data set, the maximum value and the minimum value can be directly used as a numerical interval, or after the maximum value and the minimum value are vertically floated, obtaining a value range, for example, a preset data set including data of a characteristic attribute of gender, the value can be represented by 0 and 1, 0 represents female, 1 represents male, the value interval can be [0, 1], and for example, the preset data set includes data of the characteristic attribute of age, the minimum value of the value is 0, and the maximum value is 110, then the value interval can be determined as: [0, 150], that is, the value in the preset data set may be an integer between 0 and 150. The preset interval is used to represent a minimum span range of data in the preset data set, for example, the preset interval is [0, 100000000], which means that a value range of the data in the preset data set is greater than the preset interval, and the preset interval may be preset according to a requirement of an actual scene, for example, the preset interval is set to [0, 10 hundred million ] or [0, 100 hundred million ].

Step S102, coding operation is carried out on the data to be coded to obtain a plurality of coding index values of the data to be coded.

In this embodiment, one to-be-encoded data corresponds to a plurality of encoding index values, and the plurality of encoding index values can ensure that different to-be-encoded data in the preset data set correspond to different encoding results, and can also ensure that a numerical interval of the preset data set is effectively reduced by encoding, thereby reducing a storage space occupied by the encoding results.

In this embodiment, the plurality of encoding index values may be implemented based on multiple hash operations, that is, different hash operations are respectively performed on the data to be encoded, and a remainder is performed on a result of each hash operation to obtain a plurality of encoding index values finally.

Step S103, according to each coding index value, a target coding row corresponding to each coding index value is obtained from a preset coding matrix.

In this embodiment, before the model training, the predetermined coding matrix may be randomly initialized according to the value interval of the predetermined data set, for example, the value interval is [ -1,1], and then the data in the predetermined coding matrix may include 0.001, -0.002, 0.0004. During the model training process, the preset coding matrix can be updated according to the result of each round or each plurality of rounds of training.

And step S104, merging the target coding lines to obtain a coding result of the data to be coded, and performing model training according to the coding result of the data to be coded.

In this embodiment, merging the target encoding lines may be implemented by combining data in the same column of the target encoding lines, or performing calculation such as summation and averaging, for example, the target encoding line: 1,1, 1,1 and 3, 3, 3, 3, the summation calculation is to sum the two target coding rows by columns, and the obtained coding result is: 4, 4, 4, the averaging calculation is to average the two target encoding rows by columns, and the obtained encoding result is: 2, 2, 2, 2, although other ways of combining may be used, such as weighted averaging, etc.

According to the method provided by the embodiment of the invention, the coding operation is carried out on the data to be coded to obtain the plurality of coding index values, and then the plurality of coding index values are combined to finally obtain the coding result, so that the width range of the coding result is obviously reduced, and the condition of an upstream link and a downstream link related to model training is not required to be changed in the whole process, therefore, the storage space occupied by high-dimensional ID characteristic data is effectively reduced, the cost of the model training is greatly reduced, and the method is not limited by the openness of the upstream link and the downstream link related to the model training.

On the basis of fig. 2, an embodiment of the present invention further provides a specific implementation manner for obtaining a plurality of code index values of data to be coded, please refer to fig. 3, fig. 3 shows a flowchart of a data code generating method provided by the embodiment of the present invention, and step S102 includes the following sub-steps:

and a substep S102-10, performing hash operation on the data to be encoded respectively by using a plurality of different preset hash functions to obtain a plurality of first hash values.

In this embodiment, since the preset modulus needs to be set to be large in the prior art to avoid hash collision, for example, for a numerical value interval with an interval range of [0, 10000000], the preset modulus needs to be set to be 300 to 500 ten thousand to avoid obvious hash collision, the larger the preset modulus is, the more the model parameters are, and the slower the model convergence speed is, in the embodiment of the present invention, multiple hashes are performed by using a plurality of different preset hash functions, so that the preset modulus does not need to be set too large to avoid obvious hash collision, the scale of the model parameters is greatly reduced, the speed of model convergence is increased, and meanwhile, the preset modulus does not need to be updated when the numerical value interval of the data set to be encoded changes, and the model does not need to be retrained.

In this embodiment, the number of the preset hash functions may be set as needed, and the inventor finds, through tests, that hash collisions can be effectively reduced by setting the number of the preset hash functions to 3.

In this embodiment, the preset hash functions are different, and the obtained first hash values are also different.

And a substep S102-11, performing a modulus operation on each first hash value according to a preset modulus value to obtain a modulus result of each first hash value.

In this embodiment, the preset modulus value can be set as required, and due to the fact that multiple times of hashing are adopted, the preset modulus value does not need to be set too large, and therefore the model does not need to be trained again after the preset modulus value is updated when the numerical value interval changes.

And a substep S102-12, taking the modulus result of the plurality of first hash values as a plurality of coding index values of the data to be coded.

In this embodiment, each preset hash function corresponds to a preset coding matrix, and the number of rows of each preset coding matrix is the same as the preset modulus, on the basis of fig. 3, an embodiment of the present invention further provides a specific implementation manner for obtaining a target coding row, please refer to fig. 4, where fig. 4 shows an exemplary flowchart of a data coding generation method provided in an embodiment of the present invention, and step S103 includes the following sub-steps:

and a substep S1031, determining a target preset coding matrix corresponding to the target coding index value for any target coding index value according to a preset hash function corresponding to the target coding index value.

In this embodiment, the target encoding index value is any one of a plurality of encoding index values of the data to be encoded, and the processing manner of each of the plurality of encoding index values is the same.

In this embodiment, the preset encoding matrices are independent from each other and do not affect each other, and are initialized and updated in respective manners.

In this embodiment, for example, there are 3 preset hash functions, which are a, b, and c, respectively, the corresponding preset encoding matrix is A, B, C, the data to be encoded is hashed and modulo-processed by a, b, and c, respectively, and the obtained encoding index value is: for the index1, index2, and index3, if the predetermined hash function corresponding to the target code index value index1 is a, the corresponding target predetermined code matrix is a.

And a substep S1032 of obtaining a target encoding row corresponding to the row index from the target preset encoding matrix by using the target encoding index value as the row index.

For the sake of comparing with the prior art more clearly, please refer to fig. 5, fig. 5 shows an exemplary diagram of an encoding implementation of multiple hash according to an embodiment of the present invention, in fig. 5, 3 different preset hash functions are adopted: the hash function 1, the hash function 2 and the hash function 3 respectively use the 3 hash functions to perform hash operation on data to be encoded to obtain three different first hash values: hashResult1, hashResult2, hashResult3, and then performing modulo operation on each first hash value to obtain corresponding modulo results, which are: 2. 1 and 2. Each preset hash function corresponds to a preset coding matrix which is a preset coding matrix 1, a preset coding matrix 2 and a preset coding matrix 3, and the 2 nd row X of the preset coding matrix 1 is subjected to modulo result2 of hashResult1₂₁X₂₂X₂₃X₂₄As the target coding row, similarly, according to the modulo result1 of hashResult2, row 1Y of the preset coding matrix 2 is set₁₁Y₁₂Y₁₃Y₁₄Presetting a 2 nd row Z of the coding matrix 3 as a target coding row according to a modulo result2 of hashResult3₂₁Z₂₂Z₂₃Z₂₄And finally, combining the 3 target coding lines as target coding lines to obtain a final coding result of the data to be coded: XYZ₁XYZ₂XYZ₃XYZ₄。

The method provided by the embodiment of the invention can obviously reduce the collision rate of hash while obviously reducing the parameter number. However, in some extreme scenarios, the implementation of multiple hash may cause a certain loss in performance, if the scenario has an extremely high performance requirement, the method is limited, and in order to reduce the influence on performance, on the basis of fig. 2, an embodiment of the present invention further provides another specific implementation manner for obtaining multiple code index values of data to be coded, please refer to fig. 6, where fig. 6 shows a flowchart of a data code generation method provided by an embodiment of the present invention, and step S102 further includes the following sub-steps:

and a substep S102-20, performing hash operation on the data to be encoded to obtain a second hash value.

In this embodiment, in this step, only one hash operation is performed on one piece of data to be encoded, and only one second hash value is obtained.

And a substep S102-21, segmenting the second hash values in sequence according to the preset number of segments to obtain a plurality of hash segments.

In this embodiment, the essence of the multiple hashes in the method for performing hash operation on the data to be encoded by using a plurality of different predetermined hash functions respectively is as follows: for example, 1000,000,000 sub-buckets can be guaranteed to be unique by three 1000 sub-buckets, modular operation is still performed for sub-buckets after a plurality of hash operations are finished, and because bits of the second hash value have no correlation, the inventor improves the multiple hash into sequential splitting of the second hash value based on the characteristic and the essence of multiple hash to obtain a plurality of hash segments, and finally obtains a plurality of code index values based on the plurality of hash segments. In this embodiment, the preset number of segments may be set as needed, for example, if the preset number of segments is 3, the second hash value is divided into 3 hash segments.

In this embodiment, the segmentation may be performed from front to back or from back to front, as a specific implementation manner, taking the segmentation from back to front as an example, the segmentation may be performed by performing an and operation and a shift operation, for example, the second hash value is 1000110010111111, and the preset number of segments is 3, then first using a mask 000000001111 to take 13 th to 16 th bits of the second hash value to obtain 1111, then right-shifting the second hash value by 4 bits to obtain 0000100011001011, then using a mask 000000001111 to take 9 th to 12 th bits of the second hash value to obtain 1011, and so on to obtain 5 th to 8 th bits and 1 th to 4 th bits. Efficient splitting of the second hash value may be achieved by an and operation and a shift operation.

And a substep S102-22 of determining a code index value for each hash segment according to a predetermined modulus value.

In this embodiment, the hash segments are sorted according to the splitting order, for example, a sequence number of a hash segment obtained by the first splitting is 1, a sequence number of a hash segment obtained by the second splitting is 2, and so on, as a specific implementation manner, the method for calculating the code index value of a hash segment may be:

firstly, carrying out modular operation on each hash segment according to a preset modular value to obtain a modular result of each hash segment.

In this embodiment, the preset modulus value can be set as required.

And secondly, calculating the coding index value of the hash segment according to the serial number of the hash segment, the modulus result and the preset modulus value.

In this embodiment, when the hash segment is numbered from 1, the code index value of the hash segment is calculated as the modulo result of the hash segment + (sequence number of the hash segment-1) × the preset modulus value, and when the hash segment is numbered from 0, the code index value of the hash segment is calculated as the modulo result of the hash segment + sequence number of the hash segment × the preset modulus value.

It should be noted that, when the preset modulus is a power of 2, the modular operation may also be implemented by shifting, for example, shifting right by one bit, which is equivalent to dividing by 2, and the preset modulus is an nth power of 2, which may be implemented by shifting right by n bits, so that by performing bit and operation on the specified length, splitting and modular operation are just completed at the same time, and the conversion efficiency is improved.

And a substep S102-23 of using the code index values of the plurality of hash segments as a plurality of code index values of the data to be encoded.

In this embodiment, after the sub-step S102-20 to the sub-step S102-23 obtain a plurality of code index values of the data to be coded, the target code row also needs to be obtained from the preset code matrix according to the code index, at this time, only one preset code matrix is provided, and the row number of the preset code matrix is the product of the preset modulus value and the number of the hash segments.

In this embodiment, since the code index value is obtained according to the sequence number of the hash segment, the modulo result, and the preset modulo value, 3 results generated by each code index value do not conflict, and a search operation can be directly performed on the preset code matrix to obtain the corresponding target code row.

It should be noted that, the two implementations of the sub-step S102-10 to the sub-step S102-12 and the sub-step S102-20 to the sub-step S102-23 may be only one of them, or an appropriate implementation may be adaptively selected according to needs.

For the sake of comparing more clearly with the prior art and the encoding method of multiple hashes, please refer to fig. 7, fig. 7 shows an exemplary diagram of an encoding implementation method of multiple segmentations according to an embodiment of the present invention, in fig. 7, a second hash value is segmented 3 times to obtain 3 hash segments, which have values of: r1, r2, and r3, where the preset modulus is mod, and for r1, the modulo result obtained by the modulo operation is 2, and the corresponding code index value is 2, and for r2, the modulo result obtained by the modulo operation is 1, and the corresponding code index value is: 1+1 — mod n, where the corresponding code index is n, if mod is 3, n is 4, and the corresponding code index is 4, and similarly, if the corresponding code index of r3 is m, the target coding row presets data in rows 2, n, and m in the coding matrix, and these are respectively: x₂₁X₂₂X₂₃X₂₄、Xn₁Xn₂Xn₃Xn₄、Xm₁Xm₂Xm₃Xm₄And finally, merging the data of the 2 nd, n th and m th rows to obtain a final coding result.

On the basis of fig. 2, an embodiment of the present invention further provides a specific implementation manner of obtaining an encoding result of data to be encoded, please refer to fig. 8, fig. 8 shows an exemplary flowchart of a data encoding generation method provided by the embodiment of the present invention, and step S104 further includes the following sub-steps:

in sub-step S1041, for any target column in the target matrix, the element of the target column in the target matrix is determined according to the element of the target column in the target coding rows.

In this embodiment, the encoding result may be represented by an object matrix, where the number of rows of the object matrix is one row, and the number of columns of the object matrix is the same as the number of columns of the preset encoding matrix.

In this embodiment, according to the elements in the target column in the target coding rows, the implementation manner of determining the elements in the target column in the target matrix may be: summing the element values of the target column in the target coding rows to obtain the element value of the target column in the target matrix, which may also be: averaging the element values of the target columns in the target coding rows to obtain the element values of the target columns in the target matrix, or directly combining the elements of the target columns in the target coding rows to obtain the elements of the target columns in the target matrix.

In this embodiment, in order to better prove the technical effect achieved by the above data encoding generation method, an embodiment of the present invention further provides a primary hash method adopted in the prior art and a test result of a multiple hash method provided by the embodiment of the present invention, a preset data set includes 500 ten thousand randomly generated numbers, a value interval of the numbers is [0, 1 hundred million ], a collision rate of the primary hash method under different preset modulus values and a collision rate of 3 times of hashes under a preset modulus value of 4096 are shown in table 1 below, where the collision rate is (4987421-number of hash result non-repeat results)/number of hash result non-repeat results:

TABLE 1

As can be seen from table 1, if one hash attempts to obtain a collision rate similar to multiple hashes, the size of the modulus needs to be extremely large, which is 1 ten thousand times the size of the multiple hashes, which results in huge subsequent model parameters, many invalid elements in the preset coding matrix, slow convergence during model training, low training efficiency, and difficult deployment.

The embodiment of the present invention further provides test results of hash modes with different times, the preset data set includes 500 ten thousand randomly generated numbers, the value interval is [0, 1 hundred million ], and collision rates of hash modes with different times are shown in table 2:

TABLE 2

As can be seen from table 2, the collision rate is 15% when 2 times of hash are performed, but the collision rate of 3 times of hash is already very small, which already meets the requirements of most application scenarios, and in consideration of the influence of the hash times on the performance, it is not necessary to perform more hash operations.

The embodiment of the present invention further provides that the 3-time hash is performed with different collision rates of the preset modulus, for convenience of bit manipulation, the preset modulus is 2 times the number of bits, or for example, 5000 or 10000 is used as the preset modulus, which has little influence on the overall efficiency, the preset data set includes 100 ten thousand randomly generated numbers, the value interval is [0, 10 hundred million ], and the collision rate calculation method is the same as the above, and the experimental results of the collision rates with different preset moduli are shown in table 3:

TABLE 3

As can be seen from table 3, the larger the preset norm value is, the smaller the collision rate is, but the more the model parameters are, the slower the model convergence rate is. Experiments show that the modulo value of 12 bits, i.e. 4096 is a moderate choice. The number of bits may be reduced if more collision possibilities are tolerated and vice versa. Note that the number of value bits in table 3 does not affect the performance of the hash as a whole, and only affects the parameter quantity of the subsequent model.

The embodiment of the present invention further compares processing time consumed by two implementation manners, namely sub-step S102-10 to sub-step S102-12 and sub-step S102-20 to sub-step S102-23, through experiments, where a preset module value is 4096, a data volume in a preset data set is 1 hundred million, and the processing time consumed by the two implementation manners is shown in table 4:

TABLE 4

As can be seen from Table 4, the processing time of the methods of substep S102-20 to substep S102-23 is reduced by approximately 1 half as compared with the methods of substep S102-10 to substep S102. In large-scale data processing, the method from the substep S102-20 to the substep S102-23 can achieve better performance under the condition of obtaining the same collision rate.

In order to execute the corresponding steps of the model training method in the above-described embodiment and various possible embodiments, an implementation manner of the data code generating apparatus 100 is given below. Referring to fig. 9, fig. 9 is a block diagram illustrating a data encoding and generating apparatus 100 according to an embodiment of the present invention. It should be noted that the basic principle and the resulting technical effect of the data encoding and generating apparatus 100 provided in the present embodiment are the same as those of the above-mentioned embodiments, and for the sake of brief description, no reference is made to this embodiment.

The data code generating device 100 includes an obtaining module 110, an operation module 120, and a merging module 130.

The obtaining module 110 is configured to obtain any data to be coded in a preset data set, where a value interval determined by a maximum value and a minimum value of data in the preset data set is greater than a preset interval.

The operation module 120 is configured to perform encoding operation on the data to be encoded to obtain a plurality of encoding index values of the data to be encoded.

Specifically, the operation module 120 is specifically configured to: respectively carrying out hash operation on data to be encoded by utilizing a plurality of different preset hash functions to obtain a plurality of first hash values; performing a modulus operation on each first hash value according to a preset modulus value to obtain a modulus result of each first hash value; and taking the modulus result of the plurality of first hash values as a plurality of coding index values of the data to be coded.

Specifically, the operation module 120 is further specifically configured to: carrying out hash operation on data to be encoded to obtain a second hash value; sequentially segmenting the second hash value according to the preset segment number to obtain a plurality of hash segments; determining a coding index value of each hash segment according to a preset modulus value; and taking the coding index values of the plurality of hash segments as a plurality of coding index values of the data to be coded.

Specifically, the operation module 120 is specifically configured to, when determining the encoding index value of each hash segment according to the preset modulus, specifically: performing a modular operation on each hash segment according to a preset modular value to obtain a modular result of each hash segment; and calculating the coding index value of the Hash segment according to the serial number of the Hi segment, the modulus result and the preset modulus value.

Specifically, the number of rows of the predetermined encoding matrix in the operation module 120 is a product of the predetermined modulus and the number of hash segments.

The obtaining module 110 is further configured to obtain, according to each code index value, a target code row corresponding to each code index value from a preset code matrix.

Specifically, each preset hash function corresponds to a preset coding matrix, the number of rows of each preset coding matrix is the same as the preset modulus, and the obtaining module 110 is specifically configured to: for any target coding index value, determining a target preset coding matrix corresponding to the target coding index value according to a preset hash function corresponding to the target coding index value; and taking the target coding index value as a row index, and acquiring a target coding row corresponding to the row index from a target preset coding matrix.

The merging module 130 is configured to merge the multiple target coding lines to obtain a coding result of the data to be coded.

Specifically, the encoding result is represented by an object matrix, the row number of the object matrix is one row, the column number of the object matrix is the same as the column number of the preset encoding matrix, and the merging module 130 is specifically configured to: for any target column in the target matrix, determining the elements of the target column in the target matrix according to the elements of the target column in the target coding rows.

Referring to fig. 10, fig. 10 is a block diagram illustrating an electronic device 10 according to an embodiment of the present disclosure. The electronic device 10 may be a computer device, for example, any one of a smart phone, a tablet computer, a personal computer, a server, a ground station, a private cloud, a public cloud, and the like, and the above devices may be used to implement the data encoding generation method provided in the foregoing embodiments, and may be determined specifically according to an actual application scenario, and is not limited herein. The electronic device 10 includes a processor 11, a memory 12, and a bus 13, and the processor 11 is connected to the memory 12 through the bus 13.

The memory 12 is used for storing programs, such as the data code generation device 100 shown in fig. 9, each data code generation device 100 includes at least one software functional module which can be stored in the memory 12 in a form of software or firmware (firmware), and the processor 11 executes the programs after receiving execution instructions to implement the data code generation method disclosed in the above embodiment.

The Memory 12 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (NVM).

The processor 11 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 11. The processor 11 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Complex Programmable Logic Device (CPLD), a Field Programmable Gate Array (FPGA), and an embedded ARM.

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by the processor 11, the data code generation method disclosed in the foregoing embodiment is implemented.

In summary, embodiments of the present invention provide a data encoding generation method, an apparatus, an electronic device, and a storage medium, where the method includes: acquiring any data to be coded in a preset data set, wherein a numerical value interval determined by the maximum value and the minimum value of the data in the preset data set is larger than a preset interval; performing coding operation on the data to be coded to obtain a plurality of coding index values of the data to be coded; acquiring a target coding row corresponding to each coding index value from a preset coding matrix according to each coding index value; and merging the target coding lines to obtain a coding result of the data to be coded so as to carry out model training according to the coding result of the data to be coded. Compared with the prior art, the embodiment of the invention obtains a plurality of coding index values by coding operation of the data to be coded, and then combines the coding index values to finally obtain the coding result, thereby obviously reducing the width range of the coding result and greatly reducing the width range of the coding result, and the whole process does not need any change on the conditions of the upstream link and the downstream link related to model training, therefore, the invention effectively reduces the storage space occupied by the high-dimensional ID characteristic data, greatly reduces the cost of the model training and is not limited by the openness of the upstream link and the downstream link related to the model training.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for generating a data code, the method comprising:

2. The data encoding generation method of claim 1, wherein the step of performing an encoding operation on the data to be encoded to obtain a plurality of encoding index values of the data to be encoded comprises:

3. The method as claimed in claim 2, wherein each of the predetermined hash functions corresponds to one of the predetermined coding matrices, the number of rows of each of the predetermined coding matrices is the same as the predetermined modulus, and the step of obtaining the target coding row corresponding to each of the coding index values from the predetermined coding matrices according to each of the coding index values comprises:

4. The data encoding generation method of claim 1, wherein the step of performing an encoding operation on the data to be encoded to obtain a plurality of encoding index values of the data to be encoded further comprises:

5. The data encoding generation method of claim 4, wherein the hash segments are sorted in a slicing order, and the step of determining the encoding index value of each hash segment according to the predetermined modulus value comprises:

6. The data encoding generation method of claim 4 or 5, wherein the number of rows of the predetermined encoding matrix is a product of the predetermined modulus value and the number of hash segments.

7. The data encoding generation method of claim 1, wherein the encoding result is represented by an object matrix, the number of rows of the object matrix is one, the number of columns of the object matrix is the same as the number of columns of the preset encoding matrix, and the step of combining the target encoding rows to obtain the encoding result of the data to be encoded includes:

8. An apparatus for generating a data code, the apparatus comprising:

9. An electronic device comprising a processor and a memory; the memory is used for storing programs; the processor is configured to implement the data encoding generation method according to any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a data code generation method according to any one of claims 1 to 7.