[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108536824B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN108536824B
CN108536824B CN201810315884.8A CN201810315884A CN108536824B CN 108536824 B CN108536824 B CN 108536824B CN 201810315884 A CN201810315884 A CN 201810315884A CN 108536824 B CN108536824 B CN 108536824B
Authority
CN
China
Prior art keywords
data
key value
value
random number
suffix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810315884.8A
Other languages
Chinese (zh)
Other versions
CN108536824A (en
Inventor
孟洋
郭会
韩大志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank Of China Financial Technology Co ltd
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN201810315884.8A priority Critical patent/CN108536824B/en
Publication of CN108536824A publication Critical patent/CN108536824A/en
Application granted granted Critical
Publication of CN108536824B publication Critical patent/CN108536824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method and a data processing device, firstly, when counting each same key value in an association field of a left table, performing association processing on the left table and a table to be associated by using a first key value to obtain first data, and performing association processing on the left table and the table to be associated by using a second key value to obtain second data; and then, carrying out association processing on the second data and the copied table to be associated by using the key value added with the suffix of the second random number to obtain third data. The obtained set of the first data and the third data is the final required data set. Compared with the prior art, the invention only needs 2 jobs for processing the data tilt problem, reduces the jobs number in the processing process, and simultaneously reduces IO operation, thereby improving the processing efficiency.

Description

Data processing method and device
Technical Field
The present invention relates to the field of data processing, and more particularly, to a data processing method and apparatus.
Background
At present, in the face of increasing processing requirements of mass data, a solution of Hadoop is often adopted. The most core designs of the Hadoop framework are HDFS (Hadoop Distributed File System) and MapReduce. The HDFS provides storage for massive data, has the characteristic of high fault tolerance and is used for being deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets.
MapReduce provides calculation for massive data. MapReduce is a programming model for parallel operation of large-scale data sets (greater than 1 TB). Map and Reduce are the main ideas of MapReduce, both from the functional programming language and from the properties of the vector programming language. The method greatly facilitates programmers to operate programs on the distributed system under the condition of no distributed parallel programming. Dividing the data into a plurality of distributed computing tasks according to the size of the data in a Map stage, performing local operation by using the advantage of high speed of data localization as much as possible, then performing partitioning (namely, entering the same data processing column according to the data of the same keyword), and finally performing combined correlation operation in a Reduce stage.
When the execution program performs data association in the Reduce stage, if the number of data of a certain associated keyword is more than that of other associated keywords, the amount of data processed by the Reduce node where the associated keyword is located is more than that of other nodes, so that most of the Reduce nodes are already executed, but one or more Reduce nodes are slow to run, so that the execution time of the whole program is long, which is called data skew. The data skew problem is illustrated below, assuming that the table structures of table a, table B, and table C are as follows:
tables 1-1 Table A Structure
Figure GDA0002669432980000011
Figure GDA0002669432980000021
Tables 1-2 Table B Structure
Table B field Name of field
ACCT_NO Account number
ACCT_NAME Name of a house
Tables 1-3 Table C structures
Table C field Name of field
ACCT_NO Account number
ACCT_EMAIL Account email box
Table a connects table B and table C to the left, and the associated fields are both ACCT _ NO, and are denoted by SQL as "a left join B on (a.acct _ NO ═ b.acct _ NO) left join C on (a.acct _ NO ═ c.acct _ NO)". In table B and table C, ACCT _ NO is the primary key, and exists only in table B and table C, and ACCT _ NO is not the primary key in table a, and there are a plurality of records with the same ACCT _ NO value. A table left join is one of the outer joins in the SQL statement, the result set includes all rows of the left table specified in the left join clause, not just the row to which the join column matches, if a row of the left table does not match a row in the right table, all select list columns of the right table are null in the associated result set row. Often, a data table has a column or combination of columns whose value uniquely identifies each row in the table, such column or columns being referred to as the primary key of the table by which the physical integrity of the table is enforced.
Tables 1-4 show the distribution of data in ACCT _ NO statistical Table A
ACCT _ NO value Number of records
111111 62357842
222222 2
333333 1
As shown in the above table, the number of records of ACCT _ NO ═ 111111 in table a is 62357842, which is much larger than the number of records of other values. This creates a data skew problem during the Reduce phase. As shown in fig. 1, a Reduce node that processes key 111111 receives data records that are much larger than other nodes, which results in a processing time that is much longer than that of other nodes, and may also cause memory overflow, thereby causing the job not to be completed normally, and greatly affecting the job processing efficiency.
The idea of the existing solution is to average the data and then segment the data. Specifically, grouping statistics is carried out on the left table according to the associated key values, data records with the statistics values higher than a certain threshold value are stored in one file, and data records with the statistics values lower than the threshold value are stored in another file. For the part lower than the threshold, the processing is performed by associating the table B and the table C in the normal processing manner. For the part higher than the threshold value, a random number (the random number is not more than a certain value RMax) is added after each key value of the left table to form a new associated key value, and integers from 1 to RMax are added after each key value of the tables B and C to form RMax repeated records, and then the association processing is carried out with the new key value. And finally combining the result sets generated by the two parts of data files. However, the existing solution needs to count data distribution in advance, split and land data files, process each data file in a classified manner, and merge data files, which requires 4 jobs in total, and each job started by Hadoop requires more time, and corresponding IO operations are also more, which further results in lower processing efficiency.
Disclosure of Invention
In view of the above, the present invention provides a data processing method and apparatus, which aim to achieve the purpose of improving processing efficiency by reducing the number of jobs and IO operations in the processing process on the basis of solving the technical problem of data skew.
In order to achieve the above object, the following solutions are proposed:
a method of data processing, comprising:
while counting each same key value in an association field of a left table, performing association processing on the left table and a table to be associated by using a first key value to obtain first data, and performing association processing on the left table and the table to be associated by using a second key value to obtain second data; for each key value in the associated field of the left table, using the key value as a first key value when the count value does not exceed the threshold of the record number; when the count value of the key value exceeds the record number threshold, adding a first random number suffix on the basis of the key value to obtain a second key value in the part exceeding the record number threshold, wherein the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1;
adding a second random number suffix to the key value in the associated field of the second data, copying each data record of the table to be associated into N pieces, adding increment value suffixes to the key values corresponding to the N copied data records respectively, and performing association processing on the second data and the table to be associated after the copying processing by using the key value added with the second random number suffix to obtain third data, wherein the value interval of the second random number suffix is [1, N ], the second random number suffix is an integer, N is an integer greater than 1, the increment value suffixes added to the key values corresponding to the N copied data records are different, and the value of the increment value suffix is an integer in the interval [1, N ];
and combining the first data and the third data to obtain the associated data record.
A data processing apparatus comprising:
the device comprises a first processing unit, a second processing unit and a third processing unit, wherein the first processing unit is used for performing association processing on a left table and a table to be associated by using a first key value to obtain first data while counting each same key value in an associated field of the left table, and performing association processing on the left table and the table to be associated by using a second key value to obtain second data; for each key value in the associated field of the left table, using the key value as a first key value when the count value does not exceed the threshold of the record number; when the count value of the key value exceeds the record number threshold, adding a first random number suffix on the basis of the key value to obtain a second key value in the part exceeding the record number threshold, wherein the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1;
a second processing unit, configured to add a second random number suffix to a key value in an associated field of the second data, copy each data record of the table to be associated into N pieces, add increment suffixes to the key values corresponding to the N pieces of copied data records, and perform association processing on the second data and the table to be associated after the copy processing by using the key values added with the second random number suffix to obtain third data, where a value interval of the second random number suffix is [1, N ], the second random number suffix is an integer, N is an integer greater than 1, the increment values added to the key values corresponding to the N pieces of copied data records are different, and a value of the increment value suffix is an integer within the interval [1, N ];
and the merging unit is used for merging the first data and the third data to obtain the associated data record.
Compared with the prior art, the technical scheme of the invention has the following advantages:
according to the data processing method and device provided by the technical scheme, firstly, when the same key value in the associated field of the left table is counted, the first key value is used for performing association processing on the left table and the table to be associated to obtain first data, and the second key value is used for performing association processing on the left table and the table to be associated to obtain second data; and then, carrying out association processing on the second data and the copied table to be associated by using the key value added with the suffix of the second random number to obtain third data. The obtained set of the first data and the third data is the final required data set. Compared with the prior art, the invention only needs 2 jobs for processing the data tilt problem, reduces the jobs number in the processing process, and simultaneously reduces IO operation, thereby improving the processing efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of a data skew problem;
fig. 2 is a flowchart of a data processing method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a first data association process according to an embodiment of the present invention;
FIG. 4 is a flow chart of a first data association according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a second data association process according to an embodiment of the present invention;
fig. 6 is a flowchart of a second data association according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating another first data association process according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating another second data association process according to an embodiment of the invention;
fig. 9 is a schematic logical structure diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
The core thought of the invention is that while grouping statistics is carried out on the key values in the associated fields of the left table, different processing is carried out on the corresponding data records in two stages that the count value does not exceed the record number threshold value and the count value exceeds the record number threshold value, so that the job number and IO operation are reduced, and the processing efficiency is further improved.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present embodiment provides a data processing method, as shown in fig. 2, the method includes the steps of:
s11: and when counting each same key value in the associated field of the left table, performing association processing on the left table and the table to be associated by using the first key value to obtain first data, and performing association processing on the left table and the table to be associated by using the second key value to obtain second data.
The specific number of each identical key value in the associated field of the left table may be different. One is that the total number of the same key value does not exceed the threshold of the record number, when the key value is counted, the stage that the count value exceeds the threshold of the record number does not exist; and in the other situation, when the total number of the same key value exceeds the record number threshold, the key value is counted, the stage that the count value does not exceed the record number threshold is firstly carried out, and then the stage that the count value exceeds the record number threshold is carried out, and the two stages respectively use the first key value and the second key value to associate the corresponding data record in the left table with the table to be associated. The first key value is the key value when the count value does not exceed the record number threshold, the second key value is the key value increased by the first random number suffix when the count value exceeds the record number threshold, the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1. The following illustrates the process of data association using different key values, i.e. the first time of data association:
the table structures of table a, table B, and table C are assumed to be as follows:
tables 1-1 Table A Structure
Table a field Name of field
TRANS_ID Transaction coding
ACCT_NO Transaction account number
Tables 1-2 Table B Structure
Table B field Name of field
ACCT_NO Account number
ACCT_NAME Name of a house
Tables 1-3 Table C structures
Table C field Name of field
ACCT_NO Account number
ACCT_EMAIL Account email box
Wherein, the table A is a left table, the table B and the table C are tables to be associated, and the associated fields of the three tables are ACCT _ NO in the processing process. Referring to fig. 3, the key values in the association fields are 111111, 222222, 333333, etc. And the Input stage inputs the table A, the table B and the table C into the system. The Splitting stage divides tables A, B, and C into small tables. In the Mapping stage, the Map terminal maintains a counter and records the number of 111111, 222222 and 333333 in the associated fields of the table A respectively; 222222 and 333333 exceed the threshold of the record number, so the key values 222222 and 333333 in the associated fields are not processed; the number of 111111 exceeds the record number threshold, the 111111 corresponding to the stage not exceeding the record number threshold is not processed, and a first random number suffix is added to the 111111 corresponding to the stage exceeding the record number threshold. In the Shuffling stage, the tables A, B and C are divided into different tables according to different key values in the associated fields. And the Reducing stage associates the data records corresponding to the same key value in the tables A, B and C. In the Output stage, the associated data records corresponding to the key values without the addition of the first random number suffix are combined together to form a below file; and combining the associated data records corresponding to the key value added with the first random number suffix to form an above file, wherein the key value in the associated field of the table A is added with the first random number suffix, while the key values in the associated fields of the tables B and C are not modified, so that the table A cannot be associated with the data records in the tables B and C by adding the key value of the first random number suffix, and the finally obtained above file is a part of the original table A. Since the data records in table a that exceed the record number threshold have been grouped by adding the first random number suffix, each reduce node receives more average data, so that no data skew occurs. The maintenance counter of the Map terminal can be implemented in a memory, or in an HDFS or a local file system.
Fig. 4 shows a specific flow of the first data association. The process comprises the following steps:
s111: inputting a left table and a table to be associated to a Map end;
s112: after the data record of the left table is input to the Map end, counting each same key value in the associated field;
s113: for each same key value in the associated field of the left table, judging whether a count value KNum corresponding to the key value exceeds a record number threshold value, if not, executing a step S114, and if so, executing a step S117;
s114: the key value is not subjected to postfix value processing and is output by the Map terminal;
s115: outputting from the Map terminal after adding the association fields to the association table;
referring to fig. 3, the data (111111, Peter) is a data record in the table B to be associated, and 111111 in [111111, {111111, Peter } ] is an incremental association field.
S116: associating a corresponding first data record in the left table with a data record in a table to be associated by using a first key value at the reduce node, and taking the associated first data record as first data;
the first data record is a data record corresponding to the first key value in the left table. Referring to fig. 3, in the data [111111, {00000001,111111} ], 111111 is the first key value, and {00000001,111111} is the first data record corresponding to the first key value.
S117: adding a first random number suffix to the key value and then outputting the key value by a Map end;
s118: and using the second key value to take the corresponding second data record in the left table as second data at the reduce node.
And the second data record is the data record corresponding to the second key value in the left table. Referring to fig. 3,111111 +65, 111111+125 in data [111111+65, {19412353,111111} ], [111111+125, {03412353,111111} ] and the like are each a key value added with a first random number suffix, i.e., a second key value; {19412353,111111}, {03412353,111111} are both second data records.
S12: adding a second random number suffix to the key value in the association field of the second data, copying each data record of the table to be associated into N pieces, adding increment value suffixes to the key values corresponding to the N pieces of copied data records, and performing association processing on the second data and the copied table to be associated by using the key value added with the second random number suffix to obtain third data.
The value rule of the second random number suffix is consistent with the value rule of the first random number suffix. Specifically, the value range of the second random number suffix is [1, N ], the second random number suffix is an integer, and N is an integer greater than 1. The increment value suffixes added to the key values corresponding to the N copied data records are different, and the values of the increment value suffixes are integers in an interval [1, N ]. By copying N data records in the table to be associated and adding an incremental value suffix to the key value in the associated field, the data records in the second data can be associated with the data records in the table to be associated, and omission does not occur. The following illustrates the data association process by increasing the key value of the second random number suffix, i.e. the second data association process:
as shown in fig. 5. And in the Input stage, the above file, the table B and the table C are Input into the system. And the Splitting stage divides the above file, the table B and the table C into small tables respectively. In the Mapping stage, adding a second random number suffix to the key value corresponding to each data record in the above file; and copying each data record in the table B and the table C into N, and adding an incremental value suffix to each corresponding key value respectively. And in the Shuffling stage, the above file, the table B and the table C are divided into different tables according to different key values in the associated fields. And in the Reducing stage, the data records corresponding to the same key value in the above file, the table B and the table C are associated together. Combining the associated data records together to form a below file in an Output stage; and merging the associated data records corresponding to the key values added with the first random number suffix. Since the data records in the above file have been grouped by adding the suffix of the second random number, each reduce node receives more average data, so that the data skew condition is not generated.
Fig. 6 shows a specific flow of the second data association. The process comprises the following steps:
s121: inputting the second data and the table to be associated into a Map terminal;
s122: after the second data is input to the Map terminal, adding a second random number suffix to the key value of the associated field and outputting the second random number suffix by the Map terminal;
s123: after the table to be associated is input to the Map end, copying each data record of the table to be associated into N pieces, and respectively adding increment value suffixes to the key values corresponding to the N pieces of copied data records and outputting the increment value suffixes by the Map end;
s124: and associating the corresponding data record in the second data with the data record in the copied table to be associated by using the key value added with the suffix of the second random number at the reduce node, and taking the associated data record as third data. And combining the third data and the first data together to be used as a finally associated data record.
And the corresponding data record in the second data is the data record in the second data corresponding to the key value added with the suffix of the second random number. Referring to FIG. 5, 111111+65 in the data [111111+125, {03412353,111111} ] is the key value increased by the second random number suffix; {03412353,111111} is a data record in the second data corresponding to the key value with the second random number suffix added.
S13: and combining the first data and the third data to obtain the associated data record.
Compared with the prior art, the data processing method provided by the embodiment only needs 2 jobs for processing the data tilt problem, reduces the jobs number in the processing process, and reduces IO operation, thereby improving the processing efficiency.
In order to further reduce IO operations, it can also be implemented by reducing the data records in the table to be associated that need to be copied. The specific method comprises the following steps:
after the left table and the table to be associated are associated by the second key value to obtain second data, removing a first random number suffix from the second key value, and removing the key value without the first random number suffix to obtain fourth data; referring to fig. 7, the sign file is the fourth data, and the key value included in the fourth data is 111111.
And before the step of copying each data record of the table to be associated into N pieces, removing the data records which are not matched in the table to be associated by using a key value contained in the fourth data. Referring to fig. 8, the data records corresponding to key values 222222 and 333333 in table B and table C are not duplicated. Therefore, the data volume transmitted to the reduce node is reduced, and IO operation is further reduced. Sign files are put into distributed caches for implementation.
The values of the log threshold and N are explained below:
Figure GDA0002669432980000091
the constant k is larger than or equal to 1, the default is 2, the k is a reference value, and the actual value can be adjusted according to the actual running condition. Under the condition that the memory is still surplus when the reduce task runs, the constant k value in the threshold calculation formula is properly reduced, so that the threshold is improved, the number of second key values in the steps S11 and S12 and the number of key values in fourth data are reduced, the number of records for repeatedly copying the tables B and C is reduced, IO (input/output) operation is further reduced, and the processing efficiency is improved; under the condition of internal memory shortage during reduce task operation, the constant k value in the threshold calculation formula is properly increased, so that the threshold is reducedAnd reducing the number of the first data records reduced to each reduce node in the step S11, thereby reducing the memory consumption of the reduce nodes and ensuring the stable operation of the operation. The MSize is the memory size on one reduce node, NNum is the number of all map nodes in the Hadoop cluster, RSize is the size of one data record of the left table, and TNum is the number of all data records in the left table.
While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently with other steps in accordance with the invention.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.
The present embodiment provides a data processing apparatus, as shown in fig. 9, including: a first processing unit 11, a second processing unit 12 and a merging unit 13. Wherein,
the first processing unit 11 is configured to, while counting each identical key value in an association field of a left table, perform association processing on the left table and a table to be associated by using a first key value to obtain first data, and perform association processing on the left table and the table to be associated by using a second key value to obtain second data; for each key value in the associated field of the left table, using the key value as a first key value when the count value does not exceed the threshold of the record number; when the count value of the key value exceeds the record number threshold, adding a first random number suffix on the basis of the key value to obtain a second key value in the part exceeding the record number threshold, wherein the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1;
a second processing unit 12, configured to add a second random number suffix to a key value in an associated field of the second data, copy each data record of the table to be associated into N pieces, add increment suffixes to the key values corresponding to the N copied data records, and perform association processing on the second data and the table to be associated after the copy processing by using the key values added with the second random number suffix to obtain third data, where a value interval of the second random number suffix is [1, N ], the second random number suffix is an integer, the N is an integer greater than 1, the increment values added to the key values corresponding to the N copied data records are different, and a value of the increment value suffix is an integer within the interval [1, N ];
a merging unit 13, configured to merge the first data and the third data to obtain a correlated data record.
The first processing unit 11 specifically includes: the device comprises a counting subunit, a data dividing subunit, a first association subunit and a second association subunit.
The counting subunit is used for counting each same key value in the associated field after the data record of the left table is input to the Map end;
the data dividing subunit is used for judging whether a count value corresponding to each key value exceeds a record number threshold value or not for each same key value, if not, the key value is not subjected to suffix value processing and is output by the Map terminal, and if yes, the key value is added with the first random number suffix and then is output by the Map terminal;
the first association subunit is configured to associate, by using the first key value, the corresponding first data record in the left table with the data record in the table to be associated, and use the associated first data record as first data;
and the second association subunit is used for taking the corresponding second data record in the left table as second data by using the second key value.
The second processing unit 12 specifically includes: the device comprises a first processing subunit, a second processing subunit and a third association subunit.
The first processing subunit is used for adding a second random number suffix to the key value of the associated field after the second data is input to the Map terminal and outputting the second random number suffix by the Map terminal;
the second processing subunit is configured to copy each data record of the table to be associated into N pieces after the table to be associated is input to the Map terminal, where key values corresponding to the N pieces of copied data records are respectively incremented by increment value suffixes and output by the Map terminal;
and the third association subunit is used for associating the corresponding data record in the second data with the data record in the to-be-associated table after the copying processing by using the key value added with the suffix of the second random number, and taking the associated data record as third data.
The first processing unit may further include: the third processing subunit is configured to, after the association processing is performed on the left table and the table to be associated by using the second key value to obtain second data, remove the first random number suffix from the second key value, and remove the key value from which the first random number suffix is removed to obtain fourth data;
the second processing unit may further include: and the fourth processing subunit is configured to, before the step of copying each data record of the table to be associated into N pieces, remove, by using a key value included in the fourth data, a data record that is not matched in the table to be associated.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A data processing method, comprising:
while counting each same key value in an association field of a left table, performing association processing on the left table and a table to be associated by using a first key value to obtain first data, and performing association processing on the left table and the table to be associated by using a second key value to obtain second data; for each key value in the associated field of the left table, using the key value as a first key value when the count value does not exceed the threshold of the record number; when the count value of the key value exceeds the record number threshold, adding a first random number suffix on the basis of the key value to obtain a second key value in the part exceeding the record number threshold, wherein the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1;
adding a second random number suffix to the key value in the associated field of the second data, copying each data record of the table to be associated into N pieces, adding increment value suffixes to the key values corresponding to the N copied data records respectively, and performing association processing on the second data and the table to be associated after the copying processing by using the key value added with the second random number suffix to obtain third data, wherein the value interval of the second random number suffix is [1, N ], the second random number suffix is an integer, N is an integer greater than 1, the increment value suffixes added to the key values corresponding to the N copied data records are different, and the value of the increment value suffix is an integer in the interval [1, N ];
and combining the first data and the third data to obtain the associated data record.
2. The method according to claim 1, wherein while counting each same key value in the associated field of the left table, the associating processing is performed on the left table and the table to be associated by using a first key value to obtain first data, and the associating processing is performed on the left table and the table to be associated by using a second key value to obtain second data, specifically comprising:
after the data record of the left table is input to the Map end, counting each same key value in the associated field;
for each same key value, judging whether a count value corresponding to the key value exceeds a record number threshold value, if not, performing no suffix value processing on the key value and outputting the key value by the Map end, and if so, adding the first random number suffix to the key value and outputting the key value by the Map end;
associating the corresponding first data record in the left table with the data record in the table to be associated by using the first key value, and taking the associated first data record as first data;
and recording corresponding second data in the left table as second data by using the second key value.
3. The method according to claim 1, wherein adding a second random number suffix to the key value in the association field of the second data, copying each data record of the table to be associated into N pieces, adding an increment value suffix to the key values corresponding to the N copied data records, and associating the second data with the copied table to be associated by using the key value added with the second random number suffix to obtain third data, includes:
after the second data is input to a Map terminal, adding a second random number suffix to the key value of the associated field and outputting the second random number suffix by the Map terminal;
after the table to be associated is input to the Map end, copying each data record of the table to be associated into N pieces, and respectively adding increment value suffixes to the key values corresponding to the N pieces of copied data records and outputting the increment value suffixes by the Map end;
and associating the corresponding data record in the second data with the data record in the copied table to be associated by using the key value added with the suffix of the second random number, and taking the associated data record as third data.
4. The method according to claim 3, wherein after the associating processing is performed on the left table and the table to be associated by using the second key value to obtain second data, the method further comprises: removing the first random number suffix from the second key value, and removing the key value without the first random number suffix to obtain fourth data;
before the step of copying each data record of the table to be associated into N, the method further includes: and removing the data records which are not matched in the table to be associated by using the key value contained in the fourth data.
5. The method according to any one of claims 1 to 4, wherein the calculation formula of the record number threshold is as follows:
Figure FDA0002696402570000021
wherein k is 2, MSize is the memory size on one reduce node, NNum is the number of all map nodes in the Hadoop cluster, and RSize is the size of one data record of the left table.
6. A data processing apparatus, comprising:
the device comprises a first processing unit, a second processing unit and a third processing unit, wherein the first processing unit is used for performing association processing on a left table and a table to be associated by using a first key value to obtain first data while counting each same key value in an associated field of the left table, and performing association processing on the left table and the table to be associated by using a second key value to obtain second data; for each key value in the associated field of the left table, using the key value as a first key value when the count value does not exceed the threshold of the record number; when the count value of the key value exceeds the record number threshold, adding a first random number suffix on the basis of the key value to obtain a second key value in the part exceeding the record number threshold, wherein the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1;
a second processing unit, configured to add a second random number suffix to a key value in an associated field of the second data, copy each data record of the table to be associated into N pieces, add increment suffixes to the key values corresponding to the N pieces of copied data records, and perform association processing on the second data and the table to be associated after the copy processing by using the key values added with the second random number suffix to obtain third data, where a value interval of the second random number suffix is [1, N ], the second random number suffix is an integer, N is an integer greater than 1, the increment values added to the key values corresponding to the N pieces of copied data records are different, and a value of the increment value suffix is an integer within the interval [1, N ];
and the merging unit is used for merging the first data and the third data to obtain the associated data record.
7. The apparatus according to claim 6, wherein the first processing unit specifically includes:
the counting subunit is used for counting each same key value in the associated field after the data record of the left table is input to the Map end;
the data dividing subunit is used for judging whether a count value corresponding to each key value exceeds a record number threshold value or not for each same key value, if not, the key value is not subjected to suffix value processing and is output by the Map terminal, and if yes, the key value is added with the first random number suffix and then is output by the Map terminal;
the first association subunit is configured to associate, by using the first key value, the corresponding first data record in the left table with the data record in the table to be associated, and use the associated first data record as first data;
and the second association subunit is used for taking the corresponding second data record in the left table as second data by using the second key value.
8. The apparatus according to claim 6, wherein the second processing unit specifically includes:
the first processing subunit is used for adding a second random number suffix to the key value of the associated field after the second data is input to the Map terminal and outputting the second random number suffix by the Map terminal;
the second processing subunit is configured to copy each data record of the table to be associated into N pieces after the table to be associated is input to the Map terminal, where key values corresponding to the N pieces of copied data records are respectively incremented by increment value suffixes and output by the Map terminal;
and the third association subunit is used for associating the corresponding data record in the second data with the data record in the to-be-associated table after the copying processing by using the key value added with the suffix of the second random number, and taking the associated data record as third data.
9. The apparatus of claim 8, wherein the first processing unit further comprises: the third processing subunit is configured to, after the association processing is performed on the left table and the table to be associated by using the second key value to obtain second data, remove the first random number suffix from the second key value, and remove the key value from which the first random number suffix is removed to obtain fourth data;
the second processing unit further includes: and the fourth processing subunit is configured to, before the step of copying each data record of the table to be associated into N pieces, remove, by using a key value included in the fourth data, a data record that is not matched in the table to be associated.
10. The apparatus according to any one of claims 6 to 9, wherein the calculation formula of the record number threshold is:
Figure FDA0002696402570000041
wherein k is 2, MSize is the memory size on one reduce node, NNum is the number of all map nodes in the Hadoop cluster, and RSize is the size of one data record of the left table.
CN201810315884.8A 2018-04-10 2018-04-10 Data processing method and device Active CN108536824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810315884.8A CN108536824B (en) 2018-04-10 2018-04-10 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810315884.8A CN108536824B (en) 2018-04-10 2018-04-10 Data processing method and device

Publications (2)

Publication Number Publication Date
CN108536824A CN108536824A (en) 2018-09-14
CN108536824B true CN108536824B (en) 2020-11-20

Family

ID=63479767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810315884.8A Active CN108536824B (en) 2018-04-10 2018-04-10 Data processing method and device

Country Status (1)

Country Link
CN (1) CN108536824B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966681A (en) * 2020-08-14 2020-11-20 咪咕文化科技有限公司 Data processing method, device, network equipment and storage medium
CN113098840B (en) * 2021-02-25 2022-08-16 鹏城实验室 Efficient and safe linear rectification function operation method based on addition secret sharing technology
CN113821541A (en) * 2021-09-27 2021-12-21 北京沃东天骏信息技术有限公司 Data skew processing method, apparatus, storage medium, and program product
CN118233454B (en) * 2024-05-09 2024-07-16 成都中科合迅科技有限公司 File batch uploading method and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108459A1 (en) * 2012-10-12 2014-04-17 International Business Machines Corporation Functionality of decomposition data skew in asymmetric massively parallel processing databases
CN105095413A (en) * 2015-07-09 2015-11-25 北京京东尚科信息技术有限公司 Method and apparatus for solving data skew
CN106776026A (en) * 2016-12-19 2017-05-31 北京奇虎科技有限公司 A kind of distributed data processing method and device
CN106874322A (en) * 2016-06-27 2017-06-20 阿里巴巴集团控股有限公司 A kind of data table correlation method and device
CN107066612A (en) * 2017-05-05 2017-08-18 郑州云海信息技术有限公司 A kind of self-adapting data oblique regulating method operated based on SparkJoin

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108459A1 (en) * 2012-10-12 2014-04-17 International Business Machines Corporation Functionality of decomposition data skew in asymmetric massively parallel processing databases
CN105095413A (en) * 2015-07-09 2015-11-25 北京京东尚科信息技术有限公司 Method and apparatus for solving data skew
CN106874322A (en) * 2016-06-27 2017-06-20 阿里巴巴集团控股有限公司 A kind of data table correlation method and device
CN106776026A (en) * 2016-12-19 2017-05-31 北京奇虎科技有限公司 A kind of distributed data processing method and device
CN107066612A (en) * 2017-05-05 2017-08-18 郑州云海信息技术有限公司 A kind of self-adapting data oblique regulating method operated based on SparkJoin

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Handling Data-skew Effects in Join Operations using MapReduce;M. Al Hajj Hassan,etc;《Procedia Computer Science》;20141231;第29卷;全文 *
MapReduce中数据倾斜解决方法的研究;王刚 等;《计算机技术与发展》;20160930;第26卷(第9期);全文 *

Also Published As

Publication number Publication date
CN108536824A (en) 2018-09-14

Similar Documents

Publication Publication Date Title
CN108536824B (en) Data processing method and device
Lin et al. Mining high utility itemsets in big data
Tao et al. Minimal mapreduce algorithms
Brenna et al. Distributed event stream processing with non-deterministic finite automata
Zhang et al. Parallel rough set based knowledge acquisition using MapReduce from big data
CN111512283B (en) Radix estimation in a database
CN101916281B (en) Concurrent computational system and non-repetition counting method
US11221890B2 (en) Systems and methods for dynamic partitioning in distributed environments
US10162830B2 (en) Systems and methods for dynamic partitioning in distributed environments
US11163792B2 (en) Work assignment in parallelized database synchronization
Hu et al. Output-optimal parallel algorithms for similarity joins
CN110955732A (en) Method and system for realizing partition load balance in Spark environment
Han et al. Distme: A fast and elastic distributed matrix computation engine using gpus
Mohamed et al. Hash semi cascade join for joining multi-way map reduce
CN107506394B (en) Optimization method for eliminating big data standard relation connection redundancy
US8046394B1 (en) Dynamic partitioning for an ordered analytic function
Nassar et al. Chi squared feature selection over Apache Spark
Aslam et al. Pre‐filtering based summarization for data partitioning in distributed stream processing
US11442792B2 (en) Systems and methods for dynamic partitioning in distributed environments
Grahne et al. DFA minimization in map-reduce
Lepikhov et al. Query processing in a DBMS for cluster systems
CN111143456B (en) Spark-based Cassandra data import method, device, equipment and medium
Pandit et al. Log mining based on hadoop’s map and reduce technique
Shivarkar Speed-up extension to Hadoop system
Senapati et al. A method for scalable first-order rule learning on Twitter data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221110

Address after: 100005 No. 69, inner main street, Dongcheng District, Beijing, Jianguomen

Patentee after: AGRICULTURAL BANK OF CHINA

Patentee after: Agricultural Bank of China Financial Technology Co.,Ltd.

Address before: 100005 No. 69, inner main street, Dongcheng District, Beijing, Jianguomen

Patentee before: AGRICULTURAL BANK OF CHINA