CN108536824B

CN108536824B - Data processing method and device

Info

Publication number: CN108536824B
Application number: CN201810315884.8A
Authority: CN
Inventors: 孟洋; 郭会; 韩大志
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank Of China Financial Technology Co ltd; Agricultural Bank of China
Priority date: 2018-04-10
Filing date: 2018-04-10
Publication date: 2020-11-20
Anticipated expiration: 2038-04-10
Also published as: CN108536824A

Abstract

The application discloses a data processing method and a data processing device, firstly, when counting each same key value in an association field of a left table, performing association processing on the left table and a table to be associated by using a first key value to obtain first data, and performing association processing on the left table and the table to be associated by using a second key value to obtain second data; and then, carrying out association processing on the second data and the copied table to be associated by using the key value added with the suffix of the second random number to obtain third data. The obtained set of the first data and the third data is the final required data set. Compared with the prior art, the invention only needs 2 jobs for processing the data tilt problem, reduces the jobs number in the processing process, and simultaneously reduces IO operation, thereby improving the processing efficiency.

Description

Data processing method and device

Technical Field

The present invention relates to the field of data processing, and more particularly, to a data processing method and apparatus.

Background

At present, in the face of increasing processing requirements of mass data, a solution of Hadoop is often adopted. The most core designs of the Hadoop framework are HDFS (Hadoop Distributed File System) and MapReduce. The HDFS provides storage for massive data, has the characteristic of high fault tolerance and is used for being deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets.

MapReduce provides calculation for massive data. MapReduce is a programming model for parallel operation of large-scale data sets (greater than 1 TB). Map and Reduce are the main ideas of MapReduce, both from the functional programming language and from the properties of the vector programming language. The method greatly facilitates programmers to operate programs on the distributed system under the condition of no distributed parallel programming. Dividing the data into a plurality of distributed computing tasks according to the size of the data in a Map stage, performing local operation by using the advantage of high speed of data localization as much as possible, then performing partitioning (namely, entering the same data processing column according to the data of the same keyword), and finally performing combined correlation operation in a Reduce stage.

When the execution program performs data association in the Reduce stage, if the number of data of a certain associated keyword is more than that of other associated keywords, the amount of data processed by the Reduce node where the associated keyword is located is more than that of other nodes, so that most of the Reduce nodes are already executed, but one or more Reduce nodes are slow to run, so that the execution time of the whole program is long, which is called data skew. The data skew problem is illustrated below, assuming that the table structures of table a, table B, and table C are as follows:

tables 1-1 Table A Structure

Tables 1-2 Table B Structure

Table B field	Name of field
		ACCT_NO	Account number
ACCT_NAME	Name of a house

Tables 1-3 Table C structures

Table C field	Name of field
		ACCT_NO	Account number
ACCT_EMAIL	Account email box

Table a connects table B and table C to the left, and the associated fields are both ACCT _ NO, and are denoted by SQL as "a left join B on (a.acct _ NO ═ b.acct _ NO) left join C on (a.acct _ NO ═ c.acct _ NO)". In table B and table C, ACCT _ NO is the primary key, and exists only in table B and table C, and ACCT _ NO is not the primary key in table a, and there are a plurality of records with the same ACCT _ NO value. A table left join is one of the outer joins in the SQL statement, the result set includes all rows of the left table specified in the left join clause, not just the row to which the join column matches, if a row of the left table does not match a row in the right table, all select list columns of the right table are null in the associated result set row. Often, a data table has a column or combination of columns whose value uniquely identifies each row in the table, such column or columns being referred to as the primary key of the table by which the physical integrity of the table is enforced.

Tables 1-4 show the distribution of data in ACCT _ NO statistical Table A

ACCT _ NO value	Number of records
		111111	62357842
222222	2
		333333	1

As shown in the above table, the number of records of ACCT _ NO ═ 111111 in table a is 62357842, which is much larger than the number of records of other values. This creates a data skew problem during the Reduce phase. As shown in fig. 1, a Reduce node that processes key 111111 receives data records that are much larger than other nodes, which results in a processing time that is much longer than that of other nodes, and may also cause memory overflow, thereby causing the job not to be completed normally, and greatly affecting the job processing efficiency.

The idea of the existing solution is to average the data and then segment the data. Specifically, grouping statistics is carried out on the left table according to the associated key values, data records with the statistics values higher than a certain threshold value are stored in one file, and data records with the statistics values lower than the threshold value are stored in another file. For the part lower than the threshold, the processing is performed by associating the table B and the table C in the normal processing manner. For the part higher than the threshold value, a random number (the random number is not more than a certain value RMax) is added after each key value of the left table to form a new associated key value, and integers from 1 to RMax are added after each key value of the tables B and C to form RMax repeated records, and then the association processing is carried out with the new key value. And finally combining the result sets generated by the two parts of data files. However, the existing solution needs to count data distribution in advance, split and land data files, process each data file in a classified manner, and merge data files, which requires 4 jobs in total, and each job started by Hadoop requires more time, and corresponding IO operations are also more, which further results in lower processing efficiency.

Disclosure of Invention

In view of the above, the present invention provides a data processing method and apparatus, which aim to achieve the purpose of improving processing efficiency by reducing the number of jobs and IO operations in the processing process on the basis of solving the technical problem of data skew.

In order to achieve the above object, the following solutions are proposed:

a method of data processing, comprising:

while counting each same key value in an association field of a left table, performing association processing on the left table and a table to be associated by using a first key value to obtain first data, and performing association processing on the left table and the table to be associated by using a second key value to obtain second data; for each key value in the associated field of the left table, using the key value as a first key value when the count value does not exceed the threshold of the record number; when the count value of the key value exceeds the record number threshold, adding a first random number suffix on the basis of the key value to obtain a second key value in the part exceeding the record number threshold, wherein the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1;

adding a second random number suffix to the key value in the associated field of the second data, copying each data record of the table to be associated into N pieces, adding increment value suffixes to the key values corresponding to the N copied data records respectively, and performing association processing on the second data and the table to be associated after the copying processing by using the key value added with the second random number suffix to obtain third data, wherein the value interval of the second random number suffix is [1, N ], the second random number suffix is an integer, N is an integer greater than 1, the increment value suffixes added to the key values corresponding to the N copied data records are different, and the value of the increment value suffix is an integer in the interval [1, N ];

and combining the first data and the third data to obtain the associated data record.

A data processing apparatus comprising:

the device comprises a first processing unit, a second processing unit and a third processing unit, wherein the first processing unit is used for performing association processing on a left table and a table to be associated by using a first key value to obtain first data while counting each same key value in an associated field of the left table, and performing association processing on the left table and the table to be associated by using a second key value to obtain second data; for each key value in the associated field of the left table, using the key value as a first key value when the count value does not exceed the threshold of the record number; when the count value of the key value exceeds the record number threshold, adding a first random number suffix on the basis of the key value to obtain a second key value in the part exceeding the record number threshold, wherein the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1;

a second processing unit, configured to add a second random number suffix to a key value in an associated field of the second data, copy each data record of the table to be associated into N pieces, add increment suffixes to the key values corresponding to the N pieces of copied data records, and perform association processing on the second data and the table to be associated after the copy processing by using the key values added with the second random number suffix to obtain third data, where a value interval of the second random number suffix is [1, N ], the second random number suffix is an integer, N is an integer greater than 1, the increment values added to the key values corresponding to the N pieces of copied data records are different, and a value of the increment value suffix is an integer within the interval [1, N ];

and the merging unit is used for merging the first data and the third data to obtain the associated data record.

Compared with the prior art, the technical scheme of the invention has the following advantages:

according to the data processing method and device provided by the technical scheme, firstly, when the same key value in the associated field of the left table is counted, the first key value is used for performing association processing on the left table and the table to be associated to obtain first data, and the second key value is used for performing association processing on the left table and the table to be associated to obtain second data; and then, carrying out association processing on the second data and the copied table to be associated by using the key value added with the suffix of the second random number to obtain third data. The obtained set of the first data and the third data is the final required data set. Compared with the prior art, the invention only needs 2 jobs for processing the data tilt problem, reduces the jobs number in the processing process, and simultaneously reduces IO operation, thereby improving the processing efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a data skew problem;

fig. 2 is a flowchart of a data processing method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a first data association process according to an embodiment of the present invention;

FIG. 4 is a flow chart of a first data association according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a second data association process according to an embodiment of the present invention;

fig. 6 is a flowchart of a second data association according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating another first data association process according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating another second data association process according to an embodiment of the invention;

fig. 9 is a schematic logical structure diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

The core thought of the invention is that while grouping statistics is carried out on the key values in the associated fields of the left table, different processing is carried out on the corresponding data records in two stages that the count value does not exceed the record number threshold value and the count value exceeds the record number threshold value, so that the job number and IO operation are reduced, and the processing efficiency is further improved.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present embodiment provides a data processing method, as shown in fig. 2, the method includes the steps of:

s11: and when counting each same key value in the associated field of the left table, performing association processing on the left table and the table to be associated by using the first key value to obtain first data, and performing association processing on the left table and the table to be associated by using the second key value to obtain second data.

The specific number of each identical key value in the associated field of the left table may be different. One is that the total number of the same key value does not exceed the threshold of the record number, when the key value is counted, the stage that the count value exceeds the threshold of the record number does not exist; and in the other situation, when the total number of the same key value exceeds the record number threshold, the key value is counted, the stage that the count value does not exceed the record number threshold is firstly carried out, and then the stage that the count value exceeds the record number threshold is carried out, and the two stages respectively use the first key value and the second key value to associate the corresponding data record in the left table with the table to be associated. The first key value is the key value when the count value does not exceed the record number threshold, the second key value is the key value increased by the first random number suffix when the count value exceeds the record number threshold, the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1. The following illustrates the process of data association using different key values, i.e. the first time of data association:

the table structures of table a, table B, and table C are assumed to be as follows:

tables 1-1 Table A Structure

Table a field	Name of field
		TRANS_ID	Transaction coding
ACCT_NO	Transaction account number

Tables 1-2 Table B Structure

Table B field	Name of field
		ACCT_NO	Account number
ACCT_NAME	Name of a house

Tables 1-3 Table C structures

Table C field	Name of field
		ACCT_NO	Account number
ACCT_EMAIL	Account email box

Wherein, the table A is a left table, the table B and the table C are tables to be associated, and the associated fields of the three tables are ACCT _ NO in the processing process. Referring to fig. 3, the key values in the association fields are 111111, 222222, 333333, etc. And the Input stage inputs the table A, the table B and the table C into the system. The Splitting stage divides tables A, B, and C into small tables. In the Mapping stage, the Map terminal maintains a counter and records the number of 111111, 222222 and 333333 in the associated fields of the table A respectively; 222222 and 333333 exceed the threshold of the record number, so the

key values

222222 and 333333 in the associated fields are not processed; the number of 111111 exceeds the record number threshold, the 111111 corresponding to the stage not exceeding the record number threshold is not processed, and a first random number suffix is added to the 111111 corresponding to the stage exceeding the record number threshold. In the Shuffling stage, the tables A, B and C are divided into different tables according to different key values in the associated fields. And the Reducing stage associates the data records corresponding to the same key value in the tables A, B and C. In the Output stage, the associated data records corresponding to the key values without the addition of the first random number suffix are combined together to form a below file; and combining the associated data records corresponding to the key value added with the first random number suffix to form an above file, wherein the key value in the associated field of the table A is added with the first random number suffix, while the key values in the associated fields of the tables B and C are not modified, so that the table A cannot be associated with the data records in the tables B and C by adding the key value of the first random number suffix, and the finally obtained above file is a part of the original table A. Since the data records in table a that exceed the record number threshold have been grouped by adding the first random number suffix, each reduce node receives more average data, so that no data skew occurs. The maintenance counter of the Map terminal can be implemented in a memory, or in an HDFS or a local file system.

Fig. 4 shows a specific flow of the first data association. The process comprises the following steps:

s111: inputting a left table and a table to be associated to a Map end;

s112: after the data record of the left table is input to the Map end, counting each same key value in the associated field;

s113: for each same key value in the associated field of the left table, judging whether a count value KNum corresponding to the key value exceeds a record number threshold value, if not, executing a step S114, and if so, executing a step S117;

s114: the key value is not subjected to postfix value processing and is output by the Map terminal;

s115: outputting from the Map terminal after adding the association fields to the association table;

referring to fig. 3, the data (111111, Peter) is a data record in the table B to be associated, and 111111 in [111111, {111111, Peter } ] is an incremental association field.

S116: associating a corresponding first data record in the left table with a data record in a table to be associated by using a first key value at the reduce node, and taking the associated first data record as first data;

the first data record is a data record corresponding to the first key value in the left table. Referring to fig. 3, in the data [111111, {00000001,111111} ], 111111 is the first key value, and {00000001,111111} is the first data record corresponding to the first key value.

S117: adding a first random number suffix to the key value and then outputting the key value by a Map end;

s118: and using the second key value to take the corresponding second data record in the left table as second data at the reduce node.

And the second data record is the data record corresponding to the second key value in the left table. Referring to fig. 3,111111 +65, 111111+125 in data [111111+65, {19412353,111111} ], [111111+125, {03412353,111111} ] and the like are each a key value added with a first random number suffix, i.e., a second key value; {19412353,111111}, {03412353,111111} are both second data records.

S12: adding a second random number suffix to the key value in the association field of the second data, copying each data record of the table to be associated into N pieces, adding increment value suffixes to the key values corresponding to the N pieces of copied data records, and performing association processing on the second data and the copied table to be associated by using the key value added with the second random number suffix to obtain third data.

The value rule of the second random number suffix is consistent with the value rule of the first random number suffix. Specifically, the value range of the second random number suffix is [1, N ], the second random number suffix is an integer, and N is an integer greater than 1. The increment value suffixes added to the key values corresponding to the N copied data records are different, and the values of the increment value suffixes are integers in an interval [1, N ]. By copying N data records in the table to be associated and adding an incremental value suffix to the key value in the associated field, the data records in the second data can be associated with the data records in the table to be associated, and omission does not occur. The following illustrates the data association process by increasing the key value of the second random number suffix, i.e. the second data association process:

as shown in fig. 5. And in the Input stage, the above file, the table B and the table C are Input into the system. And the Splitting stage divides the above file, the table B and the table C into small tables respectively. In the Mapping stage, adding a second random number suffix to the key value corresponding to each data record in the above file; and copying each data record in the table B and the table C into N, and adding an incremental value suffix to each corresponding key value respectively. And in the Shuffling stage, the above file, the table B and the table C are divided into different tables according to different key values in the associated fields. And in the Reducing stage, the data records corresponding to the same key value in the above file, the table B and the table C are associated together. Combining the associated data records together to form a below file in an Output stage; and merging the associated data records corresponding to the key values added with the first random number suffix. Since the data records in the above file have been grouped by adding the suffix of the second random number, each reduce node receives more average data, so that the data skew condition is not generated.

Fig. 6 shows a specific flow of the second data association. The process comprises the following steps:

s121: inputting the second data and the table to be associated into a Map terminal;

s122: after the second data is input to the Map terminal, adding a second random number suffix to the key value of the associated field and outputting the second random number suffix by the Map terminal;

s123: after the table to be associated is input to the Map end, copying each data record of the table to be associated into N pieces, and respectively adding increment value suffixes to the key values corresponding to the N pieces of copied data records and outputting the increment value suffixes by the Map end;

s124: and associating the corresponding data record in the second data with the data record in the copied table to be associated by using the key value added with the suffix of the second random number at the reduce node, and taking the associated data record as third data. And combining the third data and the first data together to be used as a finally associated data record.

And the corresponding data record in the second data is the data record in the second data corresponding to the key value added with the suffix of the second random number. Referring to FIG. 5, 111111+65 in the data [111111+125, {03412353,111111} ] is the key value increased by the second random number suffix; {03412353,111111} is a data record in the second data corresponding to the key value with the second random number suffix added.

S13: and combining the first data and the third data to obtain the associated data record.

Compared with the prior art, the data processing method provided by the embodiment only needs 2 jobs for processing the data tilt problem, reduces the jobs number in the processing process, and reduces IO operation, thereby improving the processing efficiency.

In order to further reduce IO operations, it can also be implemented by reducing the data records in the table to be associated that need to be copied. The specific method comprises the following steps:

after the left table and the table to be associated are associated by the second key value to obtain second data, removing a first random number suffix from the second key value, and removing the key value without the first random number suffix to obtain fourth data; referring to fig. 7, the sign file is the fourth data, and the key value included in the fourth data is 111111.

And before the step of copying each data record of the table to be associated into N pieces, removing the data records which are not matched in the table to be associated by using a key value contained in the fourth data. Referring to fig. 8, the data records corresponding to

key values

222222 and 333333 in table B and table C are not duplicated. Therefore, the data volume transmitted to the reduce node is reduced, and IO operation is further reduced. Sign files are put into distributed caches for implementation.

The values of the log threshold and N are explained below:

the constant k is larger than or equal to 1, the default is 2, the k is a reference value, and the actual value can be adjusted according to the actual running condition. Under the condition that the memory is still surplus when the reduce task runs, the constant k value in the threshold calculation formula is properly reduced, so that the threshold is improved, the number of second key values in the steps S11 and S12 and the number of key values in fourth data are reduced, the number of records for repeatedly copying the tables B and C is reduced, IO (input/output) operation is further reduced, and the processing efficiency is improved; under the condition of internal memory shortage during reduce task operation, the constant k value in the threshold calculation formula is properly increased, so that the threshold is reducedAnd reducing the number of the first data records reduced to each reduce node in the step S11, thereby reducing the memory consumption of the reduce nodes and ensuring the stable operation of the operation. The MSize is the memory size on one reduce node, NNum is the number of all map nodes in the Hadoop cluster, RSize is the size of one data record of the left table, and TNum is the number of all data records in the left table.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently with other steps in accordance with the invention.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

The present embodiment provides a data processing apparatus, as shown in fig. 9, including: a first processing unit 11, a second processing unit 12 and a merging unit 13. Wherein,

the first processing unit 11 is configured to, while counting each identical key value in an association field of a left table, perform association processing on the left table and a table to be associated by using a first key value to obtain first data, and perform association processing on the left table and the table to be associated by using a second key value to obtain second data; for each key value in the associated field of the left table, using the key value as a first key value when the count value does not exceed the threshold of the record number; when the count value of the key value exceeds the record number threshold, adding a first random number suffix on the basis of the key value to obtain a second key value in the part exceeding the record number threshold, wherein the value range of the first random number suffix is [1, N ], the first random number suffix is an integer, and N is an integer greater than 1;

a second processing unit 12, configured to add a second random number suffix to a key value in an associated field of the second data, copy each data record of the table to be associated into N pieces, add increment suffixes to the key values corresponding to the N copied data records, and perform association processing on the second data and the table to be associated after the copy processing by using the key values added with the second random number suffix to obtain third data, where a value interval of the second random number suffix is [1, N ], the second random number suffix is an integer, the N is an integer greater than 1, the increment values added to the key values corresponding to the N copied data records are different, and a value of the increment value suffix is an integer within the interval [1, N ];

a merging unit 13, configured to merge the first data and the third data to obtain a correlated data record.

The first processing unit 11 specifically includes: the device comprises a counting subunit, a data dividing subunit, a first association subunit and a second association subunit.

The counting subunit is used for counting each same key value in the associated field after the data record of the left table is input to the Map end;

the data dividing subunit is used for judging whether a count value corresponding to each key value exceeds a record number threshold value or not for each same key value, if not, the key value is not subjected to suffix value processing and is output by the Map terminal, and if yes, the key value is added with the first random number suffix and then is output by the Map terminal;

the first association subunit is configured to associate, by using the first key value, the corresponding first data record in the left table with the data record in the table to be associated, and use the associated first data record as first data;

and the second association subunit is used for taking the corresponding second data record in the left table as second data by using the second key value.

The second processing unit 12 specifically includes: the device comprises a first processing subunit, a second processing subunit and a third association subunit.

The first processing subunit is used for adding a second random number suffix to the key value of the associated field after the second data is input to the Map terminal and outputting the second random number suffix by the Map terminal;

the second processing subunit is configured to copy each data record of the table to be associated into N pieces after the table to be associated is input to the Map terminal, where key values corresponding to the N pieces of copied data records are respectively incremented by increment value suffixes and output by the Map terminal;

and the third association subunit is used for associating the corresponding data record in the second data with the data record in the to-be-associated table after the copying processing by using the key value added with the suffix of the second random number, and taking the associated data record as third data.

The first processing unit may further include: the third processing subunit is configured to, after the association processing is performed on the left table and the table to be associated by using the second key value to obtain second data, remove the first random number suffix from the second key value, and remove the key value from which the first random number suffix is removed to obtain fourth data;

the second processing unit may further include: and the fourth processing subunit is configured to, before the step of copying each data record of the table to be associated into N pieces, remove, by using a key value included in the fourth data, a data record that is not matched in the table to be associated.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A data processing method, comprising:

2. The method according to claim 1, wherein while counting each same key value in the associated field of the left table, the associating processing is performed on the left table and the table to be associated by using a first key value to obtain first data, and the associating processing is performed on the left table and the table to be associated by using a second key value to obtain second data, specifically comprising:

after the data record of the left table is input to the Map end, counting each same key value in the associated field;

for each same key value, judging whether a count value corresponding to the key value exceeds a record number threshold value, if not, performing no suffix value processing on the key value and outputting the key value by the Map end, and if so, adding the first random number suffix to the key value and outputting the key value by the Map end;

associating the corresponding first data record in the left table with the data record in the table to be associated by using the first key value, and taking the associated first data record as first data;

and recording corresponding second data in the left table as second data by using the second key value.

3. The method according to claim 1, wherein adding a second random number suffix to the key value in the association field of the second data, copying each data record of the table to be associated into N pieces, adding an increment value suffix to the key values corresponding to the N copied data records, and associating the second data with the copied table to be associated by using the key value added with the second random number suffix to obtain third data, includes:

after the second data is input to a Map terminal, adding a second random number suffix to the key value of the associated field and outputting the second random number suffix by the Map terminal;

after the table to be associated is input to the Map end, copying each data record of the table to be associated into N pieces, and respectively adding increment value suffixes to the key values corresponding to the N pieces of copied data records and outputting the increment value suffixes by the Map end;

and associating the corresponding data record in the second data with the data record in the copied table to be associated by using the key value added with the suffix of the second random number, and taking the associated data record as third data.

4. The method according to claim 3, wherein after the associating processing is performed on the left table and the table to be associated by using the second key value to obtain second data, the method further comprises: removing the first random number suffix from the second key value, and removing the key value without the first random number suffix to obtain fourth data;

before the step of copying each data record of the table to be associated into N, the method further includes: and removing the data records which are not matched in the table to be associated by using the key value contained in the fourth data.

5. The method according to any one of claims 1 to 4, wherein the calculation formula of the record number threshold is as follows:

wherein k is 2, MSize is the memory size on one reduce node, NNum is the number of all map nodes in the Hadoop cluster, and RSize is the size of one data record of the left table.

6. A data processing apparatus, comprising:

7. The apparatus according to claim 6, wherein the first processing unit specifically includes:

8. The apparatus according to claim 6, wherein the second processing unit specifically includes:

9. The apparatus of claim 8, wherein the first processing unit further comprises: the third processing subunit is configured to, after the association processing is performed on the left table and the table to be associated by using the second key value to obtain second data, remove the first random number suffix from the second key value, and remove the key value from which the first random number suffix is removed to obtain fourth data;

the second processing unit further includes: and the fourth processing subunit is configured to, before the step of copying each data record of the table to be associated into N pieces, remove, by using a key value included in the fourth data, a data record that is not matched in the table to be associated.

10. The apparatus according to any one of claims 6 to 9, wherein the calculation formula of the record number threshold is: