CN115687286A

CN115687286A - Incremental big data calculation method and system based on impala

Info

Publication number: CN115687286A
Application number: CN202211448709.9A
Authority: CN
Inventors: 邱锋兴; 李汝山; 康华林; 宋琦; 刘树锋; 赵玉洁
Original assignee: Xiamen Anscen Network Technology Co ltd
Current assignee: Xiamen Anscen Network Technology Co ltd
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2023-02-03

Abstract

The invention provides an impala-based incremental big data calculation method, which comprises the following steps: responding to a scanned resource table, reading a corresponding file in the HDFS, and storing information of the file and the resource table; judging whether a newly added file exists, if so, further judging the number of the newly added file and the size of the file, and if not, ending; further splitting the file into sub resource tables according to the number of the files and the size of the files, wherein each resource table averagely distributes the newly added files; copying the newly added file to each sub resource table through an HDFS command; when responding to the SQL execution, the resource table is replaced by the sub-resource table and the execution is circulated. The problem that when the calculated data amount is large, the calculation resources are insufficient is solved; once the computing task is interrupted, the computing task needs to be executed again; each calculation needs total calculation, and increment calculation can not be carried out aiming at a specific table.

Description

Incremental big data calculation method and system based on impala

Technical Field

The invention belongs to the technical field of big data calculation, and particularly relates to an impala-based increasable big data calculation method and system.

Background

With the rapid development and popularization of computer and information technology, data generated by industrial application is explosively increased, and the data volume is far beyond the processing capacity of the existing traditional computing technology and information system, so that the search for effective large data processing technology, method and means is urgent.

Hadoop implements a Distributed File System (Distributed File System), wherein one component is HDFS (Hadoop Distributed File System). HDFS provides storage for massive data, has the characteristic of high fault tolerance, is designed to be deployed on low-cost hardware, provides high throughput for accessing data, and is suitable for application programs with huge data volume.

Impala is an MPP (massively parallel processing) SQL query engine for processing large amounts of data stored in Hadoop clusters, which provides the fastest way to access data stored in a Hadoop distributed file system, with high performance and low latency.

Impala is a memory-based computing engine, and a computation with a large data volume needs more memory, and if a plurality of computing tasks are executed simultaneously, the memory shortage is easily caused. By writing the sql execution mode, the calculation needs to be executed by the full amount of data every time, in the calculation process, the memory resources are always occupied and cannot be released, and once the calculation task is interrupted, the calculation task needs to be executed again, which wastes time and affects efficiency.

In view of this, it is very meaningful to provide an impala-based incremental big data calculation method and system thereof.

Disclosure of Invention

The method aims to solve the problems that the existing method is large in calculation data quantity and insufficient in calculation resources; once the calculation task is interrupted, the calculation task needs to be executed again in full, so that the problems of time waste and efficiency influence are solved; the invention provides an impala-based incremental big data calculation method and system, aiming at solving the technical defect problem that incremental calculation cannot be carried out on a specific SQL resource table with a large data size.

In a first aspect, the invention provides an impala-based increasable big data calculation method, which includes:

s1, responding to a scanned resource table, reading a corresponding file in an HDFS, and storing information of the file and the resource table;

s2, judging whether a newly added file exists or not, if so, further judging the number of the newly added file and the size of the file, and if not, ending;

s3, further splitting the files into sub resource tables according to the number of the files and the sizes of the files, and distributing the newly added files to each resource table on average;

s4, copying the newly added file to each sub-resource table through an HDFS command;

and S5, replacing the resource table with the sub-resource table when the SQL is executed, and executing circularly.

Preferably, before S1, the method further comprises: responding to the configuration SQL execution information, wherein the execution information comprises a calculation SQL select statement, a resource table needing incremental calculation, a table name of an execution result, the number of calculation threads and creation time.

Preferably, S1 specifically includes:

s11, inquiring information of a resource table through an impala command to acquire an HDFS path of the resource table;

s12, scanning the HDFS path of the resource table, reading all HDFS files under the resource table, acquiring all file information, and putting the read HDFS file information in a set L1;

and S13, reading the configuration table of the resource table associated file information, and placing the configuration table in the set L2.

Further preferably, after S1 and before S2, the method further comprises: and comparing the file names of the set L1 and the set L2, putting the newly added files into the set L3, and further judging whether incremental data exist in the set L3.

Preferably, S3 specifically further includes:

s31, calculating the number of sub-resource tables to be created and the number of files of each sub-resource table according to the total file number of the newly added files, the size of each file and the number of counting threads;

s32, according to the table structure of the resource table, creating each sub-resource table through impala, and acquiring an HDFS path of the sub-resource table;

and S33, circularly gathering the L3 according to the number of the resource table files calculated in the S31, copying the files to the HDFS path of the sub-resource table through the HDFS command, and storing the association relation table of the sub-resource table and the files.

Further preferably, S5 further includes:

s51, refreshing the sub-resource table through an impala command to enable data to take effect;

s52, circulating each sub-resource table, replacing the resource table in the SQL with the sub-resource table, executing the SQL by adopting a thread pool according to the configured number of the calculation threads to obtain a sub-result table, and recording the execution condition: resource table, sub-resource table, execution state, start time, end time, message.

Further preferably, S5 further includes:

s53, judging whether an execution result table in the execution information already exists;

s54, if the result table does not exist, creating a result table through the impala according to the table structure of the sub-result table, and acquiring the HDFS path of the result table;

s55, acquiring the HDFS path of the sub-result table through impala, migrating the HDFS file of the sub-result table to the HDFS path of the execution result table, and refreshing the execution result table through the impala to enable data to take effect;

and S56, after the execution is finished, storing the association relationship between the resource table and the newly-added file of the set L3 into a configuration table of the resource table association file information.

In a second aspect, the present invention further provides an impala-based increasable big data computing system, which is characterized by including:

a scanning and reading module: the system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for responding to a scanned resource table, reading a corresponding file in the HDFS and saving the information of the file and the resource table;

a judging module: the method is used for judging whether a newly added file exists, if so, further judging the number of the newly added file and the size of the file, and if not, ending;

splitting and distributing the module: the resource management system is used for splitting the file into sub resource tables according to the number of the files and the size of the files, and each resource table averagely distributes the newly added files;

a copying module: the file management system is used for copying the newly added file to each sub-resource table through an HDFS command;

and a replacement module: and the resource table is replaced by the sub-resource table and is executed circularly.

In a third aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by one or more processors, cause the one or more processors to carry out a method as described in any one of the implementations of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

Compared with the prior art, the beneficial results of the invention are as follows:

the method comprises the steps of reading corresponding files in the HDFS by scanning a resource table, storing information of the files and the resource table, judging whether newly added files exist or not, judging the number and the size of the newly added files if the newly added files exist, splitting the newly added files into sub-resource tables according to the number and the size of the files, averagely distributing the newly added files to each resource table, copying the newly added files to each sub-resource table through an HDFS command, replacing the resource tables with the sub-resource tables when SQL is executed, and performing in a circulating mode, so that the problem that when the calculated data amount is large, the calculated resources are insufficient is solved; once the computing task is interrupted, the computing task needs to be executed again; each calculation needs full calculation, and increment calculation cannot be carried out aiming at a specific table.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the invention. Other embodiments and many of the intended advantages of embodiments will be readily appreciated as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.

FIG. 1 is an exemplary device architecture diagram in which an embodiment of the present invention may be employed;

FIG. 2 is a flowchart illustrating an incremental big data calculation method based on impala according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for calculating incremental big data based on impala according to an embodiment of the present invention;

FIG. 4 is a flow diagram of an impala based increasable big data computing system according to an embodiment of the invention;

FIG. 5 is a schematic block diagram of a computer apparatus suitable for use in implementing an electronic device of an embodiment of the invention.

Detailed Description

In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. In this regard, directional terminology, such as "top," "bottom," "left," "right," "up," "down," etc., is used with reference to the orientation of the figures being described. Because components of embodiments can be positioned in a number of different orientations, the directional terminology is used for purposes of illustration and is in no way limiting. It is to be understood that other embodiments may be utilized and logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.

Fig. 1 illustrates an exemplary system architecture 100 of a method for processing information or an apparatus for processing information to which embodiments of the present invention may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having communication functions, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services, such as a background information processing server that processes check request information transmitted by the

terminal apparatuses

101, 102, 103. The background information processing server may analyze and perform other processing on the received verification request information, and obtain a processing result (e.g., verification success information used to represent that the verification request is a legal request).

It should be noted that the method for processing information provided by the embodiment of the present invention is generally executed by the server 105, and accordingly, the apparatus for processing information is generally disposed in the server 105. In addition, the method for sending information provided by the embodiment of the present invention is generally executed by the

terminal equipment

101, 102, 103, and accordingly, the apparatus for sending information is generally disposed in the

terminal equipment

101, 102, 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or may be implemented as a single software or a plurality of software modules, and is not limited in particular herein.

The method aims to solve the problems that the existing method is large in calculation data quantity and insufficient in calculation resources; once the calculation task is interrupted, the calculation task needs to be executed again in full, so that the time is wasted, and the efficiency is influenced; the incremental calculation can not be carried out aiming at the resource table with specific SQL and large data quantity.

In a first aspect, an embodiment of the present invention discloses an incremental big data calculation method based on impala, and as shown in fig. 2, the method includes:

s101, responding to a scanned resource table, reading a corresponding file in the HDFS, and saving information of the file and the resource table;

s102, judging whether a newly added file exists or not, if so, further judging the number of the newly added file and the size of the file, and if not, ending;

s103, further splitting the files into sub resource tables according to the number of the files and the sizes of the files, wherein each resource table averagely distributes the newly-added files;

s104, copying the newly added file to each sub-resource table through an HDFS command;

and S105, replacing the resource table with the sub-resource table when the SQL is executed, and executing circularly.

The method comprises the steps of reading corresponding files in the HDFS by scanning a resource table, storing information of the files and the resource table, judging whether newly added files exist, judging the number and the size of the newly added files if the newly added files exist, splitting the newly added files into sub-resource tables according to the number and the size of the files, averagely distributing the newly added files to each resource table, copying the newly added files to each sub-resource table through an HDFS command, replacing the resource tables with the sub-resource tables when SQL is executed, and executing in a circulating mode.

Further, in this embodiment, as shown in fig. 3, the specific steps are as follows:

s1, configuring SQL execution information: calculating SQL sentences (select sentences), resource tables needing incremental calculation, table names of execution results, the number of calculation threads and the creation time.

And S2, starting to execute the task.

And S3, inquiring the information of the resource table through the impala command, and acquiring the HDFS path of the resource table.

S4, scanning the HDFS path of the resource table, reading all HDFS files under the resource table, and acquiring all file information, such as: file name, file size. The read HDFS file information is placed in the set L1.

S5, reading a configuration table of resource table associated file information (the table is a resource table stored in the task history execution and associated information of the HDFS file), and placing the configuration table in the set L2. The configuration table information includes: resource table, file name, file size, retention time, etc.

And S6, comparing file names of the sets L1 and L2, adding a new file (namely incremental data) into the set L3.

And S7, judging whether a newly added file exists or not, namely whether the set L3 has data or not.

And S8, if yes, calculating the number of the sub-resource tables needing to be created and the number of the files of each sub-resource table according to the total number of the files of the newly added files, the size of each file and the number of the counting threads.

S9, according to the table structure of the resource table, creating each sub-resource table through impala, and acquiring the HDFS path of the sub-resource table.

And S10, circularly gathering the L3 (namely incremental data) according to the number of the resource table files calculated in the S8, copying the files to the HDFS path of the sub-resource table through the HDFS command, and storing the association relation table of the sub-resource table and the files. Wherein the relationship table information includes: resource table, sub-resource table, file name, file size, creation time, etc.

S11, refreshing the sub-resource table through an impala command to enable the data to take effect.

S12, circulating each sub-resource table, replacing the resource table in the SQL with the sub-resource table, executing the SQL by adopting a thread pool according to the configured number of the calculation threads to obtain a sub-result table, and recording the execution condition: resource table, sub-resource table, execution state, start time, end time, message.

And S13, judging whether the execution result table of the task information in the S1 already exists.

And S14, if the result table does not exist, creating the result table through impala according to the table structure of the sub-result table, and acquiring the HDFS path of the result table.

S15, acquiring the HDFS path of the sub-result table through the impala, migrating the HDFS file of the sub-result table to the HDFS path of the execution result table, and refreshing the execution result table through the impala to enable the data to take effect.

And S16, after the execution is finished, storing the association relation of the resource table and the newly added file (the set L3) into a configuration table of the resource table association file information.

The invention solves the problems that when the calculated data amount is large, the calculation resources are insufficient; once the computing task is interrupted, the computing task needs to be executed again; each calculation needs total calculation, and increment calculation can not be carried out aiming at a specific table.

In a second aspect, the present invention further provides an impala-based increasable big data computing system, referring to fig. 4, including: a scanning and reading module 41, a judging module 42, a splitting and distributing module 43, a copying module 44 and a replacing module 45. The scanning and reading module 41: the HDFS is used for responding to a scanned resource table, reading a corresponding file in the HDFS and storing information of the file and the resource table; the judging module 42: the method comprises the steps of judging whether new files exist or not, if so, further judging the number of the new files and the size of the files, and if not, ending; the split assignment module 43: the resource management system is used for splitting the file into sub resource tables according to the number of the files and the size of the files, and each resource table averagely distributes the newly added files; the copy module 44: the file management system is used for copying the newly added file to each sub-resource table through an HDFS command; the replacement module 45: and the resource table is replaced by the sub-resource table and is executed circularly.

Referring now to FIG. 5, a block diagram of a computer apparatus 600 suitable for use with an electronic device (e.g., the server or terminal device shown in FIG. 1) to implement an embodiment of the invention is shown. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer apparatus 600 includes a Central Processing Unit (CPU) 601 and a Graphics Processing Unit (GPU) 602, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 603 or a program loaded from a storage section 609 into a Random Access Memory (RAM) 606. In the RAM 604, various programs and data necessary for the operation of the apparatus 600 are also stored. The CPU 601, GPU602, ROM 603, and RAM 604 are connected to each other via a bus 605. An input/output (I/O) interface 606 is also connected to bus 605.

The following components are connected to the I/O interface 606: an input portion 607 including a keyboard, a mouse, and the like; an output section 608 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 609 including a hard disk and the like; and a communication section 610 including a network interface card such as a LAN card, a modem, or the like. The communication section 610 performs communication processing via a network such as the internet. The drive 611 may also be connected to the I/O interface 606 as needed. A removable medium 612 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 611 as necessary, so that the computer program read out therefrom is mounted into the storage section 609 as necessary.

In particular, the processes described above with reference to the flow diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 610 and/or installed from the removable media 612. The computer programs, when executed by a Central Processing Unit (CPU) 601 and a Graphics Processor (GPU) 602, perform the above-described functions defined in the method of the present invention.

It should be noted that the computer readable medium of the present invention can be a computer readable signal medium or a computer readable medium or any combination of the two. The computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or any combination of the foregoing. More specific examples of the computer readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The modules described may also be provided in a processor.

As another aspect, the present invention also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: responding to a scanned resource table, reading a corresponding file in the HDFS, and storing information of the file and the resource table; judging whether a newly added file exists, if so, further judging the number of the newly added file and the size of the file, otherwise, ending; further splitting the file into sub resource tables according to the number of the files and the size of the files, wherein each resource table averagely distributes the newly added files; copying the newly added file to each sub resource table through an HDFS command; when responding to the SQL execution, the resource table is replaced by the sub-resource table and the execution is circulated.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention according to the present invention is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the scope of the invention as defined by the appended claims. For example, the above features and the technical features (but not limited to) having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims

1. An incremental big data calculation method based on impala is characterized by comprising the following steps:

s4, copying the newly added file to each sub resource table through an HDFS command;

and S5, when the SQL is executed in response, replacing the resource table with the sub-resource table, and executing circularly.

2. The impala-based increasable big data calculation method according to claim 1, characterized in that before S1, it further comprises:

responding to the configuration SQL execution information, wherein the execution information comprises a calculation SQL select statement, a resource table needing incremental calculation, a table name of an execution result, the number of calculation threads and creation time.

3. The impala-based increasable big data calculation method as claimed in claim 1, wherein S1 specifically comprises:

and S13, reading a configuration table of the resource table associated file information, and placing the configuration table in the set L2.

4. The impala-based increasable big data calculation method as claimed in claim 3, further comprising after S1 and before S2:

and comparing the file names of the set L1 and the set L2, putting the newly added files into the set L3, and further judging whether incremental data exist in the set L3.

5. The impala-based increasable big data computing method according to claim 1, wherein S3 specifically comprises:

s31, calculating the number of sub-resource tables to be created and the number of files of each sub-resource table according to the total file number, the size and the counting thread number of the newly added files;

6. The impala-based increasable big data calculation method as claimed in claim 2, wherein S5 further comprises:

s51, refreshing the sub resource table through an impala command to enable data to take effect;

7. The impala-based increasable big data calculation method as claimed in claim 6, wherein S5 further includes:

s54, if the result table does not exist, creating a result table through impala according to the table structure of the sub-result table, and acquiring an HDFS path of the result table;

and S56, after the execution is finished, storing the association relationship between the resource table and the newly-added file of the set L3 into a configuration table of the resource table associated file information.

8. An impala-based increasable big data computing system, comprising:

a scanning and reading module: the HDFS is used for responding to a scanned resource table, reading a corresponding file in the HDFS and storing information of the file and the resource table;

a judgment module: the method is used for judging whether a newly added file exists, if so, further judging the number of the newly added file and the size of the file, and if not, ending;

splitting a distribution module: the resource table is used for splitting the file into sub resource tables according to the number of the files and the size of the files, and the newly added files are distributed to the resource tables on average;

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.