CN112541816A

CN112541816A - Distributed stream computing processing engine for internet financial consumption credit batch business

Info

Publication number: CN112541816A
Application number: CN202011517524.XA
Authority: CN
Inventors: 冯宇; 罗喜川
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-03-23

Abstract

The invention relates to the field of flow calculation processing, and discloses an internet financial consumption credit batch service distributed flow calculation processing engine, which comprises the following steps of S1001: reading data from a source database, and sending the source data stream to a kafka topic partition through a kafka producer; step S1002: reading a source data message in Kafka, and creating a reader as a consumer for each partition in topic of Kafka; step S1003: the reader collects the messages from the same topic through stream operation and forwards the messages to each Segment of GP through HTTP Connector; step S1004: scheduling, loading and starting GP-SQL for Java using a stream data processing mode and stream operation through a scheduler Dispatcher to realize JOB of stream data processing to complete specific service functions; step S1005: and writing the processed result data into a result database. The invention improves the batch processing efficiency and shortens the processing time by the distributed parallel technology, and combines the Kafka high-speed stream transmission capability and the Greenplus powerful stream operation execution capability.

Description

Distributed stream computing processing engine for internet financial consumption credit batch business

Technical Field

The invention relates to the field of stream computing processing, in particular to an internet financial consumption credit batch service distributed stream computing processing engine.

Background

The internet consumption finance is a network loan in essence, and refers to a financial operation mode of providing a specific consumption product (except a real estate or an automobile) or service loan to a consumer through the internet on the basis of big data credit investigation by an internet financial enterprise with related qualifications. The user can enjoy the convenience brought by the internet consumption finance by logging in the related website and applying for the website.

The current internet finance consumption credit batch uses a single machine single process mode, along with the continuous increase of data volume, the pressure of a single machine batch system is very large, the batch processing time is longer and longer, and the normal transaction and the data processing of a downstream system are influenced.

Therefore, an internet financial consumption credit batch business distributed stream computing processing engine capable of improving batch processing efficiency and shortening processing time is urgently needed.

Disclosure of Invention

Based on the above problems, the invention provides an internet financial consumption credit batch business distributed stream computing processing engine. The invention improves the batch processing efficiency and shortens the processing time by the distributed parallel technology, combines the Kafka high-speed stream transmission capability and the Greenplus strong stream operation execution capability to form a distributed stream calculation processing engine, and greatly reduces the delay of data processing.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

an internet finance consumption credit batch business distributed stream computing processing engine comprises the following steps,

step S1001: reading data from a source database, and sending the source data stream to a kafka topic partition through a kafka producer;

step S1002: reading a source data message in Kafka, and creating a reader as a consumer for each partition in topic of Kafka;

step S1003: the reader collects the messages from the same topic through stream operation and forwards the messages to each Segment of GP through HTTP Connector;

step S1004: scheduling, loading and starting a GP-SQL for Java using a stream data processing mode and stream operation by a scheduler Dispatcher to realize JOB of stream data processing to complete a specific service function;

step S1005: and writing the processed result data into a result database.

Preferably, in step S1002, the Reader is a Kafka consumer of the stream calculation processing engine, and is responsible for reading the source data stream in Kafka, performing pre-transformation and processing on the data stream, and then sending the data to the segment of the GP through the Connector.

Preferably, the Connector is based on the HTTP protocol, and is responsible for the kafka, GP connection function, and data transceiving function.

Preferably, the Dispatcher is a scheduling component, and is responsible for scheduling, loading, and starting the JOB implemented in GP-SQL for JAVA associated with GP Segment.

Preferably, the data stream needs to perform a pretandler operation before the JOB processing, so as to obtain the data format required by the JOB processing.

Preferably, PostHandler operation is performed on the data stream after JOB processing, so that the data stream meets a format required by storing the data stream into a mysql database, and the result data after processing is written into the result database.

The invention has the beneficial effects that:

(1) the invention uses Kafka stream data platform, which has high-speed data stream transmission capability.

(2) The method uses Greenplus and has strong stream operation execution capacity.

(3) The invention supports a complete stream computing processing mode, event time and processing time, a fixed window and a sliding window, can simulate a conversation window through the time window, and can execute various powerful data processing functions of Greenplus on the windows.

(4) The invention fully utilizes the distributed parallel technology of Kafka and Greenplus, greatly improves the batch processing efficiency and greatly shortens the processing time.

(5) The invention also supports the transverse expansion of the processing node, and greatly improves the expansibility, the flexibility and the fault tolerance of the system.

Drawings

A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:

FIG. 1 is a schematic diagram of a framework shown in accordance with some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort.

Referring to fig. 1, the internet finance consumption credit batch business distributed stream computing processing engine includes the following steps,

step S1005: and writing the processed result data into a result database.

To facilitate understanding of the embodiments, we make the following explanations.

1、Greenplum：

The Greenplus is an open source data processing platform which is constructed based on the MPP version of Postgres, has rich functions and excellent performance, can distribute data on different nodes to perform high-speed parallel processing, and is also called GP for short in the application file.

The greenplus mainly comprises a Master node, a Segment node and an interconnect. Greenplus is an entry of the greenplus database system, accepts SQL statements connected and submitted by clients, distributes workloads to other database instances (segment instances), and stores and processes data by them. The greenplus interconnect is responsible for communication between different PostgreSQL instances. Greenplus segments are independent PostgreSQL databases, each segment storing a portion of the data. Most of the query processing is done by segment. The Master node does not store any user data, only performs access control on the client and stores metadata of the table distribution logic, the Segment node is responsible for data storage, and the distribution key can be optimized to fully utilize the io performance of the Segment node to expand the storage mode of the io performance of the whole cluster, so that different storage modes can be used according to different data heat degrees or access modes. Different data of a table can use different physical storage modes: row store, column store, external table.

The greenplus architecture has roughly the following characteristics:

1.1 Large Scale data storage

(1) The greenplus database enables the storage of scale data by distributing the data over multiple nodes.

(2) The Greenplus adopts a divide-and-conquer method to regularly distribute the data to the nodes, and fully utilizes the IO capacity of the Segment host, so that the system can achieve the maximum IO capacity.

(3) In greenplus, each table is distributed across all nodes. The Master node firstly performs hash operation on one or more columns of the table, and then distributes the data of the table into the Segment node according to the hash result. In the whole process, the Master node does not store any user data, but only performs access control on the client and stores metadata of the table distribution logic.

(4) Greenplus provides a flexible storage approach called "multi-state storage. Polymorphic storage may use different storage modes depending on the data hot or access mode. Different data of a table may use different physical storage.

The supported storage modes comprise:

and (3) line storage: the row storage is a common storage mode of the traditional database, and is characterized by fast access and easy multi-column updating.

Column store: the column stores are kept in columns, with data for different columns being stored in different places (usually in different files). This is suitable for cases where only a few fields in a wide table are accessed at a time. Another advantage of column storage is the high compression ratio.

External table: data is stored in other systems such as HDFS, and the database retains only metadata information.

1.2 parallel query planning and execution

(1) The scheduler (QD) on the greenplus Master (Master) will issue a query task to each data node.

(2) After the data node receives the task (queries the plan tree), it creates a work process (QE) to execute the task.

(3) If cross-node data exchange (such as Hashjoin) is needed, a plurality of work processes are created on the data nodes to perform tasks in a coordinated manner. Processes that execute the same task (slice in the query plan) on different nodes form a process group. And the data flows from bottom to top, and finally the Master returns to the client.

1.3 parallel data Loading

(1) The parallel loading technology fully utilizes the advantages of distributed computing and distributed storage, and ensures that the I/O resources of each Disk are exerted.

(2) The speed is improved by more than 40-50 times by parallel loading compared with serial loading.

(3) By increasing Segment, the parallel loading speed increases linearly.

2、Kafka：

Kafka is an open-source, high-throughput, distributed publish-subscribe messaging system and streaming data transport platform. Kafka was originally designed as a distributed log system and is widely used in message middleware scenarios for streaming data. As the Kafka extension component becomes more and more, it also slowly evolves into a complete stream data computing platform. The core components of Kafka ensure the three most important functions: high speed, reliability and expandability.

Topics in the kafka structure are queues that hold messages, each topic containing one or more partitions, the number of which determines the degree of parallelism in consumption. The producer generates a message, and the message is sent to a certain topic or directly sent to a specific partition in the topic. The consumer consumes the message, and the consumer consumes the same topic message in units of group (consumer group). Multiple groups of consumers may consume the same topoic at the same time, each group generally corresponding to a different application logic. Within the same group, each partial message can be consumed by only one consumer at a time. There is no case where multiple consumers within the same group consume the same partition. Therefore, increasing the number of generators will certainly increase the parallelism of the sender, but increasing the number of consumers will not necessarily increase the parallelism of the receiver, because the upper limit of the parallelism of the actual consumer is determined by the data of the partition. Kafka is a distributed cluster that is composed of brokers, each of which is an instance of a process that can independently provide services. The message is organized in partitions and stored on the Broker. There will be backups of data, called replenica, that are stored on another Broker, so that the number of backups will not exceed the number of brokers, and the number of partitions is not limited, and there can be multiple partitions of the same topic on the same Broker.

In all backups, there is a Leader that provides read-write services to the outside world, others are called replicas. Only when the Leader is hung up, a new Leader is selected from the remaining replicas to provide services. To maximize Kafka's throughput, consideration must be given to how to match the design of Kafka itself in the stream processing pipeline architecture design.

3. Streaming data:

stream data is data without boundaries and has 2 important features.

The first is that there is a delay from generation to processing of the stream data. The streaming data therefore has two temporal attributes: event time and processing time.

The second feature is that the application can determine the consistency target of the data according to the actual requirement, such as strong consistency, final consistency, or at most once, at least once, and so on.

4. Stream data processing mode:

the stream data processing mode has three main attributes.

4.1 time: the time attribute is divided into three categories, the event time, the processing time and the time are irrelevant, generally, the delay exists between the event time and the processing time, and the delay is not fixed

4.2 Window: the window attribute is that a virtual boundary is added to the stream data, so that the stream data without the boundary is converted into data sets with the boundary; it is the most common method of processing borderless data. The added boundaries are generally of two types, time boundaries and event boundaries, called time windows and session windows, respectively; there are two types of time windows, fixed window and sliding window, and the meaning of the session window is that the event to be processed has definite start and end marks. The series of events between the start and end markers is called a session, and the data is processed in units of sessions

4.3, operation: the operation is what processing we need to perform on the stream data. From simple to complex, in turn, the internal connection of the stream data, i.e. finding a common event or similar in both stream data; the second is the transformation and filtering of data, such as simple deduplication, unit conversion, to complex encryption, desensitization, etc.; the third is the most complex aggregation of stream data, i.e. some aggregation function needs to be performed in order to identify certain characteristics of the data stream. The SQL for JAVA of GP is a good tool for executing stream operations.

In summary, the stream data processing mode is a permutation and combination of these three attributes. Most stream computation problems can be classified into one of these categories. For example, on a session window based on event time, or in a fixed window, a time-independent data is transformed, etc.

5. A stream calculation engine:

a stream compute engine is a compute engine that handles specifically unbounded data. It has two characteristics, firstly, the stream computing engine must be able to meet the requirement of strong consistency. The second feature is that the stream computation engine needs to support the stream data processing mode described above.

It should be noted that, in some embodiments, in step S1002, the Reader is a Kafka provider of the stream calculation processing engine, and is responsible for reading the source data stream in Kafka, performing pre-transformation and processing on the data stream, and then sending the data to the segment of the GP through the Connector.

It should be noted that in some embodiments, the Connector is based on the HTTP protocol and is responsible for the kafka, GP connection function, and data transceiving function.

It should be noted that in some embodiments, the Dispatcher is a scheduling component, which is responsible for scheduling, loading, and starting the JOB implemented in GP-SQL for JAVA associated with GP Segment.

It should be noted that, in some embodiments, the data stream needs to perform a pretandler operation before the JOB processing, so as to obtain the data format required by the JOB processing.

It should be noted that in some embodiments, after the JOB processing, the data stream needs to perform PostHandler operation to meet the format required for storing the data stream to the mysql database, so that the result data after the processing is written into the result database.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. The distributed flow computing processing engine for the internet financial credit consumption batch business is characterized by comprising the following steps,

step S1005: and writing the processed result data into a result database.

2. The internet finance consumption credit batch business distributed stream computing processing engine according to claim 1, wherein in step S1002, the Reader is a Kafka conditioner of the stream computing processing engine, and is responsible for reading the source data stream in Kafka, performing pre-transformation and processing on the data stream, and then sending the data to the segment of GP through the Connector.

3. The internet finance consumption credit batch business distributed stream computing processing engine according to claim 1, wherein the Connector is based on HTTP protocol and is responsible for kafka, GP connection functions, and data transceiving functions.

4. The internet finance consumption credit batch business distributed stream computing processing engine of claim 1, wherein Dispatcher is a scheduling component responsible for scheduling, loading, and launching GP Segment associated JOB implemented in GP-SQL for JAVA.

5. The distributed flow computation processing engine for internet finance consumption credit batch business according to claim 1, wherein the data flow needs to perform a pretandler operation before JOB processing, so as to obtain the data format required by JOB processing.

6. The distributed stream computing engine for internet finance consumption credit batch service according to claim 1, wherein the data stream is further subjected to PostHandler operation after JOB processing so as to meet the format required for saving to the mysql database, and the writing of the processed result data into the result database is realized.