CN111431955B

CN111431955B - Streaming data processing system and method

Info

Publication number: CN111431955B
Application number: CN201910023008.2A
Authority: CN
Inventors: 毕俊; 张敬亮; 傅伟康; 马银龙; 王千一; 王晓东; 杨光辉; 张家旺
Original assignee: Zhongke Star Map Co ltd
Current assignee: Zhongke Star Map Co ltd
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2023-03-24
Anticipated expiration: 2039-01-10
Also published as: CN111431955A

Abstract

The invention provides a streaming data processing system and a method, wherein the system comprises: the leading system is used for receiving the data, forwarding the data to the Kafka platform and sending a notice to the scheduling system; the Kafka platform is used for storing the data forwarded by the leading system in pre-allocated topic; the scheduling system is used for generating a scheduling notification according to the information of the Kafka platform and the state information in the etcd system and sending the scheduling notification to the Kubernetes platform after receiving the notification of the leading system; the etcd system is used for updating and storing the state information according to the report of the Kubernets platform; the Kubernetes platform is used to control, generate or adjust the pod according to the scheduling notification and consume the data in the corresponding topic. The invention realizes a uniform API interface for receiving data stream, and the dispatching system can be pertinently started, closed, expanded and contracted according to the load condition of the container, thereby realizing elastic and flexible data processing.

Description

Streaming data processing system and method

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a streaming data processing system and method.

Background

Kubernets, as an opening source container orchestration engine, is used for managing containerized applications on multiple hosts in a cloud platform, and provides mechanisms for application deployment, planning, updating, and maintenance, with the goal of making it simple and efficient to deploy containerized applications. In a Kubernetes-based system, a client sends a Deployment to an ApiServer, and the ApiServer receives a request of a client and stores resource contents into a database (etcd); a Controller component (comprising scheduler, replication and endpoint) monitors resource changes and reacts; the replicase checks database changes and creates a desired number of pod instances; the Scheduler checks the database change again, finds the Pod which is not distributed to a specific execution node (node), then distributes the Pod to the nodes which can operate the Pod according to a group of related rules, updates the database and records the Pod distribution condition; kubelete monitors database changes, manages the lifecycle of subsequent pods, and discovers those pods that are assigned to run on the node where it is located; if a new pod is found, then this new pod will be run on the node; the kuberproxy runs on each host of the cluster and manages network communication, such as service discovery and load balancing. E.g., when data is sent to the host, it is routed to the correct pod or container. For data sent from a host, it may discover the remote server based on the request address and route the data correctly, in some cases using a Round-robin scheduling algorithm (Round-robin) to send the request to multiple instances in the cluster.

However, the Kubernetes-based system described above suffers from the following drawbacks: the pressure at the rear end cannot be dynamically sensed and the elastic expansion and contraction are realized; no API interface accesses the data stream.

Disclosure of Invention

The invention aims to solve at least one of the technical problems in the prior art or the related art and provides an elastic, flexible and efficient data processing scheme.

To this end, according to a first aspect of the present invention, there is provided a streaming data processing system, characterized by comprising:

the leading system is used for receiving the data, forwarding the data to the Kafka platform and sending a notice to the scheduling system;

the Kafka platform is used for storing the data forwarded by the leading system in pre-allocated topic;

the scheduling system is used for generating a scheduling notice according to the information of the Kafka platform and the state information in the distributed service registration (etcd) system and sending the scheduling notice to the Kubernetes platform after receiving the notice of the approach system;

a distributed service registration (etcd) system for updating and storing the state information according to the report of the Kubernets platform;

and the Kubernetes platform is used for controlling, generating or adjusting the pod according to the scheduling notification and consuming the data in the corresponding topic.

Further, the connection system comprises an nginx module and a lua module, wherein the nginx module is used for receiving the data, and the lua module is used for analyzing address information of the data and forwarding the data to a topic corresponding to the Kafka platform according to the address information.

Further, the state information in the etcd system includes kubernets platform running state information, a service process ID, and a Pod ID, the information of the Kafka platform includes topoic information and a statistical count, and the scheduling notification includes data in the topoc indicating which Pod consumes the corresponding scheduling notification.

Further, the status information in the etcd system further includes a load status, the information of the Kafka platform further includes a received data amount, and the scheduling notification further includes a resource indicating to generate a new pod or to indicate which pod to adjust and how to adjust.

Further, the data is transmitted by adopting a udp protocol in the forwarding and consumption of the data.

According to a second aspect of the present invention, there is provided a streaming data processing method, comprising:

the leading system receives the data, forwards the data to the Kafka platform and sends a notice to the dispatching system;

the Kafka platform stores the data forwarded by the leading system in pre-allocated topic;

after receiving the notification of the approach system, the scheduling system acquires the information of the Kafka platform and the state information in a distributed service registration (etcd) system, generates a scheduling notification and sends the scheduling notification to the Kubernetes platform;

and the Kubernetes platform controls, generates or adjusts the pod according to the scheduling notification and consumes the data in the corresponding topic.

Further, the leading system comprises an nginx module and a lua module, wherein the nginx module receives the data, and the lua module analyzes address information of the data and forwards the data to the topic corresponding to the Kafka platform according to the address information.

The invention achieves the function of data parallel access by adding an independent process (leading system), and realizes a uniform API interface for receiving data stream; data are classified and forwarded through a nginx + lua module architecture, and dynamic load balancing is achieved; the state update of the Kubernetes platform is received through the etcd system, and the rear-end pressure can be sensed in time; the container is controlled and adjusted by the scheduling system according to the data change condition in the Kafka platform and the container state in the Kubernetes platform, so that the container can be pertinently started, closed, expanded and contracted according to the load condition of the container, and elastic and flexible data processing is realized.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a block diagram of a streaming data processing system according to the present invention;

fig. 2 is a flow chart of a streaming data processing method according to the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

The invention is improved on the basis of the existing Kubernetes platform, and a data access program at the front end is added to access data flow and carry out classification processing, thereby realizing uniform API interface and dynamic load balance; and a scheduling system is added to realize the elastic expansion and contraction of the container in the Kubernetes platform. The following describes the aspects of the present invention in detail with reference to the accompanying drawings.

Referring to fig. 1, there is shown a streaming data processing system according to the present invention, comprising a docking system 11, a Kafka platform 12, a scheduling system 13, a distributed service registration (etcd) system 14, and a kubernets platform 15. The leading system 11 is configured to receive the data, forward the data to the Kafka platform 12, and send a notification to the scheduling system; the Kafka platform 12 is used for storing the data forwarded by the leading system 11 in pre-allocated topic; the scheduling system 13 is configured to generate a scheduling notification according to the information of the Kafka platform 12 and the state information in the etcd system 14 after receiving the notification of the connection system 11, and send the scheduling notification to the kubernets platform 15; the etcd system 14 is configured to update and store the state information according to the report of the kubernets platform 15; the kubernets platform 15 is used to control, generate or adjust the pod according to the scheduling notification, and consume the data in the corresponding topic. Wherein the pod comprises one or more containers for processing the business data; the "consumption" refers to that the pod acquires and processes the data in the corresponding topic.

Optionally, the guidance system 11 includes a nginx module 111 and a lua module 112, where the nginx module 111 is configured to receive the data, and the lua module 112 is configured to analyze header information of the data, classify the data according to IP address information in the header, and forward the data to a topic corresponding to the Kafka platform according to a predefined routing policy. Thus, when different types of data streams are accessed, data of different ip sources can be classified and distributed.

Optionally, the state information in the etcd system 14 includes kubernets platform operation state information, a service process ID, and a Pod ID, the information of the Kafka platform 12 includes topic information and a statistical count, and the scheduling system 13 generates a scheduling notification according to the information, where the notification includes data in the topic indicating which Pod consumes the corresponding Pod, so that the kubernets platform 15 may control the indicated Pod to consume the data in the topic.

Optionally, the state information in the etcd system 14 further includes a load state, the information of the Kafka platform 12 further includes a received data volume, and the scheduling system 13 generates a scheduling notification according to the information, where the notification includes a resource indicating to generate a new pod or to indicate which pod is to be adjusted and how to adjust, so that the kubernets platform 15 may generate a new pod to consume data in the corresponding topic, or close, expand, and shrink the resource of the indicated pod, for example, a CPU, a memory, and a storage space for the pod, so that resource isolation, dynamic elastic scaling, and migration of each application are ensured.

The information related to the kubernets platform 15 will be uploaded to the etcd system 14 periodically or aperiodically, and the status information stored in the etcd system 14 will be updated in real time.

Optionally, the udp protocol is used for transmitting data in the forwarding and consumption of the data, so that the data transmission concurrency is improved, and the transmission capability of 10000 pieces of data per second of each node can be achieved.

Referring to fig. 2, there is shown a streaming data processing method according to the present invention, comprising:

s21, the leading system receives the data, forwards the data to a Kafka platform and sends a notice to a scheduling system;

s22, the Kafka platform stores the data forwarded by the leading system in pre-distributed topic;

s23, the scheduling system acquires the information of the Kafka platform and the state information in the etcd system, generates a scheduling notice and sends the scheduling notice to the Kubernetes platform;

and S24, the Kubernetes platform controls, generates or adjusts the pod according to the scheduling notification, and consumes the data in the corresponding topic.

Optionally, the guidance system includes a nginx module and a lua module, the nginx module receives the data, and the lua module analyzes address information of the data and forwards the data to a topic corresponding to the Kafka platform according to the address information.

Optionally, the state information in the etcd system includes kubernets platform operation state information, a service process ID, and a Pod ID, the information of the Kafka platform includes topic information and a statistical count, and the scheduling notification includes a notification indicating which Pod consumes data in the corresponding topic.

Optionally, the status information in the etcd system further includes a load status, the information of the Kafka platform further includes a received data volume, and the scheduling notification further includes a resource indicating which pod is to be generated or adjusted, and how to adjust.

Optionally, the data is transmitted by using udp protocol in the forwarding and consumption of the data.

Therefore, the invention realizes a uniform API for receiving data stream, and realizes dynamic load balance by classifying and forwarding data; the rear-end pressure can be sensed in time through the arrangement of the etcd system, the scheduling system can be started, closed, expanded and contracted according to the load condition of the container in a targeted manner, and elastic and flexible data processing is realized.

The following illustrates the specific implementation of the present invention by way of an example:

the collector sends streaming data;

the nginx module in the leading-in system receives the data and sends the data to the lua module, the lua module analyzes message header information of the data, IP address information of the data is obtained, topic corresponding to the data of the IP address is determined according to a predefined routing strategy, a Kafka client API is called to forward the data to topic corresponding to a Kafka platform, and a dispatching system is informed;

after receiving the notification, the scheduling system acquires topic information, statistical count and received data volume from the Kafka platform, acquires state information of the Kubernets platform registered in the etcd from the etcd system, wherein the state information comprises running state information of the Kubernets platform, service process ID, pod ID and load state, and generates a scheduling notification according to the information;

if no container is determined to be processed according to the topoc information, the business process ID and the pod ID, indicating generation of a new pod for processing data in the corresponding topoc in the scheduling notification;

if the container resource is determined to be insufficient according to the load state and the received data volume, indicating to expand the resource of the pod in the scheduling notification, for example, increasing the memory of the pod;

if the container resource is determined to be excessive according to the load state and the received data volume, the resource of the shrinkage pod is indicated in the scheduling notification, for example, the CPU occupancy rate of the pod is reduced;

and the Kubernets platform controls, generates or adjusts the pod according to the scheduling notification, and acquires and processes the corresponding service data in the topic through the specified pod.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium. The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A streaming data processing system, comprising:

a distributed service registration (etcd) system, which is used for updating and storing the state information according to the report of a Kubernets platform;

2. The streaming data processing system according to claim 1, wherein the docking system comprises a nginx module and a lua module, the nginx module is configured to receive the data, and the lua module is configured to analyze address information of the data and forward the data to a topic corresponding to the Kafka platform according to the address information.

3. The streaming data processing system of claim 1, wherein the state information in the etcd system includes kubernets platform running state information, a service process ID, and a podID, wherein the Kafka platform information includes topic information and a statistical count, and wherein the scheduling notification includes an indication of which pod consumed the data in the corresponding topic.

4. The streaming data processing system of claim 3, wherein the state information in the etcd system further comprises a load state, wherein the Kafka platform information further comprises a received data volume, and wherein the scheduling notification further comprises a resource indicating which pod to generate a new pod or which pod to adjust and how to adjust.

5. The streaming data processing system of any of claims 1-4, wherein the data is transmitted using udp protocol in the forwarding and consumption of the data.

6. A streaming data processing method, comprising:

7. The method according to claim 6, wherein the docking system comprises a nginx module and a lua module, the nginx module receives the data, and the lua module analyzes address information of the data and forwards the data to a topic corresponding to the Kafka platform according to the address information.

8. The method of claim 6, wherein the status information in the etcd system comprises Kubernets platform running status information, service process IDs and podIDs, wherein the Kafka platform information comprises topoic information and statistical counts, and wherein the scheduling notification comprises information indicating which pod consumed the data in the corresponding topoc.

9. The method of claim 8, wherein the status information in the etcd system further comprises a load status, wherein the Kafka platform information further comprises a received data volume, and wherein the scheduling notification further comprises a resource indicating which pod is to be adjusted or a new pod is to be generated.

10. Method according to any of claims 6-9, wherein the data is transmitted using udp protocol in the forwarding and consumption of the data.