CN110535713B - Monitoring management system and monitoring management method - Google Patents
Monitoring management system and monitoring management method Download PDFInfo
- Publication number
- CN110535713B CN110535713B CN201810509664.9A CN201810509664A CN110535713B CN 110535713 B CN110535713 B CN 110535713B CN 201810509664 A CN201810509664 A CN 201810509664A CN 110535713 B CN110535713 B CN 110535713B
- Authority
- CN
- China
- Prior art keywords
- monitoring
- queue
- message queue
- information
- monitoring information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 415
- 238000007726 management method Methods 0.000 title claims description 125
- 238000000034 method Methods 0.000 claims abstract description 25
- 230000036541 health Effects 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims description 18
- 238000013480 data collection Methods 0.000 claims description 13
- 210000001503 joint Anatomy 0.000 claims description 13
- 238000003032 molecular docking Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 abstract description 6
- 238000012423 maintenance Methods 0.000 abstract description 5
- 238000004458 analytical method Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 18
- 230000008569 process Effects 0.000 description 7
- 230000002159 abnormal effect Effects 0.000 description 5
- 230000002354 daily effect Effects 0.000 description 4
- 238000009825 accumulation Methods 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/04—Network management architectures or arrangements
- H04L41/046—Network management architectures or arrangements comprising network management agents or mobile agents therefor
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/069—Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/04—Processing captured monitoring data, e.g. for logfile generation
- H04L43/045—Processing captured monitoring data, e.g. for logfile generation for graphical visualisation of monitoring data
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/06—Generation of reports
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/104—Peer-to-peer [P2P] networks
- H04L67/1044—Group management mechanisms
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Debugging And Monitoring (AREA)
Abstract
The application discloses control management system includes: the system comprises a message queue cluster, a monitoring management platform and a database; the message queue cluster comprises at least one message queue gateway and a plurality of message queue nodes; the method comprises the steps that a gateway monitoring data acquisition agent program acquires monitoring information of a message queue gateway and reports the acquired monitoring information of the message queue gateway to a monitoring management platform, and a node monitoring data acquisition agent program acquires the monitoring information of a message queue node and reports the acquired monitoring information of the message queue node to the monitoring management platform; the monitoring management platform is used for analyzing the monitoring information, obtaining a monitoring result and storing the monitoring information into a database; and the database is used for storing the monitoring information provided by the monitoring management platform. By adopting the system, the running condition and the health condition of the current message queue cluster are summarized through mechanisms such as information acquisition and analysis, and the operation and maintenance guarantee of the system is greatly improved.
Description
Technical Field
The application relates to the technical field of computers, in particular to a monitoring management system. The application also relates to a monitoring management method.
Background
The message queue middleware product is a general message queue product and is widely used in scenes such as data distribution, message interaction and the like.
In the prior art, message middleware products (such as IBM MQ) are often adopted for message interaction, but a complete monitoring management system is lacked for monitoring the operation of the message queue cluster. In the process of using the message queue technology, a main problem is that when a message queue cluster fails, it cannot be found in time to perform related operation and maintenance operations, for example, when the queue is full or an MQ cluster fails, it cannot be found in time to cause low reliability and low maintainability of the system.
The prior art has the problems of low reliability and low maintainability when a message queue middleware product is adopted.
Disclosure of Invention
The application provides a monitoring management system and a monitoring management method, which aim to solve the problems of low reliability and low maintainability existing in the prior art when a message queue middleware product is adopted.
The application provides a monitoring management system, includes: the system comprises a message queue cluster, a monitoring management platform and a database;
the message queue cluster comprises at least one message queue gateway and a plurality of message queue nodes;
the message queue gateway is used for distributing messages of a butt joint application system to the message queue nodes according to the load of the message queue nodes, the message queue gateway runs a gateway monitoring data acquisition agent program, and the gateway monitoring data acquisition agent program is used for acquiring monitoring information of the message queue gateway and reporting the acquired monitoring information of the message queue gateway to the monitoring management platform;
the message queue node is used for receiving messages of the butt joint application system provided by the message queue gateway, processing the messages of the butt joint application system, or storing the messages of the butt joint application system in a message queue mode, the message queue node runs a node monitoring data acquisition agent program, and the node monitoring data acquisition agent program is used for acquiring monitoring information of the message queue node and reporting the acquired monitoring information of the message queue node to the monitoring management platform;
the monitoring management platform is used for acquiring monitoring information reported by a gateway monitoring data acquisition agent program and a node monitoring data acquisition agent program, analyzing the monitoring information, acquiring a monitoring result and storing the monitoring information into the database;
and the database is used for storing the monitoring information provided by the monitoring management platform.
Optionally, the gateway monitoring data collecting agent program is specifically configured to collect monitoring information of the message queue gateway at regular time, and report the collected monitoring information of the message queue gateway to the monitoring management platform.
Optionally, the node monitoring data collection agent is specifically configured to collect monitoring information of the message queue node at regular time, and report the collected monitoring information of the message queue node to the monitoring management platform.
Optionally, the monitoring management platform includes:
the early warning submodule is used for determining whether early warning is needed or not according to the monitoring result of the monitoring information, and if so, performing early warning processing;
and the statistics and display submodule is used for carrying out multi-dimensional statistics and display on the monitoring information in the database.
Optionally, the early warning sub-module is specifically configured to:
the system is used for notifying a system administrator through a mobile phone short message or a mail when the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information; or, an image or sound warning or alarm is issued.
Optionally, the early warning sub-module is further configured to set an early warning level of the monitoring information, where different early warning levels correspond to different early warning condition thresholds set for the monitoring information.
Optionally, the statistics and display sub-module includes:
the MQ cluster system operation condition statistics and display submodule is used for carrying out statistics and display on the MQ cluster system operation condition; or,
the hardware condition counting and displaying submodule is used for counting and displaying the health condition of the hardware; or,
and the queue manager counting and displaying submodule is used for counting and displaying the queue manager.
Optionally, the system operation condition statistics and display submodule is specifically configured to display a topological graph of the docking application service system.
Optionally, the queue manager statistics and display sub-module is specifically configured to:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager; or,
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager; or,
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
Optionally, the counting and displaying of the queue manager includes at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics.
Optionally, the monitoring information includes at least one of the following information:
queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics.
Optionally, the queue statistical information includes: queue data flow to information and/or data traffic.
The present application further provides a monitoring management method, which is applied to the monitoring management system, and the method includes:
the message queue gateway reports own monitoring information to a monitoring management platform through a gateway monitoring data acquisition agent program running on the message queue gateway;
the message queue node reports own monitoring information to a monitoring management platform through a node monitoring data acquisition agent program running on the message queue node;
and the monitoring management platform analyzes the monitoring information, obtains a monitoring result of the monitoring information, and stores the monitoring information in a database.
Optionally, the reporting, by the message queue gateway, monitoring information of the message queue gateway to the monitoring management platform through a gateway monitoring data acquisition agent running on the message queue gateway includes:
and the message queue gateway reports the monitoring information of the message queue gateway to a monitoring management platform at regular time through a gateway monitoring data acquisition agent program running on the message queue gateway.
Optionally, the reporting of the monitoring information of the message queue node to the monitoring management platform by the message queue node through the node monitoring data acquisition agent running on the message queue node includes:
and the message queue node reports the monitoring information of the message queue node to a monitoring management platform at regular time through a node monitoring data acquisition agent program running on the message queue node.
Optionally, the method further includes:
and determining whether to alarm or pre-warn according to the monitoring result of the monitoring information.
Optionally, determining whether to alarm or perform early warning according to the monitoring result of the monitoring information includes:
judging whether the monitoring result of the monitoring information reaches an alarm condition set for the monitoring information, and if so, carrying out alarm processing; or
And judging whether the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information, and if so, carrying out early warning processing.
Optionally, the method further includes:
and setting early warning levels of the monitoring information, wherein different early warning levels correspond to different early warning condition thresholds set for the monitoring information.
Optionally, the method further includes:
and carrying out multidimensional statistics and display on the monitoring information in the database.
Optionally, the performing multidimensional statistics and display on the monitoring information in the database includes:
counting and displaying the operation condition of the MQ cluster system; or,
counting and displaying the health condition of the hardware; or,
and counting and displaying the queue manager.
Optionally, the counting and displaying the operation condition of the MQ cluster system includes:
and displaying a topological graph of the butt joint application service system.
Optionally, the counting and displaying the queue manager includes:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager;
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager;
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
Optionally, the counting and displaying of the queue manager includes at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics.
Optionally, the monitoring information includes at least one of the following information:
queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics.
Optionally, the queue statistical information includes: queue data flow to information and/or data traffic.
Compared with the prior art, the method has the following advantages:
according to the monitoring management system and the monitoring management method, the monitoring information of the monitoring management platform is reported to the monitoring management platform through the message queue gateway and the message queue nodes, the monitoring management platform analyzes the monitoring information to obtain a monitoring result, the problems in the message queue cluster can be timely found and correspondingly processed, the running state and the health state of the current message queue cluster are summarized through mechanisms such as information acquisition and analysis, and the operation and maintenance guarantee of the system is greatly improved.
Drawings
Fig. 1 is a schematic diagram of a monitoring management system according to a first embodiment of the present application. .
Fig. 2 is a schematic diagram of a monitoring agent program acquiring monitoring information and sending the acquired monitoring information to a monitoring management platform according to a first embodiment of the present application.
Fig. 3 is a functional schematic diagram of a monitoring management platform according to a first embodiment of the present application.
Fig. 4 is a schematic diagram illustrating that a monitoring management platform sends warning information to a mailbox of a system administrator according to a first embodiment of the present application.
Fig. 5 is a schematic diagram illustrating statistics and display of the number of data volumes of the queue manager according to the first embodiment of the present application.
Fig. 6 is a schematic diagram for counting and presenting queue messages in a queue manager according to the first embodiment of the present application.
Fig. 7 is a schematic diagram of counting the number of data pieces of the queue manager that are put in success, put in failure, taken out success, and taken out failure according to the first embodiment of the present application.
Fig. 8 is a schematic diagram of data of successful put, failed put, successful take, and failed take of each queue in the queue manager according to the first embodiment of the present application.
Fig. 9 is a schematic diagram showing a topology diagram of a docking application service system according to a first embodiment of the present application.
Fig. 10 is a flowchart of a monitoring management method according to a second embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
A first embodiment of the present application provides a monitoring management system, and a message queue cluster in the first embodiment of the present application is described by taking an mq (ibm mq) cluster as an example. The following description will be made in detail with reference to fig. 1, fig. 2, fig. 3, fig. 4, fig. 5, fig. 6, fig. 7, and fig. 8.
The system comprises: the system comprises a message queue cluster 101, a monitoring management platform 102 and a database 103.
The cluster of message queues 101 is described as,
the message queue cluster comprises at least one message queue gateway and a plurality of message queue nodes;
the message queue gateway is used for distributing messages of a butt joint application system to the message queue nodes according to the load of the message queue nodes, the message queue gateway runs a gateway monitoring data acquisition agent program, and the gateway monitoring data acquisition agent program is used for acquiring monitoring information of the message queue gateway and reporting the acquired monitoring information of the message queue gateway to the monitoring management platform;
the message queue node is used for receiving messages of an application system provided by a message queue gateway, processing the messages of the application system, or storing the messages of the application system in a message queue mode, the message queue node runs a node monitoring data acquisition agent program, and the node monitoring data acquisition agent program is used for acquiring monitoring information of the message queue node and reporting the acquired monitoring information of the message queue node to the monitoring management platform.
It should be noted that the gateway monitoring data collection agent and the node monitoring data collection agent may adopt the same program or different programs.
As shown in fig. 1, the MQ gateway server 1 (message queue gateway) and the MQ gateway server 2 are message queue gateways.
The MQ gateway server refers to a message queue gateway server in an MQ cluster, serves as a gateway of the whole MQ cluster, is mainly oriented to application connection requests, and distributes message data of a butt joint application system to MQ nodes through the message queue gateway server through a load balancing mechanism.
The monitoring information comprises hardware information: information such as CPU utilization rate, disk use condition, file size, process, network and the like; the following information is also included: queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics, etc. Wherein the queue statistics include: queue data flow information and/or data traffic information, etc.
As shown in fig. 1, a gateway monitoring data collection Agent (Agent) running on the MQ gateway server 1 and the MQ gateway server 2 may collect monitoring information, and report the collected monitoring information of the message queue gateway to the monitoring management platform. Preferably, the gateway monitoring data acquisition agent program can acquire monitoring information at regular time and report the acquired monitoring information of the message queue gateway to the monitoring management platform, so that the monitoring management platform can process abnormal conditions in time according to the monitoring information and realize real-time monitoring of the system. The gateway monitoring data acquisition agent program is a software program running on the message queue gateway, and can acquire various monitoring information from the message queue gateway and send the acquired monitoring information to the monitoring management platform. As shown in fig. 2, the gateway monitoring data acquisition agent program acquires monitoring information by calling an API interface, and sends the acquired monitoring information to the monitoring management platform server, where system hardware information (information such as memory, CPU, disk, etc.) is mainly obtained by querying through the API interface of the operating system, the related information of the MQ cluster is acquired through the SDK API interface provided by the IBM MQ, and the acquired MQ data mainly includes: MQ cluster, MQ queue manager, queue depth, message channel, IP, port, number of entries and withdrawals, time and other information.
As shown in fig. 1, the MQ cluster node 2, the MQ cluster node 3, and the MQ cluster node 4 are message queue nodes.
The node monitoring data collection Agent (Agent) running on the MQ cluster node 1, the MQ cluster node 2, the MQ cluster node 3 and the MQ cluster node 4 can collect monitoring information of the message queue node and report the collected monitoring information of the message queue node to the monitoring management platform. Preferably, the node monitoring data acquisition agent program can acquire the monitoring information of the message queue nodes at regular time and report the acquired monitoring information of the message queue nodes to the monitoring management platform, so that the monitoring management platform can process abnormal conditions in time according to the monitoring information and realize real-time monitoring of the message queue nodes. The node monitoring data acquisition agent program is a software program running on the message queue node, and can acquire various monitoring information from the message queue node and send the acquired monitoring information to the monitoring management platform.
The monitoring management platform 103 is configured to obtain monitoring information reported by a gateway monitoring data acquisition agent program and a node monitoring data acquisition agent program, analyze the monitoring information, obtain a monitoring result, and store the monitoring information in the database. Fig. 3 shows a functional schematic of a monitoring management platform.
The monitoring management platform may include:
the alarm submodule is used for alarming according to the monitoring result of the monitoring information;
the early warning submodule is used for determining whether early warning is needed or not according to the monitoring result of the monitoring information, and if so, performing early warning processing;
and the statistics and display submodule is used for carrying out multi-dimensional statistics and display on the monitoring information in the database.
And the alarm submodule can alarm when the monitoring information is abnormal. And when the monitoring result of the monitoring information meets the alarm condition, alarming. For example, when the monitoring result of the monitoring information satisfies the following condition, an alarm may be given: the memory usage rate exceeds 95 percent; the CPU utilization rate exceeds 95 percent; the hard disk utilization rate exceeds 95 percent; data exists in the deadlock queue; the "message channel unavailable" information appears.
The early warning submodule can perform early warning. When the monitoring result of the monitoring information reaches the early warning condition threshold value set for the monitoring information, a system administrator can be notified through a mobile phone short message or a mail; or, an image or sound warning or alarm is issued.
Because the monitoring items corresponding to different monitoring information are different, different early warning condition thresholds can be set for different monitoring information. And when the preset early warning condition threshold is reached, early warning or alarm processing is carried out.
For example, when the warning condition threshold is set for a certain message queue, if the number of messages is 9 ten thousand, warning is needed when the number of messages in the message queue is greater than or equal to 9 ten thousand; and setting a memory early warning condition threshold value to be 85% of memory occupation aiming at a certain MQ node, and needing early warning when the memory occupies 85%.
It should be noted that the early warning condition thresholds set for different message queues are different, for example, for a message queue sensitive to real-time performance, the number of the early warning condition thresholds may be set to 10; for a message queue with a large data size, the early warning condition threshold may be set to 2000 messages.
Preferably, the early warning sub-module is further configured to set an early warning level of the monitoring information, where different early warning levels correspond to different early warning condition thresholds set for the monitoring information. For example, different early warning condition thresholds can be set for the CPU utilization, and the early warning condition thresholds for the CPU utilization are respectively set to 70%, 80%, and 90%, and respectively correspond to the first-stage early warning, the second-stage early warning, and the third-stage early warning.
Preferably, in order to enable a system administrator (including the system administrator of the monitoring management system and each application system administrator) to know the early warning or alarm information in real time, the monitoring management platform may bind a mobile phone number and/or a mailbox address of the system administrator. Fig. 4 is a schematic diagram illustrating that the monitoring management platform sends the warning information to the mailbox of the system administrator.
Preferably, the advice information is carried during the early warning or alarm processing.
The importance of performing early warning or alarm processing when a set early warning condition threshold is reached is described below with reference to a scene.
For example, in a data distribution platform in the insurance industry, an MQ cluster is adopted for receiving and forwarding messages. Assuming that a message queue of an underwriting policy is set, and the number of the early warning condition threshold is set to be 10 thousands of messages, if the policy pending for claim settlement is sent once in a concentrated manner every day, the impact on the message queue of the underwriting policy is likely to be brought, when the number of the messages in the message queue is greater than or equal to 10 thousands of messages, early warning is carried out, and if early warning is carried out continuously for a period of time (for example, 5 days), capacity expansion suggestions can be carried in early warning information, so that a system administrator can timely expand the capacity, the normal operation of the system is ensured, and the reliability of the system is improved.
The statistics and display submodule comprises:
the MQ cluster system operation condition statistics and display submodule is used for carrying out statistics and display on the MQ cluster system operation condition; or,
the hardware condition counting and displaying submodule is used for counting and displaying the health condition of the hardware; or,
and the queue manager counting and displaying submodule is used for counting and displaying the queue manager.
The queue manager statistics and display submodule is specifically configured to:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue messages in the queue manager; or,
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager; or,
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
As shown in FIG. 5, the number of queue manager data volumes is counted and shown, for example, the number of newly held queue manager data volumes is 420573.
The statistics and display of the queue manager comprises at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics. As shown in fig. 6, the queue messages in the queue manager are counted and shown, for example, the queue messages of the queues QAREINS001, and QAREINS001 managed by the queue manager QMGWC may be counted and shown by day or by minute.
As shown in fig. 7, which shows a schematic diagram for counting the number of data items of the queue manager that are put-in-success, put-in-failure, take-out-success, and take-out-failure, the data amount of the head office data sent to different branch companies on the day and the data amount received from the branch companies can be counted and shown as shown in fig. 7. If the data counted by the data sender is inconsistent with the data counted by the data receiver, the situation of inconsistent service can be judged.
As shown in FIG. 8, it shows the data cases of put-successful, put-failed, put-successful, and put-failed for each queue in the queue manager. According to the statistics of a certain queue in fig. 8, it can be determined whether the putting-in speed and the taking-out speed of the data are equivalent, and whether data accumulation occurs in the MQ cluster.
The MQ cluster system operation condition statistics and display submodule is specifically used for displaying a topological graph of the docking application service system. As shown in fig. 9, a schematic diagram illustrating a topology diagram of a docking application business system is shown.
It should be noted that, since the monitoring information reported by the message queue gateway to the monitoring management platform is not in a data format required by the monitoring management platform, the monitoring information may be analyzed and processed by the monitoring management platform; and then the analyzed and processed monitoring information is stored in a database.
The database 104 is configured to store the monitoring information provided by the monitoring management platform.
The database, which may refer to a repository that organizes, stores, and manages data by data structure. The database may comprise a relational database, such as Oracle, SQL Server, which may be used to query information from the database using database query statements. The database and the monitoring management platform can be deployed on the same physical server, and in order to ensure the safety of data storage, the database and the monitoring management platform can also be deployed on different physical servers.
Now, a detailed description is given of an implementation of the monitoring management system according to the first embodiment of the present application. According to the first embodiment of the application, monitoring information of the message queue gateway and the message queue nodes is collected, the collected monitoring information is reported to the monitoring management platform, and the monitoring management platform analyzes the monitoring information to obtain a monitoring result, so that problems in the message queue cluster can be found in time, and early warning or alarming is performed; the monitoring management platform can also count and display the monitoring information. The running state and the health state of the current message queue cluster are collected and summarized in real time through mechanisms such as monitoring information collection, statistics and display, early warning and the like, and the operation and maintenance guarantee of the system is greatly improved.
A second embodiment of the present application provides a monitoring management method, which is applied to the monitoring management system of the first embodiment of the present application. The following description will be made in detail with reference to fig. 2, 3, 4, 5, 6, 7, 8, 9 and 10.
As shown in fig. 10, in step S1001, the message queue gateway reports its own monitoring information to the monitoring management platform through the gateway monitoring data collection agent running on the message queue gateway.
The message queue gateway, which refers to a message queue gateway server in a message queue cluster (e.g., MQ cluster), is a gateway of the whole message queue cluster, and is mainly oriented to an application connection request, and message data of an application system is distributed to a message queue node (e.g., MQ node) through the message queue gateway server by a load balancing mechanism. As shown in fig. 1, the MQ gateway server 1 and MQ gateway server 2 are message queue gateways.
The monitoring information comprises hardware information: information such as CPU utilization rate, disk use condition, file size, process, network and the like; the following information is also included: queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics, etc. Wherein the queue statistics include: queue data flow information and/or data traffic information, etc.
The monitoring management platform is a software system, and is used for acquiring monitoring information reported by the message queue gateway and the message queue node to the monitoring management platform, analyzing the monitoring information, acquiring a monitoring result of the monitoring information, and storing the monitoring information in the database.
As shown in fig. 1, a gateway monitoring data collection Agent (Agent) running on the MQ gateway server 1 and the MQ gateway server 2 may collect monitoring information, and report the collected monitoring information of the message queue gateway to the monitoring management platform. Preferably, the gateway monitoring data acquisition agent program can acquire monitoring information at regular time and report the acquired monitoring information of the message queue gateway to the monitoring management platform, so that the monitoring management platform can process abnormal conditions in time according to the monitoring information and realize real-time monitoring of the system. The gateway monitoring data acquisition agent program is a software program running on the message queue gateway, and can acquire various monitoring information from the message queue gateway and send the acquired monitoring information to the monitoring management platform. As shown in fig. 2, the gateway monitoring data acquisition agent program acquires monitoring information by calling an API interface, and sends the acquired monitoring information to the monitoring management platform server, where system hardware information (information such as memory, CPU, disk, etc.) is mainly obtained by querying through the API interface of the operating system, the related information of the MQ cluster is acquired through the SDK API interface provided by the IBM MQ, and the acquired MQ data mainly includes: MQ cluster, MQ queue manager, queue depth, message channel, IP, port, number of entries and withdrawals, time and other information.
As shown in fig. 10, in step S1002, the message queue node reports its own monitoring information to the monitoring management platform through the node monitoring data collection agent running on the message queue node.
As shown in fig. 1, the MQ cluster node 2, the MQ cluster node 3, and the MQ cluster node 4 are message queue nodes.
The node monitoring data collection Agent (Agent) running on the MQ cluster node 1, the MQ cluster node 2, the MQ cluster node 3 and the MQ cluster node 4 can collect monitoring information of the message queue node and report the collected monitoring information of the message queue node to the monitoring management platform. Preferably, the node monitoring data acquisition agent program can acquire the monitoring information of the message queue nodes at regular time and report the acquired monitoring information of the message queue nodes to the monitoring management platform, so that the monitoring management platform can process abnormal conditions in time according to the monitoring information and realize real-time monitoring of the message queue nodes. The node monitoring data acquisition agent program is a software program running on the message queue node, and can acquire various monitoring information from the message queue node and send the acquired monitoring information to the monitoring management platform.
As shown in fig. 10, in step S1003, the monitoring management platform analyzes the monitoring information, obtains a monitoring result of the monitoring information, and stores the monitoring information in a database.
After the monitoring management platform analyzes the monitoring information and obtains the monitoring result of the monitoring information, the monitoring management platform can also determine whether to alarm or give an early warning according to the monitoring result of the monitoring information.
Determining whether to alarm or pre-warn according to the monitoring result of the monitoring information, comprising:
judging whether the monitoring result of the monitoring information reaches an alarm condition set for the monitoring information, and if so, carrying out alarm processing; or
And judging whether the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information, and if so, carrying out early warning processing.
And when the monitoring result of the monitoring information meets the alarm condition set for the monitoring information, alarming. For example, when the monitoring result of the monitoring information satisfies the following condition, an alarm may be given: the memory usage rate exceeds 95 percent; the CPU utilization rate exceeds 95 percent; the hard disk utilization rate exceeds 95 percent; data exists in the deadlock queue; the "message channel unavailable" information appears. The monitoring management platform informs a system administrator through a mobile phone short message or a mail; and also can display or sound alarm through the monitoring management platform interface.
When the monitoring result of the monitoring information reaches the early warning condition threshold value set for the monitoring information, a system administrator can be notified through a mobile phone short message or a mail; or, an image or sound warning or alarm is issued.
Because the monitoring items corresponding to different monitoring information are different, different early warning condition thresholds can be set for different monitoring information. And when the preset early warning condition threshold is reached, early warning or alarm processing is carried out.
For example, when the warning condition threshold is set for a certain message queue, if the number of messages is 9 ten thousand, warning is needed when the number of messages in the message queue is greater than or equal to 9 ten thousand; and setting a memory early warning condition threshold value to be 85% of memory occupation aiming at a certain MQ node, and needing early warning when the memory occupies 85%.
It should be noted that the early warning condition thresholds set for different message queues are different, for example, for a message queue sensitive to real-time performance, the number of the early warning condition thresholds may be set to 10; for a message queue with a large data size, the early warning condition threshold may be set to 2000 messages.
Preferably, early warning levels of monitoring information can be set, and different early warning levels correspond to different early warning condition thresholds set for the monitoring information. For example, different early warning condition thresholds can be set for the CPU utilization, and the early warning condition thresholds for the CPU utilization are respectively set to 70%, 80%, and 90%, and respectively correspond to the first-stage early warning, the second-stage early warning, and the third-stage early warning.
Preferably, in order to enable a system administrator (including the system administrator of the monitoring management system and each application system administrator) to know the early warning or alarm information in real time, the monitoring management platform may bind a mobile phone number and/or a mailbox address of the system administrator. Fig. 4 is a schematic diagram illustrating that the monitoring management platform sends the warning information to the mailbox of the system administrator.
Preferably, the advice information is carried during the early warning or alarm processing.
The importance of performing early warning or alarm processing when a set early warning condition threshold is reached is described below with reference to a scene.
For example, in a data distribution platform in the insurance industry, an MQ cluster is adopted for receiving and forwarding messages. Assuming that a message queue of an underwriting policy is set, and the number of the early warning condition threshold is set to be 10 thousands of messages, if the policy pending for claim settlement is sent once in a concentrated manner every day, the impact on the message queue of the underwriting policy is likely to be brought, when the number of the messages in the message queue is greater than or equal to 10 thousands of messages, early warning is carried out, and if early warning is carried out continuously for a period of time (for example, 5 days), capacity expansion suggestions can be carried in early warning information, so that a system administrator can timely expand the capacity, the normal operation of the system is ensured, and the reliability of the system is improved.
The monitoring management platform can perform multi-dimensional statistics and display on monitoring information in the database besides performing early warning and alarming.
The multidimensional statistics and display of the monitoring information in the database comprises the following steps:
counting and displaying the operation condition of the MQ cluster system; or,
counting and displaying the health condition of the hardware; or,
and counting and displaying the queue manager.
The counting and displaying of the queue manager comprises the following steps:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager;
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager;
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
As shown in FIG. 5, the number of queue manager data volumes is counted and shown, for example, the number of newly held queue manager data volumes is 420573.
The statistics and display of the queue manager comprises at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics. As shown in fig. 6, the queue messages in the queue manager are counted and shown, for example, the queue messages of the queues QAREINS001, and QAREINS001 managed by the queue manager QMGWC may be counted and shown by day or by minute.
As shown in fig. 7, which shows a schematic diagram for counting the number of data items of the queue manager that are put-in-success, put-in-failure, take-out-success, and take-out-failure, the data amount of the head office data sent to different branch companies on the day and the data amount received from the branch companies can be counted and shown as shown in fig. 7. If the put-in data is inconsistent with the taken-out data, the condition that the data of the docking service system is inconsistent can be judged.
As shown in FIG. 8, it shows the data cases of put-successful, put-failed, put-successful, and put-failed for each queue in the queue manager. From the statistics of a queue in fig. 8, it can be determined whether the data input speed and the data output speed are equivalent to each other, and whether data accumulation occurs in the system.
The counting and displaying of the operation condition of the MQ cluster system comprises the following steps: and displaying a topological graph of the butt joint application service system. As shown in fig. 9, a schematic diagram illustrating a topology diagram of a docking application business system is shown.
It should be noted that, since the monitoring information reported by the message queue gateway to the monitoring management platform is not in a data format required by the monitoring management platform, the monitoring information may be analyzed and processed by the monitoring management platform; and then the analyzed and processed monitoring information is stored in a database.
Now, a detailed description is given of an implementation of the monitoring management method according to the second embodiment of the present application. In the second embodiment of the application, by acquiring the monitoring information of the message queue gateway and the message queue node and reporting the acquired monitoring information to the monitoring management platform, the monitoring management platform analyzes the monitoring information to obtain a monitoring result, so that problems in the message queue cluster can be timely found, and early warning or alarming is performed; the monitoring management platform can also count and display the monitoring information. The running state and the health state of the current message queue cluster are collected and summarized in real time through mechanisms such as monitoring information collection, statistics and display, early warning and the like, and the operation and maintenance guarantee of the system is greatly improved.
Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto, and variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present invention.
Claims (23)
1. A monitoring management system, comprising: the system comprises a message queue cluster, a monitoring management platform and a database;
the message queue cluster comprises at least one message queue gateway and a plurality of message queue nodes;
the message queue gateway is used for distributing messages of a butt joint application system to the message queue nodes according to the load of the message queue nodes, the message queue gateway runs a gateway monitoring data acquisition agent program, and the gateway monitoring data acquisition agent program is used for acquiring monitoring information of the message queue gateway and reporting the acquired monitoring information of the message queue gateway to the monitoring management platform;
the message queue node is used for receiving messages of the butt joint application system provided by the message queue gateway, processing the messages of the butt joint application system, or storing the messages of the butt joint application system in a message queue mode, the message queue node runs a node monitoring data acquisition agent program, and the node monitoring data acquisition agent program is used for acquiring monitoring information of the message queue node and reporting the acquired monitoring information of the message queue node to the monitoring management platform;
the monitoring management platform is used for acquiring monitoring information reported by a gateway monitoring data acquisition agent program and a node monitoring data acquisition agent program, analyzing the monitoring information, acquiring a monitoring result and storing the monitoring information into the database;
the database is used for storing the monitoring information provided by the monitoring management platform; the monitoring information comprises at least one of the following information: queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics.
2. The monitoring management system according to claim 1, wherein the gateway monitoring data collection agent is specifically configured to collect monitoring information of the message queue gateway at regular time, and report the collected monitoring information of the message queue gateway to the monitoring management platform.
3. The monitoring management system according to claim 1, wherein the node monitoring data collection agent is specifically configured to collect monitoring information of the message queue node at regular time, and report the collected monitoring information of the message queue node to the monitoring management platform.
4. The monitoring management system according to claim 1, wherein the monitoring management platform comprises:
the early warning submodule is used for determining whether early warning is needed or not according to the monitoring result of the monitoring information, and if so, performing early warning processing;
and the statistics and display submodule is used for carrying out multi-dimensional statistics and display on the monitoring information in the database.
5. The monitoring management system of claim 4, wherein the early warning sub-module is specifically configured to:
the system is used for notifying a system administrator through a mobile phone short message or a mail when the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information; or, an image or sound warning or alarm is issued.
6. The monitoring management system of claim 4, wherein the early warning sub-module is further configured to set early warning levels of monitoring information, and different early warning levels correspond to different early warning condition thresholds set for the monitoring information.
7. The monitoring management system of claim 4, wherein the statistics and presentation submodule comprises:
the MQ cluster system operation condition statistics and display submodule is used for carrying out statistics and display on the MQ cluster system operation condition; or,
the hardware condition counting and displaying submodule is used for counting and displaying the health condition of the hardware; or,
and the queue manager counting and displaying submodule is used for counting and displaying the queue manager.
8. The monitoring management system according to claim 7, wherein the MQ cluster system operation condition statistics and presentation submodule is specifically configured to present a topology map of the docking application service system.
9. The monitoring management system of claim 7, wherein the queue manager statistics and presentation sub-module is specifically configured to:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager; or,
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager; or,
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
10. The monitoring management system of claim 7, wherein the statistics and presentation of the queue manager includes at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics.
11. The monitoring management system of claim 1, wherein the queue statistics comprise: queue data flow to information and/or data traffic.
12. A monitoring management method applied to the monitoring management system of claim 1, the method comprising:
the message queue gateway reports own monitoring information to a monitoring management platform through a gateway monitoring data acquisition agent program running on the message queue gateway;
the message queue node reports own monitoring information to a monitoring management platform through a node monitoring data acquisition agent program running on the message queue node;
the monitoring management platform analyzes the monitoring information, obtains a monitoring result of the monitoring information, and stores the monitoring information into a database; the monitoring information comprises at least one of the following information: queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics.
13. The method of claim 12, wherein the reporting, by the message queue gateway, of the monitoring information of the message queue gateway to the monitoring management platform through a gateway monitoring data collection agent running on the message queue gateway includes:
and the message queue gateway reports the monitoring information of the message queue gateway to a monitoring management platform at regular time through a gateway monitoring data acquisition agent program running on the message queue gateway.
14. The method of claim 12, wherein the reporting of the monitoring information of the message queue node to the monitoring management platform by the node monitoring data collection agent running on the message queue node comprises:
and the message queue node reports the monitoring information of the message queue node to a monitoring management platform at regular time through a node monitoring data acquisition agent program running on the message queue node.
15. The method of claim 12, further comprising:
and determining whether to alarm or pre-warn according to the monitoring result of the monitoring information.
16. The method of claim 15, wherein determining whether an alarm or pre-warning is required based on the monitoring of the monitoring information comprises:
judging whether the monitoring result of the monitoring information reaches an alarm condition set for the monitoring information, and if so, carrying out alarm processing; or
And judging whether the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information, and if so, carrying out early warning processing.
17. The method of claim 16, further comprising:
and setting early warning levels of the monitoring information, wherein different early warning levels correspond to different early warning condition thresholds set for the monitoring information.
18. The method of claim 12, further comprising:
and carrying out multidimensional statistics and display on the monitoring information in the database.
19. The method of claim 18, wherein the performing multidimensional statistics and presentation on the monitoring information in the database comprises:
counting and displaying the operation condition of the MQ cluster system; or,
counting and displaying the health condition of the hardware; or,
and counting and displaying the queue manager.
20. The method as claimed in claim 19, wherein the counting and exposing the MQ cluster system operation condition comprises:
and displaying a topological graph of the butt joint application service system.
21. The method of claim 19, wherein said counting and exposing the queue manager comprises:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager;
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager;
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
22. The method of claim 21, wherein the statistics and presentation of the queue manager comprises at least one of:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics.
23. The method of claim 12, wherein the queue statistics comprise: queue data flow to information and/or data traffic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810509664.9A CN110535713B (en) | 2018-05-24 | 2018-05-24 | Monitoring management system and monitoring management method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810509664.9A CN110535713B (en) | 2018-05-24 | 2018-05-24 | Monitoring management system and monitoring management method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110535713A CN110535713A (en) | 2019-12-03 |
CN110535713B true CN110535713B (en) | 2021-08-03 |
Family
ID=68657435
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810509664.9A Active CN110535713B (en) | 2018-05-24 | 2018-05-24 | Monitoring management system and monitoring management method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110535713B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111556019B (en) * | 2020-03-27 | 2022-06-14 | 天津市普迅电力信息技术有限公司 | Vehicle-mounted machine data encryption transmission and processing method under distributed environment |
CN113630284B (en) * | 2020-05-08 | 2023-07-07 | 网联清算有限公司 | Message middleware monitoring method, device and equipment |
CN111626870B (en) * | 2020-05-25 | 2023-07-21 | 泰康保险集团股份有限公司 | Nuclear data processing method, device and equipment for cleaning physical examination piece |
CN111638981A (en) * | 2020-05-27 | 2020-09-08 | 南京犀六智能科技有限公司 | Safety management system |
CN112333042A (en) * | 2020-10-27 | 2021-02-05 | 广州助蜂网络科技有限公司 | Monitoring management method and device for Internet of things card middleware |
CN112291254B (en) * | 2020-11-05 | 2023-05-05 | 中国人民银行清算总中心 | Message processing method and device for reliable transaction |
CN115776435B (en) * | 2022-10-24 | 2024-03-01 | 华能信息技术有限公司 | Early warning method based on API gateway |
CN116170385A (en) * | 2023-04-21 | 2023-05-26 | 四川汉科计算机信息技术有限公司 | Gateway information forwarding system, method, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101965005A (en) * | 2009-07-21 | 2011-02-02 | 中兴通讯股份有限公司 | Distributed access gateway system |
CN102801585A (en) * | 2012-08-24 | 2012-11-28 | 上海和辰信息技术有限公司 | Information monitoring system and method based on cloud computing network environment |
CN107766207A (en) * | 2017-10-20 | 2018-03-06 | 中国人民财产保险股份有限公司 | Distributed automatic monitoring method, system, computer-readable recording medium and terminal device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9240937B2 (en) * | 2011-03-31 | 2016-01-19 | Microsoft Technology Licensing, Llc | Fault detection and recovery as a service |
-
2018
- 2018-05-24 CN CN201810509664.9A patent/CN110535713B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101965005A (en) * | 2009-07-21 | 2011-02-02 | 中兴通讯股份有限公司 | Distributed access gateway system |
CN102801585A (en) * | 2012-08-24 | 2012-11-28 | 上海和辰信息技术有限公司 | Information monitoring system and method based on cloud computing network environment |
CN107766207A (en) * | 2017-10-20 | 2018-03-06 | 中国人民财产保险股份有限公司 | Distributed automatic monitoring method, system, computer-readable recording medium and terminal device |
Also Published As
Publication number | Publication date |
---|---|
CN110535713A (en) | 2019-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110535713B (en) | Monitoring management system and monitoring management method | |
CN110222091B (en) | Real-time statistical analysis method for mass data | |
US9491071B2 (en) | System and method for dynamically grouping devices based on present device conditions | |
US9367578B2 (en) | Method and system for message tracking and checking | |
CN102801785B (en) | System and method for monitoring advertisement putting engine | |
CN105095056A (en) | Method for monitoring data in data warehouse | |
CN112698915A (en) | Multi-cluster unified monitoring alarm method, system, equipment and storage medium | |
WO2011017955A1 (en) | Method for analyzing alarm data and system thereof | |
CN111221890B (en) | Automatic monitoring and early warning method and device for universal index class | |
CN105610648A (en) | Operation and maintenance monitoring data collection method and server | |
CN112395156A (en) | Fault warning method and device, storage medium and electronic equipment | |
CN114154035A (en) | Data processing system for dynamic loop monitoring | |
CN110990245A (en) | Micro-service operation state judgment method and device based on call chain data | |
CN110677304A (en) | Distributed problem tracking system and equipment | |
WO2023123801A1 (en) | Log aggregation system, and method for improving availability of log aggregation system | |
US20170213142A1 (en) | System and method for incident root cause analysis | |
CN108173711B (en) | Data exchange monitoring method for internal system of enterprise | |
US9443196B1 (en) | Method and apparatus for problem analysis using a causal map | |
CN114138522A (en) | Micro-service fault recovery method and device, electronic equipment and medium | |
CN116414658A (en) | Cloud service monitoring alarm system | |
KR100970211B1 (en) | Method and Apparatus for Monitoring Service Status Via Special Message Watcher in Authentication Service System | |
CN116795631A (en) | Service system monitoring alarm method, device, equipment and medium | |
CN109508356B (en) | Data abnormality early warning method, device, computer equipment and storage medium | |
CN113254313A (en) | Monitoring index abnormality detection method and device, electronic equipment and storage medium | |
US10296967B1 (en) | System, method, and computer program for aggregating fallouts in an ordering system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |