CN105320585B - A kind of method and device for realizing application failure diagnosis - Google Patents
A kind of method and device for realizing application failure diagnosis Download PDFInfo
- Publication number
- CN105320585B CN105320585B CN201410324069.XA CN201410324069A CN105320585B CN 105320585 B CN105320585 B CN 105320585B CN 201410324069 A CN201410324069 A CN 201410324069A CN 105320585 B CN105320585 B CN 105320585B
- Authority
- CN
- China
- Prior art keywords
- data
- application
- time
- service
- business
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003745 diagnosis Methods 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000002159 abnormal effect Effects 0.000 claims abstract description 33
- 238000012544 monitoring process Methods 0.000 claims description 42
- 230000005856 abnormality Effects 0.000 claims description 35
- 230000004044 response Effects 0.000 claims description 19
- 238000011084 recovery Methods 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 8
- 230000000737 periodic effect Effects 0.000 claims description 8
- 238000012423 maintenance Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000035939 shock Effects 0.000 description 2
- 238000004883 computer application Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Landscapes
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a kind of method and devices for realizing application failure diagnosis, comprising: acquisition multidimensional application data;When service application is abnormal, the relevant diagnosis data that service exception is related to are obtained from the time and space correlation relationship of service exception, according to service exception type to collected multidimensional application data;The relevant diagnosis data that the service exception that will acquire is related to are compared with the historical diagnostic data of each relevant diagnosis data respectively, determine application failure type.The present invention carries out fault diagnosis to service application by multidimensional application data extremely, avoids the single problem of terminal caused by diagnosing using single data to failure, is more fully determined to traffic failure, solves the problems, such as service exception.
Description
Technical Field
The present invention relates to the field of computer applications, and in particular, to a method and an apparatus for implementing application fault diagnosis.
Background
With the continuous development of IT technology application, various business processes of enterprises are more and more closely combined with internet technology, and application information systems composed of servers, databases, middleware and the like become more and more complex. Even with the escalating level of requirements on technicians, there is still the problem of increasing difficulty in troubleshooting. The quality of operation of a business application (ability, speed, and stability of completing a business) is directly related to the level of business that an enterprise can provide to a user. The monitoring management of the performance of the key business application is carried out, the analysis and the diagnosis are carried out on the problems existing in the performance supervision in time and effectively, and the urgent requirement for improving the usability of the user business application is met.
Currently, monitoring and managing the performance of business applications mainly includes the following aspects: 1. monitoring the access condition of the application; 2. when the performance of the service application is abnormal, judging whether the performance of the network system is abnormal or not; 3. when the service application has access abnormality, whether the network or the application is attacked or not is judged. By diagnosing the service application fault, the method can effectively help technicians to carry out instant recovery of the service application.
The fault diagnosis of the existing service application mainly performs fault analysis from single data such as flow data or monitoring data (for example, application logs); due to the fact that data for fault diagnosis analysis is single, the obtained fault diagnosis result is easy to be one-sided or insufficient, and therefore fault diagnosis needs to be completed by means of more manual participation.
Disclosure of Invention
In order to solve the technical problem, the invention provides a method and a device for realizing application fault diagnosis, which can comprehensively diagnose service faults according to multidimensional data and reduce human participation.
In order to achieve the above object, the present invention discloses a method for implementing application fault diagnosis, comprising:
collecting multi-dimensional application data;
when the business application is abnormal, acquiring relevant diagnostic data related to the business abnormality from the time and space incidence relation of the business abnormality for the acquired multidimensional application data according to the type of the business abnormality;
and comparing the acquired associated diagnostic data related to the business abnormality with historical diagnostic data of each associated diagnostic data respectively to determine the type of the application fault.
Further, the multidimensional application data includes: and the monitoring data extracted according to the service application server IP, the flow data extracted according to the service application server IP and the destination address, and the application performance data extracted according to the service application server IP and the destination address.
Further, monitoring the data includes at least: IP address, and/or monitoring time, and/or CPU utilization, and/or disk input/output io, and/or memory-related information, and/or swap space-related information, and/or network interface-related information, and/or database response time, and/or swap memory usage si brought into memory from disk, and/or swap memory usage so brought into disk from memory, and/or size bo written into disk from memory, and/or size bi written into memory from disk, and/or service status.
Further, the traffic data is a session uniquely identified by the same five-tuple, and at least comprises: the method comprises the steps of collecting time, source/destination addresses, source/destination ports, protocols, handshaking signal SYN packet number used when TCP/IP is sent to establish connection, code bit field FIN packet number of a TCP header is sent, TCP related information, RST sending times and total flow rate abnormity accessing specified services in unit time.
Further, the application performance data includes at least: source/destination addresses, and/or destination ports, and/or request time, and/or server response time, and/or load time, and/or page related information, and/or Http related information, and/or tomcat global access speed exception, and/or database access volume per unit time exception, and/or Weblogic current session number exception;
the application performance data is collected from performance data of an HTTP protocol, and/or performance data of an ORACLE database service, and/or performance data of a MYSQL database server.
Further, comparing the acquired associated diagnostic data related to the service abnormality with historical diagnostic data of each associated diagnostic data, and determining the application fault type specifically includes:
and comparing the acquired associated diagnostic data related to the business abnormality with historical diagnostic data of each associated diagnostic data through a periodic baseline or a moving window baseline, and determining the application fault type according to the preset threshold range of each associated diagnostic data.
Further, the historical diagnostic data is: monitoring data within a first preset time length; flow data within a second preset time length and real-time application performance data.
Further, when the fault diagnosis does not result in an analysis, the method further includes: and storing the multi-dimensional data related to the abnormity, and further determining the type of the application fault after the historical data is updated.
Further, the method further comprises: upon determining the application failure type, a failure recovery recommendation is provided from the historical diagnostic data.
On the other hand, the present application further provides an apparatus for implementing application fault diagnosis, including: the system comprises a collecting unit, an obtaining unit and a fault diagnosis unit; wherein,
the acquisition unit is used for acquiring multi-dimensional application data;
the acquisition unit is used for acquiring the correlation diagnosis data related to the business abnormity from the time and space correlation relation of the business abnormity of the acquired multidimensional application data according to the type of the business abnormity when the business application is abnormal;
and the fault diagnosis unit is used for comparing the acquired associated diagnostic data related to the business abnormality with historical diagnostic data of each associated diagnostic data respectively to determine the type of the application fault.
Further, the multidimensional application data includes: and the monitoring data extracted according to the service application server IP, the flow data extracted according to the service application server IP and the destination address, and the application performance data extracted according to the service application server IP and the destination address.
Further, monitoring the data includes at least: IP address, and/or monitoring time, and/or CPU utilization, and/or disk input/output io, and/or memory-related information, and/or swap space-related information, and/or network interface-related information, and/or database response time, and/or swap memory usage si brought into memory from disk, and/or swap memory usage so brought into disk from memory, and/or size bo written into disk from memory, and/or size bi written into memory from disk, and/or service status.
Further, the traffic data is a session uniquely identified by the same five-tuple, and at least comprises: the method comprises the steps of collecting time, source/destination addresses, source/destination ports, protocols, handshaking signal SYN packet number used when TCP/IP is sent to establish connection, code bit field FIN packet number of a TCP header is sent, TCP related information, RST sending times and total flow rate abnormity accessing specified services in unit time.
Further, the application performance data includes at least: source/destination addresses, and/or destination ports, and/or request time, and/or server response time, and/or load time, and/or page related information, and/or Http related information, and/or tomcat global access speed exception, and/or database access volume per unit time exception, and/or Weblogic current session number exception;
the application performance data is collected from performance data of an HTTP protocol, and/or performance data of an ORACLE database service, and/or performance data of a MYSQL database server.
Further, the fault diagnosis unit is specifically configured to compare the acquired associated diagnostic data related to the service abnormality with historical diagnostic data of each associated diagnostic data through a periodic baseline or a moving window baseline, and determine an application fault type according to a preset threshold range of each associated diagnostic data.
Further, the historical diagnostic data is: monitoring data within a first preset time length; flow data within a second preset time length and real-time application performance data.
Further, the device also comprises a subsequent diagnosis unit, which is used for storing the multi-dimensional data related to the abnormity when the fault diagnosis does not analyze the result, and further determining the type of the applied fault after the historical data is updated.
Further, the apparatus comprises a recovery suggestion unit for providing a fault recovery suggestion from the historical diagnostic data based on the determined application fault type.
The technical scheme of the application includes: collecting multi-dimensional application data; when the business application is abnormal, acquiring relevant diagnostic data related to the business abnormality from the time and space incidence relation of the business abnormality for the acquired multidimensional application data according to the type of the business abnormality; and comparing the acquired associated diagnostic data related to the business abnormality with historical diagnostic data of each associated diagnostic data respectively, determining the type of the application fault, and analyzing the fault reason. The invention diagnoses the fault of the service application abnormity through the multidimensional application data, avoids the problem of single terminal caused by adopting single data to diagnose the fault, determines the service fault more comprehensively and solves the problem of the service abnormity.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a method of implementing application fault diagnosis in accordance with the present invention;
fig. 2 is a block diagram of an apparatus for implementing application fault diagnosis according to the present invention.
Detailed Description
Fig. 1 is a flowchart of a method for implementing application fault diagnosis, as shown in fig. 1, including:
step 100, collecting multidimensional application data;
in this step, the collected multidimensional application data includes: and the monitoring data extracted according to the service application server IP, the flow data extracted according to the service application server IP and the destination address, and the application performance data extracted according to the service application server IP and the destination address.
Further, monitoring the data includes at least: IP address, and/or monitoring time, and/or CPU utilization, and/or disk input output (io), and/or memory related information, and/or swap space related information, and/or network interface related information, and/or database response time, and/or swap memory usage (si) to bring memory from disk, and/or swap memory usage (so) to bring memory to disk, and/or size of write memory from disk (bo), and/or size of write memory from disk (bi), and/or service status.
The flow data is a session uniquely identified by the same five-tuple, and at least comprises the following steps: the flow data is a session uniquely identified by the same five-tuple, and at least comprises the following steps: the method comprises the steps of collecting time, source/destination addresses, source/destination ports, protocols, sending SYN (handshake signals used when TCP/IP establishes connection) packet number, sending FIN (code bit field of TCP header) packet number, TCP related information, sending RST times and/or total flow rate abnormity accessing specified services in unit time. Here, the TCP-related information includes: TCP retransmission times, TCP checksum error times, TCP connection abnormal closing times, and the like.
The application performance data includes at least: source/destination addresses, and/or destination ports, and/or request time, and/or server response time, and/or load time, and/or page related information, and/or Http related information, and/or tomcat global access speed exception, and/or database access volume per unit time exception, and/or Weblogic current session number exception;
here, tomcat is an existing WEB application server, and Weblogic is WEB middleware in a JAVA programming application.
The application performance data is collected from performance data of the HTTP protocol, and/or performance data of ORACLE database service, and/or performance data of MYSQL database server.
Here, the page-related information includes: page download time, page slow down ratio, etc
The Http related information includes: the Http access rate, the Http error rate, the Http access quantity per unit time are abnormal, and the like.
Step 101, when the service application is abnormal, acquiring relevant diagnosis data related to the service abnormality from the time and space correlation relation of the service abnormality for the acquired multidimensional application data according to the type of the service abnormality.
The time and space correlation relationship of the service anomaly means that correlation diagnostic data is obtained from determinable time information in the multidimensional data according to the time of the occurrence of the anomaly according to the time of the occurrence of the service anomaly, and the correlation diagnostic data is obtained from information of related protocol layers.
Due to the complexity of the abnormal situations of the business applications, the skilled person should understand that a comprehensive example cannot be performed; for clarity of explanation of the present invention, common traffic application anomalies are exemplified and some of the relevant diagnostic data are briefly presented.
It should be noted that the service exception type is a summary of service exception types obtained by those skilled in the art according to empirical analysis, and the following are the types of common service exception types and related associated diagnostic data:
1. business application service availability exceptions including: the main related relevant diagnosis data comprises the following diagnosis data: service state (start/stop), CPU utilization, disk utilization, memory utilization related parameters, and the like, and the part of abnormal conditions mainly comes from monitoring data.
2. The business application server responds to the abnormity, and the related diagnosis data mainly comprises: the method comprises the steps of application request time, application page downloading time, page slowing proportion, Http access rate, Http error rate(s), server response time, database response time, swap memory usage (si) calling a memory from a disk, swap memory usage (so) calling a disk from a memory, idle memory, size (bo) writing a disk from a memory, size (bi) writing a memory from a disk, cpu utilization rate and the like, wherein the first 6 index data are application performance data, and the last 6 index data are monitoring data.
3. The business application service access abnormity mainly relates to associated diagnosis data comprising: the total flow of accessing the specified service in unit time is abnormal, the http access quantity in unit time is abnormal, the tomcat global access speed is abnormal, the database access quantity in unit time is abnormal, the Weblogic current session number is abnormal, and the like.
4. The business application flow abnormity mainly relates to associated diagnosis data comprising: protocol proportion exception events (Tcp/Udp/Icmp/Igmp) are abnormal in proportion, flow is abnormal (bps, pps, session), and the index data mainly come from a flow collector.
5. Service performance anomalies of business applications, mainly related to associated diagnostic data, include: service performance monitoring is abnormal.
6. The service state abnormity of the business application mainly relates to associated diagnosis data comprising: service status (start/stop), service status monitoring exception.
7. The business application is abnormal due to network attack, and the related diagnosis data mainly involved comprises: sending SYN packet number in unit time is abnormal, average packet length is abnormal, and worm event alarm occurs on the line: code Red, hard disk killer, SqlSlammer, shock wave killer, shock wave, mail worm, Win Nuke attack, Udp Fragment Flood. The index data mainly come from a flow collector.
8. The service application line is abnormal, and the related diagnostic data mainly related to the abnormality comprises the following data: the data flow of the second layer is abnormal, the retransmission rate of a TCP data packet, the detection and error rate of TCP, the abnormal closing times of TCP connection and the like. The index data come from a flow collector and an application collector.
And 102, comparing the acquired associated diagnostic data related to the business abnormality with historical diagnostic data of each associated diagnostic data respectively, and determining the type of the application fault.
Specifically, the acquired associated diagnostic data related to the business anomaly are respectively compared with historical diagnostic data of each associated diagnostic data through a periodic baseline or a moving window baseline, and the application fault type is determined according to the preset threshold range of each associated diagnostic data.
In this step, the historical diagnostic data is: monitoring data within a first preset time length; flow data within a second preset time length and real-time application performance data.
Here, for the monitoring data, because data which includes the log and has the same property as the log is mainly used, the first preset time duration generally refers to a plurality of periods of generated monitoring data, the period of the monitoring data is related to the type of the monitoring data designed according to the actual abnormal fault condition, and is generally obtained by taking minutes as the minimum unit;
the flow data refers to the flow parameters in a short period of time for comparison to determine abnormality, and therefore, the second preset time period generally refers to a time period of about 20S.
Of course, according to the actual situation, the first preset time duration and the second preset time duration may be adjusted according to the actual application situation and the requirement.
When the fault diagnosis does not analyze the result, the method further comprises the following steps: and storing the multi-dimensional data related to the abnormity, and further determining the type of the application fault after the historical data is updated.
The method of the invention also comprises the following steps: based on the determined application fault type and cause, fault recovery recommendations are provided from historical diagnostic data.
Fig. 2 is a block diagram of an apparatus for implementing application fault diagnosis according to the present invention, as shown in fig. 2, including:
the system comprises a collecting unit, an obtaining unit and a fault diagnosis unit; wherein,
the acquisition unit is used for acquiring multi-dimensional application data;
here, the multidimensional application data includes: and the monitoring data extracted according to the service application server IP, the flow data extracted according to the service application server IP and the destination address, and the application performance data extracted according to the service application server IP and the destination address.
The monitoring data includes at least: IP address, and/or monitoring time, and/or CPU utilization, and/or disk input/output io, and/or memory-related information, and/or swap space-related information, and/or network interface-related information, and/or database response time, and/or swap memory usage si brought into memory from disk, and/or swap memory usage so brought into disk from memory, and/or size bo written into disk from memory, and/or size bi written into memory from disk, and/or service status.
The flow data is a session uniquely identified by the same five-tuple, and at least comprises the following steps: the method comprises the steps of collecting time, source/destination addresses, source/destination ports, protocols, handshaking signal SYN packet number used when TCP/IP is sent to establish connection, code bit field FIN packet number of a TCP header is sent, TCP related information, RST sending times and total flow rate abnormity accessing specified services in unit time.
The application performance data includes at least: a source/destination address, a destination port, a request time, a server response time, a loading time, page (URL) related information, Http related information, tomcat global access speed exception, a database access amount exception in unit time and a Weblogic current session number exception;
the application performance data is collected from performance data of the HTTP protocol, and/or performance data of ORACLE database service, and/or performance data of MYSQL database server.
The acquisition unit is used for acquiring the correlation diagnosis data related to the business abnormity from the time and space correlation relation of the business abnormity of the acquired multidimensional application data according to the type of the business abnormity when the business application is abnormal;
and the fault diagnosis unit is used for comparing the acquired associated diagnostic data related to the business abnormality with historical diagnostic data of each associated diagnostic data respectively to determine the type of the application fault.
The fault diagnosis unit is specifically configured to compare the acquired associated diagnostic data related to the service abnormality with historical diagnostic data of each associated diagnostic data through a periodic baseline or a moving window baseline, and determine an application fault type according to a preset threshold range of each associated diagnostic data.
The historical diagnostic data is: monitoring data within a first preset time length; flow data within a second preset time length and real-time application performance data.
The device also comprises a subsequent diagnosis unit which is used for storing the multi-dimensional data related to the abnormity when the fault diagnosis result is not analyzed, and further determining the type of the application fault after the historical data is updated.
The inventive arrangement further comprises a recovery suggestion unit for providing a fault recovery suggestion from the historical diagnostic data in dependence of the determined application fault type.
The present invention will be described in detail with reference to the following specific examples, which are provided for the purpose of clearly illustrating the contents of the present invention and are not intended to show the scope of the present invention.
Example 1
When a certain business application system stably runs online for a long time, the data operation display of a certain business data module is gradually discovered to be sporadically slow in a period, and the business abnormality is gradually enlarged to the point that other modules start to have the condition of slowing (but the slowing degree is relatively small), and the abnormal fault reason is unknown.
The following is a traditional method for applying fault diagnosis, and the system application fault is diagnosed step by step mainly through an application log:
firstly, checking the states and configurations of switches and routers in application by checking application logs, checking data such as packet loss rate and packet error rate of equipment, and finding that network equipment is normal in performance; meanwhile, other applications are checked and found to have no obvious slow condition, and the possibility that the network has problems is eliminated.
Because the application fault type cannot be diagnosed by using the single application log, the conventional method needs to perform fault diagnosis by using the following manual participation modes:
the conditions of a system cpu, a memory, a system cache and a disk io of a host where the application is located are checked by using a command line, and the condition that the parameters are normal is found. Since the abnormality is not checked out,
furthermore, operation and maintenance personnel use the command line to check the conditions of the system cpu, the memory, the system cache and the disk io of the host where the problem application database is located, and find out that the disk io is frequent in the slow period of the system and is obviously higher than the normal time of the system through multiple checking and comparison, and the problem is classified as a suspicious item.
The operation and maintenance personnel check the communication between the application and the database equipment, continuously grab and analyze the data packets through the packet analysis tool, and find that the communication data volume is improved in about the first 20-40 minutes when the system is slow, and the item is classified as a suspicious item with abnormal faults.
The operation and maintenance personnel check the two suspicious items, suspects that the system is slowed down and is related to the application, and informs application research personnel to study on the spot.
In order to determine the problem of abnormal faults, the application operation log is read and the codes are read, and the application host, the database host and the database operation parameters are continuously monitored. The problem that original data are read when report data are run for a long time interval is found in code reading, so that the problem of application failure is solved.
In the process, effective fault diagnosis cannot be carried out by adopting single data, and fault diagnosis is realized only by considering participation in a large amount in the fault diagnosis process.
By using the application fault diagnosis system, the diagnosis related data of the first 5 minutes after the system is slowed down is analyzed; here, assuming that the acquisition period of the monitoring data is 1 minute according to the working experience of the person skilled in the art, 5 consecutive periods of monitoring data are acquired for analysis, and generally, while setting the period, an alarm period of system fault abnormality can also be set through the period.
And taking the time of responding the slow fault and the service system IP as time and space correlation respectively, and extracting monitoring data, wherein the monitoring data comprises the following indexes such as memory correlation and the like:
wherein, the monitoring data includes: the virtual memory usage rate in the memory related information is more than 70%, and the historical associated data of the virtual memory usage rate is less than 10%.
The work value of the exchange memory used for calling the memory from the disk is larger than 800, and the historical associated data of the exchange memory used for calling the memory from the disk is about 0-120.
The work value of the exchange memory used for calling the disk from the memory is larger than 900, and the history associated data of the exchange memory used for calling the disk from the memory is about 0-100.
The free physical memory is about 80-140M, and the historical associated data is 400-500M.
The size of writes to disk from memory is often greater than 600, while the historical association data is 20-100.
The size of the write from disk to memory often exceeds 600 and the historical associated data is 40-70.
In the system slowing stage, the access amount of the database in unit time obviously rises. While the access rate in Http-related information does not change significantly.
When the system starts to slow down, the URL which is remarkably slowed down in the Http related information is an operation page related to a certain service (the URL can be known as a report operation page by inquiring a system URL list), and when the server of the pages responds, the response time is gradually changed from 50-200ms of historical associated data to more than 3500ms later;
the historical associated data is the value of the periodic window baseline.
The moving window baseline is the average value of the response time of the latest short time, and the periodic baseline is the data response value of the same time of the last unit time period (working day, week and month);
and after the system is determined to slow down from the data, acquiring the response time of the page of other services from the application performance data, wherein the page response time is changed to about 1500 ms.
Determining the cause of the application failure comprises:
1. frequent operations are performed on large amounts of disk data.
2. The disk cache is small or fragmented too much.
3. The physical memory is too small, so that the physical memory occupies too high, and data reading is influenced.
4. And the URL page associated with the service system is occasionally abnormal, and the operation and maintenance personnel are unreasonably used to cause the abnormality. (the system can carry out URL combing and access the corresponding operation to the application from the URL of the application, such as report operation)
And (4) fault diagnosis suggestion:
1. the operating frequency of the disk data is reduced.
2. Enlarging disk cache or defragmenting.
3. And the physical memory is increased to be too small, and the physical memory occupancy rate is reduced.
4. Determining whether the operational interference is associated with a particular type of operation, and adjusting for the interference causing event.
According to the diagnosis result, if fault diagnosis is carried out according to the existing method, only the abnormity of the memory and the disk can be diagnosed through monitoring data; if the performance data is adopted, only occasional abnormity of the URL and the related page can be diagnosed, and the existing method is adopted, so that the diagnosis result is one-sided, and the service application is influenced to recover from the abnormity in time.
Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.
Claims (10)
1. A method for implementing application fault diagnosis, comprising:
collecting multi-dimensional application data;
when business application is abnormal, acquiring relevant diagnostic data related to the business abnormality from the time and space incidence relation of the business abnormality for the collected multidimensional application data according to the type of the business abnormality, wherein the time and space incidence relation of the business abnormality refers to acquiring the relevant diagnostic data from the determinable time information in the multidimensional data according to the time of the occurrence of the business abnormality and the time of the occurrence of the abnormality, and acquiring the relevant diagnostic data from the information of the related protocol layer;
comparing the acquired associated diagnostic data related to the business abnormality with historical diagnostic data of each associated diagnostic data respectively to determine the type of the application fault;
the multi-dimensional application data includes: monitoring data extracted according to the IP of the service application server, flow data extracted according to the IP of the service application server and a destination address, and application performance data extracted according to the IP of the service application server and the destination address;
wherein the monitoring data at least comprises: IP address, and/or monitoring time, and/or CPU utilization, and/or disk input/output io, and/or memory-related information, and/or swap space-related information, and/or network interface-related information, and/or database response time, and/or swap memory usage si brought into memory from disk, and/or swap memory usage so brought into disk from memory, and/or size bo written into disk from memory, and/or size bi written into memory from disk, and/or service status;
the flow data is a session uniquely identified by the same five-tuple, and at least comprises the following steps: collecting time, and/or source/destination addresses, and/or source/destination ports, and/or protocols, and/or sending the number of SYN packets of handshake signals used when TCP/IP establishes connection, and/or sending the number of FIN packets of code bit fields of TCP headers, and/or TCP related information, and/or the number of RST sending times, and/or total flow rate abnormity accessing specified services in unit time;
the application performance data comprises at least: source/destination addresses, and/or destination ports, and/or request time, and/or server response time, and/or load time, and/or page related information, and/or Http related information, and/or tomcat global access speed exception, and/or database access volume per unit time exception, and/or Weblogic current session number exception;
the application performance data is collected from performance data of an HTTP protocol, and/or performance data of an ORACLE database service, and/or performance data of a MYSQL database server.
2. The method according to claim 1, wherein the step of comparing the acquired associated diagnostic data related to the service anomaly with historical diagnostic data of each associated diagnostic data, and the step of determining the application fault type specifically includes:
and comparing the acquired associated diagnostic data related to the business abnormality with historical diagnostic data of each associated diagnostic data through a periodic baseline or a moving window baseline, and determining the application fault type according to the preset threshold range of each associated diagnostic data.
3. The method according to any one of claims 1 to 2, wherein the historical diagnostic data is: monitoring data within a first preset time length; flow data within a second preset time length and real-time application performance data.
4. The method of claim 1, wherein when the fault diagnosis does not result in an analysis, the method further comprises: and storing the multi-dimensional data related to the abnormity, and further determining the type of the application fault after the historical data is updated.
5. The method of claim 3, further comprising: upon determining the application failure type, a failure recovery recommendation is provided from the historical diagnostic data.
6. An apparatus for implementing application failure diagnosis, comprising: the system comprises a collecting unit, an obtaining unit and a fault diagnosis unit; wherein,
the acquisition unit is used for acquiring multi-dimensional application data;
the acquiring unit is used for acquiring relevant diagnostic data related to the business abnormity from the time and space incidence relation of the business abnormity for the acquired multidimensional application data according to the business abnormity type when the business application is abnormal, wherein the time and space incidence relation of the business abnormity refers to that the relevant diagnostic data is acquired from the determinable time information in the multidimensional data according to the time of the abnormity occurrence through the time of the business abnormity, and the relevant diagnostic data is acquired from the information of the related protocol layer;
the fault diagnosis unit is used for comparing the acquired associated diagnosis data related to the business abnormality with historical diagnosis data of each associated diagnosis data respectively to determine the type of the application fault;
the multi-dimensional application data includes: monitoring data extracted according to the IP of the service application server, flow data extracted according to the IP of the service application server and a destination address, and application performance data extracted according to the IP of the service application server and the destination address;
wherein the monitoring data at least comprises: IP address, and/or monitoring time, and/or CPU utilization, and/or disk input/output io, and/or memory-related information, and/or swap space-related information, and/or network interface-related information, and/or database response time, and/or swap memory usage si brought into memory from disk, and/or swap memory usage so brought into disk from memory, and/or size bo written into disk from memory, and/or size bi written into memory from disk, and/or service status;
the flow data is a session uniquely identified by the same five-tuple, and at least comprises the following steps: collecting time, and/or source/destination addresses, and/or source/destination ports, and/or protocols, and/or sending the number of SYN packets of handshake signals used when TCP/IP establishes connection, and/or sending the number of FIN packets of code bit fields of TCP headers, and/or TCP related information, and/or the number of RST sending times, and/or total flow rate abnormity accessing specified services in unit time;
the application performance data comprises at least: source/destination addresses, and/or destination ports, and/or request time, and/or server response time, and/or load time, and/or page related information, and/or Http related information, and/or tomcat global access speed exception, and/or database access volume per unit time exception, and/or Weblogic current session number exception;
the application performance data is collected from performance data of an HTTP protocol, and/or performance data of an ORACLE database service, and/or performance data of a MYSQL database server.
7. The apparatus according to claim 6, wherein the fault diagnosis unit is specifically configured to compare the acquired associated diagnostic data related to the service anomaly with historical diagnostic data of each associated diagnostic data through a periodic baseline or a moving window baseline, and determine the application fault type according to a preset threshold range of each associated diagnostic data.
8. The apparatus of any one of claims 6 to 7, wherein the historical diagnostic data is: monitoring data within a first preset time length; flow data within a second preset time length and real-time application performance data.
9. The apparatus of claim 6, further comprising a subsequent diagnosis unit for storing multi-dimensional data related to the abnormality when the result of the failure diagnosis is not analyzed, and further determining the type of the applied failure after the historical data is updated.
10. The apparatus of claim 8, further comprising a recovery suggestion unit to provide a fault recovery suggestion from historical diagnostic data based on the determined application fault type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410324069.XA CN105320585B (en) | 2014-07-08 | 2014-07-08 | A kind of method and device for realizing application failure diagnosis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410324069.XA CN105320585B (en) | 2014-07-08 | 2014-07-08 | A kind of method and device for realizing application failure diagnosis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105320585A CN105320585A (en) | 2016-02-10 |
CN105320585B true CN105320585B (en) | 2019-04-02 |
Family
ID=55248005
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410324069.XA Active CN105320585B (en) | 2014-07-08 | 2014-07-08 | A kind of method and device for realizing application failure diagnosis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105320585B (en) |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105871638B (en) * | 2016-06-03 | 2019-03-12 | 北京启明星辰信息安全技术有限公司 | A kind of network safety control method and device |
CN106130786B (en) * | 2016-07-26 | 2019-05-07 | 腾讯科技(深圳)有限公司 | A kind of detection method and device of network failure |
CN106452941A (en) * | 2016-08-24 | 2017-02-22 | 重庆大学 | Network anomaly detection method and device |
CN106484555B (en) * | 2016-09-29 | 2019-05-17 | Oppo广东移动通信有限公司 | The method and mobile terminal of abnormality detection and recovery |
CN107995056B (en) * | 2016-10-27 | 2021-04-13 | 中国移动通信集团公司 | Method and device for judging hidden NAT fault of firewall |
CN107342891B (en) * | 2017-06-07 | 2020-09-15 | 厦门金龙旅行车有限公司 | Method for remotely collecting vehicle fault data |
CN108183821B (en) * | 2017-12-26 | 2021-03-30 | 国网山东省电力公司信息通信公司 | Application performance obtaining method and device for power grid service |
CN110362442B (en) * | 2018-04-09 | 2023-09-22 | 创新先进技术有限公司 | Data monitoring method, device and equipment |
CN108508874B (en) * | 2018-05-08 | 2019-12-31 | 网宿科技股份有限公司 | Method and device for monitoring equipment fault |
CN108923952B (en) * | 2018-05-31 | 2021-11-30 | 北京百度网讯科技有限公司 | Fault diagnosis method, equipment and storage medium based on service monitoring index |
CN110602021A (en) * | 2018-06-12 | 2019-12-20 | 蓝盾信息安全技术有限公司 | Safety risk value evaluation method based on combination of HTTP request behavior and business process |
CN108920326B (en) * | 2018-06-14 | 2022-04-29 | 创新先进技术有限公司 | Method and device for determining time-consuming abnormity of system and electronic equipment |
CN109002261B (en) * | 2018-07-11 | 2022-03-22 | 佛山市云端容灾信息技术有限公司 | Method and device for analyzing big data of difference block, storage medium and server |
CN109491844B (en) * | 2018-09-21 | 2022-03-04 | 国网技术学院 | Computer system for identifying abnormal information |
CN109787816B (en) * | 2018-12-28 | 2022-07-08 | 奇安信科技集团股份有限公司 | Service fault positioning method, device, equipment and medium |
CN109828863A (en) * | 2019-01-10 | 2019-05-31 | 网联清算有限公司 | Data disaster tolerance method, apparatus, storage medium and computer equipment |
CN109857431B (en) * | 2019-01-11 | 2022-06-03 | 平安科技(深圳)有限公司 | Code modification method and device, computer readable medium and electronic equipment |
CN111193609B (en) * | 2019-11-20 | 2021-09-28 | 腾讯科技(深圳)有限公司 | Application abnormity feedback method and device and application abnormity monitoring system |
CN112887354B (en) * | 2019-11-29 | 2023-04-21 | 贵州白山云科技股份有限公司 | Performance information acquisition method and device |
CN111371623B (en) * | 2020-03-13 | 2023-02-28 | 杨磊 | Service performance and safety monitoring method and device, storage medium and electronic equipment |
CN114363223A (en) * | 2020-09-27 | 2022-04-15 | 中兴通讯股份有限公司 | Two-layer service state detection method, communication equipment and storage medium |
CN112783718A (en) * | 2020-12-31 | 2021-05-11 | 航天信息股份有限公司 | Management system and method for system abnormity |
CN113064762B (en) * | 2021-04-09 | 2024-02-23 | 上海新炬网络信息技术股份有限公司 | Service self-recovery method based on various detection |
CN113691405B (en) * | 2021-08-25 | 2023-12-01 | 北京知道创宇信息技术股份有限公司 | Access abnormality diagnosis method and device, storage medium and electronic equipment |
CN113722142B (en) * | 2021-09-02 | 2023-08-25 | 北京天融信网络安全技术有限公司 | Method and device for analyzing reasons of insufficient memory, electronic equipment and storage medium |
CN115225462B (en) * | 2022-07-21 | 2024-02-02 | 北京天融信网络安全技术有限公司 | Network fault diagnosis method and device |
CN115696444B (en) * | 2022-09-23 | 2023-09-12 | 中兴通讯股份有限公司 | Time delay detection method, device, data analysis platform and readable storage medium |
US12050506B2 (en) | 2022-10-12 | 2024-07-30 | International Business Machines Corporation | Generating incident explanations using spatio-temporal log clustering |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101848477A (en) * | 2009-03-24 | 2010-09-29 | 亚信科技(中国)有限公司 | Method and system for diagnosing fault |
CN102081623A (en) * | 2009-11-30 | 2011-06-01 | 中国移动通信集团浙江有限公司 | Method and system for detecting database abnormality |
CN102340415A (en) * | 2011-06-23 | 2012-02-01 | 北京新媒传信科技有限公司 | Server cluster system and monitoring method thereof |
CN102761448A (en) * | 2012-08-07 | 2012-10-31 | 中国石油大学(华东) | Cluster monitoring and early warning method |
WO2013086996A1 (en) * | 2011-12-13 | 2013-06-20 | 华为技术有限公司 | Failure processing method, device and system |
CN103412805A (en) * | 2013-07-31 | 2013-11-27 | 交通银行股份有限公司 | IT (information technology) fault source diagnosis method and IT fault source diagnosis system |
CN103532776A (en) * | 2013-09-30 | 2014-01-22 | 广东电网公司电力调度控制中心 | Service flow detection method and system |
CN103532940A (en) * | 2013-09-30 | 2014-01-22 | 广东电网公司电力调度控制中心 | Network security detection method and device |
CN103595584A (en) * | 2013-11-13 | 2014-02-19 | 德科仕通信(上海)有限公司 | Method and system for diagnosing Web application performance problem |
-
2014
- 2014-07-08 CN CN201410324069.XA patent/CN105320585B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101848477A (en) * | 2009-03-24 | 2010-09-29 | 亚信科技(中国)有限公司 | Method and system for diagnosing fault |
CN102081623A (en) * | 2009-11-30 | 2011-06-01 | 中国移动通信集团浙江有限公司 | Method and system for detecting database abnormality |
CN102340415A (en) * | 2011-06-23 | 2012-02-01 | 北京新媒传信科技有限公司 | Server cluster system and monitoring method thereof |
WO2013086996A1 (en) * | 2011-12-13 | 2013-06-20 | 华为技术有限公司 | Failure processing method, device and system |
CN102761448A (en) * | 2012-08-07 | 2012-10-31 | 中国石油大学(华东) | Cluster monitoring and early warning method |
CN103412805A (en) * | 2013-07-31 | 2013-11-27 | 交通银行股份有限公司 | IT (information technology) fault source diagnosis method and IT fault source diagnosis system |
CN103532776A (en) * | 2013-09-30 | 2014-01-22 | 广东电网公司电力调度控制中心 | Service flow detection method and system |
CN103532940A (en) * | 2013-09-30 | 2014-01-22 | 广东电网公司电力调度控制中心 | Network security detection method and device |
CN103595584A (en) * | 2013-11-13 | 2014-02-19 | 德科仕通信(上海)有限公司 | Method and system for diagnosing Web application performance problem |
Also Published As
Publication number | Publication date |
---|---|
CN105320585A (en) | 2016-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105320585B (en) | A kind of method and device for realizing application failure diagnosis | |
US11641319B2 (en) | Network health data aggregation service | |
US7661032B2 (en) | Adjusting sliding window parameters in intelligent event archiving and failure analysis | |
US9634915B2 (en) | Methods and computer program products for generating a model of network application health | |
US10862777B2 (en) | Visualization of network health information | |
US7827447B2 (en) | Sliding window mechanism for data capture and failure analysis | |
US10404556B2 (en) | Methods and computer program products for correlation analysis of network traffic in a network device | |
US10243820B2 (en) | Filtering network health information based on customer impact | |
US8813220B2 (en) | Methods and systems for internet protocol (IP) packet header collection and storage | |
US11093349B2 (en) | System and method for reactive log spooling | |
US10187400B1 (en) | Packet filters in security appliances with modes and intervals | |
WO2017163352A1 (en) | Anomaly detection apparatus, anomaly detection system, and anomaly detection method | |
US8645532B2 (en) | Methods and computer program products for monitoring the contents of network traffic in a network device | |
US20190007292A1 (en) | Apparatus and method for monitoring network performance of virtualized resources | |
CN113708995B (en) | Network fault diagnosis method, system, electronic equipment and storage medium | |
JP7079721B2 (en) | Network anomaly detection device, network anomaly detection system and network anomaly detection method | |
CN105119767A (en) | Data self-check and self-cleaning software operation state monitoring method and system | |
US20190007285A1 (en) | Apparatus and Method for Defining Baseline Network Behavior and Producing Analytics and Alerts Therefrom | |
CN113067810A (en) | Network packet capturing method, device, equipment and medium | |
CN105553743A (en) | Log obtaining method, system, first network device and third network device | |
US20120072258A1 (en) | Methods and computer program products for identifying and monitoring related business application processes | |
CN115664833B (en) | Network hijacking detection method based on local area network safety equipment | |
KR102370113B1 (en) | Apparatus and method for intelligent network management based on automatic packet analysis | |
CN114513398B (en) | Network equipment alarm processing method, device, equipment and storage medium | |
US12120003B1 (en) | Methods and systems of performing TCP flow analysis across NAT |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |