[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112445641B - Operation maintenance method and system for big data cluster - Google Patents

Operation maintenance method and system for big data cluster Download PDF

Info

Publication number
CN112445641B
CN112445641B CN202011225109.7A CN202011225109A CN112445641B CN 112445641 B CN112445641 B CN 112445641B CN 202011225109 A CN202011225109 A CN 202011225109A CN 112445641 B CN112445641 B CN 112445641B
Authority
CN
China
Prior art keywords
error
running information
scanning
time interval
process running
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011225109.7A
Other languages
Chinese (zh)
Other versions
CN112445641A (en
Inventor
李燕
杨雪平
宋彬彬
杨雪
周伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dezhou Vocational and Technical College
Original Assignee
Dezhou Vocational and Technical College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dezhou Vocational and Technical College filed Critical Dezhou Vocational and Technical College
Priority to CN202011225109.7A priority Critical patent/CN112445641B/en
Publication of CN112445641A publication Critical patent/CN112445641A/en
Application granted granted Critical
Publication of CN112445641B publication Critical patent/CN112445641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a method for operating and maintaining a big data cluster, which comprises the following steps: acquiring process information in a big data cluster to obtain process running information of each component in the big data cluster; setting an initial value of a process running information scanning time interval, and carrying out self-adaptive adjustment on the process running information scanning time interval according to the running information scanning condition; scanning whether a program error exists in each process of a tested assembly in the big data cluster by using the process running information; if the program error exists, extracting an error type corresponding to the program error, and carrying out error statistics; inquiring a corresponding repair strategy in a preset error code library according to the error type, and generating a repair instruction; and repairing the program error according to the repairing instruction and the repairing strategy. The system comprises modules corresponding to the method steps.

Description

Operation maintenance method and system for big data cluster
Technical Field
The invention provides a method and a system for operating and maintaining a big data cluster, and belongs to the technical field of operation and maintenance.
Background
Big data (big data), or huge data, refers to the data that is too large to be captured, managed, processed and organized in a reasonable time to help the enterprise to make business decisions more positive by the current mainstream software tools. Big data processing relies on a multitude of services like HDFS (distributed file system), YARN (resource management system), Spark (distributed memory computing framework), hbse (distributed column oriented database), HIVE (hadoop based data warehouse tool), etc. Due to network oscillation, unstable voltage, resource preemption, misoperation and other reasons, some components may be hung, maintenance personnel needs to periodically inspect the operation condition of the platform, the hung service is started after a program error is found to be abnormal and needs to be eliminated, if the hung service is not started in time, the overstocked service data may occur, even the operation of the service is influenced, and great challenges are brought to the stable operation of a large data platform. And because the large data platform has more use places and the error probability of a repetitive program is higher, operation and maintenance personnel need to do a large amount of repetitive labor. And some large data platforms do not allow remote operation due to the limitation of authority, so that great inconvenience is brought to routing inspection and program error repair of operation and maintenance personnel.
Disclosure of Invention
The invention provides a method and a system for operating and maintaining a big data cluster, which are used for solving the problems of low operation and maintenance efficiency and poor maintenance strength of the existing big data cluster, and adopt the following technical scheme:
a method for operation and maintenance of a big data cluster, the method comprising:
acquiring process information in a big data cluster to obtain process running information of each component in the big data cluster;
setting an initial value of a process running information scanning time interval, and carrying out self-adaptive adjustment on the process running information scanning time interval according to the running information scanning condition;
scanning whether a program error exists in each process of a tested assembly in the big data cluster by using the process running information; if the program error exists, extracting an error type corresponding to the program error, and carrying out error statistics;
inquiring a corresponding repair strategy in a preset error code library according to the error type, and generating a repair instruction;
and repairing the program error according to the repairing instruction and the repairing strategy.
Further, setting an initial value of a process running information scanning time interval, and performing adaptive adjustment on the process running information scanning time interval according to a running information scanning condition, wherein the adaptive adjustment comprises the following steps:
firstly, setting a scanning time interval initial value of process running information, and executing the scanning of each process of a tested component in a big data cluster by using the process running information according to the scanning time interval initial value;
secondly, process scanning of three continuous process running information scanning time intervals is carried out on the basis of the initial value of the process running information scanning time interval, namely three times of process scanning; after the three times of process scanning is finished, adjusting the process running information scanning time interval according to the time used by single scanning and the program error number in the process to obtain the process running information scanning time interval after self-adaptive adjustment;
thirdly, scanning each process of the tested assembly in the big data cluster by using the process running information according to the process running information scanning time interval after self-adaptive adjustment;
step four, after three times of scanning are continuously carried out according to the process running information scanning time interval after self-adaptive adjustment, the process running information scanning time interval is adjusted according to the time used by single scanning and the program error quantity in the process, and the process running information scanning time interval after self-adaptive adjustment is obtained again; scanning each process of the tested assembly in the big data cluster by using the process running information according to the process running information scanning time interval after the self-adaptive adjustment again;
and fifthly, repeating the contents from the third step to the fourth step, continuously adjusting the process running information scanning time interval, and scanning each process of the tested assembly in the big data cluster by using the continuously adjusted process running information scanning time interval.
Further, the process running information scanning time interval is adaptively adjusted according to the following formula:
Figure BDA0002763397280000021
wherein, T i+1 Represents the information scanning time interval after the i +1 th adaptive adjustment, i is 1,2,3 … … n, n represents the total number of times of the adaptive adjustment of the information scanning time interval, and T is 1 1 Representing the initial value of the scanning time interval of the process running information; n represents the number of scanning processes in three consecutive scans; nc represents three consecutive timesIn the scanning, the obtained program error number; t is i Indicating an information scanning time interval after the ith adaptive adjustment; t is max Represents the maximum value of the time used for carrying out the process scanning in a single time in three times of scanning; t is min The minimum amount of time it takes to perform a course scan in a single of the three scans.
Further, scanning whether a program error exists in each process of the tested assembly in the big data cluster by using the process running information; if the program error exists, extracting an error type corresponding to the program error, and performing error statistics, wherein the error statistics comprises the following steps:
when detecting that the process of the tested component has a program error, locking an error log corresponding to a program error trigger point according to the program error;
determining the error type according to the error log;
carrying out primary error marking on the process with the program error, and classifying the error type of the program error of the process;
counting the error marking times of the process of the tested component and various error types of the process to obtain a statistical result;
and sending the statistical result to an operation maintenance terminal of the big data cluster for recording.
Further, sending the statistical result to an operation maintenance terminal of the big data cluster for recording, including:
after receiving the statistical result, the operation maintenance terminal compares the statistical result with an error threshold corresponding to each tested component preset in the operation maintenance terminal:
and when any one of the error marking times and the error types in the statistical results of the tested components exceeds the error marking time index and the error type number index in the error threshold, the operation maintenance terminal carries out alarm prompt.
An operation maintenance system for large data clusters, the system comprising:
the acquisition module is used for acquiring process information in the big data cluster to obtain process running information of each component in the big data cluster;
the setting module is used for setting an initial value of a process running information scanning time interval and carrying out self-adaptive adjustment on the process running information scanning time interval according to the condition of carrying out running information scanning;
the judging module is used for scanning whether a program error exists in each process of the tested assembly in the big data cluster by using the process running information; if the program error exists, extracting an error type corresponding to the program error, and carrying out error statistics;
the generating module is used for inquiring a corresponding repairing strategy in a preset error code library according to the error type and generating a repairing instruction;
and the repairing module is used for repairing the program error according to the repairing instruction and the repairing strategy.
Further, the setting module includes:
the system comprises an initial value setting module, a scanning module and a processing module, wherein the initial value setting module is used for setting a scanning time interval initial value of process running information and executing the scanning of each process of a tested assembly in a big data cluster by using the process running information according to the scanning time interval initial value;
the scanning module I is used for carrying out process scanning of three continuous process running information scanning time intervals on the basis of the initial value of the process running information scanning time interval, namely three times of process scanning; after the three times of process scanning is finished, adjusting the process running information scanning time interval according to the time used by single scanning and the program error number in the process to obtain the process running information scanning time interval after self-adaptive adjustment; the system comprises a process running information scanning module, a data acquisition module, a data analysis module and a data analysis module, wherein the process running information scanning module is used for scanning each process of a tested assembly in a big data cluster according to a process running information scanning time interval after self-adaptive adjustment;
the self-adaptive adjustment module is used for adjusting the process running information scanning time interval according to the time used by single scanning and the program error quantity in the process after three times of scanning is continuously carried out at the process running information scanning time interval after self-adaptive adjustment, so as to obtain the process running information scanning time interval after self-adaptive adjustment again; scanning each process of the tested assembly in the big data cluster by using the process running information according to the process running information scanning time interval after the self-adaptive adjustment again; and continuously adjusting the process running information scanning time interval, and scanning each process of the tested assembly in the big data cluster by using the continuously adjusted process running information scanning time interval.
Further, the process running information scanning time interval is adaptively adjusted according to the following formula:
Figure BDA0002763397280000031
wherein, T i+1 Represents the information scanning time interval after the i +1 th adaptive adjustment, i is 1,2,3 … … n, n represents the total number of times of the adaptive adjustment of the information scanning time interval, and T is 1 1 Representing the initial value of the scanning time interval of the process running information; n represents the number of scanning processes in three consecutive scans; nc represents the number of program errors obtained in three consecutive scans; t is i Indicating an information scanning time interval after the ith adaptive adjustment; t is max Represents the maximum value of the time used for carrying out the process scanning in a single time in three times of scanning; t is min The minimum amount of time it takes to perform a course scan in a single of the three scans.
Further, the judging module comprises:
the locking module is used for locking an error log corresponding to the program error trigger point according to the program error when the program error of the process of the tested assembly is detected;
the type determining module is used for determining the error type according to the error log;
the marking module is used for marking the process with the program error once and classifying the error type of the program error of the process;
the statistical module is used for counting the error marking times of the process of the tested assembly and various error types of the process to obtain a statistical result;
and the recording module is used for sending the statistical result to an operation maintenance terminal of the big data cluster for recording.
Further, the recording module includes:
the comparison module is used for controlling the operation maintenance terminal to compare the statistical result with an error threshold corresponding to each tested component preset in the operation maintenance terminal after receiving the statistical result:
and the warning module is used for performing warning prompt by the operation maintenance terminal when any one of the error marking times and the error types in the statistical results of the tested components exceeds the error marking time index and the error type number index in the error threshold value.
The invention has the beneficial effects that:
the operation maintenance method and the system for the big data cluster can effectively improve the operation maintenance efficiency and the operation maintenance strength of data. The scanning of the data cluster and the matching degree of the data acquisition frequency and the actual operation condition of the big data cluster can be effectively improved through the setting of the process operation information scanning time interval and the self-adaptive adjustment, so that the data acquisition frequency in the operation and maintenance process can be adjusted at any time according to the actual operation condition of the big data cluster, and further the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the operation and maintenance efficiency and the operation and maintenance force in the operation and maintenance process of the big data cluster can be effectively improved, meanwhile, the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the reasonable adjustment and application of operation and maintenance resources can be realized, and the waste of the operation and maintenance resources can be reduced.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a block diagram of the system of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
As shown in fig. 1, the operation maintenance method for a big data cluster according to an embodiment of the present invention includes:
s1, collecting process information in the big data cluster to obtain process running information of each component in the big data cluster;
s2, setting an initial value of a process running information scanning time interval, and carrying out self-adaptive adjustment on the process running information scanning time interval according to the running information scanning condition;
s3, scanning whether a program error exists in each process of the tested assembly in the big data cluster by using the process running information; if the program error exists, extracting an error type corresponding to the program error, and carrying out error statistics;
s4, inquiring a corresponding repair strategy in a preset error code library according to the error type, and generating a repair instruction;
and S5, repairing the program error according to the repairing instruction and the repairing strategy.
The effect of the above technical scheme is as follows: the operation maintenance efficiency and the operation maintenance force of the data can be effectively improved. The scanning of the data cluster and the matching degree of the data acquisition frequency and the actual operation condition of the big data cluster can be effectively improved through the setting and the self-adaptive adjustment of the process operation information scanning time interval, the data acquisition frequency in the operation and maintenance process can be adjusted at any time according to the actual operation condition of the big data cluster, and then the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the operation and maintenance efficiency and the force of the operation and maintenance process of the big data cluster can be effectively improved, meanwhile, the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the reasonable adjustment and application of operation and maintenance resources can be realized, and the waste of the operation and maintenance resources can be reduced.
In an embodiment of the present invention, setting an initial value of a process running information scanning time interval, and performing adaptive adjustment on the process running information scanning time interval according to a condition of performing running information scanning, includes:
firstly, setting an initial value of a scanning time interval of process running information, and executing scanning of each process of a tested assembly in a big data cluster by using the process running information according to the initial value of the scanning time interval;
secondly, process scanning of three continuous process running information scanning time intervals is carried out on the basis of the initial value of the process running information scanning time interval, namely three times of process scanning; after the three times of process scanning is finished, adjusting the process running information scanning time interval according to the time used by single scanning and the program error number in the process to obtain the process running information scanning time interval after self-adaptive adjustment;
thirdly, scanning each process of the tested assembly in the big data cluster by using the process running information according to the process running information scanning time interval after self-adaptive adjustment;
step four, after three times of scanning is continuously carried out at the process running information scanning time interval after self-adaptive adjustment, the process running information scanning time interval is adjusted according to the time used by single scanning and the program error quantity in the process, and the process running information scanning time interval after self-adaptive adjustment is obtained again; scanning each process of the tested assembly in the big data cluster by using the process running information according to the process running information scanning time interval after the self-adaptive adjustment again;
and fifthly, repeating the contents from the third step to the fourth step, continuously adjusting the process running information scanning time interval, and scanning each process of the tested assembly in the big data cluster by using the continuously adjusted process running information scanning time interval.
The process running information scanning time interval is adaptively adjusted according to the following formula:
Figure BDA0002763397280000061
wherein, T i+1 Represents the information scanning time interval after the i +1 th adaptive adjustment, i is 1,2,3 … … n, n represents the total number of times of the adaptive adjustment of the information scanning time interval, and T is 1 1 Representing the initial value of the scanning time interval of the process running information; n represents the number of scanning processes in three consecutive scans; nc represents the number of program errors obtained in three consecutive scans; t is i Indicating an information scanning time interval after the ith adaptive adjustment; t is max Represents the maximum value of the time used for carrying out the process scanning in a single time in three times of scanning; t is min The minimum amount of time it takes to perform a course scan in a single of the three scans.
The effect of the above technical scheme is as follows: the scanning of the data cluster and the matching degree of the data acquisition frequency and the actual operation condition of the big data cluster can be effectively improved through the setting and the self-adaptive adjustment of the process operation information scanning time interval, the data acquisition frequency in the operation and maintenance process can be adjusted at any time according to the actual operation condition of the big data cluster, and then the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the operation and maintenance efficiency and the force of the operation and maintenance process of the big data cluster can be effectively improved, in addition, the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the reasonable adjustment and application of operation and maintenance resources can be realized, and the waste of the operation and maintenance resources can be reduced.
Meanwhile, in the operation process of each tested device of the large data cluster, the number of processes is different in each time period due to the difference between the number of executed tasks and the data amount of a single task, and further the scanning time for the processes is different, so that the change of the scanning time indirectly reflects the task amount of each tested device in the operation process of executing the tasks; the self-adaptive adjustment of the process running information scanning time interval is carried out through the formula, so that the matching degree of the process running information scanning time interval and the actual conditions of the large data cluster running and the program error rate can be improved to a great extent. And highly matching the whole operation and maintenance operation condition with the actual operation condition of each tested device of the big data cluster.
In one embodiment of the present invention, the process running information is used to scan whether a program error exists in each process of the tested component in the big data cluster; if the program error exists, extracting an error type corresponding to the program error, and performing error statistics, wherein the error statistics comprises the following steps:
s301, when detecting that a program error occurs in the process of the tested component, locking an error log corresponding to a program error trigger point according to the program error;
s302, determining the error type according to the error log;
s303, carrying out primary error marking on the process with the program error, and classifying the error type of the program error of the process;
s304, counting the error marking times of the process of the tested assembly and various error types of the process to obtain a statistical result;
s305, sending the statistical result to an operation maintenance terminal of the big data cluster for recording.
Sending the statistical result to an operation maintenance terminal of the big data cluster for recording, wherein the recording comprises the following steps:
s3051, after receiving the statistical result, the operation and maintenance terminal compares the statistical result with an error threshold corresponding to each tested component preset in the operation and maintenance terminal: the error threshold comprises an error marking frequency index and an error type number index;
s3052, when any one of the error marking times and the error types in the statistical results of the tested components exceeds the error marking time index and the error type number index in the error threshold, the operation maintenance terminal carries out alarm prompt.
The effect of the above technical scheme is as follows: through error record statistics, comparison between the statistical result and the error threshold value and warning, the operation maintenance efficiency and the operation maintenance strength of the data can be effectively improved.
An embodiment of the present invention provides an operation maintenance system for a big data cluster, as shown in fig. 2, the system includes:
the acquisition module is used for acquiring process information in the big data cluster to obtain process running information of each component in the big data cluster;
the system comprises a setting module, a processing module and a processing module, wherein the setting module is used for setting an initial value of a process running information scanning time interval and carrying out self-adaptive adjustment on the process running information scanning time interval according to the condition of carrying out running information scanning;
the judging module is used for scanning whether a program error exists in each process of the tested assembly in the big data cluster by using the process running information; if the program error exists, extracting an error type corresponding to the program error, and carrying out error statistics;
the generating module is used for inquiring a corresponding repairing strategy in a preset error code library according to the error type and generating a repairing instruction;
and the repairing module is used for repairing the program error according to the repairing instruction and the repairing strategy.
The working principle of the technical scheme is as follows: firstly, acquiring process information in a big data cluster through an acquisition module to obtain process running information of each component in the big data cluster; then, setting an initial value of a process running information scanning time interval by using a setting module, and carrying out self-adaptive adjustment on the process running information scanning time interval according to the condition of carrying out running information scanning; then, scanning whether a program error exists in each process of the tested assembly in the big data cluster by using the process running information through a judging module; if the program error exists, extracting an error type corresponding to the program error, and carrying out error statistics; then, a generating module is adopted to query a corresponding repairing strategy in a preset error code library according to the error type and generate a repairing instruction; and finally, repairing the program error through a repairing module according to the repairing instruction and the repairing strategy.
The effect of the above technical scheme is as follows: the operation maintenance efficiency and the operation maintenance force of the data can be effectively improved. The scanning of the data cluster and the matching degree of the data acquisition frequency and the actual operation condition of the big data cluster can be effectively improved through the setting and the self-adaptive adjustment of the process operation information scanning time interval, the data acquisition frequency in the operation and maintenance process can be adjusted at any time according to the actual operation condition of the big data cluster, and then the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the operation and maintenance efficiency and the force of the operation and maintenance process of the big data cluster can be effectively improved, meanwhile, the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the reasonable adjustment and application of operation and maintenance resources can be realized, and the waste of the operation and maintenance resources can be reduced.
In one embodiment of the present invention, the setting module includes:
the initial value setting module is used for setting an initial value of a scanning time interval of the process running information and executing the scanning of each process of the tested assembly in the big data cluster by using the process running information according to the initial value of the scanning time interval;
the scanning module I is used for carrying out process scanning of three continuous process running information scanning time intervals on the basis of the initial value of the process running information scanning time interval, namely three times of process scanning; after the three times of process scanning is finished, adjusting the process running information scanning time interval according to the time used by single scanning and the program error number in the process to obtain the process running information scanning time interval after self-adaptive adjustment; the system comprises a process running information scanning module, a process monitoring module and a data processing module, wherein the process running information scanning module is used for scanning each process of a tested assembly in a big data cluster according to a process running information scanning time interval after self-adaptive adjustment;
the self-adaptive adjustment module is used for adjusting the process running information scanning time interval according to the time used by single scanning and the program error quantity in the process after three times of scanning is continuously carried out at the process running information scanning time interval after self-adaptive adjustment, so as to obtain the process running information scanning time interval after self-adaptive adjustment again; according to the process running information scanning time interval after the self-adaptive adjustment again, the process running information is utilized to execute the scanning of each process of the tested component in the big data cluster; and continuously adjusting the process running information scanning time interval, and scanning each process of the tested assembly in the big data cluster by using the continuously adjusted process running information scanning time interval.
The process running information scanning time interval is adaptively adjusted according to the following formula:
Figure BDA0002763397280000081
wherein, T i+1 When the number of times of the information scanning is 1,2,3 … … n, n is 1, T is 1 1 Representing the initial value of the scanning time interval of the process running information; n represents the number of scanning processes in three consecutive scans; nc represents the number of program errors obtained in three consecutive scans; t is i Indicating an information scanning time interval after the ith adaptive adjustment; t is max Represents the maximum value of the time used for carrying out the process scanning in a single time in three times of scanning; t is a unit of min The minimum amount of time it takes to perform a course scan in a single of the three scans.
The working principle of the technical scheme is as follows: firstly, an initial value setting module sets a scanning time interval initial value of process running information, and the process running information is utilized to execute scanning of each process of a tested assembly in a big data cluster according to the scanning time interval initial value; then, process scanning of three continuous process running information scanning time intervals is carried out by using a scanning module I on the basis of the initial value of the process running information scanning time interval, namely three times of process scanning; after the three times of process scanning is finished, adjusting the process running information scanning time interval according to the time used by single scanning and the program error number in the process to obtain the process running information scanning time interval after self-adaptive adjustment; the system comprises a process running information scanning module, a data acquisition module, a data analysis module and a data analysis module, wherein the process running information scanning module is used for scanning each process of a tested assembly in a big data cluster according to a process running information scanning time interval after self-adaptive adjustment; finally, after three times of scanning are continuously carried out at the adaptively adjusted process running information scanning time interval through the adaptive adjustment module, the process running information scanning time interval is adjusted according to the time used by single scanning and the program error number in the process, and the process running information scanning time interval after adaptive adjustment is obtained again; scanning each process of the tested assembly in the big data cluster by using the process running information according to the process running information scanning time interval after the self-adaptive adjustment again; and continuously adjusting the process running information scanning time interval, and scanning each process of the tested assembly in the big data cluster by using the continuously adjusted process running information scanning time interval.
The effect of the above technical scheme is as follows: the scanning of the data cluster and the matching degree of the data acquisition frequency and the actual operation condition of the big data cluster can be effectively improved through the setting and the self-adaptive adjustment of the process operation information scanning time interval, the data acquisition frequency in the operation and maintenance process can be adjusted at any time according to the actual operation condition of the big data cluster, and then the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the operation and maintenance efficiency and the force of the operation and maintenance process of the big data cluster can be effectively improved, in addition, the operation and maintenance process of the whole big data cluster can be adjusted according to the change of the actual operation condition of the big data cluster, the reasonable adjustment and application of operation and maintenance resources can be realized, and the waste of the operation and maintenance resources can be reduced.
Meanwhile, in the running process of each tested device of the big data cluster, the quantity of processes is different in each time period due to the difference between the quantity of executed tasks and the data quantity of a single task, and the scanning time for the processes is different, so that the change of the scanning time indirectly reflects the quantity of the tasks executed by each tested device in the running process; the self-adaptive adjustment of the process running information scanning time interval is carried out through the formula, so that the matching degree of the process running information scanning time interval and the actual conditions of the large data cluster running and the program error rate can be improved to a great extent. The overall operation and maintenance operation condition is highly matched with the actual operation condition of each tested device of the big data cluster.
In an embodiment of the present invention, the determining module includes:
the locking module is used for locking an error log corresponding to the program error trigger point according to the program error when the program error of the process of the tested assembly is detected;
the type determining module is used for determining the error type according to the error log;
the marking module is used for marking the process with the program error once and classifying the error type of the program error of the process;
the statistical module is used for counting the error marking times of the process of the tested assembly and various error types of the process to obtain a statistical result;
and the recording module is used for sending the statistical result to an operation maintenance terminal of the big data cluster for recording.
The recording module includes:
the comparison module is used for controlling the operation maintenance terminal to compare the statistical result with an error threshold corresponding to each tested component preset in the operation maintenance terminal after receiving the statistical result:
and the warning module is used for performing warning prompt on the operation maintenance terminal when any one of the error marking times and the error types in the statistical results of the tested components exceeds the error marking time index and the error type number index in the error threshold value.
The working principle of the technical scheme is as follows: firstly, when detecting that a program error occurs in the process of the tested component through a locking module, locking an error log corresponding to a program error trigger point according to the program error; then, determining the error type according to the error log through a type determination module; then, a marking module is used for marking the process with the program error once, and classifying the error type of the program error of the process; then, adopting a statistical module to perform statistics on the error marking times of the process of the tested component and various error types of the process to obtain a statistical result; and finally, sending the statistical result to an operation maintenance terminal of the big data cluster by adopting a recording module for recording.
The operation process of the recording module is as follows:
firstly, after the operation and maintenance terminal is controlled by a comparison module to receive the statistical result, the statistical result is compared with the error threshold corresponding to each tested component preset in the operation and maintenance terminal: and then, when any one of the error marking times and the error types in the statistical results of the tested components exceeds the error marking time index and the error type number index in the error threshold value by using the warning module, the operation maintenance terminal carries out warning prompt.
The effect of the above technical scheme is as follows: the effect of the above technical scheme is as follows: through error record statistics, comparison between the statistical result and the error threshold value and warning, the operation maintenance efficiency and the operation maintenance strength of the data can be effectively improved.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (6)

1. A method for operation and maintenance of a big data cluster, the method comprising:
acquiring process information in a big data cluster to obtain process running information of each component in the big data cluster;
setting an initial value of a process running information scanning time interval, and carrying out self-adaptive adjustment on the process running information scanning time interval according to the process running information scanning condition;
scanning whether a program error exists in each process of a tested assembly in the big data cluster by using the process running information; if the program error exists, extracting an error type corresponding to the program error, and carrying out error statistics;
inquiring a corresponding repair strategy in a preset error code library according to the error type, and generating a repair instruction;
repairing the program error according to the repairing instruction and a repairing strategy;
the method comprises the following steps of setting an initial value of a process running information scanning time interval, and carrying out self-adaptive adjustment on the process running information scanning time interval according to the process running information scanning condition, wherein the method comprises the following steps:
firstly, setting an initial value of a scanning time interval of process running information, and executing scanning of each process of a tested assembly in a big data cluster by using the process running information according to the initial value of the scanning time interval;
secondly, process scanning of three continuous process running information scanning time intervals is carried out on the basis of the initial value of the process running information scanning time interval, namely three times of process scanning; after the three times of process scanning is finished, adjusting the process running information scanning time interval according to the time used by single scanning and the program error number in the process to obtain the process running information scanning time interval after self-adaptive adjustment;
thirdly, scanning each process of the tested component in the big data cluster by using the process running information according to the adaptively adjusted process running information scanning time interval;
step four, after three times of scanning is continuously carried out at the process running information scanning time interval after self-adaptive adjustment, the process running information scanning time interval is adjusted according to the time used by single scanning and the program error quantity in the process, and the process running information scanning time interval after self-adaptive adjustment is obtained again; scanning each process of the tested assembly in the big data cluster by using the process running information according to the process running information scanning time interval after the self-adaptive adjustment again;
fifthly, repeating the contents of the third step to the fourth step, continuously adjusting the process running information scanning time interval, and scanning each process of the tested assembly in the big data cluster by using the continuously adjusted process running information scanning time interval;
the process running information scanning time interval is adaptively adjusted according to the following formula:
Figure FDA0003618444580000011
wherein, T i+1 Represents the information scanning time interval after the i +1 th time of adaptive adjustment, i is 1,2,3 … … n, n represents the total number of times of adaptive adjustment of the information scanning time interval, and when i is 1, T 1 Representing the initial value of the scanning time interval of the process running information; n represents the number of scanning processes in three consecutive scans; nc represents the number of program errors obtained in three consecutive scans; t is i Indicating an information scanning time interval after the ith adaptive adjustment; t is max Represents the maximum value of the time used for carrying out the process scanning in a single time in three times of scanning; t is min The minimum amount of time it takes to perform a course scan in a single of the three scans.
2. The method of claim 1, wherein the process running information is used to scan whether a program error exists in each process of the tested components in the big data cluster; if the program error exists, extracting an error type corresponding to the program error, and performing error statistics, wherein the error statistics comprises the following steps:
when the program error of the process of the tested component is detected, locking an error log corresponding to a program error trigger point according to the program error;
determining the error type according to the error log;
carrying out primary error marking on the process with the program error, and classifying the error type of the program error of the process;
counting the error marking times of the process of the tested component and various error types of the process to obtain a statistical result;
and sending the statistical result to an operation maintenance terminal of the big data cluster for recording.
3. The method of claim 2, wherein sending the statistical result to an operation and maintenance terminal of the big data cluster for recording comprises:
after receiving the statistical result, the operation maintenance terminal compares the statistical result with an error threshold corresponding to each tested component preset in the operation maintenance terminal:
and when any one of the error marking times and the error types in the statistical results of the tested components exceeds the error marking time index and the error type number index in the error threshold value, the operation maintenance terminal gives an alarm prompt.
4. An operation and maintenance system for large data clusters, the system comprising:
the acquisition module is used for acquiring process information in the big data cluster to obtain process running information of each component in the big data cluster;
the setting module is used for setting an initial value of a process running information scanning time interval and carrying out self-adaptive adjustment on the process running information scanning time interval according to the process running information scanning condition;
the judging module is used for scanning whether a program error exists in each process of the tested assembly in the big data cluster by using the process running information; if the program error exists, extracting an error type corresponding to the program error, and carrying out error statistics;
the generating module is used for inquiring a corresponding repairing strategy in a preset error code library according to the error type and generating a repairing instruction;
the repair module is used for repairing the program error according to the repair instruction and the repair strategy;
wherein the setting module includes: the initial value setting module is used for setting an initial value of a scanning time interval of the process running information and executing the scanning of each process of the tested assembly in the big data cluster by using the process running information according to the initial value of the scanning time interval;
the scanning module I is used for carrying out process scanning of three continuous process running information scanning time intervals on the basis of the initial value of the process running information scanning time interval, namely three times of process scanning; after the three times of process scanning is finished, adjusting the process running information scanning time interval according to the time used by single scanning and the program error number in the process to obtain the process running information scanning time interval after self-adaptive adjustment; the system comprises a process running information scanning module, a data acquisition module, a data analysis module and a data analysis module, wherein the process running information scanning module is used for scanning each process of a tested assembly in a big data cluster according to a process running information scanning time interval after self-adaptive adjustment;
the self-adaptive adjustment module is used for adjusting the process running information scanning time interval according to the time used by single scanning and the program error quantity in the process after three times of scanning is continuously carried out at the process running information scanning time interval after self-adaptive adjustment, so as to obtain the process running information scanning time interval after self-adaptive adjustment again; scanning each process of the tested assembly in the big data cluster by using the process running information according to the process running information scanning time interval after the self-adaptive adjustment again; continuously adjusting the process running information scanning time interval, and scanning each process of the tested component in the big data cluster by using the continuously adjusted process running information scanning time interval;
the process running information scanning time interval is adaptively adjusted according to the following formula:
Figure FDA0003618444580000031
wherein, T i+1 Indicates the information scanning time interval after the (i + 1) th adaptive adjustmentWhere i is 1,2,3 … … n, n indicates the total number of times of information scanning interval adaptation, and T is 1 1 Representing the initial value of the scanning time interval of the process running information; n represents the number of scanning processes in three consecutive scans; nc represents the number of program errors obtained in three consecutive scans; t is i Indicating an information scanning time interval after the ith adaptive adjustment; t is max Represents the maximum value of the time used for carrying out the process scanning in a single time in three times of scanning; t is min The minimum amount of time it takes to perform a course scan in a single of the three scans.
5. The system of claim 4, wherein the determining module comprises:
the locking module is used for locking an error log corresponding to a program error trigger point according to the program error when the program error of the process of the tested component is detected;
the type determining module is used for determining the error type according to the error log;
the marking module is used for marking the process with the program error once and classifying the error type of the program error of the process;
the statistical module is used for counting the error marking times of the process of the tested assembly and various error types of the process to obtain a statistical result;
and the recording module is used for sending the statistical result to an operation maintenance terminal of the big data cluster for recording.
6. The system of claim 5, wherein the recording module comprises:
the comparison module is used for controlling the operation maintenance terminal to compare the statistical result with an error threshold corresponding to each tested component preset in the operation maintenance terminal after receiving the statistical result:
and the warning module is used for performing warning prompt by the operation maintenance terminal when any one of the error marking times and the error types in the statistical results of the tested components exceeds the error marking time index and the error type number index in the error threshold value.
CN202011225109.7A 2020-11-05 2020-11-05 Operation maintenance method and system for big data cluster Active CN112445641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011225109.7A CN112445641B (en) 2020-11-05 2020-11-05 Operation maintenance method and system for big data cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011225109.7A CN112445641B (en) 2020-11-05 2020-11-05 Operation maintenance method and system for big data cluster

Publications (2)

Publication Number Publication Date
CN112445641A CN112445641A (en) 2021-03-05
CN112445641B true CN112445641B (en) 2022-08-26

Family

ID=74736422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011225109.7A Active CN112445641B (en) 2020-11-05 2020-11-05 Operation maintenance method and system for big data cluster

Country Status (1)

Country Link
CN (1) CN112445641B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990228A (en) * 2010-11-17 2011-03-23 中兴通讯股份有限公司 Communication network equipment fault detection method and device
CN109477863A (en) * 2016-06-13 2019-03-15 电网监控有限公司 Method and system for the Dynamic Fault-Detection in power network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10846165B2 (en) * 2018-05-17 2020-11-24 Micron Technology, Inc. Adaptive scan frequency for detecting errors in a memory system
CN109960690A (en) * 2019-03-18 2019-07-02 新华三大数据技术有限公司 A kind of operation and maintenance method and device of big data cluster
CN111143142B (en) * 2019-12-26 2021-05-04 江南大学 Universal check point and rollback recovery method
CN111104243B (en) * 2019-12-26 2021-05-28 江南大学 Low-delay dual-mode lockstep soft error-tolerant processor system
CN111124720B (en) * 2019-12-26 2021-05-04 江南大学 Self-adaptive check point interval dynamic setting method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990228A (en) * 2010-11-17 2011-03-23 中兴通讯股份有限公司 Communication network equipment fault detection method and device
CN109477863A (en) * 2016-06-13 2019-03-15 电网监控有限公司 Method and system for the Dynamic Fault-Detection in power network

Also Published As

Publication number Publication date
CN112445641A (en) 2021-03-05

Similar Documents

Publication Publication Date Title
CN110086649B (en) Abnormal flow detection method, device, computer equipment and storage medium
CN112769796A (en) Cloud network side collaborative defense method and system based on end side edge computing
CN107784481B (en) Task timeliness early warning method and device
CN107566163B (en) Alarm method and device for user behavior analysis association
EP2487860B1 (en) Method and system for improving security threats detection in communication networks
CN108521339B (en) Feedback type node fault processing method and system based on cluster log
CN111866016B (en) Log analysis method and system
CN101470426B (en) Fault detection method and system
CN110794800A (en) Monitoring system for wisdom mill information management
CN109359098B (en) System and method for monitoring scheduling data network behaviors
CN108306997B (en) Domain name resolution monitoring method and device
CN113077065A (en) Method, device and equipment for processing faults of vehicle production line and storage medium
CN110363381B (en) Information processing method and device
CN112445641B (en) Operation maintenance method and system for big data cluster
CN115622867A (en) Industrial control system safety event early warning classification method and system
CN113094243B (en) Node performance detection method and device
CN110609761B (en) Method and device for determining fault source, storage medium and electronic equipment
CN113472881B (en) Statistical method and device for online terminal equipment
CN117909173A (en) Cloud application health degree analysis method and device based on big data cloud platform
CN114418460B (en) Construction process information analysis method and construction management system applied to BIM
CN114860543A (en) Anomaly detection method, device, equipment and computer readable storage medium
CN111258788B (en) Disk failure prediction method, device and computer readable storage medium
CN112131069B (en) Equipment operation monitoring method and system based on clustering
CN114548686A (en) Engineering construction quality acceptance supervision method, system and device and storage medium
CN111581001A (en) Operation maintenance method and device for big data cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant