[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
SlideShare a Scribd company logo
How to monitor the  $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters
Relevant Hadoop Information From 3 – 3000 Nodes Hardware/Software failures “common” Redundant Components DataNode, TaskTracker Non-redundant Components NameNode, JobTracker, SecondaryNameNode Fast Evolving Technology (Best Practices?)
Monitoring Software Nagios –  Red Yellow Green Alerts, Escalations Defacto Standard – Widely deployed Text base configuration Web Interface Pluggable with shell scripts/external apps Return 0 - OK
Cacti Performance Graphing System RRD/RRA Front End Slick Web Interface Template System for Graph Types Pluggable SNMP input Shell script /external program
 
hadoop-cacti-jtg JMX Fetching Code w/ (kick off) scripts Cacti templates For Hadoop Premade Nagios Check Scripts Helper/Batch/automation scripts Apache License
Hadoop JMX
Sample Cluster P1 NameNode & SecNameNode Hardware RAID 8 GB RAM 1x QUAD CORE DerbyDB (hive) on SecNameNode JobTracker 8GB RAM 1x QUAD CORE
A Sample Cluster p2 Slave (hadoopdata1-XXXX) JBOD 8x 1TB SATA Disk RAM 16GB 2x Quad Core
Prerequisites Nagios (install) DAG RPMs Cacti (install) Several RPMS Liberal network access to the cluster
Alerts & Escalations X nodes * Y Services = < Sleep Define a policy  Wake Me Up’s (SMS) Don’t Wake Me Up’s (EMAIL) Review (Daily, Weekly, Monthly)
Wake Me Up’s NameNode Disk Full (Big Big Headache) RAID Array Issues (failed disk) JobTracker SecNameNode Do not realize it is not working too late
Don’t Wake Me Up’s Or ‘Wake someone else up’ DataNode Warning Currently Failed Disk will down the Data Node (see Jira) TaskTracker Hardware Bad Disk (Start RMA) Slaves are expendable (up to a point)
Monitoring Battle Plan Start With the Basics Ping, Disk Add Hadoop Specific Alarms  check_data_node Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
The Basics Nagios Nagios (All Nodes) Host up (Ping check) Disk % Full SWAP > 85 % * Load based alarms are somewhat useless  389% CPU load is not necessarily a bad thing in Hadoopville
The Basics Cacti Cacti (All Nodes) CPU (full CPU) RAM/SWAP  Network Disk Usage
Disk Utilization
RAID Tools Hpacucli – not a Street Fighter move Alerts on RAID events (NameNode)  Disk failed  Rebuilding JBOD (DataNode) Failed Drive Drive Errors Dell, SUN, Vendor Specific Tools
Before you jump in X Nodes * Y Checks * = Lots of work About 3 Nodes into the process … Wait!!! I need some interns!!! Solution S.I.C.C.T.  Semi-Intelligent-Configuration-cloning-tools (I made that up)  (for this presentation)
Nagios Answers “IS IT RUNNING?” Text based Configuration
Cacti Answers “HOW WELL IS IT RUNNING?” Web Based configuration  php-cli tools
Monitoring Battle Plan Thus Far Start With the Basics Ping, Disk !!!!!!Done!!!!!! Add Hadoop Specific Alarms  check_data_node Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
Add Hadoop Specific Alarms Hadoop Components with a Web Interface NameNode 50070 JobTracker 50030 TaskTracker 50060 DataNode 50075 check_http + regex = simple + effective
nagios_check_commands.cfg Component Failure (Future) Newer Hadoop will have XML status  define command { command_name  check_remote_namenode command_line  $USER1$/check_http -H  $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service {                service_description            check_remote_namenode                use                              generic-service                host_name                        hadoopname1                check_command               check_remote_namenode!50070 }
Monitoring Battle Plan Start With the Basics Ping, Disk (Done) Add Hadoop Specific Alarms  check_data_node (Done) Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
JMX Graphing Enable JMX Import Templates
JMX Graphing
JMX Graphing
JMX Graphing
 
Standard Java JMX
Monitoring Battle Plan Thus Far Start With the Basics !!!!!!Done!!!!! Ping, Disk Add Hadoop Specific Alarms !Done! check_data_node Add JMX Graphing !Done! NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
Add JMX based Alarms hadoop-cacti-jtg is flexible extend fetch classes Don’t call output() Write your own check logic
Quick JMX Base Walkthrough  url, user, pass, object specified from CLI wantedVariables, wantedOperations by inheritance fetch() output() provided
Extend for NameNode
Extend for Nagios
Monitoring Battle Plan Start With the Basics !DONE! Ping, Disk Add Hadoop Specific Alarms !DONE! check_data_node Add JMX Graphing !DONE! NameNodeOperations Add JMX Based alarms !DONE! FilesTotal > 1,000,000 or LiveNodes < 50%
Review File System Growth Size Number of Files Number of Blocks Ratio’s Utilization CPU/Memory Disk Email (nightly) FSCK  DSFADMIN
The Future JMX Coming to JobTracker and TaskTracker (0.21) Collect and Graph Jobs Running Collect and Graph Map / Reduce per node Profile Specific Jobs in Cacti?

More Related Content

Hw09 Monitoring Best Practices

  • 1. How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters
  • 2. Relevant Hadoop Information From 3 – 3000 Nodes Hardware/Software failures “common” Redundant Components DataNode, TaskTracker Non-redundant Components NameNode, JobTracker, SecondaryNameNode Fast Evolving Technology (Best Practices?)
  • 3. Monitoring Software Nagios – Red Yellow Green Alerts, Escalations Defacto Standard – Widely deployed Text base configuration Web Interface Pluggable with shell scripts/external apps Return 0 - OK
  • 4. Cacti Performance Graphing System RRD/RRA Front End Slick Web Interface Template System for Graph Types Pluggable SNMP input Shell script /external program
  • 5.  
  • 6. hadoop-cacti-jtg JMX Fetching Code w/ (kick off) scripts Cacti templates For Hadoop Premade Nagios Check Scripts Helper/Batch/automation scripts Apache License
  • 8. Sample Cluster P1 NameNode & SecNameNode Hardware RAID 8 GB RAM 1x QUAD CORE DerbyDB (hive) on SecNameNode JobTracker 8GB RAM 1x QUAD CORE
  • 9. A Sample Cluster p2 Slave (hadoopdata1-XXXX) JBOD 8x 1TB SATA Disk RAM 16GB 2x Quad Core
  • 10. Prerequisites Nagios (install) DAG RPMs Cacti (install) Several RPMS Liberal network access to the cluster
  • 11. Alerts & Escalations X nodes * Y Services = < Sleep Define a policy Wake Me Up’s (SMS) Don’t Wake Me Up’s (EMAIL) Review (Daily, Weekly, Monthly)
  • 12. Wake Me Up’s NameNode Disk Full (Big Big Headache) RAID Array Issues (failed disk) JobTracker SecNameNode Do not realize it is not working too late
  • 13. Don’t Wake Me Up’s Or ‘Wake someone else up’ DataNode Warning Currently Failed Disk will down the Data Node (see Jira) TaskTracker Hardware Bad Disk (Start RMA) Slaves are expendable (up to a point)
  • 14. Monitoring Battle Plan Start With the Basics Ping, Disk Add Hadoop Specific Alarms check_data_node Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
  • 15. The Basics Nagios Nagios (All Nodes) Host up (Ping check) Disk % Full SWAP > 85 % * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville
  • 16. The Basics Cacti Cacti (All Nodes) CPU (full CPU) RAM/SWAP Network Disk Usage
  • 18. RAID Tools Hpacucli – not a Street Fighter move Alerts on RAID events (NameNode) Disk failed Rebuilding JBOD (DataNode) Failed Drive Drive Errors Dell, SUN, Vendor Specific Tools
  • 19. Before you jump in X Nodes * Y Checks * = Lots of work About 3 Nodes into the process … Wait!!! I need some interns!!! Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools (I made that up) (for this presentation)
  • 20. Nagios Answers “IS IT RUNNING?” Text based Configuration
  • 21. Cacti Answers “HOW WELL IS IT RUNNING?” Web Based configuration php-cli tools
  • 22. Monitoring Battle Plan Thus Far Start With the Basics Ping, Disk !!!!!!Done!!!!!! Add Hadoop Specific Alarms check_data_node Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
  • 23. Add Hadoop Specific Alarms Hadoop Components with a Web Interface NameNode 50070 JobTracker 50030 TaskTracker 50060 DataNode 50075 check_http + regex = simple + effective
  • 24. nagios_check_commands.cfg Component Failure (Future) Newer Hadoop will have XML status define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service {                service_description            check_remote_namenode                use                             generic-service                host_name                       hadoopname1                check_command               check_remote_namenode!50070 }
  • 25. Monitoring Battle Plan Start With the Basics Ping, Disk (Done) Add Hadoop Specific Alarms check_data_node (Done) Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
  • 26. JMX Graphing Enable JMX Import Templates
  • 30.  
  • 32. Monitoring Battle Plan Thus Far Start With the Basics !!!!!!Done!!!!! Ping, Disk Add Hadoop Specific Alarms !Done! check_data_node Add JMX Graphing !Done! NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
  • 33. Add JMX based Alarms hadoop-cacti-jtg is flexible extend fetch classes Don’t call output() Write your own check logic
  • 34. Quick JMX Base Walkthrough url, user, pass, object specified from CLI wantedVariables, wantedOperations by inheritance fetch() output() provided
  • 37. Monitoring Battle Plan Start With the Basics !DONE! Ping, Disk Add Hadoop Specific Alarms !DONE! check_data_node Add JMX Graphing !DONE! NameNodeOperations Add JMX Based alarms !DONE! FilesTotal > 1,000,000 or LiveNodes < 50%
  • 38. Review File System Growth Size Number of Files Number of Blocks Ratio’s Utilization CPU/Memory Disk Email (nightly) FSCK DSFADMIN
  • 39. The Future JMX Coming to JobTracker and TaskTracker (0.21) Collect and Graph Jobs Running Collect and Graph Map / Reduce per node Profile Specific Jobs in Cacti?