[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3526064.3534108acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Open access

Redfish-Nagios: A Scalable Out-of-Band Data Center Monitoring Framework Based on Redfish Telemetry Model

Published: 27 June 2022 Publication History

Abstract

Current monitoring tools for high-performance computing (HPC) systems are often inefficient in terms of scalability and interfacing with modern data center management APIs. This inefficiency leads to a lack of effective management of infrastructure of modern data centers. Nagios is one of the widely used industry-standard tools for data center infrastructure monitoring, which mainly include monitoring of nodes and associated hardware and software components. However, current Nagios monitoring has special requirements that introduce several limitations. First, a significant human effort is needed for the configuration of monitored nodes in the Nagios server. Second, the Nagios Remote Plugin Executor and the Nagios Service Check Acceptor are required on the Nagios server and each monitored node for active and passive monitoring, respectively. Third, Nagios monitoring also requires monitoring-specific agents on each monitored node. These shortcomings are inherently due to Nagios' in-band implementation nature. To overcome these limitations, we introduced Redfish-Nagios, a scalable out-of-band monitoring tool for modern HPC systems. It integrates the Nagios server with the out-of-band Distributed Management Task Force's Redfish telemetry model, which is implemented in the baseboard management controller of the nodes. This integration eliminates the requirements of any agent, plugin, hardware component, or configuration on the monitored nodes. It is potentially a paradigm shift in Nagios-based monitoring for two reasons. First, it simplifies communication between the Nagios server and monitored nodes. Second, it saves the computational cost by removing the requirements of running complex Nagios-native protocols and agents on the monitored nodes. The Redfish-Nagios integration methodology enables monitoring of next-generation HPC systems using the scalable and modern Redfish telemetry model and interface.

References

[1]
Ghazanfar Ali. 2020. Nagios Redfish API Integration: Out-of-band (BMC) based Monitoring. https://github.com/nsfcac/Nagios-Redfish-API-Integration Retrieved May, 2022 from
[2]
Mina Andrawos and Martin Helmich. 2017. Cloud Native Programming with Golang: Develop microservice-based high performance web apps for the cloud with Go. Packt Publishing Ltd.
[3]
Anthony Bonkoski et al. 2013. Illuminating the Security Issues Surrounding Lights-Out Server Management. In Presented as part of the 7th USENIX Workshop on Offensive Technologies. USENIX, Washington, D.C. https://www.usenix.org/conference/woot13/workshop-program/presentation/Bonkoski
[4]
A. Borghesi et al. 2022. Anomaly Detection and Anticipation in High Performance Computing Systems. IEEE Transactions on Parallel and Distributed Systems, Vol. 33, 4 (2022), 739--750. https://doi.org/10.1109/TPDS.2021.3082802
[5]
DMTF. 2020. DMTF's Redfish®. https://www.dmtf.org/standards/redfish Retrieved May, 2020 from
[6]
Nagios Enterprises. 2017. Nagios.
[7]
Sabyasachi Ghosh, Mark Redekopp, et almbox. 2012. KnightShift: Shifting the I/O Burden in Datacenters to Management Processor for Energy Efficiency. In Computer Architecture. Springer Berlin Heidelberg, Berlin, Heidelberg, 183--197.
[8]
Glauco Goncalves et al. 2019. A standard to rule them all: Redfish. IEEE Communications Standards Magazine, Vol. 3, 2 (2019), 36--43.
[9]
Jon R Hass. 2017. Redfish Facilities Equipment Management Overview. In Companion Proceedings of the10th International Conference on Utility and Cloud Computing. 121--121.
[10]
Jeff Hilland. 2017. Redfish Overview. In Companion Proceedings of the10th International Conference on Utility and Cloud Computing. 119--119.
[11]
Elham Hojati et al. 2017. Benchmarking automated hardware management technologies for modern data centers and cloud environments. In Proceedings of the10th International Conference on Utility and Cloud Computing. 195--196.
[12]
E. Hojati et almbox. 2020. Redfish Green500 Benchmarker (RGB): Towards Automation of the Green500 Process for Data Centers. In 2020 IEEE Green Technologies Conference(GreenTech). 47--52. https://doi.org/10.1109/GreenTech46478.2020.9289729
[13]
C. Hongsong and W. Xiaomei. 2015. Design and Implementation of Cloud Server Remote Management System Based on IMPI Protocol. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom). 1475--1478. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.266
[14]
HPCC. 2022. High Performance Computing Center. http:www.depts.ttu.edu/hpcc/ Retrieved February, 2022 from
[15]
J. Kim et al. 2016. Performance Evaluation of Multithreaded Computations for CPU Bounded Task. In 2016 International Conference on Platform Technology and Service (PlatCon). 1--5. https://doi.org/10.1109/PlatCon.2016.7456816
[16]
Jie Li, Ghazanfar Ali, Ngan Nguyen, Jon Hass, Alan Sill, Tommy Dang, and Yang Chen. 2020. MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 119--129.
[17]
Eduard Luchian, Paul Docolin, and Virgil Dobrota. 2016. Advanced monitoring of the OpenStack NFV infrastructure: A Nagios approach using SNMP. 2016 12th IEEE International Symposium on Electronics and Telecommunications (ISETC) (2016). https://doi.org/10.1109/isetc.2016.7781055
[18]
Nagios. 2020. Nagios-The Industry Standard In IT Infrastructure Monitoring. https://www.nagios.org/ Retrieved May, 2020 from
[19]
OpenHPC. 2020. OpenHPC Software Stack. https://openhpc.community/development/source-repository/ Retrieved December, 2020 from
[20]
Chanyoung Park, Yoonsue Joe, Myounghwan Yoo, Dongeun Lee, and Kyungtae Kang. 2020. Poster: Prototype of Configurable Redfish Query Proxy Module. In 2020 IEEE 28th International Conference on Network Protocols (ICNP). IEEE, 1--2.
[21]
R. Rajachandrasekar, X. Besseron, and D. K. Panda. 2012. Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 1136--1143. https://doi.org/10.1109/IPDPSW.2012.139
[22]
J. Renita et al. 2017. Network's server monitoring and analysis using Nagios. In 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET). 1904--1909. https://doi.org/10.1109/WiSPNET.2017.8300092
[23]
J Renita and N Edna Elizabeth. 2017. Network's server monitoring and analysis using Nagios. In 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET). IEEE, 1904--1909.
[24]
D. R. Rinku and M. Asha Rani. 2017. Analysis of multi-threading time metric on single and multi-core CPUs with Matrix Multiplication. In 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB). 152--155. https://doi.org/10.1109/AEEICB.2017.7972402
[25]
Tom Ryder. 2016. Nagios core administration cookbook. Packt Publishing Ltd.
[26]
Sahil Suneja et al. 2014. Non-intrusive, Out-of-band and Out-of-the-box Systems Monitoring in the Cloud. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems (Austin, Texas, USA) (SIGMETRICS '14). ACM, New York, NY, USA, 249--261. https://doi.org/10.1145/2591971.2592009
[27]
DELL Technologies. 2020. Integrated Dell Remote Access Controller (iDRAC). https://www.delltechnologies.com/en-us/solutions/openmanage/idrac.htm Retrieved May, 2020 from
[28]
Thomas-Krenn.AG. 2020. IPMI Sensor Monitoring Plugin. https://www.thomas-krenn.com/en/wiki/IPMI_Sensor_Monitoring_Plugin_setup
[29]
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In 2010 39th International Conference on Parallel Processing Workshops. IEEE, 207--216.
[30]
UGE. 2020. Univa Grid Engine. https://www.univa.com/ Retrieved May, 2020 from
[31]
Andy B. Yoo et al. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, Berlin, Heidelberg, 44--60.
[32]
Shu Zhang et al. 2014. Real time thermal management controller for data center. In Fourteenth Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm). IEEE, 1346--1353.

Cited By

View all
  • (2023)Real-Time Monitoring and Management of Hardware and Software Resources in Heterogeneous Computer Networks through an Integrated System ArchitectureSymmetry10.3390/sym1506113415:6(1134)Online publication date: 23-May-2023
  • (2023)Incremental Hibernation for Baseboard Management ControllersProceedings of the 2023 International Conference on Research in Adaptive and Convergent Systems10.1145/3599957.3606223(1-7)Online publication date: 6-Aug-2023

Index Terms

  1. Redfish-Nagios: A Scalable Out-of-Band Data Center Monitoring Framework Based on Redfish Telemetry Model

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          SNTA '22: Fifth International Workshop on Systems and Network Telemetry and Analytics
          June 2022
          62 pages
          ISBN:9781450393157
          DOI:10.1145/3526064
          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 27 June 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. agent-less monitoring
          2. automation
          3. data center
          4. dmtf redfish telemetry
          5. high-performance computing
          6. in-band monitoring
          7. nagios
          8. out-of-band monitoring

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          HPDC '22

          Acceptance Rates

          Overall Acceptance Rate 22 of 106 submissions, 21%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)244
          • Downloads (Last 6 weeks)28
          Reflects downloads up to 20 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Real-Time Monitoring and Management of Hardware and Software Resources in Heterogeneous Computer Networks through an Integrated System ArchitectureSymmetry10.3390/sym1506113415:6(1134)Online publication date: 23-May-2023
          • (2023)Incremental Hibernation for Baseboard Management ControllersProceedings of the 2023 International Conference on Research in Adaptive and Convergent Systems10.1145/3599957.3606223(1-7)Online publication date: 6-Aug-2023

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media