HW_RAS-tutorial-HPCA-2025

As cloud services become increasingly integral to modern computing, the reliability of underlying hardware components is critical to maintaining service continuity and performance. This tutorial delves into the advanced methodologies for predicting hardware failures to cloud service reliability.

We begin with an overview of hardware failures in data centers, followed by machine learning-based failure prediction methods designed for different hardware components. Key topics include hardware metrics for predictive analysis, feature engineering, algorithm selection, evaluation, and production deployment. In addition to knowledge share, the tutorial offers a handson Memory Failure Prediction competition, where participants apply their techniques to real-world problems. This tutorial is beneficial for researchers and engineers from both academia and industry, providing them with useful tools and necessary knowledge to enhance hardware resilience and cloud service reliability.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.idea		.idea
_data		_data
_includes		_includes
_layouts		_layouts
_portfolio		_portfolio
_sass		_sass
assets		assets
doc		doc
.gitattributes		.gitattributes
.gitignore		.gitignore
404.html		404.html
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
_config.yml		_config.yml
index.md		index.md
jekyll-agency.gemspec		jekyll-agency.gemspec
legal.md		legal.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HW_RAS-tutorial-HPCA-2025

About

Releases

Packages

Contributors 3

Languages

License

hyxie2023/Hw_RAS-2025.github.io

Folders and files

Latest commit

History

Repository files navigation

HW_RAS-tutorial-HPCA-2025

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages