O'Reilly book - Building Machine Learning Systems with a feature store: batch, real-time, and LLMs
Dashboards for Example ML Systems
| Course | MLOps | LLLMs | Feature/Training/Inference | Working AI Systems | Focus | |--------------------------------|-------|----------------------------|--------------------|------------------| | Building AI Systems (O'Reilly) | Yes | Fine-Tuning & RAG | Yes | High | Project-based, Software Engineering, Fundamentals | | Made With ML | No | Yes | No | No | Software Engineering, Model Training | | 7 Steps MLOps | Yes | Separate Course | Yes | Low | Learning Tools and Project |
This project is part of the ID2223 Machine Learning course at KTH for HT2024. The goal of the lab is to build a serverless AI system that predicts air quality (PM2.5) for a specified location, using Hopsworks, feature engineering, and machine learning model development.
The objective of this lab is to create a serverless AI system to predict air quality levels at specific locations using machine learning. The system takes air quality and weather data, creates feature groups in Hopsworks, trains a machine learning model, and visualizes the results on Hopsworks.
This project uses the following data sources:
- Air Quality Data from AQICN
- Weather Data from Open-Meteo
The complete workflow includes data loading, feature engineering, training a model, making batch predictions, and visualization of results.
-
Data and Feature Engineering:
- Daily Feature Pipeline: Automatically gets daily weather and air quality data and stores them as Feature Groups in the Hopsworks Feature Store.
- Backfill Pipeline: Backfills historical data from the selected sensor to create a large dataset.
-
Model Training:
- A regression model (
XGBRegressor
) is trained using weather and air quality features to predict PM2.5 levels. - Features are extracted from a Feature View created in the Hopsworks platform.
- A regression model (
-
Batch Inference:
- The trained model is used to predict PM2.5 levels for the next 7-10 days using weather forecast data.
- The results are visualized using a dashboard.
This project is divided into several main components:
-
Data handling Downloads and processes historical air quality and weather data. Runs daily to get new data from AQICN and Open-Meteo.
-
Model Training Pipeline: Trains a machine learning model to predict PM2.5 levels based on features from air quality and weather data.
-
**Batch Inference: Runs predictions on new data and saves the results for visualization. A public dashboard visualizing air quality predictions for a selected location.
To set up and run the system, please follow the steps below:
- A Hopsworks account at hopsworks.ai.
- A GitHub account to manage the code.
- The trained model is capable of predicting PM2.5 levels based on weather conditions and historical air quality data.
- The model performance is evaluated using metrics like Mean Squared Error (MSE)
- Hindcast Graphs are provided to show prediction performance compared to observed values.