A pure Python machine learning library built entirely from the ground up for educational purposes. Made to understand how ML really works!
Ever wondered what goes on inside those powerful machine learning libraries like Scikit-learn, PyTorch, or TensorFlow? How does a neural network actually learn? How is gradient descent implemented?
SmolML is our answer! It's a fully functional (though simplified) machine learning library built using only pure Python and its basic collections
, random
, and math
modules. No NumPy, no SciPy, no C++ extensions – just Python, all the way down.
The goal isn't to compete with production-grade libraries on speed or features, but to provide a transparent, understandable, and educational implementation of core machine learning concepts.
You can read these guides of the different sections of SmolML in any order, though this list presents the recommended order for learning.
- SmolML - Core: Automatic Differentiation & N-Dimensional Arrays
- SmolML - Regression: Predicting Continuous Values
- SmolML - Neural Networks: Backpropagation to the limit
- SmolML - Tree Models: Decisions, Decisions!
- SmolML - K-Means: Finding Groups in Your Data!
- SmolML - Preprocessing: Make your data meaningful
- SmolML - The utility room!
We believe the best way to truly understand complex topics like machine learning is often to build them yourself. Production libraries are fantastic tools, but their internal complexity and optimizations can sometimes obscure the fundamental principles.
SmolML strips away these layers to focus on the core ideas:
- Learning from First Principles: Every major component is built from scratch, letting you trace the logic from basic operations to complex algorithms.
- Demystifying the Magic: See how concepts like automatic differentiation (autograd), optimization algorithms, and model architectures are implemented in code.
- Minimal Dependencies: Relying only on Python's standard library makes the codebase accessible and easy to explore without external setup hurdles.
- Focus on Clarity: Code is written with understanding, not peak performance, as the primary goal.
In order to learn as much as possible, we recommend reading through the guides, checking the code, and then trying to implement your own versions of these components.
SmolML provides explains the essential building blocks for any ML library:
-
The Foundation: Custom Arrays & Autograd Engine:
- Automatic Differentiation (
Value
): A simple autograd engine that tracks operations and computes gradients automatically – the heart of training neural networks! - N-dimensional Arrays (
MLArray
): A custom array implementation inspired by NumPy, supporting common mathematical operations needed for ML. Extremely inefficient due to being written in Python, but ideal for understanding N-Dimensional Arrays.
- Automatic Differentiation (
-
Essential Preprocessing:
- Scalers (
StandardScaler
,MinMaxScaler
): Fundamental tools to prepare your data, because algorithms tend to perform better when features are on a similar scale.
- Scalers (
-
Build Your Own Neural Networks:
- Activation Functions: Non-linearities like
relu
,sigmoid
,softmax
,tanh
that allow networks to learn complex patterns. (Seesmolml/utils/activation.py
) - Weight Initializers: Smart strategies (
Xavier
,He
) to set initial network weights for stable training. (Seesmolml/utils/initializers.py
) - Loss Functions: Ways to measure model error (
mse_loss
,binary_cross_entropy
,categorical_cross_entropy
). (Seesmolml/utils/losses.py
) - Optimizers: Algorithms like
SGD
,Adam
, andAdaGrad
that update model weights based on gradients to minimize loss. (Seesmolml/utils/optimizers.py
)
- Activation Functions: Non-linearities like
-
Classic ML Models:
- Regression: Implementations of
Linear
andPolynomial
regression. - Neural Networks: A flexible framework for building feed-forward neural networks.
- Tree-Based Models:
Decision Tree
andRandom Forest
implementations for classification and regression. - K-Means:
KMeans
clustering algorithm for grouping similar data points together.
- Regression: Implementations of
- Students: Learning ML concepts for the first time.
- Developers: Curious about the internals of ML libraries they use daily.
- Educators: Looking for a simple, transparent codebase to demonstrate ML principles.
- Anyone: Who enjoys learning by building!
Let's be crystal clear: SmolML is built for learning, not for breaking speed records or handling massive datasets.
- Performance: Being pure Python, it's WAAAY slower than libraries using optimized C/C++/Fortran backends (like NumPy).
- Scale: It's best suited for small datasets and toy problems where understanding the mechanics is more important than computation time.
- Production Use: Do not use SmolML for production applications. Stick to battle-tested libraries like Scikit-learn, PyTorch, TensorFlow, JAX, etc., for real-world tasks.
Think of it as learning to build a go-kart engine from scratch before driving a Formula 1 car. It teaches you the fundamentals in a hands-on way!
The best way to use SmolML is to clone this repository and explore the code and examples (if available).
git clone https://github.com/rodmarkun/SmolML
cd SmolML
# Explore the code in the smolml/ directory!
You can also run the multiple tests in the tests/
folder. Just install the requirements.txt
(this is for comparing SmolML against another standard libraries like TensorFlow, sklearn, etc, and generate plots with matplotlib).
cd tests
pip install -r requirements
Contributions are always welcome! If you're interested in contributing to SmolML, please fork the repository and create a new branch for your changes. When you're done with your changes, submit a pull request to merge your changes into the main branch.
If you want to support SmolML, you can:
- Star ⭐ the project in Github!
- Donate 🪙 to my Ko-fi page!
- Share ❤️ the project with your friends!