This repository tracks my journey through the 100 Days of CUDA Challenge. Each day, I'll be coding CUDA kernels and documenting my progress.
The 100 Days of CUDA Challenge is about consistently coding CUDA kernels for 100 days without any gaps. The challenge encourages learning and practicing GPU programming using NVIDIA's CUDA platform.
- Programming Massively Parallel Processors (PMPP) - Recommended book for learning CUDA
- Original Challenge Repository
- Reference Repositories and Hamdi's Repository
I'll be developing code on my laptop and running it on a Jetson Nano for testing and execution.
Day | Date | Description | Code |
---|---|---|---|
1 | 2025-03-10 | Getting Started with CUDA - Vector Addition | Link |
2 | 2025-03-11 | Matrix Addition in CUDA | Link |
3 | 2025-03-12 | Matrix Multiplication in CUDA | Link |
4 | 2025-03-13 | Parallel Reduction - Partial Sum | Link |
5 | 2025-03-14 | Layer Normalization in CUDA | Link |
6 | 2025-03-15 | Matrix Transpose with CPU/GPU Benchmarking | Link |
7 | 2025-03-16 | 1D & 2D Convolution in CUDA | Link |
8 | 2025-03-17 | Parallel Prefix Sum - Exclusive Scan | Link |
9 | 2025-03-18 | Flash Attention Forward Pass | Link |
10 | 2025-03-19 | Sparse Matrix-Vector Multiplication (SpMV) | Link |
11 | 2025-03-20 | Merge Sort with CUDA | Link |
12 | 2025-03-21 | Breadth-First Search (BFS) with CUDA | Link |
13 | 2025-03-22 | Optimized BFS with Shared Memory | Link |
14 | 2025-03-23 | Fractional Hausdorff Distance (FHD) for Image Processing | Link |
15 | 2025-03-24 | Convolutional Neural Network (CNN) in CUDA | Link |
16 | 2025-03-25 | Parallel Particle System Simulation | Link |
17 | 2025-03-26 | Naive Bayes Classifier Training | Link |
18 | 2025-03-27 | Matrix Multiplication using CUBLAS | Link |
19 | 2025-03-28 | Fast Fourier Transform (FFT) Implementation | Link |
20 | 2025-03-29 | Monte Carlo Option Pricing with CUDA | Link |
21 | 2025-03-30 | Particle Swarm Optimization (PSO) with CUDA | Link |
22 | 2025-03-31 | CUDA-accelerated Reinforcement Learning (Q-Learning) | Link |
23 | 2025-04-01 | Genetic Algorithm Optimization with CUDA | Link |
24 | 2025-04-02 | Gated Linear Unit (GLU) Implementation | Link |
25 | 2025-04-03 | Parallel Point Cloud PassThrough Filter | Link |
26 | 2025-04-04 | Kernel Density Estimation (KDE) | Link |
27 | 2025-04-05 | Mirror Descent (STE) for Quantization | Link |
28 | 2025-04-06 | Mini-Batch SGD for Linear Regression | Link |
29 | 2025-04-07 | K-Means Assignment Step (File Input) | Link |
30 | 2025-04-08 | Headless Camera Processing (Grayscale + Avg Intensity) | Link |
31 | 2025-04-09 | 2D Heat Simulation (Basic vs Shared Memory) | Link |
32 | 2025-04-10 | CUDA Streams for Overlap (Matrix Multiply) | Link |
33 | 2025-04-11 | Parallel Reduction Optimization (Warp Shuffle) | Link |
34 | 2025-04-12 | Point Cloud Voxel Grid Filter (Atomics) | Link |
35 | 2025-04-13 | Kalman Filter Prediction Step (cuBLAS) | Link |
36 | 2025-04-14 | SpMV with cuSPARSE | Link |
37 | 2025-04-15 | Simple NN Forward Pass (GEMM + Activation) | Link |
38 | 2025-04-16 | Batch Normalization Kernel (Forward Pass) | Link |
39 | 2025-04-17 | Thrust Library Basics | Link |
40 | 2025-04-18 | Image Interpolation (Texture Memory) | Link |
41 | 2025-04-19 | Parallel Radix Sort (Basic Single Pass) | Link |
42 | 2025-04-20 | N-Body Simulation Optimization (Shared Memory) | Link |
43 | 2025-04-21 | Simple cuDNN Convolution (Forward) | Link |
44 | 2025-04-22 | Occupancy Grid Mapping Update | Link |
45 | 2025-04-23 | Optical Flow Gradient Step (Lucas-Kanade) | Link |
46 | 2025-04-24 | Simple Backpropagation Step (Fully Connected Layer) | Link |
47 | 2025-04-25 | Dynamic Parallelism (Simple Example) | Link |
48 | 2025-04-26 | Parallel AABB Collision Detection | Link |
49 | 2025-04-27 | Mini-Project: Perception Pipeline (Grayscale -> Blur -> Sobel -> Reduction) | Link |
50 | 2025-04-28 | Unit Testing CUDA Kernels with Google Test | Link |
51 | 2025-04-29 | Exploring TensorRT (Simple ONNX Inference) | Link |
52 | 2025-04-30 | Minimal GRU (minGRU) with Parallel Scan | Link |
53 | 2025-05-01 | Bidirectional LSTM Implementation | Link |
54 | 2025-05-02 | AdaHessian Optimizer Kernel | Link |
55 | 2025-05-03 | Quantization Comparison (FP32/FP16/SimFP8) | Link |
56 | 2025-05-04 | Mish Activation Function Benchmark | Link |
57 | 2025-05-05 | Conjugate Gradient Method (CGM) using cuBLAS | Link |
58 | 2025-05-06 | Bitonic Sort with Shared Memory Optimization | Link |
59 | 2025-05-07 | Basic Ray Tracing with CUDA | Link |
60 | 2025-05-08 | Muon Optimization - Newton-Schulz Iteration | Link |
61 | 2025-05-09 | Fisher Information Matrix | Link |
62 | 2025-05-10 | Batched Vector L2 Norm (Shared Memory Reduction) | Link |
63 | 2025-05-11 | Parallel Markov Chain Clustering for Robot Localization | Link |
64 | 2025-05-12 | Spectral Normalization in GANs (cuBLAS Power Iteration) | Link |
65 | 2025-05-13 | GEGLU Activation Function Implementation | Link |
66 | 2025-05-14 | GPU-Accelerated MFCC Feature Extraction | Link |
67 | 2025-05-15 | SwiGLU Activation and Gradient Computation | Link |
68 | 2025-05-16 | LoRA Implementation and Benchmarking | Link |
69 | 2025-05-17 | Parallel Password Cracking (FNV-1a) | Link |
70 | 2025-05-18 | Mean Squared Error (MSE) Calculation | Link |
71 | 2025-05-19 | Group Normalization Forward Pass | Link |
72 | 2025-05-20 | Total Variation Distance (TVD) Loss | Link |
73 | 2025-05-21 | 1D Rotary Positional Embedding (RoPE) | Link |
74 | 2025-05-22 | 2D Rotary Positional Embeddings (RoPE-2D) in CUDA | Link |
75 | 2025-05-23 | Fused Linear Transformation and Softmax Cross-Entropy Loss | Link |
76 | 2025-05-24 | Contrastive Loss (Forward & Backward) | Link |
77 | 2025-05-25 | Huber Loss Implementation in CUDA | Link |
78 | 2025-05-26 | Dynamic Tanh (DyT) Operation | Link |
79 | 2025-05-28 | Upper Triangular Matrix Multiplication | Link |
80 | 2025-05-29 | Matrix Multiplication with Swish Activation and Scaling | Link |
81 | 2025-05-30 | Generalized Jensen-Shannon Divergence Loss (Forward & Backward) | Link |
82 | 2025-05-31 | Negative Cosine Similarity (Cosine Distance) | Link |
83 | 2025-05-31 | Minimum Reduction Over a Specific Dimension | Link |
84 | 2025-06-01 | Cumulative Product (Prefix Product / Scan) | Link |
85 | 2025-06-02 | Tensor-Matrix Multiplication | Link |
86 | 2025-06-03 | Hard Sigmoid Activation Function | Link |
87 | 2025-06-03 | Softplus Activation Function | Link |
88 | 2025-06-05 | Warp-Level Programming - Warp Sum Reduction | Link |
89 | 2025-06-05 | Memory Coalescing Demonstration | Link |
90 | 2025-06-06 | Frobenius Norm in CUDA | Link |
91 | 2025-06-07 | Hinge Loss Implementation in CUDA | Link |
92 | 2025-06-08 | ELU Activation Function Implementation in CUDA | Link |
93 | 2025-06-09 | RMS Normalization Implementation in CUDA | Link |
94 | 2025-06-10 | CUDA Implementation of Forward and Simplified Reverse Diffusion Steps | Link |
95 | 2025-06-12 | Barnsley Fern Fractal Generator (PPM Output) | Link |
96 | 2025-06-15 | Product Reduction along a Tensor Dimension (Bugfix & Test) | Link |
- Code CUDA kernels consistently for 100 days without any gaps
- Document what I did each day
- Every 10 days, I can claim a badge from the challenge
- No code, no badge, no challenge