Stars
A tiniest GPU that can render only two texture mapped triangles
This VerilogHDL implementation multiplies two 8x8 matrices using eight MAC (multiply-accumulate) modules.
Verilog implementation of the systolic array architecture used in modern ML acceleration chips.
A minimal Tensor Processing Unit (TPU) inspired by Google's TPUv1.
Verilog HDL implementation of SDRAM controller and SDRAM model
IC implementation of Systolic Array for TPU
Transactional Verilog design and Verilator Testbench for a RISC-V TensorCore Vector co-processor for reproducible linear algebra
Small-scale Tensor Processing Unit built on an FPGA
Implementation of a Tensor Processing Unit for embedded systems and the IoT.
INT8 & FP16 multiplier accumulator (MAC) design with UVM verification completed.
2023集创赛国二。基于脉动阵列写的一个简单的卷积层加速器,支持yolov3-tiny的第一层卷积层计算,可根据FPGA端DSP资源灵活调整脉动阵列的结构以实现不同的计算效率。
The following repository houses a detailed implementation of the systolic array using Verilog and System Verilog
Matrix Multiply and Accumulate unit written in System Verilog
hardware design of universal NPU(CNN accelerator) for various convolution neural network
Superscalar Out-of-Order NPU Design on FPGA
A Matrix Multiplication Accelerator implemented on PL side of Zynq SoC which communicate through external UART interface.
Exploiting Kernel Sparsity and Entropy for Interpretable CNN Compression
RL-Pruner: Structured Pruning Using Reinforcement Learning for CNN Compression and Acceleration
Implementing a Neural Network from Scratch
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network