Cited By
View all- Yazdanpanah FAlaei M(2024)An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDAParallel Computing10.1016/j.parco.2024.103084120(103084)Online publication date: Jun-2024
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be ...
Association for Computing Machinery
New York, NY, United States
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in