CUDA pHash is a high-performance, GPU-accelerated tool for computing the perceptual hash (pHash) of images.
CUDA pHash outperforms other leading perceptual hash implementations by a wide margin, processing massive datasets via highly-optimized compute pipelines.
Implementation | Wall clock Time (ms)* ⏱️ | Speedup vs. CUDA ⚡ |
---|---|---|
⚡ CUDA pHash | 000.000 | Baseline 🏆 |
OpenCV pHash | 000.000 | 0.0× slower 🐢 |
Python pHash | 000.000 | 0.0× slower 🐢 |
* pHash calculation on COCO 2017 (163,957 images) using 13th Gen Intel i9 13900K and NVIDIA RTX 3080 over PCIe.
Below is a simple example demonstrating how to instantiate a CUDA pHash object, compute hashes, and compare them using Hamming distance:
#include "cuda_phash.cuh"
#include <iostream>
#include <vector>
#include <string>
// Computes Hamming distance between two bit-packed hash vectors
int hammingDistance(const std::vector<uint32_t>& hashA, const std::vector<uint32_t>& hashB) {
int distance = 0;
for (size_t i = 0; i < hashA.size(); ++i) {
uint32_t diff = hashA[i] ^ hashB[i]; // XOR to find differing bits
while (diff) {
distance += (diff & 1);
diff >>= 1;
}
}
return distance;
}
int main() {
// Initialize CUDA pHash with:
// - hashSize = 8
// - highFreqFactor = 4
// - batchSize = 5000
CudaPhash phasher(8, 4, 5000);
// Image paths to process
std::vector<std::string> imagePaths = { "image1.jpg", "image2.jpg", "image3.jpg" };
// Compute perceptual hashes
std::vector<std::vector<uint32_t>> hashes = phasher.phash(imagePaths);
// Compute Hamming distance between image1 and image2
int dist = hammingDistance(results[0], results[1]);
std::cout << "Hamming distance between image1 and image2: " << dist << std::endl;
return 0;
}
The phash()
function returns a std::vector<std::vector<uint32_t>>
, where each inner vector represents the bit-packed hash of an image. If desired, these can be converted to binary or hexadecimal representations by iterating over the 32-bit words.
Perceptual Hashing (pHash) is a technique commonly used for identifying visually similar images while being resilient to transformations such as scaling, rotation, and lighting changes. Unlike cryptographic hashes, which produce radically different outputs for even minor input changes, pHash captures an image's structural essence, allowing for robust similarity comparisons.
Despite its widespread adoption in image retrieval and duplicate detection, existing pHash implementations are often limited by computational efficiency, especially when scaling to large datasets. The standard approach relies on the Discrete Cosine Transform (DCT) and mean-thresholding, but traditional CPU-based implementations are computationally expensive. To address these limitations, we propose a highly optimized GPU-accelerated pHash implementation that significantly improves efficiency by leveraging CUDA and cuBLAS.
The core of pHash computation involves transforming an image into the frequency domain using the 2D Discrete Cosine Transform (DCT):
where
By retaining only the low-frequency coefficients (top-left sub-block), we extract structural features that remain invariant under common transformations such as rotation, scaling, and lighting changes.
To generate a compact binary hash:
- An
n × n
subset of DCT coefficients is selected. - The mean of these coefficients is computed.
- Each coefficient is compared to the mean:
1
if the coefficient is above the mean0
otherwise
- The resulting binary sequence is bit-packed into 32-bit words for memory-efficient storage and rapid comparison.
Standard CPU-based implementations of pHash suffer from high computational costs, particularly during the DCT and thresholding steps. While a GPU can accelerate DCT computations significantly, one of the key challenges is in the time required to transfer images to the GPU over PCIe lanes, which erodes these efficiency gains. This data transfer bottleneck is often the primary reason that existing GPU-based pHash implementations have not achieved their full potential.
To address this, we implemented several CUDA-specific optimizations to minimize overhead and maximize performance:
- Batched DCT Computation: We leverage cuBLAS to perform matrix-matrix multiplications in parallel across multiple images, resulting in highly-optimized DCT times of less than 40ms for a batch of 5,000 256x256 images.
- Efficient Memory Management:
- Pinned (page-locked) memory facilitates faster CPU-GPU transfers.
- Shared memory and warp-wide ballot operations reduce redundant computations.
- Asynchronous Execution: We implement overlapping pipelines to concurrently execute image loading, preprocessing, and hashing, effectively eliminating bottlenecks.
- Optimized Data Transfer Strategies:
- Minimized PCIe Transfers: By structuring computations to reduce unnecessary memory copies, we significantly decrease PCIe latency.
- Double Buffering: This technique allows image preprocessing and GPU computation to overlap, hiding data transfer delays.
Empirical evaluations show that our GPU-accelerated approach outperforms traditional CPU implementations by orders of magnitude, particularly for large-scale datasets. By parallelizing DCT computation and optimizing memory access patterns, we achieve substantial speedups while maintaining high accuracy and robustness in perceptual similarity detection.
We present an optimized pHash implementation that harnesses GPU acceleration to dramatically improve efficiency without sacrificing robustness. This approach enables large-scale image processing applications, including duplicate detection, content-based retrieval, and near-duplicate search, to scale effectively with growing data demands. Future work includes extending the framework to support additional perceptual hashing techniques and further optimizing CUDA kernels for specific hardware architectures.
git clone https://github.com/yourusername/cuda_phash.git
cd cuda_phash
Ensure the following are installed:
- CUDA Toolkit (tested with CUDA 11+)
- cuBLAS (included with CUDA)
- OpenCV (for image loading and preprocessing)
- Modern C++ Compiler (e.g., MSVC, GCC, or Clang)
Since no CMake configuration is provided, build the project using your preferred method:
- Create a new Visual C++ project.
- Add all
.cu
and.cuh
files. - Configure project settings to enable CUDA:
- Use
"CUDA Runtime API"
in VS properties.
- Use
- Link against:
cublas.lib
cudart.lib
- OpenCV libraries (e.g.,
opencv_world455.lib
).
Compile using nvcc
and link required libraries:
nvcc -o cuda_phash main.cu cuda_phash.cu -lcublas `pkg-config --cflags --libs opencv4`
Adjust paths and library flags to match your system configuration.
Once compiled, include cuda_phash.cuh
in your project and ensure CUDA Toolkit and OpenCV are correctly linked at runtime.