[Q & A] Intercepting cudaMallocAsync API may also be suitable to this approach?

Hello, I have read your thesis and code and I think your idea is great! However, I have a question. Since the introduction of Stream-Ordered Memory Allocator in CUDA 11.2, cudaMallocAsync and cudaFreeAsync APIs have been provided. If an application calls cudaMallocAsync and it is also intercepted and replaced with cudaMallocManaged, what impact does it have on the calculation results?