Article ID: 21.20240664
In modern system on a chip (SoC), input/output (I/O) devices typically utilize direct memory access (DMA) and access virtual memory in the regular way. Additionally, operating system (OS) tends to allocate physical I/O memory contiguously. To reduce latency overhead due to page-table walks, hardware often employs translation lookaside buffer (TLB) prefetch techniques. Recently, TLB coalescing schemes that merge contiguous pages into a TLB entry have been reported. However, the conventional TLB prefetchers operate in the page level and do not effectively leverage the advantages of contiguous allocation. In this paper, we present the TLB prefetcher that exploits both contiguous allocation and TLB coalescing for I/O devices. The presented prefetcher operates in the block level, exploits contiguity in memory, requires no history tracking schemes, and can reduce page-table walks compared to the conventional scheme. Our experiments indicate that the presented scheme can improve both TLB and I/O device performance.