In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm. For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory. Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced. To hide this latency our