Over the past few years, attention is shining in the field of deep learning, especially in the domain of natural language processing (NLP). Its impressive effectiveness, along with ubiquitous implementations, have aroused our interest in efficiently scheduling the data-flow of corresponding computations onto architectures with many computing units to realize parallel computing. In this paper, based on manually analyzing the optimum scheduling solutions for small instances, which are obtained by a satisfiability checking (SAT) solver, we propose a general scheduling solution to parallelize the processing of attention layers that are widely adopted in recent deep learning models. According to the solution, for the proposed hardware system with m processing elements (PEs) connected in a unidirectional ring, a m-time speed up is achievable. For two specific application schemes of attention, we respectively recognize that almost 25% and 50% of the original computations have become redundant under those certain circumstances. To avoid unnecessary computing with corresponding gains in processing latency, we have come up with strategies of optimization accordingly, which further lead to another two scheduling solutions. By avoiding the redundancy, the adoptions of the optimized scheduling solutions are able to additionally bring near 25% and 50% reduction in execution cycles, respectively for the two application schemes. To prove the correctness of these solutions, we have mathematically revealed their validity, as well as utilized SAT solver to conduct the verification by adopting the solutions themselves as additional constraints for the formulated SAT problems.
View full abstract