The document discusses optimization techniques for deep learning frameworks on Intel CPUs and Fugaku aimed architectures. It introduces oneDNN, a performance library for deep learning operations on Intel CPUs. It discusses issues with C++ implementation, and how just-in-time assembly generation using Xbyak can address these issues by generating optimal code depending on parameters. It also introduces Xbyak_aarch64 for generating optimized code for Fugaku's Scalable Vector Extension instructions.
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
The document summarizes the successes and lessons learned from Fugaku, Japan's flagship supercomputer. Key points include:
- Fugaku achieved the top performance on all HPC benchmarks in 2020 and 2021, showing high performance across applications, not just traditional HPC workloads.
- While many applications achieved their target performance, some did not due to issues like insufficient parallelism, I/O scalability problems, and compiler vectorization failures.
- Lessons include the need for improved software stacks, application analysis, and adapting to modern applications beyond classic HPC.
- Looking ahead, sustained exascale performance will require data-centric architectures and corresponding system software and algorithms as transistor scaling slow
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
The document summarizes the successes and lessons learned from Fugaku, Japan's flagship supercomputer. Key points include:
- Fugaku achieved the top performance on all HPC benchmarks in 2020 and 2021, showing high performance across applications, not just traditional HPC workloads.
- While many applications achieved their target performance, some did not due to issues like insufficient parallelism, I/O scalability problems, and compiler vectorization failures.
- Lessons include the need for improved software stacks, application analysis, and adapting to modern applications beyond classic HPC.
- Looking ahead, sustained exascale performance will require data-centric architectures and corresponding system software and algorithms as transistor scaling slow
16. Ra
nk
Site System Cores Rmax Rpeak
1 National
University
of Defense
Technology
China
Tianhe-2
(MilkyWay-2) -
TH-IVB-FEP
Cluster, Intel Xeon
E5-2692 12C
2.200GHz, TH
Express-2, Intel
Xeon Phi 31S1P
NUDT
3120000 33862.
7
54902.
4
2 DOE/SC/Oa
k Ridge
National
Laboratory
United
States
Titan - Cray XK7 ,
Opteron 6274 16C
2.200GHz, Cray
Gemini
interconnect,
NVIDIA K20x
Cray Inc.
560640 17590.
0
27112.
5
Top 500 2014 November
18. CPU: 低遅延を意識した設計
大きなキャッシュ
メモリーアクセスの長い遅
延をキャッシュで短かな遅
延に変える
高度な制御
分岐遅延を軽減する為の
分岐予測・投機的実行
データ遅延を軽減する為の
データ先読み
強力な演算機能
演算の遅延を軽減する
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU
49. Part II
マルチコアの時代の終わりと新しい模索
Dark Siliconとマルチコアの時代の終わり
Mooreの法則に対する楽観論と懐疑論
Heterogeneous System Architecture
Foundation
3D積層技術
Silicon Photonics Technology
Micro Server
Heterogeneous Systemの
「進化」としての Project Ara
60. Mooreの法則についての懐疑論
”Compute Power with Energy Efficiency” AFDS 2012
http://bit.ly/1GFr8w3 by ARM
Mooreの法則は、死んではいない
Mooreの法則のあるバージョン
は、この10年の間は、正しいだ
ろう。
しかし、その効果は、ますます小
さなものになり、ますます重要では
なくなる。
過去には、製造技術とMooreの法則が、消費電力とパフォーマンスと面積の
改善を、我々に無償で提供してくれていた。
ただ、これ以上は期待できない。
63. Mooreの法則についての懐疑論
”Compute Power with Energy Efficiency” AFDS 2012
http://bit.ly/1GFr8w3 ARMの見解
それでは、我々に何ができるか?
我々は、もっと多くのトランジスタを持つことができる。
我々は、それら全てに同時に電力を供給できない。
我々は、それらの余分なトランジスタを新しいやり方で
使う必要がある。
• マルチ・コア
• メニー・コア
• ドメイン専用のプロセッサー
それらは全てHeterogeneousな処理の方向を
向いている。積極的な電力管理のもとで。
計算は、最も効率的な場所で行われるべきせある。
64. Mooreの法則についての悲観論
”Transitioning from the Era of Multicore to the Era of
Specialization” SICS 2014 http://bit.ly/1BOIEuC
Mooreの法則は、終わりつつある
経済が、ますます大きな力で、半導体の
エコシステムをドライブしている。
最先端の製造工場を持つベンダーの数は、
縮小している。
性能を上げるためのコストは、増大する
だろう。
ハードウェアの専用化は、重要な課題に
なるだろう。
ノードのトランジスターあたりのコスト
133. Intel Xeon E5-2600 v3
“How to Build Next-Generation Data
Center Infrastructure”
http://intel.ly/1urN2ba
134. New Compute-Optimized EC2
Instances http://amzn.to/1yGqaKm
The new C4 instances are based on the Intel
Xeon E5-2666 v3 (code name Haswell)
processor. This custom processor, designed
specifically for EC2, runs at a base speed of 2.9
GHz, and can achieve clock speeds as high as
3.5 GHz with Turbo boost.
177. FAIR open sources deep-
learning modules for Torch
Many research projects on machine learning
and AI at FAIR use Torch, an open source
development environment for numerics,
machine learning, and computer vision, with a
particular emphasis on deep learning and
convolutional nets. Torch is widely used at a
number of academic labs as well as at
Google/DeepMind, Twitter, NVIDIA, AMD, Intel,
and many other companies.
2015年1月 http://bit.ly/1DWKgn2
178. FAIR open sources deep-
learning modules for Torch
Today, we're open sourcing optimized deep-
learning modules for Torch. These modules are
significantly faster than the default ones in
Torch and have accelerated our research
projects by allowing us to train larger neural
nets in less time.
This release includes GPU-optimized
modules for large convolutional nets
(ConvNets), as well as networks with sparse
activations that are commonly used in Natural
Language Processing applications.
188. IBM, Nvidia team to build even
faster supercomputers
The Department of Energy has awarded a $325
million contract to IBM to create two
supercomputers that will be at least three
times more powerful than any existing systems
in deployment today. IBM's partners in this
endeavor will be Nvidia and Mellanox.
http://bit.ly/1uIeP7o
189. IBM, Nvidia team to build even
faster supercomputers
The current leader is Tianhe-2 (Milky Way 2), a
Chinese supercomputer with a theoretical max
of 55 petaflops built with Xeon E5 processors
and Xeon Phi co-processors. It may or may not
be surpassed when the new Top500
supercomputer list comes out this week. Either
way, a 165 petaFLOP supercomputer is a tall
order.
The DoE supercomputer will use a mix of IBM
Power 8 RISC CPUs, Nvidia's Tesla GPUs and
NVlink GPU interconnects, and Mellanox's
100Gbit/sec. InfiniBand interconnects. The
system is expected to be installed in 2017.
191. “SEATTLE”
WHAT IS IT AND WHY?
“Seattle” は、AMDの最初のARMベースの64bitプロ
セッサーである。
‒ 8 ARM CortexTM-A57 cores
‒ 2 DDR3/4 DRAM channels
‒ 10G Ethernet, PCI-Express, SATA
‒ GlobalFoundries 28nm process
ARMアーキテクチャーの32bitから64bitへの移行は、
x86の32bitから64bitへの移行と同じように、産業界に
おける重要な変化である。
AMDは、64bitのx86の世界で 果たしてきたように、
64bitのARMの世界で、主導的な役割を果たそうとしてい
る。
215. We discover that, regardless of CPU
microarchitecture, memcached execution is
remarkably inefficient, saturating neither
network links nor available memory bandwidth.
Instead, we find performance is typically
limited by the per-packet processing overheads
in the NIC and OS kernel— long code paths
limit CPU performance due to poor branch
predictability and instruction fetch bottlenecks.
216. Hence, we argue for an alternate architecture—
Thin Servers with Smart Pipes (TSSP)—for
cost-effective high-performance memcached
deployment. TSSP couples an embedded-class
low- power core to a memcached accelerator
that can process GET requests entirely in
hardware, offloading both network handling
and data look up. We demonstrate the
potential benefits of our TSSP architecture
through an FPGA prototyping platform, and
show the potential for a 6X-16X power-
performance improvement over conventional
server baselines.