Open
Description
🐛 Describe the bug
AMP static shape CPP wrapper
suite | name | thread | batch_size_new | speed_up_new | inductor_new | eager_new | compilation_latency_new | batch_size_old | speed_up_old | inductor_old | eager_old | compilation_latency_old | Ratio Speedup(New/old) | Eager Ratio(old/new) | Inductor Ratio(old/new) | Compilation_latency_Ratio(old/new) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
torchbench | functorch_dp_cifar10 | multiple | 64 | 0.868102 | 0.009271905 | 0.008048959274310001 | 10.719749 | 64 | 1.160976 | 0.006782472 | 0.007874287212672 | 11.024352 | 0.75 | 0.98 | 0.73 | 1.03 |
torchbench | opacus_cifar10 | multiple | 64 | 0.838903 | 0.010030994 | 0.008415030959582 | 11.244563 | 64 | 1.18228 | 0.006808455000000001 | 0.0080495001774 | 11.659031 | 0.71 | 0.96 | 0.68 | 1.04 |
the bad commit: 768cb73
/workspace/pytorch# bash inductor_single_run.sh multiple inference performance torchbench functorch_dp_cifar10 amp first static cpp
Testing with cpp wrapper.
Testing with inductor.
multi-threads testing....
loading model: 0it [00:00, ?it/s]
cpu eval functorch_dp_cifar10
skipping cudagraphs due to cpp wrapper enabled
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████| 5
7396
0/50 [00:02<00:00, 24.73it/s]
1.139x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,functorch_dp_cifar10,64,1.138629,18.863196,38.237371,0.893517,77.563494,86.806938,71,1,0,0,0,0,1
the last good commit: 3c0cbf4
/workspace/pytorch# bash inductor_single_run.sh multiple inference performance torchbench functorch_dp_cifar10 amp first static cpp
Testing with cpp wrapper.
Testing with inductor.
multi-threads testing....
loading model: 0it [00:00, ?it/s]
cpu eval functorch_dp_cifar10
skipping cudagraphs due to cpp wrapper enabled
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████| 50/50 [00:01<00:00, 27.99it/s]
1.431x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,functorch_dp_cifar10,64,1.430911,14.841640,38.689340,0.888515,76.825395,86.464922,71,1,0,0,0,0,1
Versions
SW info
name | target_branch | target_commit | refer_branch | refer_commit |
---|---|---|---|---|
torchbench | main | 373ffb19 | main | 373ffb19 |
torch | main | 53ecb81 | main | 8568dbc |
torchvision | main | 0.19.0a0+d23a6e1 | main | 0.19.0a0+d23a6e1 |
torchtext | main | 0.16.0a0+b0ebddc | main | 0.16.0a0+b0ebddc |
torchaudio | main | 2.6.0a0+1a8f621 | main | 2.6.0a0+ea5de17 |
torchdata | main | 0.7.1a0+0790338 | main | 0.7.1a0+0790338 |
dynamo_benchmarks | main | nightly | main | nightly |
Repro:
inductor_single_run.sh
bash inductor_single_run.sh multiple inference performance torchbench functorch_dp_cifar10 amp first static cpp
Suspected guilty commit: 768cb73
torchbench-functorch_dp_cifar10-inference-amp-static-cpp-multiple-performance-drop_guilty_commit.log