[inductor][cpu]functorch_dp_cifar10 and opacus_cifar10 performance regression in 2025-05-24 nightly release

@chauhang

🐛 Describe the bug

AMP static shape CPP wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	functorch_dp_cifar10	multiple	64	0.868102	0.009271905	0.008048959274310001	10.719749	64	1.160976	0.006782472	0.007874287212672	11.024352	0.75	0.98	0.73	1.03
torchbench	opacus_cifar10	multiple	64	0.838903	0.010030994	0.008415030959582	11.244563	64	1.18228	0.006808455000000001	0.0080495001774	11.659031	0.71	0.96	0.68	1.04

the bad commit: 768cb73

/workspace/pytorch# bash inductor_single_run.sh multiple inference performance torchbench functorch_dp_cifar10 amp first static cpp
Testing with cpp wrapper.
Testing with inductor.
multi-threads testing....
loading model: 0it [00:00, ?it/s]
cpu  eval  functorch_dp_cifar10
skipping cudagraphs due to cpp wrapper enabled
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████| 5
7396
0/50 [00:02<00:00, 24.73it/s]
1.139x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,functorch_dp_cifar10,64,1.138629,18.863196,38.237371,0.893517,77.563494,86.806938,71,1,0,0,0,0,1

the last good commit: 3c0cbf4

/workspace/pytorch# bash inductor_single_run.sh multiple inference performance torchbench functorch_dp_cifar10 amp first static cpp
Testing with cpp wrapper.
Testing with inductor.
multi-threads testing....
loading model: 0it [00:00, ?it/s]
cpu  eval  functorch_dp_cifar10
skipping cudagraphs due to cpp wrapper enabled
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████| 50/50 [00:01<00:00, 27.99it/s]
1.431x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,functorch_dp_cifar10,64,1.430911,14.841640,38.689340,0.888515,76.825395,86.464922,71,1,0,0,0,0,1

Versions

SW info

name	target_branch	target_commit	refer_branch	refer_commit
torchbench	main	373ffb19	main	373ffb19
torch	main	`53ecb81`	main	`8568dbc`
torchvision	main	0.19.0a0+d23a6e1	main	0.19.0a0+d23a6e1
torchtext	main	0.16.0a0+b0ebddc	main	0.16.0a0+b0ebddc
torchaudio	main	2.6.0a0+1a8f621	main	2.6.0a0+ea5de17
torchdata	main	0.7.1a0+0790338	main	0.7.1a0+0790338
dynamo_benchmarks	main	nightly	main	nightly

Repro:
inductor_single_run.sh
bash inductor_single_run.sh multiple inference performance torchbench functorch_dp_cifar10 amp first static cpp
Suspected guilty commit: 768cb73
torchbench-functorch_dp_cifar10-inference-amp-static-cpp-multiple-performance-drop_guilty_commit.log

cc @chauhang @penguinwu @chuanqi129

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions