torchdynamo: [aot_eager] Accuracy error - BigBird
Repro - python benchmarks/huggingface.py --accuracy --float32 --backend=aot_eager --training --only=BigBird
My investigation
-
I ran accuracy minifier. Accuracy minifier takes the extracted Fx subgraph, runs that eagerly, and then with compiler_fn (aot_eager here) and compares the outputs. In this case, accuracy minifier did not find any offending subgraph.
-
I recheck
eager
backend –torchdynamo.optimize("eager")
. It passed. So, something is wrong withaot_eager
backend. -
At this point, I am thinking the problem must be in interstitial code between Fx subgraphs in Python bytecode. I thought that BigBird has numpy random calls, so maybe we are changing numpy rng state. But that was a dead end too as we reset numpy rng state before each run.
eager
backend is passing too, so interstitial code is ok. -
So, I just started arbitrarily falling back to eager here by adjusting the number of ops in Fx graph
- After lots of hit and trial, I found that the this diff worked
diff --git a/torchdynamo/optimizations/training.py b/torchdynamo/optimizations/training.py
index 7f64c86c..b580180e 100644
--- a/torchdynamo/optimizations/training.py
+++ b/torchdynamo/optimizations/training.py
@@ -29,7 +29,8 @@ class AotAutogradStrategy(object):
@classmethod
def compile_fn(cls, gm: torch.fx.GraphModule, example_inputs):
- if count_calls(gm.graph) < 2:
+ if count_calls(gm.graph) <= 2:
+ log.error(f"{gm}")
return gm.forward # no point for tiny graphs
return cls(gm, example_inputs).verified_candidate()
And the additional skipped module is
def forward(self, _stack0_0_ : torch.Tensor):
contiguous = _stack0_0_.contiguous(); _stack0_0_ = None
view = contiguous.view(1, 1024, -1); contiguous = None
return (view,)
So, I have two questions
- What is the issue? I don’t understand why skipping the above subgraph makes it work.
- Why accuracy minifier failed here?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Copy over non parameter grad (#85658) Wow, ugh silly mistake. Fix for https://github.com/pytorch/torchdynamo/issues/1291 not even sure how all the tests passed before this. Pull Request resolved: ht... — committed to pytorch/pytorch by eellison 2 years ago
- Copy over non parameter grad (#85658) Wow, ugh silly mistake. Fix for https://github.com/pytorch/torchdynamo/issues/1291 not even sure how all the tests passed before this. Pull Request resolved: ht... — committed to pytorch/pytorch by eellison 2 years ago
- Copy over non parameter grad (#85658) Wow, ugh silly mistake. Fix for https://github.com/pytorch/torchdynamo/issues/1291 not even sure how all the tests passed before this. Pull Request resolved: ht... — committed to alvgaona/pytorch by eellison 2 years ago
This inaccuracy also seems pretty high to me. I would imagine
ref
andref
to be very different here.In this model, there are many graph breaks. So there are multiple of tens of subgraphs.
For each subgraph, minifier compares eager vs dynamo accuracy. If it fails, it dumps the offending subgraph and starts minifying it further (dumping at every success of minification).
The issue here is that accuracy minifier does not find any subgraph offending. (even though the final accuracy fails)