triton: MatMul tutorial fails for float32 inputs
I’ve ran the 3rd tutorial on matrix multiplication and the code runs successfully with the following versions
torch==2.0.0+cu118
triton==2.0.0
I then changed the dtypes of the input from half precision to single precision like below
a = torch.randn((512, 512), device='cuda', dtype=torch.float32)
b = torch.randn((512, 512), device='cuda', dtype=torch.float32)
Running this code then produces the output
❌ Triton and Torch differ
Why would making the inputs higher precision cause worse numerical error in the output? I’ve been observing this on other kernels which was how I ended up trying it on the tutorial as well.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 24 (6 by maintainers)
Commits related to this issue
- [RUNTIME] Make apis compatible with cuda 11 drivers (#2081) https://github.com/openai/triton/issues/2042 — committed to openai/triton by Jokeren a year ago
- [RUNTIME] Make apis compatible with cuda 11 drivers (#2081) https://github.com/openai/triton/issues/2042 — committed to siliconflow/triton by Jokeren a year ago
It’s not a problem with cuda runtime, but rather with libcuda.so. So using cuda-toolkit 12 won’t be able to solve the problem. I’m working on solving the compatibility issue.