triton: MatMul tutorial fails for float32 inputs

I’ve ran the 3rd tutorial on matrix multiplication and the code runs successfully with the following versions

torch==2.0.0+cu118
triton==2.0.0

I then changed the dtypes of the input from half precision to single precision like below

a = torch.randn((512, 512), device='cuda', dtype=torch.float32)
b = torch.randn((512, 512), device='cuda', dtype=torch.float32)

Running this code then produces the output ❌ Triton and Torch differ

Why would making the inputs higher precision cause worse numerical error in the output? I’ve been observing this on other kernels which was how I ended up trying it on the tutorial as well.

About this issue

Original URL
State: closed
Created a year ago
Comments: 24 (6 by maintainers)

Commits related to this issue

[RUNTIME] Make apis compatible with cuda 11 drivers (#2081) https://github.com/openai/triton/issues/2042 — committed to openai/triton by Jokeren a year ago
[RUNTIME] Make apis compatible with cuda 11 drivers (#2081) https://github.com/openai/triton/issues/2042 — committed to siliconflow/triton by Jokeren a year ago

Most upvoted comments

undefined symbol: cuLaunchKernelEx It looks like there is a problem with cuda. Can you try to install cuda 12?

It’s not a problem with cuda runtime, but rather with libcuda.so. So using cuda-toolkit 12 won’t be able to solve the problem. I’m working on solving the compatibility issue.

Jokeren on Aug 9, 2023