triton: MatMul tutorial fails for float32 inputs

I’ve ran the 3rd tutorial on matrix multiplication and the code runs successfully with the following versions

torch==2.0.0+cu118
triton==2.0.0

I then changed the dtypes of the input from half precision to single precision like below

a = torch.randn((512, 512), device='cuda', dtype=torch.float32)
b = torch.randn((512, 512), device='cuda', dtype=torch.float32)

Running this code then produces the output ❌ Triton and Torch differ

Why would making the inputs higher precision cause worse numerical error in the output? I’ve been observing this on other kernels which was how I ended up trying it on the tutorial as well.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 24 (6 by maintainers)

Commits related to this issue

Most upvoted comments

undefined symbol: cuLaunchKernelEx It looks like there is a problem with cuda. Can you try to install cuda 12?

It’s not a problem with cuda runtime, but rather with libcuda.so. So using cuda-toolkit 12 won’t be able to solve the problem. I’m working on solving the compatibility issue.