omniperf: omniperf analyze statistics does not match understanding

I have using using omniperf to analyze some of the applications. I ran a simple 8x8x8 gemm in BF16 data format using following command line omniperf profile -n gemm_m8_k8_n8] -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0

after running omniperf analyze -p gemm_m8_k8_n8 I get following output

The highlighted metric MFMA Flops (BF16) does not make sense. I expect 8x8x8x2 = 1024 flops.

Kernel takes 14.8 us, look below

So I expect 1024/(14.8 * 1e-6) = 69.2 Million Flops ~ 0.068 GFLOPS.

But I see 4.4 Gflops. How is this calculated?

About this issue

Original URL
State: closed
Created a year ago
Comments: 20 (5 by maintainers)

Most upvoted comments

Yes, I am trying to generate the dump. Been bouncing different places to try and get some help. Havent found my luck yet. Currently waiting on rocBLAS team to respond. If you can help find someone who knows how to dump assembly while using rocBLAS, I would appreciate it!

nishshah0 on Mar 6, 2023

https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/obj_tooling.md

skyreflectedinmirrors on Jan 24, 2023

Attaching the pmc_perf.csv pmc_perf.csv

This is the command I used to run omniperf profile -n gemm_m8_k8_n8 -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0

nishshah0 on Jan 17, 2023

The 512 BF16 flops/instruction value in the MI-200 equation appears to be incorrect, at least according to: https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/, but we’ll need to double check.

That said @shaw586, assuming you did a single BF16 operation is probably incorrect. You need to take the value of MFMA-BF16 in the MFMA Arithmetic Instr Mix and multiply that by the number of waves launched to get the total number of BF16 operations. Then take the total number of BF16 ops, and multiply by the FLOP/OP count (seemingly, 1024 on MI-200) to get total FLOPS then normalize by time.

@coleramos425 – I noticed a separate issue where the values in the above section can only be normalized by the # of waves:

https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_analyze/configs/gfx90a/1000_compute-unit-instruction-mix.yaml#L171

Can you open something to track?

skyreflectedinmirrors on Jan 16, 2023