omniperf: omniperf analyze statistics does not match understanding
I have using using omniperf to analyze some of the applications. I ran a simple 8x8x8 gemm in BF16 data format using following command line
omniperf profile -n gemm_m8_k8_n8] -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0
after running omniperf analyze -p gemm_m8_k8_n8 I get following output

The highlighted metric MFMA Flops (BF16) does not make sense. I expect 8x8x8x2 = 1024 flops.
Kernel takes 14.8 us, look below

So I expect 1024/(14.8 * 1e-6) = 69.2 Million Flops ~ 0.068 GFLOPS.
But I see 4.4 Gflops. How is this calculated?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (5 by maintainers)
Yes, I am trying to generate the dump. Been bouncing different places to try and get some help. Havent found my luck yet. Currently waiting on rocBLAS team to respond. If you can help find someone who knows how to dump assembly while using rocBLAS, I would appreciate it!
https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/obj_tooling.md
Attaching the pmc_perf.csv pmc_perf.csv
This is the command I used to run
omniperf profile -n gemm_m8_k8_n8 -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0The 512 BF16 flops/instruction value in the MI-200 equation appears to be incorrect, at least according to: https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/, but we’ll need to double check.
That said @shaw586, assuming you did a single BF16 operation is probably incorrect. You need to take the value of
MFMA-BF16in theMFMA Arithmetic Instr Mixand multiply that by the number of waves launched to get the total number of BF16 operations. Then take the total number of BF16 ops, and multiply by the FLOP/OP count (seemingly, 1024 on MI-200) to get total FLOPS then normalize by time.@coleramos425 – I noticed a separate issue where the values in the above section can only be normalized by the # of waves:
https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_analyze/configs/gfx90a/1000_compute-unit-instruction-mix.yaml#L171
Can you open something to track?