iree: Bad dispatch outputs from SDXL VAE

What happened?

I’ve narrowed a numerics issue in a model down to a bad dispatch. The whole model outputs zeros but this dispatch is producing some NANs and some Infs.

Steps to reproduce your issue

  1. iree-compile --iree-hal-target-backends=llvm-cpu --iree-input-type=torch --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-triple=x86_64-linux-gnu --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-opt-const-eval=false --iree-opt-const-expr-hoisting=false --iree-llvmcpu-enable-ukernels=all stable_diffusion_xl_base_1_0_vae.mlir -o cpu_vae.vmfb --iree-flow-trace-dispatch-tensors
  2. Observe the output of dispatch 208

What component(s) does this issue relate to?

MLIR, Runtime

Version information

da982154aebccb41c1cf9bf5594097a2e6906b19

Additional context

No response

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 34 (20 by maintainers)

Most upvoted comments

whoa, first legit find from the suite and it hasn’t landed yet! high five

All the patches are landed to IREE, @gpetters94 could you help verify if the issue is addressed?

@monorimet Alright, cool. I’ll keep combing through the dispatches to find where the zeroes are coming from. (Should I close this and make another you think?)

No, it’s ok, I’m thinking we might end up narrowing to a very similar dispatch after updating to the attention-retaining IR

I’m doing a few breaks to see where zeroes start. The dispatch graph shows these dispatches getting pretty huge:

v439 [shape = ellipse, label = "%435 = flow.tensor.reshape\ntensor<32x4194304xf32>"];
    v440 [shape = box, label = "%436 = flow.dispatch[]\n@main_dispatch_170::@main_dispatch_170_generic_32x4194304_f32(%435)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %5 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel] (%3) -> (%1)\l        %5 = arith.divf %in, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg1\ltensor<32xf32>"];
    v441 [shape = box, label = "%437 = flow.dispatch[]\n@main_dispatch_171::@main_dispatch_171_generic_32x4194304_f32(%435, %436)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, parallel] (%0, %1) -> (%2)\l        %4 = arith.subf %in, %in_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %5 = arith.mulf %4, %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg2\ltensor<32x4194304xf32>"];
    v442 [shape = box, label = "%438 = flow.dispatch[]\n@main_dispatch_172::@main_dispatch_172_generic_32x4194304_f32(%437)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %4 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg1\ltensor<32xf32>"];
    v443 [shape = ellipse, label = "%439 = flow.tensor.reshape\ntensor<32x4194304xf16>"];
    v444 [shape = box, label = "%440 = flow.dispatch[]\n@main_dispatch_173::@main_dispatch_173_generic_32x4194304_f16xf32xf32xf32(%439, %436, %438)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<32xf32>\l%3 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.subf %5, %in_1 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.divf %in_2, %cst : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %7, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2100:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = math.rsqrt %8 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2101:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.mulf %6, %9 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %10 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<32x4194304xf32>"];
    v445 [shape = ellipse, label = "%441 = flow.tensor.reshape\ntensor<128x1048576xf32>"];
    v446 [shape = box, label = "%442 = flow.dispatch[]\n@main_dispatch_238::@main_dispatch_238_generic_128x1048576_f32xf16xf16xf16(%441, %cst_121, %cst_120)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<128x1048576xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<128xf16>\l%3 = tensor.empty() : tensor<128x1048576xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in_0 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.mulf %in, %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.extf %in_1 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %6, %7 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = arith.truncf %8 : f32 to f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2841:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.negf %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %11 = math.exp %10 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %12 = arith.addf %11, %cst : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %13 = arith.divf %cst, %12 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %14 = arith.mulf %13, %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %14 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<128x1048576xf16>"];
    v447 [shape = ellipse, label = "%443 = flow.tensor.reshape\ntensor<1x128x1024x1024xf16>"];
    v448 [shape = box, label = "%444 = flow.dispatch[]\n@main_dispatch_239::@main_dispatch_239_slow_memcpy(%443, %421)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1024x1024xf16>\lflow.dispatch.tensor.store %0, %arg1\ltensor<1x128x1026x1026xf16>"];
    v449 [shape = ellipse, label = "%445 = flow.tensor.reshape\ntensor<1x128xf16>"];
    v450 [shape = box, label = "%446 = flow.dispatch[]\n@main_dispatch_252::@main_dispatch_252_conv_2d_nchw_fchw_1x128x1024x1024x128x3x3_f16(%444, %cst_122, %445)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1026x1026xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128x128x3x3xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<1x128xf16>\l%3 = tensor.empty() : tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2791:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.fill ins(%cst : f16) outs(%3 : tensor<1x128x1024x1024xf16>) -> tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2791:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%5 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%0, %1 : tensor<1x128x1026x1026xf16>, tensor<128x128x3x3xf16>) outs(%4 : tensor<1x128x1024x1024xf16>) -> tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2992:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%6 = linalg.generic[parallel, parallel, parallel, parallel] (%5, %2) -> (%3)\l        %7 = arith.addf %in, %in_0 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2992:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %7 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2992:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %6, %arg3\ltensor<1x128x1024x1024xf16>"];
    v451 [shape = ellipse, label = "%447 = flow.tensor.reshape\ntensor<134217728xf16>"];
    v452 [shape = box, label = "%448 = flow.dispatch[]\n@main_dispatch_169::@main_dispatch_169_generic_134217728_f16xf32(%447)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<134217728xf16>\l%1 = tensor.empty() : tensor<134217728xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.generic[parallel] (%0) -> (%1)\l        %3 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %3 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %2, %arg1\ltensor<134217728xf32>"];
    v453 [shape = ellipse, label = "%449 = flow.tensor.reshape\ntensor<32x4194304xf32>"];
    v454 [shape = box, label = "%450 = flow.dispatch[]\n@main_dispatch_170::@main_dispatch_170_generic_32x4194304_f32(%449)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %5 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel] (%3) -> (%1)\l        %5 = arith.divf %in, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg1\ltensor<32xf32>"];
    v455 [shape = box, label = "%451 = flow.dispatch[]\n@main_dispatch_171::@main_dispatch_171_generic_32x4194304_f32(%449, %450)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, parallel] (%0, %1) -> (%2)\l        %4 = arith.subf %in, %in_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %5 = arith.mulf %4, %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg2\ltensor<32x4194304xf32>"];
    v456 [shape = box, label = "%452 = flow.dispatch[]\n@main_dispatch_172::@main_dispatch_172_generic_32x4194304_f32(%451)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %4 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg1\ltensor<32xf32>"];
    v457 [shape = ellipse, label = "%453 = flow.tensor.reshape\ntensor<32x4194304xf16>"];
    v458 [shape = box, label = "%454 = flow.dispatch[]\n@main_dispatch_173::@main_dispatch_173_generic_32x4194304_f16xf32xf32xf32(%453, %450, %452)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<32xf32>\l%3 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.subf %5, %in_1 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.divf %in_2, %cst : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %7, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2100:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = math.rsqrt %8 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2101:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.mulf %6, %9 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %10 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<32x4194304xf32>"];
    v459 [shape = ellipse, label = "%455 = flow.tensor.reshape\ntensor<128x1048576xf32>"];
    v460 [shape = box, label = "%456 = flow.dispatch[]\n@main_dispatch_238::@main_dispatch_238_generic_128x1048576_f32xf16xf16xf16(%455, %cst_125, %cst_124)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<128x1048576xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<128xf16>\l%3 = tensor.empty() : tensor<128x1048576xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in_0 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.mulf %in, %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.extf %in_1 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %6, %7 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = arith.truncf %8 : f32 to f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2841:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.negf %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %11 = math.exp %10 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %12 = arith.addf %11, %cst : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %13 = arith.divf %cst, %12 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %14 = arith.mulf %13, %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %14 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<128x1048576xf16>"];
    v461 [shape = ellipse, label = "%457 = flow.tensor.reshape\ntensor<1x128x1024x1024xf16>"];
    v462 [shape = box, label = "%458 = flow.dispatch[]\n@main_dispatch_239::@main_dispatch_239_slow_memcpy(%457, %421)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1024x1024xf16>\lflow.dispatch.tensor.store %0, %arg1\ltensor<1x128x1026x1026xf16>"];
    v463 [shape = ellipse, label = "%459 = flow.tensor.reshape\ntensor<1x128xf16>"];
    v464 [shape = box, label = "%460 = flow.dispatch[]\n@main_dispatch_260::@main_dispatch_260_conv_2d_nchw_fchw_1x128x1024x1024x128x3x3_f16(%458, %cst_126, %432, %459)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1026x1026xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128x128x3x3xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<1x128x1024x1024xf16>\l%3 = flow.dispatch.tensor.load %arg3 -> tensor<1x128xf16>\l%4 = tensor.empty() : tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2791:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%5 = linalg.fill ins(%cst : f16) outs(%4 : tensor<1x128x1024x1024xf16>) -> tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2791:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%6 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%0, %1 : tensor<1x128x1026x1026xf16>, tensor<128x128x3x3xf16>) outs(%5 : tensor<1x128x1024x1024xf16>) -> tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3082:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%7 = linalg.generic[parallel, parallel, parallel, parallel] (%2, %6, %3) -> (%4)\l        %8 = arith.addf %in_0, %in_1 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3082:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = arith.addf %in, %8 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3084:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3084:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %7, %arg4\ltensor<1x128x1024x1024xf16>"];
    v465 [shape = ellipse, label = "%461 = flow.tensor.reshape\ntensor<134217728xf16>"];
    v466 [shape = box, label = "%462 = flow.dispatch[]\n@main_dispatch_169::@main_dispatch_169_generic_134217728_f16xf32(%461)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<134217728xf16>\l%1 = tensor.empty() : tensor<134217728xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.generic[parallel] (%0) -> (%1)\l        %3 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %3 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %2, %arg1\ltensor<134217728xf32>"];
    v467 [shape = ellipse, label = "%463 = flow.tensor.reshape\ntensor<32x4194304xf32>"];
    v468 [shape = box, label = "%464 = flow.dispatch[]\n@main_dispatch_170::@main_dispatch_170_generic_32x4194304_f32(%463)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %5 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel] (%3) -> (%1)\l        %5 = arith.divf %in, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg1\ltensor<32xf32>"];
    v469 [shape = box, label = "%465 = flow.dispatch[]\n@main_dispatch_171::@main_dispatch_171_generic_32x4194304_f32(%463, %464)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, parallel] (%0, %1) -> (%2)\l        %4 = arith.subf %in, %in_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %5 = arith.mulf %4, %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg2\ltensor<32x4194304xf32>"];
    v470 [shape = box, label = "%466 = flow.dispatch[]\n@main_dispatch_172::@main_dispatch_172_generic_32x4194304_f32(%465)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %4 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg1\ltensor<32xf32>"];
    v471 [shape = ellipse, label = "%467 = flow.tensor.reshape\ntensor<32x4194304xf16>"];
    v472 [shape = box, label = "%468 = flow.dispatch[]\n@main_dispatch_173::@main_dispatch_173_generic_32x4194304_f16xf32xf32xf32(%467, %464, %466)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<32xf32>\l%3 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.subf %5, %in_1 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.divf %in_2, %cst : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %7, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2100:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = math.rsqrt %8 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2101:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.mulf %6, %9 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %10 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<32x4194304xf32>"];
    v473 [shape = ellipse, label = "%469 = flow.tensor.reshape\ntensor<128x1048576xf32>"];
    v474 [shape = box, label = "%470 = flow.dispatch[]\n@main_dispatch_238::@main_dispatch_238_generic_128x1048576_f32xf16xf16xf16(%469, %cst_129, %cst_128)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<128x1048576xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<128xf16>\l%3 = tensor.empty() : tensor<128x1048576xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in_0 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.mulf %in, %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.extf %in_1 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %6, %7 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = arith.truncf %8 : f32 to f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2841:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.negf %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %11 = math.exp %10 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %12 = arith.addf %11, %cst : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %13 = arith.divf %cst, %12 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %14 = arith.mulf %13, %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %14 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<128x1048576xf16>"];
    v475 [shape = ellipse, label = "%471 = flow.tensor.reshape\ntensor<1x128x1024x1024xf16>"];
    v476 [shape = box, label = "%472 = flow.dispatch[]\n@main_dispatch_239::@main_dispatch_239_slow_memcpy(%471, %421)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1024x1024xf16>\lflow.dispatch.tensor.store %0, %arg1\ltensor<1x128x1026x1026xf16>"];
    v477 [shape = ellipse, label = "%473 = flow.tensor.reshape\ntensor<1x128xf16>"];
    v478 [shape = box, label = "%474 = flow.dispatch[]\n@main_dispatch_252::@main_dispatch_252_conv_2d_nchw_fchw_1x128x1024x1024x128x3x3_f16(%472, %cst_130, %473)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1026x1026xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128x128x3x3xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<1x128xf16>\l%3 = tensor.empty() : tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2791:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.fill ins(%cst : f16) outs(%3 : tensor<1x128x1024x1024xf16>) -> tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2791:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%5 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%0, %1 : tensor<1x128x1026x1026xf16>, tensor<128x128x3x3xf16>) outs(%4 : tensor<1x128x1024x1024xf16>) -> tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2992:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%6 = linalg.generic[parallel, parallel, parallel, parallel] (%5, %2) -> (%3)\l        %7 = arith.addf %in, %in_0 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2992:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %7 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2992:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %6, %arg3\ltensor<1x128x1024x1024xf16>"];
    v479 [shape = ellipse, label = "%475 = flow.tensor.reshape\ntensor<134217728xf16>"];
    v480 [shape = box, label = "%476 = flow.dispatch[]\n@main_dispatch_169::@main_dispatch_169_generic_134217728_f16xf32(%475)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<134217728xf16>\l%1 = tensor.empty() : tensor<134217728xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.generic[parallel] (%0) -> (%1)\l        %3 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %3 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %2, %arg1\ltensor<134217728xf32>"];
    v481 [shape = ellipse, label = "%477 = flow.tensor.reshape\ntensor<32x4194304xf32>"];
    v482 [shape = box, label = "%478 = flow.dispatch[]\n@main_dispatch_170::@main_dispatch_170_generic_32x4194304_f32(%477)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %5 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel] (%3) -> (%1)\l        %5 = arith.divf %in, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg1\ltensor<32xf32>"];
    v483 [shape = box, label = "%479 = flow.dispatch[]\n@main_dispatch_171::@main_dispatch_171_generic_32x4194304_f32(%477, %478)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, parallel] (%0, %1) -> (%2)\l        %4 = arith.subf %in, %in_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %5 = arith.mulf %4, %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg2\ltensor<32x4194304xf32>"];
    v484 [shape = box, label = "%480 = flow.dispatch[]\n@main_dispatch_172::@main_dispatch_172_generic_32x4194304_f32(%479)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %4 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg1\ltensor<32xf32>"];
    v485 [shape = ellipse, label = "%481 = flow.tensor.reshape\ntensor<32x4194304xf16>"];
    v486 [shape = box, label = "%482 = flow.dispatch[]\n@main_dispatch_173::@main_dispatch_173_generic_32x4194304_f16xf32xf32xf32(%481, %478, %480)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<32xf32>\l%3 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.subf %5, %in_1 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.divf %in_2, %cst : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %7, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2100:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = math.rsqrt %8 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2101:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.mulf %6, %9 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %10 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<32x4194304xf32>"];
    v487 [shape = ellipse, label = "%483 = flow.tensor.reshape\ntensor<128x1048576xf32>"];
    v488 [shape = box, label = "%484 = flow.dispatch[]\n@main_dispatch_238::@main_dispatch_238_generic_128x1048576_f32xf16xf16xf16(%483, %cst_133, %cst_132)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<128x1048576xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<128xf16>\l%3 = tensor.empty() : tensor<128x1048576xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in_0 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.mulf %in, %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.extf %in_1 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %6, %7 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = arith.truncf %8 : f32 to f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2841:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.negf %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %11 = math.exp %10 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %12 = arith.addf %11, %cst : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %13 = arith.divf %cst, %12 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %14 = arith.mulf %13, %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %14 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<128x1048576xf16>"];
    v489 [shape = ellipse, label = "%485 = flow.tensor.reshape\ntensor<1x128x1024x1024xf16>"];
    v490 [shape = box, label = "%486 = flow.dispatch[]\n@main_dispatch_239::@main_dispatch_239_slow_memcpy(%485, %421)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1024x1024xf16>\lflow.dispatch.tensor.store %0, %arg1\ltensor<1x128x1026x1026xf16>"];
    v491 [shape = ellipse, label = "%487 = flow.tensor.reshape\ntensor<1x128xf16>"];
    v492 [shape = box, label = "%488 = flow.dispatch[]\n@main_dispatch_260::@main_dispatch_260_conv_2d_nchw_fchw_1x128x1024x1024x128x3x3_f16(%486, %cst_134, %460, %487)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1026x1026xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128x128x3x3xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<1x128x1024x1024xf16>\l%3 = flow.dispatch.tensor.load %arg3 -> tensor<1x128xf16>\l%4 = tensor.empty() : tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2791:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%5 = linalg.fill ins(%cst : f16) outs(%4 : tensor<1x128x1024x1024xf16>) -> tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2791:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%6 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%0, %1 : tensor<1x128x1026x1026xf16>, tensor<128x128x3x3xf16>) outs(%5 : tensor<1x128x1024x1024xf16>) -> tensor<1x128x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3082:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%7 = linalg.generic[parallel, parallel, parallel, parallel] (%2, %6, %3) -> (%4)\l        %8 = arith.addf %in_0, %in_1 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3082:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = arith.addf %in, %8 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3084:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3084:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %7, %arg4\ltensor<1x128x1024x1024xf16>"];
    v493 [shape = ellipse, label = "%489 = flow.tensor.reshape\ntensor<134217728xf16>"];
    v494 [shape = box, label = "%490 = flow.dispatch[]\n@main_dispatch_169::@main_dispatch_169_generic_134217728_f16xf32(%489)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<134217728xf16>\l%1 = tensor.empty() : tensor<134217728xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.generic[parallel] (%0) -> (%1)\l        %3 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %3 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2091:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %2, %arg1\ltensor<134217728xf32>"];
    v495 [shape = ellipse, label = "%491 = flow.tensor.reshape\ntensor<32x4194304xf32>"];
    v496 [shape = box, label = "%492 = flow.dispatch[]\n@main_dispatch_170::@main_dispatch_170_generic_32x4194304_f32(%491)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %5 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel] (%3) -> (%1)\l        %5 = arith.divf %in, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg1\ltensor<32xf32>"];
    v497 [shape = box, label = "%493 = flow.dispatch[]\n@main_dispatch_171::@main_dispatch_171_generic_32x4194304_f32(%491, %492)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, parallel] (%0, %1) -> (%2)\l        %4 = arith.subf %in, %in_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %5 = arith.mulf %4, %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg2\ltensor<32x4194304xf32>"];
    v498 [shape = box, label = "%494 = flow.dispatch[]\n@main_dispatch_172::@main_dispatch_172_generic_32x4194304_f32(%493)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf32>\l%1 = tensor.empty() : tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<32xf32>) -> tensor<32xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":310:26 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%3 = linalg.generic[parallel, reduction] (%0) -> (%2)\l        %4 = arith.addf %in, %out : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %4 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %3, %arg1\ltensor<32xf32>"];
    v499 [shape = ellipse, label = "%495 = flow.tensor.reshape\ntensor<32x4194304xf16>"];
    v500 [shape = box, label = "%496 = flow.dispatch[]\n@main_dispatch_173::@main_dispatch_173_generic_32x4194304_f16xf32xf32xf32(%495, %492, %494)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<32x4194304xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<32xf32>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<32xf32>\l%3 = tensor.empty() : tensor<32x4194304xf32> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.subf %5, %in_1 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2103:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.divf %in_2, %cst : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2097:34 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %7, %cst_0 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2100:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = math.rsqrt %8 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2101:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.mulf %6, %9 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %10 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2104:12 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<32x4194304xf32>"];
    v501 [shape = ellipse, label = "%497 = flow.tensor.reshape\ntensor<128x1048576xf32>"];
    v502 [shape = box, label = "%498 = flow.dispatch[]\n@main_dispatch_238::@main_dispatch_238_generic_128x1048576_f32xf16xf16xf16(%497, %cst_137, %cst_136)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<128x1048576xf32>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<128xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<128xf16>\l%3 = tensor.empty() : tensor<128x1048576xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.generic[parallel, parallel] (%0, %1, %2) -> (%3)\l        %5 = arith.extf %in_0 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %6 = arith.mulf %in, %5 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2837:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %7 = arith.extf %in_1 : f16 to f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.addf %6, %7 : f32 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2839:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = arith.truncf %8 : f32 to f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2841:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.negf %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %11 = math.exp %10 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %12 = arith.addf %11, %cst : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %13 = arith.divf %cst, %12 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %14 = arith.mulf %13, %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %14 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":2860:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %4, %arg3\ltensor<128x1048576xf16>"];
    v503 [shape = ellipse, label = "%499 = flow.tensor.reshape\ntensor<1x128x1024x1024xf16>"];
    v504 [shape = box, label = "%500 = flow.dispatch[]\n@main_dispatch_239::@main_dispatch_239_slow_memcpy(%499, %421)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1024x1024xf16>\lflow.dispatch.tensor.store %0, %arg1\ltensor<1x128x1026x1026xf16>"];
    v505 [shape = ellipse, label = "%501 = flow.tensor.reshape\ntensor<1x3xf16>"];
    v506 [shape = box, label = "%502 = flow.dispatch[]\n@main_dispatch_284::@main_dispatch_284_conv_2d_nchw_fchw_1x3x1024x1024x128x3x3_f16(%500, %cst_138, %501)\n%0 = flow.dispatch.tensor.load %arg0 -> tensor<1x128x1026x1026xf16>\l%1 = flow.dispatch.tensor.load %arg1 -> tensor<3x128x3x3xf16>\l%2 = flow.dispatch.tensor.load %arg2 -> tensor<1x3xf16>\l%3 = tensor.empty() : tensor<1x3x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3356:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%4 = linalg.fill ins(%cst_1 : f16) outs(%3 : tensor<1x3x1024x1024xf16>) -> tensor<1x3x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3356:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%5 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%0, %1 : tensor<1x128x1026x1026xf16>, tensor<3x128x3x3xf16>) outs(%4 : tensor<1x3x1024x1024xf16>) -> tensor<1x3x1024x1024xf16> loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3356:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l%6 = linalg.generic[parallel, parallel, parallel, parallel] (%5, %2) -> (%3)\l        %7 = arith.addf %in, %in_3 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3356:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %8 = arith.divf %7, %cst : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3358:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %9 = arith.addf %8, %cst_0 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3361:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %10 = arith.cmpf ult, %9, %cst_1 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3364:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %11 = arith.select %10, %cst_1, %9 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3364:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %12 = arith.cmpf ugt, %11, %cst_2 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3364:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        %13 = arith.select %12, %cst_2, %11 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3364:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\l        linalg.yield %13 : f16 loc(callsite(callsite(\".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":3364:13 at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":252:10) at \".\\\\stable_diffusion_xl_base_1_0_1024x1024_fp16_vae_decode.mlir\":250:3))\lflow.dispatch.tensor.store %6, %arg3\ltensor<1x3x1024x1024xf16>"];
    v507 [shape = ellipse, label = "%503 = hal.tensor.export\n!hal.buffer_view"];
    v508 [shape = ellipse, label = " = util.return\n"];
  }

They are good outputs at least up to --iree-flow-break-dispatch=@main:242