go: cmd/compile: avoid slow versions of LEA instructions on x86
On newer x86 cpus (amd and intel) 3 operand LEA instructions with base, index and offset have a higher latency and less throughput than 2 operand LEA instructions.
The compiler when emitting the instructions could rewrite slow leas into e.g. LEA + ADD instructions where possible (flag clobbering ok) similar how MOV $0 R is rewritten to XOR R R.
Intel® 64 and IA-32 Architectures Optimization Reference Manual
3.5.1.3 Using LEA
For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
...
relevant llvm optimization ticket: https://reviews.llvm.org/D32277
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 6
- Comments: 21 (17 by maintainers)
Commits related to this issue
- cmd/compile: split slow 3 operand LEA instructions into two LEAs go tool objdump ../bin/go | grep "\.go\:" | grep -c "LEA.*0x.*[(].*[(].*" Before: 1012 After: 20 Updates #21735 Benchmarks thanks to... — committed to golang/go by martisch 6 years ago
- cmd/compile: split 3 operand LEA in late lower pass On newer amd64 cpus 3 operand LEA instructions are slow, CL 114655 split them to 2 LEA instructions in genssa. This CL make late lower pass run af... — committed to golang/go by wdvxdr1123 2 years ago
- cmd/compile: split 3 operand LEA in late lower pass On newer amd64 cpus 3 operand LEA instructions are slow, CL 114655 split them to 2 LEA instructions in genssa. This CL make late lower pass run af... — committed to TroutSoftware/go by wdvxdr1123 2 years ago
- cmd/compile: remove broken LEA "optimization" CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would go bad if the wrong-type... — committed to golang/go by dr2chase a year ago
- [release-branch.go1.20] cmd/compile: remove broken LEA "optimization" CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would ... — committed to golang/go by dr2chase a year ago
- [release-branch.go1.20] cmd/compile: remove broken LEA "optimization" CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would ... — committed to tailscale/go by dr2chase a year ago
- [release-branch.go1.20] cmd/compile: remove broken LEA "optimization" CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would ... — committed to tailscale/go by dr2chase a year ago
Looks like I found a specific example of this. After b1df8d6ffa2c4c5be567934bd44432fff8f3c4a7 we convert 32-bit multiplication by 0x101 into slow lea, which caused a performance regression on master vs 1.10 for image/draw benchmarks:
Are there objections to waiting till 1.12? The performance improvements are overall very small (0.16% is what I measure pretty consistently, including across a much larger selection of benchmarks), and it’s getting very late in the 1.11 cycle and we seem to be behind schedule.
To summarize the performance changes, across the low-noise 256 benchmarks (out of 488 total, low noise is less than 2% max standard deviation reported across 30 trials) 15 showed >= 1% improvement, 16 showed >= 1% slowdown.
For comparison, the experiment to pretend we are only targeting more-recent versions of intel hardware (CL 117925) had 83 improved, 6 slowed down, and the geomean improvement was 1.28%
Change https://go.dev/cl/440035 mentions this issue:
cmd/compile: split 3 operand LEA in late lower pass
Even for LEAQ and LEAL my CL doesnt manage to split all 3 operand LEA. There are some that had aux set and arent transformed.
@TocarIP did you find a file+line number where a slow lea is created that effects the draw benchmarks? For 0x101 multiplication it seems a 2 (but not 3) operand lea is created since there is no offset/displacement e.g.: