go: cmd/compile: avoid slow versions of LEA instructions on x86

On newer x86 cpus (amd and intel) 3 operand LEA instructions with base, index and offset have a higher latency and less throughput than 2 operand LEA instructions.

The compiler when emitting the instructions could rewrite slow leas into e.g. LEA + ADD instructions where possible (flag clobbering ok) similar how MOV $0 R is rewritten to XOR R R.

Intel® 64 and IA-32 Architectures Optimization Reference Manual
3.5.1.3 Using LEA

For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
...

relevant llvm optimization ticket: https://reviews.llvm.org/D32277

/cc @TocarIP @randall77 @josharian

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 6
Comments: 21 (17 by maintainers)

Commits related to this issue

cmd/compile: split slow 3 operand LEA instructions into two LEAs go tool objdump ../bin/go | grep "\.go\:" | grep -c "LEA.*0x.*[(].*[(].*" Before: 1012 After: 20 Updates #21735 Benchmarks thanks to... — committed to golang/go by martisch 6 years ago
cmd/compile: split 3 operand LEA in late lower pass On newer amd64 cpus 3 operand LEA instructions are slow, CL 114655 split them to 2 LEA instructions in genssa. This CL make late lower pass run af... — committed to golang/go by wdvxdr1123 2 years ago
cmd/compile: split 3 operand LEA in late lower pass On newer amd64 cpus 3 operand LEA instructions are slow, CL 114655 split them to 2 LEA instructions in genssa. This CL make late lower pass run af... — committed to TroutSoftware/go by wdvxdr1123 2 years ago
cmd/compile: remove broken LEA "optimization" CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would go bad if the wrong-type... — committed to golang/go by dr2chase a year ago
[release-branch.go1.20] cmd/compile: remove broken LEA "optimization" CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would ... — committed to golang/go by dr2chase a year ago
[release-branch.go1.20] cmd/compile: remove broken LEA "optimization" CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would ... — committed to tailscale/go by dr2chase a year ago
[release-branch.go1.20] cmd/compile: remove broken LEA "optimization" CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would ... — committed to tailscale/go by dr2chase a year ago

Most upvoted comments

Looks like I found a specific example of this. After b1df8d6ffa2c4c5be567934bd44432fff8f3c4a7 we convert 32-bit multiplication by 0x101 into slow lea, which caused a performance regression on master vs 1.10 for image/draw benchmarks:

CMYK-8   461µs ± 2%   513µs ± 1%  +11.31%  (p=0.000 n=10+9)

TocarIP on May 10, 2018

Are there objections to waiting till 1.12? The performance improvements are overall very small (0.16% is what I measure pretty consistently, including across a much larger selection of benchmarks), and it’s getting very late in the 1.11 cycle and we seem to be behind schedule.

To summarize the performance changes, across the low-noise 256 benchmarks (out of 488 total, low noise is less than 2% max standard deviation reported across 30 trials) 15 showed >= 1% improvement, 16 showed >= 1% slowdown.

For comparison, the experiment to pretend we are only targeting more-recent versions of intel hardware (CL 117925) had 83 improved, 6 slowed down, and the geomean improvement was 1.28%

dr2chase on Jun 25, 2018

Change https://go.dev/cl/440035 mentions this issue: cmd/compile: split 3 operand LEA in late lower pass

gopherbot on Oct 7, 2022

Even for LEAQ and LEAL my CL doesnt manage to split all 3 operand LEA. There are some that had aux set and arent transformed.

martisch on Oct 22, 2021

@TocarIP did you find a file+line number where a slow lea is created that effects the draw benchmarks? For 0x101 multiplication it seems a 2 (but not 3) operand lea is created since there is no offset/displacement e.g.:

  draw.go:243		0x112f368		89ca			MOVL CX, DX			
  draw.go:243		0x112f36a		c1e108			SHLL $0x8, CX			
  draw.go:243		0x112f36d		8d0c11			LEAL 0(CX)(DX*1), CX

martisch on May 26, 2018