go: cmd/compile: avoid slow versions of LEA instructions on x86

On newer x86 cpus (amd and intel) 3 operand LEA instructions with base, index and offset have a higher latency and less throughput than 2 operand LEA instructions.

The compiler when emitting the instructions could rewrite slow leas into e.g. LEA + ADD instructions where possible (flag clobbering ok) similar how MOV $0 R is rewritten to XOR R R.

Intel® 64 and IA-32 Architectures Optimization Reference Manual
3.5.1.3 Using LEA

For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
...

relevant llvm optimization ticket: https://reviews.llvm.org/D32277

/cc @TocarIP @randall77 @josharian

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 6
  • Comments: 21 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Looks like I found a specific example of this. After b1df8d6ffa2c4c5be567934bd44432fff8f3c4a7 we convert 32-bit multiplication by 0x101 into slow lea, which caused a performance regression on master vs 1.10 for image/draw benchmarks:

CMYK-8   461µs ± 2%   513µs ± 1%  +11.31%  (p=0.000 n=10+9)

Are there objections to waiting till 1.12? The performance improvements are overall very small (0.16% is what I measure pretty consistently, including across a much larger selection of benchmarks), and it’s getting very late in the 1.11 cycle and we seem to be behind schedule.

To summarize the performance changes, across the low-noise 256 benchmarks (out of 488 total, low noise is less than 2% max standard deviation reported across 30 trials) 15 showed >= 1% improvement, 16 showed >= 1% slowdown.

For comparison, the experiment to pretend we are only targeting more-recent versions of intel hardware (CL 117925) had 83 improved, 6 slowed down, and the geomean improvement was 1.28%

Change https://go.dev/cl/440035 mentions this issue: cmd/compile: split 3 operand LEA in late lower pass

Even for LEAQ and LEAL my CL doesnt manage to split all 3 operand LEA. There are some that had aux set and arent transformed.

@TocarIP did you find a file+line number where a slow lea is created that effects the draw benchmarks? For 0x101 multiplication it seems a 2 (but not 3) operand lea is created since there is no offset/displacement e.g.:

  draw.go:243		0x112f368		89ca			MOVL CX, DX			
  draw.go:243		0x112f36a		c1e108			SHLL $0x8, CX			
  draw.go:243		0x112f36d		8d0c11			LEAL 0(CX)(DX*1), CX