playwright: [Proposal] The problem with visual comparison

Problem

Pixelmatch (which is being used to compare snapshots) has an issue (as i described in https://github.com/mapbox/pixelmatch/issues/127)

And that is not the only one. Pixelmatch also has some sort of antialiasing detection, but its not configurable and not perfect either: I launched playwright test 500 times and 6 of those failed due to browser rendering instability, thats 1.2%, which is a lot. 6 failed out of 500 Diff Test code src: https://pastebin.com/eWzMpKBW

Solution

I can solve both of these problems by integrating our package looks-same instead of pixelmatch into Playwright.

Tools comparison

Color comparator: looks-same has configurable CIEDE2000 tolerance (2.3 by default) instead of pixelmatch threshold, which takes into account human color perception.

These colors from example1 (linked in issue above) would be considered as different if tolerance < 33.5, so diff image (marked as pink) would be: looks-same diff

At the same time, looks-same wouldn’t mark the difference from example2 (linked in issue above) with tolerance > 1.86

Antialiasing: looks-same has configurable antialiasingTolerance, which would help with those 6/500 fails.

It can detect the diff from above, but would ignore it with propper antialiasingTolerance. looks-same antiAliasing

Benchmarking

Case:

Compare two images (896x5069) and save diff image to file system.

Environment:

Apple M1 Pro, 32GB

Currently, looks-same is not that fast, but i managed to improve its performance, and thats what i got on my branch:

equal buffers
- looks-same: 0.1ms
- pixelmatch: 260ms
indistinguishable difference (23166 / 4541824 pixels have #fefefe color instead of #ffffff)
- looks-same: 110ms
- pixelmatch: 260ms
distinguishable difference (5 red pixels on white background)
- looks-same: 130ms
- pixelmatch: 500ms
big difference (500x500 red square in the center)
- looks-same: 260ms
- pixelmatch: 540ms
huge difference (two 500x500 red squares in the center)
- looks-same: 380ms
- pixelmatch: 540ms
gigantic difference (four 500x500 red squares in the center)
- looks-same: 640ms
- pixelmatch: 580ms

Conclusion

looks-same

provides better comparison quality than pixelmatch
configurable: ignoreCaret, tolerance, antialiasingTolerance
faster than pixelmatch on average, unless the diff if really huge: (130ms constant time(896x5069) + 130 ms per each 250k different pixels)

I can complete the job and make a Pull Request to use looks-same in Playwright image comparison, but firstly i would like to clarify: are you interested in this?

P.S. If you need benchmark code examples, i could provide them in PR after polishing the looks-same code

About this issue

Original URL
State: open
Created a year ago
Reactions: 17
Comments: 16

Commits related to this issue

fix(ssim-cie94): make sure a single off pixel yields diff in comparator This patch brings in antialiasing tests from `looks-same` project for our experimental `ssim-cie94` comparator. One of the new... — committed to aslushnikov/playwright by aslushnikov a year ago
fix(ssim-cie94): make sure a single off pixel yields diff in comparator (#24348) This patch brings in antialiasing tests from `looks-same` project for our experimental `ssim-cie94` comparator. On... — committed to microsoft/playwright by aslushnikov a year ago
fix(ssim-cie94): make sure a single off pixel yields diff in comparator (#24348) This patch brings in antialiasing tests from `looks-same` project for our experimental `ssim-cie94` comparator. On... — committed to OctoMind-dev/playwright by aslushnikov a year ago

Most upvoted comments

I’m interested in getting your feedback on using our experimental ssim-cie94

rendering instability tolerance: it still fails sometimes on the example above
colors comparison:
- ssim-cie94 is better, than pixelmatch, but worse than CIEDE2000 (https://github.com/microsoft/playwright/issues/24312#issuecomment-1646634882)

Previously, I did not include the time to save the image, but that is wrong, because looks-same resolves png-buffer, while pixelmatch and ssim-cie94 should transform raw img buffer to png-buffer, and looks-same does it significantly faster

performance (900x5000 png images, average on 10 function calls):
- equal buffers (excluding saving to the file system)
  - looks-same: 0.12 ms
  - pixelmatch: 260 ms
  - ssim-cie94: 390 ms
- indistinguishable difference (960 / 4541824 pixels have #fefefe color instead of #ffffff, excluding saving to the file system)
  - looks-same: 111 ms
  - pixelmatch: 240 ms
  - ssim-cie94: 908 ms
- distinguishable difference (5 red pixels on white background, including saving to the file system)
  - looks-same: 140 ms
  - pixelmatch: 500 ms
  - ssim-cie94: 1170 ms
- big difference (500x500 red square in the center, including saving to the file system)
  - looks-same: 250 ms
  - pixelmatch: 530 ms
  - ssim-cie94: 1340 ms
- huge difference (two 500x500 red squares in the center, including saving to the file system)
  - looks-same: 374 ms
  - pixelmatch: 544 ms
  - ssim-cie94: 1500 ms
- gigantic difference (four 500x500 red squares in the center, including saving to the file system)
  - looks-same: 630 ms
  - pixelmatch: 600 ms
  - ssim-cie94: 1882 ms

KuznetsovRoman on Jul 28, 2023

I’m surprised to see CIEDE2000 usage instead of CIE94

CIE94 still has a perceptual uniformity issue, so CIEDE2000 is better.

Example: color1 = [129, 174, 237], color2 = [129, 174, 255], cie94 = 4.5, ciede2000 = 1.85 color3 = [111, 120, 138], color4 = [118, 120, 138], cie94 = 2.27, ciede2000 = 3.5

CIE94 is almost 2x faster than CIEDE2000 on my machine

On my branch i did some modifications to improve image comparison speed: https://github.com/gemini-testing/looks-same/blob/HERMIONE-1049/index.js#L23. With those improvements, comparison speed is similar

So, in most cases we don’t really need to calc CIEDE2000 to make sure white and red are different colors.

The trick is to calc fast CIE76 firstly, and in most cases that is enough (based on the results of my calculations,[0.695 * CIEDE2000] <= CIE76 <= [6.2 * CIEDE2000] for any pair of rgb colors). We only need to calc slow CIEDE2000 if, based on CIE76, we can’t be sure CIEDE2000 is lower/higher than tolerance value (keep in mind, most of the time colors would be exactly equal and we would not need any calculations, so on average, that would not be a problem)

KuznetsovRoman on Jul 24, 2023