simdjson: MSVC simdjson is slower than g++ on Windows

On the same machine and OS, WSL g++ 7.5-compiled simdjson parses at 2.6GB/s and MSVC 2019-compiled simdjson parses at 1.0GB/s. ClangCL parses at 1.4GB/s, so there might be a link.exe thing going on there. My machine is Kady Lake R (AVX2 but not AVX512).

After investigation: these seem to be the major impactors:

  • 40%: @TrianglesPCT may be fixing some or all of the most major regression, caused by generic SIMD, by removing lambdas.
  • 10%: We need to understand why this did not fully recover the performance we had before this. Either one of them could be the culprit, but it’s probably not anything in between.
  • 10%: We need to understand why we lost another 10% to the stage 1 structural scanner refactor.

Data

g++ 7.5.0 under WSL

jkeiser@JKEISER-THINKPAD:~/simdjson/build$ benchmark/parse ../jsonexamples/twitter.json
number of iterations 200 
                                                     
../jsonexamples/twitter.json

     9867 blocks -     631515 bytes - 55263 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  24.3210 ns per block ( 70.04%) -   0.3800 ns per byte -   4.3429 ns per structural -    2.631 GB/s
|- Stage 1
|    Speed        :  11.5728 ns per block ( 33.33%) -   0.1808 ns per byte -   2.0665 ns per structural -    5.530 GB/s
|- Stage 2
|    Speed        :  12.6267 ns per block ( 36.36%) -   0.1973 ns per byte -   2.2547 ns per structural -    5.068 GB/s

3181.7 documents parsed per second

VS 2019 (cl.exe 19.25.28614)

PS C:\Users\john\Source\simdjson\build> .\benchmark\Release\parse.exe ..\jsonexamples\twitter.json
number of iterations 200 

..\jsonexamples\twitter.json

     9867 blocks -     631515 bytes - 55263 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  65.5249 ns per block ( 83.29%) -   1.0239 ns per byte -  11.7004 ns per structural -    0.977 GB/s
|- Allocation
|    Speed        :   2.8679 ns per block (  3.65%) -   0.0448 ns per byte -   0.5121 ns per structural -   22.315 GB/s
|- Stage 1
|    Speed        :  32.2862 ns per block ( 41.04%) -   0.5045 ns per byte -   5.7652 ns per structural -    1.982 GB/s
|- Stage 2
|    Speed        :  29.4285 ns per block ( 37.41%) -   0.4598 ns per byte -   5.2549 ns per structural -    2.175 GB/s

1976.0 documents parsed per second

VS 2019 (cl.exe 19.25.28614) with /arch:AVX2

Compiling with /arch:AVX2 only gave a 10% improvement:

PS C:\Users\john\Source\simdjson\build> .\benchmark\Release\parse.exe ..\jsonexamples\twitter.json
number of iterations 200

..\jsonexamples\twitter.json

     9867 blocks -     631515 bytes - 55263 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  60.7013 ns per block ( 82.70%) -   0.9485 ns per byte -  10.8391 ns per structural -    1.054 GB/s
|- Allocation
|    Speed        :   2.4726 ns per block (  3.37%) -   0.0386 ns per byte -   0.4415 ns per structural -   25.882 GB/s
|- Stage 1
|    Speed        :  27.1889 ns per block ( 37.04%) -   0.4249 ns per byte -   4.8550 ns per structural -    2.354 GB/s
|- Stage 2
|    Speed        :  29.8135 ns per block ( 40.62%) -   0.4659 ns per byte -   5.3236 ns per structural -    2.147 GB/s

2246.1 documents parsed per second

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 103 (91 by maintainers)

Most upvoted comments

@pps83 oh! No, I closed this and split it into two separate issues: #847 (MSVC vs. ClangCL) and #848 (ClangCL vs. WSL clang).

Not related to the bug, just of curiosity, what’s the speed in clang on windows?

I tried to test my project where I get best results with simdjson (vs other parses such as rapidjson). With MS compiler I get roughly 6.500s runtime (50% is spent in json parsing). With clang-cl I get roughly 6.050s. Tthat is, for json parsing itself, assuming other things are equal, I get 3.500s vs 3.050s, or 13% speed up with clang.