antlr4: [Java] Lexer is >10x slower if input is not in Ascii range

Hi,

Currently antlr lexer runtime uses a cache which is limited to 127

Java version:

public static final int MAX_DFA_EDGE = 127; // forces unicode to stay in ATN

This default causes big slowdown if parsed document contains characters codes bigger than 127, which is common for a lot of latin scripts (from my tests, up to 3x slowdown for Turkish text , >10x slowdown if most of the chars are out of range) I believe this should be a problem for Polish, German, Scandinavian texts as well.

If this range is extended to 240 (Latin-1) or even better to 368 (Latin Extended A) This would extend fast lexing of mixed content documents to a lot more languages.

I believe the price of increasing this cache size a little is worth the speed gains.

About this issue

Original URL
State: open
Created 7 years ago
Comments: 31 (17 by maintainers)

Most upvoted comments

@parrt I created https://github.com/antlr/antlr4/pull/2015

mdakin on Sep 17, 2017

Current way of handling of edges in DFAState objects is quite problematic. a) edges to other states is not encapsulated properly, it is a raw array and it is managed by other classes. This makes it very fragile, prone to accidents and difficult to replace with a better abstraction.

b) Using an array of states in the DFA state is quite wasteful (As a cache in lexer’s case, for transitions of edges that has characters 0-127) Even if the DFA state object has a single edge, it creates this array with 128 references (512 bytes)

c) As mentioned in other messages int this issue, performance of current way of handling DFA edges for charaters >127 is terrible. I guess if my test grammar was for Japanese or Arabic instead of Turkish it performance would be 100x worse, not 10x.

d) Having a special -1 index, and having array values shifted, further complicates logic

My solution to this is to remove existing array that represent edges, replace it with a fast / compact hashmap. I believe the performance penalty is miniscule for lexer grammars that deals with only ascii charactrers, and its memory footprint is much compact, ~60-80 bytes vs 512 bytes. Also no special -1 or other cases, so it will be simpler to read as well

These are exactly the thoughts I had when I examined these details in the C++ target, where I replaced the array by a hash map already. Memory consumption dropped from >600MB to around ~300MB, in one use case, with this map in place. So it’s a big difference. I also removed the need for a negative special index and the shifting.

mike-lischke on Sep 12, 2017

@sharwell
My understanding of how antlr runtime works is still very limited, however I had a closer look at relevant bits in the code and I want to share a few observations for this matter and propose a solution.

Current way of handling of edges in DFAState objects is quite problematic. a) “edges” to other states are not encapsulated properly, it is a raw array and it is managed by other classes. This makes it very fragile, prone to accidents and difficult to replace with a better abstraction.

b) Using an array of states in DFAState is quite wasteful (As a cache in lexer’s case, for transitions of edges that has characters 0-127) Even if the DFA state object has a single edge, it creates this array with 128 references (512 bytes)

c) As mentioned in other messages in this issue, performance of current way of handling DFAState transitions with charaters >127 is terrible. I guess if my test grammar was for Japanese or Arabic instead of Turkish it performance would be 100x worse, not 10x.

d) Having a special -1 index, and having array values shifted, further complicates logic

My solution to this is to remove existing array that represent edges, replace it with a fast / compact hashmap. I believe the performance penalty is minuscule for lexer grammars that deals with only ascii charactrers, and its memory footprint is much compact, ~60-80 bytes vs 512 bytes. Also no special -1 or other cases, so it will be simpler to read as well.

I already have a clean intmap implementation, but it will probably take some time to replace existing structure. I also added synchronization, it is still 10x faster for 90%-10% mix of ASCII-Non ASCII input for my test grammar.

You can see the current diff on my branch here: https://github.com/antlr/antlr4/compare/master...mdakin:master

Another note is that maybe instead of a single DFAState class, maybe it would have been better to have different implementations for Lexer and Parser to represent states. But there is probably a reason for this , as I said I don’t know the theory.

I can replace DFAState edges with my map and change all uses, but it will probably take some time. I can create a separate issue for this. What do you think?

mdakin on Sep 12, 2017