antlr4: [Java] Lexer is >10x slower if input is not in Ascii range
Hi,
Currently antlr lexer runtime uses a cache which is limited to 127
Java version:
public static final int MAX_DFA_EDGE = 127; // forces unicode to stay in ATN
This default causes big slowdown if parsed document contains characters codes bigger than 127, which is common for a lot of latin scripts (from my tests, up to 3x slowdown for Turkish text , >10x slowdown if most of the chars are out of range) I believe this should be a problem for Polish, German, Scandinavian texts as well.
If this range is extended to 240 (Latin-1) or even better to 368 (Latin Extended A) This would extend fast lexing of mixed content documents to a lot more languages.
I believe the price of increasing this cache size a little is worth the speed gains.
About this issue
- Original URL
- State: open
- Created 7 years ago
- Comments: 31 (17 by maintainers)
@parrt I created https://github.com/antlr/antlr4/pull/2015
These are exactly the thoughts I had when I examined these details in the C++ target, where I replaced the array by a hash map already. Memory consumption dropped from >600MB to around ~300MB, in one use case, with this map in place. So it’s a big difference. I also removed the need for a negative special index and the shifting.
@sharwell
My understanding of how antlr runtime works is still very limited, however I had a closer look at relevant bits in the code and I want to share a few observations for this matter and propose a solution.
Current way of handling of edges in DFAState objects is quite problematic. a) “edges” to other states are not encapsulated properly, it is a raw array and it is managed by other classes. This makes it very fragile, prone to accidents and difficult to replace with a better abstraction.
b) Using an array of states in DFAState is quite wasteful (As a cache in lexer’s case, for transitions of edges that has characters 0-127) Even if the DFA state object has a single edge, it creates this array with 128 references (512 bytes)
c) As mentioned in other messages in this issue, performance of current way of handling DFAState transitions with charaters >127 is terrible. I guess if my test grammar was for Japanese or Arabic instead of Turkish it performance would be 100x worse, not 10x.
d) Having a special -1 index, and having array values shifted, further complicates logic
My solution to this is to remove existing array that represent edges, replace it with a fast / compact hashmap. I believe the performance penalty is minuscule for lexer grammars that deals with only ascii charactrers, and its memory footprint is much compact, ~60-80 bytes vs 512 bytes. Also no special -1 or other cases, so it will be simpler to read as well.
I already have a clean intmap implementation, but it will probably take some time to replace existing structure. I also added synchronization, it is still 10x faster for 90%-10% mix of ASCII-Non ASCII input for my test grammar.
You can see the current diff on my branch here: https://github.com/antlr/antlr4/compare/master...mdakin:master
Another note is that maybe instead of a single DFAState class, maybe it would have been better to have different implementations for Lexer and Parser to represent states. But there is probably a reason for this , as I said I don’t know the theory.
I can replace DFAState edges with my map and change all uses, but it will probably take some time. I can create a separate issue for this. What do you think?