lucenenet: One character is missing in class ASCIIFoldingFilter
I think one character in class ASCIIFoldingFilter is missing Character: Ʀ Nº: 422 UTF-16: 01A6
Source code that might need to be added to method FoldToASCII(char[] input, int inputPos, char[] output, int outputPos, int length):
case '\u01A6': // Ʀ [LATIN LETTER YR]
output[outputPos++] = 'R';
Links about this character:
https://codepoints.net/U+01A6
https://en.wikipedia.org/wiki/%C6%A6
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (7 by maintainers)
Thanks for the report.
As this is a line-by-line port from Java Lucene 4.8.0 (for the most part), we have faithfully reproduced the ASCIIFoldingFilter in its entirety. While we have admittedly included some patches from later versions of Lucene where they affect usability (for example,
Lucene.Net.Analysis.Commonall came from 4.8.1), the change you are suggesting isn’t even reflected in the ASCIIFoldingFilter in the latest commit.If you wish to pursue adding more characters to
ASCIIFoldingFilter, I suggest you take it up with the Lucene design team on their dev mailing list.However, do note this isn’t the only filter included in the box that is capable of removing diacritics from ASCII characters. Some alternatives:
Note that you can also create a custom folding filter by using a similar approach in the ICUFoldingFilter implementation (ported from Lucene 7.1.0). There is a tool you can port to generate a
.nrmbinary file from modified versions of these text files. The.nrmfile can then be provided to the constructor ofICU4N.Text.Normalizer2- more about the data format can be found in the ICU normalization docs. Note that the.nrmfile is the same binary format used in C++ and Java.Alternatively, if you wish to extend the
ASCIIFoldingFilterwith your own custom brew of characters, you can simply chain your own filter toASCIIFoldingFilteras pointed out in this article.FYI - there is also another demo showing additional ways to build analyzers here: https://github.com/NightOwl888/LuceneNetDemo
@diegolaz79
Nope, it isn’t valid to use multiple tokenizers in the same Analyzer, as there are strict consuming rules to adhere to.
It would be great to build code analysis components to ensure developers adhere to these tokenizer rules while typing, such as the rule that ensures
TokenStreamclasses are sealed or use a sealedIncrementToken()method (contributions welcome). It is not likely we will add any additional code analyzers prior to the 4.8.0 release unless they are contributed by the community, though, as these are not blocking the release. For the time being, the best way to ensure custom analyzers adhere to the rules are to test them with Lucene.Net.TestFramework, which also hits them with multiple threads, random cultures, and random strings of text to ensure they are robust.I built a demo showing how to setup testing on custom analyzers here: https://github.com/NightOwl888/LuceneNetCustomAnalyzerDemo (as well as showing how the above example fails the tests). The functioning analyzer just uses a
WhiteSpaceTokenizerandICUFoldingFilter. Of course, you may wish to add additional test conditions to ensure your custom analyzer meets your expectations, and then you can experiment with different tokenizers and adding or rearranging filters until you find a solution that meets all of your requirements (as well as plays by Lucene’s rules). And of course, you can then later add additional conditions as you discover issues.For other ideas about what test conditions you may use, I suggest having a look at Lucene.Net’s extensive analyzer tests including the ICU tests. You may also refer to the tests to see if you can find a similar use case to yours for building queries (although do note that the tests don’t show .NET best practices for disposing objects).
Thanks again! Your suggestions helped me a lot!
I’m currently doing it like this
The WhitespaceAnalyzer did not help my case of the code format (“M-12-14”, “B-10-39”, etc) but will try other more suitable.
And using the finalAnalyer for indexing and search.
FYI - There is a generic Spanish stop word list that can be accessed through SpanishAnalyzer.DefaultStopSet.
PerFieldAnalyzerWrapperapplies a different analyzer to each field (example). Note you don’t necessarily have to use inline analyzers, you can also simply new up pre-constructed analyzers for each field.If all of the data in the field can be considered a token, there is a
KeywordAnalyzerthat can be used to keep the entire field together.Just out of curiosity, do all of your use cases work without the
LowerCaseFilter?Lowercasing is not the same as case folding (which is what
ICUFoldingFilterdoes):Case Mapping and Case Folding
While this might not matter for your use case, it is also worth noting that performance will be improved without the
LowerCaseFilter.In addition, search performance and accuracy can be improved by using a
StopFilterwith a reasonable stop word set to cover your use cases - the only reason I removed it from the demo was because the question was about removing diacritics.@diegolaz79
My bad. It looks like the example I pulled was from an older version of Lucene. However, “Answer 2” in this link shows an example from 4.9.0, which is similar enough to 4.8.0.
And of course, the whole idea of the last example is to implement another folding filter named
CustomFoldingFiltersimilar toASCIIFoldingFilterthat adds your own folding rules that is executed beforeASCIIFoldingFilter.Alternatively, use
ICUFoldingFilter, which implements UTR #30 (includes accent removal).