flashtext: something wrong in Chinese ?
in python 2.7:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')
return [u’北京’, u’你’],missing u’欢迎’ ?
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 30 (11 by maintainers)
@leepand @leepand 你好,我也是中国的使用者,只需要修改源代码第532行 idx = sequence_end_pos ,修改为 idx = sequence_end_pos -1,即可, 代码if name == ‘main’: kp=KeywordProcessor() kp.add_keyword(‘北京’) kp.add_keyword(‘欢迎’) kp.add_keyword(‘你’) text = ‘北京欢迎你’ tl=kp.extract_keywords(text) print(tl)
输出:[‘北京’, ‘欢迎’, ‘你’]
does it seems ridiculous that a string matching tool must have a tokenizer ?
I suggest (just a suggestion ^_^) that just design it as a pure AC automata, like https://github.com/WojciechMula/pyahocorasick/ is more useful and more feasible. pyahocorasick is written in C, and I’d like to see a pure python version.
maybe you can separatie the tokenizer and allow us to write our own tokenizer?
like https://whoosh.readthedocs.io/en/latest/analysis.html
@vi3k6i5 I think the best you can do is separate the tokenizer, no matter English or Chinese. You can allow us to design our own tokenizer and pass it into flashtext
Cool, Thanks for the suggestion. I will definitely take it into consideration 😃
You can remove number characters inside of “non word boundaries”. E.g.
Output:
I’m considering using Chinese characters to mimic English Words and it seems to work fine. (In python 3.6) ` string = ‘北 京 欢 迎 您 ! 北 京 欢 迎 您 !’
keyword_proc = KeywordProcessor()
keyword_proc.add_keyword(‘北 京’)
keyword_proc.add_keyword(‘欢 迎’)
keyword_proc.add_keyword(‘您’)
keywords = keyword_proc.extract_keywords(string, span_info=True)
`
Output:
[(‘北 京’, 0, 3), (‘欢 迎’, 4, 7), (‘您’, 8, 9), (‘北 京’, 12, 15), (‘欢 迎’, 16, 19), (‘您’, 20, 21)]
The reason is : there is no space between Chinese words.
So, I remove digits and letters from no_word_boundaries :
self.non_word_boundaries = set(string.digits + string.ascii_letters + '_')change to:
self.non_word_boundaries = set('_')It works well.
Fix is added in master branch:
Please do
pip install -U git+https://github.com/vi3k6i5/flashtext.gityou can pass a parameter
encodingwhen loading flashtext from a file.