flashtext: something wrong in Chinese ？

in python 2.7:

from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() keyword_processor.add_keyword(u'北京') keyword_processor.add_keyword(u'欢迎') keyword_processor.add_keyword(u'你') keyword_processor.extract_keywords(u'北京欢迎你')

return [u’北京’, u’你’]，missing u’欢迎’ ?

About this issue

Original URL
State: open
Created 6 years ago
Comments: 30 (11 by maintainers)

Most upvoted comments

in python 2.7:

from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() keyword_processor.add_keyword(u'北京') keyword_processor.add_keyword(u'欢迎') keyword_processor.add_keyword(u'你') keyword_processor.extract_keywords(u'北京欢迎你')

return [u’北京’, u’你’]，missing u’欢迎’ ?

@leepand @leepand 你好，我也是中国的使用者，只需要修改源代码第532行 idx = sequence_end_pos ，修改为 idx = sequence_end_pos -1，即可，代码if name == ‘main’: kp=KeywordProcessor() kp.add_keyword(‘北京’) kp.add_keyword(‘欢迎’) kp.add_keyword(‘你’) text = ‘北京欢迎你’ tl=kp.extract_keywords(text) print(tl)

输出：[‘北京’, ‘欢迎’, ‘你’]

+10

wuxiaobo on Sep 21, 2018

does it seems ridiculous that a string matching tool must have a tokenizer ?

bojone on Jan 17, 2018

I suggest (just a suggestion ^_^) that just design it as a pure AC automata, like https://github.com/WojciechMula/pyahocorasick/ is more useful and more feasible. pyahocorasick is written in C, and I’d like to see a pure python version.

bojone on Jan 18, 2018

maybe you can separatie the tokenizer and allow us to write our own tokenizer?

like https://whoosh.readthedocs.io/en/latest/analysis.html

bojone on Jan 18, 2018

@vi3k6i5 I think the best you can do is separate the tokenizer, no matter English or Chinese. You can allow us to design our own tokenizer and pass it into flashtext

bojone on Jan 27, 2018

Cool, Thanks for the suggestion. I will definitely take it into consideration 😃

vi3k6i5 on Jan 18, 2018

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')
return [‘测试’]
keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')
return nothing 😦
请问汉字混数字时候识别不了的问题解决了吗？楼下说得方法解决不了数字汉字混合的时候
好多坑，确实加数字时识别不了

You can remove number characters inside of “non word boundaries”. E.g.

from flashtext import KeywordProcessor

string = '北京3欢迎'

extracter = KeywordProcessor()
extracter.set_non_word_boundaries(set('-')) # Only keep '-'
extracter.add_keyword('欢迎')
print(extracter.extract_keywords(string))

Output: