flashtext: something wrong in Chinese ?

in python 2.7:

from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() keyword_processor.add_keyword(u'北京') keyword_processor.add_keyword(u'欢迎') keyword_processor.add_keyword(u'你') keyword_processor.extract_keywords(u'北京欢迎你')

return [u’北京’, u’你’],missing u’欢迎’ ?

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 30 (11 by maintainers)

Most upvoted comments

in python 2.7:

from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() keyword_processor.add_keyword(u'北京') keyword_processor.add_keyword(u'欢迎') keyword_processor.add_keyword(u'你') keyword_processor.extract_keywords(u'北京欢迎你')

return [u’北京’, u’你’],missing u’欢迎’ ?

@leepand @leepand 你好,我也是中国的使用者,只需要修改源代码第532行 idx = sequence_end_pos ,修改为 idx = sequence_end_pos -1,即可, 代码if name == ‘main’: kp=KeywordProcessor() kp.add_keyword(‘北京’) kp.add_keyword(‘欢迎’) kp.add_keyword(‘你’) text = ‘北京欢迎你’ tl=kp.extract_keywords(text) print(tl)

输出:[‘北京’, ‘欢迎’, ‘你’]

does it seems ridiculous that a string matching tool must have a tokenizer ?

I suggest (just a suggestion ^_^) that just design it as a pure AC automata, like https://github.com/WojciechMula/pyahocorasick/ is more useful and more feasible. pyahocorasick is written in C, and I’d like to see a pure python version.

maybe you can separatie the tokenizer and allow us to write our own tokenizer?

like https://whoosh.readthedocs.io/en/latest/analysis.html

@vi3k6i5 I think the best you can do is separate the tokenizer, no matter English or Chinese. You can allow us to design our own tokenizer and pass it into flashtext

Cool, Thanks for the suggestion. I will definitely take it into consideration 😃

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')

return [‘测试’]

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')

return nothing 😦

请问汉字混数字时候识别不了的问题解决了吗? 楼下说得方法解决不了数字汉字混合的时候

好多坑,确实加数字时识别不了

You can remove number characters inside of “non word boundaries”. E.g.

from flashtext import KeywordProcessor

string = '北京3欢迎'

extracter = KeywordProcessor()
extracter.set_non_word_boundaries(set('-')) # Only keep '-'
extracter.add_keyword('欢迎')
print(extracter.extract_keywords(string))

Output:

['欢迎']

I’m considering using Chinese characters to mimic English Words and it seems to work fine. (In python 3.6) ` string = ‘北 京 欢 迎 您 ! 北 京 欢 迎 您 !’

keyword_proc = KeywordProcessor()

keyword_proc.add_keyword(‘北 京’)

keyword_proc.add_keyword(‘欢 迎’)

keyword_proc.add_keyword(‘您’)

keywords = keyword_proc.extract_keywords(string, span_info=True)

`

Output:

[(‘北 京’, 0, 3), (‘欢 迎’, 4, 7), (‘您’, 8, 9), (‘北 京’, 12, 15), (‘欢 迎’, 16, 19), (‘您’, 20, 21)]

The reason is : there is no space between Chinese words.

So, I remove digits and letters from no_word_boundaries :

self.non_word_boundaries = set(string.digits + string.ascii_letters + '_')

change to:

self.non_word_boundaries = set('_')

It works well.

Fix is added in master branch:

Please do pip install -U git+https://github.com/vi3k6i5/flashtext.git

you can pass a parameter encoding when loading flashtext from a file.