spaCy: Tokenization not working using v2.1

How to reproduce the behaviour

I found a bug where tokenization is completely not working with version 2.1.0a10 on python 2.7. I have reproduced this on three of my machines.

$ conda create -n py27_spacy2 python=2.7
$ source activate py27_spacy2
$ pip install -U spacy-nightly
$ python -m spacy download en_core_web_sm
$ python -c "import spacy; nlp=spacy.load('en_core_web_sm'); doc=nlp(u'hello world'); print ','.join([t.text for t in doc])"
h,e,ll,o,w,o,r,l,d

Your Environment

Operating System: Ubuntu
Python Version Used: 2.7
spaCy Version Used: 2.1.0a10

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 29 (10 by maintainers)

Commits related to this issue

Add failing test for #3356 — committed to explosion/spaCy by honnibal 5 years ago
Fix tokenizer on Python2.7 spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()`... — committed to explosion/spaCy by honnibal 5 years ago
Fix tokenizer on Python2.7 (#3460) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.... — committed to explosion/spaCy by honnibal 5 years ago
fix (#1) * Add failing test for #3356 * Fix test that caused pytest to choke on Python3 * adding kb_id as field to token, el as nlp pipeline component * annotate kb_id through ents in doc ... — committed to kiku-jw/spaCy by kiku-jw 5 years ago

Most upvoted comments

>>> from spacy.lang.en import English
>>> nlp = English()
>>> doc=nlp(u'Well I wonder how this will / shall look after tokenization with the model - ill or not ?')
>>> print ','.join([t.text for t in doc])
Well,I,wonder,how,this,will,/,shall,look,after,tokenization,with,the,model,-,ill,or,not,?
>>> import spacy
>>> nlp=spacy.load('en_core_web_sm')
>>> doc=nlp(u'Well I wonder how this will / shall look after tokenization with the model - ill or not ?')
>>> print ','.join([t.text for t in doc])
W,e,l,l,I,w,o,n,d,e,r,h,o,w,t,h,i,s,w,i,l,l,/,s,h,a,l,l,l,o,o,k,a,f,t,e,r,t,o,k,e,n,i,z,a,t,i,o,n,w,i,t,h,t,h,e,m,o,d,e,l,-,i,ll,o,r,n,o,t,?

rulai-huajunzeng on Mar 7, 2019