tensorflow: Decoding error during vocabulary loading from TextVectorization layer
Intro Error occurs within get_vocabulary() method from TextVectorization when one of tokens can’t be decoded. The exact place in a code is string_lookup’s get_vocabulary() method:
return [x.decode(self.encoding) for _, x in sorted(zip(values, keys))]
How to reproduce?
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text = 'Był to świetny pomysł, bo punktował Prawo i Sprawiedliwość tam, gdzie jest ono najsłabsze, mimo że udaje najsilniejsze. Uderzał w wizerunek państwa dobrobytu, które nikogo nie zostawia z tyłu i wyrównuje szanse. Tutaj mamy pewnego rodzaju déjà vu.'
vectorize_layer = TextVectorization()
vectorize_layer.adapt([text])
print(vectorize_layer.get_vocabulary())
Simpler code:
values = [1]
keys = [b'warszawie\xc2']
[x.decode('utf-8') for _, x in sorted(zip(values, keys))]
Error: UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc2 in position 9: unexpected end of data
Fix
I was able to fix it simply by writing my own _get_vocabulary() method, just by ignoring decoding errors which are rare but frustrating:
def _get_vocabulary():
keys, values = vectorize_layer._index_lookup_layer._table_handler.data()
return [x.decode('utf-8', errors='ignore') for _, x in sorted(zip(values, keys))]
Can option to ignore decoding errors be passed to string_lookup?
- Windows 10
- TensorFlow 2.2
- Python 3.8
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (5 by maintainers)
Problem occurs due to windows default encoding
utf-16. Everything works fine until a character that is not handled properly during encoding is encountered in decoding step insideTextVectorization.get_vocabulary()Solved this problem by changing default encoding for windows to utf-8. To change, go toLanguage Settings -> Administrative Language settings -> change system locale -> select beta: Use Unicode UTF-8 for worldwide language support. Restart. Done.@amahendrakar I tried again running this script:
Response is:
2.3.0 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc3 in position 4: unexpected end of data
on Python 3.8.5. But I tested it on Python 3.6.9 that you used and it worked. So it looks that the problem is with combination:
Python 3.8.X and Tensorflow 2.3
Can you test it that why on Collab or your local machine?