tensorflow: Decoding error during vocabulary loading from TextVectorization layer

Intro Error occurs within get_vocabulary() method from TextVectorization when one of tokens can’t be decoded. The exact place in a code is string_lookup’s get_vocabulary() method:

return [x.decode(self.encoding) for _, x in sorted(zip(values, keys))]

How to reproduce?

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

text = 'Był to świetny pomysł, bo punktował Prawo i Sprawiedliwość tam, gdzie jest ono najsłabsze, mimo że udaje najsilniejsze. Uderzał w wizerunek państwa dobrobytu, które nikogo nie zostawia z tyłu i wyrównuje szanse. Tutaj mamy pewnego rodzaju déjà vu.'

vectorize_layer = TextVectorization()
vectorize_layer.adapt([text])
print(vectorize_layer.get_vocabulary())

Simpler code:

values = [1]
keys = [b'warszawie\xc2']
[x.decode('utf-8') for _, x in sorted(zip(values, keys))]

Error: UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc2 in position 9: unexpected end of data

Fix

I was able to fix it simply by writing my own _get_vocabulary() method, just by ignoring decoding errors which are rare but frustrating:

def _get_vocabulary():
    keys, values = vectorize_layer._index_lookup_layer._table_handler.data()
    return [x.decode('utf-8', errors='ignore') for _, x in sorted(zip(values, keys))]

Can option to ignore decoding errors be passed to string_lookup?

Windows 10
TensorFlow 2.2
Python 3.8

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (5 by maintainers)

Most upvoted comments

Problem occurs due to windows default encoding utf-16. Everything works fine until a character that is not handled properly during encoding is encountered in decoding step inside TextVectorization.get_vocabulary() Solved this problem by changing default encoding for windows to utf-8. To change, go to Language Settings -> Administrative Language settings -> change system locale -> select beta: Use Unicode UTF-8 for worldwide language support. Restart. Done.

+36

tusharsnx on Jul 26, 2021

@amahendrakar I tried again running this script:

import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

print(tf.__version__)

text = 'Był to świetny pomysł, bo punktował Prawo i Sprawiedliwość tam, gdzie jest ono najsłabsze, mimo że udaje najsilniejsze. Uderzał w wizerunek państwa dobrobytu, które nikogo nie zostawia z tyłu i wyrównuje szanse. Tutaj mamy pewnego rodzaju déjà vu.'

vectorize_layer = TextVectorization()
vectorize_layer.adapt([text])
print(vectorize_layer.get_vocabulary())

Response is:

2.3.0 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc3 in position 4: unexpected end of data

on Python 3.8.5. But I tested it on Python 3.6.9 that you used and it worked. So it looks that the problem is with combination:

Python 3.8.X and Tensorflow 2.3

Can you test it that why on Collab or your local machine?

mikolajkania on Oct 6, 2020