mne-bids: Problem with accentuated characters
Describe the bug
This issue is similar to the bug fixed in #172. In .json files loaded with MNE-BIDS, different fields may have non-UTF-8 characters, like accentuated characters. These would appear in many institutions or person names, e.g., for French names. For example, MNE-BIDS crashes when important data from that open-access dataset: https://search.kg.ebrains.eu/instances/Dataset/1e5ec1d6-a17d-46ed-8e8b-05c2673dbc0e It crashed because the .json file contains:
{
...
"InstitutionName": "Hôpital Pierre Wertheimer",
...
}
That is not a problem specific to MNE-BIDS. The following code reproduce the error triggered by MNE-BIDS when using read_raw_bids(...) on this dataset:
import json
json.loads(open("/Users/christianoreilly/Downloads/sub-071_task-MCSE_ieeg.json").read())
produces
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/var/folders/_r/pjcpg0r91hd_pj072jq4zws00000gn/T/ipykernel_7841/70971748.py in <module>
----> 1 json.loads(open("/Users/christianoreilly/Downloads/sub-071_task-MCSE_ieeg.json", encoding='utf-8').read())
~/opt/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 45: invalid continuation byte
This is an issue that is likely to be common in many non-English speaking countries. We cannot expect the JSON files containing information like institution names across the world to be limited to the English character subsets, so MNE-BIDS should probably aim to support this use case.
Additional information
Platform: macOS-10.16-x86_64-i386-64bit
Python: 3.8.8 (default, Apr 13 2021, 12:59:45) [Clang 10.0.0 ]
Executable: /Users/christianoreilly/opt/anaconda3/bin/python
CPU: i386: 16 cores
Memory: 64.0 GB
mne: 0.23.4
numpy: 1.22.2 {}
scipy: 1.6.2
matplotlib: 3.4.2 {backend=module://matplotlib_inline.backend_inline}
sklearn: 0.24.2
numba: 0.53.1
nibabel: 3.2.1
nilearn: Not found
dipy: 1.4.1
cupy: Not found
pandas: 1.4.1
mayavi: Not found
pyvista: 0.32.1 {OpenGL 4.1 ATI-4.7.103 via AMD Radeon Pro 5500M OpenGL Engine}
vtk: 9.0.3
PyQt5: 5.9.2
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (9 by maintainers)
Thank you all for identifying and brainstorming this issue, and in particular @christian-oreilly for letting us at EBRAINS know. We have identified 575 JSON files in this dataset with the wrong encoding, and updated those to UTF-8. I think that solves the issue.
Just emailed curation-support@ebrains.eu to let them know about the issue.