mne-bids: Problem with accentuated characters

Describe the bug

This issue is similar to the bug fixed in #172. In .json files loaded with MNE-BIDS, different fields may have non-UTF-8 characters, like accentuated characters. These would appear in many institutions or person names, e.g., for French names. For example, MNE-BIDS crashes when important data from that open-access dataset: https://search.kg.ebrains.eu/instances/Dataset/1e5ec1d6-a17d-46ed-8e8b-05c2673dbc0e It crashed because the .json file contains:

  {
  ...
      "InstitutionName": "Hôpital Pierre Wertheimer",
  ...
  }

That is not a problem specific to MNE-BIDS. The following code reproduce the error triggered by MNE-BIDS when using read_raw_bids(...) on this dataset:

  import json
  json.loads(open("/Users/christianoreilly/Downloads/sub-071_task-MCSE_ieeg.json").read())

produces

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/var/folders/_r/pjcpg0r91hd_pj072jq4zws00000gn/T/ipykernel_7841/70971748.py in <module>
----> 1 json.loads(open("/Users/christianoreilly/Downloads/sub-071_task-MCSE_ieeg.json", encoding='utf-8').read())

~/opt/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 45: invalid continuation byte

This is an issue that is likely to be common in many non-English speaking countries. We cannot expect the JSON files containing information like institution names across the world to be limited to the English character subsets, so MNE-BIDS should probably aim to support this use case.

Additional information

  Platform:      macOS-10.16-x86_64-i386-64bit
  Python:        3.8.8 (default, Apr 13 2021, 12:59:45)  [Clang 10.0.0 ]
  Executable:    /Users/christianoreilly/opt/anaconda3/bin/python
  CPU:           i386: 16 cores
  Memory:        64.0 GB
  
  mne:           0.23.4
  numpy:         1.22.2 {}
  scipy:         1.6.2
  matplotlib:    3.4.2 {backend=module://matplotlib_inline.backend_inline}
  
  sklearn:       0.24.2
  numba:         0.53.1
  nibabel:       3.2.1
  nilearn:       Not found
  dipy:          1.4.1
  cupy:          Not found
  pandas:        1.4.1
  mayavi:        Not found
  pyvista:       0.32.1 {OpenGL 4.1 ATI-4.7.103 via AMD Radeon Pro 5500M OpenGL Engine}
  vtk:           9.0.3
  PyQt5:         5.9.2

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

Thank you all for identifying and brainstorming this issue, and in particular @christian-oreilly for letting us at EBRAINS know. We have identified 575 JSON files in this dataset with the wrong encoding, and updated those to UTF-8. I think that solves the issue.

Just emailed curation-support@ebrains.eu to let them know about the issue.