conan: Cannot get MD5 for non-ASCII filenames

(Conan 0.30.3, Python 2.7.12, Ubuntu 16.04)

In conans/util/files.py, I get

UnicodeEncodeError: 'ascii' codec can't encode characters in position 10122-10125: ordinal not in range(128)

on the line content.encode(). This is because the unicode-type string content contains non-ASCII characters that are not handled by the default encoder.

Is there any reason not to do content.encode("utf-8")?

To reproduce, add a file with non-ASCII characters in the filename to the files exported by a library.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 28 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Strangely enough, I came across a very similar error today. I can build my package locally on Ubuntu16+conan 1.21 and also Ubuntu 18 + conan 1.29.0. On our CI servers, we rock the same configs but on dockers. On the docker images, I get

DEBUG :packager.py    [112]: PACKAGE: Creating config files to /home/ubuntu/.conan/data/pkg/1.0/usr/stable/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9 [2020-09-30 23:12:38,585]
Traceb:ack (most recent call last):
  File "conan/conans/client/command.py", line 1947, in run
  File "conan/conans/client/command.py", line 358, in create
  File "conan/conans/client/conan_api.py", line 81, in wrapper
  File "conan/conans/client/conan_api.py", line 364, in create
  File "conan/conans/client/cmd/create.py", line 43, in create
  File "conan/conans/client/cmd/test.py", line 38, in install_build_and_test
  File "conan/conans/client/manager.py", line 68, in deps_install
  File "conan/conans/client/installer.py", line 308, in install
  File "conan/conans/client/installer.py", line 332, in _build
  File "conan/conans/client/installer.py", line 398, in _handle_node_cache
  File "conan/conans/client/installer.py", line 442, in _build_package
  File "conan/conans/client/installer.py", line 212, in build_package
  File "conan/conans/client/installer.py", line 161, in _package
  File "conan/conans/client/packager.py", line 91, in run_package_method
  File "conan/conans/client/packager.py", line 125, in _create_aux_files
  File "conan/conans/model/manifest.py", line 110, in save
  File "conan/conans/util/files.py", line 183, in save
  File "conan/conans/util/files.py", line 201, in to_file_bytes
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1257517-1257518: surrogates not allowed

ERROR: 'utf-8' codec can't encode characters in position 1257517-1257518: surrogates not allowed

And I cant reproduce it locally or figure out what is going on. We are using python 3.5 on U16 and 3.6 on U18.

Any suggestions to debug this?


UPDATE: It was related to a locale config. The following did the trick for me.

RUN locale-gen en_US.UTF-8 && update-locale LC_ALL=en_US.UTF-8 \
    LANG=en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8

Source: https://www.embeddeduse.com/2019/02/11/using-docker-containers-for-yocto-builds/

I believe we should take a look at PEP 0383

On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF. Bytes below 128 will produce exceptions; see the discussion below.

To convert non-decodable bytes, a new error handler ([2]) "surrogateescape" is introduced, which produces these surrogates. On encoding, the error handler converts the surrogate back to the corresponding byte. This error handler will be used in any API that receives or produces file names, command line arguments, or environment variables.

since we’re trying to store/load file names as bytes, we have to use surrogateescape encoding, as PEP suggests

I have not been able to reproduce this in Py3 with conan 1.0. Here’s my test setup:

  • conan data directory is a folder named with Japanese characters
  • conanfile.py lives in a folder named with Japanese characters
  • inside the folder lives a file named “日本語のファイル.txt”
  • the file is UTF-8 encoded with Japanese characters as its contents
  • conanfile.py has exports_sources configured to export that file
  • my build() method has the following code:
        path = os.path.join(self.source_folder, "日本語のファイル.txt")
        self.output.info("MD5    : {}".format(tools.md5sum(path)))
        self.output.info("SHA1   : {}".format(tools.sha1sum(path)))
        self.output.info("SHA256 : {}".format(tools.sha256sum(path)))
  • I run conan create .

@masseman have you encountered the same troubles on Py3, or is this only a Py2 problem?