nushell: Files with Non-UTF8 characters are simply ignored in `ls`

Describe the bug

Files with Non-UTF8 characters are simply ignored in ls.

How to reproduce

  1. Create a file named ''$'\001\b\327''@'$'\310\320\f''@8'. Spelling and quoting is set from the point of view how Bash views this file’s name.
  2. ls with Nu.
  3. See only normal files. This file is not displayed, at all.
  4. ls with Bash.
  5. File is shown.

Expected behavior

Can view this file in some form or the other. Not even necessary to display all characters properly, I just want to view the file name, according to how Nu sees it, so I can rm it.

Screenshots

No response

Configuration

key value
version 0.87.0
branch
commit_hash 77a1c3c7b2f3a110d48bcb792968e6b0d85d4bb7
build_os linux-x86_64
build_target x86_64-unknown-linux-gnu
rust_version rustc 1.71.1 (eb26296b5 2023-08-03)
rust_channel 1.71.1-x86_64-unknown-linux-gnu
cargo_version cargo 1.71.1 (7f1d04c00 2023-07-29)
build_time 2023-11-14 20:18:44 +00:00
build_rust_channel release
allocator mimalloc
features dataframe, default, extra, sqlite, static-link-openssl, trash, which, zip
installed_plugins

Additional context

I accidentally created this file, saw it in VS Code. Using Nu to list the file did not work out, at all. Then I stat the file, switched to Bash, rann ls there and voilà, there is that file. Removed it inside Bash. Wasn’t able to do it with Nu, because I could not even list, i.e. “see”, it, which also means, I wasn’t able to delete it.

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 17 (3 by maintainers)

Most upvoted comments

I could reproduce this using touch "$(echo -ne "\xff\xff")" in bash. Then ls in bash shows ''$'\377\377', while nu gives warning: get non-utf8 filename "/tmp/test/\xFF\xFF", ignored.

Note that Rust uses OsString to represent paths, which can handle invalid UTF-8.

For comparison, os.listdir(".") in Python returns ['\udcff\udcff'], ie. it represents each invalid byte as a codepoint in the unicode surrogate block. When you want to open '\udcff\udcff' it just removes the prefix \udc, and this seems to be non-ambiguous because a surrogate codepoint does not have a valid UTF-8 representation. So this idea could be an option for nu if we don’t want to introduce a separate datatype for paths.

Did you try the reproducible way I portrayed in the following comment?

Nope. I’m not motivated enough to install docker or scuba.