nushell: Files with Non-UTF8 characters are simply ignored in `ls`

Describe the bug

Files with Non-UTF8 characters are simply ignored in ls.

How to reproduce

Create a file named ''$'\001\b\327''@'$'\310\320\f''@8'. Spelling and quoting is set from the point of view how Bash views this file’s name.
ls with Nu.
See only normal files. This file is not displayed, at all.
ls with Bash.
File is shown.

Expected behavior

Can view this file in some form or the other. Not even necessary to display all characters properly, I just want to view the file name, according to how Nu sees it, so I can rm it.

Screenshots

No response

Configuration

key	value
version	0.87.0
branch
commit_hash	77a1c3c7b2f3a110d48bcb792968e6b0d85d4bb7
build_os	linux-x86_64
build_target	x86_64-unknown-linux-gnu
rust_version	rustc 1.71.1 (eb26296b5 2023-08-03)
rust_channel	1.71.1-x86_64-unknown-linux-gnu
cargo_version	cargo 1.71.1 (7f1d04c00 2023-07-29)
build_time	2023-11-14 20:18:44 +00:00
build_rust_channel	release
allocator	mimalloc
features	dataframe, default, extra, sqlite, static-link-openssl, trash, which, zip
installed_plugins

Additional context

I accidentally created this file, saw it in VS Code. Using Nu to list the file did not work out, at all. Then I stat the file, switched to Bash, rann ls there and voilà, there is that file. Removed it inside Bash. Wasn’t able to do it with Nu, because I could not even list, i.e. “see”, it, which also means, I wasn’t able to delete it.

About this issue

Original URL
State: open
Created 7 months ago
Comments: 17 (3 by maintainers)

Most upvoted comments

I could reproduce this using touch "$(echo -ne "\xff\xff")" in bash. Then ls in bash shows ''$'\377\377', while nu gives warning: get non-utf8 filename "/tmp/test/\xFF\xFF", ignored.

Note that Rust uses OsString to represent paths, which can handle invalid UTF-8.

For comparison, os.listdir(".") in Python returns ['\udcff\udcff'], ie. it represents each invalid byte as a codepoint in the unicode surrogate block. When you want to open '\udcff\udcff' it just removes the prefix \udc, and this seems to be non-ambiguous because a surrogate codepoint does not have a valid UTF-8 representation. So this idea could be an option for nu if we don’t want to introduce a separate datatype for paths.

DonSheddow on Jan 19, 2024

Did you try the reproducible way I portrayed in the following comment?

Nope. I’m not motivated enough to install docker or scuba.

fdncred on Nov 20, 2023