PowerShell: text.encoding Invalid
Prerequisites
- Write a descriptive title.
- Make sure you are able to repro it on the latest released version
- Search the existing issues.
- Refer to the FAQ.
- Refer to Differences between Windows PowerShell 5.1 and PowerShell.
Steps to reproduce
function UnescapeNonIsoChar($inputString) {
Try {
[regex]::replace($inputString, '(?:\\u[0-9a-f]{4})+', {
param($m)
$utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
[text.encoding]::UTF8.GetString($utf8Bytes)
})
} Catch {
[regex]::Unescape($inputString)
}
}
@'
profile.header.profile=\u00e6\u00aa\u0094\u00e6\u00a1\u0088\u00e5\u0090\u008d\u00e7\u00a8\u00b1
profile.header.customer=\u00e5\u00ae\u00a2\u00e6\u0088\u00b6\u00e5\u0090\u008d\u00e7\u00a8\u00b1
profile.header.account=\u00e5\u00b8\u00b3\u00e8\u0099\u009f/\u00e6\u00a2\u009d\u00e4\u00bb\u00b6\u00e4\u00bb\u00a3\u00e7\u00a2\u00bc
profile.header.description=\u00e6\u008f\u008f\u00e8\u00bf\u00b0
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user\u2019s layout cannot be updated.
'@ -split [System.Environment]::NewLine |
ForEach-Object {
UnescapeNonIsoChar -inputString $_
}
Hit Variable breakpoint on '$string' (Write access)
At line:8 char:13
+ $string=[text.encoding]::UTF8.GetString($utf8Bytes)
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[DBG]: PS C:\Users\he123>> [text.encoding]::UTF8.GetString($utf8Bytes)
檔案名稱
[DBG]: PS C:\Users\he123>> c
Hit Variable breakpoint on '$string' (Write access)
At line:8 char:13
+ $string=[text.encoding]::UTF8.GetString($utf8Bytes)
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[DBG]: PS C:\Users\he123>> c
profile.header.profile=æªæ¡å稱
profile.header.customer=客æ¶å稱
profile.header.account=帳è/æ¢ä»¶ä»£ç¢¼
profile.header.description=æè¿°
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user’s layout cannot be updated.
Expected behavior
profile.header.profile=檔案名稱 檔案名稱
profile.header.customer=客戶名稱 客戶名稱
profile.header.account=帳號 帳號/條件代碼 條件代碼
profile.header.description=描述 描述
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user’s layout cannot be updated.
Actual behavior
profile.header.profile=æªæ¡å稱
profile.header.customer=客æ¶å稱
profile.header.account=帳è/æ¢ä»¶ä»£ç¢¼
profile.header.description=æè¿°
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user’s layout cannot be updated.
Error details
æªæ¡å稱
Environment data
Name Value
---- -----
PSVersion 7.2.5
PSEdition Core
GitCommitId 7.2.5
OS Microsoft Windows 10.0.22621
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
Visuals
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (1 by maintainers)
This is my last try, if you don’t understand what is wrong after this you are on your own.
Yes it can be any form but mostly languages go for the
\uxxxx
format wherexxxx
is the Unicode codepoint value. Using檔
as an example again we can find the Unicode information at https://www.compart.com/en/unicode/U+6A94. It has the following propertiesU+6a94
0xE6, 0xAA, 0x94
0x6A, 0x94
0x00, 0x00, 0x6A, 0x94
A few extra things I want to stress
U+6a94
is pretty much the UTF-16-le byte representation but that’s just a coincidence0x7FFF
range are encoded as their codepoint which is why they match\uxxxx
I am talking about the Unicode codepoint and not the UTF-16-LE byte representationWhat you are doing is you are taking the UTF-8 byte representation and trying to smuggle it in using escaped Unicode codepoint sequences so it becomes
\u00e6\u00aa\u0094
. This is not strictly correct as this character when represented by the proper unicode codepoint is\u6a94
(because it’sU+6a94
, the code you have technically works because you are interpreting\uxxxx
as the UTF-8 bytes. Ignoring PowerShell and dotnet you can see that other language prove me correct in this statement, to display檔
I need to use\u6a94
and not\u00e6\u00aa\u0094
One thing I do need to mention,
U+6a94
is pretty much the equivalent of both of the UTF-16-LE bytes. This is just how UTF-16 works. Ignore that coincidence for now as UTF-16 was originally designed to map the Unicode codepoint values until they found out the space wasn’t large enough. Just know that the value of the Unicode codepoint (what starts withU+xxxx
) is what we are talking about.Notice that I am running in Python which is completely unrelated to PowerShell and their documentation mentioned the
\uxxxx
https://docs.python.org/3/howto/unicode.html formatPowerShell even has it’s own standard by using the backtick u escape sequence and the value hex value of the Unicode codepoint (not necessarily the UTF-16-LE byte representation).
Notice how yet again only the Unicode codepoint
U+6a94
works in the scenario. Trying to use the UTF-8 bytes as the value for each codepoint results in that erroneous string you mentioned.Hopefully this convinces you why I am saying doing
\u00e6\u00aa\u0094
to represent the character檔
is incorrect. In saying that the code you’ve written does correctly escape the\uxxxx
codepoints as UTF-8 bytes as I you can see by just running it against your first lineNotice how this is the exact same code as what you have and the output is correct. This is even running on Windows to replicate the platform you are on.
Notice what happens when you try and do the first and last line
The text that worked before is now failing. This is because of the entry
\u2019
in the last line causing the[Regex]::Replace
call to fail and the input string falling back to[Regex]::Unespace($inputString)
to be called. The reason why this is failing because\u2019
cannot be converted to a single byteIn fact what
\u2019
presents is the Right Single Quotation Mark. Notice how this codepoint isu+2019
and it matches the same value as\u2019
in the text. The presence of this value in the string causes the entire regex replacement to fail, moving it to the[regex]::Unescape($inputString)
call in your catch block. ThisUnespace
method is unescaping\uxxxx
as per the rules I’ve talked about above which is why you get the different strings.This is critical to understand, a failure in the
[Regex]::Replace
method causes it to go through[Regex]::Unescape
which is why you get the wrong output back. When you combine this with the fact that the-split [System.Environment]::NewLine
isn’t actually splitting your here string you can notice that the whole function is being called on the full string rather than line by line. The full string contains the\u2019
entry causing the failure with your logic which causes the\uxxxx
entries in the previous lines to be escaped by their Unicode codepoint value.To prove that the split doesn’t work look at the result of this
Notice how when
-split [System.Environment]::NewLine
was used the function was only called once (Lines process: 1
) whereas when-split "\r?\n"
was used it was passed in line by line. If you were to adjust your code so it did-split "\r?\n
you will notice it worksThe reason why goes into all my points above. Each line is now being processed one by one (
UnescapeNonIsoChar
is called per line rather than the whole thing). Each line is being converted using your regex replace logic of treating\uxxxx
as UTF-8 bytes. Even the last line where\u2019
is failing to be replaced with your logic is falling back to the[Regex]::Unescape
which will convert\u2019
using the Unicode codepoint logic. This gives you exactly what you want. The only caveat will be if you have a line that contains both the logic where\uxxxx
is a UTF-8 byte representation and the Unicode codepoint. This is why I am recommending that you fix up whatever is generating the text to do it properly.If you don’t believe me after this I’m not sure what else I can do to convince you. PowerShell is acting sane here. There is no bug here as other languages treat
\uxxxx
in the same manner as[Regex]::Unescape
and what I am telling you. Good luck with your work.求求你,不会,不懂不要乱说。