PowerShell: text.encoding Invalid

Prerequisites

Steps to reproduce

function UnescapeNonIsoChar($inputString) {
    Try {
        [regex]::replace($inputString, '(?:\\u[0-9a-f]{4})+', { 
            param($m) 
            $utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
            [text.encoding]::UTF8.GetString($utf8Bytes)
            
        })
    } Catch {
        [regex]::Unescape($inputString)
    }
}

@'
profile.header.profile=\u00e6\u00aa\u0094\u00e6\u00a1\u0088\u00e5\u0090\u008d\u00e7\u00a8\u00b1
profile.header.customer=\u00e5\u00ae\u00a2\u00e6\u0088\u00b6\u00e5\u0090\u008d\u00e7\u00a8\u00b1
profile.header.account=\u00e5\u00b8\u00b3\u00e8\u0099\u009f/\u00e6\u00a2\u009d\u00e4\u00bb\u00b6\u00e4\u00bb\u00a3\u00e7\u00a2\u00bc
profile.header.description=\u00e6\u008f\u008f\u00e8\u00bf\u00b0
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user\u2019s layout cannot be updated.
'@ -split [System.Environment]::NewLine |
    ForEach-Object {
        UnescapeNonIsoChar -inputString $_
    }
Hit Variable breakpoint on '$string' (Write access)

At line:8 char:13
+             $string=[text.encoding]::UTF8.GetString($utf8Bytes)
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[DBG]: PS C:\Users\he123>> [text.encoding]::UTF8.GetString($utf8Bytes)
檔案名稱
[DBG]: PS C:\Users\he123>> c
Hit Variable breakpoint on '$string' (Write access)

At line:8 char:13
+             $string=[text.encoding]::UTF8.GetString($utf8Bytes)
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[DBG]: PS C:\Users\he123>> c
profile.header.profile=æªæ¡å稱
profile.header.customer=客æ¶å稱
profile.header.account=帳è/æ¢ä»¶ä»£ç¢¼
profile.header.description=æè¿°
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user’s layout cannot be updated.

Expected behavior

profile.header.profile=檔案名稱 檔案名稱
profile.header.customer=客戶名稱 客戶名稱
profile.header.account=帳號 帳號/條件代碼 條件代碼
profile.header.description=描述 描述
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user’s layout cannot be updated.

Actual behavior

profile.header.profile=æªæ¡å稱
profile.header.customer=客æ¶å稱
profile.header.account=帳è/æ¢ä»¶ä»£ç¢¼
profile.header.description=æè¿°
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user’s layout cannot be updated.

Error details

æªæ¡å稱

Environment data

Name                           Value
----                           -----
PSVersion                      7.2.5
PSEdition                      Core
GitCommitId                    7.2.5
OS                             Microsoft Windows 10.0.22621
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Visuals

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (1 by maintainers)

Most upvoted comments

This is my last try, if you don’t understand what is wrong after this you are on your own.

Yes it can be any form but mostly languages go for the \uxxxx format where xxxx is the Unicode codepoint value. Using as an example again we can find the Unicode information at https://www.compart.com/en/unicode/U+6A94. It has the following properties

  • The Unicode codepoint is U+6a94
  • The UTF-8 byte representation is 0xE6, 0xAA, 0x94
  • The UTF-16-LE byte representation is 0x6A, 0x94
  • The UTF-32-LE byte representation is 0x00, 0x00, 0x6A, 0x94

A few extra things I want to stress

  • Unicode != UTF-16-LE
    • Microsoft products sometimes lumps them into the same thing but
    • Unicode is the standard
    • UTF-16 is an encoding scheme of how Unicode characters are encoded to bytes
  • The Unicode codepoint U+6a94 is pretty much the UTF-16-le byte representation but that’s just a coincidence
    • Due to how UTF-16 was originally written, any Unicode characters in the 0x7FFF range are encoded as their codepoint which is why they match
    • You don’t really need to know this fact, just that when I talk about \uxxxx I am talking about the Unicode codepoint and not the UTF-16-LE byte representation

What you are doing is you are taking the UTF-8 byte representation and trying to smuggle it in using escaped Unicode codepoint sequences so it becomes \u00e6\u00aa\u0094. This is not strictly correct as this character when represented by the proper unicode codepoint is \u6a94 (because it’s U+6a94, the code you have technically works because you are interpreting \uxxxx as the UTF-8 bytes. Ignoring PowerShell and dotnet you can see that other language prove me correct in this statement, to display I need to use \u6a94 and not \u00e6\u00aa\u0094

image

One thing I do need to mention, U+6a94 is pretty much the equivalent of both of the UTF-16-LE bytes. This is just how UTF-16 works. Ignore that coincidence for now as UTF-16 was originally designed to map the Unicode codepoint values until they found out the space wasn’t large enough. Just know that the value of the Unicode codepoint (what starts with U+xxxx) is what we are talking about.

Notice that I am running in Python which is completely unrelated to PowerShell and their documentation mentioned the \uxxxx https://docs.python.org/3/howto/unicode.html format

PowerShell even has it’s own standard by using the backtick u escape sequence and the value hex value of the Unicode codepoint (not necessarily the UTF-16-LE byte representation).

`u{xxxx}`

image

Notice how yet again only the Unicode codepoint U+6a94 works in the scenario. Trying to use the UTF-8 bytes as the value for each codepoint results in that erroneous string you mentioned.

Hopefully this convinces you why I am saying doing \u00e6\u00aa\u0094 to represent the character is incorrect. In saying that the code you’ve written does correctly escape the \uxxxx codepoints as UTF-8 bytes as I you can see by just running it against your first line

function UnescapeNonIsoChar($inputString) {
    Try {
        [regex]::replace($inputString, '(?:\\u[0-9a-f]{4})+', { 
            param($m) 
            $utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
            [text.encoding]::UTF8.GetString($utf8Bytes)
            
        })
    } Catch {
        [regex]::Unescape($inputString)
    }
}

UnescapeNonIsoChar -inputString 'profile.header.profile=\u00e6\u00aa\u0094\u00e6\u00a1\u0088\u00e5\u0090\u008d\u00e7\u00a8\u00b1'

image

Notice how this is the exact same code as what you have and the output is correct. This is even running on Windows to replicate the platform you are on.

Notice what happens when you try and do the first and last line

function UnescapeNonIsoChar($inputString) {
    Try {
        [regex]::replace($inputString, '(?:\\u[0-9a-f]{4})+', { 
            param($m) 
            $utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
            [text.encoding]::UTF8.GetString($utf8Bytes)
            
        })
    } Catch {
        [regex]::Unescape($inputString)
    }
}

UnescapeNonIsoChar -inputString @'
profile.header.profile=\u00e6\u00aa\u0094\u00e6\u00a1\u0088\u00e5\u0090\u008d\u00e7\u00a8\u00b1
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user\u2019s layout cannot be updated.
'@

image

The text that worked before is now failing. This is because of the entry \u2019 in the last line causing the [Regex]::Replace call to fail and the input string falling back to [Regex]::Unespace($inputString) to be called. The reason why this is failing because \u2019 cannot be converted to a single byte

image

In fact what \u2019 presents is the Right Single Quotation Mark. Notice how this codepoint is u+2019 and it matches the same value as \u2019 in the text. The presence of this value in the string causes the entire regex replacement to fail, moving it to the [regex]::Unescape($inputString) call in your catch block. This Unespace method is unescaping \uxxxx as per the rules I’ve talked about above which is why you get the different strings.

This is critical to understand, a failure in the [Regex]::Replace method causes it to go through [Regex]::Unescape which is why you get the wrong output back. When you combine this with the fact that the -split [System.Environment]::NewLine isn’t actually splitting your here string you can notice that the whole function is being called on the full string rather than line by line. The full string contains the \u2019 entry causing the failure with your logic which causes the \uxxxx entries in the previous lines to be escaped by their Unicode codepoint value.

To prove that the split doesn’t work look at the result of this

Function Test-PerLine {
    [CmdletBinding()]
    param (
        [Parameter(ValueFromPipeline)]
        [string]
        $InputObject
    )
    
    begin { $i = 0 }
    process {
        $i++
        "Line: $InputObject"
    }
    end {
        "Lines processed; $i"
    }
}

@'
line 1
line 2
'@ -split [System.Environment]::NewLine | Test-PerLine

@'
line 1
line 2
'@ -split "\r?\n" | Test-PerLine

image

Notice how when -split [System.Environment]::NewLine was used the function was only called once (Lines process: 1) whereas when -split "\r?\n" was used it was passed in line by line. If you were to adjust your code so it did -split "\r?\n you will notice it works

function UnescapeNonIsoChar($inputString) {
    Try {
        [regex]::replace($inputString, '(?:\\u[0-9a-f]{4})+', { 
            param($m) 
            $utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
            [text.encoding]::UTF8.GetString($utf8Bytes)
            
        })
    } Catch {
        [regex]::Unescape($inputString)
    }
}

@'
profile.header.profile=\u00e6\u00aa\u0094\u00e6\u00a1\u0088\u00e5\u0090\u008d\u00e7\u00a8\u00b1
profile.header.customer=\u00e5\u00ae\u00a2\u00e6\u0088\u00b6\u00e5\u0090\u008d\u00e7\u00a8\u00b1
profile.header.account=\u00e5\u00b8\u00b3\u00e8\u0099\u009f/\u00e6\u00a2\u009d\u00e4\u00bb\u00b6\u00e4\u00bb\u00a3\u00e7\u00a2\u00bc
profile.header.description=\u00e6\u008f\u008f\u00e8\u00bf\u00b0
layout.msg.updatePrimaryUsersLayout=Kindly save it as a New Layout as Primary user\u2019s layout cannot be updated.
'@ -split "\r?\n" |
    ForEach-Object {
        UnescapeNonIsoChar -inputString $_
    }

image

The reason why goes into all my points above. Each line is now being processed one by one (UnescapeNonIsoChar is called per line rather than the whole thing). Each line is being converted using your regex replace logic of treating \uxxxx as UTF-8 bytes. Even the last line where \u2019 is failing to be replaced with your logic is falling back to the [Regex]::Unescape which will convert \u2019 using the Unicode codepoint logic. This gives you exactly what you want. The only caveat will be if you have a line that contains both the logic where \uxxxx is a UTF-8 byte representation and the Unicode codepoint. This is why I am recommending that you fix up whatever is generating the text to do it properly.

If you don’t believe me after this I’m not sure what else I can do to convince you. PowerShell is acting sane here. There is no bug here as other languages treat \uxxxx in the same manner as [Regex]::Unescape and what I am telling you. Good luck with your work.

<div>https://www.compart.com/en/unicode/U+6A94</div><div>Unicode</div><div>U+6A94 is the unicode hex value of the character CJK Unified Ideograph-6A94. Char U+6A94, Encodings, HTML Entitys:檔,檔, UTF-8 (hex), UTF-16 (hex), UTF-32 (hex)</div>
<div>Unicode HOWTO — Python 3.10.5 documentation</div>

When it is saved as a PS1 file, it works properly It always uses iso-8859-1 no matter how you tweak it

try: save ps1 file add BOM head,encoding any ps1 内若含有中日韩等编码,必须保存成bom头,编码任意。 即bom+utf8可以,bom+utf16le也行。 bom头,bom头,有头无乱码。还支持中文变量名,韩文变量名,参数名,参数值,等。

注意:你的代码,只要加上bom头保存。在powershell 5.1,powershell 7.3 preview中运行毫无问题,已经测试。

\u00e6\u00aa\u0094\u00e6\u00a1\u0088\u00e5\u0090\u008d\u00e7 ----这些东西在我眼中视为幺蛾子。你用任意编码的字符串,转成base64,继而传递,即可避免幺蛾子。避免编码转换,避免编码不识别。

求求你,不会,不懂不要乱说。