godot: Unicode problem.

Godot version: 3.1 Stable (tested with latest master also)

OS/device including version: Windows 10

Issue description: I am receiving unicode messages through sockets that contain unicode characters, in my case Greek language characters.

  • StreamPeer get_utf8_string gives me Unicode error: invalid skip.

  • PoolByteArray get_string_from_utf8 gives me Unicode error: invalid skip. Screenshot_3

  • PoolByteArray get_string_from_ascii give me the full message but all unicode characters are converted to questionmarks. Screenshot_4

I’m a bit frustrated with thisone, searched through issues but could not find any solution.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (12 by maintainers)

Most upvoted comments

@zaksnet how do you send the string? Can you provide a minimal reproduction project? How do you detect the end tag?

The string is being sent from another application (of which i do not currently have the source code). I am doing:

const ENDTAG = "{end}"
	if is_connected_to_host:
		var bytes = client.get_available_bytes()
		if bytes > 0:
			var message: String = client.get_string(bytes)
			var messages = message.split(ENDTAG)

If a split does not end with ENDTAG, i add it to the next message.

  1. I don’t think this is network related. It just happened that the data i get came from the network (i think the same thing would happen if i loaded them from a file).
  2. I am guessing get_string decodes the message to ascii since i get the same result as get_string_from_ascii. That is, the whole string is fine (i’m pretty sure there is not partial data) except the unicde parts which are represented as question marks.

In conclusion, i think this is a Unicode decoding problem and not network related.

I actually think UTF-16 is unlikely, I was asking just in case. It may also be a non-unicode greek encoding, but I’m assuming you know it’s some kind of unicode since that’s what the issue title says. If you’re not sure, then I think a non-unicode encoding is more likely than UTF-16, but I’m not really that familiar.

Also you shouldn’t rely on a string after get_available_bytes to be exactly one message. That function just tells you how much data it has received so far, and it’s possible that it has received more or less than one full message. If you read that number of bytes you might not have a full message, or you might have a full message and the beginning of another message. Unicode characters can be multiple bytes, so the partial data might even be split in the middle of a character.

To handle null-terminated strings you would need to:

  • Keep a PoolByteArray to hold incomplete strings
  • When you read data from the client, use get_data instead of get_utf8_string, because the data can end in the middle of a unicode character, and then you would have an invalid character and lose that data.
  • If there’s a null character in the data, then add everything before the null character to the PoolByteArray and then turn that whole thing into a string, and then clear the buffer. Put everything after the null character into the cleared PoolByteArray since that is the start of the next string.
  • If there is no null character, just add the whole thing to the buffer because you haven’t completed a string yet.

Sorry, my fault, all messages end with an end tag ({end} in my case) not null terminated. Of course, everything you said applies in this case but, i’m just mentioning it because i have already build a mechanism to detect when the full message is received.