telego: Bad Request: entity begins in a middle of a UTF-16 symbol at byte offset ...

💬 Telego version

0.22.0

👾 Issue description

entitys := []tu.MessageEntityCollection{} entitys = append(entitys, tu.Entity("🌗")) bot.SendMessage(tu.MessageWithEntities(tu.ID(ctm.Chat.ID), entitys...).WithReplyToMessageID(ctm.MessageID)) //[Thu Apr 13 13:07:41 MSK 2023] DEBUG API call to: "https://api.telegram.org/botBOT_TOKEN/sendMessage", with data: {"chat_id":1,"text":"🌗Last quarter","entities":[{"type":"code","offset":1,"length":12}],"reply_to_message_id":1} //[Thu Apr 13 13:07:41 MSK 2023] DEBUG API response sendMessage: Ok: false, Err: [400 "Bad Request: entity begins in a middle of a UTF-16 symbol at byte offset 4"] Please look: https://core.telegram.org/bots/api#messageentity MessageEntity … offset Integer Offset in UTF-16 code units to the start of the entity length Integer Length of the entity in UTF-16 code units … https://core.telegram.org/api/entities#entity-length Computing entity length Code points in the BMP (U+0000 to U+FFFF) count as 1, because they are encoded into a single UTF-16 code unit Code points in all other planes count as 2, because they are encoded into two UTF-16 code units (also called surrogate pairs) … However, since UTF-8 encodes codepoints in non-BMP planes as a 32-bit code unit starting with 0b11110, a more efficient way to compute the entity length without converting the message to UTF-16 is the following:

If the byte marks the beginning of a 32-bit UTF-8 code unit (all bytes starting with 0b11110) increment the count by 2, otherwise If the byte marks the beginning of a UTF-8 code unit (all bytes not starting with 0b10) increment the count by 1. Example:

length := 0 for byte in text { if (byte & 0xc0) != 0x80 { length += 1 + (byte >= 0xf0) } } Note: the length of an entity must not include the length of trailing newlines or whitespaces, rtrim entities before computing their length: however, the next offset must include the length of newlines or whitespaces that precede it.

⚡️ Expected behavior

{code 2 12 <nil> }

🧐 Code example

`					entitys := []tu.MessageEntityCollection{}
					entitys = append(entitys, tu.Entity("🌗"))
					t, es := tu.MessageEntities(entitys...)
					stdo.Printf("%U %v %v\n", []rune(t)[0], []byte(t), es) //15:46:48 main.go:471: U+1F317 [240 159 140 151] []
					entitys = append(entitys, tu.Entity("Last quarter").Code())
					t, es = tu.MessageEntities(entitys...)
					stdo.Printf("%s %v\n", t, es) //15:46:48 main.go:474: 🌗Last quarter [{code 1 12  <nil>  }]
					bot.SendMessage(tu.MessageWithEntities(tu.ID(ctm.Chat.ID), entitys...).WithReplyToMessageID(ctm.MessageID))
//[Thu Apr 13 13:07:41 MSK 2023] DEBUG API call to: "https://api.telegram.org/botBOT_TOKEN/sendMessage", with data: {"chat_id":1,"text":"🌗Last quarter","entities":[{"type":"code","offset":1,"length":12}],"reply_to_message_id":1}
//[Thu Apr 13 13:07:41 MSK 2023] DEBUG API response sendMessage: Ok: false, Err: [400 "Bad Request: entity begins in a middle of a UTF-16 symbol at byte offset 4"]
`

About this issue

Original URL
State: closed
Created a year ago
Comments: 21 (21 by maintainers)

Most upvoted comments

А неприятности будем переживать по мере их поступления, а еще лучше не переживать, а идти дальше. © Михаил Жванецкий \8^)

abakum on May 4, 2023

If you know the way to count offsets & length correctly and still be able to send all messages that are possible to send right now, please either explain it to me, or write a pull request, it will be very welcomed.

Telegram is constantly changing, breaking things from one update to another, if they will decide to change the behavior, sure, why not, I just will update Telego as usual, but breaking some working functionality of Telego right now to “properly count offsets” even if it doesn’t matter, for what? To know that it counts it correctly, but now you can’t send some messages as expected?

mymmrac on May 4, 2023

When using bots, it’s required to specify ParseMode for any formatting to work, by default no formatting is applied, so *test* will be sent literally as *test*

mymmrac on May 4, 2023

Telegram’s clients does not treat message entities the same as bots IMHO the specification is more important than the implementation, since implementation errors are possible: – Командир говорит: «Делай как я!», а комиссар говорит: «Делай, как я говорю!» \8^)

abakum on May 4, 2023

I investigated a lot and here are some of my findings:

Telegram’s clients does not treat message entities the same as bots
If you for example will send a* test *_b_ though bot, the entities will be: [{offset: 2, length: 5}, {offset: 7, length: 1}], which really doesn’t make sense, because according to docs * test * should be at offset 1, not 2. Having it with offset 2 it should be of length 4 since we are trimming trailing spaces, but it’s of length 5
I will not make 100% one to one entity handling any way, because it’s much more complex than it seems (Telegram actually combines same entities into one and does more stuff with it), so the goal is just to get it working in any case, but it’s non-necessary means to work 100% as Telegram does as long as the result is the same

Most importantly, from now on any issues related to entity length or entity count, or anything related to entities that is not breaking or disallowing to achieve certain result will not be worked on. If you are not able to send something properly, yes, it’s a bug, but if it sends as expected, and you were expecting a different result from Telegram then what Telego produced - it’s how things will work, it’s not a bug

mymmrac on May 3, 2023

Only right space must be trimmed rtrim

abakum on May 1, 2023

Trimmed to much, only left spaces should be trimmed

mymmrac on May 1, 2023

This issue is partially fixed by #95, but still the length of trailing newlines or whitespace is not counted correctly, I will fix that soon (hopefully)

Btw, really thanks for reporting this issue

mymmrac on Apr 14, 2023