telego: Bad Request: entity begins in a middle of a UTF-16 symbol at byte offset ...
💬 Telego version
0.22.0
👾 Issue description
entitys := []tu.MessageEntityCollection{} entitys = append(entitys, tu.Entity("🌗")) bot.SendMessage(tu.MessageWithEntities(tu.ID(ctm.Chat.ID), entitys...).WithReplyToMessageID(ctm.MessageID)) //[Thu Apr 13 13:07:41 MSK 2023] DEBUG API call to: "https://api.telegram.org/botBOT_TOKEN/sendMessage", with data: {"chat_id":1,"text":"🌗Last quarter","entities":[{"type":"code","offset":1,"length":12}],"reply_to_message_id":1} //[Thu Apr 13 13:07:41 MSK 2023] DEBUG API response sendMessage: Ok: false, Err: [400 "Bad Request: entity begins in a middle of a UTF-16 symbol at byte offset 4"]
Please look:
https://core.telegram.org/bots/api#messageentity
MessageEntity
…
offset Integer Offset in UTF-16 code units to the start of the entity
length Integer Length of the entity in UTF-16 code units
…
https://core.telegram.org/api/entities#entity-length
Computing entity length
Code points in the BMP (U+0000 to U+FFFF) count as 1, because they are encoded into a single UTF-16 code unit
Code points in all other planes count as 2, because they are encoded into two UTF-16 code units (also called surrogate pairs)
…
However, since UTF-8 encodes codepoints in non-BMP planes as a 32-bit code unit starting with 0b11110, a more efficient way to compute the entity length without converting the message to UTF-16 is the following:
If the byte marks the beginning of a 32-bit UTF-8 code unit (all bytes starting with 0b11110) increment the count by 2, otherwise If the byte marks the beginning of a UTF-8 code unit (all bytes not starting with 0b10) increment the count by 1. Example:
length := 0 for byte in text { if (byte & 0xc0) != 0x80 { length += 1 + (byte >= 0xf0) } } Note: the length of an entity must not include the length of trailing newlines or whitespaces, rtrim entities before computing their length: however, the next offset must include the length of newlines or whitespaces that precede it.
⚡️ Expected behavior
{code 2 12 <nil> }
🧐 Code example
` entitys := []tu.MessageEntityCollection{}
entitys = append(entitys, tu.Entity("🌗"))
t, es := tu.MessageEntities(entitys...)
stdo.Printf("%U %v %v\n", []rune(t)[0], []byte(t), es) //15:46:48 main.go:471: U+1F317 [240 159 140 151] []
entitys = append(entitys, tu.Entity("Last quarter").Code())
t, es = tu.MessageEntities(entitys...)
stdo.Printf("%s %v\n", t, es) //15:46:48 main.go:474: 🌗Last quarter [{code 1 12 <nil> }]
bot.SendMessage(tu.MessageWithEntities(tu.ID(ctm.Chat.ID), entitys...).WithReplyToMessageID(ctm.MessageID))
//[Thu Apr 13 13:07:41 MSK 2023] DEBUG API call to: "https://api.telegram.org/botBOT_TOKEN/sendMessage", with data: {"chat_id":1,"text":"🌗Last quarter","entities":[{"type":"code","offset":1,"length":12}],"reply_to_message_id":1}
//[Thu Apr 13 13:07:41 MSK 2023] DEBUG API response sendMessage: Ok: false, Err: [400 "Bad Request: entity begins in a middle of a UTF-16 symbol at byte offset 4"]
`
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 21 (21 by maintainers)
А неприятности будем переживать по мере их поступления, а еще лучше не переживать, а идти дальше. © Михаил Жванецкий \8^)
If you know the way to count offsets & length correctly and still be able to send all messages that are possible to send right now, please either explain it to me, or write a pull request, it will be very welcomed.
Telegram is constantly changing, breaking things from one update to another, if they will decide to change the behavior, sure, why not, I just will update Telego as usual, but breaking some working functionality of Telego right now to “properly count offsets” even if it doesn’t matter, for what? To know that it counts it correctly, but now you can’t send some messages as expected?
When using bots, it’s required to specify
ParseMode
for any formatting to work, by default no formatting is applied, so*test*
will be sent literally as*test*
I investigated a lot and here are some of my findings:
a* test *_b_
though bot, the entities will be:[{offset: 2, length: 5}, {offset: 7, length: 1}]
, which really doesn’t make sense, because according to docs* test *
should be at offset 1, not 2. Having it with offset 2 it should be of length 4 since we are trimming trailing spaces, but it’s of length 5Most importantly, from now on any issues related to entity length or entity count, or anything related to entities that is not breaking or disallowing to achieve certain result will not be worked on. If you are not able to send something properly, yes, it’s a bug, but if it sends as expected, and you were expecting a different result from Telegram then what Telego produced - it’s how things will work, it’s not a bug
Only right space must be trimmed rtrim
Trimmed to much, only left spaces should be trimmed
This issue is partially fixed by #95, but still the length of trailing newlines or whitespace is not counted correctly, I will fix that soon (hopefully)