bergamot-translator: QE for HTML input doesn't work

A quick experiment with wasm test page for html text translation shows weird byte ranges for words.

An example:

Sentence: "How are  <b>you</b>  doing?"
Translated sentence: "Wie geht <b>es dir</b>?"

"words":["Wie", "geht", "<b", "es d"]
"wordByteRanges":[{"begin":0,"end":3}, {"begin":4,"end":8}, {"begin":9,"end":11}, {"begin":12,"end":16}]

Looks like, QE ignores tags completely (treating them as if they are non-existent) in the translated sentences and compute the byte ranges of the words in the translated sentences. Is it something that needs to be fixed or am I doing something wrong at my end?

Attaching the image for detailed results.

Screenshot 2022-02-17 at 14 32 35

cc @kpu @jerinphilip @abarbosa94 @felipesantosk @mfomicheva

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 28 (28 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve added an example of how the output of #358 can be used to the demo page in that branch. That example treats the input field as plain text, escapes it, and renders the translation output as HTML. Then with a little bit of Javascript for thresholding and CSS I render sentence and word level quality indicators.

image

(Based on thresholds that have no meaning, I just wanted some thresholds that are met often enough for screenshot purposes.)

@abhi-agg Plain text remains as it was already working. However, what I would recommend as the easiest thing for you is to follow the suggestion in: https://github.com/browsermt/bergamot-translator/issues/355#issuecomment-1044425813 specifically:

HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)

Then you don’t have to handle byte ranges in either case.

All mock-ups seem to assume plain text (e.g. a form input), not HTML. In that case, what is currently in main will be sufficient. #357 won’t hurt, but the offsets weren’t broken to begin with in this scenario.

To clarify a bit further: #357 just fixes the byte offsets in case the input is marked as HTML. If the input was text, there was no issue. So for form translation (which I assume are always plain text unless you’re going to attempt auto translation in contenteditable elements…) it is not necessary.

#358 adds the quality scores as HTML to HTML output, but this only works if the input was HTML to begin with. Of course, it could also be used with plain text if you’d encode all entities in the text before sending it to the translator (and mark it as HTML). The output can easily be rendered like the mock-ups. Thresholding can easily be done in the extension using Javascript or CSS. The added tags that don’t meet the threshold won’t affect rendering. You might need to be a bit creative with horizontal padding to fill in the gaps between consecutive words that do meet the threshold though. The way I insert the tags, it will leave the spaces that are not part of the word outside any of the score tags. But that’s not insurmountable.