common-voice: [BUG] Sentences for recording have "?" in place of expected special Latvian characters

Describe the bug Text encoding in the database seems corrupt. Sentences affected seem to be added from the Common Voice UI, most likely they looked fine when they were validated, but on the recording screen they look corrupt, Latvian symbols like č, ā, ē and similar are displayed as ?

To Reproduce Steps to reproduce the behavior:

Go to https://commonvoice.mozilla.org/lv/speak
Check in the developer tools sentences that were loaded for recording.

Some of the sentences will have ? in the middle of the sentence where special Latvian letters should be. An example

id: "02ea84f2c2f9944d19bfb369b70f54b4d235c66ffd67c2413978001095a6a832"
text received from the server: "Ku?ier, tu lab?k proti, sit tu!"
text most likely in the sentence: "Kučier, tu labāk proti, sit tu!"

Another

id: "3176d6952f89b1ac08acfe756a0f3722979522c12406d72983e2583b2d947129"
text received from the server: "Beidziet flirt?t ar prec?tiem cilv?kiem!"
text most likely in the sentence: "Beidziet flirtēt ar precētiem cilvēkiem!"

Expected behavior Sentences with proper characters are loaded for recording and there are no ? in the middle of the sentences

About this issue

Original URL
State: closed
Created a year ago
Reactions: 2
Comments: 20 (19 by maintainers)

Most upvoted comments

@moz-rotimib is it possible to remove all sentences that were badly encoded? Because these sentences aren’t really pronounceable and will harm STT models.

ftyers on Aug 3, 2023

I do not fully follow the thought process here, but my thinking on this would be:

Are the sentences in the DB with or without the ? right now? If yes, they have ? in the DB then the data is broken and the next question would be if we have some backup or other way to get the data correct back. If we can think of a way to fix this, let’s look for a solution, but if there is no solution then let’s first remove or at least hide the sentences with ? that are in the middle of some word from the UI for now. E.g. by setting is_used column for those sentences to FALSE.

Proper investigation and fix may take time, so some solution to mitigate the issue would be very welcome.

raivisdejus on Aug 9, 2023

I was already checking for %Zaman%, %azal%, %yor%, it’s not in the dump unfortunately

moz-dfeller on Aug 10, 2023

@HarikalarKutusu The encoding of the database field is correct, as you could see from the screenshot, the characters are encoded correctly. The malformed characters also don’t seem to come from the sentence-collector migration as I stated earlier as I couldn’t find Zamanın azalıyor and other sentences that are malformed. I will implement some extra logging when the sentences are written to the DB and check if the the user client and/or the device are the culprit.

moz-dfeller on Aug 8, 2023

I just entered your special characters to test. Here is a screenshot from the production DB entry: The characters are stored correctly. So the search continues 😦

moz-dfeller on Aug 7, 2023

These sentences are stored inside the database containing question marks. I tried to reproduce the issue by entering sentences with Latvian special characters via desktop and mobile phone browser and check how they arrive in the database. So far, I can say that they are stored correctly using the interface. The latest submitted sentences don’t contain question marks as well.

I first suspected that there might have been an encoding problem during the sentence migration from the old sentence-collector database into the CV database. But I checked a few sentences and they are neither inside the sentence-collector.txt in /data/{locale} nor inside the the old sentence-collector db dump. Which means that they must have come via interface. But it seems that it doesn’t happen all the time which makes it hard to track down.

moz-dfeller on Aug 7, 2023

I get reports of those malformed sentences from the community. As they are mostly deductible in Turkish, some people record them as they should be, but some use the “Report” button - which does nothing in terms of the workflow. During validation, the path is again not well defined. Some people use the Report button, and some people down-vote (I tell them to downvote).

In any case, I have no idea how these can be cleaned across all languages if they get into the dataset release. @moz-dfeller, I think this is an issue of the highest priority in terms of dataset health and validity.

HarikalarKutusu on Aug 7, 2023

Actually #4111 and #4113 are about the source field, which had a wrong encoding (standard latin1). The sentences themselves should be correctly encoded as they are saved as utf8mb4 encoded strings. So I assume that they might have been either written in a wrong format to the database or are wrong encoded when retrieved from the database. I will investigate that issue.

moz-dfeller on Aug 7, 2023

I second @ftyers.

weird formatting

Are they just weirdly displayed or are they malformed in the database? If malformed, is it a 1-to-1 mapped unicode character which cannot be displayed or are they replaced by question marks?

If these are not correctable, they should be removed for sure.

I (in-)validated many sentences (>1000) from imports in the past few days and I’m afraid they might end like this. I will not use the interface until it is sorted,

Maybe it will be best to make an announcement on this so that people don’t get frustrated.

HarikalarKutusu on Aug 3, 2023

Same for special Turkish characters:

But I’m not sure if these are from old sentences or from recently entered ones through the new interface.

HarikalarKutusu on Aug 3, 2023