common-voice: [BUG] Sentences for recording have "?" in place of expected special Latvian characters
Describe the bug
Text encoding in the database seems corrupt. Sentences affected seem to be added from the Common Voice UI, most likely they looked fine when they were validated, but on the recording screen they look corrupt, Latvian symbols like č, ā, ē and similar are displayed as ?
To Reproduce Steps to reproduce the behavior:
- Go to https://commonvoice.mozilla.org/lv/speak
- Check in the developer tools sentences that were loaded for recording.
Some of the sentences will have ? in the middle of the sentence where special Latvian letters should be. An example
id: "02ea84f2c2f9944d19bfb369b70f54b4d235c66ffd67c2413978001095a6a832"
text received from the server: "Ku?ier, tu lab?k proti, sit tu!"
text most likely in the sentence: "Kučier, tu labāk proti, sit tu!"
Another
id: "3176d6952f89b1ac08acfe756a0f3722979522c12406d72983e2583b2d947129"
text received from the server: "Beidziet flirt?t ar prec?tiem cilv?kiem!"
text most likely in the sentence: "Beidziet flirtēt ar precētiem cilvēkiem!"
Expected behavior
Sentences with proper characters are loaded for recording and there are no ? in the middle of the sentences
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 20 (19 by maintainers)
@moz-rotimib is it possible to remove all sentences that were badly encoded? Because these sentences aren’t really pronounceable and will harm STT models.
I do not fully follow the thought process here, but my thinking on this would be:
Are the sentences in the DB with or without the
?right now? If yes, they have?in the DB then the data is broken and the next question would be if we have some backup or other way to get the data correct back. If we can think of a way to fix this, let’s look for a solution, but if there is no solution then let’s first remove or at least hide the sentences with?that are in the middle of some word from the UI for now. E.g. by settingis_usedcolumn for those sentences toFALSE.Proper investigation and fix may take time, so some solution to mitigate the issue would be very welcome.
I was already checking for
%Zaman%,%azal%,%yor%, it’s not in the dump unfortunately@HarikalarKutusu The encoding of the database field is correct, as you could see from the screenshot, the characters are encoded correctly. The malformed characters also don’t seem to come from the sentence-collector migration as I stated earlier as I couldn’t find
Zamanın azalıyorand other sentences that are malformed. I will implement some extra logging when the sentences are written to the DB and check if the the user client and/or the device are the culprit.I just entered your special characters to test. Here is a screenshot from the production DB entry:
The characters are stored correctly. So the search continues 😦
These sentences are stored inside the database containing question marks. I tried to reproduce the issue by entering sentences with Latvian special characters via desktop and mobile phone browser and check how they arrive in the database. So far, I can say that they are stored correctly using the interface. The latest submitted sentences don’t contain question marks as well.
I first suspected that there might have been an encoding problem during the sentence migration from the old sentence-collector database into the CV database. But I checked a few sentences and they are neither inside the
sentence-collector.txtin/data/{locale}nor inside the the old sentence-collector db dump. Which means that they must have come via interface. But it seems that it doesn’t happen all the time which makes it hard to track down.I get reports of those malformed sentences from the community. As they are mostly deductible in Turkish, some people record them as they should be, but some use the “Report” button - which does nothing in terms of the workflow. During validation, the path is again not well defined. Some people use the Report button, and some people down-vote (I tell them to downvote).
In any case, I have no idea how these can be cleaned across all languages if they get into the dataset release. @moz-dfeller, I think this is an issue of the highest priority in terms of dataset health and validity.
Actually #4111 and #4113 are about the source field, which had a wrong encoding (standard
latin1). The sentences themselves should be correctly encoded as they are saved asutf8mb4encoded strings. So I assume that they might have been either written in a wrong format to the database or are wrong encoded when retrieved from the database. I will investigate that issue.I second @ftyers.
Are they just weirdly displayed or are they malformed in the database? If malformed, is it a 1-to-1 mapped unicode character which cannot be displayed or are they replaced by question marks?
If these are not correctable, they should be removed for sure.
I (in-)validated many sentences (>1000) from imports in the past few days and I’m afraid they might end like this. I will not use the interface until it is sorted,
Maybe it will be best to make an announcement on this so that people don’t get frustrated.
Same for special Turkish characters:
But I’m not sure if these are from old sentences or from recently entered ones through the new interface.