FFmpegInteropX: `AutoCorrectAnsiSubtitles` causes incorrect handling of files without UTF8-BOM

The name AutoCorrectAnsiSubtitles is a bit misleading, because it doesn’t do much “automatic” - it rather forces all files without UTF-8 BOM to be treated by ffmpeg as ASCII with a certain code page - even when they aren’t. (btw. ‘ANSI’ is not quite right, because ANSI text is valid UTF-8 already, it’s the extended ASCII chars which need special treatment)

It might make sense to check the content whether for whether it’s UTF-8 before forcing a code page and setting sub_charenc. Here’s how ffmpeg does it: https://github.com/FFmpeg/FFmpeg/blob/2532e832d2773d9d09574434f84decebcf6f81a1/libavcodec/decode.c#L950

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 19 (7 by maintainers)

Most upvoted comments

The lib you linked recommends to probe at least 4kb of data. We currently read files with a fixed chunk size of 16kb (configurable, but 16kb is the default). I wonder if it would be enough to check these first 16kb of data. Then we could do it inline with the first read call coming from ffmpeg. The obvious upside is that we would not have to do any seeking at all. We’d just have to make sure that we really fully read that first chunk (the stream might return less than what is requested, if it comes from the web). The second upside is this: We don’t really know what kind of stream is passed to our function. It is very well possible to pass the stream of a full movie here. It would get parsed and all subtitle streams would be added. We sure do not want to read the full movie in-memory, just to do some character encoding check. So we’d definitely need to restrict the amount of data to check.

lukasf on Oct 31, 2023