[FFmpeg-devel] [PATCH] lavc: support subtitles charset conversion.

Fri Jan 4 16:44:59 CET 2013

On Fri, Jan 04, 2013 at 04:06:02PM +0100, Nicolas George wrote:
> Le quintidi 15 nivôse, an CCXXI, Clement Boesch a écrit :
> > I was simpling thinking of a flag saying that the pkt->data can be
> > re-coded to UTF-8; basically what we will do for every simple text
> > decoders, except those outputting UTF-8 only (like WebVTT iirc, or even
> > TED, where nothing is needed).
> 
> I still find it a bit ad-hoc and inelegant to put inside the codec
> structure. Setting a flag in the context structure from the codec init
> function seems much more elegant.
> 

I fail to see how it is more elegant; the codec properties sounds like the
best place to declare such generalities. Using the context structure is
only a necessity if we need on the fly changes, which don't sound common
at all. And if we find such insanity, I'd suggest to fix that mess in the
decoder or the demuxer itself.

> Also, extending the AVCodec structure or using a bit in the capability field
> has a larger risk of causing compatibility trouble with the fork, whereas we
> already have the framework for this case in AVCodecContext.
> 

We'll create a gap, like we do everywhere:

diff --git a/libavcodec/avcodec.h b/libavcodec/avcodec.h
index 060589b..5e055a9 100644
--- a/libavcodec/avcodec.h
+++ b/libavcodec/avcodec.h
@@ -536,6 +536,10 @@ typedef struct AVCodecDescriptor {
  * Codec supports lossless compression. Audio and video codecs only.
  */
 #define AV_CODEC_PROP_LOSSLESS      (1 << 2)
+/**
+ * Subtitle codec support character re-encoding of the AVPacket data to UTF-8
+ */
+#define AV_CODEC_PROP_PRE_CHARENC   (1 << 16)
 
 #if FF_API_OLD_DECODE_AUDIO
 /* in bytes */

> > If a codec supports the several-encoding thing (wtf?), it should be
> > handled inside the decoder itself IMO.
> 
> Yes.
> 
> > Post conversion for teletext? I suggest the decoder should output directly
> > in UTF-8.
> 
> Assuming it can. Teletext is an industrial standard, and as such probably
> badly designed: I would not be surprised that some variant of it use legacy
> encodings without declaration.
> 

I'd better hear about someone implementing it instead of trying to
suppose how crazy it can be.

It's true that we don't have yet teletext support. OTOH we have quite a
bunch of other subtitles formats, and all of them (except the bitmap
subtitles and the utf-8 ones where there is nothing to do) need something
more advanced than what is proposed in this patch.

> > Writing garbage? The original packet data is maintained untouched.
> 
> That is exactly the problem: the user asked to convert the encoding, and
> lavc did not. You can expect decoding failures (if the decoder is not
> capable of handling the data), truncated output (if a stray 0 stops the
> decoder) or any thing of the sort.

If it fails at any point the original data is used.

>                                    Worse: since the failure can happen for
> some packets but not all, the output file may mix correctly transcoded
> packets and garbled ones.
> 

I'll make sure to fix the fixme if no one do.

[...]

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130104/034db3e8/attachment.asc>