[FFmpeg-devel] [PATCH] speex in ogg muxer

Sat Oct 10 04:01:58 CEST 2009

David Conrad wrote:

> On Sep 5, 2009, at 8:20 PM, Justin Ruggles wrote:
> 
>> Justin Ruggles wrote:
>>
>>> Justin Ruggles wrote:
>>>
>>>> Justin Ruggles wrote:
>>>>
>>>>> Justin Ruggles wrote:
>>>>>
>>>>>> Justin Ruggles wrote:
>>>>>>
>>>>>> Now I think I know what is going wrong, and there is nothing we  
>>>>>> can do
>>>>>> about it I think.  speexenc does some weird things with granule
>>>>>> positions.  It starts out for a long time with granulepos=0 even  
>>>>>> though
>>>>>> it is encoding audio, then when it starts writing granule  
>>>>>> positions it
>>>>>> is not always in sync with the start of the stream.  Below is a  
>>>>>> little
>>>>>> snippet from a comparison of an original spx file to a copied  
>>>>>> spx file.
>>>>>> Each packet should be 320 samples.
>>>>>>
>>>>>> [...]
>>>>> So... I figured it out, but you may not want to know the answer. ;)
>>>>>
>>>>> The granulepos of the first packet is supposed to be interpreted as
>>>>> smaller than the full frame size by calculating what the  
>>>>> granulepos of
>>>>> the first page would normally be, then subtracting it from what it
>>>>> really is to get the delay.
>>>>>
>>>>>>> From above, this is the last packet in the first page. There  
>>>>>>> are 59
>>>>> packets per page in this stream (the first 2 packets are headers,  
>>>>> hence
>>>>> the packetno of 60).
>>>>>> -00:00:01.171: serialno 1626088319, granulepos 18737, packetno 60
>>>>>> +00:00:01.180: serialno 0000000000, granulepos 18880, packetno 60
>>>>> speexdec interprets the first packet as having a delay of
>>>>> 18880-18737=143 samples.  So the first packet should be 320-143=177
>>>>> samples long, and the decoder discards the first 143 samples of the
>>>>> first frame.
>>>>>
>>>>> None of this is documented except for in the speexenc and speexdec
>>>>> source code.  From analyzing a Speex-in-FLV sample, it appears  
>>>>> that the
>>>>> way Adobe handles this in Flash Media Server is to do like our ogg
>>>>> demuxer does and interpret the first page as if each frame is 320
>>>>> samples, then resync timestamps with the source after the first  
>>>>> page,
>>>>> causing a skip in timestamps after the first page instead of at the
>>>>> beginning of the stream.
>>>>>
>>>>> I'm still not sure what to do about this though...
>>>> This patch makes it so that all the pts and durations are correct  
>>>> for
>>>> Ogg/Speex.  It basically just changes the durations of the first and
>>>> last packets.
>>> nevermind. this doesn't quite work. i'm still working on it. damn ogg
>>> and its craziness!
>> Ok, now this patch should work correctly.
> 
> After some discussion with xiph people, apparently vorbis does this  
> exact same thing. The reasoning behind it is that libvorbis/libspeex  
> generate additional samples to prime the lapped transform. There is  
> apparently nothing in the vorbis/speex bitstream to indicate how many  
> samples this is, so instead ogg granulepos is used to figure out how  
> many samples to skip at the beginning.

Ouch. Is there a way to pre-parse the first page packets before decoding
to determine the correct packet durations?

> This is probably why there's such a high PSNR difference between our  
> decoder and libvorbis despite the output sounding fine. However, I'm  
> not sure how to fix this: since vorbis has no fixed frame_size it  
> requires decoding every packet in the first ogg page, then subtracting  
> the page's granulepos from how many samples were decoded. Which would  
> break other containers.

Then does ogg/vorbis have the same timestamp problem as ogg/speex when
doing stream copy?

>> diff --git a/libavformat/oggparsespeex.c b/libavformat/oggparsespeex.c
>> index cc00dd2..c295970 100644
>> --- a/libavformat/oggparsespeex.c
>> +++ b/libavformat/oggparsespeex.c
>> @@ -53,6 +64,7 @@ static int speex_header(AVFormatContext *s, int  
>> idx) {
>>            byte-aligned. */
>>         st->codec->frame_size = AV_RL32(p + 56);
>>         frames_per_packet     = AV_RL32(p + 64);
>> +        spxp->frame_size      = st->codec->frame_size;
>>         if (frames_per_packet)
>>             st->codec->frame_size *= frames_per_packet;
> 
> [...]
> 
>> +static int speex_packet(AVFormatContext *s, int idx)
>> +{
>> +    struct ogg *ogg = s->priv_data;
>> +    struct ogg_stream *os = ogg->streams + idx;
>> +    struct speex_params *spxp = os->private;
>> +    int frames_per_packet = s->streams[idx]->codec->frame_size /  
>> spxp->frame_size;
> 
> This frame_size / frame_size is confusing; one should be renamed. spxp- 
>  >single_frame_size maybe with a comment reminding that the codec- 
>  >frame_size is per packet with multiple frames?

I've simplified it now.

>> +
>> +    if (os->flags & OGG_FLAG_EOS && os->lastgp != -1 && os->granule  
>>> 0) {
>> +        /* first packet of last page. we have to calculate the last  
>> packet
>> +           duration here because it is the only place we know the  
>> next-to-last
>> +           granule position. */
>> +        spxp->last_packet_duration = os->granule - os->lastgp +
>> +                                     spxp->frame_size *  
>> frames_per_packet *
>> +                                     (1 - ogg_page_packets(os));
>> +    }
>> +
>> +    if (!os->lastgp && os->granule > 0)
> 
> This will set this duration for all speex packets in the first page;  
> shouldn't it only be the first (os->seq == 2)?