[FFmpeg-devel] [PATCH] speex in ogg muxer

Mon Oct 12 05:43:24 CEST 2009

Justin Ruggles wrote:

> David Conrad wrote:
> 
>> On Sep 5, 2009, at 8:20 PM, Justin Ruggles wrote:
>>
>>> Justin Ruggles wrote:
>>>
>>>> Justin Ruggles wrote:
>>>>
>>>>> Justin Ruggles wrote:
>>>>>
>>>>>> Justin Ruggles wrote:
>>>>>>
>>>>>>> Justin Ruggles wrote:
>>>>>>>
>>>>>>> Now I think I know what is going wrong, and there is nothing we  
>>>>>>> can do
>>>>>>> about it I think.  speexenc does some weird things with granule
>>>>>>> positions.  It starts out for a long time with granulepos=0 even  
>>>>>>> though
>>>>>>> it is encoding audio, then when it starts writing granule  
>>>>>>> positions it
>>>>>>> is not always in sync with the start of the stream.  Below is a  
>>>>>>> little
>>>>>>> snippet from a comparison of an original spx file to a copied  
>>>>>>> spx file.
>>>>>>> Each packet should be 320 samples.
>>>>>>>
>>>>>>> [...]
>>>>>> So... I figured it out, but you may not want to know the answer. ;)
>>>>>>
>>>>>> The granulepos of the first packet is supposed to be interpreted as
>>>>>> smaller than the full frame size by calculating what the  
>>>>>> granulepos of
>>>>>> the first page would normally be, then subtracting it from what it
>>>>>> really is to get the delay.
>>>>>>
>>>>>>>> From above, this is the last packet in the first page. There  
>>>>>>>> are 59
>>>>>> packets per page in this stream (the first 2 packets are headers,  
>>>>>> hence
>>>>>> the packetno of 60).
>>>>>>> -00:00:01.171: serialno 1626088319, granulepos 18737, packetno 60
>>>>>>> +00:00:01.180: serialno 0000000000, granulepos 18880, packetno 60
>>>>>> speexdec interprets the first packet as having a delay of
>>>>>> 18880-18737=143 samples.  So the first packet should be 320-143=177
>>>>>> samples long, and the decoder discards the first 143 samples of the
>>>>>> first frame.
>>>>>>
>>>>>> None of this is documented except for in the speexenc and speexdec
>>>>>> source code.  From analyzing a Speex-in-FLV sample, it appears  
>>>>>> that the
>>>>>> way Adobe handles this in Flash Media Server is to do like our ogg
>>>>>> demuxer does and interpret the first page as if each frame is 320
>>>>>> samples, then resync timestamps with the source after the first  
>>>>>> page,
>>>>>> causing a skip in timestamps after the first page instead of at the
>>>>>> beginning of the stream.
>>>>>>
>>>>>> I'm still not sure what to do about this though...
>>>>> This patch makes it so that all the pts and durations are correct  
>>>>> for
>>>>> Ogg/Speex.  It basically just changes the durations of the first and
>>>>> last packets.
>>>> nevermind. this doesn't quite work. i'm still working on it. damn ogg
>>>> and its craziness!
>>> Ok, now this patch should work correctly.
>> After some discussion with xiph people, apparently vorbis does this  
>> exact same thing. The reasoning behind it is that libvorbis/libspeex  
>> generate additional samples to prime the lapped transform. There is  
>> apparently nothing in the vorbis/speex bitstream to indicate how many  
>> samples this is, so instead ogg granulepos is used to figure out how  
>> many samples to skip at the beginning.
> 
> Ouch. Is there a way to pre-parse the first page packets before decoding
> to determine the correct packet durations?

To answer my own question... Yes this can be done.  I stripped down the
header parsing code from vorbis_dec.c as much as I could and ended up
with about 400 lines of code added to oggparsevorbis.c to determine each
Vorbis packet duration.

-Justin