[FFmpeg-devel] AVCHD/H.264 decoder: further development/corrections
Sun Jan 25 20:08:06 CET 2009
Hello * (especially h264 maintainers),
in the last few days I tried to find out what has to be done in H.264
decoder in order to correctly support AVCHD files from full-HD
camcorders (which is IMHO quite an important use case). I admit, I'm a
bit selfish there, since I want to get my Panasonic HDC-SD9
fully-supported :-). But I'm also willing to invest more time and to fix
the code. However, I need some advice, preferably somewhat more detailed.
I identified the following problems and potential solutions:
1. Inconsistency between packets returned via av_read_frame() and
actually delivered full frames from avcodec_decode_video()
2. Key frame calculation and seeking
3. Reporting frame type to libavformat
Now the details:
*1. Inconsistency between packets and decoded frames*
H.264 decoder returns AVPackets via av_read_frame(), which contain
either a full frame or just a field (half frame). The former case is not
problematic, since decoded frames are 1:1 to returned packets. It is
problematic, though, when the decoder returns packets, which DO NOT
correspond to a full frame. This is the case of interlaced AVCHD video
as produced by various full-HD camcorders (at least Panasonic, Sony and
Canon). H.264 standard allows namely coding by field, so one picture in
H.264 terms (as currently returned as AVPacket from av_read_frame()) can
contain either a single field, two fields (frame) or even repeated
fields (so in total 1-3 fields per AVPacket).
I'd concentrate first on H.264 pictures having 1 to 2 fields only, since
the other case (3 fields per picture) is probably not that interesting
now (it is used to quasi stretch FPS from original cinema material to
television frame rates).
Although the decoder itself takes this into account, the interface in
libavformat doesn't. Thus, currently only video having full frames per
packet decodes really correctly (and this also only with not-yet-applied
patch concerning frame types). Reason: av_read_frame() doesn't return
whole frames, although it is documented so.
*Potential solution:* For field pictures, delay returning a packet from
h264_parse(), until the second field picture is also read. The decoder
should take then care of decoding both fields correctly and returning a
full frame for each packet.
*Alternative solution:* Return field packet from h264_parse()
immediately, but somehow tell libavformat that the packet does not
represent a full frame and second field has to be read as well. Read it
in libavformat, extending the existing packet. Thus, av_read_frame()
returns then full frame.
*No solution:* Leave libavformat and h264_parse as-is and take care of
second half-frame in ffmpeg.c and other libavformat users. This won't
work, as we would need to adjust API and thus every single program using
ffmpeg to correctly handle field frames. Further, libavformat computes
wrong DTS/PTS for the second field (equal to DTS/PTS of the first field
of _next_ frame instead of in-between, since second field doesn't
specify DTS/PTS at all), which causes do_video_out() to drop and
duplicate frames, producing very jerky video.
*No solution 2:* Communicate to libavformat the fact which field of full
frame the returned packet contains and adjust DTS/PTS calculation in
compute_pkt_fields() appropriately, returning last_DTS+duration/2 and
last_PTS+duration/2 for DTS/PTS of the second field. Again, this is API
change, since av_read_frame() would not return full frames. Though it
works in ffmpeg.c, it is unclear if it works in other programs using
libavformat (probably not).
Now the question: Which solution is the "right" one? I'd go for the
first one or possibly for the alternative. The first proposed solution
seems to be most "compatible", since we don't need to extend AVPacket to
address the issue.
Your opinions? Or eventually a different idea?
*2. Key frame calculation and seeking*
H.264 is different to other video codecs, since it doesn't have fixed
key frames. Instead, several reference pictures from the history can be
used to decode a particular picture. There are IDR pictures, which are
effectively key frames, but these seem not to be really used. AVCHD
files from camcorders have exactly one IDR frame at the beginning of the
Other than that, the stream provides information (SEI recovery point) on
how many frames need to be decoded before the video synchronizes
starting at the given point. There is already field
AVPacket.convergence_duration, which is supposed to address exactly this
(until now unused in h264, though).
My suggestion is to report key frames for IDR pictures and for
appropriate frames after SEI recovery point (after counting down number
of frames given in recovery point SEI message).
Alternatively, key frames could be reported for IDR pictures and for
pictures having recovery point. In this case, the application would have
to handle it via AVPacket.convergence_duration. Unfortunately, noone
seems to handle convergence_duration in an application, and I don't
believe anyone would like to. So IMHO this is a no-go.
My suggestion for current av_seek_frame() would be to do the following
for streams needing convergence_duration (is there a flag for it
already?): When seeking to a certain PTS, seek to a frame with given PTS
and then roll *backward* until the frame with last recovery point with
convergence_duration <= distance is found (how to find it most optimal?)
and then re-decode all reference frames (i.e., leaving out unneeded
B-frames) into dummy buffers from this point on until just before given
PTS. So the next av_read_frame() will read a key frame, which can be
decoded correctly. In this way, the application doesn't have to handle
convergence_duration by itself.
Michael suggested new seeking API. Maybe this should be addressed there
via a flag (seek to frame with recovery point and use
convergence_duration in application or let libavformat do the decode up
to key frame as described above), but for now, an alternative needs to
be implemented for current seeking API.
Further, I'd propose keeping a small cache of (PTS, position,
convergence_duration) triples for frames containing SEI recovery point
message, so the seeking around "current" location would be faster.
Reason: video editing software, where we often need to seek one frame
*3. Reporting frame type to libavformat*
This is a minor thing, but still important for correct computation of
PTS/DTS and key frame flags. compute_pkt_fields() relies on having the
information about picture type (I/P/B-frame). However, H.264 doesn't
have strict I/P/B frames, there is even a possibility to have mixed-type
slices inside of one frame. Indeed, my camcorder produces in interlaced
mode top field as I-slice and bottom field as P-slice referring to the
So my suggestion is, report picture type I-frame for key frames (which
are key frames is discussed above) and report P-frame for all frames
containing only P- and I- slices. Other frames containing also B-slices
will be reported as B-frames.
Thanks in advance.
More information about the ffmpeg-devel