[FFmpeg-devel] [PATCH 5/5] startcode: Filter out non-startcodes earlier

Andriy Gelman andriy.gelman at gmail.com
Mon Jul 1 02:53:48 EEST 2019


Andreas, 

On Sun, 09. Jun 13:00, Andreas Rheinhardt wrote:
> Up until now, the bitmasks used to initially find out when one needs
> to take a closer look and search for startcodes were rather primitive:
> If a block (of four or eight bytes, depending on the system) contained a
> zero, it was treated as a target for closer inspection.
> This can be improved: Using the same technique of bitmasking as before,
> one can test whether an arbitrary continuous range of bits consists of
> zeros alone (and one can or the results of tests for non-overlapping
> ranges of bits). A simple improvement would consist in simply testing
> for whether every even byte does not vanish. But even better masks are
> possible, in particular for big endian systems:
> If all of the bits called 'a' or all of the bits called 'b' in the
> uint32_t aaaa aaaa aaaa aaax bbbb bbbb bbbb bbby vanish, then this is
> a candidate for closer inspection. If not, then it follows that
> the three least serious bytes can't be the position of the 0x01 byte of
> a 0x00 0x00 0x01 startcode. And if the most significant byte is the
> position of the 0x01 of a startcode and if the same test is performed on
> each block of four bytes in sequence, then all the b bits (as well as
> the y bit) of the previous uint32_t would vanish and it would have been
> considered a candidate for closer inspection. In other words: All
> startcodes can be found with this test. The amount of candidates is
> thereby reduced by a factor of about 256 (from about 4/2^8 to about
> 2/2^15). For eight bytes at a time, the patterns are simply applied to
> the high and low 32 bits at the same time.
> Unfortunately, not so much is possible for little-endian systems because
> continuous blocks of bits in the input byte array won't be continuous
> blocks in the integer read. But one can nevertheless even improve upon
> the "check every even/odd byte" test using the following mask:
> aaaa aaaa aaaa bbbb bbbb bbbb cccc cccc
> If any of these three blocks of bits vanishes, the block is a candidate
> for further inspection. In the byte array, this mask corresponds to the
> following situation:
> cccc cccc bbbb bbbb aaaa bbbb aaaa aaaa
> If a startcode's 0x01 is at the second or third position, the c block
> vanishes; if it is at the fourth position, the b block vanishes and if
> it is at the first position, the a block of the previous four-byte block
> vanishes. (This derivation just made use of the fact that there must be
> two vanishing bytes in sequence; I couldn't find a way to utilize that
> the third byte also has nearly no bits set.) So all startcodes can be
> found with this mask. Doubling this mask yields a mask for eight bytes
> at a time. I haven't found a better mask for eight bytes than this one.
> Using this, one can detect reduce the amount of false positives by about
> 72% compared to the earlier check whether one of the bytes vanishes.
> 
> Being more effective at initial filtering makes the whole process more
> efficient, of course. Here are some benchmarks for a 30.2 Mb/s transport
> stream A (10 iterations of 8192 runs each) and another 7.4 Mb/s transport
> stream B (10 iterations of 131072 runs each) on an x64; except for
> "vanilla" all versions tested didn't return false positives.
> 
>   |   vanilla   |  aligned  |  aligned   |
>   | (unaligned) | bad masks | good masks |
> --|-------------|-----------|------------|
> A |    411578   |   323355  |    233494  |
> B |     55476   |    44147  |     31793  |
> 
> "vanilla" is the version before this patchset, i.e. before the switch to
> aligned reads. "aligned bad masks" is the version before this commit.
> "aligned good masks" is the current version.
> 
> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt at gmail.com>
> ---
>  libavcodec/startcode.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/libavcodec/startcode.c b/libavcodec/startcode.c
> index b29c1490c0..fa506a37d1 100644
> --- a/libavcodec/startcode.c
> +++ b/libavcodec/startcode.c
> @@ -95,10 +95,18 @@ int ff_startcode_find_candidate_c(const uint8_t *buf, int size)
>  
>  #if HAVE_FAST_64BIT
>      INITIALIZATION(8);
> -    MAIN_LOOP(64, 0x0101010101010101ULL, 0x8080808080808080ULL);
> +#if HAVE_BIGENDIAN
> +    MAIN_LOOP(64, 0x0002000200020002ULL, 0x8000800080008000ULL);
> +#else
> +    MAIN_LOOP(64, 0x0010010100100101ULL, 0x8008008080080080ULL);
> +#endif
>  #else
>      INITIALIZATION(4);
> -    MAIN_LOOP(32, 0x01010101U, 0x80808080U);
> +#if HAVE_BIGENDIAN
> +    MAIN_LOOP(32, 0x00020002U, 0x80008000U);
> +#else
> +    MAIN_LOOP(32, 0x00100101U, 0x80080080U);
> +#endif
>  #endif
>  
>      /* For the end, buf is the earliest possible position
> -- 
> 2.21.0

Below are the results of this patchset on Armv7 using OdroidXU4. 
OdroidXU4 has 8 cores - 4xARM Cortex A15 and 4xARM Cortex A7. Only the A15 cores were used for this test. 

Test sequences:
Sequence C - 2560x1440, 20.4Mb/s, 8105 frames      
Source: https://www.youtube.com/watch?v=1La4QzGeaaQ

Sequence D - 1920x1080, 4.8Mb/s, 18808 frames  
Source: https://www.youtube.com/watch\?v\=LXb3EKWsInQ

Averaged results:
Sequence C - 10 iterations with 16384  runs
Sequence D - 10 iterations with ~32768 runs

  |   before    | first patch | full patchset |
  |  patchset   |  only - 1/5 |     5/5       | 
  | (decicyles) | (decicycles)|  (decicyles)  |
--|-------------|-------------|---------------|
C |   910181    |   864827    |    861376     |
D |    74798    |    70529    |     69574     |   

I also have a Jetson Nano that uses Armv8. Let me know if you need these simulations on 64bit Arm.

Andriy


More information about the ffmpeg-devel mailing list