[FFmpeg-devel] Conversion of Graphic Subtitles to Text Subtitles

Soft Works softworkz at hotmail.com
Wed Sep 22 23:26:48 EEST 2021


Hi,

as some had contacted me off-list about converting graphical subs to text,
whether it works and how well it would work, I wanted to drop a few words
about the ‘graphicsubs2text’ filter which is part of my latest subtitle 
filtering patchset.

From previous experience with OCR in general I had quite some doubt about
whether this could end up in something really useful when I started doing 
this filter. In turn I’ve been surprised about the results I’ve been able
to achieve, which are exceeding expectations.

In the few tests I did so far, the recognition accuracy has been about 
99% - that may not hold up for all cases, but it shows that it's 
capable for practical use.

The current version of the 'graphicsubs2text' combines the recognized
text from all simultaneously shown bitmap rects as paragraphs of an ass
subtitle event.

Possible future improvements:

- Detect the colors of words and apply them as inline styles to the 
  ass text
- Same for font sizes
  even letter- word- and line-spacing might be possible to replicate
- Take the actual positions of the bitmaps plus the relative position
  of recognized text blocks+ within these bitmap rects and use this for 
  positioning ass text via inline styles

Though, even in its current state, the results are useful.

Below, I'm showing an ffmpeg log from converting a publicly available
test video. If you'd like to take a look at the converted result without
compiling, feel free to contact me.

Regards,
softworkz


Command and log (slightly edited for clarity):

ffmpeg -y -loglevel verbose -i "https://streams.videolan.org/streams/ts/video_subs_ttxt%2Bdvbsub.ts" -filter_complex "[0:13]graphicsub2text=language=eng:ocr_mode=both" -c:v libx265 -c:s ass output.mkv

[tcp] Starting connection attempt to 213.36.253.119 port 443
[tcp] Successfully connected to 213.36.253.119 port 443
[tcp @ 000001567A2188C0] Starting connection attempt to 213.36.253.119 port 443
[tcp @ 000001567A2188C0] Successfully connected to 213.36.253.119 port 443
Input #0, mpegts, from 'https://streams.videolan.org/streams/ts/video_subs_ttxt%2Bdvbsub.ts':
  Duration: 00:02:21.72, start: 458.752189, bitrate: 6010 kb/s
  Program 6301 
    Metadata:
      service_name    : BBC 1 London
      service_provider: BSkyB
  Stream #0:1[0x1388]: Video: mpeg2video (Main), 1 reference frame ([2][0][0][0] / 0x0002), yuv420p(tv, top first, left), 
720x576 [SAR 64:45 DAR 16:9], 7980 kb/s, 25 fps, 25 tbr, 90k tbn
    Side data:
      cpb: bitrate max/min/avg: 7980000/0/0 buffer size: 1835008 vbv_delay: N/A
  Stream #0:2[0x1389](eng): Audio: mp2 ([3][0][0][0] / 0x0003), 48000 Hz, stereo, fltp, 256 kb/s
  Stream #0:3[0x138a](NAR): Audio: mp2 ([3][0][0][0] / 0x0003), 48000 Hz, stereo, fltp, 256 kb/s
  Stream #0:4[0x138b](eng,eng): Subtitle: dvb_teletext ([6][0][0][0] / 0x0006)
  Stream #0:5[0x902]: Unknown: none ([5][0][0][0] / 0x0005)
  Stream #0:6[0x903]: Unknown: none ([5][0][0][0] / 0x0005)
  Stream #0:7[0x904]: Unknown: none ([5][0][0][0] / 0x0005)
  Stream #0:8[0x905]: Unknown: none ([5][0][0][0] / 0x0005)
  Stream #0:9[0x907]: Unknown: none ([5][0][0][0] / 0x0005)
  Stream #0:10[0x908]: Unknown: none ([5][0][0][0] / 0x0005)
  Stream #0:11[0x909]: Unknown: none ([5][0][0][0] / 0x0005)
  Stream #0:12[0x90a]: Unknown: none ([5][0][0][0] / 0x0005)
  Stream #0:13[0x138c](eng): Subtitle: dvb_subtitle ([6][0][0][0] / 0x0006)
  Program 6318 
...
  Stream #0:0[0x12]: Data: epg
[Parsed_graphicsub2text_0] Initializing libtesseract, version: 4.1.1
Stream mapping:
  Stream #0:13 (dvbsub) -> graphicsub2text (graph 0)
  graphicsub2text (graph 0) -> Stream #0:0 (ass)
  Stream #0:1 -> #0:1 (mpeg2video (native) -> hevc (libx265))
  Stream #0:2 -> #0:2 (mp2 (native) -> vorbis (libvorbis))
Press [q] to stop, [?] for help
[mpegts @ 000001567A19DBC0] Correcting start time by 29244
[graph_2_in_0_2 @ 000001567A1C5E80] tb:1/48000 samplefmt:s16p samplerate:48000 chlayout:0x3
[format_out_0_2 @ 000001567A1C6240] auto-inserting filter 'auto_aresample_0' between the filter 'Parsed_anull_0' and the filter 
'format_out_0_2'
[graph 1 video input from stream 0:1 @ 000001567C6EC800] w:720 h:576 pixfmt:yuv420p tb:1/90000 fr:25/1 sar:64/45
...
Output #0, matroska, to 'output.mkv':
  Metadata:
    encoder         : Lavf59.6.100
  Stream #0:0: Subtitle: ass
    Metadata:
      encoder         : Lavc59.7.103 ass
  Stream #0:1: Video: hevc, 1 reference frame, yuv420p(tv, top coded first (swapped), left), 720x576 (0x0) [SAR 64:45 DAR 
16:9], q=2-31, 25 fps, 1k tbn
    Metadata:
      encoder         : Lavc59.7.103 libx265
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
  Stream #0:2(eng): Audio: vorbis (oV[0][0] / 0x566F), 48000 Hz, stereo, fltp
    Metadata:
      encoder         : Lavc59.7.103 libvorbis
[Parsed_graphicsub2text_0] Initializing libtesseract, version: 4.1.1
subtitle input filter: decoding size 720x576
[graph 0 subtitle input from stream 0:13] graphical subtitles - w:0 h:0 tb:1/90000
[Parsed_graphicsub2text_0] OCR Result: Look, I've got to go, I've got
[Parsed_graphicsub2text_0] OCR Result: to set up a shoot for a rock band
[Parsed_graphicsub2text_0] OCR Result: tonight. So, um... Think about it,
[Parsed_graphicsub2text_0] OCR Result: give us a call any time. See you.
[Parsed_graphicsub2text_0] OCR Result: All I ever wanted
[Parsed_graphicsub2text_0] OCR Result: to be was an actress.
[Parsed_graphicsub2text_0] OCR Result: Theatre and films, you know?
[Parsed_graphicsub2text_0] OCR Result: But it's hard if you haven't got the
[Parsed_graphicsub2text_0] OCR Result: connections to open the right doors.
[Parsed_graphicsub2text_0] OCR Result: You have to stick at it,
[Parsed_graphicsub2text_0] OCR Result: ignore rejections.
[Parsed_graphicsub2text_0] OCR Result: A few years ago,
[Parsed_graphicsub2text_0] OCR Result: I was really short of cash
[Parsed_graphicsub2text_0] OCR Result: and the guy that did my
[Parsed_graphicsub2text_0] OCR Result: publicity photos suggested
[Parsed_graphicsub2text_0] OCR Result: I do some glamour stuff.
[Parsed_graphicsub2text_0] OCR Result: Nothing, you know, pornographic,
[Parsed_graphicsub2text_0] OCR Result: just lingerie shots.
[Parsed_graphicsub2text_0] OCR Result: Anyway, I knew him and he
[Parsed_graphicsub2text_0] OCR Result: seemed genuine enough so I did.
[Parsed_graphicsub2text_0] OCR Result: Just the once.
[Parsed_graphicsub2text_0] OCR Result: And?
[Parsed_graphicsub2text_0] OCR Result: It was great.
[Parsed_graphicsub2text_0] OCR Result: I enjoyed it, after a while.
[Parsed_graphicsub2text_0] OCR Result: He made me laugh.
[Parsed_graphicsub2text_0] OCR Result: We had a few drinks and
[Parsed_graphicsub2text_0] OCR Result: it got me relaxed.
[Parsed_graphicsub2text_0] OCR Result: It was a bit like acting, I suppose.
[Parsed_graphicsub2text_0] OCR Result: Problem was,
[Parsed_graphicsub2text_0] OCR Result: I got a bit carried away.
[Parsed_graphicsub2text_0] OCR Result: I ended up doing some nude stuff.
[Parsed_graphicsub2text_0] OCR Result: He said he wouldn't use them,
[Parsed_graphicsub2text_0] OCR Result: it was just between us,
[Parsed_graphicsub2text_0] OCR Result: just a bit of fun, you know.
[Parsed_graphicsub2text_0] OCR Result: Anyway, I'm not stupid so
[Parsed_graphicsub2text_0] OCR Result: I asked for the negatives.
[Parsed_graphicsub2text_0] OCR Result: I just put them in a drawer
[Parsed_graphicsub2text_0] OCR Result: and forgot about them.
[Parsed_graphicsub2text_0] OCR Result: Until last week.
[Parsed_graphicsub2text_0] OCR Result: There was a nice article
[Parsed_graphicsub2text_0] OCR Result: in the local paper.
[Parsed_graphicsub2text_0] OCR Result: "Letherbridge Girl's Name In
[Parsed_graphicsub2text_0] OCR Result: Lights At Last" kind of thing.
[Parsed_graphicsub2text_0] OCR Result: And the next day, in the post,
[Parsed_graphicsub2text_0] OCR Result: I got a bunch of photos of me with...
[Parsed_graphicsub2text_0] OCR Result: You know. So the photographer had
[Parsed_graphicsub2text_0] OCR Result: kept some of the negatives. Yeah.
[Parsed_graphicsub2text_0] OCR Result: How stupid can you get, eh?
[Parsed_graphicsub2text_0] OCR Result: And now he's asking for money?
[Parsed_graphicsub2text_0] OCR Result: Yeah.
[Parsed_graphicsub2text_0] OCR Result: Or he goes to the tabloids.
[Parsed_graphicsub2text_0] OCR Result: Want one? Yeah, please.
[Parsed_graphicsub2text_0] OCR Result: So what was all that about?
[Parsed_graphicsub2text_0] OCR Result: What?
[mpegts] PES packet size mismatch
[mpegts] Packet corrupt (stream = 2, dts = 54008409).
[mpeg2video] ac-tex damaged at 17 29
[mpeg2video] Warning MVs not available
encoded 3532 frames in 479.91s (7.36 fps), 262.28 kb/s, Avg QP:33.26


More information about the ffmpeg-devel mailing list