[MPlayer-dev-eng] patch: video filter to OCR subtitles for mplayer

Tue Nov 25 19:33:03 CET 2003

A patch that OCR's subtitles from movies is here:
http://www.ee.oulu.fi/~tuukkat/mplayer/MPlayer-20031114.ocr.patch
Some documentation, also shown below:
http://www.ee.oulu.fi/~tuukkat/mplayer/ocrdoc.txt

I didn't attach the patch, since it's 27 kilobytes and probably not yet
ready for inclusion. Otherwise I think it's ready, but the interface
between mplayer and the filter is clumsy, because I don't know mplayer
internals. For example, where do I get current frame number? Currently I
just count the frame that pass the filter, but of course it goes wrong
if user skips some parts (e.g. with cursor keys). Also it didn't appear to
work with Mencoder, to be investigated...

But I'd like to hear comments.

Automatic Optical Character Recognition of Video Subtitles
----------------------------------------------------------
Or -vf threshold,ocr for MPlayer.

This document briefly describes the two MPlayer filters
named "threshold" and "ocr" whose purpose is to extract and OCR
subtitles from movies.

The filters have large number of parameters, which should
be tuned properly for optimal performance. The default values
are tuned for Finnish television broadcasts.

threshold
---------
This filter takes in BGR32 image and binarizes it, setting
subtitles to full white and everything else to black. It
performs multiple complex image processing tasks for recognizing
subtitles and discarding everything else. Here is the main operations:
1. Edges in each frame are enhanced using Roberts masks.
   A new image is created based on absolute values of edge strengths.
2. Difference image is created by subtracting the previous image
   from the newest frame and taking absolute value.
3. The image is thresholded: all colors with distance larger than DIST
   from the specified color (RED,GREEN,BLUE) are set to black,
   other pixels are set to white.
4. The image is labeled, e.g. connected white regions are numbered
   as 1,2,3,... and the pixel values are set based on the region number.
   All pixels in the same region get the same value.
5. A map between the newest and previous labeled image is created based
   on spatially overlapping regions (aka blobs aka segments). For example,
   if there is a white box in the middle of the screen in the newest image,
   it might be labeled as 7. If there were a circle in previous image that
   spatially overlaps with the box and it was labeled as 5, then the map
   would map 7 to 5. Two special cases are, if a region in the newest image
   overlaps with nothing in the previous image in which case the map is set
   to zero, of if a region overlaps with multiple regions in the previous
   image, in which case the map is set to -1.
6. Several features are computed for each region in the newest image based
   on things computed at steps 1-5. These include bounding box, area,
   boundary length, average change of pixel color, number of different pixels
   from the corresponding region at the previous image (using map from step 5)
   divided by total area, and average edge strength which is sum of pixels
   in the edge strength image computed at step 1 divided by edge length.
7. Based on the features, regions are rejected which don't appear to be
   letters. Specifically, the following tests are used:
   - If a region touches edges of the whole image, it is rejected.
   - If the area is too LARGE or too SMALL, it is rejected.
   - If the bounding box is too NARROW or too WIDE (horizontally) it is rejected.
   - If the bounding box is too SHORT or too TALL (vertically) it is rejected.
   - If the average pixel color difference is more than DIFFER, it is rejected.
   - If the edge length divided by area is more than EDGERAT, it is rejected.
   - If the edge strength is more than EDGESTRENGTH, it is rejected.
   - If the number of pixels different from previous overlapping region is
     more than TDIFFER, it is rejected.
   The rejected regions are cleared (set to black) from the labeled image.
8. A new BGR32 image is created and passed on in the MPlayer filter chain.
   Each pixel is set to either black or white depending if it is in a non-rejected
   region or not. Ideally, characters will be white and everything else rejected
   and black.

ocr
---
This filter detects when subtitles appear or disappear and calls external
program to do actual character recognizion. It writes the extracted subtitles
in MicroDVD format into file "dumpsub.txt".
For each image, the number of pixels that go from black-to-white (on pixels)
and from white-to-black (off pixels) compared to the last frame, are counted.
If the total sum of changed pixels is more than given THRESHOLD value,
it is assumed that subtitles either appear (on>off) or disappear.
If they appear, nothing much is done, just the image and the frame number are
saved.
If they disappear, first a cleaned image is created by ANDing the last image
and the image where the subtitles appeared: this will clean even more moving
non-characters away which were not rejected in the first filter (long-term
cleaning). The the image is written into temporary PGM file, gocr is called,
and the OCR'd text is read back. A new line is written into the subtitles
file, containing begin frame number, end frame number, and the OCR'd subtitle.

Usage:
------
You should crop the video so that only the subtitles are left, if possible.
Then you should remove noise with strong hqdn3d filter. I'm using something
like this:

mplayer -vf crop=www:hhh:xxx:yyy,hqdn3d=30:30:30,framestep=5,threshold,ocr,scale

You need to put "scale" at the end of the filter chain otherwise MPlayer
doesn't know how to display the BGR32 images. You might need a scale
also between framestep and threshold, or run it in two passes like I do.

Don't skip forward/backward when extracting the subtitles, otherwise the
frame numbers written into the subtitle file will get wrong.

Options:
--------
ocr options:
-vf ocr=THRESHOLD:SCALE
where SCALE multiplies frame numbers written to the subtitle file by the
given amount (by default 1).
threshold options:
-vf threshold=/edges:EDGEREJ/small:SMALL/large:LARGE/narrow:NARROW/
               wide:WIDE/short:SHORT/tall:TALL/differ:DIFFER/tdiffer:TDIFFER/
	       red:RED/green:GREEN/blue:BLUE/distance:DIST
where EDGEREJ is nonzero if regions are rejected when they touch image edges.
For other parameters, read the description above and see the default values
in vf_threshold.c.

Finetuning the options:
-----------------------
As mentioned, the parameters is closely finetuned to Finnish TV broadcasts.
If you do something else, chances are that nothing is recognized as subtitles
and you get only black image with
	mplayer -vf threshold,scale <file.avi>
In that case, start playing with the options until you get nicely only
subtitles and nothing else. Then add also ocr to the filter chain and boost
to speed:
	mplayer -fps 300 -vf threshold,ocr <file.avi> -vo null

Results:
--------
>From BTTV-captured TV image, the subtitles are extracted very well. The
problem is more in gocr, which makes often errors, especially for italic
fonts. Normal fonts are recognized fairly well, although there are lots
of extra or missing spaces. It is easy to edit the subtitles by hand later,
especially since the frame numbers are usually correct. However, it is
easy to replace gocr with a call to a better OCR program if there is such.

Thoughts:
---------
- GOcr would need some work. And what about Clara/other OCR programs?
- I tried first implementing it as many simpler filters, but it's not easy...
- Looks like it doesn't work with MEncoder?
- Better way to get frame numbers than count them?
- Better to not run it with compressed video, use uncompressed I420 or similar