Detection & Recognition of Overlay Text in Video

Visual text is ubiquitous in videos. Detection and recognition of visual text can provide many important semantic features for video analysis. Visual text can be divided into two categories, scene text and overlay text. This project aims to accurately detect and recognize the overlay text in video frames using compressed domain features, statistical fusing of contextual knowledge, and machine learning methods. 

Text Detecting

For overlay text detection, we use texture and motion features extracted from DCT coefficients and motion vectors in MPEG video stream. Multiple hypothetical images are then generated by using color space quantization. Finally, layout analysis is carried out to eliminate false alarms. The system has shown superior performance when testing in TREC 2002 movie videos and broadcast news videos. The system is shown below.


Text Recognition

For text recognition, we developed a multimodal method fusing multiple information sources and language models to enhance text recognition performance. Currently, we fuse the information from closed caption streams and external linguistic corpus. The closed caption model is realized by a novel time-dependent probability model which takes into account the time distance between videotext word and closed caption word. The experiments showed increased performance by fusing multi-modal features.


We have applied the multi-modal statistical fusion technique to detection of caption text in low-resolution news video. The method achieves a significant performance gain, with word recognition rate of 76.8% and character recognition rate of 86.7%. By feeding back the recognition results to the detection module, in conjunction with temporal consistency checking, the proposed methods greatly improve the detection precision (from 70% to 92%) while maintaining a high recall rate.


Dongqing Zhang

Prof. Shih-Fu Chang 


