Detection & Recognition of Overlay Text in Video

Project's Home Page | Current Research Areas > Feature Extraction & Object Recognition >




Visual text is ubiquitous in videos. Detection and recognition of visual text can provide many important semantic features for video analysis. Visual text can be divided into two categories, scene text and overlay text. This project aims to accurately detect and recognize the overlay text in video frames using compressed domain features, statistical fusing of contextual knowledge, and machine learning methods. 

Text Detecting

For overlay text detection, we use texture and motion features extracted from DCT coefficients and motion vectors in MPEG video stream. Multiple hypothetical images are then generated by using color space quantization. Finally, layout analysis is carried out to eliminate false alarms. The system has shown superior performance when testing in TREC 2002 movie videos and broadcast news videos. The system is shown below.


Text Recognition

For text recognition, we developed a multimodal method fusing multiple information sources and language models to enhance text recognition performance. Currently, we fuse the information from closed caption streams and external linguistic corpus. The closed caption model is realized by a novel time-dependent probability model which takes into account the time distance between videotext word and closed caption word. The experiments showed increased performance by fusing multi-modal features.


We have applied the multi-modal statistical fusion technique to detection of caption text in low-resolution news video. The method achieves a significant performance gain, with word recognition rate of 76.8% and character recognition rate of 86.7%. By feeding back the recognition results to the detection module, in conjunction with temporal consistency checking, the proposed methods greatly improve the detection precision (from 70% to 92%) while maintaining a high recall rate.


Dongqing Zhang

Prof. Shih-Fu Chang 


D. Zhang, S.-F. Chang, A Bayesian Framework for Fusing Multiple Word Knowledge Models in Videotext Recognition, Proceeding of the Computer Vision and Pattern Recognition 2003, Madison, Wisconsin, June 16-22, 2003.

DongQing Zhang, Belle L. Tseng, Ching-Yung Lin and Shih-Fu Chang, "Accurate Overlay Text Extraction for Digital Video Analysis", Proceeding of IEEE International Conference on Information Technology: Research and Education (ITRE 2003).

Dongqing Zhang, and Shih-Fu Chang, Event Detection in Baseball Video Using Superimposed Caption Recognition,  ACM Multimedia 2002, Juan Les Pins, France, December 1-6, 2002. (ACM MM 2002).

Dongqing Zhang, and Shih-Fu Chang, General and Domain-specific Techniques for Detecting and Recognizing Superimposed Text in Video, Proceeding of International Conference on Image Processing, Rochester, New York, USA. (ICIP 2002). 


DVMM Project: Sports Highlight Sumamrization Using Caption Text Recognition

Caption Text Detection in Mpeg Videos


For problems or questions regarding this web site contact The Web Master.
Last updated: June 12, 2002.