Visual text is ubiquitous
in videos. Detection and recognition of visual text can provide many
important semantic features for video analysis. Visual text can
be divided into two categories, scene text and overlay
text. This project aims to accurately detect and recognize the
overlay text in video frames using compressed domain features, statistical
fusing of contextual knowledge, and machine learning methods.
For overlay text
detection, we use texture and motion features extracted from DCT coefficients
and motion vectors in MPEG video stream. Multiple hypothetical images
are then generated by using color space quantization. Finally, layout
analysis is carried out to eliminate false alarms. The system has shown
superior performance when testing in TREC 2002 movie videos and broadcast
news videos. The system is shown below.
text recognition, we developed a multimodal method fusing multiple information
sources and language models to enhance text recognition performance.
Currently, we fuse the information from closed caption streams and external
linguistic corpus. The closed caption model is realized by a novel time-dependent
probability model which takes into account the time distance between
videotext word and closed caption word. The experiments showed increased
performance by fusing multi-modal features.
We have applied
the multi-modal statistical fusion technique to detection of caption
text in low-resolution news video. The method achieves a significant
performance gain, with word recognition rate of 76.8% and character
recognition rate of 86.7%. By feeding back the recognition results to
the detection module, in conjunction with temporal consistency checking,
the proposed methods greatly improve the detection precision (from 70%
to 92%) while maintaining a high recall rate.
Zhang, S.-F. Chang, A
Bayesian Framework for Fusing Multiple Word Knowledge Models in Videotext
Recognition, Proceeding of the Computer Vision and Pattern Recognition
2003, Madison, Wisconsin, June 16-22, 2003.
DongQing Zhang, Belle L.
Tseng, Ching-Yung Lin and Shih-Fu Chang, "Accurate
Overlay Text Extraction for Digital Video Analysis", Proceeding
of IEEE International Conference on Information Technology: Research
and Education (ITRE 2003).
Dongqing Zhang, and Shih-Fu Chang, “Event
Detection in Baseball Video Using Superimposed Caption Recognition”,
ACM Multimedia 2002, Juan Les Pins, France, December 1-6, 2002.
(ACM MM 2002).
Dongqing Zhang, and Shih-Fu Chang, “General
and Domain-specific Techniques for Detecting and Recognizing Superimposed
Text in Video”, Proceeding of International Conference on Image
Processing, Rochester, New York, USA. (ICIP 2002).
Project: Sports Highlight Sumamrization Using Caption Text Recognition
Text Detection in Mpeg Videos
For problems or questions
regarding this web site contact The Web Master.
Last updated: June 12, 2002.