Extraction of overlaid text such as captions and movie credits from videos is complicated by cluttered scene backgrounds, the lack of a consistent color distribution that characterizes textual content, and variations in text size. This technology describes a method for identifying regions of interest (ROIs) in a video that may contain text using a combination of texture and motion analysis. Binary images representing hypothetical text content are generated for several partitions of the video's color space. Character-like blocks in these images are identified using a clustering algorithm; layout analysis of the clustered blocks groups them into possible lines of text. Text lines are then verified by checking that their length is sufficiently long.
Identifying text captions from videos is typically a time-consuming and difficult process due to background noise, lack of automaticity, and variations in text size. Partitioning of the video color space to obtain multiple hypotheses regarding textual regions in the video and combining them with grouping significantly reduces the effects of background disturbance on text recognition accuracy. By analyzing both texture and motion energy to accurately localize potential text regions and avoid processing of irrelevant parts of the video, this technology can achieve real-time performance. Spurious text identification is reduced by checking that the detected box of text in one video frame overlaps with those in a sufficient number of previous frames.
The technology has been implemented and successfully tested using the NIST TREC-2002 benchmark and with American and Taiwanese news videos. These tests demonstrate competitive performance in comparison to other text extraction systems.
Patent Issued (US 8,488,682)
Tech Ventures Reference: IR M02-021