Locating and Recognizing Text in WWW Images

Daniel Lopresti¹ &
Jiangying Zhou²

506 Accesses
42 Citations
Explore all metrics

Abstract

The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and “fuzzy” n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.

Article PDF

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

Automated Text Detection and Character Recognition in Natural Scenes Based on Local Image Features and Contour Processing Techniques

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

AltaVista, http://www.altavista.com
Bledsoe W and Browning I (1959) Pattern recognition and reading by machine. In: Proceedings of the Eastern Joint Computer Conference, No. 16, pp. 225–233.
Google Scholar
Doermann D, Rivlin E and Weiss I (1993) Logo recognition. Technical Report CAR-TR-688, Document Processing Group, Center for Automation Research, University of Maryland, College Park, MD 20742–3275.
Google Scholar
Graham R and Hell P (1985) On the history of the minimum spanning tree problem. Annals of the History of Computing, 7.
Harvest Web Indexing. http://www.tardis.ed.ac.uk/harvest.
Huang Q, Dom B, Steele D, Ashley J and Niblack W(1995) Foreground/background segmentation of color images by integration of multiple cues. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 246–249.
Jung D, Krishnamoorthy M, Nagy Gand Shapira A(1996) N-Tuple features for OCR revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7):734–743.
Google Scholar
Lee D, Pavlidis T and Wasilkowski GW (1987) A note on the trade-off between sampling and quantization in signal processing. Journal of Complexity, 3:359–371.
Google Scholar
Lei Z, Keren D and Cooper D (1995) Computationally fast Bayesian recognition of complex objects based on mutual algebraic invariants. In: Proceedings of the International Conference on Image Processing,Washington, D.C., pp. 635–638.
Google Scholar
Leung CM (1985) A practical basis set for Chinese character recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA.
Google Scholar
Li H, Kia O and Doermann D (1999) Text enhancement in digital video. In: Document Recognition and Retrieval VI (IS&T/SPIE Electronic Imaging'99), San Jose, CA, Vol. 3651, pp. 2–9.
Google Scholar
Lienhart R (1996) Automatic text recognition for video indexing. In: Proceedings of ACMMultimedia'96, Bosten, MA, pp. 21–30.
Google Scholar
Lopresti D (1996) Robust retrieval of noisy text. In: Proceedings of the Third Forum on Research and Advances in Digital Libraries, Washington, DC, pp. 76–85.
Google Scholar
Lopresti D, Nagy G, Sarkar P and Zhou J (1995) Spatial sampling effects in optical character recognition. In: Proceedings of the Third International Conference on Document Analysis and Recognition, Montr´eal, Canada, pp. 309–314.
Google Scholar
Lopresti Dand Zhou J (1996) Document analysis and theWorldWideWeb. In: Proceedings of the IAPRWorkshop on Document Analysis Systems, Malvern, PA, pp. 651–669.
Lopresti D and Zhou J (1996) Retrieval strategies for noisy text. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 255–269.
Google Scholar
Lopresti D and Zhou J (1997) Locating and recognizing text inWWWimages. In: Proceedings of the Symposium on Document Image Understanding Technology, Annapolis, MD, pp. 193–201.
Google Scholar
Nagy G, Seth S and Viswanathan M(1997) DIA, OCR, and theWWW. In: Bunke H and Wang P, Eds., Handbook of Character Recognition and Document Image Analysis, World Scientific Singapore, pp. 729–754.
Google Scholar
Paek S (1998) Detecting image purpose inWorld-Wide-Web documents. In: Document RecognitionV(IS&T/SPIE Electronic Imaging'98), San Jose, CA, Vol. 3305, pp. 151–158.
Google Scholar
Preparata FP and Shamos MI (1985) Computational Geometry–An Introduction, Ch. 5. Springer-Verlag.
Sarkar P, Nagy G, Zhou J and Lopresti D (1998) Spatial sampling of printed patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):344–351.
Google Scholar
Search Engine Watch, http://searchenginewatch.com.
Taghva K, Condit A and Borsack J (1994) Autotag: A tool for creating structured document collections from printed materials. Technical Report 94–11, UNLV Information Science Research Institute, Las Vegas, NV.
Google Scholar
Taubin G (1991) Estimation of planar curves, surfaces and nonplanar space curves defined by implicit equations, with applications to edge and range image segmentation, In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1115–1138.
Wang L and Pavlidis T (1993) Detection of curved and straight segments from gray scale topography. In: Proceedings of the SPIE Symposium on Character Recognition Technologies, San Jose, CA, pp. 10–20.
Google Scholar
Weinman L (1996) Designing Web Graphics. New Riders Publishing, Indianapolis, IN.
Google Scholar
WebFont Wizard. http://webreview.com/wr/pub/1999/02/19/feature/index.html.
World Wide Web Consortium (1996) In: Workshop on High Quality Printing from the Web, Cambridge, MA. http://www.w3.org/pub/WWW/Printing/Workshop 960425.html.
Google Scholar
Wu V, Manmatha R and Riseman E (1997) Finding text in images. In: Proceedings of Second ACM International Conference on Digital Libraries, Philadelphia, PA, pp. 23–26.
Google Scholar
Zahn CT (1971) Graph theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 20(1).
Zakon RH. Hobbes' Internet Timeline v4.2. http://www.isoc.org/zakon/Internet/History/HIT.html.
Zhong Y, Karu K and Jain A (1995) Locating text in complext color images. Pattern Recognition, 28(10):1523–1535.
Google Scholar
Zhou J and Lopresti D (1997) Extracting text from WWW images. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 248–252.
Google Scholar
Zhou J, Lopresti D and Lei Z (1997) OCR forWorldWideWeb images. In: Document Recognition IV (IS&T/SPIE Electronic Imaging'97), San Jose, CA, Vol. 3027, pp. 58–66.
Google Scholar
Zhou J, Lopresti D and Tasdizen T (1998) Finding text in color images. In: Document Recognition V (IS&T/SPIE Electronic Imaging'98), San Jose, CA, Vol. 3305, pp. 130–140.
Google Scholar

Download references

Author information

Authors and Affiliations

Bell Laboratories, Lucent Technologies, Inc., 600 Mountain Avenue, Murray Hill, NJ, 07974, USA
Daniel Lopresti
Summus Ltd., Suite 2200, 2000 Center Point Drive, Columbia, SC, 29210, USA
Jiangying Zhou

Authors

Daniel Lopresti
View author publications
You can also search for this author in PubMed Google Scholar
Jiangying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lopresti, D., Zhou, J. Locating and Recognizing Text in WWW Images. Information Retrieval 2, 177–206 (2000). https://doi.org/10.1023/A:1009954710479

Download citation

Issue Date: May 2000
DOI: https://doi.org/10.1023/A:1009954710479

Locating and Recognizing Text in WWW Images

Abstract

Article PDF

Similar content being viewed by others

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

Automated Text Detection and Character Recognition in Natural Scenes Based on Local Image Features and Contour Processing Techniques

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Locating and Recognizing Text in WWW Images

Abstract

Article PDF

Similar content being viewed by others

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

Automated Text Detection and Character Recognition in Natural Scenes Based on Local Image Features and Contour Processing Techniques

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation