Abstract
The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and “fuzzy” n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
AltaVista, http://www.altavista.com
Bledsoe W and Browning I (1959) Pattern recognition and reading by machine. In: Proceedings of the Eastern Joint Computer Conference, No. 16, pp. 225–233.
Doermann D, Rivlin E and Weiss I (1993) Logo recognition. Technical Report CAR-TR-688, Document Processing Group, Center for Automation Research, University of Maryland, College Park, MD 20742–3275.
Graham R and Hell P (1985) On the history of the minimum spanning tree problem. Annals of the History of Computing, 7.
Harvest Web Indexing. http://www.tardis.ed.ac.uk/harvest.
Huang Q, Dom B, Steele D, Ashley J and Niblack W(1995) Foreground/background segmentation of color images by integration of multiple cues. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 246–249.
Jung D, Krishnamoorthy M, Nagy Gand Shapira A(1996) N-Tuple features for OCR revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7):734–743.
Lee D, Pavlidis T and Wasilkowski GW (1987) A note on the trade-off between sampling and quantization in signal processing. Journal of Complexity, 3:359–371.
Lei Z, Keren D and Cooper D (1995) Computationally fast Bayesian recognition of complex objects based on mutual algebraic invariants. In: Proceedings of the International Conference on Image Processing,Washington, D.C., pp. 635–638.
Leung CM (1985) A practical basis set for Chinese character recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA.
Li H, Kia O and Doermann D (1999) Text enhancement in digital video. In: Document Recognition and Retrieval VI (IS&T/SPIE Electronic Imaging'99), San Jose, CA, Vol. 3651, pp. 2–9.
Lienhart R (1996) Automatic text recognition for video indexing. In: Proceedings of ACMMultimedia'96, Bosten, MA, pp. 21–30.
Lopresti D (1996) Robust retrieval of noisy text. In: Proceedings of the Third Forum on Research and Advances in Digital Libraries, Washington, DC, pp. 76–85.
Lopresti D, Nagy G, Sarkar P and Zhou J (1995) Spatial sampling effects in optical character recognition. In: Proceedings of the Third International Conference on Document Analysis and Recognition, Montr´eal, Canada, pp. 309–314.
Lopresti Dand Zhou J (1996) Document analysis and theWorldWideWeb. In: Proceedings of the IAPRWorkshop on Document Analysis Systems, Malvern, PA, pp. 651–669.
Lopresti D and Zhou J (1996) Retrieval strategies for noisy text. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 255–269.
Lopresti D and Zhou J (1997) Locating and recognizing text inWWWimages. In: Proceedings of the Symposium on Document Image Understanding Technology, Annapolis, MD, pp. 193–201.
Nagy G, Seth S and Viswanathan M(1997) DIA, OCR, and theWWW. In: Bunke H and Wang P, Eds., Handbook of Character Recognition and Document Image Analysis, World Scientific Singapore, pp. 729–754.
Paek S (1998) Detecting image purpose inWorld-Wide-Web documents. In: Document RecognitionV(IS&T/SPIE Electronic Imaging'98), San Jose, CA, Vol. 3305, pp. 151–158.
Preparata FP and Shamos MI (1985) Computational Geometry–An Introduction, Ch. 5. Springer-Verlag.
Sarkar P, Nagy G, Zhou J and Lopresti D (1998) Spatial sampling of printed patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):344–351.
Search Engine Watch, http://searchenginewatch.com.
Taghva K, Condit A and Borsack J (1994) Autotag: A tool for creating structured document collections from printed materials. Technical Report 94–11, UNLV Information Science Research Institute, Las Vegas, NV.
Taubin G (1991) Estimation of planar curves, surfaces and nonplanar space curves defined by implicit equations, with applications to edge and range image segmentation, In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1115–1138.
Wang L and Pavlidis T (1993) Detection of curved and straight segments from gray scale topography. In: Proceedings of the SPIE Symposium on Character Recognition Technologies, San Jose, CA, pp. 10–20.
Weinman L (1996) Designing Web Graphics. New Riders Publishing, Indianapolis, IN.
WebFont Wizard. http://webreview.com/wr/pub/1999/02/19/feature/index.html.
World Wide Web Consortium (1996) In: Workshop on High Quality Printing from the Web, Cambridge, MA. http://www.w3.org/pub/WWW/Printing/Workshop 960425.html.
Wu V, Manmatha R and Riseman E (1997) Finding text in images. In: Proceedings of Second ACM International Conference on Digital Libraries, Philadelphia, PA, pp. 23–26.
Zahn CT (1971) Graph theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 20(1).
Zakon RH. Hobbes' Internet Timeline v4.2. http://www.isoc.org/zakon/Internet/History/HIT.html.
Zhong Y, Karu K and Jain A (1995) Locating text in complext color images. Pattern Recognition, 28(10):1523–1535.
Zhou J and Lopresti D (1997) Extracting text from WWW images. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 248–252.
Zhou J, Lopresti D and Lei Z (1997) OCR forWorldWideWeb images. In: Document Recognition IV (IS&T/SPIE Electronic Imaging'97), San Jose, CA, Vol. 3027, pp. 58–66.
Zhou J, Lopresti D and Tasdizen T (1998) Finding text in color images. In: Document Recognition V (IS&T/SPIE Electronic Imaging'98), San Jose, CA, Vol. 3305, pp. 130–140.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Lopresti, D., Zhou, J. Locating and Recognizing Text in WWW Images. Information Retrieval 2, 177–206 (2000). https://doi.org/10.1023/A:1009954710479
Issue Date:
DOI: https://doi.org/10.1023/A:1009954710479