Abstract
Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite successful. Genre identification worked well for easily defined genre categories such as weather reports. We also propose additional techniques which may improve the accuracy still farther.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allen, R.B.: A Focus-Context Timeline for Browsing Historical Newspapers. In: ACM/IEEE Joint Conference on Digital Libraries, pp. 260–261 (2005)
Allen, R.B., Japzon, A., Achananuparp, P., Lee, K.-J.: A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers. In: HCI International Conf. (2007)
Allen, R.B., Schalow, J.: Metadata and Data Structures for the Historical Newspaper Digital Library Project. In: ACM CIKM, Kansas City, November, pp. 147–153 (1999)
Choi, Y.Y.: Advances in domain independent linear text segmentation. In: Proceedings of NAACL, Seattle, USA (2000)
Gatos, B., Gouraros, N., Mantzaris, S., Perantonis, S., Tsigris, A., Tzavelis, P., Vassilas, N.: A New Method for Segmenting Newspaper Articles. In: SIGIR, p. 389 (1998)
Kanungo, T., Allen, R.B.: Full-Text Access to Historical Newspapers. Technical Report: LAMP-TR-033/CAR-TR-915/CS-TR-4014, University of Maryland, College Park (April 1999)
Murray, R.: Towards a Metadata Standard for Digitized Historical Newspapers. JCDL, 330–331 (2005)
von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: ReCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science 321, 1465–1468 (2008)
Zhu, W., Allen, R.B.: Topic and Event Categorization of Historical Newspapers (in preparation)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Allen, R.B., Waldstein, I., Zhu, W. (2008). Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres. In: Buchanan, G., Masoodian, M., Cunningham, S.J. (eds) Digital Libraries: Universal and Ubiquitous Access to Information. ICADL 2008. Lecture Notes in Computer Science, vol 5362. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89533-6_49
Download citation
DOI: https://doi.org/10.1007/978-3-540-89533-6_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89532-9
Online ISBN: 978-3-540-89533-6
eBook Packages: Computer ScienceComputer Science (R0)