[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2361354.2361383acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Logical segmentation for article extraction in digitized old newspapers

Published: 04 September 2012 Publication History

Abstract

Newspapers are documents made of news item and informative articles. They are not meant to be read iteratively: the reader can pick his items in any order he fancies. Ignoring this structural property, most digitized newspaper archives only offer access by issue or at best by page to their content. We have built a digitization workflow that automatically extracts newspaper articles from images, which allows indexing and retrieval of information at the article level. Our back-end system extracts the logical structure of the page to produce the informative units: the articles. Each image is labelled at the pixel level, through a machine learning based method, then the page logical structure is constructed up from there by the detection of structuring entities such as horizontal and vertical separators, titles and text lines. This logical structure is stored in a METS wrapper associated to the ALTO file produced by the system including the OCRed text. Our front-end system provides a web high definition visualisation of images, textual indexing and retrieval facilities, searching and reading at the article level. Articles transcriptions can be collaboratively corrected, which as a consequence allows for better indexing. We are currently testing our system on the archives of the Journal de Rouen, one of France eldest local newspaper. These 250 years of publication amount to 300 000 pages of very variable image quality and layout complexity. Test year 1808 can be consulted at plair.univ-rouen.fr.

References

[1]
Antonacopoulos A., Pletschacher S., Bridson D., Papadopoulos C., "ICDAR 2009 Page Segmentation Competition", 2009 10th International Conference on Document Analysis and Recognition, IEEE, p. 1370--1374, 2009.
[2]
An C., Yin D., Baird H., "Document Segmentation Using Pixel-Accurate Ground Truth", 2010 International Conference on Pattern Recognition, IEEE, p. 245--248, 2010.
[3]
Breuel T., "Two geometric algorithms for layout analysis", Document Analysis Systems V, vol. 2, p. 687--692, 2002.
[4]
Lemaitre A., Camillerapp J., Couasnon B., "Approche perceptive pour la reconnaissance de filets bruités, Application à la structuration de pages de journaux", in, A. T. et Thierry Paquet (ed.), Dixième Colloque International Francophone sur l'Ecrit et le Document, Groupe de Recherche en Communication Ecrite, France, p. 61--66, 2008.
[5]
Yacoub S., Burns J., Faraboschi P., Ortega D., Abad Peiro J., Saxena V. 2005. Document digitization lifecycle for complex magazine collection. In Proceedings of the 2005 ACM symposium on Document engineering (DocEng '05). ACM, New York, NY, USA, 197--206.
[6]
Beretta R., Laura L. 2011. Performance Evaluation of Algorithms for Newspaper Article Identification. In Proceedings of the 2011 International Conference on Document Analysis and Recognition (ICDAR '11). IEEE Computer Society, Washington, DC, USA, 394--398.
[7]
Hebert D., Paquet T., Nicolas S., "Continuous CRF with multi-scale quantization feature functions Application to structure extraction in old newspaper", Document Analysis and Recognition (ICDAR), 2011 International Conference on, IEEE, p. 493--497, 2011.

Cited By

View all
  • (2025)Looking Back to 1850 in 2025Impact of Digitalization on Communication Dynamics10.4018/979-8-3693-3579-6.ch015(393-420)Online publication date: 3-Jan-2025
  • (2024)Digitizing History: Transitioning Historical Paper Documents to Digital Content for Information Retrieval and Mining—A Comprehensive SurveyIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.337841911:5(6151-6180)Online publication date: Oct-2024
  • (2024)Newspaper elements detection and newspaper pages categorization using CNNs and transformersInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-024-00503-9Online publication date: 7-Oct-2024
  • Show More Cited By

Index Terms

  1. Logical segmentation for article extraction in digitized old newspapers

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering
    September 2012
    256 pages
    ISBN:9781450311168
    DOI:10.1145/2361354
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 September 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. articles extraction in newspapers
    2. conditional random field
    3. document image labelling
    4. information extraction from document images
    5. logical structure
    6. page layout analysis
    7. structural analysis

    Qualifiers

    • Research-article

    Conference

    DocEng '12
    Sponsor:
    DocEng '12: ACM Symposium on Document Engineering
    September 4 - 7, 2012
    Paris, France

    Acceptance Rates

    Overall Acceptance Rate 194 of 564 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Looking Back to 1850 in 2025Impact of Digitalization on Communication Dynamics10.4018/979-8-3693-3579-6.ch015(393-420)Online publication date: 3-Jan-2025
    • (2024)Digitizing History: Transitioning Historical Paper Documents to Digital Content for Information Retrieval and Mining—A Comprehensive SurveyIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.337841911:5(6151-6180)Online publication date: Oct-2024
    • (2024)Newspaper elements detection and newspaper pages categorization using CNNs and transformersInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-024-00503-9Online publication date: 7-Oct-2024
    • (2024)Leveraging Transfer Learning for Article Segmentation in Historical NewspapersLinking Theory and Practice of Digital Libraries10.1007/978-3-031-72437-4_13(222-238)Online publication date: 26-Sep-2024
    • (2023)STRAS: A Semantic Textual-Cues Leveraged Rule-Based Approach for Article Separation in Historical NewspapersLeveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration10.1007/978-981-99-8085-7_8(89-105)Online publication date: 30-Nov-2023
    • (2023)Benchmarking NAS for Article Separation in Historical NewspapersLeveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration10.1007/978-981-99-8085-7_7(76-88)Online publication date: 30-Nov-2023
    • (2022)Convolutional Neural Network Based Intelligent Advertisement Search Framework for Online English NewspapersRecent Patents on Engineering10.2174/187221211566621071516391916:4Online publication date: Jul-2022
    • (2022)Accessible PDFs: Applying Artificial Intelligence for Automated Remediation of STEM PDFsProceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3517428.3550407(1-6)Online publication date: 23-Oct-2022
    • (2021)Old Sinhala Newspaper Article Segmentation for Content Recognition Using Image Processing2021 From Innovation To Impact (FITI)10.1109/FITI54902.2021.9833047(1-6)Online publication date: 8-Dec-2021
    • (2020)Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicalsLiterary Translation in Periodicals10.1075/btl.155.10ort(247-272)Online publication date: 6-Dec-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media