[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/827140.827147acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Bibliographic attribute extraction from erroneous references based on a statistical model

Published: 27 May 2003 Publication History

Abstract

In this paper, we propose a method for extracting bibliographic attributes from reference strings captured using Optical Character Recognition (OCR) and an extended hidden Markov model. Bibliographic attribute extraction can be used in two ways. One is reference parsing in which attribute values are extracted from OCR-processed references for bibliographic matching. The other is reference alignment in which attribute values are aligned to the bibliographic record to enrich the vocabulary of the bibliographic database. In this paper, we first propose a statistical model for attribute extraction that represents both the syntactical structure of references and OCR error patterns. Then, we perform experiments using bibliographic references obtained from scanned images of papers in journals and transactions and show that useful attribute values are extracted from OCR-processed references. We also show that the proposed model has advantages in reducing the cost of preparing training data, a critical problem in rule-based systems.

References

[1]
F. H. Ayres, J. A. W. Huggill, and E. J. Yannakoudakis. The universal standard bibligraphic code (usbc): its use for clearing, merging and controlling large databases. Program - Automated Library and Information Systems, 22(2):117--132, 1988.
[2]
A. Belaid, J. C. Anigbogu, and Y. Chenevoy. Qualitative Analysis of Low-Level Logical Structures. In Proc. of International Conference on Electronic Publishing, pages 435--446, 1994.
[3]
H. Bunke and P. S. P. Wang, editors. Handbook of Character Recoginition and Document Image Analysis. World Scientific, 1997.
[4]
D. Devroye, L. Gyorfi, and G. Lugosi. "A Probabilistic Theory of Pattern Recognition". Springer, 1996.
[5]
C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An Automatic Citation Indexing System. In Proc. of International Conference on Digital Libraries, pages 89--98, 1998.
[6]
P. Goyal. An investigation of different string coding methods. Journal of the American Society for Information Science, 35(4):248--252, 1984.
[7]
P. Goyal. Duplicate record identification in bibiliographic databases. Information Systems, 12(3):239--242, 1987.
[8]
S. Hitchcock, L. Carr, S. Harris, J. M. N. Hey, and W. Hall. Citatition linking: Improving access to on-line journal. In Proc. of Second ACM Conference on Digital Libraries(DL97), pages 115--122, 1997.
[9]
S. Kahan, T. Pavlidis, and H. S. Baird. On the recognition of printed characters of any font and size. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):274--288, March 1987.
[10]
Karen Kukich. "Techniques for Automtically Correcting Words in Text". ACM Computing Surveys, 24(4):377--439, 1992.
[11]
S. Lawrence, C. L. Giles, and K. D. Bollacker. Digital libraries and autonmous citation indexing. IEEE Computer, 32(6):67--71, June 1999.
[12]
S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In Proc. of Third International Conference on Autonomous Agents, 1999.
[13]
Y. Li, D. Lopresti, and A. Tomkins. "Validation of Document Image Defect Models for Optical Character Recognition". In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 137--150, 1994.
[14]
G. Nagy, S. Seth, and M Viswanathan. A prototype document image analysis for technical journals. IEEE Computer, 25(7):10--22, July 1992.
[15]
T. O'Neill, E., A. Rogers, S., and M. Oskins, W. Characteristics of duplicate records in OCLC's online union catalog. Library Resources & Technical Services, 37(1):59--71, 1992.
[16]
F. Parmentier and A. Belaid. "Bibliography References Validation Using Emergent Architecture". In Proc. of IAPR International Conference on Document Analysis and Recognition, pages 532--535, 1995.
[17]
A. Takasu. "Statistical Analysis of Bibliographic Strings for Constructing an Integrated Document Space". In Proc. of European Conference on Research and Advanced Technology for Digital Libraries (ECDL02), pages 75--90, 2002.
[18]
A. Takasu and K. Aihara. "DVHMM: Variable Length Text Recognition Error Model". In Proc. of Internationa Conference on Pattern Recognition (ICPR02), Vol. III, pages 110--114, 2002.
[19]
Vladimir N. Vapnik. "Statistical Learning Theory". John Wiley & Sons, 1998.
[20]
K. Y. Wong, R. G. Casey, and F. M. Wahl. "Document Analysis System". IBM journal Research and Development, 26(6):647--656, 1982.

Cited By

View all
  • (2012)Web-based citation parsing, correction and augmentationProceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries10.1145/2232817.2232872(295-304)Online publication date: 10-Jun-2012
  • (2012)Improved bibliographic reference parsing based on repeated patternsProceedings of the Second international conference on Theory and Practice of Digital Libraries10.1007/978-3-642-33290-6_40(370-382)Online publication date: 23-Sep-2012
  • (2011)Semi-supervised bibliographic element segmentation with latent permutationsProceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation10.5555/2075271.2075285(60-69)Online publication date: 24-Oct-2011
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '03: Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
May 2003
393 pages
ISBN:0769519393

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 27 May 2003

Check for updates

Qualifiers

  • Article

Conference

JCDL03
Sponsor:

Acceptance Rates

JCDL '03 Paper Acceptance Rate 54 of 216 submissions, 25%;
Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2012)Web-based citation parsing, correction and augmentationProceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries10.1145/2232817.2232872(295-304)Online publication date: 10-Jun-2012
  • (2012)Improved bibliographic reference parsing based on repeated patternsProceedings of the Second international conference on Theory and Practice of Digital Libraries10.1007/978-3-642-33290-6_40(370-382)Online publication date: 23-Sep-2012
  • (2011)Semi-supervised bibliographic element segmentation with latent permutationsProceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation10.5555/2075271.2075285(60-69)Online publication date: 24-Oct-2011
  • (2011)Unsupervised Segmentation of Bibliographic Elements with Latent PermutationsInternational Journal of Organizational and Collective Intelligence10.4018/joci.20110401042:2(49-62)Online publication date: 1-Apr-2011
  • (2010)Unsupervised segmentation of bibliographic elements with latent permutationsProceedings of the 2010 international conference on Web information systems engineering10.5555/2044492.2044517(254-267)Online publication date: 12-Dec-2010
  • (2010)Logical Structure Recovery in Scholarly Articles with Rich Document FeaturesInternational Journal of Digital Library Systems10.4018/jdls.20101001011:4(1-23)Online publication date: 1-Oct-2010
  • (2010)A citation-based approach to automatic topical indexing of scientific literatureJournal of Information Science10.1177/016555151038808036:6(798-811)Online publication date: 1-Dec-2010
  • (2009)Using web resources for support of online-browsing of research papersProceedings of the 10th IEEE international conference on Information Reuse & Integration10.5555/1689250.1689313(348-353)Online publication date: 10-Aug-2009
  • (2009)CEBBIPProceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries10.1145/1555400.1555412(73-76)Online publication date: 15-Jun-2009
  • (2008)Automatic metadata extraction from museum specimen labelsProceedings of the 2008 International Conference on Dublin Core and Metadata Applications10.5555/1503418.1503425(57-68)Online publication date: 22-Sep-2008
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media