[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1166160.1166190acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

Meta-algorithmic systems for document classification

Published: 10 October 2006 Publication History

Abstract

To address cost and regulatory concerns, many businesses are converting paper-based elements of their workflows into fully electronic flows that use the content of the documents. Scanning the document contents into workflows, however, is a manual, error-prone, and costly process especially when the data extraction process requires high accuracy. These manual costs are a primary barrier to widespread adoption of distributed capture solutions for business critical workflows such as insurance claims, medical records, or loan applications. Software solutions using artificial intelligence and natural language processing techniques are emerging to address these needs, but each have their individual strengths and weaknesses, and none have demonstrated a high level of accuracy across the many unstructured document types included in these business critical workflows. This paper describes how to overcome many of these limitations by intelligently combining multiple approaches for document classification using meta-algorithmic design patterns. These patterns explore the error space in multiple engines, and provide improved and "emergent" results in comparison to voting schemes and to the output of any of the individual engines. This paper considers the results of the individual engines along with traditional combinatorial techniques such as voting, before describing prototype results for a variety of novel metaalgorithmic patterns that reduce individual document error rates by up to 13% and reduce system error rates by up to 38%.

References

[1]
Simske, S and Lin, X. "Creating digital libraries: content generation and re-mastering", HP Labs Technical Report 2003-259, posted at: http://www.hpl.americas.hp.net/techreports/2003/HPL-2003-259.pdf.
[2]
Mohomine text classifier, subsequently purchased by Kofax (ww.kofax.com) and integrated into Indicius, http://www.kofax.com/products/indicius/index.asp.
[3]
"Divmod Reverend" at http://divmod.org/trac/wiki/DivmodReverend.
[4]
The University of California, Irvine Knowledge Discovery in Databases Archive, 20 Newsgroups data set, http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html.
[5]
Lin, X, Yacoub, S, Burns, J and Simske, S. "Performance analysis of pattern classifier combination by plurality voting". Pattern Recognition Letters 24:1959--1969 (2003).
[6]
Sebastiani, F. "Machine learning in automated text categorization". ACM Comput. Surv. 34(1):1--47 (2002).
[7]
Ruta, D, Gabrys, B. "An overview of classifier fusion methods." Computing and Information Systems 7(1):1--10 (2000).
[8]
Freund, Y, Schapire, RE. "A decision-theoretic generalization of on-line learning and an application to boosting," Proc. Second European Conf Learning Theory, LNCS (1995).
[9]
Baoli L, Qin L, Shiwen Y. "An adaptive k-nearest neighbor text categorization strategy." ACM Trans Asian Language Info Proc (TALIP), 3:215--226 (2004).
[10]
Zhang, D, Chen, X, Lee, WS. "Text classification with kernels on the multinomial manifold." Proc. 28th Ann. Intl. ACM SIGIR Conf on Res Dev in IR, 266--273 (2005).

Cited By

View all
  • (2024)Probabilistic Confusion Matrix: A Novel Method for Machine Learning Algorithm Generalized Performance AnalysisTechnologies10.3390/technologies1207011312:7(113)Online publication date: 13-Jul-2024
  • (2016)Mass Serialization Method for Document Encryption Policy EnforcementProceedings of the 2016 ACM Symposium on Document Engineering10.1145/2960811.2967166(193-196)Online publication date: 13-Sep-2016
  • (2015)The rationale for ensemble and meta-algorithmic architectures in signal and information processingAPSIPA Transactions on Signal and Information Processing10.1017/ATSIP.2015.104Online publication date: 2-Sep-2015
  • Show More Cited By

Recommendations

Reviews

Amos O Olagunju

Industry is integrating more paperwork activities into electronic information storage and retrieval systems, in response to regulatory and cost-saving mandates. For decades, automatic indexing and document classification efforts have focused on individual techniques [1,2]. The incorporation of multiple document classification algorithms and incongruent indexing schemes into one engine is a very tricky issue. With fewer correlated classifiers, document classification error rates ought to reduce. However, creating a document classification system to simultaneously reduce the classification error rate and diminish errors in the indexing scheme is not easy. Simske et al. advocate an automated, transcending algorithmic pattern design for minimizing manual human intervention in document processing, to drastically curtail the operational costs of document workflows. The document classifiers investigated include an open source Bayesian spam filter, the Hewlett-Packard engine that normalizes term frequency by the distribution frequency, a commercial neural net engine, and clever combinations of these algorithmic systems. Multifarious blends of statistical techniques were applied to various fusions of document classifier engines that significantly shrunk classification error rates. The authors presented a variety of transformed algorithmic patterns that greatly improved the initial and final classifications of sets of documents in the presence of preliminary group errors in the indexing designs. Although the investigation of interactions between highly correlated document classifiers with obscure or unknown probability spaces lingers as a problem, this paper offers insights for the design and performance evaluation of future multiengine document retrieval systems. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering
October 2006
232 pages
ISBN:1595935150
DOI:10.1145/1166160
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. confusion matrix
  2. document classification
  3. document indexing
  4. engine combination
  5. meta-algorithmics

Qualifiers

  • Article

Conference

DocEng06
Sponsor:
DocEng06: ACM Symposium on Document Engineering
October 10 - 13, 2006
Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Probabilistic Confusion Matrix: A Novel Method for Machine Learning Algorithm Generalized Performance AnalysisTechnologies10.3390/technologies1207011312:7(113)Online publication date: 13-Jul-2024
  • (2016)Mass Serialization Method for Document Encryption Policy EnforcementProceedings of the 2016 ACM Symposium on Document Engineering10.1145/2960811.2967166(193-196)Online publication date: 13-Sep-2016
  • (2015)The rationale for ensemble and meta-algorithmic architectures in signal and information processingAPSIPA Transactions on Signal and Information Processing10.1017/ATSIP.2015.104Online publication date: 2-Sep-2015
  • (2013)Second‐Order Meta‐Algorithmics and their ApplicationsMeta‐Algorithmics10.1002/9781118626719.ch8(272-309)Online publication date: 27-May-2013
  • (2013)Introduction to Meta‐AlgorithmicsMeta‐Algorithmics10.1002/9781118626719.ch6(175-240)Online publication date: 27-May-2013
  • (2013)The FutureMeta‐Algorithmics10.1002/9781118626719.ch11(360-368)Online publication date: 27-May-2013
  • (2009)Printer-scanner identification via analysis of structured security deterrents2009 First IEEE International Workshop on Information Forensics and Security (WIFS)10.1109/WIFS.2009.5386463(151-155)Online publication date: Dec-2009
  • (2009)Dynamic biometrics: The case for a real-time solution to the problem of access control, privacy and security2009 First IEEE International Conference on Biometrics, Identity and Security (BIdS)10.1109/BIDS.2009.5507535(1-10)Online publication date: Sep-2009
  • (2008)An optical character recognition approach to qualifying thresholding algorithmsProceedings of the eighth ACM symposium on Document engineering10.1145/1410140.1410197(263-266)Online publication date: 16-Sep-2008

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media