[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2382936.2382949acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

OCR-based image features for biomedical image and article classification: identifying documents relevant to cis-regulatory elements

Published: 07 October 2012 Publication History

Abstract

Images form a significant and useful source of information in published biomedical articles, which is still under-utilized in biomedical document classification and retrieval. Much current work on biomedical image retrieval and classification employs simple, standard image features such as gray scale histograms and edge direction to represent and classify images. We have used such features as well to classify images in our early work [5], where we used image-class-tags to represent and classify articles.
In the work presented here we focus on a different literature classification task, motivated by the need to identify articles discussing cis-regulatory elements and modules in the context of understanding complex gene-networks. The curators who try to identify such articles in the vast literature use as a major cue a certain type of image in which the conserved cis-regulatory region on the DNA is shown. Our experiments show that automatically identifying such images using common image features (like those mentioned above) can be highly error prone. However, using Optical Character Recognition (OCR) to extract alphabet characters from images, calculating character distribution and using the distribution parameters as image features, allows us to form a novel representation of images, and identify DNA-content in images with high precision and recall (over 0.9). Utilizing the occurrence of such DNA-rich images within articles, we train a classifier that identifies articles pertaining to cis-regulatory elements with a similarly high precision and recall. The use of OCR-based image features has much potential beyond the current task, to identify other types of biomedical sequence-based images showing DNA, RNA and proteins. Moreover, the ability to automatically identify such images has much potential to be widely applicable in other important biomedical document classification tasks.

References

[1]
Eppig JT, Bult CA, Kadin JA, Richardson JE and Blake JA. 2005. The Mouse Genome Database (MGD): From Genes to Mice --- A Community Resource for Mouse Biology. Nucleic Acids Research, 33, (Database Issue), D471--D475.
[2]
Smith CM, Finger JH, Hayamizu TF, McCright IJ, Eppig JT, Kadin JA, Richardson JE and Ringwald M. 2007. The Mouse Gene Expression Database (GXD): 2007 Update. Nucleic Acids Res, 35, D618--D623.
[3]
Hersh WR, Cohen A, Yang J, Bhuptiraju RT, Roberts P, Hearst M. 2006. TREC 2005 Genomics Track Overview. Proc. of TREC 2005, NIST Special Publication. 14--25.
[4]
Krallinger M, Vazquez M, Leitner F, et al. 2011. The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics, 12(Suppl 8):S3.
[5]
Shatkay H, Chen N, Blostein D. 2006. Integrating Image Data into Biomedical Text Categorization. Bioinformatics, 22(11), e446--e453.
[6]
Murphy RF, Velliste M, Yao J, Porreca G. 2001. Searching Online Journals for Fluorescence Microscope Images Depicting Protein Subcellular Location Patterns. Proc. of the 2nd IEEE Int. Symp. on Bio-Informatics and Biomedical Engineering (BIBE'01), 119--128.
[7]
Cohen W, Kou Z, Murphy RF. 2003. Extracting Information from Text and Images for Location Proteomics. Proc. of the 3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD'03), 2--9.
[8]
Qian Y, Murphy RF. 2008. Improved Recognition of Figures containing Fluorescence Microscope Images in Online Journal Articles using Graphical Models. Bioinformatics 24, 569--576.
[9]
SLIF: Subcellular Localization Image Finder. Carnegie Mellon University. http://slif.cbi.cmu.edu.
[10]
Rafkind B, Lee M, Chang S, Yu H. 2006. Exploring Text and Image Features to Classify Images in Bioscience Literature. Proc. of the BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL.
[11]
ImageCLEF Medical (since 2007).Cross-Language Image Retrieval Evaluation. http://www.imageclef.org/
[12]
Demner-Fushman D, Antani S, Simpson M and Thoma GR. (2009). Annotation and Retrieval of Clinically Relevant Images. International Journal of Medical Informatics: Special Issue on Mining of Clinical and Biomedical Text and Data, 78(12), e59--e67.
[13]
Yu H, Liu FF, Ramesh BP. 2010. Automatic Figure Ranking and User Interfacing for Intelligent Figure Search. PLoS One 5(10), e12983.
[14]
Chen N, Shatkay H, Blostein D. 2006. Exploring a new space of features for document classification: figure clustering. Proc. of the 2006 Conference of the IBM Center for Advanced Studies on Collaborative research. (CASCON'06).
[15]
Xu S, McCusker J, Krauthammer M. 2008. Exploring the use of image text for biomedical literature retrieval. Proc. of the AMIA Annu Symp, 2008, 1186.
[16]
Rodriguez-Esteban R, Iossifov I. 2009. Figure Mining for Biomedical research. Bioinformatics, 25(16), 2082--2084.
[17]
Gonzalez RC, Woods RE. 2002. Digital Image Processing. Prentice-Hall.
[18]
Haralick RM, Shanmugam K, Dinstein I. 1973. Texture features for image classification. IEEE Trans. On Systems, Man and Cybernetics, SMC-3(6), 610--621.
[19]
Jain AK, Vailaya A. 1998. Shape-based retrieval: a case study with trademark image databases. Pattern Recognition, 31(9), 1369--1390.
[20]
Istrail S, Tarpine R, Schutter K, and Aguiar D. 2010. Practical Computational Methods for Regulatory Genomics: A cisGRN-Lexicon and cisGRN-Browser for Gene Regulatory Networks. Methods in Molecular Biology 1, 674, Computational Biology of Transcription Factor Binding, 369--399.
[21]
CYRENE=http://www.brown.edu/Research/Istrail_Lab/pages/cyrene.html
[22]
Annicotte JS, Fayard E, Swift GH et al. 2003. Pancreatic-duodenal homeobox 1 regulates expression of liver receptor homolog 1 during pancreas development. Mol Cell Biol, 23(19), 6713--6724. 12972592
[23]
Regev Y, et al. 2002. Rule-Based Extraction of Experimental Evidence in the Biomedical Domain - the KDD Cup (Task 1). SIGKDD Explorations, 4(2), 90--91.
[24]
Xerox Rossinante=https://pdf2epub.services.open.xerox.com/
[25]
ABBYY Finereader for OCR. The website is at http://finereader.abbyy.com/
[26]
Puppina C, Ivan Prestab I, D'Elia AV et al. 2004. Functional interaction among thyroid-specific transcription factors: Pax8 regulates the activity of Hex promoter. Mol Cell Endocrinol, 224(1--2), 117--125. 15062550
[27]
Nishi H, Nakada T, Kyo S et al. 2004. Hypoxia-inducible factor 1 mediates upregulation of telomerase (hTERT). Mol Cell Biol, 24(13), 6076--6083. 15199161
[28]
Witten IH, Frank E. 2005. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. (Describes Weka: The Waikato Environment for Knowledge Analysis. http://www.cs.waikato.ac.nz/ml/weka.)
[29]
Brady S, Shatkay H. 2008. EpiLoc: a (working) text-based system for predicting protein subcellular location. Proc. of the Pacific Symposium on Biocomputing (PSB'08), 604--615.
[30]
Denroche R, Madupu R, Yooseph S, Sutton G, Shatkay H. 2010. Toward Computer-Assisted Text Curation: Classification Is Easy (Choosing Training Data Can Be Hard...) Linking Literature, Information, and Knowledge for Biology, Lecture Notes in Computer Science, 6004, 33--42.
[31]
Porter MF. 1997. An Algorithm for Suffix Stripping (Reprint). Readings in Information Retrieval, Morgan Kaufmann. http://www.tartarus.org/~martin/PorterStemmer/.

Cited By

View all
  • (2020)Integrating image caption information into biomedical document classification in support of biocurationDatabase10.1093/database/baaa0242020Online publication date: 15-Apr-2020
  • (2019)An effective biomedical document classification scheme in support of biocuration: addressing class imbalanceDatabase10.1093/database/baz0452019Online publication date: 25-Apr-2019
  • (2018)Number Recognition of Parts Book Schematics using Convolutional Recurrent Neural Network2018 International Conference on Information and Communication Technology Robotics (ICT-ROBOT)10.1109/ICT-ROBOT.2018.8549859(1-3)Online publication date: Sep-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '12: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
October 2012
725 pages
ISBN:9781450316705
DOI:10.1145/2382936
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OCR
  2. bioiomedical text classification
  3. biomedical articles
  4. biomedical images
  5. cis-regulatory elements
  6. document classification
  7. image classification
  8. image features
  9. information retrieval
  10. optical character recognition

Qualifiers

  • Research-article

Funding Sources

Conference

BCB' 12
Sponsor:

Acceptance Rates

BCB '12 Paper Acceptance Rate 33 of 159 submissions, 21%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Integrating image caption information into biomedical document classification in support of biocurationDatabase10.1093/database/baaa0242020Online publication date: 15-Apr-2020
  • (2019)An effective biomedical document classification scheme in support of biocuration: addressing class imbalanceDatabase10.1093/database/baz0452019Online publication date: 25-Apr-2019
  • (2018)Number Recognition of Parts Book Schematics using Convolutional Recurrent Neural Network2018 International Conference on Information and Communication Technology Robotics (ICT-ROBOT)10.1109/ICT-ROBOT.2018.8549859(1-3)Online publication date: Sep-2018
  • (2015)Utilizing image-based features in biomedical document classification2015 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP.2015.7351648(4451-4455)Online publication date: Sep-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media