[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features

Published: 01 July 2010 Publication History

Abstract

We participated (as Team 9) in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of full-text documents relevant for protein-protein interaction. We used two distinct classifiers for the online and offline challenges: 1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts and 2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top-performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew's Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions, and therefore indicates a promising new avenue to investigate further in bibliome informatics.

References

[1]
L. Hunter and K. Cohen, "Biomedical Language Processing: What's Beyond ?" Molecular Cell, vol. 21, no. 5, pp. 589- 594, 2006.
[2]
, http://www.com, 2010.
[3]
H. Shatkay and R. Feldman, "Mining the Biomedical Literature in the Genomic Era: An Overview," J. Computational Biology, vol. 10, no. 6, pp. 821-856, 2003.
[4]
L.J. Jensen, J. Saric, and P. Bork, "Literature Mining for the Biologist: From Information Retrieval to Biological Discovery," Nature Rev. Genetics, vol. 7, no. 2, pp. 119-129, Feb. 2006.
[5]
A. Abi-Haidar, J. Kaur1, A. Maguitman, P. Radivojac, A. Retchsteiner, K. Verspoor, Z. Wang, and L.M. Rocha, "Uncovering Protein Interaction in Abstracts and Text Using a Novel Linear Model and Word Proximity Networks," Genome Biology, vol. 9, suppl. 2: S11.1-19, 2008.
[6]
L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, "Overview of Biocreative: Critical Assessment of Information Extraction for Biology," BMC Bioinformatics, vol. 6, suppl. 1: S1, 2005.
[7]
Proc. Second BioCreative Challenge Evaluation Workshop, 2007.
[8]
S. Chakrabarti, Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, 2002.
[9]
I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, "An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal EMail Messages," Proc. Ann. ACM Conf. Research and Development in Information Retrieval, pp. 160-167, 2000.
[10]
T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer Academic Publishers, 2002.
[11]
R. Feldman and J. Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 2006.
[12]
F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys vol. 34, no. 1, pp. 1-47, 2002.
[13]
M. Krallinger and A. Valencia, "Evaluating the Detection and Ranking of Protein Interaction Relevant Articles: The Biocreative Challenge Interaction Article Sub-Task (ias)," Proc. Second Biocreative Challenge Evaluation Workshop, pp. 29-39, 2007.
[14]
H.W. Mewes, C. Amid, R. Arnold, D. Frishman, U. Guldener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N. Strack, V. Stumpflen, J. Warfsmann, and A. Ruepp, "Mips: Analysis Annotation of Proteins from Whole Genomes," Nucleic Acids Research, vol. 32, Database issue, pp. D41-D44, Jan. 2004.
[15]
F. Fdez-Riverola, E. Iglesias, F. Diaz, J. Mendez, and J. Corchado, "Spamhunting: An Instance-Based Reasoning System for Spam Labelling Filtering," Decision Support Systems, vol. 43, no. 3, pp. 722-736, 2007.
[16]
G. Salton and C. Buckley, "Term-Weighting Approaches in Automatic Text Retrieval," Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[17]
M. Porter, "An Algorithm for Suffix Stripping," Program, vol. 13, no. 3, pp. 130-137, 1980.
[18]
R. Breitling, P. Armengaud, A. Amtmann, and P. Herzyk, "Rank Products: A Simple yet Powerful and New Method to Detect Differentially Regulated Genes in Replicated Microarray Experiments," FEBS Letters, vol. 573, nos. 1-3, pp. 83-92, Aug. 2004.
[19]
B. Settles, "Abner: An Open Source Tool for Automatically Tagging Genes, Proteins and Other Entity Names in Text," Bioinformatics, vol. 21, no. 14, pp. 3191-3192, 2005.
[20]
R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-Wesley Longman, 1999.
[21]
P. Baldi, "Assessing the Accuracy of Prediction Algorithms for Classification: An Overview," Bioinformatics, vol. 16, no. 5, pp. 412- 424, May 2000.
[22]
L.E. Dodd and M.S. Pepe, "Partial AUC Estimation Regression," Biometrics, vol. 59, no. 3, pp. 614-623, 2003.
[23]
T. Fawcett, "An Introduction to ROC Analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006.
[24]
B.W. Matthews, "Comparison of the Predicted and Observed Secondary Structure of t4 Phage Lysozyme," Biochimica Biophysica Acta, vol. 405, no. 2, pp. 442-451, Oct. 1975.
[25]
T. Cover and J. Thomas, Elements of Information Theory. John Wiley and Sons, 2006.
[26]
I. Councill, C. Giles, and M. Kan, "Parscit: An Open-Source CRF Reference String Parsing Package," Proc. Int'l Conf. Language Resources and Evaluation (LREC), 2008.
[27]
U. Laemmli et al., "Cleavage of Structural Proteins During the Assembly of the Head of Bacteriophage t4," Nature, vol. 227, no. 5259, pp. 680-685, 1970.
[28]
D. Perkins et al., "Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data," Electrophoresis, vol. 20, no. 18, pp. 3551-3567, 1999.
[29]
Y. Yang and J.O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997.
[30]
T. Joachims, "Making Large-Scale Support Vector Machine Learning Practical," Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999.
[31]
P. Nakov, A. Schwartz, and M. Hearst, "Citances: Citation Sentences for Semantic Analysis of Bioscience Text," Proc. SIGIR04 Workshop Search and Discovery in Bioinformatics, 2004.
[32]
K. Lai and S. Wu, "Using the Patent Co-Citation Approach to Establish a New Patent Classification System," Information Processing and Management, vol. 41, no. 2, pp. 313-330, 2005.
[33]
X. Li, H. Chen, Z. Zhang, and J. Li, "Automatic Patent Classification Using Citation Network Information: An Experimental Study in Nanotechnology," Proc. Seventh ACM/IEEE Computer Soc. Joint Conf. Digital Libraries, pp. 419-427, 2007.

Cited By

View all
  • (2014)Investigating the integrated landscape of the intellectual topology of bioinformaticsScientometrics10.1007/s11192-014-1417-1101:1(309-335)Online publication date: 1-Oct-2014
  1. Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
    IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 7, Issue 3
    July 2010
    192 pages

    Publisher

    IEEE Computer Society Press

    Washington, DC, United States

    Publication History

    Published: 01 July 2010
    Published in TCBB Volume 7, Issue 3

    Author Tags

    1. Text mining
    2. binary classification
    3. citation network.
    4. literature mining
    5. protein-protein interaction

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2014)Investigating the integrated landscape of the intellectual topology of bioinformaticsScientometrics10.1007/s11192-014-1417-1101:1(309-335)Online publication date: 1-Oct-2014

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media