[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2806416.2806550acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

External Knowledge and Query Strategies in Active Learning: a Study in Clinical Information Extraction

Published: 17 October 2015 Publication History

Abstract

This paper presents a new active learning query strategy for information extraction, called Domain Knowledge Informativeness (DKI). Active learning is often used to reduce the amount of annotation effort required to obtain training data for machine learning algorithms. A key component of an active learning approach is the query strategy, which is used to iteratively select samples for annotation. Knowledge resources have been used in information extraction as a means to derive additional features for sample representation. DKI is, however, the first query strategy that exploits such resources to inform sample selection. To evaluate the merits of DKI, in particular with respect to the reduction in annotation effort that the new query strategy allows to achieve, we conduct a comprehensive empirical comparison of active learning query strategies for information extraction within the clinical domain. The clinical domain was chosen for this work because of the availability of extensive structured knowledge resources which have often been exploited for feature generation. In addition, the clinical domain offers a compelling use case for active learning because of the necessary high costs and hurdles associated with obtaining annotations in this domain. Our experimental findings demonstrated that (1) amongst existing query strategies, the ones based on the classification model's confidence are a better choice for clinical data as they perform equally well with a much lighter computational load, and (2) significant reductions in annotation effort are achievable by exploiting knowledge resources within active learning query strategies, with up to 14% less tokens and concepts to manually annotate than with state-of-the-art query strategies.

References

[1]
A. R. Aronson. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In Proceedings of the American Medical Informatics Association Annual Symposium, 17--21, 2001.
[2]
H. Boström and H. Dalianis. De-identifying health records by means of active learning. Recall (micro), 97(97.55), 90--97, 2012.
[3]
Y. Chen, S. Mani, and H. Xu. Applying active learning to assertion classification of concepts in clinical text. Journal of Biomedical Informatics, 45(2), 265--272, 2012.
[4]
R. A. Cote and S. Robboy. Progress in medical information management. Journal of the American Medical Association (JAMA), 243(8), 756--762, 1980.
[5]
A. Culotta and A. McCallum. Reducing labeling effort for structured prediction tasks. In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI), 746--751, 2005.
[6]
R. L. Figueroa, Q. Zeng-Treitler, L. H. Ngo, S. Goryachev, and E. P. Wiechmann. Active learning for clinical text classification: is it better than random sampling? Journal of the American Medical Informatics Association (JAMIA), 19(5), 809--816, 2012.
[7]
C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak. Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association (JAMIA), 11(5), 392--402, 2004.
[8]
H. Gurulingappa. Mining the medical and patent literature to support healthcare and pharmacovigilance, Ph.D. dissertation, University of Bonn, Bonn, Germany, 2012.
[9]
M. Jiang, Y. Chen, M. Liu, S. T. Rosenbloom, S. Mani, J. C. Denny, and H. Xu. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. Journal of the American Medical Informatics Association (JAMIA), 18(5), 601--606, 2011.
[10]
L. Kelly, L. Goeuriot, H. Suominen, T. Schreck, G. Leroy, D. L. Mowery, S. Velupillai, W. W. Chapman, D. Martinez, and G. Zuccon. Overview of the share/clef ehealth evaluation lab 2014. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction, 172--191, 2014.
[11]
M. Kholghi, L. Sitbon, G. Zuccon, and A. Nguyen. Active learning: a step towards automating medical concept extraction. Journal of the American Medical Informatics Association (JAMIA), 2015 {In Press}.
[12]
M. Kholghi, L. Sitbon, G. Zuccon, and A. Nguyen. Factors influencing robustness and effectiveness of conditional random fields in active learning frameworks. In Proceedings of the 12th Australasian Data Mining Conference (AusDM 2014) (Vol. 158): Conferences in Research and Practice in Information Technology, 2014.
[13]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), 282--289, 2001.
[14]
R. Leaman, R. Khare, and Z. Lu. NCBI at 2013 ShARe/CLEF eHealth Shared Task: disorder normalization in clinical notes with DNorm. Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2013), 2013.
[15]
D. D. Lewis and J. Catlett. Heterogenous Uncertainty Sampling for Supervised Learning. In Proceedings of the 18th International Conference on Machine Learning (ICML), 148--156, 1994.
[16]
D. A. Lindberg, B. L. Humphreys, and A. T. McCray. The unified medical language system. Methods of Information in Medicine, 32(4), 281--291, 1993.
[17]
A. K. McCallum. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu, 2002.
[18]
A. N. Nguyen, M. J. Lawley, D. P. Hansen, and S. Colquist. A simple pipeline application for identifying and negating SNOMED clinical terminology in free text. In Proceedings of the Health Informatics Conference (HIC), 188--193, 2009.
[19]
L. Ohno-Machado, P. Nadkarni, and K. Johnson. Natural language processing: algorithms and tools to extract computable information from EHRs and from the biomedical literature. Journal of the American Medical Informatics Association (JAMIA), 20(5), 805, 2013.
[20]
S. Pradhan, N. Elhadad, B. South, D. Martinez, L. Christensen, A. Vogel, H. Suominen, W. W. Chapman, and G. Savova. Task 1: ShARe/CLEF ehealth evaluation lab 2013. CLEF 2013 Evaluation Labs and Workshops: Working Notes, 2013.
[21]
S. Pradhan, N. Elhadad, B. South, D. Martinez, L. Christensen, A. Vogel, H. Suominen, W. W. Chapman, and G. Savova. Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. Journal of the American Medical Informatics Association (JAMIA), 22(1), 143--154, 2015.
[22]
R. Rosales, P. Krishnamurthy, and R. B. Rao. Semi-supervised active learning for modeling medical concepts from free text. In Proceedings of the Sixth International Conference on Machine Learning and Applications, 530--536, 2007.
[23]
E. F. T. K. Sang and J. Veenstra. Representing text chunks. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, 173--179, 1999.
[24]
G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G. Chute. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association (JAMIA), 17(5), 507--513, 2010.
[25]
T. Scheffer, C. Decomain, and S. Wrobel. Active hidden markov models for information extraction. In Proceedings of the International Conference on Advances in Intelligent Data Analysis (CAIDA), 309--318, 2001.
[26]
B. Settles. Active learning, (Vol. 6): Morgan & Claypool Publishers, 2012.
[27]
B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 1070--1079, 2008.
[28]
B. Settles, M. Craven, and L. Friedland. Active learning with real annotation costs. In Proceedings of the NIPS Workshop on Cost-Sensitive Learning, 1--10, 2008.
[29]
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, 287--294, 1992.
[30]
C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27, 379--423, 623--656, 1948.
[31]
H. Suominen, S. Salanterä, S. Velupillai, W. W. Chapman, G. Savova, N. Elhadad, S. Pradhan, B. South, D. Mowery, G. F. Jones, J. Leveling, L. Kelly, L. Goeuriot, D. Martinez, and G. Zuccon. Overview of the ShARe/CLEF eHealth Evaluation Lab 2013. In Information Access Evaluation. Multilinguality, Multimodality, and Visualization (Vol. 8138), 212--231, 2013.
[32]
Ö. Uzuner, I. Goldstein, Y. Luo, and I. Kohane. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association (JAMIA), 15(1), 14--24, 2008.
[33]
Ö. Uzuner, I. Solti, and E. Cadag. Extracting medication information from clinical text. Journal of the American Medical Informatics Association (JAMIA), 17(5), 514--518, 2010.
[34]
Ö. Uzuner, B. R. South, S. Shen, and S. L. DuVall. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association (JAMIA), 18(5), 552--556, 2011.
[35]
H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny. MedEx: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association (JAMIA), 17(1), 19--24, 2010.
[36]
H.-T. Zhang, M.-L. Huang, and X.-Y. Zhu. A unified active learning framework for biomedical relation extraction. Journal of Computer Science and Technology, 27(6), 1302--1313, 2012.
[37]
X. Zhu. Semi-supervised learning literature survey. Technical Report 1530: University of Wisconsin-Madison, 2007.

Cited By

View all
  • (2021)From Limited Annotated Raw Material Data to Quality Production DataProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481921(4114-4124)Online publication date: 26-Oct-2021
  • (2019)Quantum speedup for pool-based active learningQuantum Information Processing10.1007/s11128-019-2460-x18:11(1-11)Online publication date: 1-Nov-2019
  • (2019)The Scholarly Impact and Strategic Intent of CLEF eHealth Labs from 2012 to 2017Information Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_14(333-363)Online publication date: 14-Aug-2019
  • Show More Cited By

Index Terms

  1. External Knowledge and Query Strategies in Active Learning: a Study in Clinical Information Extraction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
    October 2015
    1998 pages
    ISBN:9781450337946
    DOI:10.1145/2806416
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. active learning
    2. clinical free text
    3. concept extraction
    4. conditional random fields
    5. domain knowledge

    Qualifiers

    • Research-article

    Conference

    CIKM'15
    Sponsor:

    Acceptance Rates

    CIKM '15 Paper Acceptance Rate 165 of 646 submissions, 26%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)From Limited Annotated Raw Material Data to Quality Production DataProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481921(4114-4124)Online publication date: 26-Oct-2021
    • (2019)Quantum speedup for pool-based active learningQuantum Information Processing10.1007/s11128-019-2460-x18:11(1-11)Online publication date: 1-Nov-2019
    • (2019)The Scholarly Impact and Strategic Intent of CLEF eHealth Labs from 2012 to 2017Information Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_14(333-363)Online publication date: 14-Aug-2019
    • (2018)Active learning for classifying long‐duration audio recordings of the environmentMethods in Ecology and Evolution10.1111/2041-210X.130429:9(1948-1958)Online publication date: 5-Jul-2018
    • (2018)IntroductionManaging Data From Knowledge Bases: Querying and Extraction10.1007/978-3-319-94935-2_1(1-18)Online publication date: 1-Aug-2018
    • (2017)Active Learning with Adaptive Density Weighted Sampling for Information Extraction from Scientific PapersArtificial Intelligence and Natural Language10.1007/978-3-319-71746-3_7(77-90)Online publication date: 28-Nov-2017
    • (2017)Clinical information extraction using small dataJournal of the Association for Information Science and Technology10.1002/asi.2393668:11(2543-2556)Online publication date: 1-Nov-2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media