[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3209280.3209532acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Active High-Recall Information Retrieval from Domain-Specific Text Corpora based on Query Documents

Published: 28 August 2018 Publication History

Abstract

In this paper, we propose a high recall active document retrieval system for a class of applications involving query documents, as opposed to key terms, and domain-specific document corpora. The output of the model is a list of documents retrieved based on the domain expert feedback collected during training. A modified version of Bag of Word (BoW) representation and a semantic ranking module, based on Google n-grams, are used in the model. The core of the system is a binary document classification model which is trained through a continuous active learning strategy. In general, finding or constructing training data for this type of problem is very difficult due to either confidentiality of the data, or the need for domain expert time to label data. Our experimental results on the retrieval of Call For Papers based on a manuscript demonstrate the efficacy of the system to address this application and its performance compared to other candidate models.

References

[1]
A. Alahmadi, A. Joorabchi, and A. E. Mahdi. 2014. Combining Bag-of-Words and Bag-of-Concepts representations for Arabic text classification. In 25th IET Irish Signals Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 343--348.
[2]
Rayner Alfred, Patricia Anthony, Suraya Alias, Asni Tahir, Kim On Chin, and Lau Hui Keng. 2013. Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge. Springer Berlin Heidelberg, Berlin, Heidelberg, 283--292.
[3]
David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. Active Learning with Statistical Models. J. Artif. Int. Res. 4, 1 (March 1996), 129--145. http://dl.acm.org/citation.cfm?id=1622737.1622744
[4]
Gordon V. Cormack and Maura R. Grossman. 2014. Evaluation of Machine-learning Protocols for Technology-assisted Review in Electronic Discovery. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '14). ACM, New York, NY, USA, 153--162.
[5]
Gordon V Cormack and Mona Mojdeh. 2009. Machine Learning for Information Retrieval: TREC 2009 Web, Relevance Feedback and Legal Tracks. In TREC.
[6]
F. Dehne, M. Lawrence, and D. University. 2007. Cooperative Caching for Grid Based DataWarehouses. In Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07). 31--38.
[7]
Joseph L. Fleiss and Jacob Cohen. 1973. The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability. Educational and Psychological Measurement 33, 3 (1973), 613--619. arXiv:https://doi.org/10.1177/001316447303300309
[8]
Julio Gonzalo, Felisa Verdejo, Irina Chugur, and Juan M. Cigarrán. 1998. Indexing with WordNet synsets can improve Text Retrieval. CoRR cmp-lg/9808002 (1998). http://arxiv.org/abs/cmp-lg/9808002
[9]
David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. 2013. Applied logistic regression. Vol. 398. John Wiley & Sons.
[10]
Aminul Islam, Evangelos Milios, and Vlado Kešelj. 2012. Text Similarity Using Google Trigrams. In Proceedings of the 25th Canadian Conference on Advances in Artificial Intelligence (Canadian AI'12). Springer-Verlag, Berlin, Heidelberg, 312--317.
[11]
Bernard J Jansen and Amanda Spink. 2004. An analysis of documents viewing patterns of Web search engine users. Web mining: Applications and techniques (2004), 339--354.
[12]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 20, 4 (Oct. 2002), 422--446.
[13]
Cheng Li, Yue Wang, Paul Resnick, and Qiaozhu Mei. 2014. ReQ-ReC: High Recall Retrieval with Query Pooling and Interactive Classification. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '14). ACM, New York, NY, USA, 163--172.
[14]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
[15]
Jie Mei, Xinxin Kou, Zhimin Yao, Andrew Rau-Chaplin, Aminul Islam, Abidalrahman Moh'd, and Evangelos E. Milios. 2015. Efficient Computation of Cooccurrence Based Word Relatedness. In Proceedings of the 2015 ACM Symposium on Document Engineering (DocEng '15). ACM, New York, NY, USA, 43--46.
[16]
George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (Nov. 1995), 39--41.
[17]
Peter Oram. 2001. WordNet: An electronic lexical database. Christiane Fellbaum (Ed.). Cambridge, MA: MIT Press, 1998. Pp. 423. -. Applied Psycholinguistics 22, 1 (2001), 131--134.
[18]
Ted Pedersen, Satanjeev Banerjee, and Siddharth Patwardhan. 2005. Maximizing semantic relatedness to perform word sense disambiguation. University of Minnesota supercomputing institute research report UMSI 25 (2005), 2005.
[19]
Michael Prince. 2004. Does Active Learning Work? A Review of the Research. Journal of Engineering Education 93, 3 (2004), 223--231.
[20]
Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and TrendsÂő in Information Retrieval 3, 4 (2009), 333--389.
[21]
Neil Rubens, Mehdi Elahi, Masashi Sugiyama, and Dain Kaplan. 2015. Active learning in recommender systems. In Recommender systems handbook. Springer, 809--846.
[22]
D. Sculley and Google Inc. 2009. Large scale learning to rank. In NIPS Workshop on Advances in Ranking. 58--63.
[23]
Burr Settles. 2010. Active learning literature survey. University of Wisconsin, Madison, Computer Science Technical Report 1648 52, 55-66 (2010), 11.
[24]
Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. 2007. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proceedings of the 24th International Conference on Machine Learning (ICML '07). ACM, New York, NY, USA, 807--814.
[25]
Simon Tong and Daphne Koller. 2002. Support Vector Machine Active Learning with Applications to Text Classification. J. Mach. Learn. Res. 2 (March 2002), 45--66.
[26]
Linkai Weng, Zhiwei Li, Rui Cai, Yaoxue Zhang, Yuezhi Zhou, Laurence T. Yang, and Lei Zhang. 2011. Query by Document via a Decomposition-based Two-level Retrieval Approach. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 505--514.
[27]
Kyle Williams, Jian Wu, and C. Lee Giles. 2014. SimSeerX: A Similar Document Search Engine. In Proceedings of the 2014 ACM Symposium on Document Engineering (DocEng '14). ACM, New York, NY, USA, 143--146.
[28]
L. Wu, S. C. H. Hoi, and N. Yu. 2010. Semantics-Preserving Bag-of-Words Models and Applications. IEEE Transactions on Image Processing 19, 7 (July 2010), 1908--1920.
[29]
Zhibiao Wu and Martha Palmer. 1994. Verbs Semantics and Lexical Selection. In Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics (ACL '94). Association for Computational Linguistics, Stroudsburg, PA, USA, 133--138.
[30]
Yin Yang, Nilesh Bansal, Wisam Dakka, Panagiotis Ipeirotis, Nick Koudas, and Dimitris Papadias. 2009. Query by Document. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM '09). ACM, New York, NY, USA, 34--43.
[31]
Cha Zhang and Tsuhan Chen. 2002. An active learning framework for content-based information retrieval. IEEE Transactions on Multimedia 4, 2 (Jun 2002), 260--268.
[32]
Le Zhao and Jamie Callan. 2012. Automatic Term Mismatch Diagnosis for Selective Query Expansion. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12). ACM, New York, NY, USA, 515--524.

Cited By

View all
  • (2022)Information retrieval from scientific abstract and citation databasesExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116967199:COnline publication date: 23-May-2022
  • (2019)Multi-Objective GP Strategies for Topical Search Integrating Wikipedia ConceptsProceedings of the ACM Symposium on Document Engineering 201910.1145/3342558.3345402(1-10)Online publication date: 23-Sep-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018
August 2018
311 pages
ISBN:9781450357692
DOI:10.1145/3209280
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • SIGDOC: ACM Special Interest Group on Systems Documentation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Active Learning
  2. Information Retrieval
  3. Recommendation Systems
  4. Semantic Similarity

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

DocEng '18
Sponsor:
DocEng '18: ACM Symposium on Document Engineering 2018
August 28 - 31, 2018
NS, Halifax, Canada

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Information retrieval from scientific abstract and citation databasesExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116967199:COnline publication date: 23-May-2022
  • (2019)Multi-Objective GP Strategies for Topical Search Integrating Wikipedia ConceptsProceedings of the ACM Symposium on Document Engineering 201910.1145/3342558.3345402(1-10)Online publication date: 23-Sep-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media