research-article

Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents

Authors:

Kwang-Nam ChoiAuthors Info & Claims

ICIBE '17: Proceedings of the 3rd International Conference on Industrial and Business Engineering

Pages 40 - 45

https://doi.org/10.1145/3133811.3133817

Published: 17 August 2017 Publication History

Get Access

Abstract

Finding a document that is similar to a specified query document within a large document database is one of important issues in the Big Data era, as most data available is in the form of unstructured texts. Our testing collection consists of two parts: In the first part texts were produced by human work by artificial plagiarism approach through the linear pipelined procedure. In the second part, texts are generated by software that inserts, deletes, and substitutes certain parts of the target documents to make a similar document from an input document. These document set is known as the Serially Evolved Documents (SED). We propose new methods: Order Preserving Precision (OPP) and Order Preserving Recall (OPR), to compute how the evolutionary order is kept among output documents obtained from the subject IR system. Using those testing texts we evaluated KONAN, a document retrieval system for Korean documents.

References

[1]

Eugene Agichtein and Silviu Cucerzan. 2005. Predicting accuracy of extracting information from unstructured text collections. In Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 413--420.

Digital Library

Google Scholar

[2]

David C Blair and Melvin E Maron. 1985. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun. ACM 28, 3 (1985), 289--299.

Digital Library

Google Scholar

[3]

Vuk Ercegovac, David J DeWitt, and Raghu Ramakrishnan. 2005. The TEXTURE benchmark: measuring performance of text queries on a relational DBMS. In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 313--324.

Google Scholar

[4]

Claudia Hauff and Franciska de Jong. 2010. Retrieval system evaluation: automatic evaluation versus incomplete judgments. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 863--864.

Digital Library

Google Scholar

[5]

Cyril Labbé and Dominique Labbé. 2013. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientomet- rics 94, 1 (2013), 379--396.

Digital Library

Google Scholar

[6]

Matt Mahoney. 2009. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text.html (2009).

Google Scholar

[7]

Gerard Salton, James Allan, and Chris Buckley. 1993. Approaches to passage retrieval in full text information systems. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 49--58.

Digital Library

Google Scholar

[8]

Mark Sanderson et al. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval 4, 4 (2010), 247--375.

Google Scholar

[9]

Ellen M Voorhees and Donna Harman. 2000. Overview of the sixth text retrieval conference (TREC-6). Information Processing & Management 36, 1 (2000), 3--35.

Digital Library

Google Scholar

[10]

Ellen M Voorhees, Donna K Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval. Vol. 1. MIT press Cambridge.

Digital Library

Google Scholar

Index Terms

Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

An evaluation of retrieval effectiveness for a full-text document-retrieval system

An evaluation of a large, operational full-text document-retrieval system (containing roughly 350,000 pages of text) shows the system to be retrieving less than 20 percent of the documents relevant to a particular search. The findings are discussed in ...
Documents clustering using tolerance rough set model and its application to information retrieval
Intelligent exploration of the web

Clustering is a powerful tool for analyzing and finding useful information in text collections. However, document clustering is a difficult clustering problem because of the unstructured form and textual characteristics of documents. As a consequence, ...
Imaged Document Text Retrieval Without OCR

We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ICIBE '17: Proceedings of the 3rd International Conference on Industrial and Business Engineering

August 2017

107 pages

ISBN:9781450353519

DOI:10.1145/3133811

© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

Waseda University: Waseda University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICIBE 2017

ICIBE 2017: 2017 3rd International Conference on Industrial and Business Engineering

August 17 - 19, 2017

Sapporo, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
55
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

An evaluation of retrieval effectiveness for a full-text document-retrieval system

Documents clustering using tolerance rough set model and its application to information retrieval

Imaged Document Text Retrieval Without OCR