An evaluation of a large, operational full-text document-retrieval system (containing roughly 350,000 pages of text) shows the system to be retrieving less than 20 percent of the documents relevant to a particular search. The findings are discussed in terms of the theory and practice of full-text document retrieval.
References
[1]
Blair. D.C. Searching biases in large interactive document retrieval systems. 1. Am. Sm. Inf. Sci. 31 (July 1960), 271-277.
Resnikoff, H.L. The national need for research in information science. ST1 Issues and Options Workshop. House subcommittee on science. research and technology, Washington, D.C. Nov. 3, 1976.
Swanson, IX. Searching natural language text by computer. Science 132. 3434 (Oct. 1960). 1099-1104. {7}. Swanson, D.R. Information retrieval as a trial and error process. Libr. Q. 47, 2 (1976), 1213-148.
Yang EHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Contextualization with SPLADE for High Recall RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657919(2337-2341)Online publication date: 10-Jul-2024
ICIBE '17: Proceedings of the 3rd International Conference on Industrial and Business Engineering
Finding a document that is similar to a specified query document within a large document database is one of important issues in the Big Data era, as most data available is in the form of unstructured texts. Our testing collection consists of two parts: ...
The notion of automatic full-text retrieval is clearly attractive; retrieval is based on automatic searching of documents for those embodying certain subject content. Such a system may involve automatic preprocessing of documents to form indexes or other structures to facilitate retrieval, but may not include any human intervention such as manual indexing of documents. Many commercial retrieval systems include accessability to databases that have not been manually indexed, so the idea of full-text retrieval is not simply a research issue.
This paper describes a large-scale search and retrieval experiment aimed at evaluating the effectiveness of full-text retrieval. IBM's STAIRS, a fast, large-capacity, full-text retrieval system, was used for the study. The database consisted of just under 40,000 documents, representing roughly 350,000 pages of hard-copy text, which were related to the defense of a large corporate lawsuit. Two attorneys participated in the experiment, along with two paralegals who were familiar with the case and experienced with the STAIRS system. A total of 51 retrieval requests were processed. Precision and Recall were chosen as measures of retrieval effectiveness. Overall, these aspects of the experimental design were thoughtfully done. The scope of this study, which involved two researchers and six support staff members, took six months, and cost almost half a million dollars, is impressive.
The most significant reported result of the experiment is that the average value of Recall was 20 percent, this being clearly unacceptable to lawyers specifying a need for at least 75 percent Recall. The authors discuss the reasons for these results and suggest a theoretical basis for them. While it is encouraging to see such a large-scale experiment, the paper is disappointing in several areas.
The authors apparently recognize the difference between what are referred to as “data retrieval” and “information retrieval.” However, they ignore the fact that they are using a data retrieval system (that happens to handle text better than most) to do information retrieval. Ascribing Recall levels to STAIRS (“This meant that, on average, STAIRS could be used to retrieve only 20 percent of the relevant documents. . .”), as they do, is wrong. The authors seemingly respond to this comment when they write, “An objection that might be made to our evaluation of STAIRS is that the low Recall observed was not due to STAIRS but rather to query-formulation error.” They describe the difficulty facing the user who is trying to express a request to STAIRS such that all (and only) the relevant documents will be retrieved. This difficulty certainly exists, but it does not follow that the poor performance should therefore be ascribed to the retrieval step. One can equally argue that STAIRS performs at a level of 100 percent in response to any request and that, therefore, however difficult the task may be, it is query formulation that is being evaluated. However, the problem goes deeper than that, which leads to a further point.
Clearly the authors recognize only two steps in the retrieval process: Boolean query formulation and Boolean retrieval using STAIRS. And indeed, those are the steps in their experimental procedure. But running such a simple system does not warrant the sweeping conclusions that they draw: “The retrieval problems we describe would be problems with any large-scale, full-text retrieval system.” In a 1970 paper by Salton [1] that they cite, easily ten different tools to aid in automatic full-text retrieval are referred to or described. The authors incorporate none of these, nor any of those proposed in the intervening years. That Recall is low, given their approach, is not news.
Interesting examples are given of problems they encountered in the way words were used in their database. Because of the nature of the data, which included personal correspondence, memoranda, and verbatim minutes of meetings, the problems are particularly severe and intriguing. The authors do not speculate as to the extent of this particular factor in performance. However, it is clear that these documents inhibit Recall in ways that would not be true of, for example, scientific journal articles.
Another area of concern involves the authors' discussion as to why Recall must decrease as file size increases. For example, they propose to show how Recall is calculated for a two term search, based on the probabilities of each of the terms occurring in a relevant document as well as the probabilities of a searcher using the terms in a query. But their analysis is embarrassingly simplistic and incomplete. It holds only under the assumption that all other queries result in Recall of zero, which is clearly not the case. Their claim that their study “shows that full-text retrieval does not operate at satisfactory levels and that there are sound theoretical reasons to expect this to be so” is simply not validated in this paper.
Finally, it may be noted that they have traded one experimental problem for another. Much research in information retrieval has indeed suffered from the small numbers of documents used in experiments. Yet, with small size came the advantage of being able to do multitudinous comparative studies. So, for example, previous work that these authors cite involved comparisons between manual indexing and automatic full-text retrieval. Their study, while done on a large number of documents, provides no such comparison.
Access critical reviews of Computing literature here
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Yang EHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Contextualization with SPLADE for High Recall RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657919(2337-2341)Online publication date: 10-Jul-2024
Smejkalová TNovotná THarašta J(2023)Kvantitativní přístup v kvalitativní doméně práva: odpověď autorůČasopis pro právní vědu a praxi10.5817/CPVP2023-2-1131:2Online publication date: 10-Jul-2023
Liu ZLuo W(2023)FMGAN: A Filter-Enhanced MLP Debias Recommendation Model Based on Generative Adversarial NetworkApplied Sciences10.3390/app1313797513:13(7975)Online publication date: 7-Jul-2023
Anderson BBani-Yaghoub MKantheti VCurtis S(2023)Using R to develop a corpus of full-text journal articlesJournal of Information Science10.1177/01655515231171362Online publication date: 14-Jul-2023