Abstract
Vocabulary incompatibilities arise when the terms used to index a document collection are largely unknown, or at least not well-known to the users who eventually search the collection. No matter how comprehensive or well-structured the indexing vocabulary, it is of little use if it is not used effectively in query formulation. This paper demonstrates that techniques for mapping user queries into the controlled indexing vocabulary have the potential to radically improve document retrieval performance. We also show how the use of controlled indexing vocabulary can be employed to achieve performance gains for collection selection. Finally, we demonstrate the potential benefit of combining these two techniques in an interactive retrieval environment. Given a user query, our evaluation approach simulates the human user's choice of terms for query augmentation given a list of controlled vocabulary terms suggested by a system. This strategy lets us evaluate interactive strategies without the need for human subjects.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Buckland M et al. (1999) Mapping entry vocabulary to unfamiliar metadata vocabularies. In: D-Lib Magazine. http://www.dlib.org/dlib/january99/buckland/01buckland.html.
Callan J, Powell AL, French JC and Connell M(2000) The effects of query-based sampling on automatic database selection algorithms.Technical Report CMU-LTI-00-162, Language Technologies Institute, School of Computer Science, Carnegie Mellon University.
Callan JP, Lu Z and Croft WB (1995) Searching distributed collections with inference networks. In: Proc. ACM SIGIR'95, pp. 21–28.
Callan JP, Lu Z and Croft WB(1995) Searching distributed collections with inference networks. In: Proc. SIGIR'95, pp. 21–29.
Chen A, Kishida K, Jiang H, Liang Q and Gey FC (1999) Comparing multiple methods for Japanese and Japanese-English text retrieval. In: First NTCIRWorkshop on Research in Japanese Text Retrieval and Term Recognition, pp. 49–58.
Cooper WS, Chen A and Gey FC (1994) Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression. In: Text REtrieval Conference (TREC-2), pp. 57–66.
Craswell N, Bailey P and Hawking D (2000) Server selection on the world wide web. In: Proc. ACM Digital Libraries Conf., pp. 37–46.
Doszkocs TE (1983) CITE NLM: Natural language searching in an online catalog. Information Technology and Libraries, 2:364–380.
Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74.
Eichmann D, Ruiz M and Srinivasan P (1998) Cross-language information retrieval with the UMLS metathesaurus.In: Proc. ACM SIGIR'98, pp. 72–80.
French JC and Powell AL (2000) Metrics for evaluating database selection techniques.WorldWideWeb: Internet and Web Information Systems, 3(3).
French JC, Powell AL, Callan J, Viles CL, Emmitt T, Prey KJ and Mou Y (1999) Comparing the performance of database selection algorithms. In: Proc. ACM SIGIR'99, pp. 238–245.
French JC, Powell AL, Gey F and Perelman N (2001) Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness. In: Proc. Tenth International Conference on Information and Knowledge Management (CIKM 2001), pp. 199–206.
French JC, Powell AL, Viles CL, Emmitt T and Prey KJ (1998) Evaluating database selection techniques: A testbed and experiment. In: Proc. ACM SIGIR'98, pp. 121–129.
Fuhr N (1999) A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229–249.
Gauch S, Wang J and Rachakonda SM (1999) A corpus analysis approach for automatic query expansion and its extension to multiple databases. ACM Transactions on Information Systems, 17(3):250–269.
Gey F, Buckland M, Chen A and Larson R (2001) Entry vocabulary-A technology to enhance digital object search. In: Proceedings of the First Internation Conference on Human Language Technology.
Gey F, Jiang H, Chen A and Larson R (1999) Manual queries and machine translation in cross language retrieval and interactive retrieval at TREC-7. In: Text REtrieval Conference (TREC-7), pp. 527–539.
Gey FC and Chen A(1998) Phrase discovery for English and cross-language retrieval at TREC-6. In: Text REtrieval Conference (TREC-6), pp. 637–648.
Gey FC, Chen A, He J, Xu L and Meggs J (1996) Term importance, Boolean conjunct training, negative terms, and foreign language retrieval: Probabilistic algorithms at TREC-5. In: Text Retrieval Conference (TREC-5).
Gravano L and García-Molina H (1995) Generalizing GlOSS to vector-space databases and broker hierarchies.In: Proc. of the 21st VLDB Conference, pp. 78–89.
Gravano L, García-Molina H and Tomasic A (1999) GlOSS: Text-source discovery over the internet. ACMTrans.on Database Systems, 24(2):229–264.
Harman D (1988) Towards interactive query expansion. In: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 321–331.
Hawking D and Thistlewaite P (1999) Methods for information server selection.ACMTransactions on Information Systems, 17(1):40–76.
Hersh W, Buckley C, Leone TJ and Hickam D (1994) OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: Proc. ACM SIGIR'94, pp. 192–201.
Hersh W, Price S and Donohoe L (2000) Assessing thesaurus-based query expansion using the UMLS metathesaurus.In: Proceedings of the 2000 American Medical Informatics Association (AMIA) Symposium.
Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H and Hidaka S (1999) Overview of IR tasks at the first NTCIR workshop. In: The First NTCIR Workshop on Japanese Text Retrieval and Term Recognition, pp. 11–22.
Kluck M and Gey F (2001) The domain-specific task of CLEF-Specific evaluation strategies in cross-language information retrieval. In: Cross-Language Information Retrieval Evaluation, Proceedings of the CLEF 2000 Workshop, Forthcoming. Springer.
Magennis M and van Rijsbergen CJ (1997) The potential and actual effectiveness of interactive query expansion.In: SIGIR'97, pp. 324–332.
Meng W, Liu K-L, Yu C, Wang X, Chang Y and Rishe N (1998) Determining text databases to search in the internet. In: Proceedings of the 24th VLDB Conference, pp. 14–25.
Powell AL, French JC, Callan J, Connell M and Viles CL (2000) The impact of database selection on distributed searching. In: Proc. ACM SIGIR ‘00, pp. 232–239.
Schatz B, Chen H et al. (1996) Interactive term suggestion for users of digital libraries: Using subject thesauri and co-occurence lists for information retrieval. In: Proc. ACM Digital Libraries Conf.
Schott H (ed.) (2000) Thesaurus for the Social Sciences. (Vol. 1:) German-English. (Vol. 2:) English-German. (Edition) 1999. InformationsZentrum Sozialwissenschaften Bonn.
Xu J and Callan J (1998) Effective retrieval with distributed collections. In: Proc. ACM SIGIR'98, pp. 112–120.
Yu C, Meng W, Liu K-L, Wu W and Rishe N (1999) Efficient and effective metasearch for a large number of text databases. In: Proc. ACM CIKM'99, pp. 217–224.
Yuwono B and Lee DL (1997) Server ranking for distributed text retrieval systems on internet. In: Proceedings of the Fifth International Conference on Database Systems for Advanced Applications, pp. 41–49.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
French, J.C., Powell, A.L., Gey, F. et al. Exploiting Manual Indexing to Improve Collection Selection and Retrieval Effectiveness. Information Retrieval 5, 323–351 (2002). https://doi.org/10.1023/A:1020447427581
Issue Date:
DOI: https://doi.org/10.1023/A:1020447427581