[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Structured queries, language modeling, and relevance modeling in cross-language information retrieval

Published: 01 May 2005 Publication History

Abstract

Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries-one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translalion probabilities confer a small but significant advantage.

References

[1]
Bahl, L. R., Jelinek, F., & Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5, 179-190.]]
[2]
Ballesteros, L. & Croft, W. B. (1998). Resolving ambiguity for cross-language retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR '98), Melbourne, Australia (pp. 64-71).]]
[3]
Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd international conference on research and development information retrieval, Berkeley (pp. 222-229).]]
[4]
Broglio, J., Croft, W. B., Callan, J. P., & Nachbar, D. W. (1995). Document retrieval and routing using the INQUERY system. In D. K. Harman (Ed.), Overview of the third text retrieval conference (TREC-3) (pp. 29-38). Gaithersburg: National Institute of Standards and Technology Special Publication 500-225.]]
[5]
Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R., & Roossin, P. (1990). A statistical approach to machine Translation. Computational Linguistics, 16(2), 79-85.]]
[6]
Brown, P., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263-312.]]
[7]
Croft, W. B., & Harper, D. J. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), 285-295.]]
[8]
Federico, M., & Bertoldi, N. (2002). Statistical cross-language information retrieval using n-best query translations. In proceedings of the twenty-fifth annual international ACM SIGIR conference on research and development in information retrieval, Tampere, Finland (pp. 167-174).]]
[9]
Fuhr, N. (1989). Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1), 55-72.]]
[10]
Gey, F. C., & Oard, D. W. (2002). The TREC-2001 Cross-language information retrieval track: Searching Arabic using English, French, or Arabic queries. In E. M. Voorhees, & D. K. Harman (Eds.), The tenth text retrieval conference, TREC 2001 (pp. 16-25). Gaithersburg: National Institute of Standards and Technology Special Publication 500-250.]]
[11]
Harman, D. (1996). Overview of the fourth text retrieval conference (TREC-4). In D. Harman (Ed.), The fourth text retrieval conference (TREC-4). Gaithersburg: National Institute of Standards and Technology Special Publication 500-236.]]
[12]
Hiemstra, D., & de Jong, F. (1999). Disambiguation strategies for cross-language information retrieval. In S. Abiteboul, & A.-M. Vercoustre (Eds.), Proceedings of the third European conference on research and advanced technology for digital libraries, ECDL '99, Paris, France (pp. 274-293).]]
[13]
Hiemstra, D., & de Vries, A. (2000). Relating the new language models of information retrieval to the traditional retrieval models (CTIT Technical Report TR-CTIT-00-09). Enschede, The Netherlands: University of Twente. Available: http://wwwhome.cs.utwente.nl/ ~hiemstra/papers/.]]
[14]
Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: MIT.]]
[15]
Jelinek, F., Bahl, L. R., & Mercer, R. L. (1975). Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory, IT-21, 250-256.]]
[16]
Khoja, S., & Garside, R. (1999). Stemming Arabic text. Lancaster, UK: Computing Department, Lancaster University. Available: http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.]]
[17]
Koehn, P. (2002). Europarl: A multilingual corpus for evaluation of machine translation. Available: http://www.isi.edu/~koehn/ publications/europarl/.]]
[18]
Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, Pittsburgh, PA (pp. 191-203).]]
[19]
Larkey, L. S., Allan, J., Connell, M. E., Bolivar, A., & Wade, C. (2003) UMass at TREC 2002: Cross language and novelty tracks. In The eleventh text retrieval conference (TREC 2002) (pp. 721-732). Gaithersburg: National Institute of Standards and Technology Special Publication 500-251.]]
[20]
Lavrenko, V., Choquette, M., & Croft, W. B. (2002). Cross-lingual relevance models. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, Tampere, Finland (pp. 175-182).]]
[21]
Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans (pp. 120-127).]]
[22]
Liu, X., & Croft, W. B. (2002). Passage retrieval based on language models. In Proceedings of the eleventh international conference on information and knowledge management, CIKM '02, McLean, VA (pp. 375-382).]]
[23]
LDC (1998). Linguistic Data Consortium. North American News Text Supplement, LDC98T30. Available: http://www.ldc.upenn.edu/ Catalog/.]]
[24]
Maron, M. E., & Kuhns, J. L. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the Association for Computing Machinery, 7(3), 216-244.]]
[25]
Miller, D. R. H., Leek, T., & Schwartz, R. M. (1999). A hidden Markov model information retrieval system. In Proceedings of the 22nd annual international conference on research and development in information retrieval, Berkeley, CA (pp. 214-221).]]
[26]
NTCIR Workshop 2 (2001). Proceedings of the second NTCIR workshop on research in Chinese and Japanese text retrieval and text summarization. Tokyo: National Institute of Informatics. Available: http://research.nii.ac.jp/ntcir/workshop/Online Proceedings2.]]
[27]
Oard, D. W., & Gey, F. C. (2003). The TREC-2002 Arabic/English CLIR track. In The eleventh text retrieval conference (TREC 2002) (pp. 17-26). Gaithersburg: National Institute of Standards and Technology Special Publication 500-251.]]
[28]
Och, F. J., & Ney, H. (2000). Improved statistical alignment models. In Proceedings of the 38th annual meeting of the Association for Computational Linguistics, Hongkong, China (pp. 440-447).]]
[29]
Peters, C. (2001). Cross-language information retrieval and evaluation: Workshop of cross-language evaluation forum, CLEF 2000, Lisbon, Portugal. Springer-Verlag.]]
[30]
Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (2002). Evaluation of cross-language information retrieval systems: Second workshop of the cross-language evaluation forum, CLEF 2001, Darmstadt, Germany. Springer-Verlag.]]
[31]
Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, Melbourne, Australia (pp. 55-63).]]
[32]
Ponte, J. M. (1998). A language modeling approach to information retrieval. Unpublished doctoral dissertation, University of Massachusetts, Amherst, MA.]]
[33]
Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, Melbourne, Australia (pp. 275-281).]]
[34]
Robertson, S. E., & Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129-146.]]
[35]
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. In D. K. Harmon (Ed.), Overview of the third text retrieval conference (TREC-3) (pp. 109-126). Gaithersburg, MD: NIST Special Publication 500-225.]]
[36]
Rocchio, J. J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system. Experiments in automatic document parocessing. Englewoood Cliffs: Prentice-Hall.]]
[37]
Siegel, S. (1956). Nonparametric statistics for the behavioral scineces. New York: McGraw-Hill.]]
[38]
Smeaton, A., & Wilkinson, R. (1997). Spanish and Chinese document retrieval in TREC-5. In E. M. Voorhees, & D. K. Harman (Eds.), The fifth text retrieval conference (TREC-5) (pp. 57-64). Gaithersburg: NIST Special Publication 500-238.]]
[39]
Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of the eighth international conference on information and knowledge management, CIKM'99 Kansas City, Missouri (pp. 316-321).]]
[40]
Turtle, H., & Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems. 9(3), 187-222.]]
[41]
Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of the 19th international ACM SIGIR conference on research and development in information retrieval (pp. 4-11). Zurich, Switzerland: ACM.]]
[42]
Xu, J., Weischedel, R., & Nguyen, C. (2001). Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 105-110). New Orleans: ACM Press.]]
[43]
Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In proceedings of the tenth international conference on information and knowledge management, CIKM'01. Atlanta (pp. 403-410).]]

Cited By

View all
  • (2011)Expanding queries with term and phrase translations in patent retrievalProceedings of the Second international conference on Multidisciplinary information retrieval facility10.5555/2018142.2018147(16-29)Online publication date: 6-Jun-2011
  • (2010)Applying an intelligent notification mechanism to blogging systems utilizing a genetic-based information retrieval approachExpert Systems with Applications: An International Journal10.1016/j.eswa.2009.05.09437:1(705-715)Online publication date: 1-Jan-2010
  • (2009)Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documentsInformation Retrieval10.1007/s10791-008-9081-912:3(300-323)Online publication date: 1-Jun-2009
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal
Information Processing and Management: an International Journal  Volume 41, Issue 3
Special issue: Cross-language information retrieval
May 2005
303 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 May 2005

Author Tags

  1. cross-lingual information retrieval
  2. language modeling
  3. query expansion
  4. structured query translation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2011)Expanding queries with term and phrase translations in patent retrievalProceedings of the Second international conference on Multidisciplinary information retrieval facility10.5555/2018142.2018147(16-29)Online publication date: 6-Jun-2011
  • (2010)Applying an intelligent notification mechanism to blogging systems utilizing a genetic-based information retrieval approachExpert Systems with Applications: An International Journal10.1016/j.eswa.2009.05.09437:1(705-715)Online publication date: 1-Jan-2010
  • (2009)Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documentsInformation Retrieval10.1007/s10791-008-9081-912:3(300-323)Online publication date: 1-Jun-2009
  • (2008)An enhanced genetic approach to optimizing auto-reply accuracy of an e-learning systemComputers & Education10.1016/j.compedu.2007.05.01451:1(337-353)Online publication date: 1-Aug-2008
  • (2007)Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rulesACM Transactions on Information Systems10.1145/1292591.129259326:1(2-es)Online publication date: 1-Nov-2007
  • (2006)FITE-TRTProceedings of the 2006 ACM symposium on Applied computing10.1145/1141277.1141525(1043-1049)Online publication date: 23-Apr-2006
  • (2004)Cross-lingual information extraction system evaluationProceedings of the 20th international conference on Computational Linguistics10.3115/1220355.1220482(882-es)Online publication date: 23-Aug-2004
  • (2004)Language-specific models in multilingual topic trackingProceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1008992.1009061(402-409)Online publication date: 25-Jul-2004

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media