Abstract
This paper presents an approach for categorizing documents according to their implicit locational relevance. We report a thorough evaluation of several classifiers designed for this task, built by using support vector machines with multiple alternatives for feature vectors. Experimental results show that using feature vectors that combine document terms and URL n-grams, with simple features related to the locality of the document (e.g. total count of place references) leads to high accuracy values. The paper also discusses how the proposed categorization approach can be used to help improve tasks such as document retrieval or online contextual advertisement.
This work was partially supported by the FCT (Portugal), through project grant PTDC/EIA/73614/2006 (GREASE-II).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ding, J., Gravano, L., Shivakumar, N.: Computing Geographical Scopes of Web Resources. In: Proceedings of the 26th international Conference on Very Large Data Bases, pp. 545–556 (2000)
Amitay, E., Har’El, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: Proceedings of the 27th international ACM SIGIR Conference on Research and Development in information Retrieval, pp. 273–280 (2004)
Gravano, L., Hatzivassiloglou, V., Lichtenstein, R.: Categorizing web queries according to geographical locality. In: Proceedings of the 12th international Conference on information and Knowledge Management, pp. 325–333 (2003)
Zhuang, Z., Brunk, C., Giles, C.L.: Modeling and visualizing geo-sensitive queries based on user clicks. In: Proceedings of the 1st international Workshop on Location and the Web, pp. 73–76 (2008)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Woodruff, A.G., Plaunt, C.: GIPSY: Automated geographic indexing of text documents. Journal of the American Society for Information Science 45(9), 645–655 (1994)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Johansson, M., Harrie, L.: Using Java Topology Suite for real-time data generalisation and integration. In: Proceedings of the 2002 workshop of the International Society for Photogrammetry and Remote Sensing (2002)
Leidner, J.L.: Toponym Resolution: a Comparison and Taxonomy of Heuristics and Methods. PhD Thesis, University of Edinburgh (2007)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1999)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computer Surveys 34(1), 1–47 (2002)
Joachims, T.: Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142 (1998)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12(3), 233–251 (1994)
Genkin, A., Lewis, D.D., Madigan, D.: Large-Scale Bayesian Logistic Regression for Text Categorization. Rutgers University Technical Report (2004)
Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature selection methods for text classification. In: Proceedings of the 13th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 230–239 (2007)
Sang, E.T.K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-Independent Named Entity Recognition. In: Proceedings of the 7th Conference on Natural Language Learning, pp. 142–147 (2003)
Kornai, A.: Proceedings of the HLT-NAACL 2003 workshop on the analysis of geographic references (2003)
Garbin, E., Mani, I.: Disambiguating toponyms in news. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 363–370 (2005)
Rauch, E., Bukatin, M., Baker, K.: A confidence-based framework for disambiguating geographic terms. In: Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, pp. 50–54 (2003)
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the 1998 ACM SIGMOD international Conference on Management of Data, pp. 307–318 (1998)
Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. In: Proceedings of the 15th ACM international Conference on information and Knowledge Management, pp. 228–237 (2006)
Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based Topic Classification. In: Proceedings of the 18th international World Wide Web Conference, Alternate Track Papers and Posters, p. 1109 (2009)
Baykan, E., Henzinger, M., Weber, I.: Web page language identification based on URLs. Proceedings of the VLDB Endowment 1(1), 176–187 (2008)
Jones, R., Zhang, W.V., Rey, B., Jhala, P., Stipp, E.: Geographic intention and modification in web search. International Journal of Geographical Information Science 22(3), 229–246 (2009)
Yu, B., Cai, G.: A query-aware document ranking method for geographic information retrieval. In: Proceedings of the 4th ACM workshop on Geographical information retrieval, pp. 49–54 (2007)
Cai, G.: GeoVSM: An Integrated Retrieval Model for Geographic Information. GIScience, 65–79 (2002)
Anastáio, I., Martins, B., Calado, P.: A Comparison of Different Approaches for Assigning Geographic Scopes to Documents. In: Proceedings of the 1st INForum - Simpósio de Informática (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Anastácio, I., Martins, B., Calado, P. (2009). Classifying Documents According to Locational Relevance. In: Lopes, L.S., Lau, N., Mariano, P., Rocha, L.M. (eds) Progress in Artificial Intelligence. EPIA 2009. Lecture Notes in Computer Science(), vol 5816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04686-5_49
Download citation
DOI: https://doi.org/10.1007/978-3-642-04686-5_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04685-8
Online ISBN: 978-3-642-04686-5
eBook Packages: Computer ScienceComputer Science (R0)