[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Using chi-square statistics to measure similarities for text categorization

Published: 01 April 2011 Publication History

Abstract

In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF*IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.

References

[1]
Topic detection and tracking pilot study: Final report. In: Darpa, (Ed.), Proceedings of the DARPA broadcast news transcription and understanding workshop, Morgan Kaufmann. pp. 194-218.
[2]
Combining multiple classifiers using Dempster's rule of combination for text categorization. In: Torra, V., Narukawa, Y. (Eds.), Proceedings of the 1st international conference on modelling decisions for artificial intelligence (MDAI'04), Springer. pp. 127-138.
[3]
Chen, Y.-T., & Chen, M. C. (2006). A study of ¿2-test for text categorization. In T. Nishida, Z. Shi, U. Visser, X. Wu, J. Liu, B. Wah, et al. (Eds.) Proceeding of the 2006 IEEE/WIC/ACM international conference on web intelligence (WI'06) (pp. 305-308). IEEE.
[4]
A formal study of information retrieval heuristics. In: Järvelin, K., Sanderson, M., Bruza, P., Allan, J. (Eds.), Proceedings of the 27th ACM SIGIR conference on research and development in information retrieval (SIGIR'04), ACM. pp. 49-56.
[5]
. In: Fellbaum, C. (Ed.), WordNet: An electronic lexical database, MIT Press, Cambridge, MA.
[6]
Lexical analysis and stop lists. In: Frakes, W., Baeza-Yates, R. (Eds.), Information retrieval: Data structures and algorithms, Prentice-Hall, NJ. pp. 102-130.
[7]
A guide to chi-squared testing. John Wiley & Sons, New York.
[8]
Organizing structured web sources by query schemas: A clustering approach. In: Evans, D.A., Gravano, L., Herzog, O., Zhai, C.X., Ronthaler, M. (Eds.), Proceedings of the 13th conference on information and knowledge management (CIKM'04), ACM. pp. 22-31.
[9]
The chi-squared distribution. John Wiley & Sons, New York.
[10]
Lewis, D. D. (1992). Representation and learning in information retrieval. Ph.D. Thesis. Amherst, MA: Computer Science Dept., Univ. of Massachusetts.
[11]
Discriminative models for information retrieval. In: Sanderson, M., Järvelin, K., Allan, J., Bruza, P. (Eds.), Proceedings of the 27th ACM SIGIR conference on research and development in information retrieval (SIGIR'04), ACM. pp. 64-71.
[12]
On a criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophy Magazine. v50 i5. 157-175.
[13]
An algorithm for suffix stripping. Program. v14 i3. 130-137.
[14]
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In: Croft, W.B., van Rijsbergen, C.J. (Eds.), Proceedings of the 17th ACM SIGIR conference on research and development in information retrieval (SIGIR'94), ACM. pp. 232-241.
[15]
On relevance weights with little relevance information. In: Can, F., Voorhees, E., Belkin, N.J., Narasimhalu, A.D., Willett, P., Hersh, W. (Eds.), Proceedings of the 20th ACM SIGIR conference on research and development in information retrieval (SIGIR'97), ACM.
[16]
Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. v24 i4. 35-43.
[17]
A study on thresholding strategies for text categorization. In: Kraft, D.H., Croft, W.B., Harper, D.J., Zobel, J. (Eds.), Proceedings of the 24th ACM SIGIR conference on research and development in information retrieval (SIGIR'01), ACM. pp. 137-145.
[18]
Improving text categorization methods for event tracking. In: Belkin, N.J., Ingwersen, P., Leong, M.-K. (Eds.), Proceedings of the 23rd ACM SIGIR conference on research and development in information retrieval (SIGIR'00), ACM. pp. 65-72.
[19]
A re-examination of text categorization methods. In: Gey, F., Hearst, M., Tong, R. (Eds.), Proceedings of the 22nd ACM SIGIR conference on research and development in information retrieval (SIGIR'99), ACM. pp. 42-49.
[20]
A comparative study on feature selection in text categorization. In: Fisher, D.H. (Ed.), Proceeding of the 14th international conference of machine learning (ICML'97), Morgan Kaufmann. pp. 412-420.
[21]
A study of smoothing methods for language models applied to ad hoc information retrieval. In: Kraft, D.H., Croft, W.B., Harper, D.J., Zobel, & J. (Eds.), Proceedings of the 24th ACM SIGIR conference on research and development in information retrieval (SIGIR'01), ACM. pp. 334-340.

Cited By

View all
  • (2023)What Constitutes the Deployment and Runtime Configuration System? An Empirical Study on OpenStack ProjectsACM Transactions on Software Engineering and Methodology10.1145/360718633:1(1-37)Online publication date: 3-Jul-2023
  • (2022)Classification of Software Defects Using Orthogonal Defect ClassificationInternational Journal of Open Source Software and Processes10.4018/IJOSSP.30074913:1(1-16)Online publication date: 23-May-2022
  • (2022)Incorporating soft information from financial news media for management decisions in dynamic business environmentsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21173243:4(4947-4960)Online publication date: 1-Jan-2022
  • Show More Cited By
  1. Using chi-square statistics to measure similarities for text categorization

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Expert Systems with Applications: An International Journal
      Expert Systems with Applications: An International Journal  Volume 38, Issue 4
      April, 2011
      1738 pages

      Publisher

      Pergamon Press, Inc.

      United States

      Publication History

      Published: 01 April 2011

      Author Tags

      1. Machine learning
      2. Nonparametric statistics
      3. Text mining

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 15 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)What Constitutes the Deployment and Runtime Configuration System? An Empirical Study on OpenStack ProjectsACM Transactions on Software Engineering and Methodology10.1145/360718633:1(1-37)Online publication date: 3-Jul-2023
      • (2022)Classification of Software Defects Using Orthogonal Defect ClassificationInternational Journal of Open Source Software and Processes10.4018/IJOSSP.30074913:1(1-16)Online publication date: 23-May-2022
      • (2022)Incorporating soft information from financial news media for management decisions in dynamic business environmentsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21173243:4(4947-4960)Online publication date: 1-Jan-2022
      • (2022)A Complete Process of Text Classification System Using State-of-the-Art NLP ModelsComputational Intelligence and Neuroscience10.1155/2022/18836982022Online publication date: 1-Jan-2022
      • (2020)CESS-A System to Categorize Bangla Web Text DocumentsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/339807019:5(1-18)Online publication date: 18-Jun-2020
      • (2020)Orientation Local Binary Pattern Based Fingerprint MatchingSN Computer Science10.1007/s42979-020-0068-y1:2Online publication date: 10-Feb-2020
      • (2019)A novel binary feature descriptor to discriminate normal and abnormal chest CT images using dissimilarity measuresPattern Analysis & Applications10.1007/s10044-018-00771-222:4(1517-1526)Online publication date: 1-Nov-2019
      • (2017)A New Approach Using Hidden Markov Model and Bayesian Method for Estimate of Word Types in Text MiningInternational Journal of Knowledge and Systems Science10.4018/IJKSS.20171001028:4(17-29)Online publication date: 1-Oct-2017
      • (2015)Topic identification techniques applied to dynamic language model adaptation for automatic speech recognitionExpert Systems with Applications: An International Journal10.1016/j.eswa.2014.07.03542:1(101-112)Online publication date: 1-Jan-2015
      • (2014)The impact of preprocessing on text classificationInformation Processing and Management: an International Journal10.1016/j.ipm.2013.08.00650:1(104-112)Online publication date: 1-Jan-2014
      • Show More Cited By

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media