[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data

Published: 01 November 2012 Publication History

Abstract

Locating documents carrying positive or negative favourability is an important application within media analysis. This article presents some empirical results on the challenges facing a machine-learning approach to this kind of opinion mining. Some of the challenges include the often considerable imbalance in the distribution of positive and negative samples, changes in the documents over time, and effective training and evaluation procedures for the models. This article presents results on three data sets generated by a media-analysis company, classifying documents in two ways: detecting the presence of favourability, and assessing negative vs. positive favourability. We describe our experiments in developing a machine-learning approach to automate the classification process. We explore the effect of using five different types of features, the robustness of the models when tested on data taken from a later time period, and the effect of balancing the input data by undersampling. We find varying choices for the optimum classifier, feature set and training strategy depending on the task and data set.

References

[1]
Akbani, R., Kwek, S. and Japkowicz, N., Applying support-vector machines to imbalanced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (Eds.), Lecture Notes in Computer Science, vol. 3201. pp. 39-50.
[2]
Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., Van Der Goot, E., Halkia, M., Pouliquen, B. and Belyaeva, J., Sentiment analysis in the news. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, pp. 2216-2220.
[3]
Balahur, A., Hermida, J.M. and Montoyo, A., Detecting implicit expressions of sentiment in text based on commonsense knowledge. In: Balahur, A., Boldrini, E., Montoyo, A., Martinez-Barco, P. (Eds.), Proceedings of the Second Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), Association for Computational Linguistics, Portland, Oregon. pp. 53-60.
[4]
Bao, J.P., Lyon, C.M. and Lane, P.C.R., A text annotation method based on semantic sequence. In: Proceedings of the Seventh International Workshop on Computational Semantics,
[5]
Blum, A.L. and Langley, P., Selection of relevant features and examples in machine learning. Artificial Intelligence. v97. 245-271.
[6]
Chawla, N.V., Japkowicz, N. and Kotcz, A., Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter. v6. 1-6.
[7]
Clarke, D., Lane, P.C.R. and Hender, P., Developing robust models for favourability analysis. In: Balahur, A., Boldrini, E., Montoyo, A., Martinez-Barco, P. (Eds.), Proceedings of the Second Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), Association for Computational Linguistics, Portland, Oregon. pp. 44-52.
[8]
de Marneffe, M.-C., MacCartney, B. and Manning, C.D., Generating typed dependency parsers from phrase structure parses. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation,
[9]
Forman, G., An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research. v3. 1289-1305.
[10]
Gamon, M., Aue, A., Corston-Oliver, S. and Ringger, E., Pulse: mining customer opinions from free text. In: Advances in Intelligent Data Analysis, VI. pp. 121-132.
[11]
Godbole, N., Srinivasaiah, M. and Skiena, S., Large-scale sentiment analysis for news and blogs. In: Proceedings of the International Conference on Weblogs and Social Media (ICWSM),
[12]
Gray, D., Bowes, D., Davey, N., Sun, Y. and Christianson, B., Further thoughts on precision. In: Proceedings of Evaluation and Assessment in Software Engineering,
[13]
Green, P.D., Lane, P.C.R., Rainer, A.W. and Scholz, S., Selecting measures in origin analysis. In: Bramer, M., Pehidis, M., Hopgood, A. (Eds.), Research and Development in Intelligent Systems XXVII: Proceedings of the Thirtieth SGAI International Conference on Artificial Intelligence, Springer-Verlag. pp. 379-392.
[14]
He, H. and Garcia, E.A., Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering. v21. 1263-1284.
[15]
Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (Eds.), Lecture Notes in Computer Science, vol. 1398. Springer, Berlin / Heidelberg. pp. 137-142.
[16]
Koppel, M. and Schler, J., The importance of neutral examples for learning sentiment. Computational Intelligence. v22. 100-109.
[17]
Krippendorff, K., Content Analysis: An Introduction to its Methodology. 2004. Sage Publications, Inc.
[18]
Kubat, M., Holte, R.C. and Matwin, S., Machine learning for the detection of oil spills in satellite radar images. Machine Learning. v30. 195-215.
[19]
Lane, P.C.R., Lyon, C.M. and Malcolm, J.A., Demonstration of the Ferret plagiarism detector. In: Proceedings of the Second International Plagiarism Conference,
[20]
Li, N. and Wu, D.D., Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems. v48. 354-368.
[21]
Li, T., Sindhwani, V., Ding, C. and Zhang, Y., Knowledge transformation for cross-domain sentiment classification. In: Proceedings of the Thirty-Second International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM. pp. 716-717.
[22]
Melville, P., Gryc, W. and Lawrence, R.D., Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the Fifteenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '09, ACM, pp. 1275-1284.
[23]
Mitchell, T., Machine Learning. 1997. McGraw-Hill, New York.
[24]
Mladenić, D., Feature subset selection in text-learning. In: Nédellec, C., Rouveirol, C. (Eds.), Lecture Notes in Computer Science, vol. 1398. Springer, Berlin / Heidelberg. pp. 95-100.
[25]
Mullen, T. and Collier, N., Sentiment analysis using support vector machines with diverse information sources. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, vol. 4. pp. 412-418.
[26]
Pang, B. and Lee, L., Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval. v2. 1-135.
[27]
Pang, B., Lee, L. and Vaithyanathan, S., Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10. Association for Computational Linguistics. pp. 79-86.
[28]
Prabowo, R. and Thelwall, M., Sentiment analysis: a combined approach. Journal of Informetrics. v3. 143-157.
[29]
Provost, F., Machine learning from imbalanced data sets 101. In: AAAI Workshop on Learning from Imbalanced Data Sets, AAAI Press.
[30]
High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, ACM. pp. 659-661.
[31]
Sun, A., Ling, E.-P. and Lui, Y., On strategies for imbalanced text classification using SVM: a comparative study. Decision Support Systems. v48. 191-201.
[32]
Tatzl, G. and Waldhauser, C., Aggregating opinions: explorations into graphs and media content analysis. In: TextGraphs-5 Workshop, ACL 2010, pp. 93
[33]
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Association for Computational Linguistics, pp. 417-424.
[34]
van Hulse, J., Khoshgoftaar, T.M. and Napolitano, A., Experimental perspectives on learning from imbalanced data. In: Ghahramani, Z. (Ed.), Proceedings of the Twenty-Fourth International Conference on Machine Learning, pp. 935-942.
[35]
Wan, X., Co-training for cross-lingual sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 1. Association for Computational Linguistics. pp. 235-243.
[36]
Weng, C.G. and Poon, J., A new evaluation measure for imbalanced datasets. In: Proceedings of the Seventh Australasian Data Mining Conference,
[37]
Wiebe, J., Wilson, T., Bruce, R., Bell, M. and Martin, M., . Learning subjective language. Computational linguistics. v30. 277-308.
[38]
Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques. 2005. Morgan Kaufmann.
[39]
Zheng, Z., Wu, X. and Srihari, R., Feature selection for text categorization on imbalanced data. ACM Sigkdd Explorations Newsletter: Special Issue on Learning from Imbalanced Datasets. v6. 80-89.

Cited By

View all
  1. On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Decision Support Systems
        Decision Support Systems  Volume 53, Issue 4
        November, 2012
        229 pages

        Publisher

        Elsevier Science Publishers B. V.

        Netherlands

        Publication History

        Published: 01 November 2012

        Author Tags

        1. Bayesian models
        2. Favourability analysis
        3. Imbalanced data
        4. Machine learning
        5. Sentiment analysis
        6. Support-vector machines

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 14 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Sentiment AnalysisProcedia Computer Science10.1016/j.procs.2024.04.094235:C(990-999)Online publication date: 24-Jul-2024
        • (2023)A method for the dynamic collaboration of the public and experts in large-scale group emergency decision-makingComputers and Industrial Engineering10.1016/j.cie.2022.108943176:COnline publication date: 1-Feb-2023
        • (2022)Review on sentiment analysis for text classification techniques from 2010 to 2021Multimedia Tools and Applications10.1007/s11042-022-14112-382:6(8137-8193)Online publication date: 1-Dec-2022
        • (2021)Framework for the Classification of Imbalanced Structured Data Using Under-sampling and Convolutional Neural NetworkInformation Systems Frontiers10.1007/s10796-021-10195-924:6(1795-1809)Online publication date: 17-Sep-2021
        • (2021)360 degree view of cross-domain opinion classification: a surveyArtificial Intelligence Review10.1007/s10462-020-09884-954:2(1385-1506)Online publication date: 1-Feb-2021
        • (2019)CatSent: a Catalan sentiment analysis websiteMultimedia Tools and Applications10.1007/s11042-019-07877-778:19(28137-28155)Online publication date: 1-Oct-2019
        • (2018)A diversity-based method for class-imbalanced cost-sensitive learningProceedings of 2018 International Conference on Mathematics and Artificial Intelligence10.1145/3208788.3208792(51-55)Online publication date: 20-Apr-2018
        • (2018)An improved semi-supervised dimensionality reduction using feature weightingExpert Systems with Applications: An International Journal10.1016/j.eswa.2018.05.023109:C(49-65)Online publication date: 1-Nov-2018
        • (2017)Learning from class-imbalanced dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.12.03573:C(220-239)Online publication date: 1-May-2017
        • (2015)What is the Impact of Imbalance on Software Defect Prediction Performance?Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/2810146.2810150(1-4)Online publication date: 21-Oct-2015
        • Show More Cited By

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media