More Web Proxy on the site http://driver.im/

article

On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data

Authors:

Peter C. R. Lane,

Paul HenderAuthors Info & Claims

Decision Support Systems, Volume 53, Issue 4

Pages 712 - 718

https://doi.org/10.1016/j.dss.2012.05.028

Published: 01 November 2012 Publication History

Abstract

Locating documents carrying positive or negative favourability is an important application within media analysis. This article presents some empirical results on the challenges facing a machine-learning approach to this kind of opinion mining. Some of the challenges include the often considerable imbalance in the distribution of positive and negative samples, changes in the documents over time, and effective training and evaluation procedures for the models. This article presents results on three data sets generated by a media-analysis company, classifying documents in two ways: detecting the presence of favourability, and assessing negative vs. positive favourability. We describe our experiments in developing a machine-learning approach to automate the classification process. We explore the effect of using five different types of features, the robustness of the models when tested on data taken from a later time period, and the effect of balancing the input data by undersampling. We find varying choices for the optimum classifier, feature set and training strategy depending on the task and data set.

References

[1]

Akbani, R., Kwek, S. and Japkowicz, N., Applying support-vector machines to imbalanced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (Eds.), Lecture Notes in Computer Science, vol. 3201. pp. 39-50.

[2]

Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., Van Der Goot, E., Halkia, M., Pouliquen, B. and Belyaeva, J., Sentiment analysis in the news. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, pp. 2216-2220.

[3]

Balahur, A., Hermida, J.M. and Montoyo, A., Detecting implicit expressions of sentiment in text based on commonsense knowledge. In: Balahur, A., Boldrini, E., Montoyo, A., Martinez-Barco, P. (Eds.), Proceedings of the Second Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), Association for Computational Linguistics, Portland, Oregon. pp. 53-60.

[4]

Bao, J.P., Lyon, C.M. and Lane, P.C.R., A text annotation method based on semantic sequence. In: Proceedings of the Seventh International Workshop on Computational Semantics,

[5]

Blum, A.L. and Langley, P., Selection of relevant features and examples in machine learning. Artificial Intelligence. v97. 245-271.

Digital Library

[6]

Chawla, N.V., Japkowicz, N. and Kotcz, A., Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter. v6. 1-6.

Digital Library

[7]

Clarke, D., Lane, P.C.R. and Hender, P., Developing robust models for favourability analysis. In: Balahur, A., Boldrini, E., Montoyo, A., Martinez-Barco, P. (Eds.), Proceedings of the Second Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), Association for Computational Linguistics, Portland, Oregon. pp. 44-52.

[8]

de Marneffe, M.-C., MacCartney, B. and Manning, C.D., Generating typed dependency parsers from phrase structure parses. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation,

[9]

Forman, G., An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research. v3. 1289-1305.

[10]

Gamon, M., Aue, A., Corston-Oliver, S. and Ringger, E., Pulse: mining customer opinions from free text. In: Advances in Intelligent Data Analysis, VI. pp. 121-132.

[11]

Godbole, N., Srinivasaiah, M. and Skiena, S., Large-scale sentiment analysis for news and blogs. In: Proceedings of the International Conference on Weblogs and Social Media (ICWSM),

[12]

Gray, D., Bowes, D., Davey, N., Sun, Y. and Christianson, B., Further thoughts on precision. In: Proceedings of Evaluation and Assessment in Software Engineering,

[13]

Green, P.D., Lane, P.C.R., Rainer, A.W. and Scholz, S., Selecting measures in origin analysis. In: Bramer, M., Pehidis, M., Hopgood, A. (Eds.), Research and Development in Intelligent Systems XXVII: Proceedings of the Thirtieth SGAI International Conference on Artificial Intelligence, Springer-Verlag. pp. 379-392.

[14]

He, H. and Garcia, E.A., Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering. v21. 1263-1284.

[15]

Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (Eds.), Lecture Notes in Computer Science, vol. 1398. Springer, Berlin / Heidelberg. pp. 137-142.

Digital Library

[16]

Koppel, M. and Schler, J., The importance of neutral examples for learning sentiment. Computational Intelligence. v22. 100-109.

[17]

Krippendorff, K., Content Analysis: An Introduction to its Methodology. 2004. Sage Publications, Inc.

[18]

Kubat, M., Holte, R.C. and Matwin, S., Machine learning for the detection of oil spills in satellite radar images. Machine Learning. v30. 195-215.

Digital Library

[19]

Lane, P.C.R., Lyon, C.M. and Malcolm, J.A., Demonstration of the Ferret plagiarism detector. In: Proceedings of the Second International Plagiarism Conference,

[20]

Li, N. and Wu, D.D., Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems. v48. 354-368.

Digital Library

[21]

Li, T., Sindhwani, V., Ding, C. and Zhang, Y., Knowledge transformation for cross-domain sentiment classification. In: Proceedings of the Thirty-Second International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM. pp. 716-717.

[22]

Melville, P., Gryc, W. and Lawrence, R.D., Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the Fifteenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '09, ACM, pp. 1275-1284.

[23]

Mitchell, T., Machine Learning. 1997. McGraw-Hill, New York.

[24]

Mladenić, D., Feature subset selection in text-learning. In: Nédellec, C., Rouveirol, C. (Eds.), Lecture Notes in Computer Science, vol. 1398. Springer, Berlin / Heidelberg. pp. 95-100.

[25]

Mullen, T. and Collier, N., Sentiment analysis using support vector machines with diverse information sources. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, vol. 4. pp. 412-418.

[26]

Pang, B. and Lee, L., Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval. v2. 1-135.

Digital Library

[27]

Pang, B., Lee, L. and Vaithyanathan, S., Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10. Association for Computational Linguistics. pp. 79-86.

[28]

Prabowo, R. and Thelwall, M., Sentiment analysis: a combined approach. Journal of Informetrics. v3. 143-157.

[29]

Provost, F., Machine learning from imbalanced data sets 101. In: AAAI Workshop on Learning from Imbalanced Data Sets, AAAI Press.

[30]

High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, ACM. pp. 659-661.

[31]

Sun, A., Ling, E.-P. and Lui, Y., On strategies for imbalanced text classification using SVM: a comparative study. Decision Support Systems. v48. 191-201.

Digital Library

[32]

Tatzl, G. and Waldhauser, C., Aggregating opinions: explorations into graphs and media content analysis. In: TextGraphs-5 Workshop, ACL 2010, pp. 93

[33]

Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Association for Computational Linguistics, pp. 417-424.

[34]

van Hulse, J., Khoshgoftaar, T.M. and Napolitano, A., Experimental perspectives on learning from imbalanced data. In: Ghahramani, Z. (Ed.), Proceedings of the Twenty-Fourth International Conference on Machine Learning, pp. 935-942.

[35]

Wan, X., Co-training for cross-lingual sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 1. Association for Computational Linguistics. pp. 235-243.

[36]

Weng, C.G. and Poon, J., A new evaluation measure for imbalanced datasets. In: Proceedings of the Seventh Australasian Data Mining Conference,

[37]

Wiebe, J., Wilson, T., Bruce, R., Bell, M. and Martin, M., . Learning subjective language. Computational linguistics. v30. 277-308.

[38]

Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques. 2005. Morgan Kaufmann.

[39]

Zheng, Z., Wu, X. and Srihari, R., Feature selection for text categorization on imbalanced data. ACM Sigkdd Explorations Newsletter: Special Issue on Learning from Imbalanced Datasets. v6. 80-89.

Cited By

Maurya CJha S(2024)Sentiment AnalysisProcedia Computer Science10.1016/j.procs.2024.04.094235:C(990-999)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.procs.2024.04.094
Zhu YXu XPan B(2023)A method for the dynamic collaboration of the public and experts in large-scale group emergency decision-makingComputers and Industrial Engineering10.1016/j.cie.2022.108943176:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.cie.2022.108943
Ullah AKhan SNawi N(2022)Review on sentiment analysis for text classification techniques from 2010 to 2021Multimedia Tools and Applications10.1007/s11042-022-14112-382:6(8137-8193)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s11042-022-14112-3
Show More Cited By

On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Over-sampling via under-sampling in strongly imbalanced data

Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
Developing robust models for favourability analysis
WASSA '11: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis

Locating documents carrying positive or negative favourability is an important application within media analysis. This paper presents some empirical results on the challenges facing a machine-learning approach to this kind of opinion mining. Some of the ...
Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic
Abstract
Network traffic data basically comprise a major amount of normal traffic data and a minor amount of attack data. Such an imbalance problem in the amounts of the two types of data reduces prediction performance, such as by prediction bias of the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Decision Support Systems

Decision Support Systems Volume 53, Issue 4

November, 2012

229 pages

ISSN:0167-9236

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2012.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 November 2012

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Maurya CJha S(2024)Sentiment AnalysisProcedia Computer Science10.1016/j.procs.2024.04.094235:C(990-999)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.procs.2024.04.094
Zhu YXu XPan B(2023)A method for the dynamic collaboration of the public and experts in large-scale group emergency decision-makingComputers and Industrial Engineering10.1016/j.cie.2022.108943176:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.cie.2022.108943
Ullah AKhan SNawi N(2022)Review on sentiment analysis for text classification techniques from 2010 to 2021Multimedia Tools and Applications10.1007/s11042-022-14112-382:6(8137-8193)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s11042-022-14112-3
Lee YBang C(2021)Framework for the Classification of Imbalanced Structured Data Using Under-sampling and Convolutional Neural NetworkInformation Systems Frontiers10.1007/s10796-021-10195-924:6(1795-1809)Online publication date: 17-Sep-2021
https://dl.acm.org/doi/10.1007/s10796-021-10195-9
Singh RSachan MPatel R(2021)360 degree view of cross-domain opinion classification: a surveyArtificial Intelligence Review10.1007/s10462-020-09884-954:2(1385-1506)Online publication date: 1-Feb-2021
https://dl.acm.org/doi/10.1007/s10462-020-09884-9
Balaguer PTeixidó IVilaplana JMateo JRius JSolsona F(2019)CatSent: a Catalan sentiment analysis websiteMultimedia Tools and Applications10.1007/s11042-019-07877-778:19(28137-28155)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11042-019-07877-7
Dong SWu YZhu JLin E(2018)A diversity-based method for class-imbalanced cost-sensitive learningProceedings of 2018 International Conference on Mathematics and Artificial Intelligence10.1145/3208788.3208792(51-55)Online publication date: 20-Apr-2018
https://dl.acm.org/doi/10.1145/3208788.3208792
Kim K(2018)An improved semi-supervised dimensionality reduction using feature weightingExpert Systems with Applications: An International Journal10.1016/j.eswa.2018.05.023109:C(49-65)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1016/j.eswa.2018.05.023
Haixiang GYijing LShang JMingyun GYuanyue HBing G(2017)Learning from class-imbalanced dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.12.03573:C(220-239)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1016/j.eswa.2016.12.035
Mahmood ZBowes DLane PHall TBener AMinku LTurhan B(2015)What is the Impact of Imbalance on Software Defect Prediction Performance?Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/2810146.2810150(1-4)Online publication date: 21-Oct-2015
https://dl.acm.org/doi/10.1145/2810146.2810150
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents