[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets

Published: 09 April 2022 Publication History

Abstract

Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. Recently, several tools are proposed to detect sentiments in software artifacts. While the tools improve accuracy over off-the-shelf tools, recent research shows that their performance could still be unsatisfactory. A more accurate sentiment detector for SE can help reduce noise in analysis of software scenarios where sentiment analysis is required. Recently, combinations, i.e., hybrids of stand-alone classifiers are found to offer better performance than the stand-alone classifiers for fault detection. However, we are aware of no such approach for sentiment detection for software artifacts. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sentiment detection tools from two recently published papers by Lin et al. [29, 30], who first reported negative results with stand alone sentiment detectors and then proposed an improved SE-specific sentiment detector, POME [29]. We report the study results on 17,581 units (sentences/documents) coming from six currently available sentiment benchmarks for software engineering. We find that the existing tools can be complementary to each other in 85-95% of the cases, i.e., one is wrong but another is right. However, a majority voting-based ensemble of those tools fails to improve the accuracy of sentiment detection. We develop Sentisead, a supervised tool by combining the polarity labels and bag of words as features. Sentisead improves the performance (F1-score) of the individual tools by 4% (over Senti4SD [5]) – 100% (over POME [29]). The initial development of Sentisead occurred before we observed the use of deep learning models for SE-specific sentiment detection. In particular, recent papers show the superiority of advanced language-based pre-trained transformer models (PTM) over rule-based and shallow learning models. Consequently, in a second phase, we compare and improve Sentisead infrastructure using the PTMs. We find that a Sentisead infrastructure with RoBERTa as the ensemble of the five stand-alone rule-based and shallow learning SE-specific tools from Lin et al. [29, 30] offers the best F1-score of 0.805 across the six datasets, while a stand-alone RoBERTa shows an F1-score of 0.801.

References

[1]
Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. 2017. SentiCR: A customized sentiment analysis tool for code review interactions. In Proceedings of the 32nd International Conference on Automated Software Engineering. 106–111.
[2]
Ikram El Asri, Noureddine Kerzazi, Gias Uddin, Foutse Khomh, and M. A. Janati Idrissi. 2019. An empirical study of sentiments in code reviews. Information and Software Technology 114 (2019), 37–54.
[3]
Eeshita Biswas, Mehmet Efruz Karabulut, Lori Pollock, and K. Vijay-Shanker. 2020. Achieving reliable sentiment analysis in the software engineering domain using BERT. In IEEE International Conference on Software Maintenance and Evolution. 162–173.
[4]
Eeshita Biswas, K. Vijay-Shanker, and Lori Pollock. 2019. Exploring word embedding techniques to improve sentiment analysis of software engineering texts. In Proceedings of the 16th International Conference on Mining Software Repositories. 68–78.
[5]
Fabio Calefato, Filippo Lanubile, Federico Maiorano, and Nicole Novielli. 2017. Sentiment polarity detection for software development. Journal Empirical Software Engineering (2017), 2543–2584.
[6]
Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2017. EmoTxt: A toolkit for emotion recognition from text. In Proc. 7th Affective Computing and Intelligent Interaction. 2.
[7]
Cássio Castaldi, Araujo Blaz, and Karin Becker. 2016. Sentiment analysis in tickets for IT support. In IEEE/ACM 13th Working Conference on Mining Software Repositories. 235–246.
[8]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 1 (2002), 321–357.
[9]
Zhenpeng Chen, Yanbin Cao, Xuan Lu, Qiaozhu Mei, and Xuanzhe Liu. 2019. SEntiMoji: An emoji-powered learning approach for sentiment analysis in software engineering. In 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 841–852.
[10]
Jacob Cohen, Stephen G. West, Leona Aiken, and Patricia Cohen. 2002. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd ed.). Lawrence Erlbaum Associates.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Technical Report. https://arxiv.org/abs/1810.04805.
[12]
Pedro P. Balage Filho, Lucas Avanco, Thiago A. S. Pardo, and Maria G. V. Nunes. 2014. NILC_USP: An improved hybrid system for sentiment analysis in Twitter messages. In Proceedings of the 8th International Workshop on Semantic Evaluation. 428–432.
[13]
Daviti Gachechiladze, F. Lanubile, Nicole Novielli, and Alexander Serebrenik. 2017. Anger and its direction in collaborative software development. In 39th International Conference on Software Engineering: New Ideas and Emerging Results Track. 11–14.
[14]
Baljinder Ghotra, Shane McIntosh, and Ahmed E. Hassan. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the 37th International Conference on Software Engineering. 789–800.
[15]
Pollyanna Goncalves, Matheus Araujo, Fabricio Benevenuto, and Meeyoung Cha. 2013. Comparing and combining sentiment analysis methods. In Proceedings of the First ACM Conference on Online Social Networks. 27–38.
[16]
Emitza Guzman, Rana Alkadhi, and Norbert Seyf. 2016. A needle in a haystack: What do Twitter users say about software? In 24th IEEE International Requirements Engineering Conference. 96–105.
[17]
Emitza Guzman, David Azócar, and Yang Li. 2014. Sentiment analysis of commit comments in GitHub: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories. 352–355.
[18]
Emitza Guzman and Bernd Bruegge. 2013. Towards emotional awareness in software development teams. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 671–674.
[19]
Ahmed E. Hassan. 2009. Predicting faults using the complexity of code changes. In Proc. 31st International Conference on Software Engineering. 78–89.
[20]
Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 168–177.
[21]
Md Rakibul Islam and Minhaz F. Zibran. 2017. Leveraging automated sentiment analysis in software engineering. In Proc. 14th International Conference on Mining Software Repositories. 203–214.
[22]
Md Rakibul Islam and Minhaz F. Zibran. 2018. DEVA: Sensing emotions in the valence arousal space in software engineering text. In 33rd Annual ACM Symposium on Applied Computing. 1536–1543.
[23]
Robbert Jongeling, Subhajit Datta, and Alexandar Serebrenik. 2015. Choosing your weapons: On sentiment analysis tools for software engineering research. In Proceedings of the 31st International Conference on Software Maintenance and Evolution.
[24]
Robbert Jongeling, Proshanta Sarkar, Subhajit Datta, and Alexander Serebrenik. 2017. On negative results when using sentiment analysis tools for software engineering research. Journal Empirical Software Engineering 22, 5 (2017), 2543–2584.
[25]
Foutse Khomh, Brian Chan, Ying Zou, and Ahmed E. Hassan. 2011. An entropy evaluation approach for triaging field crashes: A case study of Mozilla Firefox. In Proceedings of the 2011 18th Working Conference on Reverse Engineering. 261–270.
[26]
Andrew J. Ko, Brad A. Myers, and Duen Horng Chau. 2005. A linguistic analysis of how people describe software problems. In 2005 IEEE Symposium on Visual Languages and Human-Centric Computing. 127–134.
[27]
DISA Lab. 21 April 2021 (last accessed). An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets (Online Appendix). https://github.com/disa-lab/HybridSESentimentTOSEM.
[28]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. Technical Report. https://arxiv.org/abs/1909.11942.
[29]
Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, and Michele Lanza. 2019. Pattern-based mining of opinions in Q&A websites. In Proc. 41st International Conference on Software Engineering. 548–559.
[30]
Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, and Rocco Oliveto. 2018. Sentiment analysis for software engineering: How far can we go? In Proc. 40th International Conference on Software Engineering. 94–104.
[31]
Bing Liu. 2012. Sentiment Analysis and Opinion Mining (1st ed.). Morgan & Claypool Publishers.
[32]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Technical Report. https://arxiv.org/abs/1907.11692.
[33]
Walid Maalej, Zijad Kurtanović, Hadeer Nabil, and Christoph Stanik. 2016. On the automatic classification of app reviews. In International Requirements Engineering Conference. 311–331.
[34]
Rungroj Maipradit, Hideaki Hata, and Kenichi Matsumoto. 2019. Sentiment classification using N-Gram inverse document frequency and automated machine learning. IEEE Software 36, 5 (2019), 65–70.
[35]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2009. An Introduction to Information Retrieval. Cambridge Uni Press.
[36]
Mika Mȧntylȧ, Bram Adams, Giuseppe Destefanis, Daniel Graziotin, and Marco Ortu. 2016. Mining valence, arousal, and dominance – possibilities for detecting burnout and productivity?. In Proceedings of the 13th Working Conference on Mining Software Repositories. 247–258.
[37]
Ayse Tosun Misirli, Ayse Basar Bener, and Burak Turhan. 2011. An industrial case study of classifier ensembles for locating software defects. Software Quality Journal 19, 3 (2011), 515–536.
[38]
Alessandro Murgia, Parastou Tourani, Bram Adams, and Marco Ortu. 2014. Do developers feel emotions? An exploratory analysis of emotions in software artifacts. In Proceedings of the 11th Working Conference on Mining Software Repositories.
[39]
Nicole Novielli, Andrew Begel, and Walid Maalej. 2019. Introduction to the special issue on affect awareness in software engineering. Journal of Systems and Software 148 (2019), 180–182.
[40]
Nicole Novielli, Fabio Calefato, Davide Dongiovanni, Daniela Girardi, and Filippo Lanubile. 2020. Can we use SE-specific sentiment analysis tools in a cross-platform setting? In 17th International Conference on Mining Software Repositories. 11.
[41]
Nicole Novielli, Fabio Calefato, and Filippo Lanubile. 2015. The challenges of sentiment detection in the social programmer ecosystem. In Proceedings of the 7th International Workshop on Social Software Engineering. 33–40.
[42]
Nicole Novielli, Fabio Calefato, and Filippo Lanubile. 2018. A gold standard for emotion annotation in stack overflow. In Proceedings of the 15th International Conference on Mining Software Repositories (Data Showcase). 4.
[43]
Nicole Novielli, Daniela Girardi, and Filippo Lanubile. 2018. A benchmark study on sentiment analysis for software engineering research. In Proceedings of the 15th International Conference on Mining Software Repositories. 12.
[44]
Nicole Novielli and Alexander Serebrenik. 2019. Sentiment and emotion in software engineering. IEEE Software 36, 5 (2019), 6–23.
[45]
Dario Di Nucci, Fabio Palomba, Rocco Oliveto, and Andrea De Lucia. 2017. Dynamic selection of classifiers in bug prediction: An adaptive method. IEEE Transactions on Emerging Topics in Computational Intelligence 1, 3 (2017), 202–212.
[46]
Marco Ortu, Bram Adams, Giuseppe Destefanis, Parastou Tourani, Michele Marchesi, and Roberto Tonelli. 2015. Are bullies more productive? Empirical study of affectiveness vs. issue fixing time. In Proceedings of the 12th Working Conference on Mining Software Repositories. 303–313.
[47]
Marco Ortu, Alessandro Murgia, Giuseppe Destefanis, Parastou Tourani, Roberto Tonelli, Michele L. Marchesi, and Bram Adams. 2016. The emotional side of software developers in JIRA. In 13th International Conference on Mining Software Repositories. 480–483.
[48]
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment classification using machine learning techniques. In Conference on Empirical Methods in Natural Language Processing. 79–86.
[49]
Sebastiano Panichella, Andrea Di Sorbo, Emitza Guzman, Corrado A. Visaggio, Gerardo Canfora, and Harald C. Gall. 2015. How can I improve my app? Classifying user reviews for software maintenance and evolution. In IEEE International Conf. on Software Maintenance and Evolution. 281–290.
[50]
Jean Petric, David Bowes, Tracy Hall, Bruce Christianson, and Nathan Baddoo. 2016. Building an ensemble for software defect prediction based on diversity selection. In Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. Article No. 46.
[51]
Daniel Pletea, Bogdan Vasilescu, and Alexander Serebrenik. 2014. Security and emotion: Sentiment analysis of security discussions on GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories. 348–351.
[52]
scikit learn. 2017. Machine Learning in Python. http://scikit-learn.org/stable/index.html#.
[53]
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. Journal of ACM Computing Surveys 34, 1 (2002), 1–47.
[54]
C. E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 3 (1948), 379–423.
[55]
Vinayak Sinha, Alina Lazar, and Bonita Sharif. 2016. Analyzing developer sentiment in commit logs. In 13th International Conference on Mining Software Repositories. 281–290.
[56]
Richard Socher, Alex Perelygin, Jean Wu, Christopher Manning, Andrew Ng, and Jason Chuang. 2013. Recursive models for semantic compositionality over a sentiment treebank. In Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP). 12.
[57]
Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. 2010. Sentiment in short strength detection informal text. Journal of the American Society for Information Science and Technology 61, 12 (2010), 2544–2558.
[58]
Patanamon Thongtanunam, Shane Mcintosh, Ahmed E. Hassan, and Hajimu Iida. 2017. Review participation in modern code review: An empirical study of the Android, Qt, and OpenStack projects. Journal of Empirical Software Engineering 22, 2 (2017), 768–817.
[59]
Gias Uddin, Olga Baysal, Latifa Guerroj, and Foutse Khomh. 2019. Understanding how and why developers seek and analyze API related opinions. IEEE Transactions on Software Engineering (2019), 40.
[60]
Gias Uddin and Foutse Khomh. 2017. Automatic summarization of API reviews. In Proc. 32nd IEEE/ACM International Conference on Automated Software Engineering. 159–170.
[61]
Gias Uddin and Foutse Khomh. 2017. Mining API Aspects in API Reviews. Technical Report. https://swat.polymtl.ca/data/opinionvalue-technical-report.pdf.
[62]
Gias Uddin and Foutse Khomh. 2017. Opiner: A search and summarization engine for API reviews. In Proc. 32nd IEEE/ACM International Conference on Automated Software Engineering. 978–983.
[63]
Gias Uddin and Foutse Khomh. 2019. Automatic opinion mining from API reviews from stack overflow. IEEE Transactions on Software Engineering (2019), 35.
[64]
Gias Uddin, Foutse Khomh, and Chanchal K. Roy. 2020. Automatic API usage scenario documentation from technical Q&A sites. ACM Transactions on Software Engineering and Methodology (2020), 43.
[65]
Gias Uddin, Foutse Khomh, and Chanchal K. Roy. 2020. Automatic mining of API usage scenarios from stack overflow. Information and Software Technology (IST) (2020), 16.
[66]
Gias Uddin and Martin P. Robillard. 2015. Resolving API Mentions in Forum Texts. Technical Report. McGill University.
[67]
Marat Valiev, Bogdan Vasilescu, and James Herbsleb. 2018. Ecosystem-level determinants of sustained activity in open-source projects: A case study of the PyPI ecosystem. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 644–655.
[68]
Lorenzo Villarroel, Gabriele Bavota, Barbara Russo, Rocco Oliveto, and Massimiliano Di Penta. 2016. Release planning of mobile apps based on user reviews. In Proceedings of the 38th International Conference on Software Engineering. 14–24.
[69]
Sida Wang and Christopher D. Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 90–94.
[70]
Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. 2013. Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods 45, 4 (2013), 1191–1207.
[71]
Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Bjöorn Regnell, and Anders Wesslén. 2000. Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, Norwell, MA, USA.
[72]
Xinli Yang, David Lo, Xin Xia, and Jianling Sun. 2017. TLEL: A two-layer ensemble learning approach for just-in-time defect prediction. Information and Software Technology 87, 2 (2017), 206–220.
[73]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Technical Report. https://arxiv.org/abs/1906.08237.
[74]
Ting Zhang, Bowen Xu, Ferdian Thung, Stefanus Agus Haryono, David Lo, and Lingxiao Jiang. 2020. Sentiment analysis for software engineering: How far can pre-trained transformer models go?. In IEEE International Conference on Software Maintenance and Evolution. 70–80.
[75]
Yun Zhang, David Lo, Xin Xia, and Jianling Sun. 2018. Combined classifier for cross-project defect prediction: An extended empirical study. Frontiers of Computer Science 22, 2 (2018), 280–296.

Cited By

View all
  • (2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 9-Nov-2024
  • (2024)Fuzzy ensemble of fined tuned BERT models for domain-specific sentiment analysis of software engineering datasetPLOS ONE10.1371/journal.pone.030027919:5(e0300279)Online publication date: 28-May-2024
  • (2024)An Empirical Evaluation of the Zero-Shot, Few-Shot, and Traditional Fine-Tuning Based Pretrained Language Models for Sentiment Analysis in Software EngineeringIEEE Access10.1109/ACCESS.2024.343945012(109714-109734)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 31, Issue 3
      July 2022
      912 pages
      ISSN:1049-331X
      EISSN:1557-7392
      DOI:10.1145/3514181
      • Editor:
      • Mauro Pezzè
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 April 2022
      Online AM: 31 January 2022
      Accepted: 01 October 2021
      Revised: 01 September 2021
      Received: 01 September 2020
      Published in TOSEM Volume 31, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Sentiment analysis
      2. machine learning
      3. ensemble classifier

      Qualifiers

      • Research-article
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)108
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 30 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 9-Nov-2024
      • (2024)Fuzzy ensemble of fined tuned BERT models for domain-specific sentiment analysis of software engineering datasetPLOS ONE10.1371/journal.pone.030027919:5(e0300279)Online publication date: 28-May-2024
      • (2024)An Empirical Evaluation of the Zero-Shot, Few-Shot, and Traditional Fine-Tuning Based Pretrained Language Models for Sentiment Analysis in Software EngineeringIEEE Access10.1109/ACCESS.2024.343945012(109714-109734)Online publication date: 2024
      • (2024)Transformers and meta-tokenization in sentiment analysis for software engineeringEmpirical Software Engineering10.1007/s10664-024-10468-229:4Online publication date: 3-Jun-2024
      • (2024)What is Needed to Apply Sentiment Analysis in Real Software Projects: A Feasibility Study in IndustryHuman-Centered Software Engineering10.1007/978-3-031-64576-1_6(105-129)Online publication date: 8-Jul-2024
      • (2023)Emotion Analysis in Software EcosystemsSoftware Ecosystems10.1007/978-3-031-36060-2_5(105-127)Online publication date: 26-May-2023
      • (2022)On the Limitations of Combining Sentiment Analysis Tools in a Cross-Platform SettingProduct-Focused Software Process Improvement10.1007/978-3-031-21388-5_8(108-123)Online publication date: 21-Nov-2022

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media