Abstract
The rise in popularity of mobile devices has led to a parallel growth in the size of the app store market, intriguing several research studies and commercial platforms on mining app stores. App store reviews are used to analyze different aspects of app development and evolution. However, app users’ feedback does not only exist on the app store. In fact, despite the large quantity of posts that are made daily on social media, the importance and value that these discussions provide remain mostly unused in the context of mobile app development. In this paper, we study how Twitter can provide complementary information to support mobile app development. By analyzing a total of 30,793 apps over a period of six weeks, we found strong correlations between the number of reviews and tweets for most apps. Moreover, through applying machine learning classifiers, topic modeling and subsequent crowd-sourcing, we successfully mined 22.4% additional feature requests and 12.89% additional bug reports from Twitter. We also found that 52.1% of all feature requests and bug reports were discussed on both tweets and reviews. In addition to finding common and unique information from Twitter and the app store, sentiment and content analysis were also performed for 70 randomly selected apps. From this, we found that tweets provided more critical and objective views on apps than reviews from the app store. These results show that app store review mining is indeed not enough; other information sources ultimately provide added value and information for app developers.
Similar content being viewed by others
References
Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau R (2011) Sentiment analysis of twitter data. In: Proceedings of the workshop on languages in social media. Association for Computational Linguistics, pp 30–38
Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), vol 6, pp 12
Blackman NJ-M, Koval J J (2000) Interval estimation for cohen’s kappa as a measure of agreement. Statist Med 19(5):723–741
Blei D M, Ng A Y, Jordan M I (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Bougie G, Starke J, Storey M-A, German D M (2011) Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering. ACM, pp 31–36
Chang J, Gerrish S, Wang C, Boyd-Graber J L, Blei DM (2009) Reading tea leaves: How humans interpret topic models. In: Advances in neural information processing systems, pp 288–296
Chen N, Lin J, Hoi S C, Xiao X, Zhang B (2014) Ar-miner: mining informative reviews for developers from mobile app marketplace. In: Proceedings of the 36th international conference on software engineering. ACM, pp 767–778
Ciurumelea A, Schaufelbühl A, Panichella S, Gall HC (2017) Analyzing reviews and code of mobile apps for better release planning. In: Software analysis, evolution and reengineering (SANER). IEEE, pp 91– 102
Di Sorbo A, Panichella S, Alexandru C V, Shimagaki J, Visaggio C A, Canfora G, Gall H C (2016) What would users change in my app? Summarizing app reviews for recommending software changes. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 499–510
Di Sorbo A, Panichella S, Alexandru C V, Visaggio C A, Canfora G (2017) Surf: summarizer of user reviews feedback. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 55–58
Gibbons J D, Chakraborti S (2011) Nonparametric statistical inference. Springer
Gomez M, Martinez M, Monperrus M, Rouvoy R (2015) When app stores listen to the crowd to fight bugs in the wild. In: Proceedings of the 37th international conference on software engineering (ICSE), vol 2. IEEE Press, pp 567–570
Gu X, Kim S (2015) What parts of your apps are loved by users? (t). In: Automated software engineering (ASE). IEEE, pp 760–770
Guzman E, Alkadhi R, Seyff N (2017) An exploratory study of twitter messages about software applications. Requir Eng 22(3):387–412
Guzman E, Ibrahim M, Glinz M (2017) Mining twitter messages for software evolution. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 283– 284
Harman M, Jia Y, Zhang Y (2012) App store mining and analysis: Msr for app stores. In: Proceedings of the 9th IEEE working conference on mining software repositories. IEEE Press, pp 108–111
Hong L, Davison B D (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Hutto CJ, Gilbert E Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Eighth international AAAI conference on weblogs and social media, pp 50–60
Iacob C, Harrison R (2013) Retrieving and analyzing mobile apps feature requests from online reviews. In: 10th IEEE Working conference on mining software repositories (MSR). IEEE, pp 41–44
Jivani A G et al. (2011) A comparative study of stemming algorithms. Int J Comp Tech Appl 2(6):1930– 1938
Jongeling R, Datta S, Serebrenik A (2015) Choosing your weapons: on sentiment analysis tools for software engineering research. In: Software maintenance and evolution (ICSME). IEEE, pp 531– 535
Kittur A, Chi E H, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 453–456
Kouloumpis E, Wilson T, Moore J D (2011) Twitter sentiment The good the bad and the omg!. Icwsm 11:538–541
Liu B (2010) Sentiment analysis and subjectivity. Handbook Natural Lang Process 2:627–666
Loper E, Bird S (2002) Nltk: the natural language toolkit. In: Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, vol 1. Association for Computational Linguistics, pp 63–70
Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? On automatically classifying app reviews. In: IEEE 23rd international requirements engineering conference (RE). IEEE, pp 116– 125
Maalej W, Nayebi M, Johann T, Ruhe G (2016) Toward data-driven requirements engineering. IEEE Softw 33(1):48–54
Manning C D, Surdeanu M, Bauer J, Finkel J R, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: ACL (System Demonstrations), pp 55–60
Martin W, Harman M, Jia Y, Sarro F, Zhang Y (2015) The app sampling problem for app store mining. In: IEEE/ACM 12th Working conference on mining software repositories. IEEE, pp 123–133
Martin W, Sarro F, Jia Y, Zhang Y, Harman M (2016) A survey of app store analysis for software engineering. RN 16:02
Mohammad S M, Kiritchenko S, Zhu X (2013) Nrc-canada: building the state-of-the-art in sentiment analysis of tweets. arXiv:http://arXiv.org/abs/1308.6242
Naaman M, Boase J, Lai C -H (2010) Is it really about me?: message content in social awareness streams. In: Proceedings of the 2010 ACM conference on computer supported cooperative work. ACM, pp 189– 192
Nayebi M, Ruhe G (2015) Analytical product release planning. In: The Art and science of analyzing software data. Morgan Kaufmann, pp 550–580
Nayebi M, Adams B, Ruhe G (2016) Release practices for mobile apps–what do users and developers think? In: Software analysis, evolution, and reengineering (SANER), vol 1. IEEE, pp 552– 562
Nayebi M, Farrahi H, Ruhe G, Cho H (2017) App store mining is not enough. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 152–154
Nayebi M, Quapp R, Ruhe G, Marbouti M, Maurer F (2017) Crowdsourced exploration of mobile app features: a case study of the fort mcmurray wildfire. In: Proceedings of the 39th international conference on software engineering: software engineering in society track. IEEE Press, pp 57–66
Nielsen F Å (2011) A new anew: Evaluation of a word list for sentiment analysis in microblogs. arXiv:http://arXiv.org/abs/1103.2903
O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for twitter. In: ICWSM
Palomba F, Linares-Vásquez M, Bavota G, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2015) User reviews matter! tracking crowdsourced reviews to support evolution of successful apps. In: Software maintenance and evolution (ICSME). IEEE, pp 291–300
Paolacci G, Chandler J, Ipeirotis P G (2010) Running experiments on amazon mechanical turk. Judgment Decis Making 5(5):411–419
Porter M F (1980) An algorithm for suffix stripping. Program 14(3):130–137
Prasetyo PK, Lo D, Achananuparp P, Tian Y, Lim E-P (2012) Automatic classification of software related microblogs. In: Software maintenance (ICSM). IEEE, pp 596–599
Ramage D, Dumais S T, Liebling D J (2010) Characterizing microblogs with topic models. ICWSM 10:1–1
Rosenthal S, Nakov P, Kiritchenko S, Mohammad S M, Ritter A, Stoyanov V (2015) Semeval-2015 task 10: sentiment analysis in twitter. In: Proceedings of SemEval-2015
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Smedt T D, Daelemans W (2012) Pattern for python. J Mach Learn Res 13:2063–2067
Thelwall M, Buckley K, Paltoglou G (2011) Sentiment in twitter events. J Am Soc Inf Sci Technol 62(2):406–418
Tian Y, Lo D (2014) An exploratory study on software microblogger behaviors. In: 2014 IEEE 4th Workshop on mining unstructured data (MUD). IEEE, pp 1–5
Villarroel L, Bavota G, Russo B, Oliveto R, Di Penta M (2016) Release planning of mobile apps based on user reviews. In: Proceedings of the 38th international conference on software engineering. ACM, pp 14–24
Wang X, Kuzmickaja I, Abrahamsson P (2014) Microblogging in open source software development, 8–12
Wiese IS, da Silva JT, Steinmacher I, Treude C, Gerosa MA (2016) Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant. In: 2016 IEEE International conference on software maintenance and evolution (ICSME). IEEE, pp 345–355
Williams G, Mahmoud A (2017) Mining twitter data for a more responsive software engineering process. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 280–282
Wohlin C, Runeson P, Höst M, Ohlsson M C, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer Science & Business Media
Acknowledgments
We would like to thank Homayoon Farrahi and Ada Lee for their help on this study. We thank all the anonymous reviewers and the Associate editor for their valuable comments and suggestions. This research was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant 250343-12.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Yasutaka Kamei
Appendix: Crowdsourced evaluation of RQ2 and RQ3
Appendix: Crowdsourced evaluation of RQ2 and RQ3
In Section 5 we discussed why and how we used crowdsourcing:
- First, :
-
we used crowdsourcing to confirm the results of similarity analysis (cosine similarity) between tweet topics and review topics to confirm (i) if assigning a tweet topic to a review topic is correct and (ii) if we missed to assign a tweet topic to a review topic (RQ2).
- Second, :
-
we used crowdsourcing to compare the degree of specification and understandability (RQ3).
“The crowd” is composed of workers that are unknown in person to the authors. To elaborate on the validity of crowdsourcing results, we hired three developers being known to the authors from similar former work. We asked them to perform the same tasks done by the crowd. Overall across RQ2 and RQ3, the average Fleiss Kappa among three developers and across different tasks was 0.84, which indicates an almost perfect agreement. We then compared the results that were achieved by the crowd with the results achieved by the three developers.
Evaluating crowdsorced results in RQ2
We randomly selected 500 tweets and 500 review topics. Among them, 250 topics were marked as similar, and 250 topics were marked as different by the crowd. We asked three app developers we have formerly worked with to manually label these 500 topics. They answered the same questions as the crowd (Fig. 5). We compared the results and found that our developers classified 99.4% of the topics in the same way as we did base on the crowd’s evaluation.
We then compared the results that achieved by the crowd with the results achieved by the three developers as presented in Table 3.
Evaluating crowdsorced results in RQ3
We randomly selected 250 tweets and 250 reviews and asked three developers to judge the degree of specification both for degree of specification and degree of undrestandability. We compared the results of this task as it was done by crowd versus the three developers in Tables 4 and 5.
17.2% of reviews and 16.4% of tweets were classified in a different specification category by developers in comparison with the crowd. With the same set up, we asked developers to evaluate the degree of understandability and compared the results with the ones received from the crowd:
11.2% of the reviews and 17.6% of the tweets were classified in a different undrestandability category by developers in comparison with the crowd.
Rights and permissions
About this article
Cite this article
Nayebi, M., Cho, H. & Ruhe, G. App store mining is not enough for app improvement. Empir Software Eng 23, 2764–2794 (2018). https://doi.org/10.1007/s10664-018-9601-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-018-9601-1