The rise in popularity of mobile devices has led to a parallel growth in the size of the app store market, intriguing several research studies and commercial platforms on mining app stores. App store reviews are used to analyze different aspects of app development and evolution. However, app users’ feedback does not only exist on the app store. In fact, despite the large quantity of posts that are made daily on social media, the importance and value that these discussions provide remain mostly unused in the context of mobile app development. In this paper, we study how Twitter can provide complementary information to support mobile app development. By analyzing a total of 30,793 apps over a period of six weeks, we found strong correlations between the number of reviews and tweets for most apps. Moreover, through applying machine learning classifiers, topic modeling and subsequent crowd-sourcing, we successfully mined 22.4% additional feature requests and 12.89% additional bug reports from Twitter. We also found that 52.1% of all feature requests and bug reports were discussed on both tweets and reviews. In addition to finding common and unique information from Twitter and the app store, sentiment and content analysis were also performed for 70 randomly selected apps. From this, we found that tweets provided more critical and objective views on apps than reviews from the app store. These results show that app store review mining is indeed not enough; other information sources ultimately provide added value and information for app developers.
We would like to thank Homayoon Farrahi and Ada Lee for their help on this study. We thank all the anonymous reviewers and the Associate editor for their valuable comments and suggestions. This research was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant 250343-12.
Appendix: Crowdsourced evaluation of RQ2 and RQ3
Appendix: Crowdsourced evaluation of RQ2 and RQ3
In Section 5 we discussed why and how we used crowdsourcing:
- First, :
we used crowdsourcing to confirm the results of similarity analysis (cosine similarity) between tweet topics and review topics to confirm (i) if assigning a tweet topic to a review topic is correct and (ii) if we missed to assign a tweet topic to a review topic (RQ2).
- Second, :
we used crowdsourcing to compare the degree of specification and understandability (RQ3).
“The crowd” is composed of workers that are unknown in person to the authors. To elaborate on the validity of crowdsourcing results, we hired three developers being known to the authors from similar former work. We asked them to perform the same tasks done by the crowd. Overall across RQ2 and RQ3, the average Fleiss Kappa among three developers and across different tasks was 0.84, which indicates an almost perfect agreement. We then compared the results that were achieved by the crowd with the results achieved by the three developers.
Evaluating crowdsorced results in RQ2
We randomly selected 500 tweets and 500 review topics. Among them, 250 topics were marked as similar, and 250 topics were marked as different by the crowd. We asked three app developers we have formerly worked with to manually label these 500 topics. They answered the same questions as the crowd (Fig. 5). We compared the results and found that our developers classified 99.4% of the topics in the same way as we did base on the crowd’s evaluation.
We then compared the results that achieved by the crowd with the results achieved by the three developers as presented in Table 3.
Evaluating crowdsorced results in RQ3
We randomly selected 250 tweets and 250 reviews and asked three developers to judge the degree of specification both for degree of specification and degree of undrestandability. We compared the results of this task as it was done by crowd versus the three developers in Tables 4 and 5.
17.2% of reviews and 16.4% of tweets were classified in a different specification category by developers in comparison with the crowd. With the same set up, we asked developers to evaluate the degree of understandability and compared the results with the ones received from the crowd:
11.2% of the reviews and 17.6% of the tweets were classified in a different undrestandability category by developers in comparison with the crowd.
