[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ASE.2015.12acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Development emails content analyzer: intention mining in developer discussions

Published: 09 November 2015 Publication History

Abstract

Written development communication (e.g. mailing lists, issue trackers) constitutes a precious source of information to build recommenders for software engineers, for example aimed at suggesting experts, or at redocumenting existing source code. In this paper we propose a novel, semi-supervised approach named DECA (Development Emails Content Analyzer) that uses Natural Language Parsing to classify the content of development emails according to their purpose (e.g. feature request, opinion asking, problem discovery, solution proposal, information giving etc), identifying email elements that can be used for specific tasks. A study based on data from Qt and Ubuntu, highlights a high precision (90%) and recall (70%) of DECA in classifying email content, outperforming traditional machine learning strategies. Moreover, we successfully used DECA for re-documenting source code of Eclipse and Lucene, improving the recall, while keeping high precision, of a previous approach based on ad-hoc heuristics.

References

[1]
J. Aranda, and G. Venolia, The secret life of bugs: Going past the errors and omissions in software repositories. In Proceedings of the 31st International Conference on Software Engineering (ICSE), 2009, pp. 298--308.
[2]
G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, Y. Guhneuc, Is it a bug or an enhancement?: a text-based approach to classify change requests. CASCON, 2008:23.
[3]
J. Anvik, L. Hiew, and G.C. Murphy, Who should fix this bug?. In Proceedings of the 28th International Conference on Software Engineering (ICSE), 2006, pp. 361--370.
[4]
A. Bacchelli, T. Dal Sasso, M. D'Ambros, and M. Lanza. Content classification of development emails. In Proceedings of the 34th International Conference on Software Engineering (ICSE), 2012, pp. 375--385.
[5]
C. Vassallo, S. Panichella, M. Di Penta, and G. Canfora, CODES: mining source code description from developers discussions, in Proceedings of the 22th IEEE International Conference on Program Comprehension (ICPC), 2014, pp. 106--109.
[6]
R. A. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1999.
[7]
V. R. Basili, L. C. Briand, and W. L. Melo, A validation of object oriented design metrics as quality indicators. In IEEE Trans. Software Eng., vol. 22, no. 10, 1996, pp. 751--761.
[8]
A. Begel and N.Nagappan, Global Software Development. Who Does It?. In Proceedings of the 2008 IEEE International Conference on Global Software Engineering (ICGSE), 2008, pp. 195--199.
[9]
M. Bezerra, A. L. I. Oliveira, and S. R. L. Meira, A constructive rbf neural network for estimating the probability of defects in software modules. In Neural Networks, 2007. IJCNN 2007. International Joint Conference on, 2007, pp. 2869--2874.
[10]
D. M. Blei, A.Y. Ng, and M. I. Jordan, Latent dirichlet allocation. In Journal of Machine Learning Research (JMLR), Vol. 3, 2003, pp. 993--1022.
[11]
G. Canfora, M. Di Penta, R. Oliveto, and S. Panichella, Who is going to mentor newcomers in open source projects? In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE), 2012: 44.
[12]
D. Cer, M.C. de Marneffe, D. Jurafsky, and C. D. Manning, Parsing to Stanford dependencies: Trade-offs between speed and accuracy. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), 2010.
[13]
L. Cerulo, M. Ceccarelli, M. Di Penta, and G. Canfora, A Hidden Markov Model to detect coded information islands in free text. In Proceedings of 13th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM), 2013, pp. 157--166.
[14]
E. Ceylan, F. Kutlubay, and A. Bener, Software defect identification using machine learning techniques. In Software Engineering and Advanced Applications, 2006. SEAA 06. 32nd EUROMICRO Conference on, 2006, pp. 240--247.
[15]
W. W. Cohen, V. R.Carvalho, and T. M. Mitchell, Learning to Classify Email into "Speech Acts". In Proceedings of Empirical Methods in Natural Language Processing, 2004, pp 309--316.
[16]
S. Corston-Oliver, E. Ringger, M. Gamon, and R. Campbell, Task-focused summarization of email. In Proceedings of the ACL Workshop Text Summarization Branches Out, 2004, pp 43--50.
[17]
I. Dagan, O. Glickman, and B. Magnini, The PASCAL recognizing textual entailment challenge. In Proceedings of the First International Conference on Machine Learning Challenges: evaluating Predictive Uncertainly Visual Object Classification, and Recognizing Textual Entailment, 2005, pp. 177--190.
[18]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Fumas, R. A. Harshman, Indexing by Latent Semantic Analysis. In Journal of the American Society of Information Science, Vol. 41, No. 6, pp. 391--407, 1990.
[19]
M.C. de Marneffe and C.D. Manning, The Stanford typed dependencies representation. In COLING 2008: Proceedings of the Workshop on Cross-framework and Cross-domain Parser Evaluation, 2008, pp 1--8.
[20]
M.C. de Marneffe, and C.D. Manning, Stanford dependencies manual. Technical Report, 2008.
[21]
M.C. de Marneffe, B. MacCartney, and C.D. Manning, Generating typed dependency parses from phrase structure parses. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), 2006, pp 449--454.
[22]
S. S. Deshpande, G. K. Palshikar, and G. Athiappan, An Unsupervised Approach to Sentence Classification. In International Conference on Management of Data (COMAD), 2010, pp 88--99.
[23]
W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992.
[24]
K. Fundel, R. Küffner and R. Zimmer, RelEx - Relation extraction using dependency parse trees. In Bioinformatics, v.23, n.3, 2007, pp 365--371.
[25]
B. Glaser, and A. Strauss, The discovery of grounded theory: Strategies of qualitative research. New York, NY:Aldine de Gruyter, 1967.
[26]
A. Guzzi, A. Bacchelli, M. Lanza, M. Pinzger, and A. van Deursen, Communication in open source software development mailing lists. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), 2013, pp 277--286.
[27]
A. Guzzi, A. Begel, J.K. Miller, and K. Nareddy, Facilitating Enterprise Software Developer Communication with CARES. In Proceedings of the 34th International Conference on Software Engineering (ICSE), 2012, pp 1367--1370.
[28]
B. Hachey and C. Grover, Sentence classification experiments for legal text summarization. In Proceedings of 17th Annual Conference on Legal Knowledge and Information Systems (Jurix-2004), 2004, pp. 29--38.
[29]
S. Hen, M. Monperrus, M. Mezini, Semi-automatically extracting FAQs to improve accessibility of software development knowledge. In Proceedings of the 34th ACM/IEEE International Conference on Software Engineering (ICSE), 2012, pp. 793--803.
[30]
K. Herzig, S. Just, and A. Zeller, It's not a bug, it's a feature: how misclassification impacts bug prediction. In Proceedings of the 35th International Conference on Software Engineering (ICSE), 2013, pp. 392--401.
[31]
F. Ibekwe-Sanjuan, S. Fernandez, E. Sanjuan, and E. Charton, Annotation of scientific summaries for information retrieval. In O.A.H. Zaragoza, editor, ECIR08 Workshop on: Exploiting Semantic Annotations for Information Retrieval, 2008, pp. 70--83.
[32]
A. Khoo, Y. Marom, and D. Albrecht, Experiments with sentence classification. In Proceeding of 2006 Australasian Language Technology Workshop (ALTW), 2006, pp. 18--25.
[33]
J. Kim, S. Lee, S.-W. Hwang, and S. Kim, Enriching Documents with Examples: A Corpus Mining Approach. In Journal of ACM Transacrions on Information Systems (TOIS), Vol. 31, Issue n. 1, January 2013, Article n. 1.
[34]
A. J. Ko, B. A. Myers, and D. H. Chau, A linguistic analysis of how people describe software problems. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing, 2006, pp 127--134.
[35]
P. S. Kochhar, Tien-Duy B. Le, and D. Lo, It's not a bug, it's a feature: does misclassification affect bug localization?. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), 2014, pp. 296--299.
[36]
D. Lam, S.L. Rohall, C. Schmandt, and M.K. Stern, Exploiting e-mail structure to improve summarization. In Proceedings of ACM Conference on Computer Supported Cooperative Work (CSCW), Interactive Posters, New Orleans, LA. 2002.
[37]
T. D. LaToza, G. Venolia, and R. DeLine, Maintaining mental models: a study of developer work habits. In Proceedings of the 28th International Conference on Software Engineering (ICSE), 2006, pp. 492--501.
[38]
Y. Liu, T. M. Khoshgoftaar, and N. Seliya, Evolutionary optimization of software quality modeling with multiple repositories. In IEEE Trans. Softw. Eng., vol. 36, no. 6, 2010, pp. 852--864.
[39]
L. McKnight and P. Srinivasan, Categorization of sentence types in medical abstracts. In Proceedings of American Medical Informatics Association Annual Symposium, 2003, pp 440--444.
[40]
W. Maalej and M. P. Robillard, Patterns of Knowledge in API Reference Documentation. In IEEE Trans. Software Eng. 39, no.9, 2013, pp 1264--1282.
[41]
J. Nivre, L. Rimell, R. McDonald, and C. Gómez-Rodríguez, Evaluation of dependency parsers on unbounded dependencies, in Proceedings of COLING, 2010, pp. 813--821.
[42]
R. Pandita, X.Xiao, H.Zhong, T.Xie, S.Oney, and A.Paradkar, Inferring method specifications from natural language API descriptions. In Proceedings of the 34th ACM/IEEE International Conference on Software Engineering (ICSE), 2012, pp 815--825.
[43]
S. Panichella, M. Di Penta, and G. Canfora. How the Evolution of Emerging Collaborations Relates to Code Changes: An Empirical Study. In Proceedings of 22nd International Conference on Program Comprehension (ICPC), 2014, pp 177--188.
[44]
S. Panichella, G. Bavota, M. Di Penta, G. Canfora, and G. Antoniol. How Developers Collaborations Identified from Different Sources Tell us About Code Changes. In Proceedings of 30th International Conference on Software Maintenance and Evolution (ICSME), 2014, pp. 251--260.
[45]
S. Panichella, J. Aponte, M. Di Penta, A. Marcus, and G. Canfora, Mining source code descriptions from developer communications. In Proceedings of the 20th IEEE International Conference on Program Comprehension (ICPC), 2012, pp. 63--72.
[46]
O. Rambow, L. Shrestha, J. Chen, and C. Lauridsen, Summarizing email threads. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) Short paper, 2004, pp 105--108.
[47]
S. Rastkar, G. C. Murphy, and G. Murray, Automatic Summarization of Bug Reports. In IEEE Transactions on Software Engineering, Vol. 40, Issue 4, 2014, pp. 366--380.
[48]
S. Rastkar, G. C. Murphy, and G. Murray, Summarizing software artifacts: a case study of bug reports. In Proceedings of 32nd International Conference on Software Engineering (ICSE), 2010, pp 505--514.
[49]
P. C. Rigby, and A. E. Hassan, What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List. In Proceedings of the 4th International Workshop on Mining Software Repositories, 2007, page 23.
[50]
E. Shihab, N. Bettenburg, B. Adams, and A. E. Hassan, On the central role of mailing lists in open source projects: an exploratory study. In Proceedings of the 2009 International Conference on New Frontiers in Artificial Intelligence (JSAI-isAI), 2009, pp. 91--103.
[51]
S. Teufel and M. Moens, Sentence extraction and rhetorical classification for flexible abstracts. In AAAI Spring Symposium on Intelligent Text summarization, Stanford, 1998, pp. 16--25.
[52]
C. Wang, J. Lu, and G. Zhang, A semantic classification approach for online product reviews. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 276--279.
[53]
I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005.
[54]
E. Wong, J. Yang, and L. Tan, AutoComment: Mining question and answer sites for automatic comment generation. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2013, pp. 562--567.
[55]
X. Xiao, A. Paradkar, S. Thummalapenta, and T. Xie, Automated Extraction of Security Policies from Natural-Language Software Documents. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE), 2012, pp 12:1--12:11.
[56]
Y. Yamamoto and T. Takagi, Experiments with sentence classification: A sentence classification system for multi biomedical literature summarization. In Proceedings of 21st International Conference on Data Engineering Workshops, 2005, pp. 1163--1168.
[57]
H. Zhong, L. Zhang, T.Xie, and H.Mei, Inferring resource specification from natural language API documentation. In Proceedings of the 24th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2009, pp. 307--318.
[58]
Y. Zhou, Y. Tong, R. Gu, and H. Gall, Combining Text Mining and Data Mining for Bug Report Classification. In Proceedings of 30th International Conference on Software Maintenance and Evolution (ICSME), 2014, pp. 311--320.
[59]
T. Zimmermann and N. Nagappan, Predicting defects with program dependencies. In Empirical Software Engineering and Measurement, 2009. ESEM 2009. 3rd International Symposium on, 2009, pp. 435--438.

Cited By

View all
  • (2024)DRMiner: Extracting Latent Design Rationale from Jira Issue LogsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695019(468-480)Online publication date: 27-Oct-2024
  • (2024)Can GitHub Issues Help in App Review Classifications?ACM Transactions on Software Engineering and Methodology10.1145/367817033:8(1-42)Online publication date: 18-Jul-2024
  • (2024)Analyzing and Detecting Information Types of Developer Live Chat ThreadsACM Transactions on Software Engineering and Methodology10.1145/364367733:5(1-32)Online publication date: 4-Jun-2024
  • Show More Cited By
  1. Development emails content analyzer: intention mining in developer discussions

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASE '15: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering
    November 2015
    935 pages
    ISBN:9781509000241

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    IEEE Press

    Publication History

    Published: 09 November 2015

    Check for updates

    Author Tags

    1. empirical study
    2. natural language processing
    3. unstructured data mining

    Qualifiers

    • Research-article

    Conference

    ASE '15
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 82 of 337 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DRMiner: Extracting Latent Design Rationale from Jira Issue LogsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695019(468-480)Online publication date: 27-Oct-2024
    • (2024)Can GitHub Issues Help in App Review Classifications?ACM Transactions on Software Engineering and Methodology10.1145/367817033:8(1-42)Online publication date: 18-Jul-2024
    • (2024)Analyzing and Detecting Information Types of Developer Live Chat ThreadsACM Transactions on Software Engineering and Methodology10.1145/364367733:5(1-32)Online publication date: 4-Jun-2024
    • (2023)Summary of the 2nd Natural Language-based Software Engineering Workshop (NLBSE 2023)ACM SIGSOFT Software Engineering Notes10.1145/3617946.361795748:4(60-63)Online publication date: 17-Oct-2023
    • (2023)A Vision on Intentions in Software EngineeringProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613087(2117-2121)Online publication date: 30-Nov-2023
    • (2023)Summary of the 1st Natural Language-based Software Engineering Workshop (NLBSE 2022)ACM SIGSOFT Software Engineering Notes10.1145/3573074.357310148:1(101-104)Online publication date: 17-Jan-2023
    • (2023)Automated Identification and Qualitative Characterization of Safety Concerns Reported in UAV Software PlatformsACM Transactions on Software Engineering and Methodology10.1145/356482132:3(1-37)Online publication date: 26-Apr-2023
    • (2023)APIRO: A Framework for Automated Security Tools API RecommendationACM Transactions on Software Engineering and Methodology10.1145/351276832:1(1-42)Online publication date: 13-Feb-2023
    • (2022)BugListenerProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510108(299-311)Online publication date: 21-May-2022
    • (2022)Predictive Models in Software Engineering: Challenges and OpportunitiesACM Transactions on Software Engineering and Methodology10.1145/350350931:3(1-72)Online publication date: 9-Apr-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media