[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2566486.2567972acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

What makes a good biography?: multidimensional quality analysis based on wikipedia article feedback data

Published: 07 April 2014 Publication History

Abstract

With more than 22 million articles, the largest collaborative knowledge resource never sleeps, experiencing several article edits every second. Over one fifth of these articles describes individual people, the majority of which are still alive. Such articles are, by their nature, prone to corruption and vandalism. Manual quality assurance by experts can barely cope with this massive amount of data. Can it be effectively replaced by feedback from the crowd? Can we provide meaningful support for quality assurance with automated text processing techniques? Which properties of the articles should then play a key role in the machine learning algorithms and why? In this paper, we study the user-perceived quality of Wikipedia articles based on a novel Wikipedia user feedback dataset. In contrast to previous work on quality assessment which mostly relied on judgements of active Wikipedia authors, we analyze ratings of ordinary Wikipedia users along four quality dimensions (Complete, Well written, Trustworthy and Objective). We first present an empirical analysis of the novel dataset with over 36 million Wikipedia article ratings. We then select a subset of biographical articles and perform classification experiments to predict their quality ratings along each of the dimensions, exploring multiple linguistic, surface and network properties of the rated articles. Additionally, we study the classification performance and differences for the biographies of living and dead people as well as those for men and women. We demonstrate the effectiveness of our approach by the F-scores of 0.94, 0.89, 0.73, and 0.73 for the dimensions Complete, Well written, Trustworthy, and Objective. Based on the results, we believe that the quality assessment of big textual data can be effectively supported by current text classification and language processing tools.

References

[1]
S. Aluisio, L. Specia, C. Gasperin, and C. Scarton. Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 1--9, 2010.
[2]
M. Anderka, B. Stein, and N. Lipka. Predicting quality flaws in user-generated content. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval - SIGIR '12, page 981, Aug. 2012.
[3]
S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler. Automatically profiling the author of an anonymous text. Communications of the ACM, 52(2):119--123, Feb. 2009.
[4]
C. Björnsson. Lasbarhet: Lesbarkeit durch Lix. (Aus dem Schwedischen). (Pedagogiskt Utvecklingsarbete vid Stockholms Skolor. 6.). 1968.
[5]
J. E. Blumenstock. Size matters. In Proceeding of the 17th International Conference on World Wide Web, page 1095, Apr. 2008.
[6]
T. Chesney. An empirical examination of Wikipedia's credibility. First Monday, 11(11), 2006.
[7]
M. Coleman and T. Liau. A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283, 1975.
[8]
M. Corney, O. de Vel, A. Anderson, and G. Mohay. Gender-preferential text mining of e-mail discourse. In Proceedings of the 18th Annual Computer Security Applications Conference, pages 282--289, 2002.
[9]
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6). University of Sheffield, Department of Computer Science, 2011.
[10]
R. Eckart de Castilho and I. Gurevych. A lightweight framework for reproducible parameter sweeping in information retrieval. In M. Agosti, N. Ferro, and C. Thanos, editors, Proceedings of the 2011 Workshop on Data Infrastructures for Supporting Information Retrieval Evaluation, pages 7--10, Oct. 2011.
[11]
L. Feng, N. Elhadad, and M. Huenerfauth. Cognitively motivated features for readability assessment. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 229--237, 2009.
[12]
D. Ferrucci and A. Lally. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Language Engineering, 10(3--4):327--348, 2004.
[13]
O. Ferschke, I. Gurevych, and M. Rittberger. FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia. In CLEF 2012 Labs and Workshop, Notebook Papers, 2012.
[14]
O. Ferschke, T. Zesch, and I. Gurevych. Wikipedia revision toolkit: Efficiently accessing wikipedia's edit history. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 97--102, June 2011.
[15]
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363--370, 2005.
[16]
L. Flekova and I. Gurevych. Can we hide in the web? large scale simultaneous age and gender author profiling in social media. In CLEF 2012 Labs and Workshop, Notebook Papers, Sept. 2013.
[17]
R. Flesch. A new readability yardstick. The Journal of applied psychology, 32(3):221, 1948.
[18]
B. J. Fogg, P. Swani, M. Treinen, J. Marshall, O. Laraki, A. Osipovich, C. Varma, N. Fang, J. Paul, A. Rangnekar, and J. Shon. What makes Web sites credible? In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 61--68, Mar. 2001.
[19]
Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23--37. Springer, 1995.
[20]
R. Gunning. The fog index after twenty years. Journal of Business Communication, 6(2):3--13, 1969.
[21]
D. Hasan Dalip, M. André Gonçalves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by web communities. In Proceedings of the 2009 joint international conference on Digital libraries, page 295, June 2009.
[22]
F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure. Foundations of Science, 7(3):293--340, 2002.
[23]
Z. Islam and A. Mehler. Automatic readability classification of crowd-sourced data based on linguistic and information-theoretic features. Computación y Sistemas, 17(2):113--123, 2013.
[24]
S. Javanmardi and C. Lopes. Statistical measure of quality in Wikipedia. In Proceedings of the First Workshop on Social Media Analytics - SOMA '10, pages 132--138, July 2010.
[25]
J. Juran and B. Godfrey. The quality control process. 1999.
[26]
J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, and B. S. Chissom. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC, 1975.
[27]
M. Koppel, S. Argamon, and A. R. Shimoni. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4):401--412, 2002.
[28]
S. Kullback. Information theory and statistics. 1997.
[29]
G. Lindgaard, C. Dudek, D. Sen, L. Sumegi, and P. Noonan. An exploration of relations between visual appeal, trustworthiness and perceived usability of homepages. ACM Transactions on Computer-Human Interaction, 18(1):1--30, Apr. 2011.
[30]
N. Lipka and B. Stein. Identifying featured articles in Wikipedia. In Proceedings of the 19th International Conference on World Wide Web, page 1147, Apr. 2010.
[31]
G. H. McLaughlin. SMOG grading: A new readability formula. Journal of reading, 12(8):639--646, 1969.
[32]
P. V. Ogren, P. G. Wetzler, and S. Bethard. ClearTK: A UIMA toolkit for statistical natural language processing. In Proceedings of the UIMA for NLP workshop at the Language Resources and Evaluation Conference, pages 32--38, 2008.
[33]
J. Pennebaker, M. Mehl, and K. Niederhoffer. Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1):547--577, 2003.
[34]
R. E. Petty. The Elaboration Likelihood Model of Persuasion. Advances in Experimental Social Psychology, 19:123--205, 1986.
[35]
R. Rosenzweig. Can history be open source? Wikipedia and the future of the past. The Journal of American History, 93(1):117--146, 2006.
[36]
J. Schler, M. Koppel, S. Argamon, and J. Pennebaker. Effects of age and gender on blogging. In Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, pages 199--205, 2006.
[37]
H. Schmid. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, 1994.
[38]
R. Senter and E. Smith. Automated Readability Index. AMRL-TR-66--220. 1967.
[39]
B. Stvilia, L. Gasser, M. B. Twidale, and L. a. C. Smith. A framework for information quality assessment. Journal of the American Society for Information Science and Technology, 58(12):1720--1733, 2007.
[40]
B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Information quality work organization in Wikipedia. Journal of the American Society for Information Science and Technology, 59(6):983--1001, Apr. 2008.
[41]
M. Weimer and I. Gurevych. Predicting the perceived quality of web forum posts. In Proceedings of the Conference on Recent Advances in Natural Language Processing, Jan. 2007.
[42]
D. M. Wilkinson and B. A. Huberman. Assessing the value of cooperation in wikipedia. First Monday, 12(4), 2007.
[43]
E. Yaari, S. Baruchson-Arbib, and J. Bar-Ilan. Information quality assessment of community generated content: A user study of Wikipedia. Journal of Information Science, 37(5):487--498, Aug. 2011.
[44]
Z. Zhu, D. Bernhard, and I. Gurevych. A multi-dimensional model for assessing the quality of answers in social Q&A sites. In Proceedings of 14th International Conference on Information Quality, pages 264--265, Nov. 2009.

Cited By

View all
  • (2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
  • (2021)From Symbols to Embeddings: A Tale of Two Representations in Computational Social ScienceJournal of Social Computing10.23919/JSC.2021.00112:2(103-156)Online publication date: Jun-2021
  • (2021)Topical Classification of Text Fragments Accounting for Their Nearest ContextAutomation and Remote Control10.1134/S000511792012009781:12(2262-2276)Online publication date: 10-Feb-2021
  • Show More Cited By

Index Terms

  1. What makes a good biography?: multidimensional quality analysis based on wikipedia article feedback data

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WWW '14: Proceedings of the 23rd international conference on World wide web
      April 2014
      926 pages
      ISBN:9781450327442
      DOI:10.1145/2566486

      Sponsors

      • IW3C2: International World Wide Web Conference Committee

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 April 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. collaborative knowledge resources
      2. document classification
      3. machine learning
      4. natural language processing
      5. neutrality
      6. quality assessment
      7. trustworthiness
      8. wikipedia

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      WWW '14
      Sponsor:
      • IW3C2

      Acceptance Rates

      WWW '14 Paper Acceptance Rate 84 of 645 submissions, 13%;
      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)22
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 12 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
      • (2021)From Symbols to Embeddings: A Tale of Two Representations in Computational Social ScienceJournal of Social Computing10.23919/JSC.2021.00112:2(103-156)Online publication date: Jun-2021
      • (2021)Topical Classification of Text Fragments Accounting for Their Nearest ContextAutomation and Remote Control10.1134/S000511792012009781:12(2262-2276)Online publication date: 10-Feb-2021
      • (2020)Modeling Popularity and Reliability of Sources in Multilingual WikipediaInformation10.3390/info1105026311:5(263)Online publication date: 13-May-2020
      • (2019)Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different TopicsComputers10.3390/computers80300608:3(60)Online publication date: 14-Aug-2019
      • (2019)Measures for Quality Assessment of Articles and Infoboxes in Multilingual WikipediaBusiness Information Systems Workshops10.1007/978-3-030-04849-5_53(619-633)Online publication date: 3-Jan-2019
      • (2018)History-Based Article Quality Assessment on Wikipedia2018 IEEE International Conference on Big Data and Smart Computing (BigComp)10.1109/BigComp.2018.00010(1-8)Online publication date: Jan-2018
      • (2016)Web Content Classification Using Distributions of Subjective Quality EvaluationsACM Transactions on the Web10.1145/299413210:4(1-30)Online publication date: 15-Nov-2016
      • (2016)Measuring Quality of Collaboratively Edited Documents: The Case of Wikipedia2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC)10.1109/CIC.2016.044(266-275)Online publication date: Nov-2016
      • (2015)Wikipedia, sociology, and the promise and pitfalls of Big DataBig Data & Society10.1177/20539517156143322:2(205395171561433)Online publication date: 1-Dec-2015
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media