[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3670865.3673532acmconferencesArticle/Chapter ViewAbstractPublication PagesecConference Proceedingsconference-collections
research-article
Open access

Eliciting Informative Text Evaluations with Large Language Models

Published: 17 December 2024 Publication History

Abstract

Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in large language models (LLMs). This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media.
We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels --- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM.

References

[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2]
Simran Arora, Avanika Narayan, Mayee F Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, and Christopher Re. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations, 2022.
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020.
[4]
Noah Burrell and Grant Schoenebeck. Measurement integrity in peer prediction: A peer assessment case study. arXiv preprint arXiv:2108.05521, 2021.
[5]
Yi-Chun Chen, Manuel Mueller-Frank, and Mallesh M Pai. The wisdom of the crowd and higher-order beliefs. arXiv preprint arXiv:2102.02666, 2021.
[6]
Roger Cooke. Experts in uncertainty: opinion and subjective probability in science. Oxford University Press, USA, 1991.
[7]
Jacques Crémer and Richard P. McLean. Optimal selling strategies under uncertainty for a discriminating monopolist when demands are interdependent. Econometrica, 53(2):345--361, 1985. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1911240.
[8]
Anirban Dasgupta and Arpita Ghosh. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd international conference on World Wide Web, pages 319--330, 2013.
[9]
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022.
[10]
Apostolos Filippas, John Joseph Horton, and Joseph Golden. Reputation inflation. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 483--484, 2018.
[11]
Alice Gao, James R Wright, and Kevin Leyton-Brown. Incentivizing evaluation via limited access to ground truth: Peerprediction makes things worse. arXiv preprint arXiv:1606.07042, 2016.
[12]
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665--673, 2020.
[13]
Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359--378, 2007.
[14]
Yuqing Kong. Dominantly truthful multi-task peer prediction with a constant number of tasks. In Proceedings of the fourteenth annual acm-siam symposium on discrete algorithms, pages 2398--2411. SIAM, 2020.
[15]
Yuqing Kong. Dominantly truthful peer prediction mechanisms with a finite number of tasks. J. ACM, 71(2), apr 2024. ISSN 0004-5411.
[16]
Yuqing Kong and Grant Schoenebeck. Eliciting expertise without verification. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 195--212, 2018.
[17]
Yuqing Kong and Grant Schoenebeck. An information theoretic framework for designing information elicitation mechanisms that reward truth-telling. ACM Transactions on Economics and Computation (TEAC), 7(1):1--33, 2019.
[18]
Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, and James Zou. Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023.
[19]
Yang Liu, Juntao Wang, and Yiling Chen. Surrogate scoring rules. ACM Trans. Econ. Comput., 10(3), feb 2023. ISSN 2167-8375.
[20]
Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peer-prediction method. Management Science, 51(9):1359--1373, 2005.
[21]
Drazen Prelec. A bayesian truth serum for subjective data. science, 306(5695):462--466, 2004.
[22]
Goran Radanovic and Boi Faltings. Incentives for truthful information elicitation of continuous signals. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI'14, page 770--776. AAAI Press, 2014.
[23]
Paul Resnick, Ko Kuwabara, Richard Zeckhauser, and Eric Friedman. Reputation systems. Communications of the ACM, 43 (12):45--48, 2000.
[24]
Grant Schoenebeck and Fang-Yi Yu. Learning and strongly truthful multi-task peer prediction: A variational approach. arXiv preprint arXiv:2009.14730, 2020.
[25]
Grant Schoenebeck and Fang-Yi Yu. Two strongly truthful mechanisms for three heterogeneous agents answering one question. ACM Transactions on Economics and Computation, 10(4):1--26, 2023.
[26]
Reinhard Selten. Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1:43--61, 1998.
[27]
Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379--423, 1948.
[28]
Victor Shnayder, Arpit Agarwal, Rafael Frongillo, and David C Parkes. Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 179--196, 2016.
[29]
Siddarth Srinivasan and Jamie Morgenstern. Auctions and peer prediction for scientific peer review. ArXivorg, 2021.
[30]
Steven Tadelis. Reputation and feedback systems in online platform markets. Annual Review of Economics, 8:321--340, 2016.
[31]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
[32]
Jens Witkowski and David C. Parkes. A robust bayesian truth serum for small populations. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI'12, page 1492--1498. AAAI Press, 2012.
[33]
Yifan Wu and Jason Hartline. Elicitationgpt: Text elicitation mechanisms via language models. arXiv preprint arXiv:2406.09363, 2024.
[34]
Shengwei Xu, Yichi Zhang, Paul Resnick, and Grant Schoenebeck. Spot check equivalence: an interpretable metric for information elicitation mechanisms. arXiv preprint arXiv:2402.13567, 2024.
[35]
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
[36]
Yichi Zhang and Grant Schoenebeck. Multitask peer prediction with task-dependent strategies. In Proceedings of the ACM Web Conference 2023, WWW '23, page 3436--3446, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9781450394161.
[37]
Yichi Zhang and Grant Schoenebeck. High-effort crowds: Limited liability via tournaments. In Proceedings of the ACM Web Conference 2023, WWW '23, page 3467--3477, New York, NY, USA, 2023b. Association for Computing Machinery. ISBN 9781450394161.
[38]
Yuwei Zhang, Zihan Wang, and Jingbo Shang. ClusterLLM: Large language models as a guide for text clustering. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13903--13920, Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.emnlp-main.858.

Cited By

View all
  • (2024)Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputsJournal of Data and Information Science10.2478/jdis-2025-0011Online publication date: 20-Dec-2024

Index Terms

  1. Eliciting Informative Text Evaluations with Large Language Models

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    EC '24: Proceedings of the 25th ACM Conference on Economics and Computation
    July 2024
    1340 pages
    ISBN:9798400707049
    DOI:10.1145/3670865
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 December 2024

    Check for updates

    Author Tags

    1. information elicitation
    2. peer prediction
    3. large language models (LLMs)

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    EC '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 664 of 2,389 submissions, 28%

    Upcoming Conference

    EC '25
    The 25th ACM Conference on Economics and Computation
    July 7 - 11, 2025
    Stanford , CA , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputsJournal of Data and Information Science10.2478/jdis-2025-0011Online publication date: 20-Dec-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media