More Web Proxy on the site http://driver.im/

research-article

Open access

Eliciting Informative Text Evaluations with Large Language Models

Authors:

Grant SchoenebeckAuthors Info & Claims

EC '24: Proceedings of the 25th ACM Conference on Economics and Computation

Pages 582 - 612

https://doi.org/10.1145/3670865.3673532

Published: 17 December 2024 Publication History

Abstract

Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in large language models (LLMs). This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media.

We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels --- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM.

References

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[2]

Simran Arora, Avanika Narayan, Mayee F Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, and Christopher Re. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations, 2022.

[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020.

[4]

Noah Burrell and Grant Schoenebeck. Measurement integrity in peer prediction: A peer assessment case study. arXiv preprint arXiv:2108.05521, 2021.

[5]

Yi-Chun Chen, Manuel Mueller-Frank, and Mallesh M Pai. The wisdom of the crowd and higher-order beliefs. arXiv preprint arXiv:2102.02666, 2021.

[6]

Roger Cooke. Experts in uncertainty: opinion and subjective probability in science. Oxford University Press, USA, 1991.

[7]

Jacques Crémer and Richard P. McLean. Optimal selling strategies under uncertainty for a discriminating monopolist when demands are interdependent. Econometrica, 53(2):345--361, 1985. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1911240.

[8]

Anirban Dasgupta and Arpita Ghosh. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd international conference on World Wide Web, pages 319--330, 2013.

Digital Library

[9]

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022.

[10]

Apostolos Filippas, John Joseph Horton, and Joseph Golden. Reputation inflation. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 483--484, 2018.

Digital Library

[11]

Alice Gao, James R Wright, and Kevin Leyton-Brown. Incentivizing evaluation via limited access to ground truth: Peerprediction makes things worse. arXiv preprint arXiv:1606.07042, 2016.

[12]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665--673, 2020.

[13]

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359--378, 2007.

[14]

Yuqing Kong. Dominantly truthful multi-task peer prediction with a constant number of tasks. In Proceedings of the fourteenth annual acm-siam symposium on discrete algorithms, pages 2398--2411. SIAM, 2020.

[15]

Yuqing Kong. Dominantly truthful peer prediction mechanisms with a finite number of tasks. J. ACM, 71(2), apr 2024. ISSN 0004-5411.

Digital Library

[16]

Yuqing Kong and Grant Schoenebeck. Eliciting expertise without verification. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 195--212, 2018.

Digital Library

[17]

Yuqing Kong and Grant Schoenebeck. An information theoretic framework for designing information elicitation mechanisms that reward truth-telling. ACM Transactions on Economics and Computation (TEAC), 7(1):1--33, 2019.

[18]

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, and James Zou. Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023.

[19]

Yang Liu, Juntao Wang, and Yiling Chen. Surrogate scoring rules. ACM Trans. Econ. Comput., 10(3), feb 2023. ISSN 2167-8375.

Digital Library

[20]

Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peer-prediction method. Management Science, 51(9):1359--1373, 2005.

Digital Library

[21]

Drazen Prelec. A bayesian truth serum for subjective data. science, 306(5695):462--466, 2004.

[22]

Goran Radanovic and Boi Faltings. Incentives for truthful information elicitation of continuous signals. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI'14, page 770--776. AAAI Press, 2014.

[23]

Paul Resnick, Ko Kuwabara, Richard Zeckhauser, and Eric Friedman. Reputation systems. Communications of the ACM, 43 (12):45--48, 2000.

Digital Library

[24]

Grant Schoenebeck and Fang-Yi Yu. Learning and strongly truthful multi-task peer prediction: A variational approach. arXiv preprint arXiv:2009.14730, 2020.

[25]

Grant Schoenebeck and Fang-Yi Yu. Two strongly truthful mechanisms for three heterogeneous agents answering one question. ACM Transactions on Economics and Computation, 10(4):1--26, 2023.

Digital Library

[26]

Reinhard Selten. Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1:43--61, 1998.

[27]

Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379--423, 1948.

[28]

Victor Shnayder, Arpit Agarwal, Rafael Frongillo, and David C Parkes. Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 179--196, 2016.

Digital Library

[29]

Siddarth Srinivasan and Jamie Morgenstern. Auctions and peer prediction for scientific peer review. ArXivorg, 2021.

[30]

Steven Tadelis. Reputation and feedback systems in online platform markets. Annual Review of Economics, 8:321--340, 2016.

[31]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.

[32]

Jens Witkowski and David C. Parkes. A robust bayesian truth serum for small populations. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI'12, page 1492--1498. AAAI Press, 2012.

[33]

Yifan Wu and Jason Hartline. Elicitationgpt: Text elicitation mechanisms via language models. arXiv preprint arXiv:2406.09363, 2024.

[34]

Shengwei Xu, Yichi Zhang, Paul Resnick, and Grant Schoenebeck. Spot check equivalence: an interpretable metric for information elicitation mechanisms. arXiv preprint arXiv:2402.13567, 2024.

[35]

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.

[36]

Yichi Zhang and Grant Schoenebeck. Multitask peer prediction with task-dependent strategies. In Proceedings of the ACM Web Conference 2023, WWW '23, page 3436--3446, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9781450394161.

Digital Library

[37]

Yichi Zhang and Grant Schoenebeck. High-effort crowds: Limited liability via tournaments. In Proceedings of the ACM Web Conference 2023, WWW '23, page 3467--3477, New York, NY, USA, 2023b. Association for Computing Machinery. ISBN 9781450394161.

Digital Library

[38]

Yuwei Zhang, Zihan Wang, and Jingbo Shang. ClusterLLM: Large language models as a guide for text clustering. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13903--13920, Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.emnlp-main.858.

Cited By

Thelwall M(2024)Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputsJournal of Data and Information Science10.2478/jdis-2025-0011Online publication date: 20-Dec-2024
https://doi.org/10.2478/jdis-2025-0011

Index Terms

Eliciting Informative Text Evaluations with Large Language Models
1. Applied computing
  1. Law, social and behavioral sciences
    1. Economics

Recommendations

A Geometric Perspective on Minimal Peer Prediction

Minimal peer prediction mechanisms truthfully elicit private information (e.g., opinions or experiences) from rational agents without the requirement that ground truth is eventually revealed. In this article, we use a geometric perspective to prove that ...
The Limits of Multi-task Peer Prediction
EC '21: Proceedings of the 22nd ACM Conference on Economics and Computation

Recent advances in multi-task peer prediction have greatly expanded our knowledge about the power of multi-task peer prediction mechanisms. Various mechanisms have been proposed in different settings to elicit different types of information. But we ...
Peer prediction without a common prior
EC '12: Proceedings of the 13th ACM Conference on Electronic Commerce

Reputation mechanisms at online opinion forums, such as Amazon Reviews, elicit ratings from users about their experience with different products. Crowdsourcing applications, such as image tagging on Amazon Mechanical Turk, elicit votes from users as to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EC '24: Proceedings of the 25th ACM Conference on Economics and Computation

July 2024

1340 pages

ISBN:9798400707049

DOI:10.1145/3670865

Chair:
Dirk Bergemann,
Program Chairs:
Robert Kleinberg,
Daniela Saban

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGecom: Special Interest Group on Economics and Computation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 December 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

United States National Science Foundation
National Natural Science Foundation of China

Conference

EC '24

Sponsor:

SIGecom

EC '24: 25th ACM Conference on Economics and Computation

July 8 - 11, 2024

CT, New Haven, USA

Acceptance Rates

Overall Acceptance Rate 664 of 2,389 submissions, 28%

Upcoming Conference

EC '25

Sponsor:
sigecom

The 25th ACM Conference on Economics and Computation

July 7 - 11, 2025

Stanford , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
6
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)6

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Thelwall M(2024)Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputsJournal of Data and Information Science10.2478/jdis-2025-0011Online publication date: 20-Dec-2024
https://doi.org/10.2478/jdis-2025-0011

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents