Abstract
Automatic methods for evaluating machine-generated texts hold significant importance due to the expanding applications of generative systems. Conventional methods tend to grapple with a lack of explainability, issuing a solitary numerical score to signify the assessment outcome. Recent advancements have sought to mitigate this limitation by incorporating large language models (LLMs) to offer more detailed error analyses, yet their applicability remains constrained, particularly in industrial contexts where comprehensive error coverage and swift detection are paramount. To alleviate these challenges, we introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation. Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts in the initial stage and subsequently delves into providing comprehensive diagnostic reports in the second stage. DEE is fine-tuned on our elaborately assembled dataset AntEval, which encompasses 15K instances from 4 real-world applications of Alipay that employ generative systems. The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE’s evaluation criteria. Experimental results affirm that DEE’s superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.
S. Zhang and Y. Li—Equal Contributors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, J., Bai, S., Chu, Y., Cui, Z., et al.: Qwen technical report. arXiv (2023)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Goldstein, J., Lavie, A., Lin, C., Voss, C.R. (eds.) Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, pp. 65–72 (2005)
Chung, H.W., Hou, L., Longpre, S., Zoph, B., et al.: Scaling instruction-finetuned language models. arXiv (2022)
Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K.: Toxicity in chatGPT: analyzing persona-assigned language models. arXiv (2023)
Dziri, N., Kamalloo, E., Mathewson, K.W., Zaïane, O.R.: Evaluating coherence in dialogue systems using entailment. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL-HLT, pp. 3806–3812 (2019)
Fu, J., Ng, S., Jiang, Z., Liu, P.: GPTScore: evaluate as you desire. arXiv (2023)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., et al.: LoRA: low-rank adaptation of large language models. In: ICLR (2022)
Hu, J.E., Singh, A., Holzenberger, N., Post, M., Durme, B.V.: Large-scale, diverse, paraphrastic bitexts via sampling and clustering. In: Bansal, M., Villavicencio, A. (eds.) CoNLL, pp. 44–54 (2019)
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., et al.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12), 1–38 (2023)
Jiang, D., Li, Y., Zhang, G., Huang, W., Lin, B.Y., Chen, W.: TIGERScore: towards building explainable metric for all text generation tasks. arXiv (2023)
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., et al.: Efficient memory management for large language model serving with PagedAttention. In: Flinn, J., Seltzer, M.I., Druschel, P., Kaufmann, A., Mace, J. (eds.) SOSP, pp. 611–626 (2023)
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (eds.) ACL, pp. 7871–7880 (2020)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Liu, Y., Yao, Y., Ton, J.F., Zhang, X., et al.: Trustworthy LLMs: a survey and guideline for evaluating large language models’ alignment. arXiv (2023)
OpenAI: GPT-4 technical report. arXiv (2023)
Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Touvron, H., Lavril, T., Izacard, G., Martinet, X., et al.: LLaMA: Open and efficient foundation language models. arXiv (2023)
Xu, W., et al.: INSTRUCTSCORE: towards explainable text generation evaluation with automatic feedback. In: Bouamor, H., Pino, J., Bali, K. (eds.) EMNLP, pp. 5967–5994 (2023)
Yuan, W., Neubig, G., Liu, P.: BARTScore: evaluating generated text as text generation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) NeurIPS, pp. 27263–27277 (2021)
Zeng, A., Liu, X., Du, Z., Wang, Z., et al.: GLM-130B: an open bilingual pre-trained model. In: ICLR (2022)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: ICLR (2020)
Acknowledgement
This work is partially supported by Ant Group Research Fund and Natural Science Foundation of China (Grant No. U21A20488). We thank the Big Data Computing Center of Southeast University for providing the facility support on the numerical calculation in this paper.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, S. et al. (2024). DEE: Dual-Stage Explainable Evaluation Method for Text Generation. In: Onizuka, M., et al. Database Systems for Advanced Applications. DASFAA 2024. Lecture Notes in Computer Science, vol 14856. Springer, Singapore. https://doi.org/10.1007/978-981-97-5575-2_29
Download citation
DOI: https://doi.org/10.1007/978-981-97-5575-2_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5574-5
Online ISBN: 978-981-97-5575-2
eBook Packages: Computer ScienceComputer Science (R0)