CN118035807A

CN118035807A - Evaluation method and device of large language model, electronic equipment and storage medium

Info

Publication number: CN118035807A
Application number: CN202311837538.3A
Authority: CN
Inventors: 赫甲帅; 黄赟贺
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-05-14

Abstract

The present invention provides a method, device, electronic device and storage medium for evaluating a large language model, and relates to the field of artificial intelligence technology. The method includes: obtaining interaction information to be tested corresponding to the large language model to be tested, and a first manual score of the user scoring the large language model to be tested based on a manual evaluation template; the manual evaluation template is a scoring template determined based on at least two subjective indicators; the interaction information to be tested is input into at least two evaluation models, and a first objective score of each evaluation model scoring the large language model to be tested is output; the loss function of each evaluation model is determined based on a weighted average of at least two objective indicators; based on the first manual score and each first objective score, a global score corresponding to the large language model to be tested is determined. The present invention can improve the objectivity and comprehensiveness of the evaluation of the large language model to be tested, and thus improve the reliability of the evaluation result of the large language model to be tested.

Description

Large language model evaluation method, device, electronic device and storage medium

技术领域Technical Field

本发明涉及人工智能技术领域，尤其涉及一种大语言模型的评测方法、装置、电子设备和存储介质。The present invention relates to the field of artificial intelligence technology, and in particular to a large language model evaluation method, device, electronic device and storage medium.

背景技术Background technique

随着深度学习和自然语言处理技术的不断发展，出现大量的大语言模型(LargeLanguage Model，LLM)。对大语言模型进行评测可进一步评估大语言模型在特定任务上的表现，以确定大语言模型的不足之处，便于对已有的大语言模型进行针对性地改进，进而提高大语言模型的性能。With the continuous development of deep learning and natural language processing technology, a large number of large language models (LLM) have emerged. Evaluating large language models can further evaluate their performance on specific tasks, identify their shortcomings, and facilitate targeted improvements to existing large language models, thereby improving their performance.

现有技术中，一般通过大语言模型与用户交互，针对大语言模型在交互过程中的表现，由用户针对各评分项进行评分，通过人工评分实现对大语言模型的评测。然而，人工评分较为主观，导致大语言模型的评测客观性较低，进而导致大语言模型的评测结果可靠性较低。In the prior art, a large language model is generally used to interact with users, and the users rate the performance of the large language model in the interaction process for each rating item, and the large language model is evaluated through manual scoring. However, manual scoring is relatively subjective, resulting in low objectivity in the evaluation of the large language model, which in turn leads to low reliability in the evaluation results of the large language model.

发明内容Summary of the invention

本发明提供一种大语言模型的评测方法、装置、电子设备和存储介质，用以解决现有技术中大语言模型的评测结果可靠性较低的缺陷。The present invention provides a large language model evaluation method, device, electronic device and storage medium, which are used to solve the defect of low reliability of evaluation results of large language models in the prior art.

本发明提供一种大语言模型的评测方法，包括：The present invention provides a large language model evaluation method, comprising:

获取待测大语言模型对应的待测交互信息，以及用户基于人工评测模板对所述待测大语言模型进行评分的第一人工评分；所述人工评测模板为基于至少两个主观指标确定的评分模板；Acquire the interaction information to be tested corresponding to the large language model to be tested, and a first manual score of the large language model to be tested by a user based on a manual evaluation template; the manual evaluation template is a scoring template determined based on at least two subjective indicators;

将所述待测交互信息输入至少两个评测模型，输出各所述评测模型对所述待测大语言模型进行评分的第一客观评分；各所述评测模型的损失函数是基于至少两个客观指标的加权平均确定的；Inputting the interaction information to be tested into at least two evaluation models, and outputting a first objective score of each evaluation model for scoring the large language model to be tested; the loss function of each evaluation model is determined based on a weighted average of at least two objective indicators;

基于所述第一人工评分和各所述第一客观评分，确定所述待测大语言模型对应的全局评分。A global score corresponding to the large language model to be tested is determined based on the first manual score and each of the first objective scores.

根据本发明提供的大语言模型的评测方法，所述将所述待测交互信息输入至少两个评测模型，输出各所述评测模型对所述待测大语言模型进行评分的第一客观评分，包括：According to the large language model evaluation method provided by the present invention, the step of inputting the interaction information to be tested into at least two evaluation models and outputting a first objective score of the large language model to be tested by each evaluation model includes:

针对各所述评测模型，将所述待测交互信息输入所述评测模型，确定所述待测交互信息对应的预测信息；所述预测信息用于表征所述待测交互信息对应的正确性；For each of the evaluation models, the interaction information to be tested is input into the evaluation model to determine prediction information corresponding to the interaction information to be tested; the prediction information is used to characterize the correctness of the interaction information to be tested;

基于所述预测信息，确定各所述客观指标对应的指标评分；Based on the prediction information, determining an indicator score corresponding to each of the objective indicators;

基于各所述指标评分和各所述指标评分对应的评分权重，确定所述评测模型对所述待测大语言模型进行评分的第一客观评分。Based on the scores of the indicators and the score weights corresponding to the scores of the indicators, a first objective score of the evaluation model for scoring the large language model to be tested is determined.

根据本发明提供的大语言模型的评测方法所述基于所述第一人工评分和各所述第一客观评分，确定所述待测大语言模型对应的全局评分，包括：According to the large language model evaluation method provided by the present invention, determining the global score corresponding to the large language model to be tested based on the first manual score and each of the first objective scores includes:

获取各所述评测模型各自对应的得分权重，所述得分权重是基于各所述评测模型对应的子训练数据量与所有评测模型对应的总训练数据量的比值确定的；Obtaining a score weight corresponding to each of the evaluation models, wherein the score weight is determined based on a ratio of an amount of sub-training data corresponding to each of the evaluation models to an amount of total training data corresponding to all the evaluation models;

基于各所述得分权重和各所述得分权重对应的评测模型的第一客观评分，确定所述待测大语言模型对应的第二客观评分；Determining a second objective score corresponding to the large language model to be tested based on each of the score weights and the first objective score of the evaluation model corresponding to each of the score weights;

基于所述第一人工评分和所述第二客观评分的加权和，确定所述待测大语言模型对应的全局评分。A global score corresponding to the large language model to be tested is determined based on a weighted sum of the first manual score and the second objective score.

根据本发明提供的大语言模型的评测方法所述基于所述第一人工评分和所述第二客观评分的加权和，确定所述待测大语言模型对应的全局评分，包括：According to the large language model evaluation method provided by the present invention, determining the global score corresponding to the large language model to be tested based on the weighted sum of the first manual score and the second objective score includes:

对所述待测交互信息进行语义分析，确定所述待测交互信息对应的目标领域信息；Performing semantic analysis on the interaction information to be tested to determine target domain information corresponding to the interaction information to be tested;

基于所述目标领域信息，从预设映射关系中确定所述第一人工评分对应的目标人工评分权重和所述第二客观评分对应的目标客观评分权重；所述预设映射关系中包括领域信息、人工评分权重和客观评分权重之间的映射关系；Based on the target domain information, determining a target manual scoring weight corresponding to the first manual scoring and a target objective scoring weight corresponding to the second objective scoring from a preset mapping relationship; the preset mapping relationship includes a mapping relationship between domain information, manual scoring weights, and objective scoring weights;

基于所述第一人工评分、所述目标人工评分权重、所述第二客观评分和所述目标客观评分权重，确定所述待测大语言模型对应的全局评分。A global score corresponding to the large language model to be tested is determined based on the first manual score, the target manual score weight, the second objective score, and the target objective score weight.

根据本发明提供的大语言模型的评测方法各所述评测模型的损失函数中最大评分权重对应的客观指标不同。According to the evaluation method for a large language model provided by the present invention, the objective indicators corresponding to the maximum scoring weights in the loss functions of the evaluation models are different.

根据本发明提供的大语言模型的评测方法获取所述用户基于人工评测模板对所述待测大语言模型进行评分的第一人工评分，包括：The large language model evaluation method provided by the present invention obtains a first manual score of the large language model to be tested by the user based on the manual evaluation template, comprising:

获取所述用户对各所述主观指标对应的第二主观评分；Obtaining a second subjective score corresponding to each of the subjective indicators by the user;

确定所有第二主观评分的和值；determining a sum of all second subjective ratings;

基于所述所有第二主观评分的和值与所有主观指标的数量的比值，确定所述用户基于人工评测模板对所述待测大语言模型进行评分的第一人工评分。Based on the ratio of the sum of all the second subjective scores to the number of all the subjective indicators, a first manual score of the user for scoring the large language model to be tested based on the manual evaluation template is determined.

根据本发明提供的大语言模型的评测方法所述方法还包括：The method for evaluating a large language model according to the present invention further includes:

针对各所述主观指标，确定所述主观指标对应的评分分值的数量；For each of the subjective indicators, determine the number of scoring points corresponding to the subjective indicator;

基于所述评分分值的数量，确定各所述评分分值对应的所述主观指标的程度；Based on the number of the scoring values, determining the degree of the subjective indicator corresponding to each of the scoring values;

基于所有评分分值和各所述评分分值对应的所述主观指标的程度，构建所述主观指标对应的子评测模板；Based on all the scoring scores and the degree of the subjective indicator corresponding to each of the scoring scores, construct a sub-evaluation template corresponding to the subjective indicator;

基于所有主观指标各自对应的子评测模板，构建所述人工评测模板；Constructing the manual evaluation template based on the sub-evaluation templates corresponding to all subjective indicators;

将所述人工评测模板和所述待测交互信息发送至所述用户对应的终端。The manual evaluation template and the interaction information to be tested are sent to a terminal corresponding to the user.

本发明还提供一种大语言模型的评测装置，包括：The present invention also provides a large language model evaluation device, comprising:

获取模块，用于获取待测大语言模型对应的待测交互信息，以及用户基于人工评测模板对所述待测大语言模型进行评分的第一人工评分；所述人工评测模板为基于至少两个主观指标确定的评分模板；An acquisition module, configured to acquire interaction information to be tested corresponding to the large language model to be tested, and a first manual score of the large language model to be tested by a user based on a manual evaluation template; the manual evaluation template is a scoring template determined based on at least two subjective indicators;

输出模块，用于将所述待测交互信息输入至少两个评测模型，输出各所述评测模型对所述待测大语言模型进行评分的第一客观评分；各所述评测模型的损失函数是基于至少两个客观指标的加权平均确定的；An output module, used for inputting the interaction information to be tested into at least two evaluation models, and outputting a first objective score of each evaluation model for scoring the large language model to be tested; the loss function of each evaluation model is determined based on a weighted average of at least two objective indicators;

确定模块，用于基于所述第一人工评分和各所述第一客观评分，确定所述待测大语言模型对应的全局评分。A determination module is used to determine a global score corresponding to the large language model to be tested based on the first manual score and each of the first objective scores.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述大语言模型的评测方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the evaluation method for the large language model described above is implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述大语言模型的评测方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the large language model evaluation method described in any one of the above is implemented.

本发明提供的大语言模型的评测方法、装置、电子设备和存储介质，在获取待测交互信息和用户对待测大语言模型进行人工评分的第一人工评分后，通过各评测模型对待测交互信息进行预测，输出各评测模型对待测大语言模型进行评分的第一客观评分，综合第一人工评分和各第一客观评分后，得到待测大语言模型的全局评分，综合主观指标和客观指标后，可提高待测大语言模型的评测客观性和全面性，进而提高待测大语言模型的评测结果的可靠性。The large language model evaluation method, device, electronic device and storage medium provided by the present invention, after obtaining the interaction information to be tested and the first manual score of the large language model to be tested by the user, predict the interaction information to be tested by each evaluation model, output the first objective score of the large language model to be tested by each evaluation model, after combining the first manual score and each first objective score, obtain the global score of the large language model to be tested, after combining the subjective index and the objective index, can improve the objectivity and comprehensiveness of the evaluation of the large language model to be tested, thereby improving the reliability of the evaluation result of the large language model to be tested.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, a brief introduction will be given below to the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本发明实施例提供的大语言模型的评测方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a method for evaluating a large language model provided by an embodiment of the present invention;

图2是本发明实施例提供的大语言模型的评测装置的结构示意图；FIG2 is a schematic diagram of the structure of a large language model evaluation device provided by an embodiment of the present invention;

图3是本发明实施例提供的电子设备的结构示意图。FIG. 3 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

针对现有技术中大语言模型的评测结果可靠性较低的问题，本发明实施例提供一种大语言模型的评测方法，图1是本发明实施例提供的大语言模型的评测方法的流程示意图，如图1所示，该方法包括：In view of the problem that the evaluation results of large language models in the prior art are of low reliability, an embodiment of the present invention provides an evaluation method for a large language model. FIG1 is a flow chart of the evaluation method for a large language model provided by an embodiment of the present invention. As shown in FIG1 , the method includes:

步骤110、获取待测大语言模型对应的待测交互信息，以及用户基于人工评测模板对所述待测大语言模型进行评分的第一人工评分；所述人工评测模板为基于至少两个主观指标确定的评分模板。Step 110: Obtain the interaction information to be tested corresponding to the large language model to be tested, and a first manual score of the large language model to be tested by a user based on a manual score template; the manual score template is a score template determined based on at least two subjective indicators.

可选的，该待测交互信息可以为同一领域或不同领域中该待测大语言模型与用户进行多轮交互的历史交互文本，该历史交互文本中包括该用户输入的历史提问文本和该大语言模型基于该提问文本生成的历史回复文本。需要说明的是，该待测交互信息为经过用户授权后获取的。Optionally, the interaction information to be tested may be historical interaction texts of multiple rounds of interaction between the large language model to be tested and the user in the same field or in different fields, and the historical interaction texts include historical question texts input by the user and historical reply texts generated by the large language model based on the question texts. It should be noted that the interaction information to be tested is obtained after authorization by the user.

可选的，该领域可以包括医疗领域、学术领域、科技领域、艺术领域、文学领域和金融领域等，本发明实施例对此不做限制。Optionally, the field may include the medical field, academic field, scientific field, artistic field, literary field, and financial field, etc., which is not limited in the embodiment of the present invention.

可选的，基于人工评测模板对该待测大语言模型进行评分的用户可以为与该待测大语言模型进行交互的真实用户，也可以为评测专家，本发明实施例对此不做限制。Optionally, the user who scores the large language model to be tested based on the manual evaluation template may be a real user who interacts with the large language model to be tested, or may be an evaluation expert, which is not limited in this embodiment of the present invention.

可选的，该人工评测模板中的至少两个主观指标可以理解为人工评分的评分项，该主观指标可以包括：基础通用能力、垂直领域能力、分发命中率、相关性、不相关性、正确性、交互性和模型性能等，其中：基础通用能力可以理解为该待测大语言模型的回复文本与提问文本的相关性和语句通顺程度；该垂直领域能力可以理解为该待测大语言模型的回复文本与提问文本的相关性和涉及的相关领域的关键词；该分发命中率可以理解为回复文本中命中的关键词的数量；该相关性可以理解为回复文本的语义相关性和符合要求的关键词的数量；该不相关性为该相关性的反义词；该正确性为用户判断的回复文本的正确程度；该交互性可以理解为在预设数量次的追问后生成的回复文本的相关程度；该模型性能可以理解为模型实际性能与模型预期性能的比值，该模型性能可以包括模型响应速度、接口返回速度和QPS(Queries Per Second，每秒查询率)指标等。Optionally, at least two subjective indicators in the manual evaluation template can be understood as scoring items for manual scoring, and the subjective indicators can include: basic general capabilities, vertical field capabilities, distribution hit rate, relevance, irrelevance, correctness, interactivity and model performance, etc., among which: basic general capabilities can be understood as the relevance and sentence fluency of the reply text of the large language model to be tested and the question text; the vertical field capabilities can be understood as the relevance of the reply text of the large language model to be tested and the question text and the keywords involved in the relevant fields; the distribution hit rate can be understood as the number of keywords hit in the reply text; the relevance can be understood as the semantic relevance of the reply text and the number of keywords that meet the requirements; the irrelevance is the antonym of the relevance; the correctness is the correctness of the reply text judged by the user; the interactivity can be understood as the relevance of the reply text generated after a preset number of follow-up questions; the model performance can be understood as the ratio of the actual performance of the model to the expected performance of the model, and the model performance can include model response speed, interface return speed and QPS (Queries Per Second) indicators, etc.

此外，该主观指标还可以包括回复文本考虑提问文本中提问用户情绪的程度或回复文本中考虑提问用户的身份数据的程度，该身份数据可以包括提问用户的年龄、性别、学习、社会地位等数据，本发明实施例对此不做限制。In addition, the subjective indicator may also include the extent to which the reply text considers the emotions of the questioning user in the question text or the extent to which the reply text considers the identity data of the questioning user. The identity data may include the age, gender, education, social status and other data of the questioning user, and the embodiments of the present invention are not limited to this.

进一步的，所述方法还包括：Furthermore, the method further comprises:

示例的，在确定上述主观指标后，可先确定评分范围为0-5分，进而以1分为间隔，将评分范围分为6个评分分值，该6个评分分值分别为0分、1分、2分、3分、4分和5分。该6个评分分值中从0分至5分钟满足该主观指标的程度依次递增，即，0分表示满足该主观指标的程度最低，5分表示满足该主观指标的程度最高。在确定评分分值后，针对各主观指标，根据评分分值的数量，将满足该主观指标的程度进行划分，结合评分分值和各评分分值对应的满足该主观指标的程度，构建该主观指标的子评测模板。以该主观指标为基础通用能力为例，0分表示该待测大语言模型的回复文本和提问文本完全不相关；1分表示该待测大语言模型的回复文本和提问文本不相关，但回复文本的语句略微通顺；2分表示该待测大语言模型的回复文本和提问文本不相关，但回复文本的语句通顺；3分表示该待测大语言模型的回复文本和提问文本略微相关，且回复文本的语句通顺；4分表示该待测大语言模型的回复文本和提问文本相关，且回复文本的语句通顺；5分表示该待测大语言模型的回复文本和提问文本完全相关，且回复文本的语句通顺、符合逻辑。在综合所有的主观指标和各主观指标对应的子评测模板后，即可得到人工评测模板。该人工评测模板如表1所示，表1中除第一行对应的关键字段外，每一行均表示一个主观指标和该主观指标对应的子评测模板。For example, after determining the above subjective indicators, the scoring range can be determined as 0-5 points, and then the scoring range can be divided into 6 scoring values with 1 point as an interval, and the 6 scoring values are 0 points, 1 point, 2 points, 3 points, 4 points and 5 points. The degree of satisfying the subjective indicator from 0 points to 5 points in the 6 scoring values increases in sequence, that is, 0 points indicates the lowest degree of satisfying the subjective indicator, and 5 points indicates the highest degree of satisfying the subjective indicator. After determining the scoring values, for each subjective indicator, the degree of satisfying the subjective indicator is divided according to the number of scoring values, and the sub-evaluation template of the subjective indicator is constructed by combining the scoring values and the degree of satisfying the subjective indicator corresponding to each scoring value. Taking the subjective indicator as the basic general ability as an example, 0 points means that the reply text of the large language model to be tested is completely unrelated to the question text; 1 point means that the reply text of the large language model to be tested is unrelated to the question text, but the sentence of the reply text is slightly fluent; 2 points means that the reply text of the large language model to be tested is unrelated to the question text, but the sentence of the reply text is fluent; 3 points means that the reply text of the large language model to be tested is slightly related to the question text, and the sentence of the reply text is fluent; 4 points means that the reply text of the large language model to be tested is related to the question text, and the sentence of the reply text is fluent; 5 points means that the reply text of the large language model to be tested is completely related to the question text, and the sentence of the reply text is fluent and logical. After synthesizing all subjective indicators and the sub-evaluation templates corresponding to each subjective indicator, the manual evaluation template can be obtained. The manual evaluation template is shown in Table 1. In Table 1, except for the key fields corresponding to the first row, each row represents a subjective indicator and the sub-evaluation template corresponding to the subjective indicator.

表1Table 1

进一步的，获取所述用户基于人工评测模板对所述待测大语言模型进行评分的第一人工评分，包括：Further, obtaining a first manual score of the user for the large language model to be tested based on the manual evaluation template includes:

具体的，每个主观指标中不同评分分值对应的第二主观评分不同，例如，0分对应0分，1分对应20分，2分对应40分，3分对应60分，4分对应80分，5分对应100分，即，在确定用户选定的评分分值后，可将该用户选定的评分分值乘以20，即可得到该主观指标对应的第二主观评分。在计算得到各主观指标对应的第二主观评分后，可计算所有第二主观评分的和值，之后，计算该和值与所有主观指标的数量的比值，进而得到处于0-100分范围内的第一人工评分。Specifically, different scoring values in each subjective indicator correspond to different second subjective scores, for example, 0 points correspond to 0 points, 1 point corresponds to 20 points, 2 points correspond to 40 points, 3 points correspond to 60 points, 4 points correspond to 80 points, and 5 points correspond to 100 points, that is, after determining the scoring value selected by the user, the scoring value selected by the user can be multiplied by 20 to obtain the second subjective score corresponding to the subjective indicator. After calculating the second subjective score corresponding to each subjective indicator, the sum of all second subjective scores can be calculated, and then the ratio of the sum to the number of all subjective indicators is calculated to obtain the first manual score in the range of 0-100 points.

步骤120、将所述待测交互信息输入至少两个评测模型，输出各所述评测模型对所述待测大语言模型进行评分的第一客观评分；各所述评测模型的损失函数是基于至少两个客观指标的加权平均确定的。Step 120: input the interaction information to be tested into at least two evaluation models, and output a first objective score of each evaluation model for scoring the large language model to be tested; the loss function of each evaluation model is determined based on a weighted average of at least two objective indicators.

具体的，由于仅有第一人工评分会导致该待测大语言模型的评测结果的可靠性较低，因此，本发明实施例中，在获取待测交互信息后，将该待测交互信息分别输入多个评测模型中，由每个评测模型对该待测交互信息进行分类，得到该待测交互信息的预测信息，并根据该预测信息确定第一客观评分。Specifically, since only the first manual score will result in low reliability of the evaluation result of the large language model to be tested, in an embodiment of the present invention, after obtaining the interaction information to be tested, the interaction information to be tested is input into multiple evaluation models respectively, and each evaluation model classifies the interaction information to be tested to obtain prediction information of the interaction information to be tested, and determines the first objective score based on the prediction information.

可选的，该客观指标可以包括准确率(Accuracy)、精准率(Precision)、召回率(Recall)和F1分数等，其中：准确率用于衡量模型预测为正样本的样本中真正为正样本的比例。精准率用于衡量模型预测为正样本的样本中真正为正样本的比例。召回率用于衡量模型预测为正样本的样本中真正为正样本的比例。F1分数为精准率和召回率的调和平均值，该F1分数对应的计算公式如式(1)所示：Optionally, the objective indicators may include accuracy, precision, recall, and F1 score, among which: accuracy is used to measure the proportion of samples predicted by the model as positive samples that are truly positive samples. Precision is used to measure the proportion of samples predicted by the model as positive samples that are truly positive samples. Recall is used to measure the proportion of samples predicted by the model as positive samples that are truly positive samples. The F1 score is the harmonic mean of the precision and recall rates. The calculation formula corresponding to the F1 score is shown in formula (1):

其中，F1_score表示F1分数。Among them, F1_score represents the F1 score.

需要说明的是，各所述评测模型的损失函数中最大评分权重对应的客观指标不同。即，多个评测模型评测的客观指标相同，但每个评测模型侧重的客观指标不同。例如，以客观指标包括准确率(Accuracy)、精准率(Precision)、召回率(Recall)和F1分数，且评测模型的数量为4为例，It should be noted that the objective indicators corresponding to the maximum score weights in the loss functions of the evaluation models are different. That is, the objective indicators evaluated by multiple evaluation models are the same, but each evaluation model focuses on different objective indicators. For example, taking the objective indicators including accuracy, precision, recall and F1 score, and the number of evaluation models is 4,

1)评测模型1侧重于准确率，该评测模型1的损失函数如式(2)所示，式(2)为：1) Evaluation model 1 focuses on accuracy. The loss function of evaluation model 1 is shown in formula (2):

Loss1＝ω_a1*Accuracy+ω_b1*Precision+ω_c1*Recall+ω_d1*F1_scoreLoss1＝ _ωa1 *Accuracy+ _ωb1 *Precision+ _ωc1 *Recall+ _ωd1 *F1_score

其中，Loss1表示评测模型1的损失函数，ω_a1表示评测模型1中准确率对应的评分权重，ω_b1表示评测模型1中精准率对应的评分权重，ω_c1表示评测模型1中召回率对应的评分权重，ω_d1表示评测模型1中F1分数对应的评分权重，ω_a1+ω_b1+ω_c1+ω_d1＝1，且ω_a1大于ω_b1、ω_c1和ω_d1中的任意一个评分权重。Among them, Loss1 represents the loss function of evaluation model 1, ω _a1 represents the scoring weight corresponding to the accuracy in evaluation model 1, ω _b1 represents the scoring weight corresponding to the precision in evaluation model 1, ω _c1 represents the scoring weight corresponding to the recall in evaluation model 1, ω _d1 represents the scoring weight corresponding to the F1 score in evaluation model 1, ω _a1 +ω _b1 +ω _c1 +ω _d1 = 1, and ω _a1 is greater than any of the scoring weights among ω _b1 , ω _c1 and ω _d1 .

2)评测模型2侧重于精准率，该评测模型2的损失函数如式(3)所示，式(3)为：2) Evaluation model 2 focuses on accuracy. The loss function of evaluation model 2 is shown in formula (3):

Loss2＝ω_a2*Accuracy+ω_b2*Precision+ω_c2*Recall+ω_d2*F1_scoreLoss2＝ _ωa2 *Accuracy+ _ωb2 *Precision+ _ωc2 *Recall+ _ωd2 *F1_score

其中，Loss2表示评测模型2的损失函数，ω_a2表示评测模型2中准确率对应的评分权重，ω_b2表示评测模型2中精准率对应的评分权重，ω_c2表示评测模型2中召回率对应的评分权重，ω_d2表示评测模型2中F1分数对应的评分权重，ω_a2+ω_b2+ω_c2+ω_d2＝1，且ω_b2大于ω_a2、ω_c2和ω_d2中的任意一个评分权重。Among them, Loss2 represents the loss function of evaluation model 2, ω _a2 represents the scoring weight corresponding to the accuracy in evaluation model 2, ω _b2 represents the scoring weight corresponding to the precision in evaluation model 2, ω _c2 represents the scoring weight corresponding to the recall in evaluation model 2, ω _d2 represents the scoring weight corresponding to the F1 score in evaluation model 2, ω _a2 +ω _b2 +ω _c2 +ω _d2 = 1, and ω _b2 is greater than any of the scoring weights among ω _a2 , ω _c2 and ω _d2 .

3)评测模型3侧重于召回率，该评测模型3的损失函数如式(4)所示，式(4)为：3) Evaluation model 3 focuses on recall rate. The loss function of evaluation model 3 is shown in formula (4):

Loss3＝ω_a3*Accuracy+ω_b3*Precision+ω_c3*Recall+ω_d3*F1_scoreLoss3＝ _ωa3 *Accuracy+ _ωb3 *Precision+ _ωc3 *Recall+ _ωd3 *F1_score

其中，Loss3表示评测模型3的损失函数，ω_a3表示评测模型3中准确率对应的评分权重，ω_b3表示评测模型3中精准率对应的评分权重，ω_c3表示评测模型3中召回率对应的评分权重，ω_d3表示评测模型3中F1分数对应的评分权重，ω_a3+ω_b3+ω_c3+ω_d3＝1，且ω_c3大于ω_a3、ω_b3和ω_d3中的任意一个评分权重。Among them, Loss3 represents the loss function of evaluation model 3, ω _a3 represents the scoring weight corresponding to the accuracy in evaluation model 3, ω _b3 represents the scoring weight corresponding to the precision in evaluation model 3, ω _c3 represents the scoring weight corresponding to the recall in evaluation model 3, ω _d3 represents the scoring weight corresponding to the F1 score in evaluation model 3, ω _a3 +ω _b3 +ω _c3 +ω _d3 = 1, and ω _c3 is greater than any one of the scoring weights among ω _a3 , ω _b3 and ω _d3 .

4)评测模型4侧重于召回率，该评测模型4的损失函数如式(5)所示，式(5)为：4) Evaluation model 4 focuses on the recall rate. The loss function of evaluation model 4 is shown in formula (5), which is:

Loss4＝ω_a4*Accuracy+ω_b4*Precision+ω_c4*Recall+ω_d4*F1_scoreLoss4＝ _ωa4 *Accuracy+ _ωb4 *Precision+ _ωc4 *Recall+ _ωd4 *F1_score

其中，Loss4表示评测模型4的损失函数，ω_a4表示评测模型4中准确率对应的评分权重，ω_b4表示评测模型4中精准率对应的评分权重，ω_c4表示评测模型4中召回率对应的评分权重，ω_d4表示评测模型4中F1分数对应的评分权重，ω_a4+ω_b4+ω_c4+ω_d4＝1，且ω_d4大于ω_a4、ω_b4和ω_c4中的任意一个评分权重。Among them, Loss4 represents the loss function of evaluation model 4, ω _a4 represents the scoring weight corresponding to the accuracy in evaluation model 4, ω _b4 represents the scoring weight corresponding to the precision in evaluation model 4, ω _c4 represents the scoring weight corresponding to the recall in evaluation model 4, ω _d4 represents the scoring weight corresponding to the F1 score in evaluation model 4, ω _a4 +ω _b4 +ω _c4 +ω _d4 = 1, and ω _d4 is greater than any one of the scoring weights among ω _a4 , ω _b4 and ω _c4 .

可选的，在对各评测模型进行训练时，可获取不同领域的历史会话数据和该历史会话数据对应的标签，对各评测模型进行监督训练，在得到各评测模型输出的预测类别后，通过上述不同的损失函数对各自对应的评测模型进行参数更新，可使各评测模型对应的评测性能不同。该标签可以理解为该历史会话数据对应的正确性，该标签包括正类别和负类别，该正类别可表示该待测大语言模型的回复文本与对应的提问文本相关且正确，该负类别可表示该待测大语言模型的回复文本与对应的提问文本不相关且错误。Optionally, when training each evaluation model, historical conversation data from different fields and labels corresponding to the historical conversation data can be obtained, and supervised training can be performed on each evaluation model. After obtaining the predicted categories output by each evaluation model, the parameters of the corresponding evaluation models are updated through the above-mentioned different loss functions, so that the evaluation performance of each evaluation model can be different. The label can be understood as the correctness corresponding to the historical conversation data. The label includes positive and negative categories. The positive category can indicate that the reply text of the large language model to be tested is related to the corresponding question text and is correct, and the negative category can indicate that the reply text of the large language model to be tested is irrelevant to the corresponding question text and is wrong.

可选的，各评测模型对应的训练数据的子训练数据量可以相同，也可以不同，本发明实施例对此不做限制。Optionally, the amount of sub-training data of the training data corresponding to each evaluation model may be the same or different, and the embodiment of the present invention does not impose any limitation on this.

进一步的，所述将所述待测交互信息输入至少两个评测模型，输出各所述评测模型对所述待测大语言模型进行评分的第一客观评分，包括：Furthermore, the step of inputting the interaction information to be tested into at least two evaluation models and outputting a first objective score of the large language model to be tested by each evaluation model includes:

具体的，在各评测模型训练好之后，将待测交互信息输入各评测模型，可得到各评测模型对应的预测信息，该预测信息可以理解为该待测交互信息正确与否的预测类别，在得到该预测信息后，可计算该待测交互信息的各客观指标对应的指标评分，根据该指标评分和各自对应的评分权重，计算各评测模型对应的第一客观评分。例如，根据评测模型1输出的预测信息确定的第一客观评分可以为：Score_1-1＝ω_a1*Accuracy₁+ω_b1*Precision₁+ω_c1*Recall₁+ω_d1*F1_score₁，其中，Score_1-1表示评测模型1对应的第一客观评分，Accuracy₁表示评测模型1对应的准确率，Precision₁表示评测模型1对应的精准率，Recall₁表示评测模型1对应的召回率，F1_score₁表示评测模型1对应的F1分数。Specifically, after each evaluation model is trained, the interaction information to be tested is input into each evaluation model, and the prediction information corresponding to each evaluation model can be obtained. The prediction information can be understood as the prediction category of whether the interaction information to be tested is correct or not. After obtaining the prediction information, the indicator score corresponding to each objective indicator of the interaction information to be tested can be calculated, and the first objective score corresponding to each evaluation model is calculated according to the indicator score and the corresponding score weight. For example, the first objective score determined according to the prediction information output by the evaluation model 1 can be: Score _1-1 =ω _a1 *Accuracy ₁ +ω _b1 *Precision ₁ +ω _c1 *Recall ₁ +ω _d1 *F1_score ₁ , wherein Score _1-1 represents the first objective score corresponding to the evaluation model 1, Accuracy ₁ represents the accuracy rate corresponding to the evaluation model 1, Precision ₁ represents the precision rate corresponding to the evaluation model 1, Recall ₁ represents the recall rate corresponding to the evaluation model 1, and F1_score ₁ represents the F1 score corresponding to the evaluation model 1.

步骤130、基于所述第一人工评分和各所述第一客观评分，确定所述待测大语言模型对应的全局评分。Step 130: Determine a global score corresponding to the large language model to be tested based on the first manual score and each of the first objective scores.

具体的，在确定第一人工评分和各评测模型对应的第一客观评分后，可根据第一人工评分和各第一客观评分之间的加权和，计算该待测大语言模型对应的全局评分。Specifically, after determining the first manual score and the first objective scores corresponding to each evaluation model, the global score corresponding to the large language model to be tested may be calculated according to a weighted sum of the first manual score and each first objective score.

进一步的，所述基于所述第一人工评分和各所述第一客观评分，确定所述待测大语言模型对应的全局评分，包括：Furthermore, determining a global score corresponding to the large language model to be tested based on the first manual score and each of the first objective scores includes:

具体的，在确定各第一客观评分后，可根据各评测模型对应的训练数据的子训练数据量与所有训练数据的总训练数据量的比值，确定各第一客观评分对应的得分权重，例如，所有训练数据为10000条，评测模型1对应的训练数据的子训练数据量为3000条，则评测模型1对应的得分权重为0.3，评测模型2应的训练数据的子训练数据量为2000条，则评测模型2对应的得分权重为0.2，评测模型3对应的训练数据的子训练数据量为2000条，则评测模型3对应的得分权重为0.2，评测模型4对应的训练数据的子训练数据量为3000条，则评测模型4对应的得分权重为0.3。在确定各评测模型对应的得分权重后，根据各评测模型对应的第一客观评分和得分权重的加权求和，计算该待测大语言模型的第二客观评分。之后，将第二客观评分和第一人工评分进行加权求和，可计算得到该待测大语言模型对应的全局评分。Specifically, after determining each first objective score, the score weight corresponding to each first objective score can be determined according to the ratio of the sub-training data volume of the training data corresponding to each evaluation model to the total training data volume of all training data. For example, if all training data are 10,000, the sub-training data volume of the training data corresponding to evaluation model 1 is 3,000, then the score weight corresponding to evaluation model 1 is 0.3, the sub-training data volume of the training data corresponding to evaluation model 2 is 2,000, then the score weight corresponding to evaluation model 2 is 0.2, the sub-training data volume of the training data corresponding to evaluation model 3 is 2,000, then the score weight corresponding to evaluation model 3 is 0.2, the sub-training data volume of the training data corresponding to evaluation model 4 is 3,000, then the score weight corresponding to evaluation model 4 is 0.3. After determining the score weight corresponding to each evaluation model, the second objective score of the large language model to be tested is calculated according to the weighted sum of the first objective score and the score weight corresponding to each evaluation model. Afterwards, the second objective score and the first manual score are weighted and summed to obtain a global score corresponding to the large language model to be tested.

进一步的，所述基于所述第一人工评分和所述第二客观评分的加权和，确定所述待测大语言模型对应的全局评分，包括：Furthermore, determining the global score corresponding to the large language model to be tested based on the weighted sum of the first manual score and the second objective score includes:

具体的，在确定第一人工评分和第二客观评分后，可对待测交互信息进行关键词提取和语义分析，以确定该待测交互信息对应的目标领域信息。并在获取预设映射关系后，将该目标领域信息与预设映射关系中的各领域信息进行匹配，将匹配到的领域信息对应的人工评分权重确定为第一人工评分对应的目标人工评分权重，并将匹配的领域信息对应的客观评分权重确定为该第二客观评分对应的目标客观评分权重，之后，计算该第一人工评分与目标人工评分权重的第一乘积，同时计算该第二客观评分与目标客观评分权重的第二乘积，通过计算该第一乘积和第二乘积的和值，可得到该待测大语言模型对应的全局评分，在综合人工评分和机器评分后，提高对该待测大语言模型的评测客观性，进而提高待测大语言模型的评测结果的可靠性。Specifically, after determining the first manual score and the second objective score, keyword extraction and semantic analysis can be performed on the interaction information to be tested to determine the target domain information corresponding to the interaction information to be tested. After obtaining the preset mapping relationship, the target domain information is matched with each domain information in the preset mapping relationship, and the manual score weight corresponding to the matched domain information is determined as the target manual score weight corresponding to the first manual score, and the objective score weight corresponding to the matched domain information is determined as the target objective score weight corresponding to the second objective score. After that, the first product of the first manual score and the target manual score weight is calculated, and the second product of the second objective score and the target objective score weight is calculated at the same time. By calculating the sum of the first product and the second product, the global score corresponding to the large language model to be tested can be obtained. After the comprehensive manual score and machine score, the objectivity of the evaluation of the large language model to be tested is improved, thereby improving the reliability of the evaluation result of the large language model to be tested.

需要说明的是，由于不同领域的工作要求不同，对待测大语言模型进行人工评分时，人工评分对全局评分的影响程度不同。例如，艺术领域和文学领域等领域更注重主观判断和经验积累，而医疗领域、科技领域和学术领域等领域更专业和复杂，更注重准确性，因此，若该待测交互信息属于艺术领域和文学领域等领域时，第一人工评分对全局评分的影响程度要高于第二客观评分对全局评分的影响程度，此时，第一人工评分对应的人工评分权重要大于第二客观评分对应的客观评分权重。若该待测交互信息属于医疗领域、科技领域和学术领域等领域时，第一人工评分对全局评分的影响程度要低于第二客观评分对全局评分的影响程度，此时，第一人工评分对应的人工评分权重要小于第二客观评分对应的客观评分权重。因此，基于上述内容构建预设映射关系，该预设映射关系中包括领域信息、人工评分权重和客观评分权重之间的映射关系。It should be noted that due to different work requirements in different fields, when manually scoring the large language model to be tested, the degree of influence of the manual scoring on the global score is different. For example, fields such as the art field and the literature field pay more attention to subjective judgment and experience accumulation, while fields such as the medical field, the science and technology field, and the academic field are more professional and complex, and pay more attention to accuracy. Therefore, if the interactive information to be tested belongs to fields such as the art field and the literature field, the degree of influence of the first manual scoring on the global score is higher than the degree of influence of the second objective scoring on the global score. At this time, the manual scoring weight corresponding to the first manual scoring is greater than the objective scoring weight corresponding to the second objective scoring. If the interactive information to be tested belongs to fields such as the medical field, the science and technology field, and the academic field, the degree of influence of the first manual scoring on the global score is lower than the degree of influence of the second objective scoring on the global score. At this time, the manual scoring weight corresponding to the first manual scoring is less than the objective scoring weight corresponding to the second objective scoring. Therefore, a preset mapping relationship is constructed based on the above content, and the preset mapping relationship includes a mapping relationship between field information, manual scoring weights, and objective scoring weights.

本发明实施例提供的大语言模型的评测方法，在获取待测交互信息和用户对待测大语言模型进行人工评分的第一人工评分后，通过各评测模型对待测交互信息进行预测，输出各评测模型对待测大语言模型进行评分的第一客观评分，综合第一人工评分和各第一客观评分后，得到待测大语言模型的全局评分，综合主观指标和客观指标后，可提高待测大语言模型的评测客观性和全面性，进而提高待测大语言模型的评测结果的可靠性。此外，本发明实施例中通过确定待测交互信息的目标领域信息，调整人工评分对全局评分的影响程度，进一步提高评测结果的可靠性。The evaluation method of the large language model provided in the embodiment of the present invention, after obtaining the interaction information to be tested and the first manual score of the user for manually scoring the large language model to be tested, predicts the interaction information to be tested through each evaluation model, outputs the first objective score of each evaluation model for scoring the large language model to be tested, and after combining the first manual score and each first objective score, obtains the global score of the large language model to be tested. After combining the subjective index and the objective index, the objectivity and comprehensiveness of the evaluation of the large language model to be tested can be improved, thereby improving the reliability of the evaluation result of the large language model to be tested. In addition, in the embodiment of the present invention, by determining the target domain information of the interaction information to be tested, the influence of the manual score on the global score is adjusted, and the reliability of the evaluation result is further improved.

下面对本发明提供的大语言模型的评测装置进行描述，下文描述的大语言模型的评测装置与上文描述的大语言模型的评测方法可相互对应参照。The evaluation device for a large language model provided by the present invention is described below. The evaluation device for a large language model described below and the evaluation method for a large language model described above can refer to each other.

本发明实施例还提供一种大语言模型的评测装置，图2是本发明实施例提供的大语言模型的评测装置的结构示意图，如图2所示，该大语言模型的评测装置200包括：获取模块210、输出模块220和确定模块230，其中：The embodiment of the present invention further provides an evaluation device for a large language model. FIG. 2 is a schematic diagram of the structure of the evaluation device for a large language model provided by the embodiment of the present invention. As shown in FIG. 2 , the evaluation device 200 for a large language model includes: an acquisition module 210, an output module 220, and a determination module 230, wherein:

获取模块210，用于获取待测大语言模型对应的待测交互信息，以及用户基于人工评测模板对所述待测大语言模型进行评分的第一人工评分；所述人工评测模板为基于至少两个主观指标确定的评分模板；The acquisition module 210 is used to obtain the interaction information to be tested corresponding to the large language model to be tested, and a first manual score of the large language model to be tested by the user based on a manual evaluation template; the manual evaluation template is a scoring template determined based on at least two subjective indicators;

输出模块220，用于将所述待测交互信息输入至少两个评测模型，输出各所述评测模型对所述待测大语言模型进行评分的第一客观评分；各所述评测模型的损失函数是基于至少两个客观指标的加权平均确定的；An output module 220 is used to input the interaction information to be tested into at least two evaluation models, and output a first objective score of each evaluation model for scoring the large language model to be tested; the loss function of each evaluation model is determined based on a weighted average of at least two objective indicators;

确定模块230，用于基于所述第一人工评分和各所述第一客观评分，确定所述待测大语言模型对应的全局评分。The determination module 230 is configured to determine a global score corresponding to the large language model to be tested based on the first manual score and each of the first objective scores.

本发明实施例提供的大语言模型的评测装置，在获取待测交互信息和用户对待测大语言模型进行人工评分的第一人工评分后，通过各评测模型对待测交互信息进行预测，输出各评测模型对待测大语言模型进行评分的第一客观评分，综合第一人工评分和各第一客观评分后，得到待测大语言模型的全局评分，综合主观指标和客观指标后，可提高待测大语言模型的评测客观性和全面性，进而提高待测大语言模型的评测结果的可靠性。此外，本发明实施例中通过确定待测交互信息的目标领域信息，调整人工评分对全局评分的影响程度，进一步提高评测结果的可靠性。The evaluation device for a large language model provided in an embodiment of the present invention, after obtaining the interaction information to be tested and the first manual score of the user for manually scoring the large language model to be tested, predicts the interaction information to be tested through each evaluation model, outputs the first objective score of each evaluation model for scoring the large language model to be tested, and obtains the global score of the large language model to be tested after combining the first manual score and each first objective score. After combining the subjective index and the objective index, the objectivity and comprehensiveness of the evaluation of the large language model to be tested can be improved, thereby improving the reliability of the evaluation result of the large language model to be tested. In addition, in the embodiment of the present invention, by determining the target domain information of the interaction information to be tested, the influence of the manual score on the global score is adjusted, thereby further improving the reliability of the evaluation result.

可选的，获取模块210，具体用于：Optionally, the acquisition module 210 is specifically configured to:

可选的，输出模块220，具体用于：Optionally, the output module 220 is specifically used for:

可选的，确定模块230，具体用于：Optionally, the determination module 230 is specifically configured to:

可选的，各所述评测模型的损失函数中最大评分权重对应的客观指标不同。Optionally, the objective indicators corresponding to the maximum scoring weights in the loss functions of the evaluation models are different.

可选的，该大语言模型的评测装置200还包括构建模块，该构建模块具体用于：Optionally, the large language model evaluation device 200 further includes a construction module, which is specifically used to:

基于所述评分分值的数量，确定各所述评分分值对应的所述主观指标的程度；Based on the number of the scoring scores, determining the degree of the subjective indicator corresponding to each of the scoring scores;

图3是本发明实施例提供的电子设备的结构示意图，如图3所示，该电子设备可以包括：处理器(processor)310、通信接口(Communications Interface)320、存储器(memory)330和通信总线340，其中，处理器310，通信接口320，存储器330通过通信总线340完成相互间的通信。处理器310可以调用存储器330中的逻辑指令，以执行大语言模型的评测方法，该方法包括：FIG3 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention. As shown in FIG3 , the electronic device may include: a processor 310, a communication interface 320, a memory 330, and a communication bus 340, wherein the processor 310, the communication interface 320, and the memory 330 communicate with each other through the communication bus 340. The processor 310 may call the logic instructions in the memory 330 to execute the evaluation method of the large language model, and the method includes:

此外，上述的存储器330中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 330 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的大语言模型的评测方法，该方法包括：On the other hand, the present invention further provides a computer program product, the computer program product includes a computer program, the computer program can be stored on a non-transitory computer-readable storage medium, when the computer program is executed by a processor, the computer can execute the large language model evaluation method provided by the above methods, the method includes:

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的大语言模型的评测方法，该方法包括：In another aspect, the present invention further provides a non-transitory computer-readable storage medium having a computer program stored thereon, which is implemented when the computer program is executed by a processor to perform the large language model evaluation method provided by the above methods, the method comprising:

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An evaluation method of a large language model, comprising:

Acquiring interaction information to be tested corresponding to a large language model to be tested, and scoring the large language model to be tested by a user based on a manual evaluation template; the manual evaluation template is a scoring template determined based on at least two subjective indexes;

Inputting the interaction information to be tested into at least two evaluation models, and outputting a first objective score of each evaluation model for scoring the large language model to be tested; the loss function of each evaluation model is determined based on a weighted average of at least two objective indicators;

And determining global scores corresponding to the large language model to be tested based on the first artificial scores and the first objective scores.

2. The method for evaluating a large language model according to claim 1, wherein inputting the interaction information to be tested into at least two evaluation models, outputting a first objective score for each evaluation model to score the large language model to be tested, comprises:

inputting the interaction information to be tested into the evaluation model aiming at each evaluation model, and determining prediction information corresponding to the interaction information to be tested; the prediction information is used for representing the correctness of the interaction information to be detected;

determining index scores corresponding to the objective indexes based on the prediction information;

and determining a first objective score of the evaluation model for scoring the large language model to be tested based on each index score and the score weight corresponding to each index score.

3. The method for evaluating a large language model according to claim 1, wherein determining a global score corresponding to the large language model to be tested based on the first artificial score and each of the first objective scores comprises:

Obtaining the score weight corresponding to each evaluation model, wherein the score weight is determined based on the ratio of the sub-training data quantity corresponding to each evaluation model to the total training data quantity corresponding to all the evaluation models;

Determining a second objective score corresponding to the large language model to be tested based on the score weights and the first objective scores of the evaluation models corresponding to the score weights;

And determining a global score corresponding to the large language model to be tested based on the weighted sum of the first artificial score and the second objective score.

4. The method for evaluating a large language model according to claim 3, wherein determining a global score corresponding to the large language model to be tested based on a weighted sum of the first artificial score and the second objective score comprises:

Carrying out semantic analysis on the interaction information to be detected, and determining target field information corresponding to the interaction information to be detected;

Determining a target manual scoring weight corresponding to the first manual scoring and a target objective scoring weight corresponding to the second objective scoring from a preset mapping relation based on the target field information; the preset mapping relation comprises mapping relation among field information, manual scoring weight and objective scoring weight;

And determining global scores corresponding to the large language model to be tested based on the first artificial scores, the target artificial score weights, the second objective scores and the target objective score weights.

5. The method for evaluating a large language model according to any one of claims 1 to 4, wherein objective indicators corresponding to maximum scoring weights in the loss function of each of the evaluation models are different.

6. The method for evaluating a large language model according to any one of claims 1 to 4, wherein obtaining a first artificial score for the user to score the large language model to be evaluated based on an artificial evaluation template comprises:

obtaining second subjective scores corresponding to the subjective indexes by the user;

Determining a sum of all second subjective scores;

And determining a first artificial score of the user for scoring the large language model to be tested based on an artificial evaluation template based on the ratio of the sum of all second subjective scores to the number of all subjective indexes.

7. The method for evaluating a large language model according to any one of claims 1 to 4, further comprising:

Determining the number of scoring values corresponding to the subjective indexes according to the subjective indexes;

determining the degree of the subjective index corresponding to each scoring value based on the number of the scoring values;

constructing a sub-evaluation template corresponding to the subjective index based on all the score values and the degree of the subjective index corresponding to each score value;

Constructing the artificial evaluation template based on the sub-evaluation templates corresponding to all subjective indexes;

and sending the manual evaluation template and the interaction information to be tested to a terminal corresponding to the user.

8. An evaluation device for a large language model, comprising:

The system comprises an acquisition module, a first evaluation module and a second evaluation module, wherein the acquisition module is used for acquiring interaction information to be tested corresponding to a large language model to be tested and a first artificial score for scoring the large language model to be tested based on an artificial evaluation template by a user; the manual evaluation template is a scoring template determined based on at least two subjective indexes;

The output module is used for inputting the interaction information to be tested into at least two evaluation models and outputting a first customer score for scoring the large language model to be tested by each evaluation model; the loss function of each evaluation model is determined based on a weighted average of at least two objective indicators;

and the determining module is used for determining the global score corresponding to the large language model to be tested based on the first artificial score and each first objective score.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for evaluating a large language model according to any one of claims 1-7 when executing the program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements a method of evaluating a large language model according to any one of claims 1-7.