CN115391522A

CN115391522A - Text topic modeling method and system based on social platform metadata

Info

Publication number: CN115391522A
Application number: CN202210921496.0A
Authority: CN
Inventors: 高金华; 赵鑫; 沈华伟; 王永庆; 庞亮; 孟剑; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-11-25

Abstract

The present invention proposes a text topic modeling method and system based on social platform metadata, including keywords based on text data, constructing a bag of words representation of text data; based on metadata categories of text data, training attribute value prediction of corresponding categories The task is to fine-tune the pre-trained semantic extraction model to obtain the target semantic extraction model, and use the target semantic extraction model to extract the text semantic representation of the text data; construct a semantic constraint target based on the text semantic representation, guided by the semantic constraint target, and use the bag of words representation as the Input and reconstruct targets, train a neural topic model based on a variational autoencoder, obtain a topic extraction model, and derive topic-keyword distribution and topic embedding representations from the model. The method and system can perform topic modeling on short text messages widely existing in mobile applications, extract topic keywords and learn topic embedding representations.

Description

A text topic modeling method and system based on social platform metadata

技术领域technical field

本发明适用于移动应用大数据分析领域，涉及面向移动应用元数据的主题建模方法及系统，特别涉及面向社交应用平台元数据的主题分类方法及系统。The invention is applicable to the field of mobile application big data analysis, relates to a mobile application metadata-oriented topic modeling method and system, and in particular relates to a social application platform metadata-oriented topic classification method and system.

背景技术Background technique

主题建模任务旨在对语料集进行概率建模，发现一组潜在的主题，得到的主题可以用于用户画像、舆情分析与追踪以及人机对话等领域。每个主题可用于描述一个可解释的语义概念，对应于词表上的一个概率分布。同时，给定一篇文档，主题模型能够推断出其主题分布。主题建模作为一种强大的无监督文本分析技术，能够提取出海量文本中讨论的主题，并按照主题分布对文本进行聚类或分类。The topic modeling task aims to perform probabilistic modeling on the corpus and discover a set of potential topics. The obtained topics can be used in fields such as user portraits, public opinion analysis and tracking, and human-computer dialogue. Each topic can be used to describe an interpretable semantic concept, corresponding to a probability distribution over the vocabulary. At the same time, given a document, the topic model can infer its topic distribution. As a powerful unsupervised text analysis technique, topic modeling can extract the topics discussed in massive texts, and cluster or classify the texts according to the topic distribution.

Latent DirichletAllocation(LDA)潜在狄利克雷分配于2003年提出的贝叶斯概率主题模型，通过建模文档的生成过程来推测文档的主题分布。如图5所示，在LDA主题模型里，有M个文档-主题的Dirichlet先验分布，对应于M个文档-主题的多项后验分布，这样α→θ_d→z_d就构成了Dirichlet-Multi共轭，利用吉布斯采样的方法，得到基于Dirichlet分布的文档-主题后验分布。Latent Dirichlet Allocation (LDA) is a Bayesian probabilistic topic model proposed in 2003, which infers the topic distribution of documents by modeling the generation process of documents. As shown in Figure 5, in the LDA topic model, there are M document-topic Dirichlet prior distributions, corresponding to M document-topic multinomial posterior distributions, so that α→θ _d →z _d constitutes the Dirichlet -Multi conjugation, using the Gibbs sampling method to obtain the document-topic posterior distribution based on the Dirichlet distribution.

尽管LDA模型在主题建模任务上取得了很大的成功，但是其推断效率太低，难以应用于大规模分析场景中。为解决上述效率问题，基于变分自编码器的神经主题模型采用了基于深度神经网络的解码器和编码器来拟合文档生成和主题推断的过程。通过最大化输入数据的边缘似然函数的证据下界(ELBO)，模型的解码器和编码器得以训练，并用于后续的主题推断任务：Although the LDA model has achieved great success in topic modeling tasks, its inference efficiency is too low to be applied in large-scale analysis scenarios. To address the aforementioned efficiency issues, the variational autoencoder-based neural topic model employs a deep neural network-based decoder and encoder to fit the process of document generation and topic inference. The decoder and encoder of the model are trained by maximizing the Evidence Lower Bound (ELBO) of the marginal likelihood function of the input data and used for subsequent topic inference tasks:

其中,p表示后验概率分布，q表示先验概率分布，x表示文档的单词分布，z表示文档的主题分布。Among them, p represents the posterior probability distribution, q represents the prior probability distribution, x represents the word distribution of the document, and z represents the topic distribution of the document.

社交平台中短文本消息的单词分布稀疏、内容形式多样，现有的主题建模方法仍存在一些缺点和不足。The word distribution of short text messages in social platforms is sparse and the content forms are diverse. There are still some shortcomings and deficiencies in the existing topic modeling methods.

首先，在计算效率方面，传统的以LDA模型为代表的主题建模方法需要通过采样的方式来对待预测文档的主题进行推断，效率低下，无法适应于大规模数据分析场景，特别是社交媒体流式数据场景。神经主题建模方法的出现，一定程度上缓解了该方面的不足。First of all, in terms of computational efficiency, the traditional topic modeling method represented by the LDA model needs to infer the topic of the document to be predicted by sampling, which is inefficient and cannot be adapted to large-scale data analysis scenarios, especially social media streams. data scenarios. The emergence of neural topic modeling methods has alleviated this deficiency to a certain extent.

其次，现有的神经主题模型，主要是以文档的词袋模型作为输入。但是，在短文本场景下，单个文档的单词分布非常稀疏，仅利用离散的词分布信息很难对文档的主题进行精确推断。为解决该问题，部分现有模型考虑对短文本的主题分布增加约束，如约束短文本内所有词归属于同一个主题等。该类方法虽能一定程度上缓解语料集的稀疏性带来的问题，但也大大降低了模型的建模能力。Second, the existing neural topic models mainly use the bag-of-words model of documents as input. However, in short text scenarios, the word distribution of a single document is very sparse, and it is difficult to accurately infer the topic of the document using only discrete word distribution information. To solve this problem, some existing models consider adding constraints to the topic distribution of short texts, such as constraining all words in short texts to belong to the same topic. Although this type of method can alleviate the problems caused by the sparsity of the corpus to a certain extent, it also greatly reduces the modeling ability of the model.

此外，现有的神经主题模型，主要仍是以文本内容作为主题建模的主要依据。但是，在移动应用，特别是社交应用中，短文本中往往包含了丰富的属性信息，如Hashtag、URL等，该类信息能够为主题建模提供重要的线索信息。现有的模型虽然引入了与语料文档密切关联的协变量和标签信息，但是其要求每篇文档的附件信息都是完备的，极大地限制了主题模型的使用范围。实际上，短文本中的属性信息往往存在大量缺失，直接使用现有模型会引入大量的噪声，导致主题建模失败。In addition, the existing neural topic models still mainly use text content as the main basis for topic modeling. However, in mobile applications, especially social applications, short texts often contain rich attribute information, such as Hashtag, URL, etc., which can provide important clues for topic modeling. Although the existing models introduce covariates and label information that are closely related to corpus documents, they require the attachment information of each document to be complete, which greatly limits the scope of use of topic models. In fact, there is often a large amount of missing attribute information in short texts, and direct use of existing models will introduce a lot of noise, resulting in failure of topic modeling.

综上，现有的主题建模方法存在两个主要问题：(1)采用的词袋模型表示无法充分表达稀疏短文本的语义和主题；(2)现有方法无法有效利用短文本自身丰富的属性信息。To sum up, there are two main problems in existing topic modeling methods: (1) the bag-of-words model used cannot fully express the semantics and topics of sparse short texts; (2) existing methods cannot effectively utilize the richness of short texts. attribute information.

发明内容Contents of the invention

为了解决现有主题模型采用词袋模型表示无法充分表达短文本语义的问题，以及无法有效利用短文本多属性信息的问题，本发明提出了一种面向移动应用元数据的主题建模方法及系统。该方法及系统可以对移动应用内广泛存在的短文本消息进行主题建模，提取出主题的关键词并学习得到主题的嵌入表示，还可以有效融合短文本稀疏的多属性信息，进一步提升主题建模的效果，并支持分析主题在各属性信息上的属性值分布情况。In order to solve the problem that the existing topic model cannot fully express the semantics of short texts using the bag-of-words model, and cannot effectively utilize the multi-attribute information of short texts, the present invention proposes a topic modeling method and system for mobile application metadata . The method and system can perform topic modeling on short text messages widely existing in mobile applications, extract topic keywords and learn topic embedding representations, and can also effectively integrate sparse multi-attribute information of short texts to further improve topic construction. The effect of the model, and supports the analysis of the attribute value distribution of the topic on each attribute information.

针对现有技术的不足，本发明提出一种基于社交平台元数据的文本主题建模方法，其中包括：Aiming at the deficiencies of the prior art, the present invention proposes a text topic modeling method based on social platform metadata, which includes:

步骤1、从社交平台获取待主题建模的文本数据及该文本数据的元数据；Step 1. Obtain the text data to be modeled and the metadata of the text data from the social platform;

步骤2、基于该文本数据的关键词，构建该文本数据的词袋表示；Step 2, constructing a bag-of-words representation of the text data based on the keywords of the text data;

步骤3、基于该元数据的类别，训练对应类别的属性值预测任务，以微调预训练语义提取模型，得到目标语义提取模型，使用该目标语义提取模型提取该文本数据的文本语义表示；Step 3. Based on the category of the metadata, train the attribute value prediction task of the corresponding category to fine-tune the pre-trained semantic extraction model to obtain the target semantic extraction model, and use the target semantic extraction model to extract the text semantic representation of the text data;

步骤4、基于该文本语义表示构造语义约束目标，以该语义约束目标为指导，以词袋表示作为输入和重构目标，训练基于变分自编码器的神经主题模型，得到主题提取模型，并从模型中导出主题-关键词分布和主题嵌入表示；Step 4. Construct a semantic constraint target based on the semantic representation of the text. Guided by the semantic constraint target, the bag-of-words representation is used as the input and reconstruction target to train a neural topic model based on a variational autoencoder to obtain a topic extraction model, and Deriving topic-keyword distributions and topic embedding representations from the model;

步骤5、将该主题嵌入表示输入该属性值预测任务，得到主题在对应属性上的属性值分布，根据该属性值分布、该主题-关键词分布和该主题嵌入表示对相同的主题进行合并，并将合并结果作为该文本数据的主题模型。Step 5, input the topic embedding representation into the attribute value prediction task, obtain the attribute value distribution of the topic on the corresponding attribute, and merge the same topics according to the attribute value distribution, the topic-keyword distribution and the topic embedding representation, And the combined result is used as the topic model of the text data.

所述的基于社交平台元数据的文本主题建模方法，其中该步骤3包括：The described text topic modeling method based on social platform metadata, wherein this step 3 comprises:

将元数据的属性分类为离散型属性、连续型属性和文本型属性；Classify the attributes of metadata into discrete attributes, continuous attributes and text attributes;

对离散型属性，分别基于语料集中出现过的属性值计数，按照构造词表的过程，取出现次数超过预设阈值的属性值构成属性值集合，基于该属性值集合构建一个预测属性值的分类任务，采用交叉熵作为分类任务的损失函数；For discrete attributes, based on the counting of attribute values that have appeared in the corpus, according to the process of constructing the vocabulary, the attribute values whose occurrence times exceed the preset threshold are taken to form an attribute value set, and a classification of predicted attribute values is constructed based on the attribute value set task, using cross-entropy as the loss function of the classification task;

对连续型属性，将其属性值转换为均值为0，方差为1的分布；基于该连续型属性构建一个预测转换后属性值的回归任务，采用MSE作为该回归任务的损失函数；For a continuous attribute, convert its attribute value to a distribution with a mean value of 0 and a variance of 1; construct a regression task to predict the converted attribute value based on the continuous attribute, and use MSE as the loss function of the regression task;

对文本型属性，将该文本数据与其拼接，得到拼接文本，输入该预训练语义提取模型，产生的文本语义向量；For text-type attributes, the text data is spliced with it to obtain the spliced text, which is input into the pre-trained semantic extraction model to generate a text semantic vector;

构建对抗分类任务，用于判定该文本语义向量的属性类别，采用交叉熵作为损失函数。Build an adversarial classification task to determine the attribute category of the text semantic vector, using cross-entropy as the loss function.

所述的基于社交平台元数据的文本主题建模方法，其中该步骤5包括：根据该属性值分布，构建主题各属性的属性值列表；根据该主题-关键词分布，构建关键词列表；The text topic modeling method based on social platform metadata, wherein the step 5 includes: according to the attribute value distribution, constructing an attribute value list of each attribute of the topic; according to the topic-keyword distribution, constructing a keyword list;

在对主题进行合并时，使用杰卡德系数分布度量主题的关键词列表间和属性值列表间的相似度，得到第一相似度和第二相似度，使用余弦相似度来度量主题的嵌入表示之间的相似度，得到第三相似度；加权平均第一相似度、第二相似度和第三相似度，得到主题间的最终相似度，将该最终相似度大于预设值的主题进行合并。When merging topics, use the Jaccard coefficient distribution to measure the similarity between the keyword lists and attribute value lists of the topics, get the first similarity and the second similarity, and use the cosine similarity to measure the embedding representation of the topics The similarity between the subjects is obtained to obtain the third similarity; the weighted average of the first similarity, the second similarity and the third similarity is obtained to obtain the final similarity between the topics, and the topics whose final similarity is greater than the preset value are merged .

所述的基于社交平台元数据的文本主题建模方法，其中该元数据包括：发布时间、发布用户ID、发布用户个人简介、@User、#Tag和URL。The text topic modeling method based on social platform metadata, wherein the metadata includes: posting time, posting user ID, posting user's profile, @User, #Tag and URL.

本发明还提出了一种基于社交平台元数据的文本主题建模系统，其中包括：The present invention also proposes a text topic modeling system based on social platform metadata, which includes:

初始模块，用于从社交平台获取待主题建模的文本数据及该文本数据的元数据；并基于该文本数据的关键词，构建该文本数据的词袋表示；The initial module is used to obtain the text data to be modeled and the metadata of the text data from the social platform; and based on the keywords of the text data, construct the word bag representation of the text data;

微调模块，用于根据该元数据的类别，训练对应类别的属性值预测任务，以微调预训练语义提取模型，得到目标语义提取模型，使用该目标语义提取模型提取该文本数据的文本语义表示；The fine-tuning module is used to train the attribute value prediction task of the corresponding category according to the category of the metadata, so as to fine-tune the pre-trained semantic extraction model to obtain the target semantic extraction model, and use the target semantic extraction model to extract the text semantic representation of the text data;

提取模块，用于根据该文本语义表示构造语义约束目标，以该语义约束目标为指导，以词袋表示作为输入和重构目标，训练基于变分自编码器的神经主题模型，得到主题提取模型，并从模型中导出主题-关键词分布和主题嵌入表示；The extraction module is used to construct a semantically constrained target based on the semantic representation of the text, guided by the semantically constrained target, taking the bag-of-words representation as input and reconstructing the target, training a neural topic model based on a variational autoencoder, and obtaining a topic extraction model , and derive topic-keyword distribution and topic embedding representations from the model;

合并模块，用于将该主题嵌入表示输入该属性值预测任务，得到主题在对应属性上的属性值分布，根据该属性值分布、该主题-关键词分布和该主题嵌入表示对相同的主题进行合并，并将合并结果作为该文本数据的主题模型。The merging module is used to input the topic embedding representation into the attribute value prediction task, obtain the attribute value distribution of the topic on the corresponding attribute, and conduct the same topic according to the attribute value distribution, the topic-keyword distribution and the topic embedding representation Merge and use the merged result as the topic model of the text data.

所述的基于社交平台元数据的文本主题建模系统，其中该微调模块具体用于：The text topic modeling system based on social platform metadata, wherein the fine-tuning module is specifically used for:

所述的基于社交平台元数据的文本主题建模系统，其中该合并模块用于：根据该属性值分布，构建主题各属性的属性值列表；根据该主题-关键词分布，构建关键词列表；The text topic modeling system based on social platform metadata, wherein the merging module is used to: according to the attribute value distribution, construct the attribute value list of each attribute of the topic; according to the topic-keyword distribution, construct the keyword list;

所述的基于社交平台元数据的文本主题建模系统，其中该元数据包括：发布时间、发布用户ID、发布用户个人简介、@User、#Tag和URL。The text topic modeling system based on social platform metadata, wherein the metadata includes: publishing time, publishing user ID, publishing user's profile, @User, #Tag and URL.

本发明还提出了一种存储介质，用于存储执行所述任意一种基于社交平台元数据的文本主题建模方法的程序。The present invention also provides a storage medium for storing a program for executing any one of the social platform metadata-based text topic modeling methods.

本发明还提出了一种客户端，用于所述任意一种基于社交平台元数据的文本主题建模系统。The present invention also proposes a client for any of the text topic modeling systems based on social platform metadata.

由以上方案可知，本发明的优点在于：As can be seen from the above scheme, the present invention has the advantages of:

1.针对神经主题模型采用词袋模型表示无法充分表达短文本语义的问题，本发明提出了一种基于预训练模型文本语义表示的约束方法，引入外部知识指导主题建模学习，能有效提升神经主题模型的建模效果，并同步学习每个主题的语义表示，将主题投影到预训练语义表示空间。1. Aiming at the problem that the neural topic model cannot fully express the semantics of short texts using the bag of words model, this invention proposes a constraint method based on the text semantic representation of the pre-trained model, and introduces external knowledge to guide topic modeling learning, which can effectively improve the neural network. The modeling effect of the topic model, and synchronously learn the semantic representation of each topic, and project the topic to the pre-training semantic representation space.

2.针对神经主题模型无法有效利用短文本多属性信息的问题，本发明提出了一套针对不同类型属性数据的预处理方案，通过多任务学习和对抗辅助任务，将多属性信息融入文本语义表示，引入自身元数据指导主题建模学习，能有效提升神经主题模型的建模效果，并建模分析每个主题在各个属性上的属性值分布情况。2. Aiming at the problem that the neural topic model cannot effectively utilize short text multi-attribute information, this invention proposes a set of preprocessing schemes for different types of attribute data, and integrates multi-attribute information into text semantic representation through multi-task learning and confrontation auxiliary tasks , introducing its own metadata to guide topic modeling learning can effectively improve the modeling effect of the neural topic model, and model and analyze the attribute value distribution of each topic on each attribute.

3.本发明构建了一套面向移动应用元数据的文本主题分析系统。实现了移动应用中短文本文本内容的预处理(分词、词性识别、实体识别、词表构建、词袋模型表示转换)、元数据的预处理(离散型、连续型及文本型属性的转换处理)、预训练模型语义表示的微调、主题关键词(主体(人物、机构、组织实体等)，动作(动词、动名词等)、时间、地点、其他)的提取，主题嵌入表示的导出及主题在各属性上的属性值分布情况分析等一系列流程，能全面、高效地挖掘分析移动应用中短文本的主题的关键词、嵌入表示及其属性值分布情况。3. The present invention constructs a text theme analysis system oriented to mobile application metadata. Realized the preprocessing of short text content in mobile applications (word segmentation, part-of-speech recognition, entity recognition, vocabulary construction, word bag model representation conversion), metadata preprocessing (conversion processing of discrete, continuous and text attributes) ), the fine-tuning of the semantic representation of the pre-training model, the extraction of topic keywords (subjects (persons, institutions, organizational entities, etc.), actions (verbs, gerunds, etc.), time, place, etc.), the export of topic embedding representations and topics A series of processes such as the analysis of attribute value distribution on each attribute can comprehensively and efficiently mine and analyze keywords, embedded representations and attribute value distribution of short text topics in mobile applications.

附图说明Description of drawings

图1为基于语义约束的主题建模方法示意图；Figure 1 is a schematic diagram of a topic modeling method based on semantic constraints;

图2为融合元数据信息的主题建模方法示意图；Fig. 2 is a schematic diagram of a topic modeling method for fusing metadata information;

图3为面向移动应用元数据的主题建模方法的基本流程图；Fig. 3 is the basic flowchart of the subject modeling method for mobile application metadata;

图4为面向移动应用元数据的主题建模系统的业务逻辑示意图；Fig. 4 is a schematic diagram of business logic of a theme modeling system oriented to mobile application metadata;

图5为现有技术LDA模型的系统框图。Fig. 5 is a system block diagram of the prior art LDA model.

具体实施方式Detailed ways

为实现上述技术效果，本发明包括一下关键技术点：In order to achieve the above technical effects, the present invention includes the following key technical points:

关键点1：针对现有的神经主题模型采用词袋模型表示无法充分表达短文本语义的问题，提出了一种基于预训练模型文本语义约束的神经主题建模方法。显著提升神经主题模型的主题建模效果，并学习得到主题的嵌入表示；Key point 1: Aiming at the problem that the existing neural topic model cannot fully express the semantics of short texts using the bag-of-words model, a neural topic modeling method based on pre-trained model text semantic constraints is proposed. Significantly improve the topic modeling effect of the neural topic model, and learn the embedding representation of the topic;

关键点2：针对现有的神经主题模型无法有效利用短文本多属性信息的问题，提出了一种基于多任务学习的多属性信息条件主题建模方案。克服属性值缺失对主题建模的不良影响，显著提升神经主题模型在富属性短文本上的主题建模效果，并支持分析主题题在各属性信息上的属性值分布情况。Key point 2: Aiming at the problem that the existing neural topic models cannot effectively utilize short text multi-attribute information, a multi-task learning-based multi-attribute information conditional topic modeling scheme is proposed. Overcome the negative impact of missing attribute values on topic modeling, significantly improve the topic modeling effect of neural topic models on attribute-rich short texts, and support the analysis of the attribute value distribution of topic questions on each attribute information.

关键点3：构建了一套面向移动应用元数据的文本主题分析系统。实现了移动应用中短文本文本内容的预处理(分词、词性识别、实体识别、词表构建、词袋模型表示转换)、元数据的预处理(离散型、连续型及文本型属性的转换处理)、预训练模型语义表示的微调、主题关键词(主体(人物、机构、组织实体等)，动作(动词、动名词等)、时间、地点等的提取，主题嵌入表示的导出及主题在各属性上的属性值分布情况分析等一系列流程，能全面、高效地挖掘分析在线短文本的主题的关键词、嵌入表示及其属性值分布情况。主题的关键词用于构建用户画像标签数据、舆情事件标题生成等，而主题的嵌入表示能够用于用户兴趣画像的相似度计算，从而更好地服务于推荐、搜索及智能对话等业务应用。Key point 3: A set of text topic analysis system for mobile application metadata is constructed. Realized the preprocessing of short text content in mobile applications (word segmentation, part-of-speech recognition, entity recognition, vocabulary construction, word bag model representation conversion), metadata preprocessing (conversion processing of discrete, continuous and text attributes) ), the fine-tuning of the semantic representation of the pre-training model, the extraction of topic keywords (subjects (persons, institutions, organizational entities, etc.), actions (verbs, gerunds, etc.), time, place, etc., the export of topic embedding representations, and the topic in each A series of processes such as the analysis of attribute value distribution on attributes can comprehensively and efficiently mine and analyze the keywords, embedded representations and attribute value distribution of the topics of online short texts. The keywords of the topics are used to construct user portrait tag data, Public opinion event title generation, etc., and the embedding representation of topics can be used for similarity calculation of user interest portraits, so as to better serve business applications such as recommendation, search, and intelligent dialogue.

为让本发明的上述特征和效果能阐述的更明确易懂，下文特举实施例，并配合说明书附图作详细说明如下。In order to make the above-mentioned features and effects of the present invention more clear and understandable, the following specific examples are given together with the accompanying drawings for detailed description as follows.

针对采用词袋模型表示无法充分表达短文本语义的问题，本发明提出把由预训练模型产生的固定维度的文本语义向量视作文档的语义表示，将其融入神经主题模型的编码和解码过程，使得神经主题模型在短文本语料集的文档单词分布较为稀疏的场景中，通过包含语序和上下文信息的稠密语义表示的约束，克服词袋模型表示的局限性，对短文本的主题做出更准确地推断和建模，其原理如图1所示。Aiming at the problem that the bag-of-words model cannot fully express the semantics of short texts, this invention proposes to regard the fixed-dimensional text semantic vector generated by the pre-training model as the semantic representation of the document, and integrate it into the encoding and decoding process of the neural topic model. In the scene where the distribution of words in the document of the short text corpus is relatively sparse, the neural topic model overcomes the limitation of the bag of words model through the constraints of dense semantic representation including word order and context information, and makes more accurate short text topics. Ground inference and modeling, the principle is shown in Figure 1.

针对神经主题模型无法有效利用短文本多属性信息的问题，发明人观察到：在在线短文的现实场景下，讨论同一主题的短文本在发布者画像、发布时间、Hashtag、URL地址、用户Mention等多属性值分布上具有局部性，即：相似的发布者，在较短的一段时间内，发送的带有相似属性信息的短文本往往讨论着同一主题。于是，本发明提出基于短文本稀疏的多属性信息，构造预测短文本属性值的多任务，利用多任务学习来微调预训练模型，使预训练模型产生与主题信息更相关的文本语义向量作为语义表示，然后通过语义约束，达到间接融合短文本多属性信息的目的，从而规避大量缺失属性值引入的噪声干扰，其原理图如图2所示。Aiming at the problem that the neural topic model cannot effectively use the multi-attribute information of short texts, the inventors observed that: in the real scene of online short texts, short texts discussing the same topic can be compared with publisher portrait, release time, Hashtag, URL address, user Mention, etc. The distribution of multi-attribute values is localized, that is, similar publishers often discuss the same topic in short texts with similar attribute information sent within a short period of time. Therefore, the present invention proposes to construct multi-tasks for predicting attribute values of short texts based on sparse multi-attribute information of short texts, use multi-task learning to fine-tune the pre-training model, and make the pre-training model generate text semantic vectors that are more relevant to topic information as semantic Then, through semantic constraints, the purpose of indirect fusion of short text multi-attribute information is achieved, thereby avoiding the noise interference introduced by a large number of missing attribute values. The schematic diagram is shown in Figure 2.

一种面向移动应用元数据的主题建模方法，包括以下处理流程：A topic modeling method for mobile application metadata, including the following processing flow:

1.清洗文本内容，识别文本中的实体词和谓语词，构建词表，将文本内容转换为词袋模型表示。1. Clean the text content, identify entity words and predicate words in the text, build a vocabulary, and convert the text content into a bag of words model representation.

2.按属性类型对文本的元数据分别进行清洗转换，构造对应的属性值预测任务，通过多任务学习和辅助对抗任务，微调预训练模型语义表示，融入多属性信息。利用微调后的预训练模型，将文本内容和文本型属性值的拼接文本转换为文本语义表示。2. Clean and transform the metadata of the text according to the attribute type, construct the corresponding attribute value prediction task, fine-tune the semantic representation of the pre-trained model, and integrate multi-attribute information through multi-task learning and auxiliary confrontation tasks. Using the fine-tuned pre-trained model, the concatenated text of text content and text-type attribute values is converted into a text semantic representation.

3.基于文本语义表示构造语义约束目标，指导神经主题模型学习，训练完成后导出建模好的主题-单词分布和主题嵌入表示。主题-单词分布不同于上述文档-单词的词袋模型表示。主题建模任务的输入是一个文档集合，包含多篇文档。主题建模任务是挖掘出来这其中包含哪些主题。主题-单词分布是利用几个关键词来描述主题。主题嵌入表示是深度学习中常用的表示方法，用一个实数值向量来表示一个主题。3. Construct a semantic constraint target based on the text semantic representation, guide the learning of the neural topic model, and export the modeled topic-word distribution and topic embedding representation after the training is completed. The topic-word distribution is different from the bag-of-words model representation of the above document-words. The input of the topic modeling task is a document collection, which contains multiple documents. The task of topic modeling is to find out which topics are contained in it. Topic-word distribution is to use several keywords to describe the topic. Topic embedding representation is a commonly used representation method in deep learning, which uses a real-valued vector to represent a topic.

4.将主题嵌入表示输入对应的属性值预测网络，得到主题在各个属性上的属性值分布情况。主题是一个语义概念，它可以用一组关键词来描述，也可以用参与的用户来描述，也可以用用户讨论该主题时使用的hashtag来描述。进一步的分析主题的属性分布情况，就是从更多的维度来对主题进行描述，让用户更好地理解主题相关的文本关键词、主要参与用户、主流的Hashtag、主题覆盖的时间段范围等。4. Embed the topic into the attribute value prediction network corresponding to the representation input, and obtain the distribution of attribute values of the topic on each attribute. A topic is a semantic concept, which can be described by a set of keywords, by participating users, or by hashtags used by users when discussing the topic. Further analysis of the attribute distribution of the topic is to describe the topic from more dimensions, so that users can better understand the text keywords related to the topic, the main participating users, mainstream hashtags, and the time range covered by the topic, etc.

具体流程如图3所示：The specific process is shown in Figure 3:

1.数据获取：通过数据采集/数据查询等手段，收集特定社交平台一段时间内发布的文本数据及其对应的元数据，包括但不限于：发布时间、发布用户ID、发布用户个人简介、@User、#Tag、URL等元数据。1. Data acquisition: through data collection/data query and other means, collect text data and corresponding metadata published by a specific social platform within a period of time, including but not limited to: release time, release user ID, release user profile, @ Metadata such as User, #Tag, URL, etc.

2.文本预处理：清洗文本内容，识别文本中的实体词(人物、机构、组织、时间、地点、其他实体等)和谓语词(事件触发词)，构建词表，将文本内容转化为其对应的词袋表示。2. Text preprocessing: clean the text content, identify entity words (person, institution, organization, time, place, other entities, etc.) and predicate words (event trigger words) in the text, build a vocabulary, and convert the text content into its The corresponding bag-of-words representation.

3.元数据预处理：将元数据划分为三类：离散型属性、连续型属性和文本型属性，分别进行处理。3. Metadata preprocessing: Divide metadata into three categories: discrete attributes, continuous attributes and text attributes, which are processed separately.

1)离散型属性：比如发布用户ID、#Tag、URL等。对于离散型属性，分别基于语料集中出现过的属性值计数，按照构造词表的过程，取出现次数超过设定阈值的属性值构成属性值集合。特别的，对于URL地址属性值，使用URL地址中的Host字段代替原URL地址作为URL地址属性的属性值，代表在线短文本的信息源。1) Discrete attributes: such as posting user ID, #Tag, URL, etc. For discrete attributes, based on the counts of attribute values that have appeared in the corpus, according to the process of constructing the vocabulary, the attribute values whose occurrence times exceed the set threshold are taken to form an attribute value set. In particular, for the attribute value of the URL address, the Host field in the URL address is used instead of the original URL address as the attribute value of the URL address attribute, representing the information source of the online short text.

2)连续型属性：比如发布时间、评价打分等。对于连续型属性，将其属性值标准化，转换为均值为0，方差为1的分布。特别的，对于时间类属性值，先转换为时间戳再进行标准化操作。2) Continuous attributes: such as release time, evaluation score, etc. For continuous attributes, the attribute values are standardized and transformed into distributions with a mean of 0 and a variance of 1. In particular, for time class attribute values, first convert to timestamp and then standardize.

3)文本型属性：比如发布用户个人简介等。对于文本型属性，将文本内容与其拼接，作为获取预训练模型文本语义表示的输入。3) Text-type attributes: such as publishing a user profile, etc. For text-type attributes, the text content is concatenated with it as the input for obtaining the text semantic representation of the pre-trained model.

4.元数据融合：将预处理后的元数据属性值作为标签，构造多个属性值预测任务，通过多任务学习微调预训练模型融合元数据信息。每种离散型属性对应一个预测属性值ID的分类任务，采用交叉熵作为分类任务的损失函数。每种连续型属性对应一个预测转换后属性值的回归任务，采用MSE作为回归任务的损失函数。为了避免微调后的预训练模型产生的文本语义向量与某个属性高度耦合，使预训练模型共享参数更倾向于捕获多属性间的共现规律，学习到与主题信息更相关的语义表示，本发明还添加了一个辅助对抗分类任务：即判定预训练模型产生的文本语义向量对应于哪种属性，采用交叉熵作为损失函数。4. Metadata fusion: use the preprocessed metadata attribute values as labels, construct multiple attribute value prediction tasks, and fine-tune the pre-training model through multi-task learning to fuse metadata information. Each discrete attribute corresponds to a classification task of predicting the attribute value ID, and cross-entropy is used as the loss function of the classification task. Each continuous attribute corresponds to a regression task that predicts the transformed attribute value, and MSE is used as the loss function of the regression task. In order to avoid the high coupling between the text semantic vector generated by the fine-tuned pre-training model and a certain attribute, make the shared parameters of the pre-training model more inclined to capture the co-occurrence rules between multiple attributes, and learn semantic representations that are more relevant to topic information. The invention also adds an auxiliary adversarial classification task: to determine which attribute the text semantic vector generated by the pre-training model corresponds to, and use cross-entropy as the loss function.

5.主题建模：将融合元数据信息的文本语义表示作为语义约束，指导神经主题模型训练，并导出建模好的主题-关键词分布和主题嵌入表示。5. Topic modeling: The text semantic representation fused with metadata information is used as a semantic constraint to guide the training of the neural topic model, and derive the modeled topic-keyword distribution and topic embedding representation.

6.后处理：将主题嵌入表示输入4中属性值预测任务网络，得到主题在对应属性上的属性值分布情况，取Top K作为该主题各属性的属性值列表。将主题的主题-关键词分布按关键词类别(实体词(人物、机构、组织、时间、地点、其他实体等)和谓语词(事件触发词))进行划分，取Top K作为该主题各类别的关键词列表。根据关键词列表、属性值列表和主题嵌入表示对相似的主题进行合并。6. Post-processing: Embed the topic into the attribute value prediction task network in 4, get the distribution of the topic's attribute values on the corresponding attributes, and take Top K as the attribute value list of each attribute of the topic. Divide the topic-keyword distribution of the topic according to keyword categories (entity words (persons, institutions, organizations, time, places, other entities, etc.) and predicate words (event trigger words)), and take Top K as each category of the topic list of keywords. Similar topics are merged based on keyword lists, attribute value lists, and topic embedding representations.

上述合并的意义在于可能有多篇文档都是关于同一个主题的描述。主题建模的目的是为了文本集合进行压缩，找出其描述的主要事件。因此，主题建模的目标就是得到关于每个主题的描述，比如这里的主题-关键分布。The significance of the above merging is that there may be multiple documents describing the same topic. The purpose of topic modeling is to compress a text collection and find out the main events it describes. Therefore, the goal of topic modeling is to get a description about each topic, such as the topic-key distribution here.

主题-关键词分布的常见形式就是用K个最重要的关键词来描述一个主题。例如某地缘冲突主题下的关键词分布可为“XX国，ZZ国，XX国领导人，ZZ国元首，第三方参与组织，军事行动，制裁”。A common form of topic-keyword distribution is to use the K most important keywords to describe a topic. For example, the keyword distribution under a certain geo-conflict theme can be "XX country, ZZ country, leader of XX country, head of ZZ country, third-party participating organizations, military operations, sanctions".

在对两个主题进行合并时，使用Jaccard系数来度量主题的关键词列表和属性值列表之间的相似度，使用余弦相似度来度量主题的嵌入表示之间的相似度。在得到三个相似度分数后，使用加权平均得到两个主题间的最终相似度。对于相似度超过指定阈值的主题进行合并。When merging two topics, the Jaccard coefficient is used to measure the similarity between the keyword list and the attribute value list of the topics, and the cosine similarity is used to measure the similarity between the embedding representations of the topics. After obtaining three similarity scores, a weighted average is used to obtain the final similarity between two topics. Merge topics whose similarity exceeds a specified threshold.

在推断给定文本的主题分布时，可以直接用文本的词袋表示和语义表示输入到主题模型中，得到其主题分布，并将主题分布中维度值最大的维度所对应的主题作为该文本的主题。When inferring the topic distribution of a given text, the bag-of-words representation and semantic representation of the text can be directly input into the topic model to obtain its topic distribution, and the topic corresponding to the dimension with the largest dimension value in the topic distribution can be used as the topic of the text theme.

以下为与上述方法实施例对应的系统实施例，本实施方式可与上述实施方式互相配合实施。上述实施方式中提到的相关技术细节在本实施方式中依然有效，为了减少重复，这里不再赘述。相应地，本实施方式中提到的相关技术细节也可应用在上述实施方式中。The following are system embodiments corresponding to the foregoing method embodiments, and this implementation manner may be implemented in cooperation with the foregoing implementation manners. The relevant technical details mentioned in the foregoing implementation manners are still valid in this implementation manner, and will not be repeated here in order to reduce repetition. Correspondingly, the relevant technical details mentioned in this implementation manner may also be applied in the foregoing implementation manners.

一套面向移动应用元数据的文本主题分析系统，其业务逻辑见图4，至少包括以下模块：A text theme analysis system for mobile application metadata, its business logic is shown in Figure 4, at least including the following modules:

1.预处理模块：主要负责从文件/数据库/数据仓库中读取需要分析的文本数据，并进行以下预处理操作：(1)清洗文本内容；(2)识别文本中的实体词和谓语词，构建词表；(3)提取出文本中的各种属性的属性值，并按属性类型分别进行清洗转换。1. Preprocessing module: mainly responsible for reading the text data to be analyzed from the file/database/data warehouse, and performing the following preprocessing operations: (1) cleaning the text content; (2) identifying entity words and predicate words in the text , build a vocabulary; (3) extract the attribute values of various attributes in the text, and perform cleaning and conversion according to attribute types.

2.主题建模学习模块：主要负责训练神经主题模型，并输出其建模好的主题的关键词、嵌入表示及其属性值分布情况。包括主题建模的词袋模型表示、预训练语义表示的转换和加载、训练神经主题模型、导出建模好的主题的关键词、嵌入表示及其属性值分布情况、推断文本的主题分布。2. Topic modeling learning module: It is mainly responsible for training the neural topic model, and outputting the keywords, embedding representation and attribute value distribution of the topic it has modeled. Including bag-of-words model representation for topic modeling, conversion and loading of pre-trained semantic representations, training neural topic models, exporting keywords of modeled topics, embedding representations and their attribute value distributions, and inferring topic distributions of texts.

3.后处理模块：主要负责对导出的主题的关键词、嵌入表示及其属性值分布情况进行处理转换。包含合并相似的主题，提取主题的标题及描述信息，分配各个文本的簇标签等。3. Post-processing module: It is mainly responsible for processing and converting the keywords, embedded representations and attribute value distribution of the exported topics. It includes merging similar topics, extracting topic titles and description information, assigning cluster labels for each text, etc.

4.数据存储模块：主要负责将输出结果保存在数据库中。包括创建主题、关键词、文本对象，并处理好他们之间的外键关联关系，连接数据库并将数据写入对应的数据表。4. Data storage module: mainly responsible for saving the output results in the database. Including creating topics, keywords, and text objects, and handling the foreign key relationship between them, connecting to the database and writing the data into the corresponding data table.

5.元数据融合模块：主要负责微调预训练模型，使其产生与主题信息更相关的语义表示。包括微调任务的文本数据及多属性数据的转换和加载、微调预训练模型、导出保存微调好的预训练模型参数。5. Metadata fusion module: It is mainly responsible for fine-tuning the pre-training model so that it can generate semantic representations that are more relevant to topic information. Including conversion and loading of text data and multi-attribute data for fine-tuning tasks, fine-tuning pre-training models, exporting and saving fine-tuned pre-training model parameters.

Claims

1. A text topic modeling method based on social platform metadata is characterized by comprising the following steps:

step 1, obtaining text data to be subject modeled and metadata of the text data from a social platform;

step 2, constructing a bag-of-words representation of the text data based on the keywords of the text data;

step 3, training attribute value prediction tasks of corresponding categories based on the categories of the metadata to finely adjust a pre-training semantic extraction model to obtain a target semantic extraction model, and extracting text semantic representation of the text data by using the target semantic extraction model;

step 4, constructing a semantic constraint target based on the text semantic representation, training a neural topic model based on a variational self-encoder by taking the semantic constraint target as a guide and taking bag-of-words representation as an input and reconstruction target to obtain a topic extraction model, and deriving topic-keyword distribution and topic embedding representation from the model;

and 5, inputting the theme embedded representation into the attribute value prediction task to obtain attribute value distribution of the theme on the corresponding attribute, merging the same theme according to the attribute value distribution, the theme-keyword distribution and the theme embedded representation, and taking a merging result as a theme model of the text data.

2. The method of claim 1, wherein the step 3 comprises:

classifying attributes of the metadata into a discrete type attribute, a continuous type attribute and a text type attribute;

counting discrete attributes respectively based on attribute values appearing in a corpus set, taking out attribute values with the current number exceeding a preset threshold value to form an attribute value set according to the process of constructing a word list, constructing a classification task for predicting the attribute values based on the attribute value set, and adopting cross entropy as a loss function of the classification task;

for the continuous attribute, converting the attribute value into the distribution with the mean value of 0 and the variance of 1; constructing a regression task for predicting the converted attribute values based on the continuous attributes, and adopting MSE as a loss function of the regression task;

splicing the text data with the text type attribute to obtain a spliced text, inputting the pre-training semantic extraction model, and generating a text semantic vector;

and constructing a confrontation classification task for judging the attribute category of the text semantic vector, and taking the cross entropy as a loss function.

3. The method of claim 1, wherein the step 5 comprises: constructing an attribute value list of each attribute of the theme according to the attribute value distribution; constructing a keyword list according to the topic-keyword distribution;

when the topics are combined, the Jacard coefficient distribution is used for measuring the similarity between the keyword lists and the attribute value lists of the topics to obtain a first similarity and a second similarity, and the cosine similarity is used for measuring the similarity between the embedded representations of the topics to obtain a third similarity; and weighting and averaging the first similarity, the second similarity and the third similarity to obtain the final similarity among the topics, and combining the topics with the final similarity larger than a preset value.

4. The social platform metadata based text topic modeling method of claim 1, wherein the metadata comprises: time of publication, user ID of publication, user profile of publication, @ User, # Tag, and URL.

5. A text topic modeling system based on social platform metadata, comprising:

the system comprises an initial module, a topic modeling module and a topic modeling module, wherein the initial module is used for acquiring text data to be subject modeled and metadata of the text data from a social platform; constructing a bag-of-words representation of the text data based on the keywords of the text data;

the fine tuning module is used for training attribute value prediction tasks of corresponding categories according to the categories of the metadata so as to fine tune a pre-training semantic extraction model to obtain a target semantic extraction model, and extracting text semantic representation of the text data by using the target semantic extraction model;

the extraction module is used for constructing a semantic constraint target according to the text semantic representation, taking the semantic constraint target as guidance, taking bag-of-words representation as an input and reconstruction target, training a neural topic model based on a variational self-encoder to obtain a topic extraction model, and deriving topic-keyword distribution and topic embedded representation from the model;

and the merging module is used for inputting the theme embedded representation into the attribute value prediction task to obtain the attribute value distribution of the theme on the corresponding attribute, merging the same theme according to the attribute value distribution, the theme-keyword distribution and the theme embedded representation, and taking a merging result as a theme model of the text data.

6. The social platform metadata based text topic modeling system of claim 5 wherein the hinting module is specifically configured to:

7. The social platform metadata based text topic modeling system of claim 5 wherein the merging module is to: constructing an attribute value list of each attribute of the theme according to the attribute value distribution; constructing a keyword list according to the topic-keyword distribution;

8. The social platform metadata based text topic modeling system of claim 5 wherein the metadata comprises: time of publication, user ID of publication, user profile of publication, @ User, # Tag, and URL.

9. A storage medium storing a program for executing the method for modeling a text topic based on metadata of a social platform according to any one of claims 1 to 4.

10. A client for use in the social platform metadata based text topic modeling system of any one of claims 5 to 8.