CN114090759A

CN114090759A - E-commerce live broadcast real-time question-answering system and method based on knowledge base

Info

Publication number: CN114090759A
Application number: CN202210058425.2A
Authority: CN
Inventors: 梁晨阳
Original assignee: Beijing Zhongke Shenzhi Technology Co ltd
Current assignee: Beijing Zhongke Shenzhi Technology Co ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-02-25

Abstract

The invention discloses a knowledge base-based E-commerce live broadcast real-time question answering system and a knowledge base-based E-commerce live broadcast real-time question answering method, which comprise the following steps: the method comprises the following steps: the system comprises a data cleaning module, a knowledge base building module, a word segmentation processing module, an intention defining module and a result query module; the cleaning data module is used for acquiring data from a data source and storing the data into a structured database in a field-by-field manner; constructing a knowledge base module, and storing the entity words and the corresponding similar meaning words into a graph database at the same time; the word segmentation processing module is used for obtaining all words with the minimum granularity and obtaining the similar meaning words of the words with the minimum granularity; defining an intention module linking the intention with the name of the object to which it is directed; and the result query module is used for querying a corresponding result in the database and returning the result to the user. The invention can obtain the structured knowledge data from the metadata, and can effectively extract the effective characteristics of the structured knowledge base by preprocessing the structured knowledge base, thereby efficiently utilizing the knowledge base when performing question answering and realizing intelligent question answering.

Description

E-commerce live broadcast real-time question-answering system and method based on knowledge base

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a knowledge base-based E-commerce live broadcast real-time question answering system and method.

Background

Since the industrial revolution, it has become a reality to liberate both hands of people by machines instead of humans, however, mental labor by machines instead of humans has not been fully realized. In recent years, with the rapid development of big data wave, people naturally aim at a place where a machine can replace human mental labor, wherein intelligent question answering is one of typical mental labor fields. The intelligent question and answer can be subdivided into different professional fields, such as intelligent customer service, chat robots and the like. In recent years, live television broadcast with goods is rapidly developed, a large amount of flow flows to live television broadcast, and meanwhile, with the development of artificial intelligence technology, intelligent anchor is more and more popular. All of them urge the rapid development of the e-commerce live broadcast question-answering system.

Different from other traditional question-answering systems, the live E-commerce question-answering system has the following characteristics:

1. first, real-time performance is the result. Because the system is a live broadcast system, the real-time performance is ensured in the question answering process so as to ensure the live broadcast user experience.

2. The second is higher accuracy. Different from an open chat question-answering system, the E-commerce question-answering system has the advantages of fine general field and accurate content, so that higher accuracy is required for the E-commerce live broadcast question-answering system.

3. For live services, knowledge is always updated in real time, yesterday's story can certainly be today's selling point. Therefore, the E-commerce live broadcast question-answering system needs good expansibility.

Although there are many question-answering systems, it is difficult to find a question-answering system that can fully satisfy the above three requirements. First, the traditional question-answering system based on sentence pairs is difficult to cover the complexity of language and has great worry in accuracy. In recent years, with the development of deep learning technology, a bert family deep learning model question-answering system based on a bert model is widely applied. However, for the bert family model, the real-time capability is greatly reduced due to the huge parameters, and the many-to-one question-and-answer mode in the live broadcast system is definitely daunting for people to the large parameter model. Although the knowledge distillation can be used for solving the problems, the accuracy of the model is influenced, and the good expansibility is not sufficient. Since google brought forward the concept of knowledge graph, knowledge graph-based question-answering system scrapes a gust of knowledge graph in the field of question-answering system, and knowledge graph in each field is established, so that the accuracy of question-answering system is improved to a certain extent, and the system has good expansibility. However, the construction of the knowledge-graph requires significant investment, including labor and time costs. Therefore, theoretically, the question-answering system based on the knowledge graph can achieve good accuracy and expansibility, but really achieving such a system is difficult. Therefore, in the live broadcast question-answering system of the E-commerce, the simple question-answering system based on the knowledge map is difficult to meet the requirements.

Therefore, how to provide a knowledge-base-based e-commerce live broadcast real-time question answering system and method becomes a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a knowledge base-based e-commerce live broadcast real-time question-answering system and method, which can obtain structured knowledge data from metadata, and can effectively extract effective features of a structured knowledge base by preprocessing the structured knowledge base, so that the knowledge base can be efficiently utilized when performing question-answering, and intelligent question answering is realized.

In order to achieve the purpose, the invention adopts the following technical scheme:

a live E-commerce real-time question-answering system based on a knowledge base comprises: the system comprises a data cleaning module, a knowledge base building module, a word segmentation processing module, an intention defining module and a result query module; wherein,

the cleaning data module is used for acquiring data from a data source, designing schema according to specific services, cleaning the data according to the schema, identifying the data, forming structured data and storing the structured data into a structured database in different fields;

the method comprises the steps of establishing a knowledge base module, performing word segmentation processing on structured data once, preparing a word base with word vectors trained in advance, setting a threshold value to screen out near-meaning words, and storing entity words and corresponding near-meaning words into a graph database;

the word segmentation processing module performs word segmentation processing when processing and mining structured data, obtains all words with minimum granularity by using a word segmentation device, and obtains near-meaning words of the words with minimum granularity;

the definition intention module is used for clustering according to the historical linguistic data and selecting representative question sentences; defining an intent from a knowledge base; after defining the intentions, storing the corresponding intention names into a graph database, and linking the intention names with corresponding entities by using relationship edges; the intention and the aimed object name are linked by using a relation edge;

the result query module is used for firstly carrying out full word segmentation operation on the query when the user inputs the query; after obtaining all word segmentation results, removing the graph data to inquire a minimum word database, and obtaining all possibly related long words by inquiring the long words taking the minimum word as an element relation; then, finally matching the long-name words meeting the requirements by inquiring whether all elements of all long-name words are contained in the question; after obtaining the long name to be inquired, inquiring all intentions defined in advance by the long name, processing the query, and inputting the query into a corresponding model; the model outputs output values of all the predefined intentions, and the maximum value is taken as the matched user intention; and inquiring a corresponding result in the database and returning the result to the user.

Further, the schema includes corresponding entities, attributes, intents, and relationships therebetween.

Further, the method for screening out the similar meaning words comprises the following steps: calculating the cosine similarity of all the mined words and the words in the word bank, setting a certain threshold value, and selecting possible similar words; then, replacing the original words with the selected alternative near-meaning words in the original corpus, and setting a threshold value to further screen out possible near-meaning words by using a pre-trained n-gram language model and a bert mask model as scorers; and finally determining synonyms of the mined words.

Further, the word segmentation processing method comprises the following steps: firstly, dividing a long word into a plurality of minimum words by using a word segmentation device; replacing the original minimum word with the synonym of each minimum word to obtain a new long word; calculating the cosine similarity of the long words before replacement and the long words after replacement by using the bert vector, setting a threshold, and if the cosine similarity is lower than the threshold, determining that the small words cannot be replaced at the position and excluding the small words at the position; calculating the mask score of each character by using a mask model of bert, setting a threshold value, and screening out the synonym of the minimum word meeting the conditions;

after the near meaning word of each minimum word segmentation is screened out, the null character is omitted; the remaining minimum participles and corresponding similar meaning words, each group of words is called an element of the long word;

finally, each long word and its elements are stored in a database.

A knowledge base-based E-commerce live broadcast real-time question and answer method comprises the following steps:

cleaning data, acquiring data from a data source, designing a schema according to specific services, cleaning the data according to the schema, identifying the data to form structured data, and storing the structured data into a structured database in a domain-by-domain manner;

constructing a knowledge base, performing word segmentation processing on the structured data once, preparing a word base with word vectors trained in advance, setting a threshold value to screen out near-meaning words, and simultaneously storing the entity words and the corresponding near-meaning words into a graph database;

performing word segmentation processing, namely performing word segmentation processing when processing and mining structured data, acquiring all words with minimum granularity by using a word segmentation device, and acquiring near-meaning words of the words with minimum granularity;

defining intentions, clustering according to historical linguistic data, and selecting representative question sentences; defining an intent from a knowledge base; after defining the intentions, storing the corresponding intention names into a graph database, and linking the intention names with corresponding entities by using relationship edges; the intention and the aimed object name are linked by using a relation edge;

result query, when a user inputs a query, firstly carrying out full word segmentation operation on the query; after obtaining all word segmentation results, removing the graph data to inquire a minimum word database, and obtaining all possibly related long words by inquiring the long words taking the minimum word as an element relation; then, finally matching the long-name words meeting the requirements by inquiring whether all elements of all long-name words are contained in the question; after obtaining the long name to be inquired, inquiring all intentions defined in advance by the long name, processing the query, and inputting the query into a corresponding model; the model outputs output values of all the predefined intentions, and the maximum value is taken as the matched user intention; and inquiring a corresponding result in the database and returning the result to the user.

The invention has the beneficial effects that:

according to the invention, firstly, structured knowledge data is required to be obtained from metadata, and then effective characteristics of the structured knowledge base can be effectively extracted through preprocessing the structured knowledge base, so that the knowledge base can be efficiently utilized in question answering, and intelligent question answering is realized. The system arranges the data into a knowledge dictionary and stores the knowledge dictionary in a database. When a user inputs a query sentence, the system firstly carries out intelligent analysis on the query sentence of the user, carries out classification analysis on questions which the user possibly wants to ask, and finally inquires the answer which the user wants by inquiring the intelligent dictionary and returns the answer to the user.

Drawings

In order to illustrate the present invention or the technical solutions in the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only the present embodiments of the invention, and other drawings can be obtained by those skilled in the art without creative efforts based on the provided drawings.

FIG. 1 is a flow chart of a method for cleansing data according to the present invention.

FIG. 2 is a flow chart of a method of building a knowledge base according to the present invention.

FIG. 3 is a flow chart of a method of word segmentation processing in accordance with the present invention.

FIG. 4 is a flowchart of a method for defining intent according to the present invention.

FIG. 5 is a flow chart of a method for result query according to the present invention.

FIG. 6 is a diagram of obtaining corpora from a data source and cleansing corpora according to embodiment 3 of the present invention.

Figure 7 is a partial knowledge map of the core of potato chips and tomato-flavored potato chips in example 3 of the present invention.

FIG. 8 is a diagram illustrating definitions of terms according to corpora and maps in example 3 of the present invention.

FIG. 9 is a query graph in embodiment 3 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The invention provides a knowledge base-based E-commerce live broadcast real-time question-answering system, which comprises: the system comprises a data cleaning module, a knowledge base building module, a word segmentation processing module, an intention defining module and a result query module; wherein,

In this embodiment, the schema includes corresponding entities, attributes, intents, and relationships therebetween.

In this embodiment, the method for screening out the synonyms includes: calculating the cosine similarity of all the mined words and the words in the word bank, setting a certain threshold value, and selecting possible similar words; then, replacing the original words with the selected alternative near-meaning words in the original corpus, and setting a threshold value to further screen out possible near-meaning words by using a pre-trained n-gram language model and a bert mask model as scorers; and finally determining synonyms of the mined words.

In this embodiment, the word segmentation processing method includes: firstly, dividing a long word into a plurality of minimum words by using a word segmentation device; replacing the original minimum word with the synonym of each minimum word to obtain a new long word; calculating the cosine similarity of the long words before replacement and the long words after replacement by using the bert vector, setting a threshold, and if the cosine similarity is lower than the threshold, determining that the small words cannot be replaced at the position and excluding the small words at the position; calculating the mask score of each character by using a mask model of bert, setting a threshold value, and screening out the synonym of the minimum word meeting the conditions;

finally, each long word and its elements are stored in a database.

The invention mainly comprises two parts: a knowledge base and an intent recognition algorithm. The construction of the knowledge base aims to utilize prior knowledge related to the field to a greater extent, so that the speed and the accuracy of the question-answering system are improved, and the construction of the knowledge base has good expansibility. And the intention recognition algorithm based on the knowledge base can flexibly utilize the knowledge of the knowledge base to analyze the query input by the user and return an answer.

The invention provides a relatively ideal E-commerce live broadcast question-answering system based on a knowledge base. Firstly, a reasonable data storage system is manufactured according to the characteristics of live data points of the E-commerce by data preprocessing, and knowledge related to the E-commerce live broadcast is stored in a corresponding knowledge base. When the user inputs the query sentence, the intention recognition algorithm is combined with the corresponding knowledge base, corresponding answers are quickly matched for the user, and then the answers are returned to the user. The question-answering system can well meet three requirements of the e-commerce live broadcast question-answering field mentioned in the sale, namely: real-time, accuracy and scalability.

Example 2

The embodiment provides a knowledge base-based E-commerce live broadcast real-time question answering method, which comprises the following steps:

and step A, cleaning data. Data is obtained from data sources such as user query, description of the product, encyclopedia, and the like. The schema is designed according to specific services, and comprises corresponding entities, attributes, intents, relationships of the entities and the attributes, the intents, the relationships of the entities and the relationships of the entities. According to the designed schema, the data such as required entities, attributes, intents, relationships and the like are identified to form structured data, and the related technologies comprise entity identification, attribute identification, relationship extraction and the like. And storing the identified data into a structured database in a domain-by-domain mode for further use. The specific flow is shown in figure 1.

And B, constructing a knowledge base. And firstly, performing word segmentation on the data mined in the last step, and obtaining all words with the minimum granularity by using a word segmentation device. And storing the entities, the attributes, the intents, the relationships of the entities and the relationships in the corresponding databases. Meanwhile, a word stock of word vectors that have been trained in advance is prepared. For all the mined words (including all the minimum participles), possible similar words are selected by calculating the cosine similarity of the words and the words in the word bank and setting a certain threshold. And then, replacing the original words with the selected alternative similar meaning words in the original corpus, and setting a threshold value to further screen out possible similar meaning words by using a pre-trained n-gram language model and a bert mask model as scorers. Determining synonyms of the mined words. And finally, storing the entity words and the corresponding similar meaning words into a graph database simultaneously. The specific flow is shown in fig. 2.

And step C, word segmentation processing. And when the data mined in the previous step is processed, word segmentation processing is performed once, all words with the minimum granularity are obtained by using the word segmentation device, and the similar meaning words of the words with the minimum granularity are obtained. The step is that the long words are divided into a plurality of minimum words by a word divider firstly. The synonym (including the empty character string) of each minimum word is used to replace the original minimum word, so that a new long word can be obtained. Firstly, calculating the cosine similarity of the long word before replacement and the long word after replacement by using a bert vector, setting a threshold, and if the cosine similarity is lower than the threshold, determining that the small word cannot be replaced at the position and excluding the small word at the position. Since the null character is more specific, the threshold may be set slightly higher. And then, calculating the mask score of each character by using a mask model of bert, setting a threshold value, and screening out the synonym of the minimum word meeting the conditions.

After the near meaning word of each minimum participle is screened out, if the empty character can also be replaced to indicate that the word is in the original long word, if the word is possible or not, the word can be omitted. The remaining minimum participles and corresponding similar meaning words, each group of words is called an element of the long word.

Finally, each long word and the elements thereof are stored in a graph database, wherein the element relationship is an edge of a node, and because the minimum words of different groups are divided into different elements, the relationship names of the edges of the different elements are also divided into: element 1, element 2, element 3, etc. The specific flow is shown in fig. 3.

And D, defining the intention. And clustering according to the historical linguistic data, and selecting representative question sentences. Meanwhile, the intention is defined according to the information such as the related entities, attributes and relations of the knowledge base. After defining the intentions, storing the corresponding intention names into a graph database, and linking the corresponding intention names with corresponding entities by using relationship edges. In the defined intentions, the objects for which some intentions are directed are entities of a class, so that corresponding nodes are established in a database, and then the steps are executed, corresponding elements are calculated and stored in the database. Finally, the relationship edge is used to link the intention with the name of the object to which the intention is directed.

After determining the good intentions and their corresponding objects, historical user corpora are collected for training the classification model. When the classification model is trained, the input of the model replaces the targeted object with a special symbol, so that the model is not interfered by the targeted image name, the model is more targeted to the intention than the object name, and the relationship between the object name and the intention is stored in a knowledge base in advance. And (3) replacing softmax with sigmoid for the output of the model, and enabling the output of the sigmoid function corresponding to the correct label to be 1 as much as possible and the others to be 0 during training.

When a new intention is defined, the above steps are repeated, the relationship between the intention and the entity is established, and then the model is trained. The intention classification model is therefore multiple. Thus, in the graph database, the relationship of the corresponding intent and model is recorded. The specific flow is shown in fig. 4.

And E, inquiring a result, namely performing full word segmentation on the query when the query is input by the user. And after obtaining all word segmentation results, removing the graph data to inquire a minimum word database, and obtaining all possibly related long words by inquiring the long words taking the minimum word as an element relation. And finally matching the long-name words meeting the requirements by inquiring whether all elements of all long-name words are contained in the question.

After the long term to be queried is obtained, all intentions defined in advance by the long term are queried, and then the query is processed and input into a corresponding model. And finally, outputting output values of all the predefined intentions by the model, and taking the maximum value as the matched user intention. And finally, searching a corresponding result in the database and returning the result to the user. The specific flow is shown in fig. 5.

Example 3

1. and obtaining the corpus from the data source and cleaning the corpus. As shown in fig. 6.

2. And defining the schema according to the corpus, and then performing data mining. First, the entities, attributes, and their corresponding relationships are obtained. Then, a word stock is constructed, and the similar meaning words of each word and the minimum word composition elements of the words are obtained. Finally, an intellectual map is obtained, and the following map is a partial map taking potato chips and tomato-flavored potato chips as cores, and is specifically shown in fig. 7.

3. The intent is defined in terms of corpus and atlas. And collecting corpora, training an intention classification model, and finally linking the trained intention to the corresponding entity. As shown in fig. 8.

4. The user inputs the query, firstly, all entities which can be matched are detected in an intelligent matching mode, then the possible intentions of the entities are found by inquiring the map, then, the scores are calculated through the model, the scores calculated by all the possible intentions are compared, the maximum value is taken, and the result is returned. As shown in particular in fig. 9.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A live real-time question-answering system of electricity merchant based on knowledge base, characterized by that, includes: the system comprises a data cleaning module, a knowledge base building module, a word segmentation processing module, an intention defining module and a result query module; wherein,

2. The knowledge-base-based e-commerce live broadcast real-time question answering system according to claim 1, wherein the schema comprises corresponding entities, attributes, intentions and relationships among the entities, attributes and intentions.

3. The E-commerce live broadcast real-time question answering system based on the knowledge base as claimed in claim 2, wherein the method for screening out the similar meaning words comprises the following steps: calculating the cosine similarity of all the mined words and the words in the word bank, setting a certain threshold value, and selecting possible similar words; then, replacing the original words with the selected alternative near-meaning words in the original corpus, and setting a threshold value to further screen out possible near-meaning words by using a pre-trained n-gram language model and a bert mask model as scorers; and finally determining synonyms of the mined words.

4. The E-commerce live broadcast real-time question answering system based on the knowledge base as claimed in claim 3, wherein the word segmentation processing method comprises the following steps: firstly, dividing a long word into a plurality of minimum words by using a word segmentation device; replacing the original minimum word with the synonym of each minimum word to obtain a new long word; calculating the cosine similarity of the long words before replacement and the long words after replacement by using the bert vector, setting a threshold, and if the cosine similarity is lower than the threshold, considering that the small words cannot be replaced at the moment; calculating the mask score of each character by using a mask model of bert, setting a threshold value, and screening out the synonym of the minimum word meeting the conditions;

finally, each long word and its elements are stored in a database.

5. A knowledge base-based E-commerce live broadcast real-time question answering method is characterized by comprising the following steps: