[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114254645A - Artificial intelligence auxiliary writing system - Google Patents

Artificial intelligence auxiliary writing system Download PDF

Info

Publication number
CN114254645A
CN114254645A CN202011002905.4A CN202011002905A CN114254645A CN 114254645 A CN114254645 A CN 114254645A CN 202011002905 A CN202011002905 A CN 202011002905A CN 114254645 A CN114254645 A CN 114254645A
Authority
CN
China
Prior art keywords
sentence
matrix
vector
module
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011002905.4A
Other languages
Chinese (zh)
Inventor
艾浒
张楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bailing Internet Technology Co ltd
Original Assignee
Beijing Bailing Internet Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bailing Internet Technology Co ltd filed Critical Beijing Bailing Internet Technology Co ltd
Priority to CN202011002905.4A priority Critical patent/CN114254645A/en
Publication of CN114254645A publication Critical patent/CN114254645A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an artificial intelligence auxiliary writing system which comprises a writing system, wherein the writing system comprises an information processing module, a word vector semantic module, a sentence vector semantic module and a sentence vector matrix module, the word vector semantic module comprises a CBOW model neural network training module, the information processing module comprises an information collecting module, a text box input module and a text box output module, the sentence vector semantic module comprises a sentence vector combination algorithm, and the sentence vector matrix module comprises a semantic matrix association algorithm. The invention can convert a section of text or sentence into data which can be stored and calculated by a computer by creating a new sentence meaning algorithm, has more ideality compared with the traditional word meaning calculation, can output similar texts aiming at the input text of a user according to the similar operation among the sentence meanings, realizes the beneficial effect of assisting the text writing, and increases the self-checking and the comparison of the user to the text writing.

Description

Artificial intelligence auxiliary writing system
Technical Field
The invention relates to the field of machine learning, in particular to an artificial intelligence auxiliary writing system.
Background
For complexModeling is carried out on a miscellaneous natural language task, a probability model technology is used at first, but when a joint probability function of a language model is learned, a fatal dimension disaster problem exists. If the lexicon size of the language model is 100000 and the one-hot encoding represents the joint distribution of 10 consecutive words, the total number of parameters of the depth model may be 1050And (4) respectively. Accordingly, the number of samples required for a model with sufficient confidence increases exponentially. To solve this problem, Hinton et al originally proposed Distributed Representation (Distributed Representation) in 1986, the basic idea being to represent words as n-dimensional continuous real vectors. The distributed representation has strong characteristic representation capability, n-dimensional vectors and k values in each dimension can represent knAnd (4) a feature. Common open source, trained word vector models typically have n in hundreds or even thousands of dimensions. A common Word vector training mode is CBOW (Continuous Bag-of-Word Model).
Word vectors are the basis of NLP deep learning studies, since semantically similar words tend to appear in similar contexts. Thus, during the learning process, these vectors strive to capture neighboring features of words, and thus learn similarities between words. Compared with characters, the word vector has the advantage of being computable, so that the similarity between words can be measured by calculating cosine distance, Euclidean distance and the like. But the method has no capability in the aspects of sentence semantic similarity and article similarity.
In addition, in the natural language understanding aspect, the existing natural language understanding technology (NLP) does not understand and memorize literary works and various texts, cannot realize associative calculation according to the texts input by the user, does not return high-quality texts with close semantics, and cannot achieve the purposes of assisting the user in association, indexing classics and writing. For example, the latest natural language understanding ERNIE 2.0 technology of hundred degree company in 2020 comprehensively and significantly surpasses the world leading technology in 16 public data sets such as emotion analysis, text matching, natural language reasoning, lexical analysis, reading understanding, intelligent question answering and the like, but does not realize associative calculation according to the text input by the user, high-quality texts with close semantics cannot be returned, and the training cost is high.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an artificial intelligence auxiliary writing system.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention discloses an artificial intelligence auxiliary writing system, which comprises a writing system, wherein the writing system comprises an information processing module, a word vector semantic module, a sentence vector semantic module and a sentence vector matrix module, the word vector semantic module comprises a CBOW model neural network training module, the information processing module comprises an information collecting module, a text box input module and a text box output module, the sentence vector semantic module comprises a sentence vector combination algorithm, and the sentence vector matrix module comprises a semantic matrix association algorithm, and the writing system specifically comprises the following steps:
A. collecting a large number of literary works through an information processing module, and converting characters into character strings after segmentation to form a character paragraph library;
B. b, processing the character paragraphs acquired in the step A through a word vector semantic module, firstly segmenting the character paragraphs, then sequentially processing the words through a CBOW model neural network training module to obtain word vectors of all the words, and then combining all the word vectors to form a phrase vector;
C. b, the phrase vector library in the step B is integrally arranged in a sentence vector semantic module, and a word vector is output as a sentence vector through a sentence vector combination algorithm, so that sentences of the text paragraphs are mainly expressed through the sentence vector;
D. b, after each paragraph in the text paragraph library generated in the step A passes through the step B, C, obtaining a sentence characteristic vector of each text paragraph, expressing the characteristic sentence vector of the sentence by adopting a floating point type, and combining all sentence characteristic vectors to form a literary work matrix library;
E. a user inputs a target text through a text box input module of the information processing module, and after the text is converted into a character string, a target sentence vector is formed through the step B and the step C;
F. and D, processing the target sentence vectors and the literary work matrix library in the step D through a semantic matrix association algorithm of the sentence vector matrix module to obtain a similar sentence vector set, outputting the similar sentence vector set to a text box output module of the information processing module, and arranging the similar sentence vectors in an ascending order according to a similarity rate.
As a preferred technical scheme of the invention, the information processing module comprises a network crawler technology or a network API platform external port and is mainly used for extracting literary work information.
As a preferred technical solution of the present invention, the CBOW model neural network training module is mainly used under a word2vec bag of words algorithm model, the training process of the CBOW model neural network training module is to extract some literature sentences from a large number of literature sentences as training data, extract a phrase w (t) for each sentence, predict w (t) through context words w (t-2), w (t-1), w (t +1), w (t +2), and the trained CBOW model neural network training module can quantize word strings, and comprises the following steps:
(1) inputting the one-hot coding of the context word of the current word into an input layer, wherein the dimension of the one-hot coding is 1 × V, a matrix W1 is set, the dimension of W1 is V × N, V is the total number of word groups contained in a dictionary, and N is a user-defined dimension;
(2) multiplying the context words by the same matrix W1 to obtain respective vectors 1N of the context words, averaging the 1N vectors into a vector 1N, and finally multiplying the average vector 1N by the matrix W2 to become 1V, wherein the dimension of W2 is N V;
(3) and (3) normalizing the 1-V vector, taking out the probability vector of each word, taking the word corresponding to the number with the maximum probability value as a predicted word W (t), calculating errors of the predicted word W (t) and a real expected word W (t), performing reverse propagation gradient descent to adjust matrix values of W1 and W2, and finally obtaining a W1 matrix value which is a word vector library of the literature sentence.
As a preferred technical solution of the present invention, the sentence vector combination algorithm is calculated based on a CBOW model neural network training module, and a sentence vector is formed by a word vector obtained by the CBOW model neural network training module, specifically, the sentence vector is formed by the word vector obtained by the CBOW model neural network training moduleThe method comprises the following steps: setting n words contained in the target sentence A according to the obtained word vectors, wherein each word is represented by an m-dimensional word vector in a word vector library, and the set of word vectors contained in the sentence A is X (X)1,X2……Xn) Where each word vector may be represented as:
X1=[X11,X12……X1m]
X2=[X21,X22……X2m]
……
Xn=[Xn1,Xn2……Xnm]
if the semantic feature vector of sentence a is Avec, the algorithm of Avec is:
Avec=[(X11+X21+……+Xn1)/n,(X12+X22+……+Xn2)/n,……,(X1m+X2m+……+Xnm)/n]for the sake of simplifying the representation:
Y1=(X11+X21+……+Xn1)/n
Y2=(X12+X22+……+Xn2)/n
……
Ym=(X1m+X2m+……+Xnm)/n
the semantic feature vector Avec of sentence a ═ Y1,Y2,……,Yn]And obtaining a sentence vector Avec, wherein the data type of Y is a floating point number, so that after a plurality of sentence vectors are collected, the total number of sentences is set to be S, and a floating point type matrix obtained according to the sentence vectors is expressed as:
Figure BDA0002694937560000041
the output matrixes are combined to form a literary work matrix library G.
As a preferred technical scheme of the invention, the semantic association algorithm mainly combines and calculates a target text and a literary work matrix library, comprises a Euclidean distance formula, and comprises the following steps:
setting the target text input by the user in the step E as an X text, and obtaining X (X) from the n-dimensional feature vector of the X text through the step B and the step C1,X2……Xn) The term "X" is used herein to refer to a set of X texts of a target text, which is substantially different from the aforementioned lexicon vector X, and X is a concept of X texts, and a comparison sentence is Y (Y)1,Y2,……,Yn) Then the multi-dimensional corresponding formula is:
Figure BDA0002694937560000042
the distance between the X text and the plurality of sentence feature vectors can be calculated by sequentially calculating the distance between the X text and millions of sentence feature vectors stored by a program, namely the similarity between sentences, and finally sequencing the similar sentences.
As a preferred technical solution of the present invention, the semantic association algorithm includes an algorithm simplification process, and includes the following steps:
first defining a transformation matrix of m rows
Figure BDA0002694937560000051
Multiplying the m-row transformation matrix by the corresponding sentence vector X text to obtain:
Figure BDA0002694937560000052
combining all the memorized sentence characteristic vectors into a matrix G, wherein the matrix G is a matrix in which m rows and n columns are recorded, namely, m sentences are memorized by the algorithm, and each sentence characteristic vector is n;
d ═ X '-G, where X' is the X text input by the user, is a matrix of one row and n columns, and is converted into a matrix X 'of m rows and n columns by the transformation matrix C, and the difference between the sentence vector of the X text and all the sentence vectors in the literary work matrix library is obtained by subtracting X' from the matrix G to obtain a matrix D;
E=D⊙D,
wherein an operator "", is a hadamard product, which is a matrix operation, if a ═ is (a ═ij) And B ═ Bij) Are two matrices of the same order, if cij=aij×bijThen, the matrix C is called (C)ij) The Hadamard product is the Hadamard product of A and B, or called basic product, so that in the formula, E is the matrix D and the Hadamard product is made by itself, namely, all elements in the matrix D are squared;
finally F ═ ETC, wherein ETFor the transformed E and C, a transformation matrix is obtained, the obtained F is a matrix with m rows and one column, the numerical value in the matrix is the similarity of the X text and each sentence, a sentence list which is most similar to the X text can be obtained after ascending arrangement, and the original European formula has no evolution, so the final formula is as follows:
associative distance
Figure BDA0002694937560000061
As a preferred technical solution of the present invention, the semantic matrix association algorithm is mainly set on a GPU for performing operations.
Compared with the prior art, the invention has the following beneficial effects:
1: the invention can convert a section of text or sentence into data which can be stored and calculated by a computer by creating a new sentence meaning algorithm, has more ideality compared with the traditional word meaning calculation, can output similar texts aiming at the input text of a user according to the similar operation among the sentence meanings, realizes the beneficial effect of assisting the text writing, and increases the self-checking and the comparison of the user to the text writing.
2: the invention changes the single-thread operation mode into the matrix operation mode, realizes that the semantic calculation time can be changed from m times to 1 time in a short time through the high-efficiency matrix operation in the GPU, and greatly improves the efficiency of sentence meaning operation.
3: after the matrix operation mode is realized, the cost of deep learning required by single-thread operation can be reduced, and the machine learning efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic diagram of the system architecture of the present invention;
FIG. 2 is a schematic flow diagram of the present invention;
FIG. 3 is a schematic diagram of the target text output of the present invention;
Detailed Description
The following description of the preferred embodiments of the present invention is provided for the purpose of illustration and description, and is in no way intended to limit the invention.
Example 1
As shown in fig. 1-3, the present invention provides an artificial intelligence auxiliary writing system, which comprises a writing system, wherein the writing system comprises an information processing module, a word vector semantic module, a sentence vector semantic module and a sentence vector matrix module, the word vector semantic module comprises a CBOW model neural network training module, the information processing module comprises an information collecting module, a text box input module and a text box output module, the sentence vector semantic module comprises a sentence vector combination algorithm, and the sentence vector matrix module comprises a semantic matrix association algorithm, and specifically comprises the following steps:
A. collecting a large number of literary works through an information processing module, and converting characters into character strings after segmentation to form a character paragraph library;
B. b, processing the character paragraphs acquired in the step A through a word vector semantic module, firstly segmenting the character paragraphs, then sequentially processing the words through a CBOW model neural network training module to obtain word vectors of all the words, and then combining all the word vectors to form a phrase vector;
C. b, the phrase vector library in the step B is integrally arranged in a sentence vector semantic module, and a word vector is output as a sentence vector through a sentence vector combination algorithm, so that sentences of the text paragraphs are mainly expressed through the sentence vector;
D. b, after each paragraph in the text paragraph library generated in the step A passes through the step B, C, obtaining a sentence characteristic vector of each text paragraph, expressing the characteristic sentence vector of the sentence by adopting a floating point type, and combining all sentence characteristic vectors to form a literary work matrix library;
E. a user inputs a target text through a text box input module of the information processing module, and after the text is converted into a character string, a target sentence vector is formed through the step B and the step C;
F. and D, processing the target sentence vectors and the literary work matrix library in the step D through a semantic matrix association algorithm of the sentence vector matrix module to obtain a similar sentence vector set, outputting the similar sentence vector set to a text box output module of the information processing module, and arranging the similar sentence vectors in an ascending order according to a similarity rate.
Furthermore, the information processing module comprises a network crawler technology or a network API platform external port and is mainly used for extracting literary work information.
The CBOW model neural network training module is mainly used under a word2vec bag of words algorithm model, the training process of the CBOW model neural network training module is that some literature sentences are extracted from a large number of literature sentences to be used as training data, word groups W (t) are extracted from each sentence, and word strings and words can be quantized through context words w (t-2), w (t-1), w (t +1) and w (t +2) to predict W (t), and the training process comprises the following steps:
(1) inputting the one-hot coding of the context word of the current word into an input layer, wherein the dimension of the one-hot coding is 1 × V, a matrix W1 is set, the dimension of W1 is V × N, V is the total number of word groups contained in a dictionary, and N is a user-defined dimension;
(2) multiplying the context words by the same matrix W1 to obtain respective vectors 1N of the context words, averaging the 1N vectors into a vector 1N, and finally multiplying the average vector 1N by the matrix W2 to become 1V, wherein the dimension of W2 is N V;
(3) and (3) normalizing the 1-V vector, taking out the probability vector of each word, taking the word corresponding to the number with the maximum probability value as a predicted word W (t), calculating errors of the predicted word W (t) and a real expected word W (t), performing reverse propagation gradient descent to adjust matrix values of W1 and W2, and finally obtaining a W1 matrix value which is a word vector library of the literature sentence.
The sentence vector combination algorithm is calculated based on a CBOW model neural network training module, and a sentence vector is formed by word vectors obtained by the CBOW model neural network training module, and the method specifically comprises the following steps: setting n words contained in the target sentence A according to the obtained word vectors, wherein each word is represented by an m-dimensional word vector in a word vector library, and the set of word vectors contained in the sentence A is X (X)1,X2……Xn) Where each word vector may be represented as:
X1=[X11,X12……X1m]
X2=[X21,X22……X2m]
……
Xn=[Xn1,Xn2……Xnm]
if the semantic feature vector of sentence a is Avec, the algorithm of Avec is:
Avec=[(X11+X21+……+Xn1)/n,(X12+X22+……+Xn2)/n,……,(X1m+X2m+……+Xnm)/n]for the sake of simplifying the representation:
Y1=(X11+X21+……+Xn1)/n
Y2=(X12+X22+……+Xn2)/n
……
Ym=(X1m+X2m+……+Xnm)/n
the semantic feature vector Avec of sentence a ═ Y1,Y2,……,Yn]And obtaining a sentence vector Avec, wherein the data type of Y is a floating point number, so that after a plurality of sentence vectors are collected, the total number of sentences is set to be S, and the sentence vector Avec is obtained according to the sentence vectorsThe floating point type matrix of (d) is then expressed as:
Figure BDA0002694937560000091
the output matrixes are combined to form a literary work matrix library G.
The semantic association algorithm mainly combines the target text and the literary work matrix library for calculation, comprises an Euclidean distance formula, and comprises the following steps:
setting the target text input by the user in the step E as an X text, and obtaining X (X) from the n-dimensional feature vector of the X text through the step B and the step C1,X2……Xn) The term "X" is used herein to refer to a set of X texts of a target text, which is substantially different from the aforementioned lexicon vector X, and X is a concept of X texts, and a comparison sentence is Y (Y)1,Y2,……,Yn) Then the multi-dimensional corresponding formula is:
Figure BDA0002694937560000092
the distance between the X text and the plurality of sentence feature vectors can be calculated by sequentially calculating the distance between the X text and millions of sentence feature vectors stored by a program, namely the similarity between sentences, and finally sequencing the similar sentences.
The semantic association algorithm comprises an algorithm simplification process, and comprises the following steps:
first defining a transformation matrix of m rows
Figure BDA0002694937560000093
Multiplying the m-row transformation matrix by the corresponding sentence vector X text to obtain:
Figure BDA0002694937560000101
combining all the memorized sentence characteristic vectors into a matrix G, wherein the matrix G is a matrix in which m rows and n columns are recorded, namely, m sentences are memorized by the algorithm, and each sentence characteristic vector is n;
d ═ X '-G, where X' is the X text input by the user, is a matrix of one row and n columns, and is converted into a matrix X 'of m rows and n columns by the transformation matrix C, and the difference between the sentence vector of the X text and all the sentence vectors in the literary work matrix library is obtained by subtracting X' from the matrix G to obtain a matrix D;
E=D⊙D,
wherein an operator "", is a hadamard product, which is a matrix operation, if a ═ is (a ═ij) And B ═ Bij) Are two matrices of the same order, if cij=aij×bijThen, the matrix C is called (C)ij) The Hadamard product is the Hadamard product of A and B, or called basic product, so that in the formula, E is the matrix D and the Hadamard product is made by itself, namely, all elements in the matrix D are squared;
finally F ═ ETC, wherein ETFor the transformed E and C, a transformation matrix is obtained, the obtained F is a matrix with m rows and one column, the numerical value in the matrix is the similarity of the X text and each sentence, a sentence list which is most similar to the X text can be obtained after ascending arrangement, and the original European formula has no evolution, so the final formula is as follows:
associative distance
Figure BDA0002694937560000102
The semantic matrix association algorithm is mainly arranged on the GPU for operation.
Specifically, according to the above description, the present application mainly provides a sentence meaning algorithm of a text, which can convert a target text into a sentence vector that can be identified by a program, combine literary works into a database according to the characteristics of the sentence vector, form a comparison between a single sentence vector and a literary work database according to an association algorithm, thereby finding out a most similar sentence, where the sentence vector is based on an existing word vector, convert semantics in the sentence into topics of phrases, topic weights, and keywords included in a main body through a CBOW model neural network training module based on a word2vec algorithm, index the phrases into the sentences according to the keywords, form a phrase matrix vector that is labeled by the word vector and combined into the target sentence, and convert the phrase matrix into the sentence vector according to the existing phrase matrix vector, that is to convert each phrase in the sentences into the word vector, the word senses of the word vectors are superposed, each word vector set is a 1-N single-column set, so that a superposed sentence matrix can be converted into the 1-N single-column set, a plurality of phrase combination word senses are superposed into a sentence vector, the sentence vector is combined by a plurality of phrases to form expression, finally, the sentence vectors containing semantic features are combined into a floating-point type matrix, the matrix and the original text are stored in a server, the sentence feature matrix is a literature matrix library, the matrix mainly feeds back semantic features of sentences, so that a computer can understand the meaning of the original text through the sentence sense matrix, and the corresponding original text is output mainly through the sentence sense matrix when the computer outputs, so that the corresponding output effect can be achieved.
When the association is output, the text to be associated is mainly converted into sentence vectors through the description of the steps, the basis of the association output mainly relates to an Euclidean association formula, the distances between the X text and a plurality of sentence feature vectors are calculated according to the association formula, namely, the distances (also called similarities) are sequentially calculated between the X text and millions of sentences stored by a program, namely, the sentence feature vectors in the literature matrix library explained above are sequentially calculated, and then the similarity between the sentences and the X text is sequenced to see which sentences are close to each other, so that which sentences are most similar to the semantics of the X text is known, the similarity between the sentences can be accurately obtained, the meaning of the output sentences is ensured to be the same, as shown in FIG. 3, no phrase association result about "known sound" exists in the literature sentences output by the sentences of the "known sound" known, that is, after matching, the formed matching relationship is not the original "keyword matching" but "character string matching", and is mainly output according to the matching relationship of sentence vectors, which also shows that the invention can understand and memorize the "declarative knowledge", and uses a matrix association algorithm to simulate a human brain association mechanism to realize the understanding and association of user semantics.
If the number of sentences in the matrix library of the literature works reaches tens of millions, the sentence traversing is carried out for a long time, so as to improve the operation efficiency of the sentence meaning, the single-cycle association operation is improved into the matrix operation mode again, the time required for operation is changed from m times to one time, the mode is mainly based on the semantic matrix association algorithm described by the invention, the X text is converted into the matrix with the same dimension as the matrix library of the literature works, the difference value of the two matrixes is combined, then the final single-column matrix set is obtained through conversion according to a formula, the data in the set is the similarity of the X text and the sentence vector in the matrix library of the literature works, the similar numerical value represents the similarity of the sentence vector in the matrix library of the literature works, the output result shown in figure 3 can be quickly obtained through ascending output during output, and the adopted matrix association algorithm, the human brain association mechanism is simulated, the understanding and association of user semantics are realized, the semantic association mechanism can be rapidly extended from a single text to a plurality of similar and different semantic texts, and the language of famous literary works collected by a literary work matrix library is output, so that the output sentences contain stronger literary property, the excitement of authors is realized during writing, the semantic association mechanism can be used in various application fields such as drama, literary creation, self-media literary writing and the like, and the semantic association mechanism has stronger practicability and universality.
Example 2
The method can be combined with an image recognition algorithm, can recognize patterns in the picture such as the sun, the sky and the fog, can convert the words and phrases by the steps in the embodiment 1 according to the output word group text after recognition, and finally outputs the words and phrases into corresponding literary works, thereby achieving the effect of matching the characters and the paragraphs of the picture and realizing the effect of quick creation.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. An artificial intelligence auxiliary writing system comprises a writing system, and is characterized in that the writing system comprises an information processing module, a word vector semantic module, a sentence vector semantic module and a sentence vector matrix module, wherein the word vector semantic module comprises a CBOW model neural network training module, the information processing module comprises an information collecting module, a text box input module and a text box output module, the sentence vector semantic module comprises a sentence vector combination algorithm, and the sentence vector matrix module comprises a semantic matrix association algorithm, and specifically comprises the following steps:
A. collecting a large number of literary works through an information processing module, and converting characters into character strings after segmentation to form a character paragraph library;
B. b, processing the character paragraphs acquired in the step A through a word vector semantic module, firstly segmenting the character paragraphs, then sequentially processing the words through a CBOW model neural network training module to obtain word vectors of all the words, and then combining all the word vectors to form a phrase vector;
C. b, the phrase vector library in the step B is integrally arranged in a sentence vector semantic module, and a word vector is output as a sentence vector through a sentence vector combination algorithm, so that sentences of the text paragraphs are mainly expressed through the sentence vector;
D. b, after each paragraph in the text paragraph library generated in the step A passes through the step B, C, obtaining a sentence characteristic vector of each text paragraph, expressing the characteristic sentence vector of the sentence by adopting a floating point type, and combining all sentence characteristic vectors to form a literary work matrix library;
E. a user inputs a target text through a text box input module of the information processing module, and after the text is converted into a character string, a target sentence vector is formed through the step B and the step C;
F. and D, processing the target sentence vectors and the literary work matrix library in the step D through a semantic matrix association algorithm of the sentence vector matrix module to obtain a similar sentence vector set, outputting the similar sentence vector set to a text box output module of the information processing module, and arranging the similar sentence vectors in an ascending order according to a similarity rate.
2. The artificial intelligence aided authoring system of claim 1, wherein the information processing module comprises a web crawler technology or a web API platform external port, and is mainly used for extracting literary work information.
3. The artificial intelligence aided writing system of claim 1, wherein the CBOW model neural network training module is mainly used under a word2vec bag-of-words algorithm model, the training process of the CBOW model neural network training module is to extract some literature sentences from a large number of literature sentences as training data, extract a phrase w (t) for each sentence, and predict w (t) through context words w (t-2), w (t-1), w (t +1) and w (t +2), and the trained CBOW model neural network training module can quantize word strings, and the method comprises the following steps:
(1) inputting the one-hot coding of the context word of the current word into an input layer, wherein the dimension of the one-hot coding is 1 × V, a matrix W1 is set, the dimension of W1 is V × N, V is the total number of word groups contained in a dictionary, and N is a user-defined dimension;
(2) multiplying the context words by the same matrix W1 to obtain respective vectors 1N of the context words, averaging the 1N vectors into a vector 1N, and finally multiplying the average vector 1N by the matrix W2 to become 1V, wherein the dimension of W2 is N V;
(3) and (3) normalizing the 1-V vector, taking out the probability vector of each word, taking the word corresponding to the number with the maximum probability value as a predicted word W (t), calculating errors of the predicted word W (t) and a real expected word W (t), performing reverse propagation gradient descent to adjust matrix values of W1 and W2, and finally obtaining a W1 matrix value which is a word vector library of the literature sentence.
4. The artificial intelligence aided writing system of claim 2, wherein the sentence vector combination algorithm is calculated based on a CBOW model neural network training module, and the sentence vectors are formed by word vectors obtained by the CBOW model neural network training module by: setting n words contained in the target sentence A according to the obtained word vectors, wherein each word is represented by an m-dimensional word vector in a word vector library, and the set of word vectors contained in the sentence A is X (X)1,X2……Xn) Where each word vector may be represented as:
X1=[X11,X12……X1m]
X2=[X21,X22……X2m]
……
Xn=[Xn1,Xn2……Xnm]
if the semantic feature vector of sentence a is Avec, the algorithm of Avec is:
Avec=[(X11+X21+……+Xn1)/n,(X12+X22+……+Xn2)/n,……,(X1m+X2m+……+Xnm)/n]for the sake of simplifying the representation:
Y1=(X11+X21+……+Xn1)/n
Y2=(X12+X22+……+Xn2)/n
……
Ym=(X1m+X2m+……+Xnm)/n
the semantic feature vector Avec of sentence a ═ Y1,Y2,……,Yn]And obtaining a sentence vector Avec, wherein the data type of Y is a floating point number, so that after a plurality of sentence vectors are collected, the total number of sentences is set to be S, and a floating point type matrix obtained according to the sentence vectors is expressed as:
Figure FDA0002694937550000031
the output matrixes are combined to form a literary work matrix library G.
5. An artificial intelligence aided writing system according to claim 4, wherein said semantic association algorithm mainly combines the target text and the literary work matrix library for calculation, and includes Euclidean distance formula, comprising the following steps:
setting the target text input by the user in the step E as an X text, and obtaining X (X) from the n-dimensional feature vector of the X text through the step B and the step C1,X2……Xn) The term "X" is used herein to refer to a set of X texts of a target text, which is substantially different from the aforementioned lexicon vector X, and X is a concept of X texts, and a comparison sentence is Y (Y)1,Y2,……,Yn) Then the multi-dimensional corresponding formula is:
Figure FDA0002694937550000032
the distance between the X text and the plurality of sentence feature vectors can be calculated by sequentially calculating the distance between the X text and millions of sentence feature vectors stored by a program, namely the similarity between sentences, and finally sequencing the similar sentences.
6. An artificial intelligence aided authoring system as claimed in claim 5 wherein said semantic association algorithm comprises an algorithm simplification process comprising the steps of:
first defining a transformation matrix of m rows
Figure FDA0002694937550000041
Multiplying the m-row transformation matrix by the corresponding sentence vector X text to obtain:
Figure FDA0002694937550000042
combining all the memorized sentence characteristic vectors into a matrix G, wherein the matrix G is a matrix in which m rows and n columns are recorded, namely, m sentences are memorized by the algorithm, and each sentence characteristic vector is n;
d ═ X '-G, where X' is the X text input by the user, is a matrix of one row and n columns, and is converted into a matrix X 'of m rows and n columns by the transformation matrix C, and the difference between the sentence vector of the X text and all the sentence vectors in the literary work matrix library is obtained by subtracting X' from the matrix G to obtain a matrix D;
E=D⊙D,
wherein an operator "", is a hadamard product, which is a matrix operation, if a ═ is (a ═ij) And B ═ Bij) Are two matrices of the same order, if cij=aij×bijThen, the matrix C is called (C)ij) The Hadamard product is the Hadamard product of A and B, or called basic product, so that in the formula, E is the matrix D and the Hadamard product is made by itself, namely, all elements in the matrix D are squared;
finally F ═ ETC, wherein ETFor the transformed E and C, a transformation matrix is obtained, the obtained F is a matrix with m rows and one column, the numerical value in the matrix is the similarity of the X text and each sentence, a sentence list which is most similar to the X text can be obtained after ascending arrangement, and the original European formula has no evolution, so the final formula is as follows:
Figure FDA0002694937550000043
7. the system of claim 1, wherein the semantic matrix association algorithm is implemented on a GPU.
CN202011002905.4A 2020-09-22 2020-09-22 Artificial intelligence auxiliary writing system Pending CN114254645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011002905.4A CN114254645A (en) 2020-09-22 2020-09-22 Artificial intelligence auxiliary writing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011002905.4A CN114254645A (en) 2020-09-22 2020-09-22 Artificial intelligence auxiliary writing system

Publications (1)

Publication Number Publication Date
CN114254645A true CN114254645A (en) 2022-03-29

Family

ID=80789616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011002905.4A Pending CN114254645A (en) 2020-09-22 2020-09-22 Artificial intelligence auxiliary writing system

Country Status (1)

Country Link
CN (1) CN114254645A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057325A (en) * 2023-10-13 2023-11-14 湖北华中电力科技开发有限责任公司 Form filling method and system applied to power grid field and electronic equipment
CN117113977A (en) * 2023-10-09 2023-11-24 北京信诺软通信息技术有限公司 Method, medium and system for identifying text generated by AI contained in test paper
CN117312506A (en) * 2023-09-07 2023-12-29 广州风腾网络科技有限公司 Page semantic information extraction method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model
WO2020103783A1 (en) * 2018-11-19 2020-05-28 阿里巴巴集团控股有限公司 Method for determining address text similarity, address searching method, apparatus, and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
WO2020103783A1 (en) * 2018-11-19 2020-05-28 阿里巴巴集团控股有限公司 Method for determining address text similarity, address searching method, apparatus, and device
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ORKPHOL, K 等: "Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet", FUTURE INTERNET, 1 May 2019 (2019-05-01) *
王亚珅 等: "基于注意力机制的概念化句嵌入研究", 自动化学报, 13 November 2018 (2018-11-13) *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312506A (en) * 2023-09-07 2023-12-29 广州风腾网络科技有限公司 Page semantic information extraction method and system
CN117312506B (en) * 2023-09-07 2024-03-08 广州风腾网络科技有限公司 Page semantic information extraction method and system
CN117113977A (en) * 2023-10-09 2023-11-24 北京信诺软通信息技术有限公司 Method, medium and system for identifying text generated by AI contained in test paper
CN117113977B (en) * 2023-10-09 2024-04-16 北京信诺软通信息技术有限公司 Method, medium and system for identifying text generated by AI contained in test paper
CN117057325A (en) * 2023-10-13 2023-11-14 湖北华中电力科技开发有限责任公司 Form filling method and system applied to power grid field and electronic equipment
CN117057325B (en) * 2023-10-13 2024-01-05 湖北华中电力科技开发有限责任公司 Form filling method and system applied to power grid field and electronic equipment

Similar Documents

Publication Publication Date Title
CN108804530B (en) Subtitling areas of an image
CN106202010B (en) Method and apparatus based on deep neural network building Law Text syntax tree
CN111027595B (en) Double-stage semantic word vector generation method
CN112733541A (en) Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN112100351A (en) Method and equipment for constructing intelligent question-answering system through question generation data set
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN108628935B (en) Question-answering method based on end-to-end memory network
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
Tripathy et al. Comprehensive analysis of embeddings and pre-training in NLP
CN111382565A (en) Multi-label-based emotion-reason pair extraction method and system
CN111414481A (en) Chinese semantic matching method based on pinyin and BERT embedding
CN110765755A (en) Semantic similarity feature extraction method based on double selection gates
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN114254645A (en) Artificial intelligence auxiliary writing system
CN114881042B (en) Chinese emotion analysis method based on graph-convolution network fusion of syntactic dependency and part of speech
CN114756681B (en) Evaluation and education text fine granularity suggestion mining method based on multi-attention fusion
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112784576B (en) Text dependency syntactic analysis method
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
Seilsepour et al. Self-supervised sentiment classification based on semantic similarity measures and contextual embedding using metaheuristic optimizer
Li et al. Multimodal fusion with co-attention mechanism
CN114330328A (en) Tibetan word segmentation method based on Transformer-CRF
CN117932066A (en) Pre-training-based 'extraction-generation' answer generation model and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination