CN118520864A

CN118520864A - Bid text similarity detection system and detection method

Info

Publication number: CN118520864A
Application number: CN202410996347.XA
Authority: CN
Inventors: 陈楷烨; 王珊珊; 楚金; 崔凯; 杨楷文
Original assignee: Three Gorges Materials Tendering Management Co ltd
Current assignee: Three Gorges Materials Tendering Management Co ltd
Priority date: 2024-07-24
Filing date: 2024-07-24
Publication date: 2024-08-20
Anticipated expiration: 2044-07-24
Also published as: CN118520864B

Abstract

A bid text similarity detection system and method comprises a provider bid document generation client and an online detection subsystem connected with the client. The client comprises an information collection module and an encryption uploading module, and the online detection subsystem comprises a decryption module, a text analysis module based on a deep learning algorithm, a data storage module and an information pushing module. The information collection module is used for collecting the original text data input by the provider. The encryption uploading module is used for encrypting the original text data, and the data storage module is used for storing bidding text data and model output results. And before bidding, the decryption module decrypts the encrypted bidding text, and finally pushes the provider list needing to be checked and the part with the too high detection similarity in the bidding document to a bidding agency or bidding party for checking according to the model output result so as to detect string bidder ring behaviors.

Description

Bid text similarity detection system and detection method

Technical Field

The invention belongs to the technical field of text similarity detection, and particularly relates to a bidding text similarity detection system and a bidding text similarity detection method.

Background

The surrounding string behavior in the bidding process is frequent, and the detection of the illegal behavior before bidding also becomes a necessary flow of a plurality of bidding items. Generally, if the similarity of two or more documents is too high, then the document is considered suspected of string bidder ring and should be further identified. However, bid documents often have hundreds of pages, which are inefficient if reviewed entirely manually, and require the use of informative tools to assist in text similarity comparisons.

Most of the existing technologies for processing the bidding text carry out similarity judgment on the whole bidding text by a distance measurement method based on Euclidean distance or cosine distance, the accuracy of identifying the similar text is low, and the misjudgment of a result is easy to cause due to the existence of a fixed template in the bidding document text, so that unnecessary manual review cost is caused.

Bidirectional long and short time Memory network BiLSTM (english: bi-directional Long Short-Term Memory) is a method for carrying information across multiple time steps, and can extract time step information from forward and reverse directions, so that the above information can be considered, and the following information can be considered, and the method has good performance in time sequence, natural language processing and other tasks.

The bi-directional coding characterization model BERT (English is called bidirectional encoder representations from Transformer) is a pre-training language model based on a transducer architecture, which is proposed by Devlin et al in 2018, and has remarkable performance improvement in the field of natural language processing, particularly in tasks such as question-answering, text classification, named entity recognition and the like through the key characteristics of bi-directionality, a pre-training method, the transducer architecture and the like. The Chinese word vector model CW2Vec (English is called Chinese Word to Vector) is a word vector model aiming at Chinese characteristics, and takes stroke information of Chinese characters into consideration to train word vectors, so that semantic information of Chinese text can be effectively extracted, and good effects are achieved on natural language processing tasks such as Chinese text classification, emotion analysis, machine translation and the like.

A twin network defines two network structures that are identical in structure and share parameters and can be used to learn the similarity between given data. The existing research for calculating text similarity based on a deep learning model mostly uses a twin network as a basic structure, but has the defects of insufficient semantic information extraction, overlong training time and the like. How to effectively use the deep learning model to calculate the similarity of the Chinese bidding text, fully extract the semantic information in the Chinese text, and improve the accuracy and adaptability of the model is a problem which needs to be solved by the person skilled in the art.

Disclosure of Invention

The invention aims to solve the technical problem of providing a bidding text similarity detection system and a detection method, wherein different parts of bidding documents can be collected through the detection system, text mining is carried out through a deep learning algorithm, a similarity judgment result among bidding documents of all suppliers is recorded, and a supplier list needing rechecking and a part with excessively high suspected similarity in the bidding documents are pushed to an agency or a bidding party.

In order to solve the technical problems, the invention adopts the following technical scheme:

A bidding text similarity detection system comprises a bidding document generation client, wherein the bidding document generation client is electrically connected with an online detection subsystem; the bidding document generation client comprises an information collection module and an encryption uploading module; the online detection subsystem comprises a data storage module, a decryption module, a text analysis module and an information push module; the encryption uploading module is electrically connected with the data storage module;

the information collection module is used for collecting original text data of a bidding text;

the encryption uploading module is used for encrypting and uploading the original text data to the data storage module;

The data storage module is used for storing original text data and numbering texts before storing;

The decryption module is used for reading bidding text fragments to be analyzed from the original text data in the data storage module, decrypting the bidding text fragments, and transmitting the decrypted data to the text analysis module;

the text analysis module is used for carrying out data preprocessing and text analysis;

the information pushing module is used for reading data from the data storage module and pushing a text part with too high detection similarity to be checked to a rechecking party so as to detect the behavior of the string bidder ring; the similarity is too high, which means that: exceeding a set similarity threshold.

A detection method of a bidding text similarity detection system comprises the following steps:

s1: collecting bidding texts input by suppliers, encrypting and uploading;

S2: decrypting the bidding text before opening the bid, and carrying out data preprocessing on the bidding text;

s3: constructing a deep learning model, and training the deep learning model by using a training set;

S4: after the deep learning model is trained, the preprocessed bid texts of all bid suppliers are subjected to two technical scheme combination or two service scheme combination, and then the two technical scheme combination or the two service scheme combination are input into the trained deep learning model to obtain a prediction result. The two technical scheme combinations or the two service scheme combinations comprise the technical scheme combinations, the service scheme combinations do not comprise the technical scheme and the service scheme combinations.

Preferably, the substeps of step S2 are as follows:

s2.1, selecting bidding texts to be compared according to a bidding agency or bidding party preset scheme;

s2.2, unifying character codes of the bidding text;

s2.3, removing useless information, wherein the useless information comprises special symbols, spaces, line feed symbols, uniform resource locators, mail addresses and frequently-occurring repeated contents; wherein, frequent occurrence refers to: the text of the same section repeatedly appears for more than 10 times; special symbols include geometric shapes, arrows, and emoticons;

s2.4, removing stop words related to the text of the bidding document by using the stop word list, wherein the stop words comprise ' and ' o ';

and s2.5, word segmentation processing is carried out on the bid texts to be compared, and word segmentation results are input into a word vector model.

Preferably, the substeps of step S3 are as follows:

S3.1, extracting semantic information by using a bidirectional coding characterization model and a Chinese word vector model at a feature extraction layer of the deep learning model, and splicing output vectors of the language model and the word vector model;

s3.2: performing dimension reduction processing on the vector after feature extraction by using a principal component analysis algorithm;

s3.3: inputting the vector sequence subjected to the dimension reduction by the principal component analysis algorithm into a bidirectional long-short-time memory network;

s3.4: a random discarding layer is added after the bidirectional long-short time memory network layer;

S3.5: and carrying out similarity calculation on vectors output by the twin network of the deep learning model, and outputting a result of similarity or not.

Preferably, the substeps of step s3.1 are as follows:

1) The bi-directional coding characterization model uses coding modules in the converter model, and the multi-head self-attention mechanism of the coding modules firstly performs self-attention calculation:

given an embedded vector X of an input sequence, there is ，，Wherein，AndRespectively a learnable weight matrix, Q represents a query, K represents a key, V represents a value, Q, K and V are both linear transformations of the input sequence, and the self-attention score is calculated as follows:

；

Represent a self-attention score;

representing a normalized exponential function;

representing the original value of the i-th element of the input vector;

representing a transpose of the key K;

representing the original value of the i-th element of the input vector; representing the sum by taking each element of the input vector as an index, wherein e represents a natural constant, and C represents the vector length; for example, if the input vector is 1,2,3, ；

The self-attention score indicates the degree of importance of the information of other words in the sequence when processing the ith word; wherein the method comprises the steps ofThe dimension of the key is used for scaling the dot product result and preventing the value entering the normalized exponential function from being overlarge; the multi-headed self-attention mechanism splits an input sequence into multiple "heads", each of which independently performs a self-attention operation, and then concatenates the results together and through an additional weight matrixPerforming a linear transformation allows the model to focus on different representation subspaces in the input sequence simultaneously:

；

Representing a multi-headed self-attention mechanism scoring matrix;

Concat () represents a splice operation, i.e The values of (2) are spliced in parallel to form a matrix;

Representing the self-attention score of each "head", ；

N represents the number of heads, and is generally 6-16;

2) In the chinese word vector model, the similarity function between the word and the context is defined as follows:

；

wherein the method comprises the steps of As a similar function between the word w and the context c,Represents an n-gram stroke vector corresponding to the current word,Is the word vector of its corresponding context word,N-gram set for current word w, q is setN-element stroke elements of (2); modeling the prediction of the context c based on the word w, performing simulation prediction by adopting a normalized exponential function, wherein the objective function is as follows:

；

wherein the method comprises the steps of As an objective function, meaning the probability of the occurrence of the context c under the occurrence of the word w, c' being the randomly selected word in the vocabulary V, called negative sample; representing a similarity function between the word w and the randomly selected context c';

the loss function based on n-ary strokes is as follows:

；

Wherein, Is an S-type activation function:

；

is the set of all words within the current word window, D is all text of the training corpus, Is to randomly select the number of words,Is a negative exampleSampling according to word frequency distribution;

3) Splicing vectors after feature extraction of the bidirectional coding characterization model and the Chinese word vector model to serve as input of a bidirectional long-short-term memory network;

f (seq) = Concat (bi-directional coding characterization model BERT (seq), chinese word vector model CW2Vec (seq));

Wherein seq is the input word sequence, concat () concatenates the two vectors.

Preferably, the principal component analysis algorithm is implemented as follows:

(1) After normalizing the original data, forming a matrix X with the size of N multiplied by M according to columns;

(2) Solving a covariance matrix C of the matrix X;

(3) Calculating the eigenvalue and corresponding eigenvector of matrix C;

(4) Taking K maximum eigenvalues, arranging the eigenvalues from large to small, and arranging the eigenvectors corresponding to the eigenvalues into a matrix P in rows;

(5) Y=px is a reduced-dimension data matrix, which is arranged in columns, and the matrix size is k×m.

Preferably, in step s3.3, the bidirectional long-short-time memory network is composed of two independent long-short-time memory network layers, each layer extracting information of the input sequence from the forward and reverse pairs, respectively, each long-short-time memory network layer containing a plurality of neurons, each neuron receiving as input the input sequence and the hidden state of the previous time step, each neuron comprising three control structures: input doorForgetful doorOutput doorAnd cell statusHidden state：

;

Wherein,Is an input of a time step and,Is the hidden state of the last time step,Is the state of the cells in the last time step,、、、、Respectively an input door, a forgetting door, an output door, a cell state and a hidden state; In order to activate the function in the form of an S, And (3) withRespectively representing a weight matrix and a bias term of an input gate; And (3) with Respectively representing a weight matrix and a bias term of the forgetting gate; And (3) with Respectively representing a weight matrix and a bias term of the output gate; And (3) with A weight matrix and a bias term respectively representing cell states; tanh is the hyperbolic tangent activation function:

；

The forward and backward transfer process of the bidirectional long-short-time memory network is as follows:

；

wherein the method comprises the steps of AndSemantic information representing time steps extracted from forward and backward directions respectively,Is the hidden state at the time of forward t-1,Is hidden state at reverse t+1 moment, finally the forward and backward long-short-time memory networks are spliced to obtain output at t moment：

；

Concat () represents concatenating two vectors; indicating the hidden state at the forward t moment, Indicating the hidden state at the time t of the reversal.

Preferably, in step s3.5, the output layer of the deep learning model calculates the distance between the vector a and the vector B using the cosine similarity, and the cosine similarity calculation formula is as follows:

；

Binary cross entropy function binary_ crossentropy is used as a loss function:

；

Where n is the number of samples, i.e. the number of data stripes of the training set, Is the text similar or dissimilar tag value 0 or 1 for the i-th set of comparisons,Is the probability that the i-th set of comparative text predicts similarity.

Preferably, after the step S4 is completed, the step S5 operation is performed, in which, according to the output result of the model and the preset decision rule of the bidding agency or the bidding party, the vendor list to be checked and the part with too high suspected similarity in the bidding document are pushed to the bidding agency or the bidding party for checking, so as to detect the behavior of the string bidder ring.

The processing module executes program instructions, which are the detection method of the bidding text similarity detection system.

The invention has the following beneficial effects:

the invention can collect different parts of the supplier bidding document and detect the text similarity according to the actual bidding business; the accuracy of judging the text similarity is high, and excessive labor cost is avoided to be wasted for checking suppliers suspected to have the string bidder ring behaviors; the model based on deep learning can be continuously trained along with accumulation of system data in the actual use process, so that the accuracy of model identification is improved.

Drawings

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

FIG. 1 is a schematic diagram of a system for detecting similarity of bidding texts;

FIG. 2 is a diagram of a method for detecting similarity of bidding text;

FIG. 3 is a schematic diagram of a deep learning model according to the present invention.

In the figure: the bid document generation client 1, the information collection module 101, the encryption uploading module 102, the online detection subsystem 2, the decryption module 201, the text analysis module 202, the data storage module 203 and the information pushing module 204.

Detailed Description

Example 1:

As shown in FIG. 1, the bidding text similarity detection system comprises a bidding document generation client, wherein the bidding document generation client is electrically connected with an online detection subsystem; the bidding document generation client comprises an information collection module and an encryption uploading module; the online detection subsystem comprises a data storage module, a decryption module, a text analysis module and an information push module; the encryption uploading module is electrically connected with the data storage module;

The bid document generation client collects text data according to different parts of the bid document, for example, a bid item requires the bid document to contain a quotation table, a performance proof, a technical scheme, a security scheme, an after-sale service scheme, a project team configuration, a deviation table, qualification checking materials and the like, and a provider needs to input corresponding information of the different parts to a designated position of the client.

The information gathering module is used for gathering raw text data input by a provider. According to related legal requirements, the bidding text should be stored in an encrypted manner before bidding, the encryption uploading module encrypts and uploads the original text data, and the data storage module is used for storing the original text data and numbering texts of different parts before storing. Before opening the bid, the decryption module reads the bidding text fragments to be analyzed from the data storage module, decrypts the bidding text fragments, and transmits decrypted data to the text analysis module. The data storage module stores the determination result, for example, for the provider a technical service scheme text, if the provider a technical service scheme text is similar to the provider b technical service scheme text, the text number of the provider b technical service scheme is stored in the similarity determination field, and if the provider a technical service scheme text is similar to the technical service scheme texts of a plurality of providers, the numbers of the similar texts are sequentially stored in the similarity determination field and separated by commas. The information pushing module reads data from the data storage module, and pushes the part with too high detection similarity in the provider list and the bidding document to be checked to the bidding agency or bidding party for checking so as to detect the behavior of the string bidder ring.

The collected text data is uploaded to a data storage module of the online detection subsystem through encryption processing, and the data storage module numbers texts of different parts and stores the numbers texts into a database before storing the original text data. Before the signature is opened, the decryption module reads data from the data storage module and sends the decrypted data to the text analysis module. And after the deep learning model in the text analysis module is processed, outputting a result of whether texts are similar or not, and storing the result into the data storage module. After all the bidding texts of all the bidding suppliers are detected, the information pushing module reads the result data from the data storage module, and pushes the supplier list needing to be checked and the part with too high detection similarity in the bidding files to the bidding agency or bidding party for checking so as to detect the behavior of the string bidder ring.

The information pushing module pushes information in a mode of pushing information in a station, and a bidding agency or a bidding party can download a result file on line.

Example 2:

2-3, wherein the deep learning model is of a twin network structure, the two network structures are identical and share parameters, and the specific structure is as follows: input layer- > feature extraction layer- > PCA dimension reduction- > BiLSTM layer- > Dropout layer- > output layer. In FIG. 3 The word segment of the input is represented,Representing the input of the vector after feature extraction and dimension reduction i.e. LSTM time step,AndRepresenting neurons in the forward LSTM layer and the backward LSTM layer respectively,Representing the output of the time step spliced from the outputs of the forward and backward LSTM. The superscripts a and B in fig. 3 represent two different sentence sequences, respectively. LSTM represents: the network is memorized for a long time. BiLSTM denotes: bidirectional long and short time memory networks. Layer BiLSTM represents: the network layer is memorized in two-way long and short time. Dropout layer means random discard layer. CW2Vec represents: chinese word vector model. PCA representation: principal component analysis algorithms. BERT represents: the bi-directional coding characterizes the model. The characteristic extraction layer is as follows: the result vectors processed by the BERT model and the CW2Vec model are spliced.

The method comprises the following steps:

S1: raw text data of the bid document, i.e., bid text, entered by the supplier is collected and uploaded in an encrypted manner.

When the method is specifically used, the provider uploads different parts of the bidding document through the bidding document generation client, for example, the provider A needs to upload text information of different parts such as bidding letters, corporate qualification, performance evidence, bidding quotation table of price part, technical scheme, acceptance scheme, project team service configuration and the like in the business part when uploading the bidding document. After all the parts are uploaded, the client automatically generates a finished PDF bidding file, and the encryption module encrypts the original text information and uploads the original text information to the data storage module. Before the data storage module stores the bidding text data, the text of different parts is numbered and then stored in the database, and in the embodiment, the numbering rule is as follows: four-digit bid file different part number + four-digit vendor number + three-digit serial number.

S2: decrypting the bidding text before opening the bid, and performing data preprocessing on the bidding text.

And when the bid opening time published in the bid advertisement is reached, the decryption module automatically decrypts the vendor bid information and performs data preprocessing. In this embodiment, the data preprocessing steps are as follows:

s2.1, selecting bidding texts to be compared according to a bidding agency or bidding party preset scheme. For example, only the detection performance section, the technical service plan, the after-sales service plan are selected.

S2.2 unicode. English characters are unified into upper and lower cases, chinese characters are unified into simplified Chinese characters, and full-angle characters are unified into full-angle characters.

And s2.3, removing useless information such as special symbols, spaces, line-feeding symbols, URL (uniform resource locator), mail addresses, frequently-occurring repeated contents and the like outside the common punctuation marks. URL means: a uniform resource locator. Wherein, frequent occurrence refers to: the text of the same section repeatedly appears for more than 10 times; special symbols include geometric shapes, arrows, and emoticons;

And s2.4, removing stop words such as 'and', 'o' and the like related to the text of the bidding document by using the stop word list. Removing stop words generally does not affect the understanding of sentences and can reduce the computational effort of the model.

S2.5 the embodiment uses the Chinese word segmentation tool in pyhanlp to perform word segmentation processing on the bid text to be compared, and inputs the word segmentation result into the CW2Vec model. The effect of the pyhanlp integrated syntactic analyzer is good, and the word segmentation tool has good performance on natural language processing tasks such as Chinese word segmentation, part-of-speech tagging, named entity recognition and the like. Before entering the BERT model, a [ CLS ] tag is added to the sentence head of each sentence, and a [ SEP ] tag is added to the sentence tail. Pyhanlp is: a natural language processing library in the programming language python.

S3: as shown in fig. 3, a deep learning model is constructed and trained using a training set.

After the preprocessing step is completed, the deep learning model in the text analysis module starts to analyze word segmentation results. In this embodiment, the steps of model construction and training are as follows:

s3.1 at the feature extraction layer, the embodiment uses a pre-trained language model BERT and a word vector model CW2Vec suitable for chinese to extract semantic information, and concatenates the output vectors of the two models.

1) Bi-directional coded representation model BERT uses a bi-directional transducer model, BERT uses extensive, task-independent text for unsupervised training, and learns the general meaning of language through two tasks: one is a masking language model, the task randomly uses [ MASK ] to MASK a part of words, the [ MASK ] refers to a label used for masking the words, and then attempts to predict the masked words according to the context, so as to learn the association between the words and the context representation, and 15% of characters in the sentence are randomly selected for masking in the embodiment; another task is next sentence prediction, which randomly selects two sentences and then determines whether the two sentences are consecutive, thereby learning a sentence-level representation. The transducer model is: a converter model. The english language of the mask language model is all: masked Language Model.

The inputs of the BERT model are concatenated vectors of word vectors, text vectors, position vectors. In this embodiment, the word vector is a random initialization vector of a single chinese character, and the values of the text vector and the position vector are automatically learned during model training. The output of the final BERT model is 768-dimensional text vectors.

BERT uses an encoder module in a transducer model, and the multi-head self-attention mechanism of the encoder firstly carries out self-attention calculation: given an embedded vector X of an input sequence, there is，，Wherein，，The method is characterized in that the method is a learnable weight matrix, inquiry, key and value are all linear transformation of an input sequence, the code number of the inquiry is Q, and English is called Query; the code number of the Key is K, and English is called Key; the code of the Value is V, and English is called Value. The encoder module is a coding module.

The self-attention score is calculated as follows:

；

the self-attention score indicates the degree of importance of the information of the other words in the sequence when the ith word is processed. Wherein the method comprises the steps of Is the dimension of the key used to scale the dot product result, preventing the value of the entering softmax function from becoming too large. The multi-headed self-attention mechanism splits an input sequence into multiple "heads", i.e., heads, each of which independently performs a self-attention operation, and then concatenates the results and passes through an additional weight matrixThe linear transformation is performed, thus allowing the model to focus on different representation subspaces in the input sequence at the same time.

；

Wherein:

；

In this embodiment, a BERT model of 12 layers Transformer Encoder is used, the vocabulary size is 28996, and the head number of each Attention layer is 12.

2) The stroke structure and sequence of Chinese characters contain a great amount of semantic information, such as 'wood, forest and forest', and have certain similarity in meaning and stroke. The CW2Vec model utilizes the Chinese character stroke structure information to obtain word vectors, thereby extracting the semantic information of Chinese. In this embodiment, the steps for constructing word vectors by the CW2Vec model are as follows:

And (3) stroke feature acquisition: acquiring the stroke characteristics of the words obtained by the segmentation of the step s25, for example "big" words contain the stroke information "one, two and three", while the "herringbone" includes "and" Chinese character of the same ".

And (3) digitizing stroke characteristics: each stroke information is represented by a number. For example, in the above example, the stroke information of "big" and "person" is digitized into "134" and "34", and the stroke information of the word "big person" is "13434".

N-element stroke characteristics: setting a sliding window with a value of n, and sequentially extracting n-gram features of the word stroke sequence. For example, for the 3-gram feature of the word "adult," the possible extracted features are "134," "343," and "434," and the word stroke order is then converted into a vector according to One-hot encoding.

In the CW2Vec model, the similarity function between the terms and the context is defined as follows:

；

wherein the method comprises the steps of Represents an n-gram stroke vector corresponding to the current word,Is the word vector of its corresponding context word,N-gram set for current word w, q is setN-gram elements of (a) are provided. Modeling the prediction of the context c based on the word w, performing simulation prediction by adopting a softmax function, and performing an objective function as follows:

；

wherein c' is a randomly selected word, called "negative sample". The loss function based on n-ary strokes is as follows:

；

Wherein, Is a Sigmoid function:

；

is the set of all words within the current word window, D is all text of the training corpus, Is to randomly select the number of words,Is a negative exampleSampling is performed according to word frequency distribution.

3) In this embodiment, the vector after feature extraction of the BERT model and the CW2Vec model is spliced as an input to the BiLSTM network.

；

Wherein,Is a sequence of words entered.

S3.2: and performing dimension reduction processing on the vector after the feature extraction by using a PCA algorithm.

The BERT model output dimension is 768, thousands of dimensions are achieved after the output dimension of the CW2Vec model is spliced, and the dimension reduction operation is carried out on the model, so that the calculation complexity can be reduced, the data noise can be effectively removed, and the model accuracy is improved. The PCA algorithm is implemented as follows:

(1) After normalizing the original data, forming an N multiplied by M matrix X according to columns;

(2) Solving a covariance matrix C of the matrix X;

(3) Calculating the eigenvalue and corresponding eigenvector of matrix C;

S3.3: the vector sequence after PCA dimension reduction is input to a bi-directional long and short term memory network BiLSTM.

BiLSTM consists of two separate LSTM layers, each layer extracting information of the input sequence from forward and reverse pairs, respectively, each LSTM layer containing a plurality of neurons, each neuron receiving as input the input sequence and the hidden state of the previous time step, each neuron comprising three control structures: input doorForgetful doorOutput doorAnd cell statusHidden state。

;

；

The LSTM forward and backward transfer procedure is:

；

wherein the method comprises the steps of Is the hidden state at the time of forward t-1,Is hidden state at reverse t+1 moment, and finally the forward and backward LSTM are spliced to obtain output at t moment

；

S3.4: this embodiment also adds a Dropout layer after the bi-directional LSTM layer.

The Dropout layer randomly discards a portion of neurons so that the model can continue to learn from other information after losing some information, enhancing the robustness of the model, preventing overfitting. In this embodiment, the ratio of Dropout is set to 0.3.

And S3.5, carrying out similarity calculation on the vector output by the twin network, and outputting a result of similarity or not.

In this embodiment, the model output layer calculates the vector distance by using the cosine similarity, and the cosine similarity calculation formula is as follows:

；

Binary cross entropy function binary_ crossentropy is used as a loss function:

；

The model combines the Adam optimization algorithm of momentum and adaptive learning rate, the learning rate is set to 0.0001, and the gradient clipping threshold is set to-5 to 5 in order to avoid gradient explosion.

In this example, the model was trained using the Chinese text similarity dataset lcqmc of the open source of Harbin university of Industrial. Preferably, the data storage module can store text data subjected to manual review, and dynamically expands the size of the training set.

Lcqmc dataset partial data samples are shown in table 1:

Table 1 training set samples:

The following are examples of two pieces of bidding text for which similarity is to be detected:

"after receiving the user service maintenance call, the information is fed back to the responsible person of the after-sales service department at the first time. "

"I will feed back the user maintenance information to the after-market responsible person the first time. "

S4: after the deep learning model is trained, the preprocessed bid texts of all bid suppliers are subjected to two technical scheme combination or two service scheme combination, and then the two technical scheme combination or the two service scheme combination are input into the trained deep learning model to obtain a prediction result.

For example, if the service plan part in the bidding document needs to be detected, the parts of all suppliers are combined two by two and input into the model to obtain the prediction results of similarity 1 and dissimilarity 0.

Finally, a confusion matrix for judging all text similarity results is obtained, the data storage module stores judging results, for example, for the text of the technical service scheme A of the provider, if the text is similar to the text of the technical service scheme B of the provider, the text numbers of the technical service scheme B of the provider are stored in the similarity judging field, and if the text of the technical service scheme A of the provider is similar to the text of the technical service scheme B of a plurality of providers, the numbers of the similar texts are sequentially stored in the similarity judging field and separated by commas.

After step S4 is completed, step S5 is performed, in which, according to the output result of the model and the preset decision rule of the bidding agency or bidding party, the vendor list to be checked and the part with too high suspected similarity in the bidding document are pushed to the bidding agency or bidding party for checking, so as to detect the behavior of the string bidder ring.

In this embodiment, the similarity determination rule is preset by the bidding agency or the bidding party before collecting the bidding text, for example, when the technical service scheme, the after-sales service scheme and the performance prove similar, the bidding documents are determined to be similar, and the string bidder ring behaviors are suspected to exist, so that review is required.

The above embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and the scope of the present invention should be defined by the claims, including the equivalents of the technical features in the claims. I.e., equivalent replacement modifications within the scope of this invention are also within the scope of the invention.

Claims

1. A bidding text similarity detection system is characterized in that: the system comprises a bidding document generation client, wherein the bidding document generation client is electrically connected with an online detection subsystem; the bidding document generation client comprises an information collection module and an encryption uploading module; the online detection subsystem comprises a data storage module, a decryption module, a text analysis module and an information push module; the encryption uploading module is electrically connected with the data storage module;

the data storage module is used for storing original text data and numbering the original text data before storing;

2. The method for detecting the similarity of bidding texts according to claim 1, comprising the following steps:

s1: collecting bidding texts input by suppliers, encrypting and uploading;

3. The method for detecting the similarity of bidding texts according to claim 2, comprising the following steps: the substeps of step S2 are as follows:

s2.2, unifying character codes of the bidding text;

4. The method for detecting the similarity of bidding texts according to claim 2, comprising the following steps: the substeps of step S3 are as follows:

5. The method for detecting the similarity of bidding text according to claim 4, comprising the steps of: the substeps of step s3.1 are as follows:

；

Wherein, Represent a self-attention score; representing a transpose of the key K;

representing a normalized exponential function;

representing the original value of the i-th element of the input vector; representing the sum by taking each element of the input vector as an index, wherein e represents a natural constant, and C represents the vector length;

the self-attention score indicates the degree of importance of the information of other words in the sequence when processing the ith word; wherein the method comprises the steps of Is the dimension of the bond; the multi-headed self-attention mechanism splits an input sequence into a plurality of heads, each of which independently performs a self-attention operation, and then concatenates the results together and passes through an additional weight matrixPerforming a linear transformation allows the model to focus on different representation subspaces in the input sequence simultaneously:

；

Wherein, Representing a multi-headed self-attention mechanism scoring matrix;

Representing the self-attention score of each "head", ；

N represents the number of heads;

；

the loss function based on n-ary strokes is as follows:

；

Wherein, Is an S-type activation function:

；

The seq is an input word sequence, concat () carries out series splicing operation on two vectors, and BERT represents a bidirectional coding characterization model; CW2Vec represents a Chinese word vector model.

6. The method for detecting the similarity of bidding text according to claim 4, comprising the steps of: the implementation steps of the principal component analysis algorithm are as follows:

(2) Solving a covariance matrix C of the matrix X;

(3) Calculating the eigenvalue and corresponding eigenvector of matrix C;

7. The method for detecting the similarity of bidding text according to claim 4, comprising the steps of: in step s3.3, the two-way long-short-time memory network is composed of two independent long-short-time memory network layers, each layer extracts information of an input sequence from a forward direction and a reverse direction respectively, each long-short-time memory network layer contains a plurality of neurons, each neuron receives the input sequence and a hidden state of a previous time step as input, and each neuron comprises three control structures: input doorForgetful doorOutput doorAnd cell statusHidden state：

;

；

wherein the method comprises the steps of AndSemantic information representing time steps extracted from forward and backward directions respectively,Is the hidden state at the time of forward t-1,Is the hidden state at the time of the reverse t +1,AndRespectively representing the power of x and the power of-x of the natural constant e; finally, the forward and backward long-short-time memory networks are spliced to obtain the output at the time t：

；

Representing concatenating two vectors; indicating the hidden state at the forward t moment, Indicating the hidden state at the time t of the reversal.

8. The method for detecting the similarity of bidding texts according to claim 1, comprising the following steps: in step s3.5, the output layer of the deep learning model calculates the distance between the vector a and the vector B by using the cosine similarity, and the cosine similarity calculation formula is as follows:

；

binary cross entropy functions are used as loss functions:

；

9. The method for detecting the similarity of bidding texts according to claim 1, comprising the following steps: after step S4 is completed, step S5 is performed, in which, according to the output result of the model and the preset decision rule of the bidding agency or bidding party, the vendor list to be checked and the part with too high suspected similarity in the bidding document are pushed to the bidding agency or bidding party for checking, so as to detect the behavior of the string bidder ring.