CN111581229B - SQL statement generation method and device, computer equipment and storage medium - Google Patents
SQL statement generation method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN111581229B CN111581229B CN202010218628.4A CN202010218628A CN111581229B CN 111581229 B CN111581229 B CN 111581229B CN 202010218628 A CN202010218628 A CN 202010218628A CN 111581229 B CN111581229 B CN 111581229B
- Authority
- CN
- China
- Prior art keywords
- model
- prediction
- sql statement
- information
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/2445—Data retrieval commands; View definitions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2428—Query predicate definition using graphical user interfaces, including menus and forms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/248—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a device for generating SQL sentences, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring Chinese query data information; inputting the Chinese query data information into a preprocessing model, and performing expansion processing to obtain expansion information containing Chinese content; inputting the extended information into a Bert model, and performing part-of-speech sequence tagging on the extended information to obtain a vector matrix; inputting the vector matrix into a Bi-LSTM model, and performing vector prediction processing on the vector matrix to obtain an identification result; inputting each predicted SQL statement into a verification model, verifying each predicted SQL statement, obtaining a quality coefficient corresponding to each predicted SQL statement, and determining a final SQL statement corresponding to the Chinese query data information; and obtaining a query result, and displaying the query result on a query interface or playing the query result. The invention realizes the automatic generation of SQL sentences according to the Chinese query information provided by the user and the query of the database according to the SQL sentences to obtain the data to be queried, thereby improving the efficiency.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for generating an SQL statement, a computer device, and a storage medium.
Background
At present, in the field of internet technology, as service scenes are ever changing and new service scenes are continuously added, more and more data are stored in a database, and therefore, it is an urgent need to extract required data in the presence of huge data. In the prior art, data needed by interactive Query is mainly performed through an SQL (Structured Query Language) Language and a database, a traditional mode is that a business person provides a Chinese business requirement, provides the business person to a professional skilled in SQL and performs communication for many times, the professional writes a Chinese business requirement into an SQL statement and verifies the Chinese business requirement, if the verification does not pass, the business person needs to rewrite or communicate, and finally provides a result of the executed SQL statement which passes the verification to the business person, so that the whole process is complicated and trivial, and data Query timeliness is low, user experience is poor, the professional requirement threshold is high, development pressure is large, and operation cost is high.
Disclosure of Invention
The invention provides a method and a device for generating SQL sentences, computer equipment and a storage medium, which realize the automatic generation of the SQL sentences according to Chinese query information provided by a user and the query of a database according to the SQL sentences to obtain data to be queried, greatly reduce the threshold of professional requirements, improve the efficiency, promote the satisfaction degree of the user, improve the recognition accuracy and greatly reduce the operation cost.
A method for generating SQL statements comprises the following steps:
receiving a query instruction, and acquiring Chinese query data information input from a query interface;
inputting the Chinese query data information into a preset preprocessing model, and performing expansion processing on the Chinese query data information through the preprocessing model to obtain expansion information containing Chinese content;
inputting the extended information into a trained Bert model, and performing part-of-speech sequence tagging on the extended information through the Bert model to obtain a vector matrix output by the Bert model;
inputting the vector matrix into a trained Bi-LSTM model, and performing vector prediction processing on the vector matrix through the Bi-LSTM model to obtain a recognition result output by the Bi-LSTM model; the recognition result comprises at least one predicted SQL statement;
inputting each predicted SQL statement into a preset verification model, verifying each predicted SQL statement through the verification model, obtaining a quality coefficient output by the verification model and corresponding to each predicted SQL statement, and determining a final SQL statement corresponding to the Chinese query data information according to the quality coefficient corresponding to each predicted SQL statement;
and performing data query according to the final SQL statement to obtain a query result corresponding to the Chinese query data information, and displaying the query result on the query interface or playing the query result.
An apparatus for generating an SQL statement, comprising:
the receiving module is used for receiving the query instruction and acquiring Chinese query data information recorded on a query interface;
the expansion module is used for inputting the Chinese query data information into a preset preprocessing model and performing expansion processing on the Chinese query data information through the preprocessing model to obtain expansion information containing Chinese content;
the output module is used for inputting the extended information into a trained Bert model, and performing part-of-speech sequence labeling on the extended information through the Bert model to acquire a vector matrix output by the Bert model;
the prediction module is used for inputting the vector matrix into a trained Bi-LSTM model, and performing vector prediction processing on the vector matrix through the Bi-LSTM model to obtain a recognition result output by the Bi-LSTM model; the recognition result comprises at least one predicted SQL statement;
the determining module is used for inputting each predicted SQL statement into a preset verification model, verifying each predicted SQL statement through the verification model, acquiring a quality coefficient which is output by the verification model and corresponds to each predicted SQL statement, and determining a final SQL statement corresponding to the Chinese query data information according to the quality coefficient corresponding to each predicted SQL statement;
and the query module is used for performing data query according to the final SQL statement to obtain a query result corresponding to the Chinese query data information, and displaying the query result on the query interface or playing the query result.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the SQL statement generation method when executing the computer program.
A computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described SQL statement generation method.
According to the method, chinese query data information input from a query interface is obtained, the Chinese query data information is input into the preprocessing model for expansion processing, the expanded information output by the preprocessing model is input into the Bert model, the Bert model outputs a vector matrix after being labeled by a part of speech sequence, the vector matrix is input into the Bi-LSTM model and vector prediction processing is carried out, a recognition result is obtained, each predicted SQL sentence in the recognition result is input into the verification model for verification, a final SQL sentence is determined according to a quality coefficient corresponding to each obtained predicted SQL sentence, and a query result is obtained by querying according to the final SQL sentence.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of a method for generating an SQL statement according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for generating an SQL statement according to an embodiment of the present invention;
FIG. 3 is a flowchart of step S20 of a method for generating an SQL statement according to an embodiment of the present invention;
FIG. 4 is a flowchart of step S30 of a method for generating an SQL statement according to an embodiment of the present invention;
FIG. 5 is a flowchart of step S40 of a method for generating an SQL statement according to an embodiment of the present invention;
FIG. 6 is a flowchart of step S403 of a method for generating an SQL statement according to an embodiment of the present invention;
FIG. 7 is a flowchart of step S40 of a method for generating an SQL statement according to another embodiment of the invention;
FIG. 8 is a flowchart of step S50 of a method for generating an SQL statement according to an embodiment of the present invention;
FIG. 9 is a schematic block diagram of an apparatus for generating an SQL statement according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a computer device in an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
The method for generating the SQL statement provided by the invention can be applied to the application environment shown in figure 1, wherein a client (computer equipment) communicates with a server through a network. The client (computer device) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, cameras, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In an embodiment, as shown in fig. 2, a method for generating an SQL statement is provided, which mainly includes the following steps S10 to S60:
and S10, receiving a query instruction, and acquiring Chinese query data information recorded on a query interface.
Understandably, the query interface is an application program display interface for a user to query, and the information of the Chinese query data is related information of data to be queried, which includes Chinese content, such as: the chinese query data information may be "how much big north earnings are two zero and one nine years? The Chinese query data information can be obtained by converting texts, voices and pictures, a platform for text interaction, voice interaction and picture interaction is provided, the requirement of a user on multiple platforms is met, and the user experience satisfaction is improved.
In an embodiment, before the step 10, that is, before the receiving the query command and acquiring the information of the chinese query data entered on the query interface, the method includes:
and S101, receiving a data input instruction and acquiring input information.
Understandably, the data input instruction is received, the input information is acquired, the data input instruction is an instruction triggered after the input information is input on a display interface of the application program, the input information is acquired after the data input instruction is received, and an acquisition mode of the input information can be set as required, for example, the acquisition mode can be acquired through the input information included in the data input instruction, the input information is acquired according to a storage path of the input information included in the data input instruction, and the like. S102, inputting the input information into a preset identification model, and identifying the input information by the identification model to obtain an input type result; wherein the input type results include text, speech, and images.
Understandably, the recognition model is a preset model for recognizing the input information, the input information is input to the recognition model, the recognition model can determine the input type result according to the format of the input information, and the input type result comprises text, voice and images.
And S103, acquiring a conversion model corresponding to the input type result.
Understandably, a conversion model corresponding to the input type result is determined according to the input type result, wherein the conversion model comprises a text conversion model, a voice conversion model and an image conversion model, namely, if the input type result is a text, the text conversion model is obtained, if the input type result is a voice, the voice conversion model is obtained, if the input type result is an image, the image conversion model is obtained, and the conversion model is a trained neural network model, so that the conversion efficiency and the accuracy can be improved by obtaining the more targeted conversion model.
And S104, inputting the input information into a conversion model corresponding to the input type result, converting the input information by the conversion model, outputting a conversion result, and determining the conversion result as the Chinese query data information.
Understandably, the input information is input into a conversion model corresponding to the input type result, the conversion is carried out through the conversion model, and the conversion result is determined as the Chinese query data information.
Therefore, the invention provides a platform for text interaction, voice interaction and picture interaction, which meets the requirement of users on multiple platforms, and converts input information into Chinese query data information through a conversion model with strong pertinence, thereby improving the conversion accuracy and improving the user experience satisfaction.
S20, inputting the Chinese query data information into a preset preprocessing model, and performing expansion processing on the Chinese query data information through the preprocessing model to obtain expansion information containing Chinese content.
Understandably, the preprocessing model is a model for performing expansion processing on the input chinese query data information, the expansion information is information that the chinese query data information still contains chinese content after the expansion processing, the expansion processing includes one or more of conversion processing, splitting processing, replacement processing, and extraction processing, the conversion processing is to convert numbers in a chinese format into numbers in an arabic number format, and the conversion processing also includes converting numbers matched with numbers into numbers in an arabic number format, for example: converting 'two hundred and eighty seven-element' into '280.7'; the splitting process is to split information into a plurality of texts; the replacement processing is to replace the recognized short text or common words with full names; the extraction processing is to extract a column name corresponding to a preset threshold value of matching degree between a text in the information and a column name in a preset list name dictionary, that is, to identify a column name with high content relevance in the information, and the extraction processing further includes splicing the column names proposed in the information after the information is separated by separators, where the separators may be set according to requirements, such as a sub-code, a colon, and the like, for example: the information is 'Shanghai population' which is obtained by extracting and processing; saving parts; the number of people ".
Therefore, the Chinese query data information can be converted into complete and easily-recognized content through the preprocessing model, and the accuracy of subsequent conversion is enhanced.
In an embodiment, as shown in fig. 3, in the step S20, that is, performing the expansion processing on the chinese query data information through the preprocessing model to obtain the expanded information containing the chinese content, the method includes:
s201, the preprocessing model converts the Chinese query data information including the Chinese format digits through a digit conversion method to obtain first conversion information.
Understandably, the number conversion method is a method for converting numbers in a Chinese character format into numbers in an Arabic number format and converting numbers and words matched with the numbers into numbers in the Arabic number format, and the first conversion information is obtained through the preprocessing model.
S202, the preprocessing model divides the first conversion information into a plurality of texts through a natural language processing technology; wherein one of the texts is associated with a sequence number.
Understandably, the Natural Language Processing (NLP) is one of artificial intelligence technologies, which can recognize a string of texts to identify texts of multiple characters and words, and split the first conversion information into texts of multiple characters and words by the natural language Processing technology, wherein each text is associated with a sequence number, for example: the first conversion information is "how many people are in the total number of people in china? "the Chinese," the total number of people "," is "and" how much "are separated by the natural language processing technology, wherein" the Chinese "is associated with a sequence number" 1"," the total number of people "is associated with a sequence number" 2"," is associated with a sequence number "3", and "how much" is associated with a sequence number "4".
S203, the preprocessing model matches each text with a short text in a preset synonymy dictionary.
Understandably, the synonym dictionary is a preset dictionary which collects all synonyms, the synonym dictionary can be set according to requirements, for example, the synonym dictionary can be selected according to different users or different fields, or can be formed by combining a universal dictionary and a preferred dictionary corresponding to the users, and the texts are matched with the abbreviation texts in the synonym dictionary one by one, so that whether the texts have the abbreviation can be identified.
S204, when the text is not matched with the abbreviation text, marking the text as an original text; and when the text is matched with the abbreviation text, replacing the text with the full name of the abbreviation text matched with the text, and marking the text after being replaced as a substitute text.
Understandably, the abbreviation texts all have their corresponding full names, such as: the abbreviation text "Beida" corresponds to the full name "Beijing university", and if the text does not match a consistent abbreviation text in the synonym dictionary, the text is marked as an original text, and the original text indicates that the text is not abbreviated; if the text matches a consistent abbreviated text in the synonym dictionary, the text is replaced by the full name of the abbreviated text matched with the text, and the replaced text is marked as a replacement text.
S205, through a character string matching algorithm and a Chinese similarity matching algorithm, the column names with the matching degree reaching a preset threshold value with all the original texts or all the alternative texts are obtained from all the column names of a preset list name dictionary, the column names with the matching degree reaching the preset threshold value with all the original texts or all the alternative texts are recorded as reinforced column names, and all the reinforced column names are recorded as reinforced information.
Understandably, the list name dictionary is a dictionary of a preset list name set in all lists, the list name dictionary can be updated along with updated list names in the lists, a matching degree with the original text or the alternative text is obtained through the character string matching algorithm and the Chinese similarity matching algorithm, the matching degree is a similarity degree of the original text or the alternative text with the list names in the list name dictionary, the list names with the matching degree reaching a preset threshold value with all the original text or all the alternative text are recorded as enhanced list names, and all the enhanced list names are recorded as enhanced information.
And S206, splicing all the alternative texts with all the original texts according to the sequence numbers to obtain question information.
Understandably, all the alternative texts and all the original texts are spliced according to the sequence numbers, so that all the spliced alternative texts and all the original texts are recorded as the question information.
And S207, splicing the question information and the strengthening information according to a preset splicing rule to generate the expanded information.
Understandably, the splicing rule may be set according to a requirement, for example, the splicing rule is to splice according to an order of the question information being first and the reinforcement information being later, so as to generate the extended information.
Therefore, the Chinese query data information can be converted into complete and easily-recognized extension information through a digital conversion method, a natural language processing technology, a character string matching algorithm and a Chinese similarity matching algorithm in the preprocessing model, and the accuracy of subsequent conversion is enhanced.
And S30, inputting the extended information into a trained Bert model, and performing part-of-speech sequence tagging on the extended information through the Bert model to obtain a vector matrix output by the Bert model.
Understandably, the Bert model is trained based on a Bert (bidirectional encoder transform) method and obtains a trained model, the Bert model encodes each word or word in the extended information by a bidirectional encoding method, and marks the word or word with a sequence, so as to convert the extended information into a set of vector matrices, each vector matrix is a multidimensional matrix including the extended information and vector values corresponding to the extended information, the dimensions of the vector matrices can be set according to requirements, for example, the dimensions of the vector matrices can be 4-dimensional, 6-dimensional, 8-dimensional, and the like, each vector matrix includes a question matrix and a reinforced matrix, each question matrix corresponds to the question information in the extended information one-to-one, that is, a matrix corresponding to the question information is extracted from the vector matrices and determined as the question matrix, and each reinforced matrix corresponds to the reinforced information in the extended information one-to-one, that is, that a matrix corresponding to the reinforced information is extracted from the vector matrices and determined as the reinforced matrix.
Therefore, the expanded information can be accurately coded and labeled through the Bert model, and the identification accuracy is improved.
In an embodiment, as shown in fig. 4, in the step S30, that is, performing part-of-speech sequence tagging on the extended information through the Bert model, to obtain a vector matrix output by the Bert model, the method includes:
s301, the Bert model divides the extended information into a plurality of single words.
Understandably, the Bert model splits the extended information into single words, and the single words are words or a group of numbers.
And S302, the Bert model carries out part-of-speech prediction on each single character to obtain a sequence label corresponding to each single character.
Understandably, the Bert model encodes the single character into a sequence by a bidirectional coding method and labels the sequence to the single character.
And S303, combining all the single characters and the sequence labels corresponding to the single characters to obtain label information corresponding to the extended information.
Understandably, combining all the single characters and the sequence labels corresponding to the single characters, namely splicing the single characters and the sequence labels corresponding to the single characters, and then replacing the spliced single characters and the sequence labels according to the positions of the single characters in the extended information to obtain the label information corresponding to the extended information.
S304, character feature extraction is carried out on the labeled information through the Bert model, and a vector matrix which is output by the Bert model and corresponds to the expanded information is obtained.
Understandably, the character features are features related to Chinese characters, symbols, letters or English words, and the Bert model outputs a vector matrix corresponding to the extended information according to the extracted character features.
Therefore, the expanded information can be accurately coded and labeled through the Bert model, and the identification accuracy is improved.
S40, inputting the vector matrix into a trained Bi-LSTM model, and performing vector prediction processing on the vector matrix through the Bi-LSTM model to obtain a recognition result output by the Bi-LSTM model; the recognition result comprises at least one predicted SQL statement.
Understandably, the Bi-LSTM model is trained based on a Bi-Long Short-Term Memory (Bi-Long Short-Term Memory) method to obtain a trained model, the Bi-LSTM model is also called a named entity recognition model, that is, the Bi-LSTM model names and substantiates an input matrix and outputs a recognition result of the matrix, the Bi-LSTM model includes a first prediction model, a second prediction model and a third prediction model, the first prediction model predicts a first prediction result by extracting first vector features of the vector matrix and recognizing according to the extracted first vector features, the first vector features may be vector features related to a where clause in an SQL statement, the where clause may include a median, an operator, a where column name and a where column number, and the median is a value that meets a condition, the operator is a general operator in SQL, the where column name is queried in a selected column name, the where column number is the number of the where column names, a constraint relation exists between the where column name and the where column number, the first prediction result can be the statement content related to the where clause in the SQL statement, the second prediction model predicts the second prediction result by extracting the second vector characteristics of the vector matrix and identifying according to the extracted second vector characteristics, the second vector characteristics can be the vector characteristics related to the inter-condition operator in the SQL statement, the inter-condition operator comprises and or, the second prediction result can be the result related to the inter-condition operator in the SQL statement, and the third prediction model extracts the third vector characteristics of the vector matrix, and identifying according to the extracted third vector feature, predicting a third prediction result, where the third vector feature may be a vector feature related to a select statement in an SQL statement, the select statement may include a select column name, a aggregation function, and a select column number, the select column name is a column name of a query, where the where column name and the select column number may be the same or different, the aggregation function is a general aggregation function in SQL, the select column number is the number of the select column names, a constraint relationship exists between the select column name and the select column number, and the third prediction result may be a statement content related to a select statement in the SQL statement, and combining the first prediction result, the second prediction result, and the third prediction result to generate the recognition result, where the recognition result is an array including a plurality of the sum prediction SQL statements, where one prediction SQL statement is an element.
Therefore, each prediction model obtains different prediction results by extracting different vector characteristics, and the method has single concentration and reduces interference information.
In one embodiment, as shown in fig. 5, the vector matrix includes a question matrix and a reinforced matrix, the question matrix corresponds to the question information, and the reinforced matrix corresponds to the reinforced information; in step S40, inputting the vector matrix into the trained Bi-LSTM model, and performing vector prediction processing on the sequence matrix through the Bi-LSTM model to obtain the recognition result output by the Bi-LSTM model, where the step includes:
s401, inputting the vector matrix into a first prediction model in the Bi-LSTM model, inputting the question matrix into a second prediction model in the Bi-LSTM model, and inputting the reinforcing matrix into a third prediction model in the Bi-LSTM model.
Understandably, the Bert model obtains the vector matrix, and the question matrix and the reinforcement matrix in the vector matrix, inputs the vector matrix to the first prediction model, inputs the question matrix to the second prediction model, and inputs the reinforcement matrix to the third prediction model.
S402, the first prediction model carries out prediction processing by extracting first vector features in the vector matrix to obtain a first prediction result of the vector matrix, meanwhile, the second prediction model carries out prediction processing by extracting second vector features in the question matrix to obtain a second prediction result of the question matrix, and the third prediction model carries out prediction processing by extracting third vector features in the reinforced matrix to obtain a third prediction result of the reinforced matrix.
Understandably, the first vector feature may be a vector feature related to a where clause in an SQL statement, the second vector feature may be a vector feature related to an inter-condition operator in the SQL statement, the third vector feature may be a vector feature related to a select statement in the SQL statement, the first prediction result includes a plurality of arrays of first prediction array elements related to the where clause, each of the first prediction array elements is associated with a first prediction probability, the first prediction probability is a probability of predicting the first prediction array element, the second prediction result includes a plurality of arrays of second prediction array elements related to the inter-condition operator, each of the second prediction array elements is associated with a second prediction probability, the second prediction probability is a probability of predicting the second prediction array element, the third prediction result includes a plurality of arrays of third prediction array elements related to the select statement, each of the third prediction array elements is associated with a third prediction probability, and the third prediction probability is associated with a third prediction array element.
And S403, combining the first prediction result, the second prediction result and the third prediction result to generate the identification result.
Understandably, according to a preset combination requirement, the combination requirement may be set according to a requirement, for example, the combination is performed according to the sequence of a selec statement, an inter-condition operator, and a where clause in an SQL statement, and the first prediction result, the second prediction result, and the third prediction result are combined to generate the recognition result.
Therefore, each prediction model in the Bi-LSTM model obtains different prediction results by extracting different vector characteristics, and the prediction results have single concentration and reduce interference information.
In an embodiment, as shown in fig. 6, the step S403 of combining the first predicted result, the second predicted result, and the third predicted result to generate the identification result includes:
s4031, extracting a first prediction array element in the first prediction result and a first prediction probability associated therewith, a second prediction array element in the second prediction result and a second prediction probability associated therewith, and a third prediction array element in the third prediction result and a third prediction probability associated therewith;
s4032, combine one of the first predicted array elements, one of the second predicted array elements, and one of the third predicted array elements into one of the predicted SQL statements;
understandably, combining one first prediction array element, one second prediction array element and one third prediction array element according to preset combination requirements, where the combination requirements may be set according to requirements, for example, splicing according to the sequence of a select statement, an inter-condition operator and a where clause in an SQL statement to obtain one prediction SQL statement.
S4033, determining the identification probability corresponding to each predicted SQL statement according to the first, second and third prediction probabilities corresponding to the predicted SQL statement;
understandably, the first prediction probability, the second prediction probability and the third prediction probability corresponding to the prediction SQL statement are added according to a weighting method to obtain the identification probability.
S4034, all the recognition probabilities are sequenced, and the sequenced recognition probabilities and the recognition probabilities corresponding to the predicted SQL statements are recorded as recognition results.
Understandably, the sorting rules can be set according to requirements, and preferably, the sorting rules can be in a descending order.
Therefore, the first prediction array element, the second prediction array element and the third prediction array element are combined into one prediction SQL sentence, the recognition probability of the prediction SQL sentence is determined, all the recognition probabilities are sequenced, and the sequenced recognition probability and the recognition probability corresponding to the prediction SQL sentence are recorded as the recognition result, so that the prediction SQL sentence with the highest recognition probability can be conveniently obtained in the follow-up process.
In an embodiment, as shown in fig. 7, before the step S40, that is, the vector matrix is input into a trained Bi-LSTM model, and the sequence matrix is subjected to vector prediction processing by the Bi-LSTM model, so as to obtain a recognition result output by the Bi-LSTM model; before the recognition result comprises at least one predicted SQL statement, the method comprises the following steps:
s404, obtaining a vector matrix sample; the SQL statement label comprises question labels corresponding to the question matrix samples one by one and strengthening labels corresponding to the strengthening matrix samples one by one.
Understandably, the vector matrix sample is a multidimensional matrix containing vector values obtained by converting collected sample information through the Bert model, the sample information is history input information which needs to be queried and is queried and input reinforcement information obtained by processing the input information through the character string matching algorithm and the Chinese similarity matching algorithm, the vector matrix sample comprises the question matrix sample and the reinforcement matrix sample, the question matrix sample is a matrix corresponding to the input information in the vector matrix sample, the reinforcement matrix sample is a matrix corresponding to the input reinforcement information in the vector matrix sample, the SQL statement tag is an SQL statement written for querying the sample information, the question tag is a statement content related to the input information extracted from the SQL statement tag, and the reinforcement tag is a statement content related to the input reinforcement information extracted from the SQL statement tag.
S405, inputting the vector matrix sample into a first prediction model in an initial Bi-LSTM model, inputting the question matrix sample into a second prediction model in the initial Bi-LSTM model, and inputting the reinforcing matrix sample into a third prediction model in the initial Bi-LSTM model; wherein the initial Bi-LSTM model comprises initial parameters.
Understandably, the initial Bi-LSTM model is a neural network model based on a Bi-Long Short-Term Memory (Bi-Long Short-Term Memory) method, the initial Bi-LSTM model includes a first prediction model, a second prediction model and a third prediction model, the initial parameters can be set according to requirements, such as the initial parameters are randomly assigned parameter values, or the initial parameters are preset parameter values, and the vector matrix samples are input into the first prediction model in the initial Bi-LSTM model, and simultaneously the question matrix samples are input into the second prediction model in the initial Bi-LSTM model, and the reinforced matrix samples are input into the third prediction model in the initial Bi-LSTM model.
S406, the first prediction model performs prediction processing by extracting a first vector feature in the vector matrix sample to obtain a first sample prediction result of the vector matrix sample, the second prediction model performs prediction processing by extracting a second vector feature in the question matrix sample to obtain a second sample prediction result of the question matrix sample, and the third prediction model performs prediction processing by extracting a third vector feature in the reinforced matrix sample to obtain a third sample prediction result of the reinforced matrix sample.
Understandably, the first prediction model predicts a first sample prediction result by extracting a first vector feature of the vector matrix sample and performing prediction processing according to the extracted first vector feature, the first vector feature may be a vector feature related to a where clause in an SQL statement, the first sample prediction result may be statement content related to a where clause in an SQL statement, the second prediction model predicts a second sample prediction result by extracting a second vector feature of the question matrix sample and performing prediction processing according to the extracted second vector feature, the second vector feature may be a vector feature related to an inter-condition operator in an SQL statement, the second sample prediction result may be a result related to an inter-condition operator in an SQL statement, the third prediction model predicts a third sample prediction result by extracting a third vector feature of the reinforcement matrix sample and performing prediction processing according to the extracted third vector feature, the third vector feature may be a result related to a select statement in a select statement, and the prediction processing is performed according to the extracted first vector feature.
S407, confirming a first loss value according to the first sample prediction result and the SQL statement label, confirming a second loss value according to the second sample prediction result and the question label, and confirming a third loss value according to the third sample prediction result and the strengthening label.
Understandably, the first sample prediction result and the SQL statement tag are input into a first loss function in the initial Bi-LSTM model, the first loss function is a function for calculating a matching degree of the first sample prediction result and the SQL statement tag, the first loss value is obtained according to the first loss function, the second sample prediction result and the question tag are input into a second loss function in the initial Bi-LSTM model, the second loss function is a function for calculating a matching degree of the second sample prediction result and the question tag, the second loss value is obtained according to the second loss function, the third sample prediction result and the reinforcement tag are input into a third loss function in the initial Bi-LSTM model, the third loss function is a function for calculating a matching degree of the third sample prediction result and the reinforcement tag, and the third loss value is obtained according to the third loss function.
S408, combining the first sample prediction result, the second sample prediction result and the third sample prediction result to generate a sample identification result, and determining a total loss value according to the first loss value, the second loss value and the third loss value.
Understandably, according to a preset combination requirement, the combination requirement may be set according to a requirement, for example, the combination requirement is spliced according to the sequence of a selec statement, an inter-condition operator, and a where clause in an SQL statement, and the first sample prediction result, the second sample prediction result, and the third sample prediction result are spliced to combine and generate the sample identification result, and at the same time, the first loss value, the second loss value, and the third loss value are input into a total loss function of the initial Bi-LSTM model, where the middle loss function is a value of a total loss among the first loss value, the second loss value, and the third loss value calculated by a weighted average formula.
And S409, recording the initial Bi-LSTM model after convergence as the trained Bi-LSTM model when the total loss value reaches a preset convergence condition.
Understandably, the convergence condition may be set according to a requirement, for example, the convergence condition may be a condition that the total loss value is smaller than a set threshold (e.g. 0.002), and when the total loss value reaches the preset convergence condition, the training is stopped, at this time, all initial parameters in the initial Bi-LSTM model are not changed, and the initial Bi-LSTM model after convergence is recorded as a trained Bi-LSTM model.
And S410, when the total loss value does not reach a preset convergence condition, iteratively updating the initial parameters of the initial Bi-LSTM model until the total loss value reaches the preset convergence condition, and recording the converged initial Bi-LSTM model as a trained Bi-LSTM model.
Understandably, the convergence condition may also be a condition that a value obtained after the total loss value is calculated by a preset number of training times is small and does not decrease any more, for example, the convergence condition is that the value obtained after the total loss value is calculated by 8000 times is small and does not decrease any more. And when the total loss value does not reach the convergence condition, iteratively updating the initial parameters of the initial Bi-LSTM model until the total loss value reaches the preset convergence condition, stopping training, and recording the initial Bi-LSTM model after convergence as a trained Bi-LSTM model.
Therefore, the method can improve the recognition speed and the recognition quality by training a first prediction model, a second prediction model and a third prediction model in the initial Bi-LSTM model, determining a total loss value according to the first loss value, the second loss value and the third loss value, continuously iterating initial parameters of the initial Bi-LSTM model according to the total loss value until the total loss value reaches a preset convergence condition, and recording the initial Bi-LSTM model after convergence as a trained Bi-LSTM model.
And S50, inputting each predicted SQL statement into a preset verification model, verifying each predicted SQL statement through the verification model, obtaining a quality coefficient which is output by the verification model and corresponds to each predicted SQL statement, and determining a final SQL statement corresponding to the Chinese query data information according to the quality coefficient corresponding to each predicted SQL statement.
Understandably, the verification model is a model for performing a verification operation on the input predicted SQL statement, the verification operation is a result obtained by executing the predicted SQL statement and verifying an execution process and an executed result, a rule of the verification may be set according to a requirement, and preferably, the rule of the verification is: 1. when the operator in the where clause in the prediction SQL statement is an equal sign, whether the conditional median in the where clause is in the data of the execution result; 2. and when the execution result of the predicted SQL statement is empty, the predicted SQL statement is not good. Determining a quality coefficient corresponding to the predicted SQL statement according to the verification rule and the recognition probability corresponding to the predicted SQL statement, wherein the recognition probability is the probability that the predicted SQL statement meets the prediction requirement, the calculation method of the quality coefficient can be set according to requirements, for example, the calculation method of the quality coefficient can be a weighted calculation method, the quality coefficient is used for measuring the quality effect executed by the predicted SQL statement, one predicted SQL statement is determined according to all the quality coefficients, and the predicted SQL statement is recorded as the final SQL statement corresponding to the Chinese query data information.
Therefore, all the predicted SQL sentences are verified through the verification model, the SQL sentences which do not meet the requirements are removed, and the accuracy and hit rate of SQL sentence generation are improved.
In an embodiment, as shown in fig. 8, in the step S50, that is, the verifying each of the predicted SQL statements through the verification model, obtaining quality coefficients output by the verification model and corresponding to each of the predicted SQL statements, and determining a final SQL statement corresponding to the chinese query data information according to the quality coefficients corresponding to each of the predicted SQL statements includes:
s501, recording the predicted SQL statement with the highest recognition probability in all the predicted SQL statements as a first SQL statement.
Understandably, according to the recognition probability in all the predicted SQL statements, acquiring the predicted SQL statement with the highest recognition probability, and recording the predicted SQL statement as the first SQL statement.
S502, inputting the first SQL statement into the verification model, and verifying the first SQL statement through the verification model to obtain an execution result corresponding to the first SQL statement;
understandably, the rule of the verification may be set according to the requirement, and preferably, the rule of the verification is: 1. when the operator in the where clause in the prediction SQL statement is an equal sign, whether the conditional median in the where clause is in data of an execution result; 2. and when the execution result of the predicted SQL statement is empty, the predicted SQL statement is not good. If one of the verification rules is satisfied, adding one to the execution result corresponding to the predicted SQL statement; and if the execution result does not meet the verification rule, the execution result is zero.
S503, determining a quality coefficient corresponding to the first SQL statement according to the execution result corresponding to the first SQL statement.
Understandably, the corresponding quality coefficient is obtained by performing conversion according to the execution result, and the format of the quality coefficient may be set according to requirements, for example, the format of the quality coefficient may be a percentage format.
S504, when the quality coefficient reaches a preset threshold value, setting the quality coefficients corresponding to all the predicted SQL statements except the first SQL statement to be zero, and meanwhile, determining the first SQL statement as the final SQL statement.
Understandably, the threshold may be set according to a requirement, for example, the threshold may be set to 20%, if the quality coefficient reaches the threshold, the quality coefficients corresponding to all the predicted SQL statements except the first SQL statement are set to zero, and it is determined that the first SQL statement is the final SQL statement, that is, the SQL statement to be queried.
In an embodiment, after the step S503, that is, after the determining the quality coefficient corresponding to the first SQL statement according to the execution result corresponding to the first SQL statement, the method further includes:
and S505, when the quality coefficient does not reach a preset threshold value, updating the recognition probability corresponding to the first SQL statement to be zero.
S506, recording the forecast SQL statement with the highest recognition probability in all the forecast SQL statements as a second SQL statement.
And S507, inputting the second SQL statement into the verification model, and verifying the second SQL statement through the verification model to obtain an execution result corresponding to the second SQL statement.
And S508, determining the quality coefficient corresponding to the second SQL statement according to the execution result corresponding to the second SQL statement.
And S509, when the quality coefficient reaches a preset threshold value, setting the quality coefficients corresponding to all the predicted SQL statements except the second SQL statement to be zero, and meanwhile, determining the second SQL statement as the final SQL statement.
In this way, only the predicted SQL statement (first SQL statement) with the highest recognition probability among all the predicted SQL statements is verified, a quality coefficient is obtained according to an execution result obtained after verification, if a preset threshold is reached, the first SQL statement is determined as a final SQL statement, if the preset threshold is not reached, the first SQL statement is set to be the lowest, then the predicted SQL statement (second SQL statement) with the highest recognition probability among all the predicted SQL statements is verified, the verification process is continuously circulated, and the final SQL statement is finally obtained, so that the execution times of the server are reduced, and the execution time is shortened.
In an embodiment, in the step S50, the verifying each of the predicted SQL statements by the verification model to obtain quality coefficients output by the verification model and corresponding to each of the predicted SQL statements, and determining a final SQL statement corresponding to the chinese query data information according to the quality coefficients corresponding to each of the predicted SQL statements further includes:
and S510, verifying each predicted SQL statement through the verification model to obtain an execution result corresponding to each predicted SQL statement.
Understandably, the rule of the verification may be set according to the requirement, and preferably, the rule of the verification is: 1. when the operator in the where clause in the prediction SQL statement is an equal sign, whether the conditional median in the where clause is in the data of the execution result; 2. and when the execution result of the predicted SQL statement is empty, the predicted SQL statement is not good. If one of the verification rules is satisfied, adding one to the execution result corresponding to the predicted SQL statement; and if the rule of the verification is not satisfied, the execution result is zero.
And S511, determining the quality coefficient corresponding to each predicted SQL statement according to the identification probability and the execution result corresponding to each predicted SQL statement.
Understandably, the determination manner may be set according to a requirement, and preferably, the determination manner may be that the quality coefficient corresponding to the predicted SQL statement is obtained by multiplying the recognition probability of the predicted SQL statement by the execution result corresponding to the predicted SQL statement.
And S512, determining the predicted SQL statement corresponding to the maximum quality coefficient in all the quality coefficients as the final SQL statement.
Understandably, the quality coefficient is used for measuring the quality effect of the execution of the predicted SQL statement, and the predicted SQL statement corresponding to the largest quality coefficient (with the best quality effect) among all the quality coefficients is determined as the final SQL statement.
Therefore, the quality coefficient of each predicted SQL statement is determined through the verification model, so that the final SQL statement is determined, the SQL statement with the best quality effect after execution is obtained, and the generation of the only and best SQL statement is realized.
And S60, performing data query according to the final SQL statement to obtain a query result corresponding to the Chinese query data information, and displaying the query result on the query interface or playing the query result.
Understandably, data query is performed from a database according to the final SQL statement to obtain the query result, where the query result is a result to be queried in the chinese query data information, and the query result is encapsulated, and the encapsulation mode may be set according to a requirement, for example, the encapsulation mode is to perform pie chart analysis on the query result, and the encapsulated query result is displayed on the query interface, so that a user can conveniently look up the query result.
Therefore, the invention realizes that Chinese query data information input from a query interface is acquired, the preprocessing model is input for expansion processing, the expanded information output by the preprocessing model is input to the Bert model, the Bert model outputs a vector matrix after being labeled by a part of speech sequence, the vector matrix is input to the Bi-LSTM model and vector prediction processing is carried out, the identification result is acquired, each SQL prediction statement in the identification result is input to the verification model for verification, the final SQL statement is determined according to the quality coefficient corresponding to each obtained SQL prediction statement, and query is carried out according to the final SQL statement to obtain the query result (the result required to be queried in the Chinese query data information), so that the invention provides a method for automatically generating SQL statements according to the Chinese query information provided by a user and querying a database according to the SQL statements to obtain the data required to be queried, thereby greatly reducing the threshold required by professionalism, shortening the execution cycle, improving the efficiency, improving the user satisfaction degree, improving the identification accuracy rate and greatly reducing the operation cost.
In an embodiment, a device for generating an SQL statement is provided, where the device for generating an SQL statement corresponds to the method for generating an SQL statement in the above embodiment one to one. As shown in fig. 9, the apparatus for generating the SQL statement includes a receiving module, an expanding module 12, an outputting module 13, a predicting module 14, a determining module 15, and a querying module 16. The functional modules are explained in detail as follows:
the receiving module 11 is used for receiving the query instruction and acquiring the Chinese query data information recorded on the query interface;
the expansion module 12 is configured to input the chinese query data information into a preset preprocessing model, and perform expansion processing on the chinese query data information through the preprocessing model to obtain expansion information containing chinese content;
the output module 13 is configured to input the extended information into a trained Bert model, perform part-of-speech sequence tagging on the extended information through the Bert model, and acquire a vector matrix output by the Bert model;
the prediction module 14 is configured to input the vector matrix into a trained Bi-LSTM model, perform vector prediction processing on the vector matrix through the Bi-LSTM model, and obtain an identification result output by the Bi-LSTM model; the recognition result comprises at least one predicted SQL statement;
the determining module 15 is configured to input each of the predicted SQL statements into a preset verification model, verify each of the predicted SQL statements through the verification model, obtain quality coefficients output by the verification model and corresponding to each of the predicted SQL statements, and determine a final SQL statement corresponding to the chinese query data information according to the quality coefficients corresponding to each of the predicted SQL statements;
and the query module 16 is configured to perform data query according to the final SQL statement to obtain a query result corresponding to the information of the chinese query data, and display the query result on the query interface or play the query result.
For the specific definition of the SQL statement generation apparatus, reference may be made to the above definition of the SQL statement generation method, which is not described herein again. The modules in the SQL statement generation apparatus may be wholly or partially implemented by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of generating an SQL statement.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the method for generating the SQL statement in the foregoing embodiments.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method for generating the SQL statement in the above-described embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.
Claims (9)
1. A method for generating an SQL statement is characterized by comprising the following steps:
receiving a query instruction, and acquiring Chinese query data information input from a query interface;
inputting the Chinese query data information into a preset preprocessing model, and performing expansion processing on the Chinese query data information through the preprocessing model to obtain expansion information containing Chinese content;
inputting the extended information into a trained Bert model, and performing part-of-speech sequence tagging on the extended information through the Bert model to obtain a vector matrix output by the Bert model;
inputting the vector matrix into a trained Bi-LSTM model, and performing vector prediction processing on the vector matrix through the Bi-LSTM model to obtain a recognition result output by the Bi-LSTM model; the recognition result comprises at least one predicted SQL statement;
inputting each predicted SQL statement into a preset verification model, verifying each predicted SQL statement through the verification model, obtaining a quality coefficient output by the verification model and corresponding to each predicted SQL statement, and determining a final SQL statement corresponding to the Chinese query data information according to the quality coefficient corresponding to each predicted SQL statement;
performing data query according to the final SQL statement to obtain a query result corresponding to the Chinese query data information, and displaying the query result on the query interface or playing the query result;
the expanding processing of the Chinese query data information through the preprocessing model to obtain the expanding information containing Chinese content comprises the following steps:
converting the Chinese query data information including the Chinese format digits by a digit conversion method through the preprocessing model to obtain first conversion information;
the preprocessing model splits the first conversion information into a plurality of texts through a natural language processing technology; wherein one of said texts is associated with a sequence number;
the preprocessing model matches each text with a short text in a preset synonymy dictionary;
when the text is not matched with the abbreviation text, marking the text as an original text; when the text is matched with the abbreviation text, replacing the text with the full name of the abbreviation text matched with the text, and marking the text after being replaced as a substitute text;
acquiring the column names with the matching degrees reaching a preset threshold value with all the original texts or all the alternative texts from all the column names of a preset list name dictionary through a character string matching algorithm and a Chinese similarity matching algorithm, recording the column names with the matching degrees reaching the preset threshold value with all the original texts or all the alternative texts as reinforced column names, and recording all the reinforced column names as reinforced information; the list name dictionary is a dictionary of a list name set in all preset lists;
splicing all the alternative texts with all the original texts according to the sequence numbers to obtain question information;
and splicing the question information and the strengthening information according to a preset splicing rule to generate the expanded information.
2. The method for generating an SQL statement according to claim 1, wherein the obtaining a vector matrix output by the Bert model by performing part-of-speech sequence tagging on the extended information through the Bert model comprises:
the Bert model splits the expanded information into a plurality of single characters;
the Bert model carries out part-of-speech prediction on each single character to obtain a sequence label corresponding to each single character;
combining all the single characters and the sequence labels corresponding to the single characters to obtain label information corresponding to the extended information;
and performing character feature extraction on the labeled information through the Bert model to obtain a vector matrix which is output by the Bert model and corresponds to the expanded information.
3. The method for generating an SQL statement according to claim 1, wherein the vector matrix includes a question matrix and a reinforcement matrix, the question matrix corresponds to the question information, and the reinforcement matrix corresponds to the reinforcement information;
the inputting the vector matrix into the trained Bi-LSTM model, and performing vector prediction processing on the sequence matrix through the Bi-LSTM model to obtain the recognition result output by the Bi-LSTM model, includes:
inputting the vector matrix into a first prediction model in the Bi-LSTM model, inputting the question matrix into a second prediction model in the Bi-LSTM model, and inputting the reinforcement matrix into a third prediction model in the Bi-LSTM model;
the first prediction model carries out prediction processing by extracting first vector features in the vector matrix to obtain a first prediction result of the vector matrix, meanwhile, the second prediction model carries out prediction processing by extracting second vector features in the question matrix to obtain a second prediction result of the question matrix, and the third prediction model carries out prediction processing by extracting third vector features in the reinforced matrix to obtain a third prediction result of the reinforced matrix; the first vector features are vector features related to a where clause in an SQL statement, the second vector features are vector features related to an inter-condition operator in the SQL statement, and the third vector features are vector features related to a select statement in the SQL statement; the first prediction result is the statement content related to a where clause in an SQL statement is predicted, the second prediction result is the result related to an inter-condition operator in the SQL statement is predicted, and the third prediction result is the statement content related to a select statement in the SQL statement is predicted;
and combining the first prediction result, the second prediction result and the third prediction result to generate the identification result.
4. The method of generating an SQL statement according to claim 3, wherein the combining the first predicted result, the second predicted result, and the third predicted result to generate the recognition result comprises:
extracting a first prediction array element in the first prediction result and a first prediction probability associated therewith, a second prediction array element in the second prediction result and a second prediction probability associated therewith, and a third prediction array element in the third prediction result and a third prediction probability associated therewith;
combining one of said first predicted array elements, one of said second predicted array elements, and one of said third predicted array elements into one of said predicted SQL statements; the first prediction result comprises an array of first prediction array elements associated with the where clause, the second prediction result comprises an array of second prediction array elements associated with the inter-condition operators, and the third prediction result comprises an array of third prediction array elements associated with the select statement;
determining an identification probability corresponding to each predicted SQL statement according to the first prediction probability, the second prediction probability and the third prediction probability corresponding to the predicted SQL statement;
and sequencing all the recognition probabilities, and recording the sequenced recognition probabilities and the recognition probabilities corresponding to the predicted SQL statements as recognition results.
5. The method for generating an SQL statement according to claim 4, wherein the vector matrix is input into a trained Bi-LSTM model, and the sequence matrix is subjected to vector prediction processing by the Bi-LSTM model, so as to obtain a recognition result output by the Bi-LSTM model; before the recognition result comprises at least one predicted SQL statement, the method comprises the following steps:
obtaining a vector matrix sample; the SQL statement label comprises question labels which are in one-to-one correspondence with the question matrix samples and reinforced labels which are in one-to-one correspondence with the reinforced matrix samples;
inputting the vector matrix samples into a first prediction model in an initial Bi-LSTM model, inputting the question matrix samples into a second prediction model in the initial Bi-LSTM model, and inputting the reinforcement matrix samples into a third prediction model in the initial Bi-LSTM model; wherein the initial Bi-LSTM model comprises initial parameters;
the first prediction model performs prediction processing by extracting first vector features in the vector matrix samples to obtain first sample prediction results of the vector matrix samples, the second prediction model performs prediction processing by extracting second vector features in the question matrix samples to obtain second sample prediction results of the question matrix samples, and the third prediction model performs prediction processing by extracting third vector features in the reinforced matrix samples to obtain third sample prediction results of the reinforced matrix samples;
confirming a first loss value according to the first sample prediction result and the SQL statement label, confirming a second loss value according to the second sample prediction result and the question label, and confirming a third loss value according to the third sample prediction result and the reinforced label;
combining the first sample prediction result, the second sample prediction result and the third sample prediction result to generate a sample identification result, and determining a total loss value according to the first loss value, the second loss value and the third loss value;
when the total loss value reaches a preset convergence condition, recording the converged initial Bi-LSTM model as a trained Bi-LSTM model;
and when the total loss value does not reach the preset convergence condition, iteratively updating the initial parameters of the initial Bi-LSTM model until the total loss value reaches the preset convergence condition, and recording the converged initial Bi-LSTM model as a trained Bi-LSTM model.
6. The method for generating SQL statements according to claim 1, wherein the verifying each of the predicted SQL statements by the verification model to obtain quality coefficients output by the verification model and corresponding to each of the predicted SQL statements, and determining a final SQL statement corresponding to the chinese query data information according to the quality coefficients corresponding to each of the predicted SQL statements comprises:
recording the predicted SQL statement with the highest recognition probability in all the predicted SQL statements as a first SQL statement;
inputting the first SQL statement into the verification model, and verifying the first SQL statement through the verification model to obtain an execution result corresponding to the first SQL statement;
determining a quality coefficient corresponding to the first SQL statement according to an execution result corresponding to the first SQL statement;
and when the quality coefficient reaches a preset threshold value, setting the quality coefficients corresponding to all the predicted SQL statements except the first SQL statement to be zero, and simultaneously determining the first SQL statement as the final SQL statement.
7. An apparatus for generating an SQL statement, comprising:
the receiving module is used for receiving the query instruction and acquiring the Chinese query data information input from the query interface;
the expansion module is used for inputting the Chinese query data information into a preset preprocessing model, and performing expansion processing on the Chinese query data information through the preprocessing model to obtain expansion information containing Chinese content;
the output module is used for inputting the extended information into a trained Bert model, and performing part-of-speech sequence labeling on the extended information through the Bert model to acquire a vector matrix output by the Bert model;
the prediction module is used for inputting the vector matrix into a trained Bi-LSTM model, and performing vector prediction processing on the vector matrix through the Bi-LSTM model to obtain a recognition result output by the Bi-LSTM model; the recognition result comprises at least one predicted SQL statement;
the determining module is used for inputting each predicted SQL statement into a preset verification model, verifying each predicted SQL statement through the verification model, acquiring a quality coefficient which is output by the verification model and corresponds to each predicted SQL statement, and determining a final SQL statement corresponding to the Chinese query data information according to the quality coefficient corresponding to each predicted SQL statement;
the query module is used for carrying out data query according to the final SQL statement to obtain a query result corresponding to the Chinese query data information, and displaying the query result on the query interface or playing the query result;
the expansion module is further configured to:
converting the numbers containing the Chinese character format in the Chinese query data information by the preprocessing model through a digital conversion method to obtain first conversion information;
the preprocessing model splits the first conversion information into a plurality of texts through a natural language processing technology; wherein one of said texts is associated with a sequence number;
the preprocessing model matches each text with a short text in a preset synonymy dictionary;
when the text is not matched with the abbreviation text, marking the text as an original text; when the text is matched with the abbreviation text, replacing the text with the full name of the abbreviation text matched with the text, and marking the text after being replaced as a substitute text;
acquiring the column names with the matching degrees reaching a preset threshold value with all the original texts or all the alternative texts from all the column names of a preset list name dictionary through a character string matching algorithm and a Chinese similarity matching algorithm, recording the column names with the matching degrees reaching the preset threshold value with all the original texts or all the alternative texts as reinforced column names, and recording all the reinforced column names as reinforced information; the list name dictionary is a dictionary of a list name set in all preset lists;
splicing all the alternative texts with all the original texts according to the sequence numbers to obtain question information;
and splicing the question information and the strengthening information according to a preset splicing rule to generate the expanded information.
8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method for generating the SQL statement according to any one of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium storing a computer program, wherein the computer program is used for implementing the method for generating the SQL statement according to any one of claims 1 to 6 when executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010218628.4A CN111581229B (en) | 2020-03-25 | 2020-03-25 | SQL statement generation method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010218628.4A CN111581229B (en) | 2020-03-25 | 2020-03-25 | SQL statement generation method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111581229A CN111581229A (en) | 2020-08-25 |
CN111581229B true CN111581229B (en) | 2023-04-18 |
Family
ID=72124173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010218628.4A Active CN111581229B (en) | 2020-03-25 | 2020-03-25 | SQL statement generation method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111581229B (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114168622A (en) * | 2020-09-10 | 2022-03-11 | 北京达佳互联信息技术有限公司 | Data query method and device based on domain specific language |
CN112328621B (en) * | 2020-11-05 | 2023-11-21 | 中国平安财产保险股份有限公司 | SQL conversion method, SQL conversion device, SQL conversion computer equipment and SQL conversion computer readable storage medium |
CN112380238B (en) * | 2020-11-16 | 2024-06-28 | 平安科技(深圳)有限公司 | Database data query method and device, electronic equipment and storage medium |
CN112463819B (en) * | 2020-11-26 | 2024-10-25 | 北京宏景世纪软件股份有限公司 | Calculation method, device, equipment and storage medium based on Chinese expression |
CN112506952A (en) * | 2020-12-11 | 2021-03-16 | 中信银行股份有限公司 | Data inquiry device and data inquiry method |
CN112732741B (en) * | 2020-12-31 | 2024-06-07 | 平安科技(深圳)有限公司 | SQL sentence generation method, device, server and computer readable storage medium |
CN113051286A (en) * | 2021-04-20 | 2021-06-29 | 中国工商银行股份有限公司 | Method and device for generating SQL (structured query language) statement conversion model |
CN113177123B (en) * | 2021-04-29 | 2023-11-17 | 思必驰科技股份有限公司 | Optimization method and system for text-to-SQL model |
CN113127450A (en) * | 2021-04-30 | 2021-07-16 | 平安普惠企业管理有限公司 | Data maintenance method and device, computer equipment and storage medium |
CN113656540B (en) * | 2021-08-06 | 2023-09-08 | 北京仁科互动网络技术有限公司 | BI query method, device, equipment and medium based on NL2SQL |
CN113868370A (en) * | 2021-08-20 | 2021-12-31 | 深延科技(北京)有限公司 | Text recommendation method and device, electronic equipment and computer-readable storage medium |
CN113726787B (en) * | 2021-08-31 | 2023-02-07 | 中国平安人寿保险股份有限公司 | SQL injection generation method, device, equipment and storage medium |
CN113886420B (en) * | 2021-09-29 | 2024-06-25 | 平安国际智慧城市科技股份有限公司 | SQL sentence generation method and device, electronic equipment and storage medium |
CN114116771A (en) * | 2021-11-29 | 2022-03-01 | 如果科技有限公司 | Voice control data analysis method and device, terminal equipment and storage medium |
CN114021573B (en) * | 2022-01-05 | 2022-04-22 | 苏州浪潮智能科技有限公司 | Natural language processing method, device, equipment and readable storage medium |
CN114691716A (en) * | 2022-04-11 | 2022-07-01 | 平安国际智慧城市科技股份有限公司 | SQL statement conversion method, device, equipment and computer readable storage medium |
CN117056416B (en) * | 2023-08-16 | 2024-05-07 | 杭州观远数据有限公司 | Flexible construction and management method for visualized data set model |
CN116991877B (en) * | 2023-09-25 | 2024-01-02 | 城云科技(中国)有限公司 | Method, device and application for generating structured query statement |
CN117131070B (en) * | 2023-10-27 | 2024-02-09 | 之江实验室 | Self-adaptive rule-guided large language model generation SQL system |
CN117667978B (en) * | 2023-12-07 | 2024-08-06 | 上海迈伺通健康科技有限公司 | Computer system for operating database by Chinese instruction |
CN118210818B (en) * | 2024-05-16 | 2024-08-20 | 武汉人工智能研究院 | SQL sentence generation method, device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868313A (en) * | 2016-03-25 | 2016-08-17 | 浙江大学 | Mapping knowledge domain questioning and answering system and method based on template matching technique |
CN109271504A (en) * | 2018-11-07 | 2019-01-25 | 爱因互动科技发展(北京)有限公司 | The method of the reasoning dialogue of knowledge based map |
CN109408526A (en) * | 2018-10-12 | 2019-03-01 | 平安科技(深圳)有限公司 | SQL statement generation method, device, computer equipment and storage medium |
CN110309306A (en) * | 2019-06-19 | 2019-10-08 | 淮阴工学院 | A kind of Document Modeling classification method based on WSD level memory network |
CN110334210A (en) * | 2019-05-30 | 2019-10-15 | 哈尔滨理工大学 | A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN |
-
2020
- 2020-03-25 CN CN202010218628.4A patent/CN111581229B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868313A (en) * | 2016-03-25 | 2016-08-17 | 浙江大学 | Mapping knowledge domain questioning and answering system and method based on template matching technique |
CN109408526A (en) * | 2018-10-12 | 2019-03-01 | 平安科技(深圳)有限公司 | SQL statement generation method, device, computer equipment and storage medium |
CN109271504A (en) * | 2018-11-07 | 2019-01-25 | 爱因互动科技发展(北京)有限公司 | The method of the reasoning dialogue of knowledge based map |
CN110334210A (en) * | 2019-05-30 | 2019-10-15 | 哈尔滨理工大学 | A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN |
CN110309306A (en) * | 2019-06-19 | 2019-10-08 | 淮阴工学院 | A kind of Document Modeling classification method based on WSD level memory network |
Non-Patent Citations (2)
Title |
---|
Meina Song.Hierarchical Schema Representation for Text-to-SQL Parsing With Decomposing Decoding.IEEE Access .2019,全文. * |
杨梦琴.语义驱动的数据查询与智能可视化研究.中国优秀硕士论文电子期刊网.2019,(第4期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111581229A (en) | 2020-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111581229B (en) | SQL statement generation method and device, computer equipment and storage medium | |
CN110457431B (en) | Knowledge graph-based question and answer method and device, computer equipment and storage medium | |
CN109376222B (en) | Question-answer matching degree calculation method, question-answer automatic matching method and device | |
CN110688853B (en) | Sequence labeling method and device, computer equipment and storage medium | |
CN114139551A (en) | Method and device for training intention recognition model and method and device for recognizing intention | |
CN113553853B (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN113449489B (en) | Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium | |
CN112860919B (en) | Data labeling method, device, equipment and storage medium based on generation model | |
CN111429204A (en) | Hotel recommendation method, system, electronic equipment and storage medium | |
CN111695338A (en) | Interview content refining method, device, equipment and medium based on artificial intelligence | |
CN113887229A (en) | Address information identification method and device, computer equipment and storage medium | |
CN112766319A (en) | Dialogue intention recognition model training method and device, computer equipment and medium | |
CN113886550A (en) | Question-answer matching method, device, equipment and storage medium based on attention mechanism | |
CN112949320B (en) | Sequence labeling method, device, equipment and medium based on conditional random field | |
CN113656547A (en) | Text matching method, device, equipment and storage medium | |
CN115357699A (en) | Text extraction method, device, equipment and storage medium | |
CN113779994A (en) | Element extraction method and device, computer equipment and storage medium | |
CN115438650B (en) | Contract text error correction method, system, equipment and medium fusing multi-source characteristics | |
CN111554275A (en) | Speech recognition method, device, equipment and computer readable storage medium | |
CN113051920A (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN111898339A (en) | Ancient poetry generation method, device, equipment and medium based on constraint decoding | |
CN115545035B (en) | Text entity recognition model and construction method, device and application thereof | |
CN112149424A (en) | Semantic matching method and device, computer equipment and storage medium | |
CN116166858A (en) | Information recommendation method, device, equipment and storage medium based on artificial intelligence | |
CN114638229A (en) | Entity identification method, device, medium and equipment of record data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |