CN112001188B

CN112001188B - Method and device for rapidly realizing NL2SQL based on vectorization semantic rule

Info

Publication number: CN112001188B
Application number: CN202011184694.0A
Authority: CN
Inventors: 肖超峰; 李智; 钱泓锦; 刘占亮
Original assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-03-16
Anticipated expiration: 2040-10-30
Also published as: CN112001188A

Abstract

The invention discloses a method and a device for quickly realizing NL2SQL based on vectorized semantic rules. The method comprises the following steps: performing word segmentation processing and entity recognition on a first sentence based on a natural language; replacing the corresponding entity in the first statement by using a preset entity type to obtain a second statement; identifying the second sentence according to a preset semantic rule template to obtain a semantic segment; obtaining table and field information of a service database according to semantic fragment matching; and generating SQL sentences according to the table and field information of the business database. NL2SQL can be quickly realized without depending on a complex system and a database, semantic fragments in natural sentences are identified based on vectorization semantic rules, the semantic search accuracy and generalization capability are improved, and the method has high recall rate.

Description

Method and device for rapidly realizing NL2SQL based on vectorization semantic rule

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for quickly realizing NL2SQL based on vectorized semantic rules.

Background

In the field of semantic search, how to freely query target data in a database through natural language becomes an emerging research hotspot in the industry. The conversion of natural language into a standard semantic representation which can be understood and executed by a computer is a subtask in the field of semantic analysis. NL2SQL (Natural Language to SQL) is a technology that can convert a user Natural statement into an SQL statement that can be executed by a computer.

In an actual professional application scenario, because enough labeled corpora are not available or lack in the professional field, corresponding model training cannot be constructed, so that the NL2SQL can be quickly realized by combining a business data model still is a difficult problem. In addition, the field attribute analyzed in the natural sentence and the field in the service database lack accurate mapping, so that the whole link flow is complex and the executable SQL cannot be correctly generated.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the following technical scheme.

One aspect of the present invention provides a method for rapidly implementing NL2SQL based on vectorized semantic rules, comprising:

performing word segmentation processing and entity recognition on a first sentence based on a natural language;

replacing the corresponding entity in the first statement by using a preset entity type to obtain a second statement;

identifying the second sentence according to a preset semantic rule template to obtain a semantic segment;

obtaining table and field information of a service database according to the semantic fragment matching;

and generating SQL sentences according to the table and field information of the service database.

Preferably, the performing word segmentation processing and entity recognition on the first sentence based on the natural language comprises:

performing word segmentation processing and entity recognition on the first sentence by using a predefined entity rule template and a conventional dictionary to obtain a preset entity type corresponding to an entity in the first sentence; the entity rule template includes a custom professional domain dictionary.

Preferably, the entity rule template further comprises a third party model interface for invoking a third party entity recognition model.

Preferably, the recognizing the second sentence according to a preset semantic rule template to obtain a semantic segment includes:

matching each participle in the second sentence with a word in the semantic rule template to obtain a semantic rule;

and identifying elements of the semantic fragments according to the semantic rules.

Preferably, matching each participle in the second sentence with a word in the semantic rule template comprises:

converting the participles into participle vectors, and calculating the similarity between the participle vectors and the vectors of the words in the semantic rule template; and if the similarity reaches a threshold value, replacing the participle with a word in the semantic rule template.

Preferably, the obtaining of the table and field information of the service database according to the semantic segment matching includes:

and matching the elements of the semantic fragments with the information description rules of a preset business database table to obtain the information, the field information and the associated information between the tables of the corresponding table.

Preferably, the generating an SQL statement according to the table and field information of the service database includes:

obtaining the structural elements of the SQL statement according to the information and the field information of the corresponding tables and the correlation information among the tables;

and filling the structural elements into a rule template of the SQL statement to generate the SQL statement.

The second aspect of the present invention provides a device for quickly implementing NL2SQL based on vectorized semantic rules, including:

the entity recognition module is used for performing word segmentation processing and entity recognition on the first sentence based on the natural language;

the entity type replacing module is used for replacing the corresponding entity in the first statement by using a preset entity type to obtain a second statement;

the semantic segment recognition module is used for recognizing the second sentence according to a preset semantic rule template to obtain a semantic segment;

the service database table matching module is used for obtaining the table and field information of the service database according to the semantic segment matching;

and the SQL statement generating module is used for generating the SQL statement according to the table and field information of the business database.

A third aspect of the invention provides a memory storing a plurality of instructions for implementing the method described above.

A fourth aspect of the present invention provides an electronic device, comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions are loaded and executed by the processor, so that the processor can execute the method.

The invention has the beneficial effects that: the technical scheme provided by the invention is that,

the text proposes a field of application in the profession, combined with business data structured description information,

matching files are simply configured according to a rule template in advance, the configured files are combined, after word segmentation processing is carried out on natural sentences, semantic fragments are identified according to semantic rules by entity assistance, then a business database table is matched according to field information in the semantic fragments, and SQL sentences are finally generated. The NL2SQL can be quickly realized only by simple and easily understood file template format configuration without depending on a complex system and a database, and the semantic fragments in the natural sentences are identified based on the vectorized semantic rule, so that good accuracy and generalization capability are ensured, and the recall rate is high.

Drawings

FIG. 1 is a schematic flow chart of a method for rapidly implementing NL2SQL based on vectorized semantic rules according to the present invention;

fig. 2 is a schematic structural diagram of the device for rapidly implementing NL2SQL based on vectorized semantic rules according to the present invention.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.

A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.

The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.

The display screen is used for displaying user interfaces of all the application programs.

In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.

Example one

As shown in fig. 1, an embodiment of the present invention provides a method for quickly implementing NL2SQL based on vectorized semantic rules, including:

s101, performing word segmentation processing and entity recognition on a first sentence based on a natural language;

s102, replacing a corresponding entity in the first statement by using a preset entity type to obtain a second statement;

s103, identifying the second sentence according to a preset semantic rule template to obtain a semantic segment;

s104, obtaining table and field information of a service database according to the semantic segment matching;

and S105, generating an SQL statement according to the table and the field information of the business database.

Executing step S101, specifically including: performing word segmentation processing and entity recognition on the first sentence by using a predefined entity rule template and a conventional dictionary to obtain a preset entity type corresponding to an entity in the first sentence; the entity rule template includes a custom professional domain dictionary.

For a specific business professional field, more professional languages or newly generated words exist, the conventional dictionary is often incapable of accurately segmenting natural sentences in the professional field, and in order to solve the problem, the embodiment of the invention utilizes a self-defined professional field dictionary. In the customized professional field dictionary, a user can customize a professional language and/or newly generated words and the like, and can also customize synonyms, stop words, keywords, entities and the like. Because the matching degree of the customized professional field dictionary and the professional sentences is better, the precision can be improved and the professional field dictionary is more suitable for the business scene by adopting the professional field dictionary to process word segmentation and entity recognition.

In the actual application process, an entity rule template can be predefined, and a customized professional domain dictionary is configured in the entity rule template. The content format of the entity rule template is entity type and dictionary name, and a plurality of words are separated by commas, which is exemplified as follows:

#{person}=http://ip:port/person

q = interface address// name entity third party model

Lixia, Lijian, Liwen,.// custom dictionary

# stock name

Digital video signal

#{date}=http://ip:port/date

q =// time entity

#{place}=http://ip:port/place

q =// location entity

#end

In a specific implementation process, a Trie (dictionary tree, in which a conventional dictionary is built) may be used in combination with an entity rule template to perform word segmentation processing on an input first sentence based on a natural language by using a forward and reverse maximum matching algorithm, so as to perform entity recognition. For example,

the first statement: in the last 3 months, men who have had Beijing and are greater than 175 blood group B in height have been visited.

Word segmentation and entity recognition results: the last 3 months/date passed/vq Beijing/place/uj height/v is greater than/d 175/x blood type/n B/x/uj male/n.

In a preferred embodiment of the present invention, the entity rule template further comprises a third party model interface for invoking a third party entity recognition model. As shown in the above example. The third-party entity recognition model can be directly called through the configured interface address. The parameter representation is for example: # { date } = http:// ip: port/date

q =, standard data return format, for example:

{“type”:””date”,””entity”:”2013-06-08”},

{ "type": "", "person", "" entity ": Liu Xiaohua" }.

Because the customized professional field dictionary comprises clear professional field vocabularies, the customized professional field dictionary is preferentially adopted in the process of word segmentation and entity recognition, and the accuracy is higher compared with the third-party entity recognition model.

And S102 is executed, and the entity type is adopted to replace the corresponding entity in the first statement to obtain a second statement. For example:

last 3 months/date gone/vq Beijing/place/uj height/v is greater than/d 175/x blood type/n B/x/uj man/n

"last 3 months" is the date entity { date }, and "Beijing" is the place entity { place }. After replacing the corresponding entity with the entity type, the obtained second statement is:

{ date } to men with { place } height greater than 175 blood group B.

And S103, identifying the second sentence according to a preset semantic rule template to obtain a semantic segment. The method specifically comprises the following steps: matching each participle in the second sentence with a word in the semantic rule template to obtain a semantic rule; and identifying elements of the semantic fragments according to the semantic rules.

The semantic rule template is provided with words, semantic rules and element fields. Examples are as follows:

# stature// semantic fragment rule template

vector = [ height,. ]// in use, the words are respectively vectorized by word2vector to represent 128 dimensions

frag = [ height (greater than | less than | l) \ d {3} (cm | m) ], [. eta. ]// semantic rules, multiple rules # split, { slot } representing different expressions of the above words.

operate = [ greater than: ], [ less than:.// obtaining on a frag basis

value = [ \ d {2,3} ], [ ].// obtaining field values on the basis of frag

mapvalue =// map database true value

islike = 0|1// whether to turn on fuzzy search

# gender

vector = [ male, female. ]// vectorizing words respectively represents 128 dimensions

Frag = [ (male/female) ]/H

operate=[]

value = [ (male | female) ]

mapvalue = [ male: 1], [ female: 0]// mapping database true value

islike=1

Blood group # of

vector = [ blood type, AB, O. ]// vectorizing slot words respectively represents 128 dimensions

Frag = [ blood type (A | B | AB | O), (A | B | AB | O) type. ]

operate=[]

value=[(A|B|AB|O)]

islike=0

# destination

vector = [ go,. ]// use respectively vectorized words represent 128 dimensions

Frag = [ go over { place }, to { place }. ]

operate=[]

value=[{place}]

islike=0

Time # time

slot = [ recently, these days. ]

Frag=[{date},... ]

operate=[]

value=[{date}, \d{1,3}]

islike=0

#end

As an example, for example, the second statement is:

{ date } to men with { place } height greater than 175 blood group B.

Matching the segmented words ' removed ' with Frag words ' in a semantic rule template to obtain a semantic rule: "go { place }", according to the semantic rule, identify the semantic fragment in the first sentence: "get over Beijing", and extract the essential information of the semantic segment: destination # operation "=" and value "Beijing".

For another example, the segmented word "height" is matched with the Frag word "height" in the semantic rule template to obtain the semantic rule: "height (greater than | less than |) \ d {3} (cm | m)", identifying a semantic segment in the first sentence according to the semantic rule: and the height is more than 175, and element information of the semantic fragment is extracted: height # was found to be "greater than:" for operate and "175" for value.

In a preferred embodiment of the present invention, when each participle in the second sentence is matched with a word in the semantic rule template, the participle is converted into a participle vector, and the similarity between the participle vector and the vector of the word in the semantic rule template is calculated; and if the similarity reaches a threshold value, replacing the participle with a word in the semantic rule template, and then matching, so that the recall rate and generalization capability of semantic recognition can be greatly improved.

In the specific implementation process, the above-mentioned word segmentation conversion and replacement may be performed before matching under all circumstances, or may be performed when the Frag word cannot be matched by using the original word segmentation. The specific setting can be carried out according to the actual situation.

As an example, the statements are:

"{ date } passed over a man with { place } height greater than 175 blood group B,

the segmented words are removed from the template and are used as Frag words in a semantic rule template, if the segmented words are too removed, the segmented words are not matched, and the semantic rule cannot be obtained, the segmented words are removed from the semantic rule template, if the segmented words are too removed, the segmented words are used as if the segmented words are too removed, and the segmented words are not used.

Executing step S104, obtaining the table and field information of the service database according to the semantic segment matching, specifically comprising:

Wherein, the information description rule of the business database table can be configured in the template. The method is used for describing the specific table structure and table incidence relation of the database so as to finally generate SQL query database acquisition data. For example, in the configuration file of the following example, the content format is table association described with # table as the start; beginning with # fields, a service field is described, the contents are [ table name: field description: field name: type ], and a plurality of fields are comma-separated.

#table

[person.id=behavior.person_id],[...]

#fields

person = [ height: int, gender: string, blood type: bloodtype: string, birth place: csd: string. ]

behavior = [ destination: ddd: string, arrival time: ddsj: date. ]

#end

In the following example, using the recognition result of the semantic segment obtained in step S103: time, # destination, # gender, # blood type, # height "etc. to match the information description rules of the business database table: person = [ height: int, gender: string, blood type: bloodtype: string, birth place: csd: string, ] and the like), and the table, the field and the associated information among the tables are obtained, such as the field name, the field type, the operation symbol and the like of the related field. The table and field information can be output as JSON format, and the result is as follows:

[

{

"entType": height ",

"frag" height 175cm,

"operate":">",

"value":"175",

"table":"person",

"field":"height",

"fieldType":"double"

},

{

"entType": sex ",

"frag": for male,

"operate":"=",

"value" means "male",

"table":"person",

"field":"gender",

"fieldType":"string"

},

{

"entType": blood type ",

"frag": blood group B ",

"operate":"=",

"value":"B",

"table":"person",

"field":"bloodtype",

"fieldType":"string"

},

{

"entType": time of arrival ",

"frag": last 3 months,

"operate":"",

"value":"3",

"table":"behavior",

"field":"ddsj",

"fieldType":"date"

},

{

"entType": destination ",

"frag" go to Beijing ",

"operate":"=",

"value": Beijing ",

"table":"behavior",

"field":"ddd",

"fieldType":"string"

}

]

executing step S105, generating an SQL statement according to the table and field information of the service database, which specifically includes:

As an example, the rule template of the SQL statement is: select A from B where C and D, wherein A, B, C, D is a component of the SQL statement.

According to the above example, the table and field information of the service database obtained in step S104 are used to obtain the corresponding values of "table" and "field", and the component a obtained after splicing is: person, gene, person, blob type, behavior.ddsj, behavior.ddd; the constituent element B is: person person person, behavior behavior; the constituent element C is: person id = behavior person id; the component D is table.field + operation + value condition, and if there are a plurality of default conditions and one field corresponds to each other, the default condition is converted to an or connection, and as a result, person.generator = 'man' and person.bloodtype = 'B' and person.height > 175 and behavor.dd = 'beijing' and behavor.ddsj > = 'xxx'.

Filling the result of the component A, B, C, D into a rule template of the SQL statement, and generating the SQL statement as follows:

height, person, blob type, behavior.ddsj, B-ehavior.ddfromperson, behaororbahoviorherepherson.id = behavior.person _ id and person.gen = 'man' and person.blob type = 'B' and person.height > 175 and behavior.ddd = 'beijing' and behavior.ddsj > = 'xxx'.

The method provided by the invention has the following beneficial effects:

firstly, semantic segments in natural sentences are identified based on vectorization semantic rules, and relevant field information in the semantic segments is extracted.

And secondly, a dictionary which is convenient for configuring the professional field is supported, and a third-party entity recognition model is flexibly referred in a mode of configuring an interface address. The combination of the two has good expansion capability and usability.

Thirdly, the business data table structure, the professional domain dictionary and the semantic recognition rule can be configured through a simple and easily understood file template format, the configured elements can be mutually quoted among the templates, the operation is simple, clear and convenient, the operation does not depend on a complex system and a complex database, and the expansion is easy.

Fourthly, field fuzzy search can be supported through configuration, mapping of identified attribute values and database field values is supported, multi-table association query is also supported, and application range is expanded.

Example two

As shown in fig. 2, another aspect of the present invention further includes a functional module architecture completely corresponding to and consistent with the foregoing method flow, that is, an embodiment of the present invention further provides an apparatus for quickly implementing NL2SQL based on vectorized semantic rules, including:

an entity recognition module 201, configured to perform word segmentation processing and entity recognition on a first sentence based on a natural language;

an entity type replacing module 202, configured to replace a corresponding entity in the first statement with a preset entity type, to obtain a second statement;

the semantic segment recognition module 203 is configured to recognize the second sentence according to a preset semantic rule template to obtain a semantic segment;

a service database table matching module 204, configured to obtain the table and field information of the service database according to the semantic segment matching;

the SQL statement generating module 205 is configured to generate an SQL statement according to the table and field information of the service database.

Further, the performing word segmentation processing and entity recognition on the first sentence based on the natural language comprises:

Still further, the entity rule template further includes a third party model interface for invoking a third party entity recognition model.

Further, the recognizing the second sentence according to a preset semantic rule template to obtain a semantic segment includes:

Further, matching each participle in the second sentence with a term in the semantic rule template comprises:

Further, the obtaining of the table and field information of the service database according to the semantic segment matching includes:

Further, the generating an SQL statement according to the table and field information of the service database includes:

The device can be implemented by the method provided in the first embodiment, and the specific implementation method can be referred to the description in the first embodiment, which is not described herein again.

The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.

The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions can be loaded and executed by the processor to enable the processor to execute the method according to the first embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for rapidly realizing NL2SQL based on vectorized semantic rules is characterized by comprising the following steps:

performing word segmentation processing and entity recognition on a first sentence based on a natural language, wherein the word segmentation processing and the entity recognition comprise the following steps:

performing word segmentation processing and entity recognition on the first sentence by using a predefined entity rule template and a conventional dictionary to obtain a preset entity type corresponding to an entity in the first sentence; the entity rule template comprises a self-defined professional field dictionary; the entity rule template further comprises a third party model interface for calling a third party entity identification model;

identifying the second sentence according to a preset semantic rule template to obtain a semantic segment, wherein the semantic segment comprises:

identifying elements of semantic fragments according to the semantic rules;

wherein, the semantic rule template is provided with words, semantic rules and element fields;

obtaining the table and field information of the service database according to the semantic fragment matching, wherein the table and field information comprises the following steps:

matching the elements of the semantic fragments with information description rules of a preset business database table to obtain corresponding table information, field information and correlation information among tables;

wherein, the table and field information output is in a JSON format;

generating SQL sentences according to the table and field information of the service database, including: obtaining the structural elements of the SQL statement according to the information and the field information of the corresponding tables and the correlation information among the tables;

filling the constituent elements into a rule template of an SQL statement to generate the SQL statement;

the matching each participle in the second sentence with a term in the semantic rule template comprises:

2. A memory storing a plurality of instructions for implementing the method of claim 1.

3. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the method of claim 1.