CN118467669A

CN118467669A - Index construction method, field searching method, device, equipment and medium

Info

Publication number: CN118467669A
Application number: CN202410565836.XA
Authority: CN
Inventors: 马海寅; 胡清霜; 别彬彬; 王尧舒; 沈吉明; 屈爽; 苏阳华; 谭洁; 梁建铖; 田野
Original assignee: Shenzhen Institute of Computing Sciences
Current assignee: Shenzhen Institute of Computing Sciences
Priority date: 2024-05-09
Filing date: 2024-05-09
Publication date: 2024-08-09

Abstract

The application is suitable for the technical field of data processing, and particularly relates to an index construction method, a field searching device and a medium. According to the method, for any data column in a public data set, field extraction is carried out on the data column according to a field extraction mode corresponding to the data type of the data column, field characteristics and characteristic vectors of the data column are obtained, an inverted index is constructed according to the field characteristics and storage information, and a vector index is constructed according to the characteristic vectors and the storage information. When inquiring, determining a first data set according to the data to be inquired and the inverted index, determining a second data set according to the vector to be inquired and the vector index, forming a candidate data set according to the first data set and the second data set, and sorting all data in the candidate data set according to the data to be inquired, wherein the sorted data is an inquiry result. A finer index structure in the public data set is constructed, and the comprehensiveness and accuracy of the search result are improved.

Description

Index construction method, field searching method, device, equipment and medium

Technical Field

The application is suitable for the technical field of data processing, and particularly relates to an index construction method, a field searching device, equipment and a medium.

Background

In the open scene of government data, in order to fully develop the value of open data, open platform provides the search function to the user for massive data, at present, traditional data search technology mainly searches and matches in open data through shallow sub-information such as data set names, file names, data set types and the like, although the method can realize quick positioning of data to a certain extent, the search result is often limited to data information directly related to search keywords such as data set names and the like, finer search cannot be provided, so that the search result is single, lack of depth and breadth, and the government open data often relates to a plurality of departments and business fields, association relation exists between the open data, and traditional search technology often cannot acquire other data information indirectly associated with inquiry, so that the user cannot acquire comprehensive and consecutive data information, and further the value of open data is limited. Therefore, how to provide a user with a refined search and to improve the accuracy of data search is a problem to be solved.

Disclosure of Invention

In view of the above, the embodiments of the present application provide an index construction method, a field searching method, a device, equipment, and a medium, so as to solve the problem of how to provide a user with a refined search and improve the accuracy of data search.

In a first aspect, an embodiment of the present application provides an index construction method, where the index construction method includes:

Acquiring a public data set, carrying out field extraction on any data column in the public data set according to a field extraction mode corresponding to the data type of the data column to obtain field characteristics of the data column, and determining a characteristic vector corresponding to each field characteristic;

And acquiring the storage information of each data column in the public data set, constructing an inverted index according to the field characteristics and the storage information of each data column, and constructing a vector index according to the characteristic vector and the storage information of each data column.

In a second aspect, an embodiment of the present application provides a field searching method, where after obtaining an inverted index and a vector index by using the index construction method described in the first aspect, the field searching method includes:

acquiring data to be queried, and determining a vector to be queried of the data to be queried;

determining a first data set according to the data to be queried and the inverted index, and determining a second data set according to the vector to be queried and the vector index;

forming a candidate data set from the first data set and the second data set;

And sorting all the data in the candidate data set according to the data to be queried, and obtaining sorted data as a query result.

In a third aspect, an embodiment of the present application provides an index construction apparatus, including:

The feature extraction module is used for acquiring a public data set, carrying out field extraction on any data column in the public data set according to a field extraction mode corresponding to the data type of the data column to obtain field features of the data column, and determining feature vectors corresponding to each field feature;

The index construction module is used for acquiring the storage information of each data column in the public data set, constructing an inverted index according to the field characteristics and the storage information of each data column, and constructing a vector index according to the characteristic vector and the storage information of each data column.

In a fourth aspect, an embodiment of the present application provides a field searching apparatus, where after obtaining a sum of an inverted index and a vector index by using the index construction method described in the first aspect, the field searching apparatus includes:

The query acquisition module is used for acquiring data to be queried and determining vectors to be queried of the data to be queried;

The first determining module is used for determining a first data set according to the data to be queried and the inverted index, and determining a second data set according to the vector to be queried and the vector index;

a second determining module for forming a candidate data set from the first data set and the second data set;

and the result acquisition module is used for sequencing all the data in the candidate data set according to the data to be queried, and the sequenced data is a query result.

In a fifth aspect, an embodiment of the present application provides a computer device, the computer device including a processor, a memory, and a computer program stored in the memory and executable on the processor, the processor implementing the index building method according to the first aspect or the field searching method according to the second aspect when executing the computer program.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the index construction method according to the first aspect or the field search method according to the second aspect.

Compared with the prior art, the embodiment of the application has the beneficial effects that: when the index is constructed, for any data column in the public data set, a field extraction mode corresponding to the data type of the data column is used for extracting the field of the data column to obtain field characteristics and characteristic vectors of the data column, an inverted index is constructed according to the field characteristics and storage information of each data column, and a vector index is constructed according to the characteristic vectors and the storage information of each data column. When inquiring, determining a first data set according to the inquiring data and the inverted index, determining a second data set according to the inquiring vector and the vector index, forming a candidate data set according to the first data set and the second data set, and sorting all data in the candidate data set according to the data to be inquired, wherein the sorted data is an inquiring result. According to field characteristics, characteristic vectors and storage information of a data column in the public data set, an inverted index and a vector index are respectively constructed, and a finer index structure in the public data set is constructed, so that when inquiring, information matched with inquiring data can be quickly searched according to the constructed inverted index and vector index, and finer searching in the public data set can be performed, and the comprehensiveness and accuracy of search results are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application environment of an index construction method and a field searching method according to a first embodiment of the present application;

Fig. 2 is a flow chart of an index construction method according to a second embodiment of the present application;

Fig. 3 is a flowchart of an index construction method according to a third embodiment of the present application;

fig. 4 is a flowchart of an index construction method according to a fourth embodiment of the present application;

fig. 5 is a flow chart of a field searching method provided in a fifth embodiment of the present application;

fig. 6 is a flow chart of a field searching method provided in the sixth embodiment of the present application;

Fig. 7 is a flow chart of a field searching method according to a seventh embodiment of the present application;

FIG. 8 is a schematic structural diagram of an index building device according to an eighth embodiment of the present application;

fig. 9 is a schematic structural diagram of a field searching apparatus according to a ninth embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to a tenth embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence is the intelligence of simulating, extending and expanding a person using a digital computer or a machine controlled by a digital computer, sensing the environment, obtaining knowledge, and using knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present application.

In order to illustrate the technical scheme of the application, the following description is made by specific examples.

The index construction method and the field searching method provided by the first embodiment of the application can be applied to an application environment as shown in fig. 1, wherein a server communicates with a client, the server provides index construction and field searching services, and the client triggers index construction and field searching tasks to the server. The client includes, but is not limited to, a palm computer, a desktop computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cloud computer device, a Personal Digital Assistant (PDA), and the like. The computer device corresponding to the server may be implemented by an independent server or a server cluster formed by a plurality of servers.

Referring to fig. 2, a flowchart of an index building method according to a second embodiment of the present application is shown, where the index building method is applied to a server in fig. 1, and the server is connected to a client to obtain a public data set sent by the client. As shown in fig. 2, the index construction method may include the steps of:

Step S201, a public data set is obtained, for any data column in the public data set, according to a field extraction mode corresponding to the data type of the data column, field extraction is carried out on the data column, field characteristics of the data column are obtained, and feature vectors corresponding to each field characteristic are determined.

In the embodiment of the application, a data set may refer to a set formed by text data, and may be used for storing and sorting text type data, where a storage form of the data in the data set may be a form of a table or a file, etc., a public data set may refer to a data set that may be shared and used, for a public data set in a form of a table in a data storage form, each column in the storage form of the data set represents different features or variables, information describing a specific aspect of each sample in the data set, each column in the storage form is a data column, for example, in a public data set in a form of a table in which a certain column in the data set may be a data column representing an age feature, a data column representing a gender feature, etc., a field extraction manner may refer to a technique for extracting important features from data in the data column, a field feature may refer to data representing important features of the data column, and a feature vector may refer to a vector representation of the field feature.

For different features or variables represented by each data column in the common dataset, the data columns may be divided into enumeration type data columns, numerical type data columns, text type data columns and the like according to the data types of the data columns, for example, the enumeration type data columns may refer to data columns representing data values of gender features, level features or state features as data in a fixed group of values, the numerical type data columns may refer to data columns representing age features, score features, number features and the like and the text type data columns may refer to data columns representing name features, address features, description features and the like and composed of texts.

Specifically, in the process of determining the field characteristics of the enumerated type data column, the occurrence number of each enumeration value in the enumerated type data column can be counted, and if the occurrence number of the enumeration value is greater than a predefined threshold value, the enumeration value is determined to be the field characteristics of the enumerated type data column; in determining the field characteristics of the data column of the numeric type, the minimum value, the maximum value, the 5% quantile, the 15% quantile, the 50% quantile and the 95% quantile of the data column may be calculated as the field characteristics of the data column of the numeric type by a minimum value calculation function, a maximum value calculation function and a quantile calculation function, respectively, for example, in a Python programming language, the minimum value of the data column may be calculated by an np.min () function, the maximum value of the data column may be calculated by an np.max () function, and the quantile may be calculated by an np.perfect () function; in the process of determining the text type data column, for meaningful text data composed of character string data such as numbers, letters or Chinese characters, key fields of the text data can be extracted through entity recognition technology and relation extraction technology, fine granularity type recognition is carried out to obtain field characteristics of the text type data column, for nonsensical text data composed of character string data such as numbers, letters or Chinese characters, the text data can be divided into subsequences, word frequency-inverse document frequency values of the subsequences in the data column are calculated, the values are used as importance scores of the subsequences in the data column, and subsequences with importance scores larger than a predefined threshold value are extracted to serve as field characteristics of the text type data column.

In the process of determining the feature vectors corresponding to the field features of the enumeration type data column, the numerical value type data column and the text type data column, the field features extracted from the data columns of each data type can be converted into high-dimension vector representations through a model trained by vector conversion in advance, and then the feature vector corresponding to each field feature is obtained.

Step S202, obtaining the storage information of each data column in the public data set, constructing an inverted index according to the field characteristics and the storage information of each data column, and constructing a vector index according to the characteristic vector and the storage information of each data column.

In the embodiment of the present application, the storage information may be a storage location of a data column in a public data set, for example, the storage information of the data column may be stored in a public data set with the number 2, or further, the storage information of the data column may be stored in a public data set with the number 2, the corresponding data column number is 3, the inverted index may refer to an index structure for storing a mapping relationship between field features and the storage information of the field features in the public data set, the index structure may include a key value and an array corresponding to the key value, the key value may be used for storing inverted index items such as words or phrases, the array corresponding to the key value may be used for storing storage information corresponding to words or phrases in the inverted index items, and the vector index may refer to an index structure established according to feature vectors, for example, the vector index may be a Faiss index.

Specifically, in the process of constructing the inverted index, firstly, the storage information of each data column in the public data set is acquired, the storage information is the storage information of the corresponding field characteristics in the data columns, a dictionary and an inverted list in the inverted index structure are respectively constructed according to an index construction tool, then, word segmentation processing is carried out on the field characteristics of each data column, each word or phrase corresponding to the field characteristics after the word segmentation processing is obtained, finally, each word or phrase after the word segmentation processing is used as a key value to be written into a blank key value corresponding to the dictionary, a unique identifier is allocated to each word or phrase in the dictionary, and the storage information corresponding to each word or phrase in the dictionary is respectively written into an array of the key values corresponding to the inverted list. In the process of constructing the vector index, the feature vector and the storage information can be written into a preset vector index structure to obtain the vector index.

According to the embodiment of the application, the data columns are obtained by aiming at any data type in the public data set, the field extraction mode corresponding to the data type is used for carrying out field extraction on the data columns to obtain the field characteristics and the characteristic vectors of the data columns, the inverted index is constructed according to the field characteristics and the storage information of each data column, and the vector index is constructed according to the characteristic vectors and the storage information of each data column. A finer index structure in the public data set is constructed, so that when inquiring, information matched with inquiry data can be quickly searched according to the constructed inverted index and vector index, finer searching in the public data set can be performed, and the comprehensiveness and accuracy of search results are improved.

Referring to fig. 3, a flowchart of an index construction method according to a third embodiment of the present application is shown. As shown in fig. 3, after obtaining the field features and the feature vectors of the data string in step S201, the index construction method may further include the following steps:

Step S301, calculating the similarity of the feature vectors of any two data columns to obtain a similarity calculation result, and if the similarity calculation result is greater than a similarity threshold value, determining that the two data columns are suspected matching pairs.

In the embodiment of the application, the similarity may refer to a distance between two feature vectors in an euclidean space, the similarity calculation result may refer to a distance between feature vectors of two data columns in the euclidean space, the similarity threshold may refer to a predefined similarity value, and the suspected matching pair may refer to two potentially matching data columns with a similarity calculation result greater than the similarity threshold.

Specifically, a similarity calculation formula can be used for calculating the similarity between feature vectors of any two data columns in the public data set to obtain a similarity calculation result, the similarity calculation result of any two data columns is compared with a similarity threshold, if the similarity calculation result is larger than the similarity threshold, the two data columns are determined to be suspected matching pairs, and the suspected matching pairs in the public data set are recalled through a similarity linking method.

Step S302, a preset number of first data are extracted from first data columns in the suspected matching pair, and all the first data are matched with second data columns in the suspected matching pair, so that a matching result is obtained.

In the embodiment of the application, the first data column may refer to any one data column in the suspected matching pair, the second data column may refer to another data column in the suspected matching pair except the first data column, the first data may refer to data in the first data column, the preset number may refer to the preset extraction number of the first data, and the matching result may refer to a matching result of the first data and the second data column in the suspected matching pair, and may include two matching results of matching and non-matching.

Specifically, first, a preset number of first data can be extracted from a first data column in a suspected matching pair by a bilateral sampling method, then, the first data and data in a second data column are matched by a predefined operator, such as a number, a comparison symbol, an inclusion relation and the like, and if the first data and any data in the second data column are matched, a matching result of the first data and the second data column is determined to be matching.

Step S303, counting the proportion of the matched first data in the preset quantity as a matching result, if the proportion exceeds a proportion threshold value, determining that the suspected matching pair passes verification, and forming the field characteristics of the first data column and the second data column into associated information.

In step S304, after the first data column and the second data column are written into the inverted index, the associated information is written into the array of the first data column and the second data column in the inverted index.

In the embodiment of the present application, the duty ratio threshold may refer to a preset duty ratio value, and the associated information may refer to information associated with field features of the first data column and the second data column, for example, the associated information may refer to an identifier that matches the field features of the first data column and the second data column, an associated link or address between the first data column and the second data column, and so on.

Specifically, first, according to the matching result of each first data and each second data column obtained by calculation in step S302, the number of the first data matched by the matching result is determined, the duty ratio of the number of the first data matched by the matching result in the preset number is calculated, then, if the duty ratio exceeds the duty ratio threshold, the verification of the suspected matching pair is determined, the field features of the first data column and the second data column are formed into associated information, the field features of the first data column and the second data column are determined as data column pairs, finally, after the field features and the storage information of the first data column and the second data column are written into the array corresponding to the inverted index in step S201, the associated information is written into the array corresponding to the inverted index, so that when the inverted index is queried according to the field features corresponding to the key value, not only the data column corresponding to the storage information in the array corresponding to the key value can be queried, but also the data column corresponding to the storage information can be obtained.

According to the embodiment of the application, the similarity calculation is carried out according to the feature vectors of any two data columns, each data column is filtered, a suspected matching pair with the similarity calculation result larger than the similarity threshold value is obtained, the data in the suspected matching pair is extracted and verified, the associated first data column and second data column are obtained, and the associated information is written into an array of inverted indexes. The method has the advantages that a finer index structure in the public data set is constructed, so that when a user inquires according to the inverted index, the user can search the interior of the public data set more finely, not only can the data columns corresponding to the storage information in the array corresponding to the key value be inquired, but also the data columns associated with the data columns corresponding to the storage information can be obtained, the inquiry efficiency is improved, and meanwhile, the comprehensiveness and the accuracy of the inquiry result are also improved.

Referring to fig. 4, a flowchart of an index construction method according to a fourth embodiment of the present application is shown. As shown in fig. 4, the step S202 of constructing an inverted index according to the field characteristics and the storage information of each data column may include the following steps:

In step S401, if the public dataset contains long text data, the long text data is compressed and generalized through the trained language model to obtain the generalized document.

In the embodiment of the present application, long text data may refer to a data type containing a large amount of text content, for example, the long text data may be a paragraph containing multiple meanings, a trained language model may refer to a trained language model that has been subjected to text compression induction, for example, the trained language model may be a trained large-scale language model (Large Language Model, LLM), and an induction document may refer to data obtained by compressing and induction of the long text data.

For example, a paragraph containing multiple meanings can be compressed and generalized into 1-2 sentences of natural language as a generalized document through a trained LLM model on the basis of keeping less semantic loss.

Step S402, performing word segmentation processing on field features of each data column to obtain a first word segmentation result of the corresponding data column.

Step S403, constructing an inverted index, writing the first word segmentation result of each data column into a blank key value of the inverted index, and writing the storage information of the corresponding data column into an array of the corresponding key value.

Step S404, obtaining the storage information of each induced document, and respectively performing word segmentation on each induced document to obtain a second word segmentation result of the corresponding induced document.

Step S405, writing the second word result of each induced document into a blank key value of the inverted index, and writing the storage information of the corresponding induced document into an array of the corresponding key values.

In the embodiment of the application, the first word segmentation result may refer to each word or phrase obtained by word segmentation processing on the field features, the storage information may refer to the storage position of the induction document in the public data set, and the second word segmentation result may refer to each word or phrase obtained by word segmentation processing on the induction document.

Specifically, firstly, a dictionary and an inverted list in an inverted index structure can be respectively constructed according to an index construction tool, then, words or phrases obtained after word segmentation processing of a data column and a summary document are respectively written into blank key values corresponding to the dictionary, unique identifiers are allocated to each word or phrase in the dictionary, and then, storage information corresponding to each word or phrase in the dictionary is respectively written into an array of key values corresponding to the inverted list.

In the embodiment of the application, if the public data set contains long text data, the long text data is compressed and generalized to obtain generalized documents, field characteristics of each data column and each generalized document are subjected to word segmentation to obtain a first word segmentation result and a second word segmentation result, the first word segmentation result and the second word segmentation result are written into blank key values of an inverted index, and storage information corresponding to the first word segmentation result and the second word segmentation result is written into an array of key values corresponding to the inverted index. By constructing the inverted index, the inverted index can be quickly positioned to the storage position of the corresponding word or phrase according to the inverted index item, so that the query efficiency is improved, and as the inverted index is constructed based on the characteristics of the data column and the long text data in the public data set, more refined search in the public data set can be performed according to the inverted index, and the comprehensiveness and accuracy of the search result are improved.

Referring to fig. 5, a flow chart of a field searching method provided by the fifth embodiment of the present application is provided, the field searching method is applied to the server in fig. 1, the server is connected with the client to obtain the data to be queried sent by the client, and after the index building method of any one of the second to fourth embodiments is adopted to obtain the inverted index and the vector index. As shown in fig. 5, the field searching method may include the steps of:

Step S501, data to be queried is obtained, and a vector to be queried of the data to be queried is determined.

In the embodiment of the application, the data to be queried can refer to keyword data such as words or phrases to be queried, and the vector to be queried can refer to vectorized representation of the data to be queried. Specifically, the data to be queried may be converted into the vector to be queried through a corresponding vector conversion model, for example, a word embedding model, a Bert model, and the like.

Step S502, a first data set is determined according to the data to be queried and the inverted index, and a second data set is determined according to the vector to be queried and the vector index.

In the embodiment of the application, the first data set may refer to a corresponding public data set obtained by querying the inverted index according to the data to be queried, and the second data set may refer to a corresponding public data set obtained by querying the vector index according to the vector to be queried.

Specifically, in the process of determining the first data set, prefix filtering is performed on data to be queried to obtain filtered data to be queried, then the filtered data to be queried is used as an index item, an inverted index is queried to obtain a corresponding data column and/or long text data, finally, according to the data to be queried, the queried data column and/or the long text data are filtered to obtain a filtered data column and/or the long text data, and the filtered data column and/or a public data set corresponding to the long text data are determined as the first data set.

In the process of determining the second data set, firstly, query parameters such as query radius of a vector index and the number of returned query results can be set, then, according to the set query parameters, a vector to be queried is used as an index item, the vector index is queried, similarity scores between the vector to be queried and each feature vector in the vector index are calculated, each feature vector is ordered according to the similarity scores, finally, according to the ordering results, the feature vectors with the same number as the query results are returned, and the public data set corresponding to each feature vector obtained by query is determined to be the second data set.

Step S503, forming a candidate data set according to the first data set and the second data set.

Step S504, according to the data to be queried, all the data in the candidate data set are ordered, and the ordered data is the query result.

In the embodiment of the application, the candidate data set may refer to a set of public data sets formed by the first data set and the second data set, and the query result may refer to a query result obtained by performing field search according to the data to be queried.

Specifically, firstly, comparing a first data set obtained by inquiring an inverted index with a second data set obtained by inquiring a vector index, removing repeated public data sets, merging to obtain candidate data sets, then, calculating the relevance score of data to be inquired and the public data sets aiming at any public data set in the candidate data sets, and finally, sorting all public data sets in the candidate data sets according to the relevance score, wherein the obtained sorting result is the inquiring result.

According to the embodiment of the application, a first data set is determined according to the data to be queried and the inverted index, a second data set is obtained according to the vector to be queried and the vector index corresponding to the data to be queried, a candidate data set is obtained according to the first data set and the second data set, all data in the candidate data set are ordered according to the data to be queried, and the ordered data are query results. The candidate data sets are obtained through query inverted indexes and vector indexes, more refined search in the public data sets is performed, the comprehensiveness and the accuracy of the candidate data sets obtained through query are improved, the candidate data sets obtained through query are precisely arranged, query results are obtained, the query results are clearer, and convenience is brought to follow-up processing of the query results by users.

Referring to fig. 6, a flow chart of a field searching method according to a sixth embodiment of the present application is shown. As shown in fig. 6, the determining the first data set according to the data to be queried and the inverted index in the step S502 may include the following steps:

step S601, prefix filtering is carried out on the data to be queried, and filtered data to be queried is obtained.

Step S602, candidate data are obtained according to the filtered data to be queried and the inverted index.

In the embodiment of the application, prefix filtering may refer to a data processing technology for screening out data with a specific prefix in a data query or processing process, and candidate data may refer to a data column obtained by query and/or long text data according to filtered data to be queried and inverted indexes.

Specifically, in the process of obtaining candidate data, firstly, word segmentation processing can be performed on data to be queried through a character string processing technology, the data to be queried is divided into a plurality of fields, then prefix filtering is performed on the data to be queried based on the plurality of fields after division, filtered data to be queried is obtained, finally, the filtered data to be queried is used as an index item, inverted indexes are queried, corresponding data columns are obtained, and/or long text data are the candidate data.

For example, if the data to be queried is a "unit name", the data to be queried is divided into a plurality of fields with the length of 1, the divided data to be queried can be { "single", "bit", "name" }, if prefix filtering is performed on the divided data to be queried to obtain the filtered data to be queried as a "single", the "single" is used as an index item to query the inverted index, and the corresponding data column is obtained, and/or the long text data is the candidate data.

Step S603, filtering the candidate data according to the data to be queried to obtain a first data set.

Specifically, in the process of obtaining the first data set, firstly, word segmentation processing can be performed on data and long text data of a data column in candidate data through a character string processing technology, the candidate data is divided into a plurality of candidate fields, word segmentation processing is performed on the data to be queried, the data to be queried is divided into a plurality of fields, for any candidate data, each candidate field in the candidate data is compared with the plurality of fields in the data to be queried through a predefined operator such as a number, a comparison symbol and a containing relation, the number of candidate fields which are the same as the number of candidate fields in the candidate data is recorded, secondly, the number of the candidate fields obtained by the recording is compared with the preset number, if the number of the candidate fields obtained by the recording exceeds the preset number, the candidate data is determined to be similar to the data to be queried, the data set where the candidate data is located is determined to be the candidate data set, then, the candidate data with the number which is obtained by all the recording exceeds the preset number is determined, word frequency-inverse document frequency of each determined candidate data in all public data sets is calculated, each candidate frequency value is determined, each candidate frequency value is obtained by the calculation, each candidate frequency value is compared with the candidate frequency value in all public data sets, the confidence value is obtained, and the result is obtained by filtering the logic is obtained.

According to the embodiment of the application, prefix filtering is carried out on the data to be queried to obtain filtered data to be queried, candidate data is obtained according to the filtered data to be queried and the inverted index, and the candidate data is filtered according to the data to be queried to obtain a first data set. The method comprises the steps of carrying out prefix filtering on data to be queried, querying an inverted index based on the filtered data to be queried, reducing the query range of the inverted index, improving the query efficiency, filtering candidate data obtained by query to obtain a first data set, and improving the accuracy of public data sets obtained by query.

Referring to fig. 7, a flow chart of a field searching method according to a seventh embodiment of the present application is provided. As shown in fig. 7, in the step S504, all the data in the candidate data set are ranked according to the data to be queried, and the ranked data is a query result, which may include the following steps:

step S701, word segmentation processing is carried out on the data to be queried to obtain a third word segmentation result.

Step S702, calculating the relevance score of the third word segmentation result and each data in the candidate data set, and sorting all the data in the candidate data set according to the relevance score, wherein the sorted data is the query result.

In the embodiment of the application, the third word segmentation result may refer to each word or phrase obtained by word segmentation processing of the data to be queried.

Specifically, firstly, word segmentation is carried out on data to be queried to obtain each word or phrase as a third word segmentation result, secondly, the relevance score of the word or phrase and the candidate data set is calculated aiming at any word or phrase obtained by word segmentation of the data to be queried and any candidate data set, then, the relevance score of all words or phrases obtained by word segmentation of the data to be queried and the candidate data set is added and processed to obtain the relevance score of the candidate data set, and finally, the relevance score of each candidate data set obtained by calculation and the data to be queried is sequenced to obtain the sequenced result as a query result.

According to the embodiment of the application, word segmentation is carried out on the data to be queried to obtain a third word segmentation result, the relevance score of the third word segmentation result and each data in the candidate data set is calculated, all the data in the candidate data set are ordered according to the relevance score, and the ordered data are the query result. And the result fine-ranking is carried out on the first data set obtained by the query inverted index and the second data set obtained by the query vector index through the data to be queried, so that the comprehensiveness and the accuracy of the search result are improved.

Fig. 8 shows a block diagram of an index building device according to an eighth embodiment of the present application, which corresponds to the index building method of the above embodiment, where the index building device is applied to the server in fig. 1, and the server is connected to the client to obtain the common data set sent by the client. For convenience of explanation, only portions relevant to the embodiments of the present application are shown.

Referring to fig. 8, the index constructing apparatus includes:

The feature extraction module 81 is configured to obtain a public data set, perform field extraction on any data column in the public data set according to a field extraction mode corresponding to a data type of the data column, obtain field features of the data column, and determine feature vectors corresponding to each field feature;

the index construction module 82 is configured to obtain the storage information of each data column in the common data set, construct an inverted index according to the field feature and the storage information of each data column, and construct a vector index according to the feature vector and the storage information of each data column.

Optionally, the index building device further includes:

the similarity calculation module is used for calculating the similarity of the feature vectors of any two data columns to obtain a similarity calculation result, and if the similarity calculation result is larger than a similarity threshold value, determining that the two data columns are suspected matching pairs;

The matching module is used for extracting a preset number of first data from the first data columns in the suspected matching pair, and matching all the first data with the second data columns in the suspected matching pair to obtain a matching result;

the association determining module is used for counting the duty ratio of the first data matched with the matching result in the preset quantity, if the duty ratio exceeds a duty ratio threshold value, determining that the suspected matching pair passes verification, and forming association information by the field characteristics of the first data column and the second data column;

and the association writing module is used for writing the association information into an array in the inverted index after the first data column and the second data column are written into the inverted index.

Optionally, the index building device further includes:

The induction module is used for compressing and inducing the long text data through a trained language model if the public data set contains the long text data, so as to obtain an induction document;

the first word segmentation module is used for respectively carrying out word segmentation on the field characteristics of each data column to obtain a first word segmentation result of the corresponding data column;

the first writing module is used for constructing an inverted index, writing a first word segmentation result of each data column into a blank key value of the inverted index, and writing storage information of the corresponding data column into an array of the corresponding key value;

the second word segmentation module is used for acquiring the storage information of each induced document, and respectively carrying out word segmentation on each induced document to obtain a second word segmentation result of the corresponding induced document;

and the second writing module is used for writing the second word segmentation result of each induced document into a blank key value of the inverted index, and writing the storage information of the corresponding induced document into an array of the corresponding key value.

It should be noted that, because the content of information interaction and execution process between the modules and the embodiment of the method of the present application are based on the same concept, specific functions and technical effects thereof may be referred to in the method embodiment section, and details thereof are not repeated herein.

Fig. 9 shows a block diagram of a field searching device according to a ninth embodiment of the present application, which corresponds to the field searching method of the above embodiment, where the field searching device is applied to the server in fig. 1, and the server is connected to the client to obtain the data to be queried sent by the client. For convenience of explanation, only portions relevant to the embodiments of the present application are shown.

Referring to fig. 9, after the inverted index and the vector index are obtained by using the index construction method according to any one of the second to fourth embodiments, the field searching apparatus includes:

The query obtaining module 91 is configured to obtain data to be queried, and determine a vector to be queried of the data to be queried;

a first determining module 92, configured to determine a first data set according to the data to be queried and the inverted index, and determine a second data set according to the vector to be queried and the vector index;

a second determining module 93 for forming a candidate data set from the first data set and the second data set;

The result obtaining module 94 is configured to sort all the data in the candidate data set according to the data to be queried, and obtain sorted data as a query result.

Optionally, the first determining module 92 includes:

The prefix filtering unit is used for performing prefix filtering on the data to be queried to obtain filtered data to be queried;

the inverted query unit is used for obtaining candidate data according to the filtered data to be queried and the inverted index;

And the data filtering unit is used for filtering the candidate data according to the data to be queried to obtain a first data set.

Optionally, the result obtaining module 94 includes:

the third word segmentation unit is used for carrying out word segmentation on the data to be queried to obtain a third word segmentation result;

And the sorting unit is used for calculating the relevance score of the third word segmentation result and each data in the candidate data set, sorting all the data in the candidate data set according to the relevance score, and obtaining the sorted data as a query result.

Fig. 10 is a schematic structural diagram of a computer device according to a tenth embodiment of the present application. As shown in fig. 10, the computer device of this embodiment includes: at least one processor (only one shown in fig. 10), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various index building methods described above or the various field searching method embodiments described above when executing the computer program.

The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 10 is merely an example of a computer device and is not intended to be limiting, and that a computer device may include more or fewer components than shown, or may combine certain components, or different components, such as may also include a network interface, a display screen, an input device, and the like.

The Processor may be a CPU, but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory includes a readable storage medium, an internal memory, etc., where the internal memory may be the memory of the computer device, the internal memory providing an environment for the execution of an operating system and computer-readable instructions in the readable storage medium. The readable storage medium may be a hard disk of a computer device, and in other embodiments may be an external storage device of a computer device, for example, a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on a computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs such as program codes of computer programs, and the like. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above-described embodiment, and may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The present application may also be implemented as a computer program product for implementing all or part of the steps of the method embodiments described above, when the computer program product is run on a computer device, causing the computer device to execute the steps of the method embodiments described above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided by the present application, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. An index construction method, characterized in that the index construction method comprises:

2. The index construction method according to claim 1, further comprising, after the obtaining the field features and feature vectors of the data string:

Calculating the similarity of the feature vectors of any two data columns to obtain a similarity calculation result, and determining the two data columns as suspected matching pairs if the similarity calculation result is larger than a similarity threshold;

Extracting a preset number of first data from the first data columns in the suspected matching pair, and matching all the first data with the second data columns in the suspected matching pair to obtain a matching result;

Counting the proportion of the matched first data in the preset quantity as a matching result, if the proportion exceeds a proportion threshold value, determining that the suspected matching pair passes verification, and forming field characteristics of the first data column and the second data column into associated information;

After the first data column and the second data column are written into the inverted index, the association information is written into an array of the first data column and the second data column in the inverted index.

3. The index construction method according to claim 1, characterized in that the index construction method further comprises:

If the public data set contains long text data, compressing and summarizing the long text data through a trained language model to obtain a summarizing document;

the constructing an inverted index according to the field characteristics and the storage information of each data column comprises the following steps:

respectively performing word segmentation processing on field characteristics of each data column to obtain a first word segmentation result of the corresponding data column;

Constructing an inverted index, writing a first word segmentation result of each data column into a blank key value of the inverted index, and writing storage information of the corresponding data column into an array of the corresponding key value;

Acquiring storage information of each induction document, and respectively performing word segmentation on each induction document to obtain a second word segmentation result of the corresponding induction document;

And writing the second word segmentation result of each induced document into a blank key value of the inverted index, and writing the storage information of the corresponding induced document into an array corresponding to the key value.

4. A field searching method, characterized in that after the inverted index and the vector index are obtained by using the index construction method according to any one of claims 1 to 3, the field searching method comprises:

forming a candidate data set from the first data set and the second data set;

5. The field searching method according to claim 4, wherein the determining the first data set according to the data to be queried and the inverted index comprises:

Prefix filtering is carried out on the data to be queried to obtain filtered data to be queried;

obtaining candidate data according to the filtered data to be queried and the inverted index;

and filtering the candidate data according to the data to be queried to obtain a first data set.

6. The method for searching fields according to claim 4, wherein said sorting all data in said candidate data set according to said data to be queried, to obtain sorted data as a query result, comprises:

performing word segmentation processing on the data to be queried to obtain a third word segmentation result;

and calculating the relevance score of the third word segmentation result and each data in the candidate data set, and sorting all the data in the candidate data set according to the relevance score, wherein the sorted data is a query result.

7. An index building device, characterized in that the index building device comprises:

8. A field searching apparatus, characterized in that after obtaining an inverted index and a vector index using the index construction method according to any one of claims 1 to 3, the field searching apparatus comprises:

9. A computer device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the index building method of any one of claims 1 to 3 or the field searching method of any one of claims 4 to 6 when the computer program is executed.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the index construction method of any one of claims 1 to 3, or the field search method of any one of claims 4 to 6.