WO1992009960A1 - Data retrieving device - Google Patents
Data retrieving device Download PDFInfo
- Publication number
- WO1992009960A1 WO1992009960A1 PCT/JP1991/000011 JP9100011W WO9209960A1 WO 1992009960 A1 WO1992009960 A1 WO 1992009960A1 JP 9100011 W JP9100011 W JP 9100011W WO 9209960 A1 WO9209960 A1 WO 9209960A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- character
- search
- character set
- code
- position information
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Definitions
- the present invention relates to an information search processing method for performing information search.
- the present invention is particularly suitable for a full-text search processing method or a partial-match search processing method using a multi-keyword, so that the number of matches between the input search input and the full-text or registered keywords to be searched is significantly reduced. And a high-speed information search method.
- INDUSTRIAL APPLICABILITY The present invention is suitable for an information search processing method for performing a full-text search process or a multi-keyword search in a database system.
- a sequential search method is used in which the input character string specified by the searcher is used as a keyword character string and the search for a record number is performed from keywords that match the search conditions. Also, a sentence ij that can be searched and entered from the keyboard is created and stored in the search file in the bow I format, and keywords that match the input character string specified by the searcher and the search conditions are used using the index structure of the search file.
- the index method of performing a search is generally used as a partial match search technique using a multi-keyword.
- the sequential search processing method of the multi-keyword search processing requires the same search time as the sequential search method of the full-text search processing.
- the hardware will be PI and the character string transfer between the computer that performs the search processing and the dedicated processor or LSI will take time. Therefore, realizing high-speed performance that is satisfactory for the system is an issue.
- the index method in the multi-keyword search can speed up the partial match search, but has the disadvantage that the search file becomes huge. Because of this, exact, ⁇ ⁇ , and suffix searches are used, but intermediate matches are often not supported. This requires a large number of indexes for intermediate matches, in addition to search indexes for exact matches, prefix matches, and tail matches, in order to perform intermediate matches, resulting in a huge storage capacity for search files. The main reason is that the search time increases and ⁇ 1 of the search file is not easy. Also, some systems do not support all prefixes and suffixes of keywords because of the size of the search file. However, searchers often memorize special characters and character strings in keywords, and this includes parts that include an intermediate match. It has been demanded.
- a character set is created for each character from the first character, one character at a time from the first character, followed by a total of r characters, and a search file is created with a character set group that is grouped by character set type.
- a search file was created with character groups that were grouped for each search, and it was found that the search could be speeded up by collating character sets or character continuity from the search file during the search.
- the present invention it is possible to realize a high-speed full-text search or a partial-match search using a multi-keyword for a large number of documents from the above-mentioned viewpoints.
- This eliminates the need to transfer character strings to dedicated processors and LSIs, and allows for arbitrary character string searches by focusing on character sets and character set positions or characters and character positions.
- the purpose is to provide an information retrieval processing method.
- a first feature of the present invention is a search unit identification code assigning means for dividing a character string to be searched into search units, which are units for performing search, and assigning an ascending code to each search unit.
- An attribute code assigning means for assigning an attribute code indicating a logical division of the search unit to the search unit, and extracting a character from the character string to be searched one character at a time, and setting the character set by the character and a total of the next r characters
- Character set position order code assigning means for creating a character set position order code indicating the first character position of a character set in a search unit; the above-mentioned search unit identification code, character set position order code, and attribute code Means for generating character set position information, and storing the character set position information in an area for each character set type to create a search file.
- n is the maximum number of search unit characters and a is the maximum number of attributes.
- V When “V” is set, it is desirable to give it as a numeric code of ⁇ (search unit identification code XII) cross-character position order code ⁇ xa + attribute code.
- a second feature of the present invention is to provide a search input character set including the search file created in the first feature, decomposing constituent characters of the search input character string into a character set in units of r characters from the first character.
- the last character set may become (r-1) or less, and the character set of r characters may not be created.
- a third feature of the present invention is to create a search file in which character position information is stored for each character type.
- the search target character string is divided into search units, which are search units, in ascending order for each search unit.
- Search unit identification code assigning means for assigning a code
- attribute code assigning means for assigning an attribute code indicating a logical division of the search unit to the divided search units, and characters to be searched
- Character position order code assigning means for assigning character position order information indicating a position in the search unit for each character in the column; and character position information comprising the search unit identification code, character position order code, and attribute code.
- n Maximum number of search unit characters
- a fourth feature of the present invention is to perform a search process using a search file created by the third feature, comprising a search file created by the third feature, and comprising a search input character string.
- a method for extracting the character position information of the same character from the search file as described above, and the character unit position code is the same as the character string of the search input, with the common search unit identification code between the character position information of each character extracted.
- Means for extracting a combination of character position information whose order and the attribute code are equal to the search input! / ⁇ A search unit to which a character string belongs based on the combination of the extracted character position information and Means for outputting a character position as a search result. It is desirable that the extraction of the combination of the character position information be performed at a low frequency of the whole sentence of the search input character and around J from the character.
- a fifth feature of the present invention relates to a multi-keyword search, wherein a record identification code assigning means for assigning an ascending code to each record to be searched, and each key included in the record.
- a keyword attribute code assigning means for assigning an attribute code indicating a logical division of a keyword to a word, and a character set is created by taking out one character at a time from this keyword, and creating a character set with the character and a total of subsequent r characters.
- Character set position order code assigning means for assigning a character set position order code indicating the leading character position of a character string, and character set position information comprising the above-mentioned record identification code, keyword attribute code, and character set position order code.
- the character set position information is obtained by arranging each keyword of the record in the keyword attribute area corresponding to the keyword attribute code. It is created by converting into a code consisting of integers with the character set position order code and the record identifier g
- n Number of characters in keyword string
- a sixth feature of the present invention relates to a search process for a search file created according to the fifth feature.
- the search feature includes a search file created according to the fifth feature.
- the record identification code and the keyword attribute code are common between the character set position information of the character sets, and the difference between the character set position order codes is equal to the difference in the first character position of the corresponding character set in the search input string.
- Toku ⁇ Means for extracting a combination of character set position information having the same keyword attribute code as the search input, and a search input based on the extracted combination of character set position information. And Toku ⁇ further comprising a means for outputting a record identification code as a search result corresponding to the string.
- the character set position information that can form the same character set string as the search input character set string is extracted from the character set position order code of the character set with low occurrence frequency in all keywords in the search input character set string.
- i is the character set position order code of the character set with a high frequency of appearance
- j is the character set position order code of the character set with the character set position order code i. It is desirable to extract a combination of character set position information that matches i) and j).
- the keyword is a Western character string containing symbols
- a search file that uses kanji as character position information in units of one character and kana characters as character set position information in units of two characters can be used.
- a seventh feature of the present invention is that a multi-keyword search uses character position information in units of one character, and a record identification code assigning means for assigning an ascending code to each record to be searched, For each keyword in this record, the logical Keyword attribute code assigning means for assigning an attribute code indicating a class; character position order code assigning means for decomposing this keyword for each character and assigning each character a character position order code indicating a position in the keypad; Means for generating character position information comprising the record identification means, the key word attribute code and the character position order code, storing the character position information in an area for each character type, and generating a search file.
- the feature is.
- the character position information is obtained by arranging each key word of the record in the key word attribute area corresponding to the key word attribute code. It is created by converting the attribute code and character position order into a code consisting of integers.
- n Number of characters in keyword string
- Pa Keyword attribute code It is desirable to be given as the preceding numeric code in the key word sequence of the keyword attribute area of a.
- An eighth feature of the present invention relates to a search process for a search file created according to the seventh feature, and includes a search file created according to the seventh feature, and is the same as a character constituting a search input character string.
- the frequency of occurrence of the same character string in the document is low.
- Kojien Japanese language dictionary published by Iwanami Shoten
- the frequency of appearance of kana characters among them is as high as 53200 times on average.
- the frequency of appearance of the two-letter kana character string is low, with an average frequency of 472 times.
- the search input is n characters
- the collation target extracted from the whole text will be (II / 2) X 72 character set position information on average.
- the frequency of appearance of two kanji character strings is even lower than that of kana characters, and the collation target extracted from the whole sentence is less than that of kana characters.
- the frequency of appearance of the JIS first-level kanji is 1155 times on average in the description of the headword of Kojien. For this reason, if the search input is n characters for the JIS first-level 2965 kanji, the collation target extracted from the description document of the Kojien headword will be nx 1155 characters on average.
- the search input is generally several tens of characters or less, the number of times that a character string with a high frequency of occurrence (/, including characters) is significantly less than the number of times that all characters are collated sequentially.
- the search target is rapidly narrowed down.
- the search input character set is extracted from the character set strings of the search target candidates obtained so far. Character set columns that are different from the set columns are deleted, and the search target is narrowed down for each constituent character set to be matched.
- matching is performed in order from the character set with the lowest occurrence frequency of all sentences in the search input or the lowest occurrence frequency in all keywords. And the number of times of matching and matching can be reduced.
- a search file that stores character set position information for each character set that indicates where each character set is in the character string that constitutes the character string (full text or registered keyword) to be searched By performing collation matching with the search input character set string for this retrieval file, the number of collation matching processing in character string retrieval can be greatly reduced.
- characters that have a low frequency of appearance such as kanji
- kanji characters that have a low frequency of appearance
- ⁇ "" matching the number of times of matching and matching processing can be significantly reduced.
- This search file is created as follows. Note that this explanation is based on an example of a character set for full-text search processing.
- a character string to be searched is divided into search units.
- the search target character string is a book or a paper
- it is composed of the table of contents, title, chapter or section title, text, figure or table title, and literature, and each constituent part is logical. Since it is classified, it can be configured as a search unit. Therefore, books or papers are logically divided into search units, and identification codes are assigned to each search unit in ascending order according to the order of appearance.
- ⁇ can be divided into a plurality of search units, and a series of identification codes can be assigned to each search unit together with other search units.
- the logical type of the search unit is divided into the search unit, such as the table of contents, preface, title, and text, the attribute that indicates the attribute, with the logical type-class as the attribute Assign a sign.
- the character string is extracted one character at a time from the first character, and a character set is created with that character and a total of subsequent r characters.
- Each character set indicates the search unit identification code and the position of the first character of each character set.
- the character set position information consisting of the set position order code and the attribute code of the search unit is ⁇ , stored in an area configured for each character set type, and the search target character string is stored in each character set type.
- Create a file. This search file has a file structure in which character set position information is stored for each character set type.
- the search input is decomposed from the first character into a character set in units of r characters to form a search input character set string, and the character set position information of the same character set as the decomposed character set is searched.
- the character set position information that has the same search unit identification code and the same character set position sequence code as the difference between the first character position of the character set of the corresponding search input character string and the same attribute code. Check the combination and take it out.
- the search input string is decomposed from the first character to a character set of r characters, the last character set may be (r-1) or less, and a character set of r characters may not be created. At this time, it extracts characters for the number of missing characters from the end of the character set immediately before the last character set, and concatenates them with the front of the last character set to create a character set in units of r characters.
- This matching process checks the continuity of the character set string and the attribute match between the search input and the search file, and the search unit identification code is common from the character set position information in the search file. This is performed by extracting the combination of character sets whose difference in character set position order code is equal to the difference in the first character position of the character set of the corresponding search input character string and whose attribute code is the same as the search input.
- the search unit When a character string that matches the search input is found in this way, the search unit to be extracted from the search unit identification code and the first sentence in the search unit for each character in the character set The character position indicating the position from the character is extracted and output to the searcher as a search result.
- create a search file by storing each character of the full text in the character type area.
- the search input character string is decomposed for each character, the character position information of each character is extracted from the search file, and the search unit identification code is common, in the same order as the search input character string, and in the attribute. Extracts the combination of character position information with the same code as the search input, and outputs the search unit and character position as the search result.
- a record having a keyword is assigned an ascending record identification code in accordance with the registration order, and for each keyword, a keyword attribute code indicating the logical type of the keyword as an attribute.
- a keyword attribute code indicating the logical type of the keyword as an attribute.
- the character position sequence code or character set position sequence code in the keyword, character position information or character set position information is created from these three codes, and stored in the area for each character type or character set. Create a search file.
- one Hi-input pair of the search input character string and the search input character string attribute is input.
- the search input string is decomposed into one character or character set, and the same character position information or the same character set as the search input character set in the search file is used.
- the position information is extracted and the combination of character position information or character set position information that has the same record identification code and the same character position order code or character set position order code and keyword attribute code as the search input is extracted.
- the record identification number is extracted as a search result from the combination of the extracted character position information or character set position information.
- FIG. 1 is a configuration example of an information search processing device used in an embodiment of the present invention.
- Fig. 2 shows an example of a search file according to the first embodiment.
- Figure 3 is a list of the second and third character combinations in each character set group of the first ⁇ M example.
- Figure 4 shows the first example character set group address table.
- FIG. 5 shows an example of registration of a search file according to the first embodiment.
- FIG. 6 is a flowchart illustrating a search file creation processing procedure according to the first embodiment.
- FIG. 5 is a flowchart illustrating a search processing procedure according to the first embodiment.
- FIG. 8 shows a search file according to the second embodiment.
- Fig. 9 shows a list of character set groups according to the second embodiment.
- FIG. 10 is a character set group address table according to the second embodiment.
- FIG. 11 shows an example of registration of a search file according to the second embodiment.
- FIG. 12 is a character column address table of the third embodiment.
- FIG. 13 shows an example of registration of a search file according to the third embodiment.
- 14A and 14B are flowcharts for explaining a search file creation processing procedure according to the third embodiment.
- FIG. 15 is a flowchart illustrating a search processing procedure according to the third embodiment.
- FIG. 16 shows an example of a keyword sequence according to the fourth embodiment.
- FIG. 17 shows an example of character set position information creation according to the fourth embodiment.
- FIG. 18 shows an example of registration of a search file according to the fourth embodiment.
- FIGS. 19A and 19B are flowcharts illustrating a search file creation procedure according to the fourth embodiment.
- FIGS. 20A and 20B are flowcharts illustrating a search processing procedure according to the fourth embodiment.
- FIG. 21 shows an example of a keyword string according to the fifth embodiment.
- FIG. 22 shows an example of character set position information creation according to the fifth embodiment.
- FIG. 23 shows an example of registration of a search file according to the fifth embodiment.
- FIG. 24 shows an example of character position information creation according to the sixth embodiment.
- FIG. 25 shows an example of registration of a search file according to the sixth embodiment.
- FIG. 26A is a flowchart illustrating a search file creation procedure according to the sixth embodiment.
- FIGS. 27A and 27B are flowcharts illustrating a search processing procedure according to the sixth embodiment.
- FIG. 1 shows the configuration of an information search processing device according to an embodiment of the present invention.
- the information search processing device of the present embodiment has a CPU that performs various arithmetic processing or determination processing. 1 and programs for search processing, search file creation, etc., search files created or used for search processing, memory for storing search inputs, etc., input / output unit 3 for connecting keyboard 4, display 5, display 3,
- An external storage device control unit 6 for connecting an external storage device 7 for storing information, a CPUK memory 2, an input / output unit 3, and a common bus 8 for connecting the external storage device control unit 6 are provided.
- the first embodiment is an embodiment in which a European character document is targeted for full-text search.
- a character set is extracted from a character string to be provided for the search process, one character at a time from the first character of the character string, and a character set consisting of the character and the next character is created.
- Search file creation processing that creates a search file consisting of character set groups grouped by character set type, and search that matches the search file and extracts character strings that match the search input And processing.
- This search file creation processing can be roughly divided into 1) search file area reservation, 2) addition and assignment of character set position information to each character set, and 3) search file of character set position information grouped by character set type. Can be divided into three types. Each of these processes will be described.
- the search file is composed of a character set group arranged in the character order of ASCII codes “20” to “7F” described in the ASCII code table.
- Each character set group consists of three characters whose first character is the character that represents the name of each character set shown in Fig. 2.
- the second and third characters of each character set group consist of the characters described in the ASCII code table as shown in Figure 3.
- the character set A consists of the character sets “AA”, “AA!”, ⁇ “AA ⁇ ”, and “AA to J. Create a character set with a total of three characters: And the appearance frequency is counted.
- the number of character set position information registered in each character set type group constituting the search file can be known, so that an area for the search file composed of all character set type groups can be secured.
- the top ground of the character set type group stored continuously in the search file can be determined.
- the character set group address table shown in Fig. 4 arranges the top grounds of this character set type group in the order of description of each character set shown in Figs.
- the character set position information described here is based on the search unit number indicating the order in which the search unit to which the character set belongs, and the position where the character set appears in the search unit is determined by the position of the first character of the character set. It is composed of the character set position No. that indicates the character set and the attribute name that indicates the logical type of the search unit.
- search units and their attributes will be described.
- a typical book consists of a table of contents, preface, chapter or section titles, body text, figure or table titles, references, etc., and appears in this order.
- search unit When searching for the contents of this book ⁇ , it is convenient to use this part as the search unit and to use that search unit as the search output, and it often matches the search purpose. That is, it is often the case that only the title or only the text is specified as a search target depending on the search purpose in actual search.
- search unit indicates the logical classification of the character string to be searched
- attribute search is given to this search unit according to the logical division. For example, as an attribute number, "1" in the table of contents, "2" in the preface, "3" in the chapter or section title, "4" in the figure or table title, "5" in the text, Assign “6”.
- the ban numbers are assigned in ascending order from 1. This is used as the search unit number.
- the text is a long sentence, categorize it appropriately. It is also possible to divide the text into multiple search units and assign search unit evaluations in the order in which they appear for each search unit.
- a character set is extracted from the beginning of the search unit, one character at a time, and a character set consisting of that character and the next character is created, and the character set is created in ascending order of 1, 2, 3,
- Two special characters EM end mark
- EM end mark
- the search unit is obtained from the search unit number, character set position ⁇ , and attribute number given above. Convert the character set to a code consisting of integers and create character set position information.
- This character set position information is as follows: When the maximum number of search unit characters is n and the maximum number of attributes is a,
- Character set position information is provided.
- the character set type groups are stored in the search file in the order described in FIG. 2 and FIG. Then, the character set position information is registered in each character set type group.
- the registration of the character set position information is performed by storing the character set position information at the head of the unstored area of the corresponding character set type group. Therefore, if they are registered in search unit order, character set position information will be registered in ascending numerical order in the character set type group.
- Fig. 5 shows an example where the character set position information of "d0cumet" described above is registered in a search file. At this time, the character set position information in each group is stored in ascending order. If the character set position information is 4 bytes, the file capacity is as shown below.
- additional registration of character set position information is performed by adding new character set position information to the head of the unstored area of the group corresponding to each character set of the additional document.
- deletion is performed by changing the relevant character set position information in the group corresponding to each character set of the deleted document to a special symbol (here, the ASCII code “0000”). As a result, additional registration and deletion can be performed in a short time.
- the character set position information stored for each character set type group in this search file can be obtained by extracting the leading lands of each character set group in the character set group address table in Fig. 4 as a directory. it can.
- Fig. 6 shows the flow of the above search file creation process.
- the search input character string is decomposed from the first character into a character set consisting of three characters, and a search input character set string is created.
- the search input character set sequence is rearranged in order from the character set with the lowest occurrence frequency of all sentences.
- the search unit Ban and the character position number that indicates the position of each character in the character set from the first character in the search unit are output as search results.
- the search input string is decomposed into a character set consisting of three characters from the first character so that it can be compared with the character set stored in the search file, and is used as the search input character set.
- a character set is divided into three character units from the first character, the last character set may be shorter than three characters, and a character set of three character units may not be created.
- the characters for the number of underscores are extracted from the last part of the character set immediately before the last character set, and are connected to the front part of the last character set to create a three-character unit character set.
- the character set indicating the start address of each character set type group in the search file is referred to, the character set group start address in the group address table is referred to, the frequency of full-text occurrence of each search input character set is checked, and the search input is performed. Sort the character set sequence in ascending order of occurrence frequency of full text.
- the first address in the character set group address table indicates the first address of each character set type group stored in the search file. The difference between each character set From the number of character set position information stored in the species group, the frequency of character set types appearing in all sentences can be determined.
- the number of matches with the character set position information of each character set stored in the search file can be extremely reduced by performing collation matching from character sets with low occurrence frequency of all sentences. That is, when checking the continuity of each character set by comparing the character set position information, the search unit number, the character set position identification number, and the attribute number in the character set position information in the two character set type groups are compared. Therefore, if the number of character set position information stored in the two character set type groups is small, the number of times of collation can be reduced accordingly. Therefore, when collating character set position information, collation is performed from a character set with a low frequency of full-text appearance, thereby reducing the number of times of collation. In particular, as the number of search input characters increases, the rate of inclusion of a character set with a low appearance frequency increases, so the reduction effect is large.
- the character set position information stored in each character set type group is extracted by referring to the character set group address table from the character set with the lowest occurrence of full text. Then, based on the extracted character set position information, from the character set type group in which the whole text appears very infrequently, the search unit is the same for each character set type group and the difference in the character set position number is the search input character. A combination of character set position information that is equal to the difference in the first character position of the corresponding character set in the column is extracted.
- the comparison of the character set position information difference is as follows:
- the comparison of the character set position information difference between the character set type groups is based on the character set position information of the character set type group with a low frequency of full-text occurrence and the frequency of full-text appearance. Compare the character set continuity by taking the difference from the character set position information of the character set type group with the highest degree.
- the number of times of collation is reduced by deleting discontinuous character set position information from the collation target.
- the number of matches between these two groups is only 7 times in total, and it is not necessary to check all character set position information in the group.
- the character set position information that matches the attribute specified in the search input can be extracted.
- the search unit reference and the character position reference indicating the position of each character in the character set from the first character in the search unit are extracted as search results. If there is more than one search input, for the second and subsequent search inputs, the search unit questions obtained so far from the character set type group corresponding to the first character set of the search input After extracting the character set position information with, the processing after the character set next to the search input is performed. This is all about extracting the character set included in the same search unit as the search result obtained by the first search input from the second and subsequent search inputs.
- the search unit of the search unit number “8” to which this character string belongs and the character position number “121 to 127” are output as the search results.
- This search processing operation is shown as a flowchart in FIG.
- the search input is taken out, the search input character string is divided into character sets in units of three characters from the first character, and a search input character set string is created.
- the number ai is set, and the appearance frequency of each character set is checked with reference to the character set group address table and sorted in ascending frequency (S41 to S44). Then, the character set position information stored in the character set type group corresponding to the rearranged character set is extracted from the search file (S45).
- the character set position identification of the character set position information of the character set with the low frequency of full-text occurrence in the search input character set string is i
- the character set with the high frequency of full-text search is
- character set position number of character set is j
- the character set position information of the attribute board ai is selected from the character set position information, and the search unit and character set that match the search input are selected.
- a character position number indicating the position of each constituent character from the first character in the search unit is output as a search result. (S49, 50). If the collation is continued in step S48, the character set position information of the previous matching result and the character set type group corresponding to the next character set in the character set in which the search input has been rearranged. The collation is performed with the character set position information stored in (S46).
- the Japanese character string is a character string containing kanji. For this reason, focusing on kanji, kanji has more character types than Western characters, and the frequency of repeated occurrences of the same kanji is very low compared to Western characters that use characters. For example, even though there are many terms that use the two characters "communication" in Japanese character strings, the character string "communication " is the same in four characters, such as "communication line” and "communication device". The frequency of occurrence of the character is very low. Kana characters or hiragana characters also have more character types than European characters.
- the search process can be performed quickly even if the search process is performed using a search file with a character type configuration of each kanji character or a character set search file with two sentences ⁇ Can be
- This second embodiment a description will be given of search file creation and search processing using a character set composed of two characters.
- This second embodiment is basically the same as the first embodiment in which a character set consisting of three characters is processed. However, the difference is that a search file and a character set group address table are created using the JIS code table because Japanese processing is performed.
- the search file of the second embodiment is composed of a character set group arranged in the character order described in the JIS code table as shown in FIG.
- Each character set group is a character set group consisting of two character strings with the written characters as the first character in the order shown in the JIS code table as shown in the character set group list in Fig. 9. Structure Is done.
- the character set group address table shown in FIG. 10 is a table in which the head addresses of the character set type groups are arranged in the order in which they are described in the character set group list in FIG.
- "Correspondence” in this character string is decomposed into the character set of "Communication”, “Response”, “Document”, and “Calligraphy”.
- character set position information of “801215”, “801225”, “801235”, and “801245” is given, and the character set position information is stored in the area of the search file.
- Figure 11 shows an example of storing the character set position information of this “correspondence document” in a search file. Since the procedure of the search file creation process is the same as that of the first embodiment, the flowchart is omitted.
- the input search input character string is decomposed from the first character into a character set in units of two characters, and a search input character set string is created.
- a character set type group corresponding to the character set is extracted from the search file and collated, a combination of character set position information that can form a search input character set string is extracted, and a search input is performed from the extracted character set position information.
- the character set position information having the same attribute as the force is extracted as a collation match.
- search unit identification and character position identification indicating the position of each character in the character set from the first character in the search unit are output as search results.
- the last character set When a search input string is decomposed from the first character to a character set of two characters, the last character set may become one character, and a character set of two characters may not be created. In this case, one character is taken from the last part of the character set immediately before the last character set, and is connected to the front part of the last character set to create a two-character unit.
- the search input character set is "communication" and "document”.
- the appearance frequency of the full text is “communication” ⁇ “document”, and the matching is performed in this order, first, the character set group field of “communication” and the character set group of “document” in the search file Field and the character set position information Since the character positions of “tsuru” and “sentence” in the search input “correspondence” are “1-” and “3”, respectively, The character set information that becomes “” is extracted, and the character set position information “801215” in the “communications” of the search file in FIG. 11 and “801235” in the “document” are connected. Can be extracted as a combination.
- search condition is “body”
- the position numbers “121 to 124” are extracted as a search result.
- the procedure of the search process is the same as that of the first embodiment, so that the flowchart is omitted.
- the third embodiment is different from the second embodiment in that a search file of a character set type is formed or a search file of one character fi ⁇ ! I is created.
- the processing is basically the same.
- the character address table and the search file are slightly different from those in the second embodiment because a character type group is generated for each character.
- the characters constituting the entire Japanese text are classified, the appearance frequency is counted for the character types described in the JIS code table, and the area for the search file is secured.
- a character column address table in which the head addresses of the character type groups corresponding to FIG. 10 of the second ⁇ M example are arranged in the order described in the JIS code table is created as shown in FIG.
- This character The column address table is a character column address of the second embodiment.In comparison with the table, the starting address is described for each character type, and since the number complies with JIS Level 1 and JIS Level 2, unused codes are used. Only the number of No.8836 character fields is required.
- character position information code ⁇ search unit number x n + character position identification code ⁇ x a + attribute number
- Character type groups are stored in the search file in the order described in the JIS code table based on the character column address table shown in FIG. As a result, a search file shown in FIG. 13 in which character position information is stored by being divided into character type groups is created.
- Figure 14 shows a flowchart of this search file creation process.
- the head address of the character column in the character column address table corresponding to each constituent character of the search input character string is calculated.
- the search input characters system IJ are rearranged from those with low appearance frequency, character position information stored in the character type group corresponding to each character is extracted, and based on the extracted character position information, low occurrence frequency
- the search unit is the same for each character type group, and the difference in character position number is equal to the character position difference in the search input string! /, Extract combinations of character position information.
- the collation of this character position information is as follows. When the character position number of the / ⁇ character is i and the character position identification number of the character with the highest frequency of full text is j,
- character position information having a common search unit and character continuity between character type groups is extracted, and character position information having the same attribute as the search input is extracted from the extracted character position information.
- a search unit and a character position that match the search input are extracted from the character position information that matches.
- the full-text appearance frequency of each character is in the order of “writing”, “sentence”, “shin” ⁇ “tsu”, and the collation is performed in this order.
- the difference between the character position information extracted from the character column of “Book” and the character position information extracted from the character column of “Sentence” in the search file using the above equation (5) is “1-10”.
- the character position information “801245” in the “book” of the search file and “801235” in the “sentence” can be extracted as continuous character position information.
- a search file for each kanji character and for a continuous katakana character and a hiragana character as a two-character set.
- katakana characters are often used as technical terms, and kana characters may be entered as search input character strings.
- continuous katakana and hiragana characters are used.
- Creating a search file as a two-character set is also effective for speeding up the search.
- An example of a book search system will be described as a multi-keyword information search method. Records in the book search system consist of keywords such as book title, author name, publisher name, year of publication, and abstract. Then, each record containing this keyword is registered to create a search file, and a key word or a partial character string of the key word is input as a search input to search and output a corresponding record. The creation of this search file will be described.
- record identification codes are assigned to the records to be searched in ascending order according to the registration order.
- a keyword type code indicating the attribute is assigned with the logical type of the keyword included in each record as an attribute.
- keyword attribute codes indicating attributes such as the book title, author name, publisher name, publication year, and abstract are assigned, and a logical association is made between the search input and the keywords of the book search system. ing.
- the searcher specifies a keyword for storing the book to be searched for as a search input.
- the keyword is decomposed into one character or character set, and each character indicates the character position order code indicating the character position from the beginning of the keyword, or each character set indicates the first character position of each character set from the beginning of the keyword.
- Character set Position sequence code is assigned. These record identification code, keyword attribute code, statement Character position information for each character of the keyword or character set position information for each character set is generated from the character position sequence code or character set position sequence code. At this time, the first character position of the key word preset for each key character code is added to the character position information or character set position information as a constant so that the key character can be represented by the character position. .
- This character position information or character set position information is grouped by character type or character set type, and these groups are assembled to create a search file. Therefore, this search file has a file structure in which character position information is stored for each character type or character set position information for each character set type.
- a test ⁇ power character string and a search input character string attribute are input ⁇ 1 each.
- the search input string is decomposed into individual characters or character sets, and the same character position information as the characters that make up the search input from the search file or the same character set that makes up the search input retrieves the character set position information of a character set.
- the record identification code and the keyword attribute code are common, and the character position code is the same.
- the I-order code or character set position sequence code is in the same order as the character position sequence code or character set position sequence code of the search input string, and
- the keyword attribute code is collated and extracted for character position information or character set position information that is the same as the search input. From the extracted character position information or character set position information, record identification codes common to all search input character strings are extracted as search results.
- the constituent characters of each keyword are changed from the first character of the keyword sequence to the keyword sequence created from the multi-keywords possessed by the record to be searched for the search process.
- a character set is created by taking out characters one by one and creating a character set with a total of three characters consisting of that character and the following character, and creating a search file consisting of character set type groups grouped for each of these character set types.
- this search file creation processing includes (1) securing a search file area, (2) assigning character set position information to each character set character set, and (3) character sets grouped by character set type. Storage of search location information in search files.
- the search file is composed of an ASCII code table and a character set group arranged in the order of the characters listed on the table.
- the second and third characters of each character set group are configured as shown in the second and third character combination list of the character set group in FIG. 3, as in the first embodiment. They are arranged in the order described in the toggle address table.
- the character set position information described here composes each key in a key word sequence created by arranging each key of the record in a key attribute area corresponding to the key attribute number.
- the record number will be described.
- a general book search system searches books using keywords such as book name, author name, publisher name, year of publication, and abstract.
- the record is a search target composed of the keywords of book title, author name, publisher name, publication year, and abstract. No.
- a searcher specifies a book to be searched by using a keyword as a search input or by searching for a stored keyword.
- the book search system adds keyword attributes to keywords such as the book name, author name, publisher name, year of publication, and abstract, for example, and allows search input and book search systems.
- keywords such as the book name, author name, publisher name, year of publication, and abstract, for example, and allows search input and book search systems.
- keywords There is a logical association between the keywords in the stem.
- “1” is assigned to the book name, “2” to the author name, “3” to the publisher name, “4” to the publication year, and “5” to the abstract as the keyword ⁇ gender.
- the character set position identification code For each keyword, extract one character at a time from the beginning of the keyword, create a character set with a total of three characters consisting of that character and the following character, and assign a ban number in the order of creation 1, 2, 3 The character set position number. To the last character of the keyword, two special symbols EM (end mark) indicating the end of the keyword are added, concatenated with this EM symbol to form a character set, and the character set position “Ban” is given. The EM symbol is assigned “ASCII code“ 7 F ”of DEL_l in the ASCII code table. Next, the keyword string will be described.
- a character string is formed by connecting all the keys of the record, A column. That is, the keywords are arranged in a fixed-length keyword attribute area corresponding to the keyword attribute number, and a keyword sequence is created. Thus, the attribute of the keyword to which the character set belongs can be determined from the character position in the keyword string. Note that, following each keyword attribute area, an EM symbol indicating the delimitation of the keyword attribute area is arranged in a keycode row. This EM symbol is the same as the special symbol EM indicating the end of the key.
- the character set position information is created by converting all the character sets constituting the keyword from the record number, the keyword attribute number, and the character set position number to codes consisting of integers.
- This character set position information is an integer code given by the following equation (6).
- the keyword sequence is as shown in FIG.
- the character set position information of each character set is configured as shown in FIG.
- the character set position information is composed of four-byte codes in this way, it is possible to handle 2 32 ⁇ 1169.36 million keyword strings with 1169 characters ⁇ ) o
- the character set position information assigned to each character set is registered in a search file.
- the character set type groups are stored in the search file in the order described in the ASCII code table shown in Figs. Then, the character set position information of each character set is registered in each character set type group. The registration of the character set position information is performed by storing the character set position information at the head of the unstored area of the corresponding character set type group. For this reason, if record records are given in the order of registration, character set position information will be registered in ascending numerical order in the character set type group.
- Figure 18 shows an example of registering the character set position information of the above-mentioned book name "Electronicc Publishng" in a search file.
- the character set position information in each group is stored in ascending order.
- This file size is, if the character set position information is 4 bytes,
- a new code at the head of the unstored area of the group corresponding to each character set of each keyword in the additional record is added. Do with. Deletion can be performed by changing the character set position information in the group corresponding to each character set of each key of the deleted record to a special symbol (here, ASCII code "0000"). Do. As a result, addition and deletion can be performed in a short time.
- each character set in this search file can be obtained by extracting the first banji of each character set group in the character set group address table of FIG. 4 shown in the first embodiment as a directory.
- Figures 19a and 19b show the flow of the JSLL search file creation process.
- the frequency of occurrence of the character set type is counted to create a character set column address table (S111, 112), and an area for the search file is secured (S113).
- the character set column directory (character set column heading area) indicating the character set ⁇ of the search file that stores the character set type group of the character set at the character set position number P is written.
- the character set position information is extracted from the set column address table (S120), and the character set position information is stored in the first line of the unstored area of the search file indicated by the character set column directory (S121).
- the process proceeds to the next keyword processing (S124, S125).
- the registration processing is completed (S126).
- the search process has the following configuration, as in the first embodiment.
- the search input character string is decomposed from the first character into a character set consisting of three characters, and a search input character set string is created.
- a character set that can retrieve a character set type group from the search file in order starting from the rearranged character set string and retrieve the input character set string from the character set position information stored there. Extract the combination of location information.
- the search input character string is decomposed into three-character units from the first character so that it can be compared with the character set stored in the search file. I do.
- the last character set may be less than three characters, and a character set may not be created. At this time, it extracts the missing characters from the end of the character set immediately before the last character set and concatenates them with the front of the last character set to create a three-character character set.
- each search input character set is referred to by referring to the character set group heading area in the character set group address table indicating the first banchi of each character set type group in the search file.
- character set position information stored in each character set type group column is extracted from the character set with a low frequency of occurrence by referring to the character set group address table. Then, based on the extracted character set position information, the record number and the key attribute number are the same and the character set position number of each character set type group is equal, in order from the character set type group with the lowest occurrence frequency. Difference is search input character
- the character set position matching information that is equal to the first character position difference of the corresponding character set in the column is extracted with the 01 combination.
- This character set position information collation is based on the case where the character set position number with low occurrence frequency is i and the character set position number with high appearance frequency is j in all keywords in the search input character set string.
- the keyword attribute is verified for the character set position identification of the character set position information obtained from the character string verification. That is, if the character set position number is 1 to 64, the keyword attribute of the character set position information is the book name, and if the character set position number is 66 to 97, the keyword characteristic of the character set position information is the author. If the character set position number is between 99 and 162, the keyword attribute of the character set position information is the issuer name, and if the character set position number is 164 -167, the keyword attribute of the character set position information Is the year of publication, and if the character set position number is 169 or more: L168, it is understood that the key attribute of the character set position information is an abstract. Therefore, only the character set position information that is the same as the attribute specified at the time of retrieval and input is extracted from the character set position information obtained by character set collation.
- the character set position extracted from the character set group “ffi” of “E 1 e” in the search file In the search input “E 1 ectro_j, the character positions of“ E ”and“ c ”are“ 1 ”and“ Therefore, the character set position information at which the character set position difference is “13” is extracted, and “116901” of the character set position information in “EI ej” and “” in “ctr” of the search file in FIG. 18 are extracted. 116904 "can be extracted as a combination of continuous character set position information.
- the character set position information “116901”, “116904”, and “116905” are the character set in which the record number and the keyword attribute number are equal and continuous. You can see that there is. Furthermore, since the keyword attribute is "book name”, the character position identification number is 1 to 64 character set position information from the character set position information remaining in the character set string matching so far. Then, "116901", “116904", and "116905" can be extracted.
- This search processing operation is shown as a flowchart in FIGS. 20a and 20b.
- search input is extracted, and a search input character set string is created by dividing the character string into three-character units from the beginning of the search input character string.
- search input character set sequence is rearranged in ascending order of occurrence frequency in all keys (S136).
- the character set position information stored in the character set type group column corresponding to the rearranged character set is extracted from the search file (S137).
- the frequency of occurrence in all keywords in the search input character set string is low, the character set position identification number of the character set is i, and the character set position identification number of the character set with high frequency is j.
- the position information is taken out (S138).
- the same process is performed for the remaining character sets in the search input character set string (S139, S140), and the character set position number is determined from the remaining character set position information by the keyword attribute board a. in character position range P a out takes only record one de trial No..
- the following equation (9) is used to extract the character set position identification from the character set position information.
- search processing for other phonetic characters can be performed in the same manner.
- the fifth ⁇ M example is the same as the relationship of the second example with respect to the first example.
- a search is performed according to a JIS code table using a character set of two characters. Create a file.
- the search file creation processing and search processing procedure of the fifth embodiment are the same as those of the fourth embodiment except that the number of keyword characters and the setting of the keyword attribute area are different.
- it is effective to use a two-character set search file in the search processing of Japanese documents that use Kana characters and Kanji whose character types are more common than European characters.
- kana characters may be used as the character set search file according to the fifth embodiment
- kanji may be used as the character type group search file for each character according to the sixth embodiment. .
- the sixth embodiment has the same relationship as the first embodiment and the third embodiment with respect to the second embodiment.
- character position information is stored in units of one character.
- a search file composed of character type groups is used.
- the sixth embodiment creates character position information in units of one character. Therefore, the character position information is represented by character position information code-record number. XH + (P a-1) + p
- the character position information is configured as shown in FIG. Fig. 25 shows an example in which the character position information of the book name "Correspondence of communication document" is registered in the search file.
- FIGS. 27a and 27b show a flowchart of the search process.
- the procedure of the search file creation process and the search process is basically the same as in the fourth embodiment.
- the search file is composed of character type groups in units of one character! , Ru point contact and is different in that it is constructed on the basis of the JIS code for Japanese processing c [INDUSTRIAL APPLICABILITY]
- the present invention provides a character set consisting of a search unit identifier: a symbol, a character set position order code, and an attribute number indicating the number of search units to which the character set belongs for each character set type of the character string to be searched.
- Create a search file that stores location information search this search file, extract the character set location information for each character set type that composes the input character string, and search for a character string that matches the search input .
- create a search file that stores character position information for each character type, extract the character position information for each character type that constitutes the character string of the search input, and match the search input Search for character strings.
- the present invention has the following excellent effects.
- Any character string search can be performed because the search processing is performed by focusing on the character set and character position, and it is necessary to extract the character string at the time of registration as in the index method or pre-search method of full-text search processing Flower
- a high-speed search can be realized only by software without using dedicated hardware, so that a full-text search can be efficiently performed by a general-purpose information processing device, and the versatility is high.
- a character string consisting of characters with few character types, such as European characters, can also be searched by creating a search file that stores character set position information in the character set type group that composes the character string.
- the frequency of occurrence of the same character string is low, so that the frequency of appearance of each character set can be kept low, and search matching can be performed in a character set with a low frequency of appearance, thus enabling high-speed search.
- the search process Since the search process only needs to extract the character position information or character set position information of the corresponding character or character set of the search input character string, the character position information or character set of the corresponding character type in the search file is retrieved. Even when the character set position information of the data is in the external storage device, the time required to transfer the contents of the search file to the main memory is reduced, and the search process can be sped up.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This invention relates to a data retrieving device which enables high speed retrieval and arbitrary character string collation of database in a whole-sentence retrieval mode or in a mode using multi-keywords. A character string including keywords to be retrieved is divided into respective characters or character sets consisting of a plural of characters. Every character or character set, character position information which comprises an identification code of the character string unit to be retrieved to which this character or character set belongs, a character position sequence code indicating the character position in the character string and an attribute code indicating the logical partition of the character string is generated. Thereby, a retrieval file in which character position information are grouped every character kind or character set kind is prepared beforehand. For a retrieval input, the character position information of character or character sets comprising the retrieval input is fetched from the retrieval file to collate the retrieval input with them and the character string of the retrieval object which is continuous and whose attribute code coincides with the retrieval input is taken out from the retrieval file. Thereby, the number of the character string collation is decreased and high speed partial-coincide retrieval or high speed whole-sentence retrieval is attained.
Description
明 細 書 情報検索 ½理装置 Description Information retrieval processor
〔技術分野〕 〔Technical field〕
本発明は、 情報検索を行う情報検索処理方式に関する。 本発明は、 特に全文検 索処理方式あるいはマルチキーワードを用 {/、た部分一致検索処理方式に適するも ので、 入力された検索入力と検索対象の全文または登録キーワードとの照合回数 を大幅に削減して高速に情報検索を行うことができる情報検索処理方式に関する。 本発明はデ一タベースシステムにおいて全文検索処理またはマルチキーヮード検 索を行う情報検索処理方式に適する。 The present invention relates to an information search processing method for performing information search. The present invention is particularly suitable for a full-text search processing method or a partial-match search processing method using a multi-keyword, so that the number of matches between the input search input and the full-text or registered keywords to be searched is significantly reduced. And a high-speed information search method. INDUSTRIAL APPLICABILITY The present invention is suitable for an information search processing method for performing a full-text search process or a multi-keyword search in a database system.
〔背景技術〕 (Background technology)
従来から、 全文検索処理方式としては、 全文の最初から最後まで、 検索入力文 字列との文字列照合を行 検索者が指定する入力文字列と検索条件に合致する 文書を選出する逐次検索方式や全文からあらかじめキーヮ一ドを抽出して検索フ 了ィルを作成するィンデックス方式が全文検索技術として一般的である。 また全 文に出現する文字や文字列を表形式にして、 検索入力文字列から分解して作成す る文字や文字列の出現文書を絞り込むプリサーチ方式がある。 Conventionally, as a full-text search processing method, from the beginning to the end of the whole text, a character string collation with the search input character string has been performed.A sequential search method that selects documents that match the input character string specified by the searcher and the search conditions The index method in which a keypad is extracted from the full text in advance to create a search file is generally used as a full text search technique. There is also a pre-search method in which characters and character strings appearing in the entire text are tabulated, and documents that appear in characters and character strings created by decomposing from search input character strings are narrowed down.
逐次検索方式では、 全文の最初から最後まで、 検索入力文字列との照合を行う ため、 多量の文字列を有する文書を検索する場合、 多くの時間を要する。 このた め、 多量文書の検索では、 高速な文字列照合を行う専用のプロセッサや L S Iが 提案されているが、 これらの方式では、 ハードウェアが限定されるほか、 検索処 理を行う計算機と専用プロセッサや L S Iとの間での文字列転送に時間がかかり、 システムとして満足できる高速性の実現が課題となっている。 また、 インデック ス方式では、 任意の文字列による検索の高速化が可能であるが、 検索ファイルが 巨大になる欠点がある。 このため、 任意の文字列による検索が十分にサポートさ れないという問題がある。 また、 プリサーチ方式では、 高速性を実現するための
列処理機構や文字列照合に専用のハードウエアが必要であるほか、 登録時に抽出 する文字列の精度向上が課題となっている。 In the sequential search method, matching is performed with the search input character string from the beginning to the end of the whole sentence, so it takes a lot of time to search for a document having a large number of character strings. For this reason, dedicated processors and LSIs that perform high-speed character string matching have been proposed for searching a large number of documents.However, these methods are limited in hardware and are dedicated to computers that perform search processing. It takes time to transfer character strings between processors and LSIs, and realizing high-speed operation that is satisfactory as a system is an issue. In addition, the indexing method can speed up the search using an arbitrary character string, but has the disadvantage that the search file becomes huge. For this reason, there is a problem that search by an arbitrary character string is not sufficiently supported. Also, in the pre-search system, Dedicated hardware is required for the column processing mechanism and character string collation, and improving the accuracy of character strings extracted during registration is an issue.
次に、 マルチキーワードを用いたマルチキーワード検索処理方式としては、 検 索者が指定する入力文字列をキーヮード文字列として有しかつ検索条件に合致す るキーワードからレコード蕃号をサーチする逐次検索方式や、 キーヮードから検 索入力可能な文 ijを作成して索弓 I形式に検索ファィルに格納し、 検索者が指定 する入力文字列と検索条件に合致するキーワードを検索ファィルのィンデックス 構造を利用してサーチするィンデックス方式がマルチキーヮードを用いた部分一 致検索技術として一般的である。 Next, as a multi-keyword search processing method using multi-keywords, a sequential search method is used in which the input character string specified by the searcher is used as a keyword character string and the search for a record number is performed from keywords that match the search conditions. Also, a sentence ij that can be searched and entered from the keyboard is created and stored in the search file in the bow I format, and keywords that match the input character string specified by the searcher and the search conditions are used using the index structure of the search file. The index method of performing a search is generally used as a partial match search technique using a multi-keyword.
しかし、 マルチキーワード検索処理の逐次検索処理方式は全文検索処理の逐次 検索方式と同じく検索時間がかかる。 また専用のノヽードウエアを用いると、 ハー ドウエアが PI¾されるほか、 検索処理を行う計算機と専用プロセッサや L S Iと の間での文字列転送に時間がかかる。 このためシステムとして満足できる高速性 の実現が課題となっている。 However, the sequential search processing method of the multi-keyword search processing requires the same search time as the sequential search method of the full-text search processing. In addition, if dedicated hardware is used, the hardware will be PI and the character string transfer between the computer that performs the search processing and the dedicated processor or LSI will take time. Therefore, realizing high-speed performance that is satisfactory for the system is an issue.
また、 マルチキーワード検索におけるインデックス方式は、 部分一致検索の高 速化が可能であるが、 検索ファイルが巨大になる欠点がある。 このため完全一致、 Ι ^一致、 後方一致の検索が使用されているが、 中間一致はサポートされていな いことが多い。 これは中間一致を行うために、完全一致、 前方一致、 後方一致の 検索ィンデックスとは別に、 中間一致用のインデックスが多量に必要になり、 検 索フアイルの記憶容量が巨大となること、 これに伴い検索時間が増大すること、 および検索ファイルの^1が容易でないことが主な理由となっている。 また、 シ ステムによつては検索ファィルの規模の制約から、 キーヮードの全ての前方一致 や後方一致がサポートされていないこともある。 しかし、検索者はキーワードの 中の特徵ある文字や文字列を記憶することが多く、 このため中間一致を含む部分 — ¾1食索のザポートにより検索がス厶ーズに行なえるようになることが求められ ている。 In addition, the index method in the multi-keyword search can speed up the partial match search, but has the disadvantage that the search file becomes huge. Because of this, exact, Ι ^, and suffix searches are used, but intermediate matches are often not supported. This requires a large number of indexes for intermediate matches, in addition to search indexes for exact matches, prefix matches, and tail matches, in order to perform intermediate matches, resulting in a huge storage capacity for search files. The main reason is that the search time increases and ^ 1 of the search file is not easy. Also, some systems do not support all prefixes and suffixes of keywords because of the size of the search file. However, searchers often memorize special characters and character strings in keywords, and this includes parts that include an intermediate match. It has been demanded.
本発明者は、 全文中あるいはキーワードになり得る単語に同じ文字や同じ文字
列が出現する頻度が低い特徴がある点に着目し、 検索対象文字列あるいはキーヮ The inventor considers that the same character or the same character Focusing on the fact that there is a feature where the column appears infrequently, the search target character string or key ヮ
一ドを先頭文字から 1文字ずつ、 その文字と次に続く合計 r文字で文字セッ トを 作成して文字セッ ト種ごとにグループ化した文字セッ トグループで検索ファイル を作成し、 あるいは各文字毎にグループ化した文字グループで検索ファィルを作 成し、 検索時には、 検索ファイル中から文字セッ トまたは文字の連続性を照合す ることにより検索を高速ィ匕することができることを見出した。 A character set is created for each character from the first character, one character at a time from the first character, followed by a total of r characters, and a search file is created with a character set group that is grouped by character set type. A search file was created with character groups that were grouped for each search, and it was found that the search could be speeded up by collating character sets or character continuity from the search file during the search.
本発明は、 上述の観点から大量文書を対象とする全文検索またはマルチキーヮ 一ドを用いた部分一致検索の高速化を実現でき、 しかも特定のハードウエアに限 定されず、 検索処理を主記憶上で行うことにより専用プロセッサや L S Iとの文 字列の転送が不要であり、 文字セットと文字セッ ト位置あるいは文字と文字位置 に着目することにより任意の文字列検索が可能である汎用性に富む情報検索処理 方式を提供することを目的とする。 According to the present invention, it is possible to realize a high-speed full-text search or a partial-match search using a multi-keyword for a large number of documents from the above-mentioned viewpoints. This eliminates the need to transfer character strings to dedicated processors and LSIs, and allows for arbitrary character string searches by focusing on character sets and character set positions or characters and character positions. The purpose is to provide an information retrieval processing method.
〔発明の開示〕 [Disclosure of the Invention]
本発明の第一の特徵は、 検索対象となる文字列を検索を行う単位である検索単 位に分けこの検索単位ごとに昇順の符号を付与する検索単位識別符号付与手段と、 この分けられた検索単位に対してその検索単位の論理的な区分を示す属性符号を 付与する属性符号付与手段と、 検索対象となる文字列から 1文字ずつ取り出し、 その文字と次に続く合計 r文字で文字セッ トを作成し、 検索単位における文字セ ッ 卜の先頭文字位置を示す文字セット位置順序符号を付与する文字セット位置順 序符号付与手段と、 上記検索単位識別符号と文字セット位置順序符号と属性符号 とからなる文字セッ ト位置情報を作成して、 この文字セッ ト位置情報を文字セッ ト種ごとの領域に格納して検索ファイルを作成する手段とを備えたことを特徴と する。 A first feature of the present invention is a search unit identification code assigning means for dividing a character string to be searched into search units, which are units for performing search, and assigning an ascending code to each search unit. An attribute code assigning means for assigning an attribute code indicating a logical division of the search unit to the search unit, and extracting a character from the character string to be searched one character at a time, and setting the character set by the character and a total of the next r characters Character set position order code assigning means for creating a character set position order code indicating the first character position of a character set in a search unit; the above-mentioned search unit identification code, character set position order code, and attribute code Means for generating character set position information, and storing the character set position information in an area for each character set type to create a search file.
なお、 文字セッ ト位置情報は、 nを最大検索単位文字数、 aを最大属性数とす In the character set position information, n is the maximum number of search unit characters and a is the maximum number of attributes.
"V" るとき { (検索単位識別符号 X II ) 十文字セッ ト位置順序符号 } x a +属性符号 なる数字コードとして与えられることが望ましい。 When "V" is set, it is desirable to give it as a numeric code of {(search unit identification code XII) cross-character position order code} xa + attribute code.
これにより、 複数文字からなる文字セッ ト位置情報により全文検索に使用する
検索ファイルを作成できる。 This allows full-text search using character set position information consisting of multiple characters. Can create search files.
また本発明の第二の特徵は、 第一の特徵で作成された検索ファイルを備え、 検 索入力文字列の構成文字を先頭文字から r文字単位の文字セッ トに分解して検索 入力文字セット列を作成し、 この文字セッ 卜と同じ文字セット種に格納されてい る文字セッ ト位置情報を上記検索ファイルから取り出す手段と、 この取り出した 各文字セッ卜の文字セッ ト位置情報間で、 検索単位識別符号が共通で文字セッ ト 位置順 符号の差が検索入力文字列の該当する文字セッ トの先頭文字位置差に等 しくかつその属性符号が検索入力と等しい文字セッ ト位置情報の組み合わせを抽 出する手段と、 この抽出された文字セッ ト位置情報の組み合わせに基づいて文字 セット列が属する検索単位および各文字セッ ト構成各文字の検索単位における先 頭文字からの位置を示す文字位置を検索結果として出力する手段とを備えたこと を特徵とする。 A second feature of the present invention is to provide a search input character set including the search file created in the first feature, decomposing constituent characters of the search input character string into a character set in units of r characters from the first character. Means for extracting a character set position information stored in the same character set type as the character set from the above-mentioned search file, and performing a search between the character set position information of each of the extracted character sets. Combination of character set position information where the unit identification code is common and the character set position order code difference is equal to the head character position difference of the corresponding character set in the search input string and the attribute code is the same as the search input Based on a combination of the extraction means and the extracted character set position information, the first character in the search unit to which the character set string belongs and the search unit for each character in each character set And Toku徵 further comprising a means for outputting a character position indicating a position of al as a search result.
なお検索入力文字歹 Uを先頭文字から Γ文字単位の文字セットに分解したとき、 最後の文字セットが (r— 1 ) 以下になり、 r文字単位の文字セットを作成でき ないことがある。 このときには、 最後の文字セットの直前の文字セットの後部か ら不足文字数分の文字を取り出し、 最後の文字セットの前部に連結して r文字単 位の文字セットを作成することが望ましい。 In addition, when the search input character system U is decomposed from the first character to the character set of Γ characters, the last character set may become (r-1) or less, and the character set of r characters may not be created. In this case, it is desirable to extract the missing characters from the end of the character set immediately before the last character set and to concatenate them with the front of the last character set to create a character set of r characters.
また、検索 λ¾文字セット列と同じ文字セット列を構成できる文字セッ ト位置 情報の組み合わせの抽出は、 検索入力の出現頻度の低い文字セットから順に行う ことか'望ましい。 In addition, it is desirable to extract character set position information combinations that can form the same character set string as the search λ¾ character set string, in order from the character set with the lowest frequency of search input.
また、 検索入力文字セット列と同じ文字セット列を構成できる文字セット位置 情報の組み合わせの抽出は、 検索入力文字列の全文における出現頻度の低い文字 セッ トの文字セッ ト位置順序符号を i、 出現頻度の高い文字セッ トの文字セッ ト 位 gJlU?符号を jとするとき、 (文字セッ ト位置順序符号 iの文字セッ トの文字 セット位置情報) - (文字セット位置順序符号 jの文字セッ 卜の文字セット位置 情報) = ( i - j ) x (最^ 性数) に合致する文字セット位置情報の組み合わ せを抽出することが望ましい。
また、 検索対象文字列が記号を含む欧文字列の場合は、 少なくとも 3文字単位 の文字セッ トとし記号を含む欧文字のみの文字セッ ト種グループで構成される検 索フアイルを用いることが望ましい。 In addition, the extraction of the combination of character set position information that can form the same character set string as the search input character set string is performed by setting the character set position order code of the character set with low occurrence frequency in all sentences of the search input character string to i, and Character set position of frequently used character set When gJlU? Code is j, (Character set position information of character set with character set position order code i)-(Character set with character set position order code j) It is desirable to extract a combination of character set position information that satisfies (i.e., character set position information) = (i−j) × (maximum number). If the search target character string is a European character string that includes symbols, it is desirable to use a search file that is a character set of at least three characters and consists of a character set type group that includes only European characters that include symbols. .
また、 検索対象文字列が漢字を含む日本語文字列の場合は、 少なくとも 2文字 の文字セット種グループで構成される検索ファィルを用いることが望ましい。 本発明の第三の特徴は、 文字種別に文字位置情報が格納される検索ファィルを 作成するもので、 検索対象となる文字列を検索を行う単位である検索単位に分け この検索単位ごとに昇順の符号を付与する検索単位識別符号付与手段と、 この分 けられた検索単位に対してその検索単位の論理的な区分を示す属性符号を付与す る属性符号付与手段と、 検索対象となる文字列を各文字ごとに検索単位の中での 位置を示す文字位置順序情報を付与する文字位置順序符号付与手段と、 上記検索 単位識別符号と文字位置順序符号と属性符号とからなる文字位置情報を作成して、 この文字位置情報を文字種別の領域に格納して検索ファィルを作成する手段とを 備えたことを特徴とする。 If the search target character string is a Japanese character string containing kanji, it is desirable to use a search file composed of at least two character set type groups. A third feature of the present invention is to create a search file in which character position information is stored for each character type. The search target character string is divided into search units, which are search units, in ascending order for each search unit. Search unit identification code assigning means for assigning a code, attribute code assigning means for assigning an attribute code indicating a logical division of the search unit to the divided search units, and characters to be searched Character position order code assigning means for assigning character position order information indicating a position in the search unit for each character in the column; and character position information comprising the search unit identification code, character position order code, and attribute code. Means for generating a search file by storing the character position information in a character type area.
この文字位置情報は、 ' This character position information is'
{ (検索単位識別符号 x n ) +文字位置順序符号 } x a +属性符号 {(Search unit identification code x n) + character position order code} x a + attribute code
n :最大検索単位文字数 n: Maximum number of search unit characters
a :最大属性数。 a: Maximum number of attributes.
なる数字コードとして与えられることが望ましい。 It is desirable to be given as a numerical code.
本発明の第四の特徴は、 第三の特徴で作成された検索ファィルを用いて検索 処理を行うものであり、 第三の特徴で作成された検索ファイルを備え、 検索入力 文字列の構成文字と同じ文字の文字位置情報を上記検索ファィルから取り出す手 段と、 この取り出した各文字の文字位置情報間で、 検索単位識別符号が共通で文 字位置順序符号が検索入力の文字列と等 U、順序であり、 かつその属性符号が検 索入力と等し!/ヽ文字位置情報の組み合わせを抽出する手段と、 この抽出された文 字位置情報の組み合わせに基づいて文字列が属する検索単位および文字位置を検 索結果として出力する手段とを備えたことを特徴とする。
文字位置情報の組み合わせ抽出は検索入力文字の全文出現頻度の低 、文字から J頃に行うこと力望ましい。 A fourth feature of the present invention is to perform a search process using a search file created by the third feature, comprising a search file created by the third feature, and comprising a search input character string. A method for extracting the character position information of the same character from the search file as described above, and the character unit position code is the same as the character string of the search input, with the common search unit identification code between the character position information of each character extracted. Means for extracting a combination of character position information whose order and the attribute code are equal to the search input! / ヽ A search unit to which a character string belongs based on the combination of the extracted character position information and Means for outputting a character position as a search result. It is desirable that the extraction of the combination of the character position information be performed at a low frequency of the whole sentence of the search input character and around J from the character.
また、 検索入力の文字列と同じ文字列を構成できる文字位置情報の組み合わせ の描出は、 検索入力文字列の全文における出現頻度の低 、文字の文字位置順序符 号を i、 出現頻度の高い文字位置順序符号を jとするとき、 (文字位置順序符号 ΐの文字の文字位置情報) 一 (文字位置順序符号 iの文字の文字位置情報) = ( i - j ) χ (最大属性数) に合致する文字位置情報の組み合わせを抽出すること が望ましい。 In addition, the depiction of the combination of character position information that can form the same character string as the search input character string is as follows: the frequency of occurrence of the search input character string in all sentences is low; the character position order code of the character is i; When the position sequence code is j, (character position information of the character with character position sequence code ΐ) matches one (character position information of character with character position sequence code i) = (i-j) χ (maximum number of attributes) It is desirable to extract the combination of character position information to be used.
本発明の第五の特徵は、 マルチキーワード検索に係るものであり、 検索対象と なるレコードごとに昇順の符号を付与するレコ一ド識別符号付与手段と、 このレ コ一ドが有する各キ一ワードにキーワードの論理的な区分を示す属性符号を付与 するキーワード属性符号付与手段と、 このキーヮードから 1文字ずつ取り出し、 その文字と次に続く合計 r文字で文字セットを作成し、 キーワードにおける文字 セッ卜の先頭文字位置を示す文字セット位置順序符号を付与する文字セット位置 順序符号付与手段と、 上記レコード識別符号とキーワード属性符号と文字セッ ト 位置順序符号とからなる文字セッ ト位置情報を作成して、 この文字セッ ト位置情 報を文字セット種ごとの令頁域に格納して検索ファィルを作成する手段とを備えた ことを特徵とする。 A fifth feature of the present invention relates to a multi-keyword search, wherein a record identification code assigning means for assigning an ascending code to each record to be searched, and each key included in the record. A keyword attribute code assigning means for assigning an attribute code indicating a logical division of a keyword to a word, and a character set is created by taking out one character at a time from this keyword, and creating a character set with the character and a total of subsequent r characters. Character set position order code assigning means for assigning a character set position order code indicating the leading character position of a character string, and character set position information comprising the above-mentioned record identification code, keyword attribute code, and character set position order code. Means for storing the character set position information in the control page area for each character set type and creating a search file. And 徵.
なお、文字セッ ト位置情報は、 レコードが有する各キーワードをキーワード属 性符号に対応するキーワード属性領域に配列して作成するキーワード列について、 各キーワードの全ての文字セットをレコード識別符号とキーワード属性符号と文 字セッ ト位置順序符号とで整数からなるコ一ドに変換して作成するもので、 レコード識 g|I x n + (P a— 1 ) +文字セット位置順序符号 The character set position information is obtained by arranging each keyword of the record in the keyword attribute area corresponding to the keyword attribute code. It is created by converting into a code consisting of integers with the character set position order code and the record identifier g | I xn + (P a — 1) + character set position order code.
n :キーワード列文字数 n: Number of characters in keyword string
P , :キ一ヮ一ド属性: ^号 aのキーワード属性領域のキーヮード列における先 なる数字コードとして与えられることが望まし
また本発明の第六の特徴は、 第五の特徴で作成された検索ファィルの検索処理 に係るもので、 第五の特徴で作成された検索ファィルを備え、 検索入力文字列の 構成文字を先頭文字から Γ文字単位の文字セットに分解して検索入力文字セッ ト 列を作成し、 この文字セッ 卜と同じ文字セッ 卜の文字セッ ト位置情報を上記検索 ファイルから取り出す手段と、 この取り出した各文字セッ トの文字セッ ト位置情 報間で、 レコード識別符号とキーワード属性符号が共通で文字セッ ト位置順序符 号の差が検索入力文字列の該当文字セッ トの先頭文字位置差に等しく、 かつその キーワード属性符号が検索入力と等しい文字セッ ト位置情報の組み合わせを抽出 する手段と、 この抽出された文字セッ ト位置情報の組み合わせに基づいて検索入 力文字列に対応するレコード識別符号を検索結果として出力する手段とを備えた ことを特徵とする。 P,: Key attribute: ^ sign Desired to be given as the preceding numeric code in the keyword sequence in the keyword attribute area of a. A sixth feature of the present invention relates to a search process for a search file created according to the fifth feature. The search feature includes a search file created according to the fifth feature. A means for extracting a character set position information of the same character set as the character set from the search file by decomposing the character into a character set in units of 単 位 characters and creating a character set for search input, The record identification code and the keyword attribute code are common between the character set position information of the character sets, and the difference between the character set position order codes is equal to the difference in the first character position of the corresponding character set in the search input string. Means for extracting a combination of character set position information having the same keyword attribute code as the search input, and a search input based on the extracted combination of character set position information. And Toku徵 further comprising a means for outputting a record identification code as a search result corresponding to the string.
なお、 検索入力文字セッ ト列と同じ文字セット列を構成できる文字セッ ト位置 情報の抽出は、 検索入力文字セッ ト列の全キーワードにおける出現頻度の低い文 字セッ トの文字セッ ト位置順序符号を i、 出現頻度の高い文字セッ トの文字セッ ト位置順序符号を jとするとき、 (文字セッ ト位置順序符号 iの文字セッ トの文 字セット位置情報) 一 (文字セッ ト位置順序符号; iの文字セッ トの文字セッ ト位 置情報) = i一 jに合致する文字セット位置情報の組み合わせを抽出することが 望ましい。 The character set position information that can form the same character set string as the search input character set string is extracted from the character set position order code of the character set with low occurrence frequency in all keywords in the search input character set string. Where i is the character set position order code of the character set with a high frequency of appearance, and j is the character set position order code of the character set with the character set position order code i. It is desirable to extract a combination of character set position information that matches i) and j).
なお、 キーワードが記号を含む欧文文字列の場合は、 少なくとも 3文字記号単 位の文字セッ トとし、 記号を含む欧文字のみの文字セッ ト種グループで構成され る検索ファィルを用いることが望ましい。 If the keyword is a Western character string containing symbols, it is desirable to use a character set consisting of at least three-character symbols, and to use a search file consisting of a character set type group consisting of only European characters including symbols.
また、 キーワードが漢字を舍む場合は、 漢字については 1文字単位の文字位置 情報とし、 仮名文字については 2文字単位の文字セッ ト位置情報とする検索ファ ィルを用いることができる。 If the keyword contains kanji, a search file that uses kanji as character position information in units of one character and kana characters as character set position information in units of two characters can be used.
本発明の第七の特徴は、 マルチキーワード検索で 1文字単位の文字位置情報を 用いるもので、 検索対象となるレコ一ドごとに昇順の符号を付与するレコ一ド識 別符号付与手段と、 このレコードが有する各キーワードにキーワードの論理的な
区分を示す属性符号を付与するキーヮード属性符号付与手段と、 このキーワード を各文字ごとに分解し各文字にキーヮ一ド中での位置を示す文字位置順序符号を 付与する文字位置順序符号付与手段と、 上記レコ一ド識別手段とキーヮード属性 符号と文字位置順序符号とからなる文字位置情報を作成して、 この文字位置情報 を文字種ごとの領域に格納して検索ファィルを作成する手段とを備えることを特 徴とする。 A seventh feature of the present invention is that a multi-keyword search uses character position information in units of one character, and a record identification code assigning means for assigning an ascending code to each record to be searched, For each keyword in this record, the logical Keyword attribute code assigning means for assigning an attribute code indicating a class; character position order code assigning means for decomposing this keyword for each character and assigning each character a character position order code indicating a position in the keypad; Means for generating character position information comprising the record identification means, the key word attribute code and the character position order code, storing the character position information in an area for each character type, and generating a search file. The feature is.
なお、 文字位置情報は、 レコ一ドが有する各キーヮードをキーヮード属性符号 に対応するキーヮ一ド属性領域に配列して作成するキーヮード列について、 各キ 一ワードの全ての文字をレコード識別符号とキーワード属性符号と文字位置順序 とで整数からなるコ一ドに変換して作成するもので、 The character position information is obtained by arranging each key word of the record in the key word attribute area corresponding to the key word attribute code. It is created by converting the attribute code and character position order into a code consisting of integers.
レコード識別符号 x n + (Ρ α - 1 ) 十文字位置順序符号 Record identification code xn + (Ρ α -1) Cross position code
n :キーワード列文字数 n: Number of characters in keyword string
P a :キーヮード属性符号 aのキーヮード属性領域のキーヮード列における先 なる数字コードとして与えられることが望ましい。 Pa: Keyword attribute code It is desirable to be given as the preceding numeric code in the key word sequence of the keyword attribute area of a.
本発明の第八の特徴は、 第七の特徴で作成された検索ファィルの検索処理に係 るもので、第七の特徵で作成された検索ファイルを備え、検索入力文字列の構成 文字と同じ文字の文字位置情報を上記検索ファィルから取り出す手段と、 この取 り出した各文字の文字位置情報間で、 レコード識別符号とキーヮ一ド属性符号が 共通で文字位置順序符号が検索入力の文字列と等しい順序であり、 かつそのキ一 ヮ一ド属性符号が検索入力と等しい文字位置情報の組み合わせを抽出する手段と、 この抽出された文字位置情報の組み合わせに基づ、て検索入力文字列に対応する レコード識別符号を検索結果として出力する手段とを備えたことを特徵とする。 なお、検索入力文字列と同じ文字列を構成できる文字位置情報の組み合わせの 抽出は、 検索入力文字列の全キーワードにおける出現頻度の低い文字の文字位置 順序符号を i、 出現頻度の高い文字の文字位置順序符号を jとするとき、 (文字 位圜,序符号 Ϊの文字の文字位置情報) (文字位置順序符号 jの文字の文字位
置情報) = i— jに合致する文字位置情報の組み合わせを抽出することが望まし い。 An eighth feature of the present invention relates to a search process for a search file created according to the seventh feature, and includes a search file created according to the seventh feature, and is the same as a character constituting a search input character string. A means for extracting the character position information of a character from the search file, and a character string of a record input code and a key code attribute code common to the extracted character position information and a character position order code for a search input. Means for extracting a combination of character position information whose order is equal to and whose key attribute code is equal to that of the search input. Based on the combination of the extracted character position information, Means for outputting a corresponding record identification code as a search result. The combination of character position information that can compose the same character string as the search input character string is extracted by setting the character position order code of the low-frequency characters in all the keywords of the search input character string to i, and the characters of the high-frequency characters Assuming that the position order code is j, (character position information of the character with character position, ordinal code Ϊ) (character position of the character with character position order code j) It is desirable to extract a combination of character position information that matches (placement information) = i-j.
本発明の原理について説明する。 The principle of the present invention will be described.
文書中に同じ文字列が出現する頻度は低い。 例えば広辞苑 (岩波書店発行の国 語辞典) の見出し語の説明文は約 900万文字あるが、 その中で仮名文字の出現頻 度を調べると平均約 53200回と高い。 しかし、 仮名 2文字の文字列の出現頗度を 調べると平均出現頻度 472 回と低くなる。 このため、 仮名 2文字を文字セッ 卜と すると、 検索入力が n文字の場合、 全文から抽出する照合対象は平均すれば (II / 2 ) X 72個の文字セッ ト位置情報となる。 また、 漢字は文字種が仮名文字よ り多いので、 漢字 2文字の文字列の出現頻度は仮名文字よりさらに低くなり、 全 文から抽出する照合対象も仮名文字より少なくなる。 The frequency of occurrence of the same character string in the document is low. For example, Kojien (Japanese language dictionary published by Iwanami Shoten) has about 9 million explanatory words for headwords, but the frequency of appearance of kana characters among them is as high as 53200 times on average. However, the frequency of appearance of the two-letter kana character string is low, with an average frequency of 472 times. For this reason, if two kana characters are used as a character set, if the search input is n characters, the collation target extracted from the whole text will be (II / 2) X 72 character set position information on average. In addition, since kanji has more types of characters than kana characters, the frequency of appearance of two kanji character strings is even lower than that of kana characters, and the collation target extracted from the whole sentence is less than that of kana characters.
さらに漢字 1文字だけについてみても、 上述の広辞苑の見出し語の説明文につ いては J I S第 1水準の漢字の出現頻度は平均 1155回である。 このため、 J I S 第 1水準 2965種の漢字については、 検索入力が n文字の場合、 広辞苑の見出し語 の説明文書から抽出する照合対象は平均すれば n X 1155文字となる。 Furthermore, looking at only one kanji character, the frequency of appearance of the JIS first-level kanji is 1155 times on average in the description of the headword of Kojien. For this reason, if the search input is n characters for the JIS first-level 2965 kanji, the collation target extracted from the description document of the Kojien headword will be nx 1155 characters on average.
一般的に検索入力は数十文字以下であるため、 出現頻度の高 (/、文字を含む文字 列であつても、 全部の文字を逐次照合するものに比べるとその照合回数は極めて 少なくなる。 Since the search input is generally several tens of characters or less, the number of times that a character string with a high frequency of occurrence (/, including characters) is significantly less than the number of times that all characters are collated sequentially.
例えば、 「通信」 という 2つの文字列を使用する用語は多々あるとしても 「通 信 · ·」 という文字列は 「通信回線」 、 「通信装置」 のように 「通信」 の文字以 降で同一の文字が発生する頻度が低くなる。 この結果、 「通信」 に続く 「回線」 や 「装置」 の文字列を照合すると、 検索対象が急激に絞り込まれていく。 このよ うにして、 検索入力文字セット列の構成文字セッ 卜で全文または登録キーワード との照合を進めていくと、 それまでに得られた検索対象候補の文字セット列の中 から、 検索入力文字セッ ト列と異なる文字セッ ト列が削除され、 照合する構成文 字セッ トごとに検索対象が絞り込まれていく。 特に、 検索入力の中の全文出現頗 度あるいは全キーワードにおける出現頻度の低い文字セッ トから順に照合を行う
と一層絞り込まれて照合一致を取る回数を低減できる。 For example, even though there are many terms that use two character strings "communication", the character string "communication ..." is the same after the word "communication", such as "communication line" and "communication device". Character occurs less frequently. As a result, if the character strings of “line” and “equipment” following “communication” are collated, the search target is rapidly narrowed down. In this way, as the character set of the search input character set string is compared with the full text or the registered keyword, the search input character set is extracted from the character set strings of the search target candidates obtained so far. Character set columns that are different from the set columns are deleted, and the search target is narrowed down for each constituent character set to be matched. In particular, matching is performed in order from the character set with the lowest occurrence frequency of all sentences in the search input or the lowest occurrence frequency in all keywords. And the number of times of matching and matching can be reduced.
したか、つて、 検索対象となる文字列 (全文または登録キーワード) を構成する 各文字セットが文字列中のどの位置にあるかをも示す文字セット位置情報を文字 セット種ごとに格納した検索ファイルを作成し、 この検索ファイルに対して検索 入力文字セット列との照合一致を行うことにより文字列検索における照合一致処 理回数を大幅に低減することができる。 A search file that stores character set position information for each character set that indicates where each character set is in the character string that constitutes the character string (full text or registered keyword) to be searched By performing collation matching with the search input character set string for this retrieval file, the number of collation matching processing in character string retrieval can be greatly reduced.
さらに漢字のように出現頻度の低い文字については文字セットとせずに、 1文 字単位で文字種ごとの領域に格納して検索ファイルを作成し、 この検索ファイル に対して検索入力文字列との照^""致を行う場合も同じく照合一致処理回数を大 幅に削減できる。 In addition, characters that have a low frequency of appearance, such as kanji, are not set as character sets, but are stored in an area for each character type in units of one character, and a search file is created. In the case of ^ "" matching, the number of times of matching and matching processing can be significantly reduced.
この検索ファィルの作成は次のように行う。 なおこの説明は全文検索処理用の 文字セッ卜の例で説明する。 This search file is created as follows. Note that this explanation is based on an example of a character set for full-text search processing.
まず検索对象となる文字列を検索単位に分ける。 検索对象文字列が例えば書籍 や論文の場合、 目次、 序文、 章または節等のタイ トル、 本文、 図または表等のタ ィ トル、 文献という順序で構成されており、 それぞれの構成部分が論理的に 区分されているため、 検索単位として構成できる。 そこで書籍または論文を論理 的に検索単位に分け、 それぞれの検索単位ごとに出現順序に従って昇順に識別符 号を付与する。 このとき:^については複数の検索単位に分割し、 それぞれ他の 検索単位とともに一連の識別符号を付与することもできる。 また、 この検索単位 について、 目次、序文、 タイ トル、 本文のようにその検索単位の論理的な種別が 区分されるので、 その論理的な種-別を属 ί生として、 その属性を示す属性符号を付 与する。 First, a character string to be searched is divided into search units. For example, if the search target character string is a book or a paper, it is composed of the table of contents, title, chapter or section title, text, figure or table title, and literature, and each constituent part is logical. Since it is classified, it can be configured as a search unit. Therefore, books or papers are logically divided into search units, and identification codes are assigned to each search unit in ascending order according to the order of appearance. At this time: ^ can be divided into a plurality of search units, and a series of identification codes can be assigned to each search unit together with other search units. In addition, since the logical type of the search unit is divided into the search unit, such as the table of contents, preface, title, and text, the attribute that indicates the attribute, with the logical type-class as the attribute Assign a sign.
そして、 文字列を先頭文字から 1文字ずつ取り出し、 その文字と次に続く合計 r文字で文字セットを作成し、 各文字セッ トに検索単位識別符号と各文字セット の先Ι¾字位置を示 字セット位置順序符号と検索単位の属性符号とからなる 文字セッ ト位置情報を^^し、 文字セット種ごとに構成された領域に格納し、 検 索対象文字列を各文字セット種別に格納する検索ファィルを作成する。
この検索ファイルは、 文字セッ トの種別ごとに文字セッ ト位置情報が格納され た形のファイル構造となる。 Then, the character string is extracted one character at a time from the first character, and a character set is created with that character and a total of subsequent r characters. Each character set indicates the search unit identification code and the position of the first character of each character set. ^^ The character set position information consisting of the set position order code and the attribute code of the search unit is ^^, stored in an area configured for each character set type, and the search target character string is stored in each character set type. Create a file. This search file has a file structure in which character set position information is stored for each character set type.
検索処理は、 検索入力を先頭文字から r文字単位の文字セッ 卜に分解して検索 入力文字セッ ト列を構成し、 分解した文字セッ トと同じ文字セッ 卜の文字セッ ト 位置情報を検索ファィルから取り出して、 検索単位識別符号が共通しており文字 セッ ト位置順序符号の差が該当する検索入力文字列の文字セッ 卜の先頭文字位置 差に等しくかつ属性符号が等しい文字セッ ト位置情報の組み合わせを照合して取 り出す。 なお検索入力文字列を先頭文字から r文字単位の文字セッ トに分解した とき、 最後の文字セッ トが (r— 1 ) 以下になり、 r文字単位の文字セットを作 成できないことがある。 このときには、 最後の文字セッ トの直前の文字セッ 卜の 後部から不足文字数分の文字を取り出し、 最後の文字セッ トの前部に連結して r 文字単位の文字セッ トを作成する。 In the search processing, the search input is decomposed from the first character into a character set in units of r characters to form a search input character set string, and the character set position information of the same character set as the decomposed character set is searched. Of the character set position information that has the same search unit identification code and the same character set position sequence code as the difference between the first character position of the character set of the corresponding search input character string and the same attribute code. Check the combination and take it out. When the search input string is decomposed from the first character to a character set of r characters, the last character set may be (r-1) or less, and a character set of r characters may not be created. At this time, it extracts characters for the number of missing characters from the end of the character set immediately before the last character set, and concatenates them with the front of the last character set to create a character set in units of r characters.
この照合処理は、 検索入力と検索ファイルとの文字セッ ト列の連続性の一致と 属性の一致とをみるもので、 検索ファイル中の文字セッ ト位置情報から検索単位 識別符号が共通していて文字セット位置順序符号の差が該当する検索入力文字列 の文字セッ トの先頭文字位置差に等しくかつ属性符号が検索入力と同じ文字セッ トの組み合わせを取り出すことにより行う。 This matching process checks the continuity of the character set string and the attribute match between the search input and the search file, and the search unit identification code is common from the character set position information in the search file. This is performed by extracting the combination of character sets whose difference in character set position order code is equal to the difference in the first character position of the character set of the corresponding search input character string and whose attribute code is the same as the search input.
これにより、 全検索ファィルの照合が不要になり、 検索ファィルにある検索入 力と同じ文字セッ トの文字セット位置情報だけの照合一致を行えばよいので、 照 合回数は逐次照合に比べるときわめて低減することができる。 また、 一般的に同 じ文字列の出現頻度が低いので、 Γ文字の文字セッ トを照合するたびに検索対象 が絞り込まれるので、 照合回数は低減していく。 This eliminates the need to collate all search files, and only needs to perform collation matching with the character set position information of the same character set as the search input in the retrieval file. Can be reduced. In addition, since the frequency of occurrence of the same character string is generally low, the search target is narrowed down each time the character set of Γ characters is collated, and the number of collations is reduced.
さらに、 検索ファイルから取り出した文字セット位置情報を照合するとき、 検 索入力の中の全文出現頻度の低い文字セッ トから順に行うと検索対象が一層絞り 込まれ、 照合一致をとる回数がさらに低減できる。 In addition, when collating character set position information extracted from a search file, if the character set with the lowest occurrence frequency of all sentences in the search input is performed in order, the search target is further narrowed down, and the number of times of matching matches is further reduced it can.
このようにして検索入力に合致する文字列を見出したときはその検索単位識別 符号から抽出すべき検索単位と文字セッ ト構成各文字の検索単位における先頭文
字からの位置を示す文字位置を抽出して、 検索者に検索結果として出力する。 全文検索において、 文字種ごとの検索ファイルを用いるときは、 全文の構成各 文字を文字種別の領域に格納して検索ファィルを作成する。 この検索ファィルに 対して検索入力文字列を各文字ごとに分解し、 各文字の文字位置情報を検索ファ ィルから取り出して、 検索単位識別符号が共通で検索入力文字列と等しい順序で かつ属性符号が検索入力と同じ文字位置情報の組み合わせを取り出して、 検索単 位と文字位置を検索結果として出力する。 When a character string that matches the search input is found in this way, the search unit to be extracted from the search unit identification code and the first sentence in the search unit for each character in the character set The character position indicating the position from the character is extracted and output to the searcher as a search result. When using a search file for each character type in full-text search, create a search file by storing each character of the full text in the character type area. For this search file, the search input character string is decomposed for each character, the character position information of each character is extracted from the search file, and the search unit identification code is common, in the same order as the search input character string, and in the attribute. Extracts the combination of character position information with the same code as the search input, and outputs the search unit and character position as the search result.
さらにマルチキーワード検索の場合においては、 キーワードを有するレコード について登録順序に従って昇順のレコード識別符号を付与し、 各キーワードにつ いては、 そのキーワードの論理的な種別を属性としてその属性を示すキーワード 属性符号、 およびキーワードにおける文字位置順序符号または文字セット位置順 序符号を与えて、 この 3つの符号から文字位置情報または文字セット位置情報を 作成して、 文字種ごと、 または文字セットごとの領域に格納して検索ファイルを 作成する。 Furthermore, in the case of a multi-keyword search, a record having a keyword is assigned an ascending record identification code in accordance with the registration order, and for each keyword, a keyword attribute code indicating the logical type of the keyword as an attribute. , And the character position sequence code or character set position sequence code in the keyword, character position information or character set position information is created from these three codes, and stored in the area for each character type or character set. Create a search file.
マルチキーヮード検索処理では、 検索入力文字列と検索入力文字列属性との対 が 1個 Hi入力される。 各検索入力文字列について検索入力文字列を 1文字、 あ るいは文字セットに分解し、検索ファイル中から検索入力を構成する文字と同じ 文字位置情報または検索入力を構成する文字セットと同じ文字セット位置情報を 取り出して、 レコード識別符号が共通で文字位置順序符号または文字セット位置 順序符号とキーヮード属性符号が検索入力と等しい文字位置情報または文字セッ ト位置情報の組み合わせを照合して取り出す。 取り出した文字位置情報または文 字セット位置情報の組み合わせからレコード識別番号を検索結果として取り出す。 In the multi-keyword search process, one Hi-input pair of the search input character string and the search input character string attribute is input. For each search input string, the search input string is decomposed into one character or character set, and the same character position information or the same character set as the search input character set in the search file is used. The position information is extracted and the combination of character position information or character set position information that has the same record identification code and the same character position order code or character set position order code and keyword attribute code as the search input is extracted. The record identification number is extracted as a search result from the combination of the extracted character position information or character set position information.
〔図面の簡単な説明〕 [Brief description of drawings]
図 1は本発明実施例に使用する情報検索処理装置の構成例。 FIG. 1 is a configuration example of an information search processing device used in an embodiment of the present invention.
図 2は第一実施例の検索ファイル例。 Fig. 2 shows an example of a search file according to the first embodiment.
図 3は第一^ M例の各文字セット群の第 2、 第 3文字組み合わせ一覧。 Figure 3 is a list of the second and third character combinations in each character set group of the first ^ M example.
図 4は第一 例文字セットグループアドレス表。
図 5は第一実施例の検索ファィルの登録例。 Figure 4 shows the first example character set group address table. FIG. 5 shows an example of registration of a search file according to the first embodiment.
図 6は第一実施例の検索ファィル作成処理手順を説明するフローチャート。 図 Ίは第一実施例の検索処理手順を説明するフローチャート。 FIG. 6 is a flowchart illustrating a search file creation processing procedure according to the first embodiment. FIG. 5 is a flowchart illustrating a search processing procedure according to the first embodiment.
図 8は第二実施例の検索ファィル。 FIG. 8 shows a search file according to the second embodiment.
図 9は第二実施例の文字セッ ト群一覧。 Fig. 9 shows a list of character set groups according to the second embodiment.
図 10は第二実施例の文字セッ トグループアドレス表。 FIG. 10 is a character set group address table according to the second embodiment.
図 11は第二実施例の検索ファィルの登録例。 FIG. 11 shows an example of registration of a search file according to the second embodiment.
図 12は第三実施例の文字欄ァドレス表。 FIG. 12 is a character column address table of the third embodiment.
図 13は第三実施例の検索ファィルの登録例。 FIG. 13 shows an example of registration of a search file according to the third embodiment.
図 14 a、 bは第三実施例の検索ファィル作成処理手順を説明するフローチヤ一 ト。 14A and 14B are flowcharts for explaining a search file creation processing procedure according to the third embodiment.
図 15は第三実施例の検索処理手順を説明するフローチャート。 FIG. 15 is a flowchart illustrating a search processing procedure according to the third embodiment.
図 16は第四実施例のキーヮード列の例。 FIG. 16 shows an example of a keyword sequence according to the fourth embodiment.
図 17は第四実施例の文字セット位置情報作成例。 FIG. 17 shows an example of character set position information creation according to the fourth embodiment.
図 18は第四実施例の検索ファィルの登録例。 FIG. 18 shows an example of registration of a search file according to the fourth embodiment.
図 19 a、 bは第四実施例の検索ファィル作成手順を説明するフローチャート。 図 20 a、 bは第四実施例の検索処理手順を説明するフローチャート。 FIGS. 19A and 19B are flowcharts illustrating a search file creation procedure according to the fourth embodiment. FIGS. 20A and 20B are flowcharts illustrating a search processing procedure according to the fourth embodiment.
図 21は第五実施例のキーワード列の例。 FIG. 21 shows an example of a keyword string according to the fifth embodiment.
図 22は第五実施例の文字セッ ト位置情報作成例。 FIG. 22 shows an example of character set position information creation according to the fifth embodiment.
図 23は第五実施例の検索ファィルの登録例。 FIG. 23 shows an example of registration of a search file according to the fifth embodiment.
図 24は第六実施例の文字位置情報作成例。 FIG. 24 shows an example of character position information creation according to the sixth embodiment.
図 25は第六実施例の検索ファィルの登録例。 FIG. 25 shows an example of registration of a search file according to the sixth embodiment.
図 26 a、 は第六実施例の検索ファィル作成手順を説明するフローチャート。 図 27 a、 bは第六実施例の検索処理手順を説明するフローチャート。 FIG. 26A is a flowchart illustrating a search file creation procedure according to the sixth embodiment. FIGS. 27A and 27B are flowcharts illustrating a search processing procedure according to the sixth embodiment.
〔発明を実施するための最良の形態〕 [Best mode for carrying out the invention]
図 1は本発明実施例における情報検索処理装置の構成を示すものである。 本実施例の情報検索処理装置は、 各種演算処理あるいは判断処理を行う C P U
1と、 検索処理、 検索ファイル作成等のプログラム、 作成されたあるいは検索処 理を行うための検索ファイル、 検索入力等を記憶するメモリ 2、 キーボード 4、 ディスプレイ 5を接続する入出力部 3、 各種情報が記憶される外部記憶装置 7を 接続する外部記憶装置制御部 6、 C P U K メモリ 2、 入出力部 3、 外部記憶装 置制御部 6を接続する共通バス 8とを備える。 FIG. 1 shows the configuration of an information search processing device according to an embodiment of the present invention. The information search processing device of the present embodiment has a CPU that performs various arithmetic processing or determination processing. 1 and programs for search processing, search file creation, etc., search files created or used for search processing, memory for storing search inputs, etc., input / output unit 3 for connecting keyboard 4, display 5, display 3, An external storage device control unit 6 for connecting an external storage device 7 for storing information, a CPUK memory 2, an input / output unit 3, and a common bus 8 for connecting the external storage device control unit 6 are provided.
次に第一実施例での情報検索処理を説明する。 この第一実施例は、 特に欧文文 字文書を全文検索対象とするときの実施例である。 Next, information search processing in the first embodiment will be described. The first embodiment is an embodiment in which a European character document is targeted for full-text search.
本実施例での情報検索処理は、検索処理に供するための文字列について文字列 の先頭文字から 1文字ずつ取り出し、 その文字と次に続く文字の合計 3文字で文 字セットを作成し、 これらの文字セット種ごとにグループ化した文字セットグル ープで構成される検索ファィルを作成する検索ファィル作成処理と、 検索ファィ ルとの照合一致を行って検索入力に合致する文字列を抽出する検索処理との二つ に分けられる。 In the information search process of this embodiment, a character set is extracted from a character string to be provided for the search process, one character at a time from the first character of the character string, and a character set consisting of the character and the next character is created. Search file creation processing that creates a search file consisting of character set groups grouped by character set type, and search that matches the search file and extracts character strings that match the search input And processing.
まず、 検索ファイル作成処理について説明する。 First, the search file creation process will be described.
この検索ファイル作成処理は、 大まかに分けると、 ①検索ファイル領域確保、 ②各文字セットへの文字セット位置情報の付 ·与、 ③文字セット種別ごとにグルー プ化した文字セット位置情報の検索ファィルへの格納の 3つに分けることができ る。 この各処理についてそれぞれ説明する。 This search file creation processing can be roughly divided into 1) search file area reservation, 2) addition and assignment of character set position information to each character set, and 3) search file of character set position information grouped by character set type. Can be divided into three types. Each of these processes will be described.
① 検索ファイル領域確保 ① Secure search file area
検索フ了ィルは、 図 2に示すように、 A S C I Iコード表に記載されている A S C I Iコード 「2 0」〜「7 F」 までの文字順に配列された文字セット群で構 成される。 各文字セット群は図 2に示す各文字セット君の名称を表す文字を先頭 文字とする 3文字で構成される。 各文字セット群の 2文字目と 3文字目は、 図 3 に示すように A S C I Iコード表に記載されている文字で構成される。 例えば A 文字セット群は、 「AA 」、 「AA!」、 · · · 「AA} 」、 「AA〜J の文 字セットで構成される。 そこで全文の先頭文字から 1文字ずつ取り出し、 その文 字と次に続く文字の合計 3文字で文字セッ トを作成し、 これらの文字セット種ご
とに出現頻度を計数する。 これにより、 検索ファイルを構成する各文字セッ ト種 グループに登録される文字セッ ト位置情報の数がわかるので、 全文字セッ ト種グ ループで構成される検索ファイルの領域を確保できる。 また同時に、 各文字セッ ト種グループに登録される文字セッ ト位置情報の数から、 検索ファイル内に連続 して格納される文字セッ ト種グループの先頭審地もわかる。 この文字セッ ト種グ ループの先頭審地を図 2と図 3で示す各文字セッ トの記載順に配列したのが図 4 に示す文字セッ トグル一プアドレス表である。 As shown in FIG. 2, the search file is composed of a character set group arranged in the character order of ASCII codes “20” to “7F” described in the ASCII code table. Each character set group consists of three characters whose first character is the character that represents the name of each character set shown in Fig. 2. The second and third characters of each character set group consist of the characters described in the ASCII code table as shown in Figure 3. For example, the character set A consists of the character sets “AA”, “AA!”, ··· “AA}”, and “AA to J. Create a character set with a total of three characters: And the appearance frequency is counted. By this means, the number of character set position information registered in each character set type group constituting the search file can be known, so that an area for the search file composed of all character set type groups can be secured. At the same time, from the number of character set position information registered in each character set type group, the top ground of the character set type group stored continuously in the search file can be determined. The character set group address table shown in Fig. 4 arranges the top grounds of this character set type group in the order of description of each character set shown in Figs.
② 各文字セッ トへの文字セッ ト位置情報の付与 ② Assignment of character set position information to each character set
ここで述べる文字セッ ト位置情報は、 文字セッ 卜が属する検索単位が現れる順 番を示す検索単位番号と、 検索単位におけるその文字セットの出現する位置をそ の文字セッ 卜の先頭文字の位置で示す文字セッ ト位置蕃号と、 検索単位の論理的 な種別を示す属性審号とで作成される。 The character set position information described here is based on the search unit number indicating the order in which the search unit to which the character set belongs, and the position where the character set appears in the search unit is determined by the position of the first character of the character set. It is composed of the character set position No. that indicates the character set and the attribute name that indicates the logical type of the search unit.
まず検索単位とその属性について説明する。 例えば一般的な書籍は、 目次、 序 文、 章または節のタイ トル、 本文、 図または表のタイ トル、 参考文献などの部分 で構成されており、 ほぽこの順序に従つて現れる。 この書籍の内容^検索すると き、 検索対象としてこの部分を検索単位とし、 その検索単位を検索出力とするこ とが便利であるし、 また検索目的に合致することが多い。 すなわち、 検索目的に よってタイ トルのみや本文のみを検索対象として指定することが実際の検索では 多いからである。 First, search units and their attributes will be described. For example, a typical book consists of a table of contents, preface, chapter or section titles, body text, figure or table titles, references, etc., and appears in this order. When searching for the contents of this book ^, it is convenient to use this part as the search unit and to use that search unit as the search output, and it often matches the search purpose. That is, it is often the case that only the title or only the text is specified as a search target depending on the search purpose in actual search.
したがって、 一つの書籍を全文検索対象として検索する場合に、 その書籍を構 成する論理的な部分に分けて検索結果を出力することが好ましい。 この検索単位 は、 検索対象の文字列の論理的な分類を示すものであるため、 この検索単位に論 理的区分に従って属性審号を付与する。 例えば、 属性蕃号として、 目次に 「1」、 序文に 「2」、 章または節のタイ トルに 「3」 、 図または表のタイ トルに 「4」、 本文に 「5」、 参考文献に 「6」 を付与する。 Therefore, when a single book is searched as a full-text search target, it is preferable to output search results by dividing the book into logical parts constituting the book. Since this search unit indicates the logical classification of the character string to be searched, attribute search is given to this search unit according to the logical division. For example, as an attribute number, "1" in the table of contents, "2" in the preface, "3" in the chapter or section title, "4" in the figure or table title, "5" in the text, Assign “6”.
そしてこの検索単位が書籍に出現する順序に 1から昇順に蕃号を付与する。 こ れを検索単位番号とする。 なおこの際に本文が長文である場合には適当な区分に
分けて本文を複数の検索単位に分け、 検索単位ごとに出現する順位で検索単位審 号を付与することもできる。 And, in the order in which the search unit appears in the book, the ban numbers are assigned in ascending order from 1. This is used as the search unit number. In this case, if the text is a long sentence, categorize it appropriately. It is also possible to divide the text into multiple search units and assign search unit evaluations in the order in which they appear for each search unit.
次に検索単位ごとに、 検索単位の先頭から 1文字ずつ取り出し、 その文字と次 に続く文字の合計 3文字で文字セットを作成し、 作成順に 1、 2、 3 · · ·と昇 順に審号を付与して文字セット位置審号とする。 検索単位の構成文字の最後の文 字には を示す特殊文字 EM (エンドマーク) を 2文字付加し、 この EM文字 と逸链させて文字セッ トとして、文字セット位置番号を付与する。 なお、 EM文 字は、 ASC I Iコード表の 「DELj の ASC I Iコード 「7 Fj とした。 そして、 このように与えられた検索単位番号、 文字セッ ト位置 Φ号、 属性蕃号 から検索単位を構成する文字セットを整数からなるコードに変換して文字セット 位置情報を作成する。 Next, for each search unit, a character set is extracted from the beginning of the search unit, one character at a time, and a character set consisting of that character and the next character is created, and the character set is created in ascending order of 1, 2, 3, To give a character set position essay. Two special characters EM (end mark) are added to the last character of the constituent characters of the search unit, and a character set position number is given as a character set by deviating from this EM character. The EM character is “ASCII code of DELj“ 7 Fj ”in the ASCII code table. The search unit is obtained from the search unit number, character set position Φ, and attribute number given above. Convert the character set to a code consisting of integers and create character set position information.
この文字セット位置情報は、 最大検索単位文字数を n、 最大属性数を aとする とき、 This character set position information is as follows: When the maximum number of search unit characters is n and the maximum number of attributes is a,
文字セッ ト位置情報コード- (検索単位審号 xn+文字セッ ト位置番号 } xa Character set position information code-(Search unit reference xn + character set position number) xa
+属性番号 …ひ) 式で与えられる数字コードである。 + Attribute number ... hi) Numeric code given by the formula.
例えば、 検索単位の最大文字数 n =10000、 最大属性数 a=10とし、 8審目の 検索単位である:^ (属性蕃号 =5) の先頭から第 121〜130蕃目の文字位置に 「d 0 c ume n t J という文字列があった場合、 この文字列の中の「d o cumen t」 ¾「do c」 、 「o c u」 、 「c um」 、 「ume」 、 「men」 「en t」 、 「n t 」 「t 」 の文字セッ トに分 g军され、 それぞれ「801215」 「801225」、 「801235」、 「801245」、 「801255」、 「801265」、 「801275」 、 「801285」 の文字セッ ト位置情報が与えられる。 For example, assuming that the maximum number of characters in the search unit is n = 10000 and the maximum number of attributes is a = 10, the search unit for the 8th trial is: ^ (attribute number = 5) If there is a character string d 0 c ume nt J, "do cumen t" in this character string ¾ "do c", "ocu", "c um", "ume", "men", "en t" '', `` Nt '', and `` t '' character sets, respectively, for `` 801215 '', `` 801225 '', `` 801235 '', `` 801245 '', `` 801255 '', `` 801265 '', `` 801275 '', and `` 801285 ''. Character set position information is provided.
そしてこのように文字セット位置情報を 4バイ トのコードで構成すれば、 最大 10000文字数の検索単位を 232X (nxa) 43万個取り扱うことが可能であ 。 And if thus configured character set position information in 4 bytes of code, 2 32 X (nxa) a retrieval unit for up to 10000 characters 430,000 units handled it is possible der.
③ 文字セッ ト位置情報の検索ファィルへの登録
次にこの各文字セッ トごとに付与された文:^セッ ト位置情報を検索ファイルに 登録する。 ③ Register character set position information in search file Next, the sentence assigned to each character set: ^ set position information is registered in the search file.
上述のように文字セット種別グループは、 図 2と図 3に記載された順に検索フ アイルに格納される。 そして各文字セッ ト種別グループに文字セッ ト位置情報を 登録する。 この文字セッ ト位置情報の登録は、 該当する文字セッ ト種グループの 未格納領域の先頭にそれぞれ文字セッ ト位置情報を格納することによつて行われ る。 このため、 検索単位順に登録するとすれば文字セッ ト種グループ内には文字 セット位置情報が数値順の昇順に登録されることになる。 As described above, the character set type groups are stored in the search file in the order described in FIG. 2 and FIG. Then, the character set position information is registered in each character set type group. The registration of the character set position information is performed by storing the character set position information at the head of the unstored area of the corresponding character set type group. Therefore, if they are registered in search unit order, character set position information will be registered in ascending numerical order in the character set type group.
上述の 「d 0 c u m e t」 の文字セッ ト位置情報を検索ファィルに登録した例 を図 5に示す。 このとき、 各グループ内の文字セッ ト位置情報は昇順に格納され る。 このファイル容量は、 文字セッ ト位置情報が 4バイ 卜であると、 下記に示す 容量になる。 Fig. 5 shows an example where the character set position information of "d0cumet" described above is registered in a search file. At this time, the character set position information in each group is stored in ascending order. If the character set position information is 4 bytes, the file capacity is as shown below.
4バイ ト x (検索単位文字数) i 4 bytes x (number of search units) i
i = 0 なお、 文字セット位置情報の追加登録は、 追加文書の各文字セッ トに該当する グループの未格納領域の先頭に新規文字セッ ト位置情報を追加することで行う。 また、 削除は削除文書の各文字セッ トに該当するグループ内の該当文字セッ ト位 置情報を特殊記号 (ここでは A S C I Iコードの 「 0 0 0 0」 ) に変更すること によって行う。 これにより追加登録と削除を短時間に行うことができる。 i = 0 In addition, additional registration of character set position information is performed by adding new character set position information to the head of the unstored area of the group corresponding to each character set of the additional document. In addition, deletion is performed by changing the relevant character set position information in the group corresponding to each character set of the deleted document to a special symbol (here, the ASCII code “0000”). As a result, additional registration and deletion can be performed in a short time.
なお上述のようにこの検索ファィルの各文字セット種グループごとに格納され た文字セッ ト位置情報は、 図 4の文字セッ トグル一プアドレス表の各文字セット グループ先頭蕃地をディレクトリとして取り出すことができる。 As described above, the character set position information stored for each character set type group in this search file can be obtained by extracting the leading lands of each character set group in the character set group address table in Fig. 4 as a directory. it can.
以上の検索ファィルの作成処理の流れを図 6に示す。 Fig. 6 shows the flow of the above search file creation process.
すなわち、 各文字セットの出現度数を計数して文字セッ トグル一プアドレス表 を作成し (S ll、 12) 、 検索ファイルの領域を確保する (S 13) 。 次に検索単位 登録順位カウンタを k = lに初期設定して、 検索単位審号を 「1」 に、 最大検索 単位文字数を 「n =10000 」 に、 最大属性数を a =10に設定する (S 14) 。 そし
て最初の検索単位を取り出す (S 15) 。 ここまでが登録の前処理である。 ここか ら検索単位ごとの登録処理となり、 まず、 文字セット位置審号を P = 1に、 登録 する検索単位の構成文字数 m、 登録する検索単位の属性番号 a i を設定する (S 16) 。 次に、検索単位の先頭文字から順に、 文字セッ ト位置審号 Pに該当する文 字セット位置情報を That is, the frequency of occurrence of each character set is counted, a character set group address table is created (Sll, 12), and an area for the search file is secured (S13). Next, the search unit registration rank counter is initially set to k = l, the search unit examination is set to "1", the maximum number of search units is set to "n = 10000", and the maximum number of attributes is set to a = 10 ( S 14). Soshi To retrieve the first search unit (S15). This is the pre-processing of registration. From here, registration processing is performed for each search unit. First, the character set position identification code is set to P = 1, the number m of characters constituting the search unit to be registered, and the attribute number ai of the search unit to be registered are set (S16). Next, in order from the first character of the search unit, the character set position information corresponding to the character set position
D= (k X 100000+ p ) x lO+ i ··· ( 2 ) D = (k X 100000+ p) x lO + i (2)
の式で作成する (S 17) 。 文字セット位置審号 pにある文字セッ卜と同じ文字セ ット種グループが格納されている検索ファイルの文字セットグループの先頭番地 を示 «字セットグループ先頭蕃地を文字セットグル一プアドレス表から取り出 して (S 18) 、 文字セットグループ先頭審地が示す検索ファイルの文字セットグ ループの未格納領域の先頭行に文字セット位置情報を格納する (S 19) 。 そして、 P = P + 1、 m=m- 1とし、 検索単位内の全ての文字セットを処理したところ で、 次の検索単位の処理に移る (S 23、 24) 。 (S17). Indicate the first address of the character set group of the search file that contains the same character set type group as the character set in the character set position mark p. «Character set group first address is the character set group address table (S18), and stores the character set position information in the first line of the unstored area of the character set group of the search file indicated by the character set group heading (S19). Then, assuming that P = P + 1 and m = m−1, all character sets in the search unit have been processed, and the process proceeds to the next search unit (S23, 24).
次にこのようにして作成された検索ファィルを用いる検索処理について説明す る。 Next, a search process using the search file thus created will be described.
本実施例では、 検索ファイルから取り出した文字セット位置情報をもとに検索 入力文字セット列と同じ文字セット列を文字列照合して全文検索を行う例で説明 する。 まず、 その検索処理は大まかに分けると以下の構成からなっている。 In the present embodiment, an example will be described in which a full-text search is performed by collating the same character set string as the search input character set string based on the character set position information extracted from the search file. First, the retrieval process is roughly composed of the following configuration.
①検索入力文字列を先頭文字から 3文字単位の文字セッ卜に分解し、 検索入力 文字セット列を作成する。 (1) The search input character string is decomposed from the first character into a character set consisting of three characters, and a search input character set string is created.
②検索入力文字セッ ト列の各文字セッ卜に該当する文字セッ トグル一プアドレ ス表内の文字セットグループ先頭蕃地を算出する。 (2) Calculate the first character set in the character set group in the character set group address table corresponding to each character set in the search input character set string.
③検索入力文字セッ ト列を全文出現頻度の少ない文字セットから順に並べ変え 。 (3) The search input character set sequence is rearranged in order from the character set with the lowest occurrence frequency of all sentences.
④並べ変えた文字セット列の先頭から順に該当する文字セット種グループを検 索フアイルから取り出してそこに格納されている文字セット位置情報から検索入 力文字セット列を構成できる文字セット位置情報の組み合わせを取り出す。
⑤抽出した文字セッ ト位置情報から検索入力と同じ属性を有する文字セッ ト位 置情報を取り出し照合一致とする。 組 み 合 わ せ Combination of character set position information that can retrieve the corresponding character set type group from the search file in order from the top of the rearranged character set string and construct the search input character set string from the character set position information stored there. Take out. 文字 From the extracted character set position information, character set position information having the same attribute as the search input is extracted and matched.
⑥照合一致した文字セッ ト位置情報から検索単位蕃号と文字セッ ト構成各文字 の検索単位における先頭文字からの位置を示す文字位置銎号を検索結果として出 力する。 (4) Based on the character set position information that matches and matches, the search unit Ban and the character position number that indicates the position of each character in the character set from the first character in the search unit are output as search results.
次に具体的にそれぞれの処理を説明する。 Next, each processing will be specifically described.
① 検索入力文字セッ ト列の作成 ① Create search input character set string
検索ファイルに格納されている文字セッ 卜と照合可能なように、 検索入力文字 列を先頭文字から 3文字単位の文字セッ 卜に分解し、 検索入力文字セット列とす なお、 検索入力文字列を先頭文字から 3文字単位の文字セッ 卜に分解したとき、 最後の文字セッ トが 3文字未満になり 3文字単位の文字セッ トを作成できないこ とがある。 このときには、 最後の文字セッ トの直前の文字セッ 卜の後部から、 不 足文字数分の文字を取り出し、 最後の文字セッ 卜の前部に連結して 3文字単位の 文字セッ トを作成する。 The search input string is decomposed into a character set consisting of three characters from the first character so that it can be compared with the character set stored in the search file, and is used as the search input character set. When a character set is divided into three character units from the first character, the last character set may be shorter than three characters, and a character set of three character units may not be created. In this case, the characters for the number of underscores are extracted from the last part of the character set immediately before the last character set, and are connected to the front part of the last character set to create a three-character unit character set.
② 各検索入力文字セッ 卜に該当する文字セッ トグル一プアドレス表内の文字 セッ トグル一プ先頭審地の算出 ② Calculation of the character set group heading ground in the character set group address table corresponding to each search input character set
検索ファィルの作成時と同様に、 各検索入力文字セッ トの図 2と図 3で示され る文字セッ ト順位を算出し、 これを文字セッ トグル一プアドレス表における検索 入力文字セッ 卜のアドレスポインタとする。 In the same way as when creating the search file, the character set ranks shown in Figures 2 and 3 for each search input character set were calculated, and this was used as the address of the search input character set in the character set group address table. It is a pointer.
③ 出現頻度順の並べ変え ③ Sort by appearance frequency
そして、 検索ファィルの各文字セッ ト種グループの先頭番地を示す文字セッ ト グループアドレス表内の文字セットグループ先頭番地を参照して、 各検索入力文 字セッ トの全文出現頻度を調べ、 検索入力文字セッ ト列を全文出現頻度の低いも のから順に並べ変える。 上述のように、 文字セッ トグループアドレス表内の先頭 番地は、 検索ファイルに格納されている各文字セッ ト種グループの先頭審地を示 しており、 次に続く文字セットグループ先頭蕃地との差をとれば、 各文字セッ ト
種グループに格納されている文字セッ ト位置情報の数から、 全文中に出現する文 字セット種別頻度がわかる。 The character set indicating the start address of each character set type group in the search file is referred to, the character set group start address in the group address table is referred to, the frequency of full-text occurrence of each search input character set is checked, and the search input is performed. Sort the character set sequence in ascending order of occurrence frequency of full text. As described above, the first address in the character set group address table indicates the first address of each character set type group stored in the search file. The difference between each character set From the number of character set position information stored in the species group, the frequency of character set types appearing in all sentences can be determined.
これは全文出現頻度の低い文字セッ 卜から照合一致を行うことにより、 検索フ アイルに格納された各文字セッ卜の文字セッ ト位置情報との照合回数をきわめて 低減できるためである。 すなわち文字セット位置情報を照合して各文字セッ 卜の 連続性を調べる場合に二つの文字セット種グループ内の文字セット位置情報中の 検索単位番号と文字セット位置審号と属性番号とを照合するため、 その二つの文 字セット種グループ内に格納されている文字セット位置情報の数が少なければそ れだけ照合回数を少なくすることができる。 したがって、 文字セット位置情報の 照合を行うときに、 全文出現頻度の低い文字セットから照合を行って照合回数を 低減させる。 特に検索入力文字が多くなるほど出現頻度の低い文字セッ トが含ま れる割合が高まるため低減効果は大きい。 This is because the number of matches with the character set position information of each character set stored in the search file can be extremely reduced by performing collation matching from character sets with low occurrence frequency of all sentences. That is, when checking the continuity of each character set by comparing the character set position information, the search unit number, the character set position identification number, and the attribute number in the character set position information in the two character set type groups are compared. Therefore, if the number of character set position information stored in the two character set type groups is small, the number of times of collation can be reduced accordingly. Therefore, when collating character set position information, collation is performed from a character set with a low frequency of full-text appearance, thereby reducing the number of times of collation. In particular, as the number of search input characters increases, the rate of inclusion of a character set with a low appearance frequency increases, so the reduction effect is large.
④ 文字セット列の照合 · Collation of character set columns
全文出現 の低い文字セットから文字セットグループアドレス表を参照して それぞれの文字セット種グループに格納されている文字セット位置情報を取り出 す。 そして取り出した文字セット位置情報をもとに、 全文出現頗度の低い文字セ ット種グループから、 各文字セット種グループ間で検索単位が等しくかつ文字セ ッ ト位置番号の差が検索入力文字列の該当する文字セットの先頭文字位置差に等 しい文字セッ ト位置情報の組み合わせを抽出する。 この文字セット位置情報差の 照合は、 a =最^性数とすると、 The character set position information stored in each character set type group is extracted by referring to the character set group address table from the character set with the lowest occurrence of full text. Then, based on the extracted character set position information, from the character set type group in which the whole text appears very infrequently, the search unit is the same for each character set type group and the difference in the character set position number is the search input character. A combination of character set position information that is equal to the difference in the first character position of the corresponding character set in the column is extracted. The comparison of the character set position information difference is as follows:
検索入力文字セット列の全文出現頻度の低い文字セッ 卜の文字セット位置蕃号 を ί、 全文出現離の高い文字セットの文字セット位置審号を jとするとき、 When the character set position of the character set with a low frequency of full-text occurrence in the search input character set string is ί and the character set position of the character set with a high frequency of full-text occurrence is j,
{ (文字セッ ト位置審号 iの文字セッ トの文字セッ ト位置情報) 一 (文字セッ ト 位置番号 jの文字セッ トの文字セット位置 ί青報) } = ( i - j ) x a … ( 3 ) の式で照合すればよい。 {(Character set position information Character set position information of character set i) i (Character set position number Character set position of character set j ί blue report)} = (i-j) xa… ( What is necessary is to match by the formula of 3).
この文字セット種グループ間での文字セッ ト位置情報差の照合処理は、 全文出 現頻度の低い文字セット種グループの文字セッ ト位置情報とそれより全文出現頻
度の高い文字セッ ト種グループの文字セッ ト位置情報との差を取って文字セッ ト の連続を照合する。 The comparison of the character set position information difference between the character set type groups is based on the character set position information of the character set type group with a low frequency of full-text occurrence and the frequency of full-text appearance. Compare the character set continuity by taking the difference from the character set position information of the character set type group with the highest degree.
検索入力文字列における任意の文字セッ トを ABCと DEFとすると、 該当す る文字セッ ト位置情報を抽出するには、 Aと Dの文字位置差が Lであるとし、 グ ループ ABCの文字セッ ト位置情報を A x 、 グループ DEFの文字セッ ト位置情 報を Dy としたとき Assuming that any character set in the search input string is ABC and DEF, to extract the corresponding character set position information, the character position difference between A and D is L, and the character set of group ABC when the door position information a x, a character set position information of the group DEF was D y
A« +L · a >Dy なら Dy を削除 A «+ L · a> D y deletes D y
A, +L · a<Dy なら Ακ を削除 A, + L · If a <D y , delete Α κ
Ακ +L · a=Oy なら Ax、 Dy を合致として共に削除 Ακ + L · If a = O y , delete both A x and D y as matches
a =最大属性数 a = maximum number of attributes
というように不連続な文字セット位置情報を照合対象から削除していくことによ りその照合回数を削減させる。 The number of times of collation is reduced by deleting discontinuous character set position information from the collation target.
例えばグループ ABCの文字セッ ト位置情報が For example, if the character set position information of group ABC is
100052、 200113、 300105、 500205、 600083、 700054 100052, 200113, 300105, 500205, 600083, 700054
グループ D E Fの文字セット位置情報が Group D E F character set position information
100022、 300015、 300135、 棚 35、 500025 100022, 300015, 300135, shelf 35, 500025
文字位置差 L = 3、 最大属性数 a =10 Character position difference L = 3, maximum number of attributes a = 10
であった場合、 この二つのグループ間の照合回数は全体で 7回だけですみ、 グル ープ内の全ての文字セット位置情報を照合する必要はない。 In this case, the number of matches between these two groups is only 7 times in total, and it is not necessary to check all character set position information in the group.
⑤ 属性番号の照合 ⑤ Attribute number verification
文字セッ ト列照合から得られた文字セッ ト位置情報の中から、 検索入力と同じ 属性審号の文字セッ ト位置情報を取り出すことにより、 検索入力で指定した属性 に一致する文字セッ ト位置情報を抽出できる。 By extracting the character set position information of the same attribute board as the search input from the character set position information obtained from the character set collation, the character set position information that matches the attribute specified in the search input Can be extracted.
⑥ 検索結果の抽出 ⑥ Extract search results
取り出した文字セッ ト位置情報から検索単位審号と文字セッ 卜構成各文字の検 索単位における先頭文字からの位置を示す文字位置審号を検索結果として抽出す る。
なお、検索入力が複数ある場合には、 2審目以降の検索入力に対しては、 検索 入力の最初の文字セッ トに該当する文字セッ 卜種グループからそれまでに得られ た検索単位審号を有する文字セッ卜位置情報を取り出した後、 検索入力の次の文 字セット以降の処理を行うようにする。 これは第 1番目の検索入力で得られた検 索結果と同じ検索単位に含まれる文字セッ トを第 2番目以降の検索入力から抽出 することを百的とする。 Based on the extracted character set position information, the search unit reference and the character position reference indicating the position of each character in the character set from the first character in the search unit are extracted as search results. If there is more than one search input, for the second and subsequent search inputs, the search unit questions obtained so far from the character set type group corresponding to the first character set of the search input After extracting the character set position information with, the processing after the character set next to the search input is performed. This is all about extracting the character set included in the same search unit as the search result obtained by the first search input from the second and subsequent search inputs.
以上の②〜⑥の動作を具体例を挙げて説明する。 検索対象として本文が指定さ れ、 検索入力文字列としては「d 0 c ume n」 が指定されたとする。 この場合 本文の属性審号は 「5」 とする。 なお、 図 5の検索ファイルを対象として説明す 検索入力が「d 0 c ume njであるから、 検索入力文字セッ トは 「d o c」 と 「umej と 「n」 とになる。 しかし 「n」 は 1文字なので 「n」 の前にある 2文字と 結して「men」 とする。 全文出現頗度が「ume」 く 「do c」 < 「men」 の順であり、 照合をこの順序に行うとすると、 まず検索ファイル中の 「ひ me_lの文字セッ トグループ攔から取り出した文字セッ ト位置 報と 「d o c j の文字セッ トグル一プ欄から取り出した文字セッ ト位置情報との間で、 検索 入力 「do cumen」 における 「u」 と 「d」 との文字位置が各々 「4」 と 「1」であるから、 これらの差に最大属性数 =10を乗算した 「30」 になる文字セ ット位置情報を抽出して、 図 5の検索ファィルの 「 u m e j 内の文字セット位置 情報の 「801245」 と 「文書」 内の 「801215」 とを連続性ある文字セッ ト位置情報 の組み合わせとして抽出することができる。 次に、 この照合結果と 「menj の 文字セッ トグループ欄から取り出した文字セッ ト位置情報との間で、 検索入力「 do c ume nj における 「u」 と 「m」 との文字位置が各々 「4」 と 「5」 で あるから、 これらの差に S 属性数 =10を乗算した 「一 10」 になる文字セット位 置情報を抽出して、 図 5の検索ファイルの 「ume」 内の文字セッ ト位置情報の 「801245」 と 「menj 内の 「801255」 とを連続性ある文字セッ ト位置情報の組 み合わせとして抽出することができる。
さらに、 検索条件は 「本文」 であるから、 これまでの文字列照合で残った文字 セッ ト位置情報の中から、 属性番号が「5」 の文字セッ ト位置情報として、 「80 1215」 と 「801245」 と 「801255」 とを抽出できる。 The above operations (1) to (4) will be described with reference to specific examples. It is assumed that the text is specified as a search target and “d 0 c ume n” is specified as a search input character string. In this case, the attribute name in the text is “5”. Note that the search input described for the search file in FIG. 5 is “d 0 c ume nj”, so the search input character sets are “doc”, “umej and“ n ”. However, since "n" is a single character, it is connected to the two characters before "n" to form "men". The appearance of all sentences is in the order of “ume”, “doc” <“men”. If the collation is performed in this order, first, the character set extracted from the character set group “ In the search input “do cumen”, the character positions of “u” and “d” are “4”, respectively, between the position information and the character set position information extracted from the “docj character set group” field. Since it is “1”, character set position information that becomes “30” obtained by multiplying these differences by the maximum number of attributes = 10 is extracted, and the character set position information in “umej” in the search file in FIG. 5 is extracted. “801245” and “801215” in “document” can be extracted as a combination of character set position information with continuity. Next, between the collation result and the character set position information extracted from the character set group field of “menj”, the character positions of “u” and “m” in the search input “do c ume nj” are respectively “ Since the numbers are 4 and 5, the difference is multiplied by the number of S attributes = 10 to extract the character set position information that becomes "1 10", and the characters in "ume" in the search file in Fig. 5 are extracted. “801245” in the set position information and “801255” in “menj” can be extracted as a combination of character set position information with continuity. Furthermore, since the search condition is “body”, “80 1215” and “80 1215” are used as the character set position information with the attribute number “5” from the character set position information remaining in the character string matching so far. 801245 "and" 801255 ".
したがって、 この文字列が属する検索単位番号 「8」 の検索単位と文字位置番 号 「121〜127 」 を検索結果として出力する。 Therefore, the search unit of the search unit number “8” to which this character string belongs and the character position number “121 to 127” are output as the search results.
この検索処理動作を図 7にフローチャートとして示す。 This search processing operation is shown as a flowchart in FIG.
すなわち、 検索入力を取り出し、 検索入力文字列を先頭文字から 3文字単位の 文字セッ トに分割して検索入力文字セッ ト列を作成し、 照合回数 nをその文字セ ッ ト数ー1、 属性番号 a i を設定し、 各文字セッ 卜の出現頻度を文字セッ トグル 一プアドレス表を参照して調べ出現頻度の低いものから順に並び変える (S 41〜 S 44) 。 そして並べ変えた文字セットに該当する文字セット種グループに格納さ れている文字セッ ト位置情報を検索ファイルから取り出す (S 45) 。 そして、 二 つの文字セッ ト種グループ間で、 検索入力文字セッ ト列の全文出現頻度の低い文 字セッ 卜の文字セッ ト位置情報の文字セット位置審号を i、 全文検索頻度の高い 文字セッ トの文字セッ ト位置番号を jとするとき、 (文字セッ ト位置番号 iの文 字セッ 卜の文字セッ ト位置情報) ― (文字セット位置番号 jの文字セッ トの文字 セッ ト位置情報) = ( i - j ) x (最大属性数) である文字セッ ト位置情報を一 致結果として取り出す (S 46) 。 そして照合が終わったか否かを判断した後 (S 47、 48) 、 文字セッ ト位置情報の中から属性審号が a i の文字セット位置情報を 選別し、 検索入力に一致した検索単位と文字セット構成各文字の検索単位におけ る先頭文字からの位置を示す文字位置番号を検索結果として出力する。 ( S 49、 50) 。 なお、 ステップ S 48で照合が連続した場合、 これまでの一致結果の文字セ ッ ト位置情報と、 検索入力を並べ変えた文字セッ 卜の中の次の文字セッ 卜に該当 する文字セット種グループに格納されている文字セッ ト位置情報とで照合を行う ( S 46) 。 That is, the search input is taken out, the search input character string is divided into character sets in units of three characters from the first character, and a search input character set string is created. The number ai is set, and the appearance frequency of each character set is checked with reference to the character set group address table and sorted in ascending frequency (S41 to S44). Then, the character set position information stored in the character set type group corresponding to the rearranged character set is extracted from the search file (S45). Then, between the two character set type groups, the character set position identification of the character set position information of the character set with the low frequency of full-text occurrence in the search input character set string is i, and the character set with the high frequency of full-text search is When character set position number of character set is j, (character set position information of character set of character set position number i)-(character set position information of character set of character set position number j) Character set position information of = (i-j) x (maximum number of attributes) is extracted as a matching result (S46). After determining whether or not the matching has been completed (S47, 48), the character set position information of the attribute board ai is selected from the character set position information, and the search unit and character set that match the search input are selected. A character position number indicating the position of each constituent character from the first character in the search unit is output as a search result. (S49, 50). If the collation is continued in step S48, the character set position information of the previous matching result and the character set type group corresponding to the next character set in the character set in which the search input has been rearranged. The collation is performed with the character set position information stored in (S46).
なお、 全文検索の高速性が求められる場合、 文字セッ トの構成文字数を増加す るとますます文字セッ 卜の出現頻度が低くなり、 各文字セット種グループに格納
される文字セット位置情報が少なくなるため、 容易に高速化を実現できる。 If high-speed full-text search is required, increasing the number of characters in a character set will reduce the frequency of occurrence of the character set and store it in each character set type group. Since less character set position information is used, the speed can be easily increased.
上記例では、 A S C I Iコードによる英文処理の例を示したがフランス語やド ィッ ί吾も同様の文字セット構成と検索フ了ィルの構成で全文検索を高速化できる。 また他の表音文字で表現される言語の検索処理も同様に処理できる。 In the above example, the example of English sentence processing by ASCII code was shown, but full-text search can be speeded up in French and Didgo using the same character set configuration and search file configuration. Search processing for a language expressed by another phonetic character can be similarly performed.
7欠に第二実施例および第三実施例として、 表音文字である仮名文字と表意文字 である漢字とが混在して使用される日本語を用 、て全文検索処理を行う場合の例 について説明する。 In the second and third embodiments, a case where full-text search processing is performed using Japanese characters in which kana characters that are phonograms and kanji that are ideographs are mixed is used. explain.
日本語の文字列は漢字混じりの文字列である。 このため漢字について着目する と漢字は字種が欧文字に比べて多く、 同一の漢字が繰り返し現れる頻度は、 文字を使用する欧文に比べると非常に少ない。 例えば、 日本語の文字列で「通信」 という 2つの文字列を使用する用語は多々あるとしても 「通信 · ·」 という文字 列は 「通信回線」、 「通信装置」 のように 4文字で同一の文字が発生する頻度は 非常に少なくなる。 また仮名文字あるいは平仮名文字も欧文文字に比べるとその 字種が多い。 このため、 漢字を含む文字列の場合には、漢字 1文字ごとの文字種 構成の検索ファィルぁるいは 2文^ ϋ成の文字セッ ト検索ファィルを用いて検索 処理を行っても検索処理を高速化できる。 The Japanese character string is a character string containing kanji. For this reason, focusing on kanji, kanji has more character types than Western characters, and the frequency of repeated occurrences of the same kanji is very low compared to Western characters that use characters. For example, even though there are many terms that use the two characters "communication" in Japanese character strings, the character string "communication ..." is the same in four characters, such as "communication line" and "communication device". The frequency of occurrence of the character is very low. Kana characters or hiragana characters also have more character types than European characters. For this reason, in the case of a character string containing kanji, the search process can be performed quickly even if the search process is performed using a search file with a character type configuration of each kanji character or a character set search file with two sentences ^ Can be
次に第二実施例を説明する。 Next, a second embodiment will be described.
この第二実施例では 2文字で構成される文字セッ トによる検索ファィル作成と 検索処理について説明する。 この第二実施例では 3文字で構成される文字セット の処理を行う第一実施例とは基本的に共通である。 ただし日本語処理を行うため、 J I Sコード表を用いて検索ファィルおよび文字セッ トグループアドレス表を作 成する点が異なる。 In the second embodiment, a description will be given of search file creation and search processing using a character set composed of two characters. This second embodiment is basically the same as the first embodiment in which a character set consisting of three characters is processed. However, the difference is that a search file and a character set group address table are created using the JIS code table because Japanese processing is performed.
以下具体的に説明する。 This will be specifically described below.
この第二実施例の検索ファイルは図 8に示すように J I Sコード表に記載され ている文字順に配列された文字セット群で構成される。 また、 各文字セット群は 図 9の文字セット群一覧に示すように J I Sコード表に示されている文字順に、 記載文字を先頭文字とする 2文字の文字列で構成される文字セッ トグループで構
成される。 この文字セッ ト種グループの先頭番地を図 9の文字セッ ト群一覧の記 載順に配列したものが図 10に示す文字セッ トグループアドレス表で る。 The search file of the second embodiment is composed of a character set group arranged in the character order described in the JIS code table as shown in FIG. Each character set group is a character set group consisting of two character strings with the written characters as the first character in the order shown in the JIS code table as shown in the character set group list in Fig. 9. Structure Is done. The character set group address table shown in FIG. 10 is a table in which the head addresses of the character set type groups are arranged in the order in which they are described in the character set group list in FIG.
そして第一実施例と同じく、 検索単位の最大文字数 n = 10000、 最大属性数 a = 10、 8番目の検索単位である本文 (属性番号 = 5 ) の先頭から第 121〜125番 目の文字位置に 「通信文書の」 という文字列があった場合、 この文字列の中の 「 通信文書」 は、 「通信」 、 「信文」 、 「文書」、 「書の」 の文字セッ トに分解さ れ、 それぞれ「801215」、 「801225」 、 「801235」 、 「801245」 の文字セッ ト位 置情報が与えられ、 この文字セッ ト位置情報を検索ファィルの領域に格納する。 この 「通信文書」 の文字セッ ト位置情報を検索ファイルに格納した例を図 11に示 す。 この検索ファィル作成処理の手順は第一実施例と同じであるためその流れ図 は省略する。 Then, as in the first embodiment, the maximum number of characters in the search unit n = 10000, the maximum number of attributes a = 10, and the 121-125th character position from the beginning of the text (attribute number = 5), which is the eighth search unit If there is a character string of "Correspondence" in the character string, "Correspondence" in this character string is decomposed into the character set of "Communication", "Response", "Document", and "Calligraphy". Then, character set position information of “801215”, “801225”, “801235”, and “801245” is given, and the character set position information is stored in the area of the search file. Figure 11 shows an example of storing the character set position information of this “correspondence document” in a search file. Since the procedure of the search file creation process is the same as that of the first embodiment, the flowchart is omitted.
またこのように作成された検索ファィルを用いる検索処理は、 入力された検索 入力文字列をその先頭文字から 2文字単位の文字セッ トに分解して検索入力文字 セッ ト列を作成し、 この各文字セッ トに該当する文字セット種グループを検索フ アイルから取り出して照合し、 検索入力文字セット列を構成できる文字セッ ト位 置情報の組み合わせを取り出し、 この取り出した文字セッ ト位置情報から検索入 力と同じ属性を有する文字セッ ト位置情報を照合一致として取り出す。 この照合 —致した文字セッ ト位置情報から検索単位審号と文字セット構成各文字の検索単 位における先頭文字からの位置を示す文字位置審号を検索結果として出力する。 なお、 検索入力文字列を先頭文字から 2文字単位の文字セッ トに分解したとき、 最後の文字セットが 1文字になり 2文字単位の文字セッ トを作成できないことが ある。 このときには最後の文字セッ トの直前の文字セッ 卜の後部から 1文字を取 り出し、 最後の文字セッ 卜の前部に連結して 2文字単位の文字セッ トを作成する。 検索入力文字列として 「通信文書」 が指定された場合、 検索入力文字セッ トは 「通信」 と 「文書」 になる。 全文出現頻度が「通信」 < 「文書」 の順であり、 照 合をこの順序に行うとすると、 まず検索ファィル中の 「通信」 の文字セッ トグル ープ欄と 「文書」 の文字セッ トグループ欄から取り出した文字セッ ト位置情報と
の間で、 検索入力 「通信文書」 における 「通」 と 「文」 との文字位置が各々 「1 - と 「3」であるから、 これらの差に最大属性数 =10を乗算した 「一 20」 になる文 字セット情報を抽出して、 図 11の検索ファイルの 「通信」 内の文字セット位置情 報の 「801215」 と 「文書」 内の 「801235」 とを連続性ある文字セット位置情報 の組み合わせとして抽出することができる。 そして、 検索条件は「本文」 である ため、属性審号が「5」 の文字セット位置情報として 「801215」 と 「801235」が 抽出でき、 共通する検索単位審号「8」の検索単位と文字位置番号「121〜124」 が検索結果として取り出される。 このように、 検索処理の手順は第一実施例と同 じであるためその流れ図は省略する。 In addition, in the search processing using the search file created in this way, the input search input character string is decomposed from the first character into a character set in units of two characters, and a search input character set string is created. A character set type group corresponding to the character set is extracted from the search file and collated, a combination of character set position information that can form a search input character set string is extracted, and a search input is performed from the extracted character set position information. The character set position information having the same attribute as the force is extracted as a collation match. Based on the collated character set position information, search unit identification and character position identification indicating the position of each character in the character set from the first character in the search unit are output as search results. When a search input string is decomposed from the first character to a character set of two characters, the last character set may become one character, and a character set of two characters may not be created. In this case, one character is taken from the last part of the character set immediately before the last character set, and is connected to the front part of the last character set to create a two-character unit. When "correspondence document" is specified as a search input character string, the search input character set is "communication" and "document". If the appearance frequency of the full text is “communication” <“document”, and the matching is performed in this order, first, the character set group field of “communication” and the character set group of “document” in the search file Field and the character set position information Since the character positions of “tsuru” and “sentence” in the search input “correspondence” are “1-” and “3”, respectively, The character set information that becomes “” is extracted, and the character set position information “801215” in the “communications” of the search file in FIG. 11 and “801235” in the “document” are connected. Can be extracted as a combination. Since the search condition is “body”, “801215” and “801235” can be extracted as the character set position information for the attribute issue “5”, and the search unit and character for the common search unit issue “8” The position numbers “121 to 124” are extracted as a search result. As described above, the procedure of the search process is the same as that of the first embodiment, so that the flowchart is omitted.
次に第三^ 例として、 1文字ごとの文字種検索ファィルを作成して検索する 場合を說明する。 漢字はその字種が多いため、 1文字ごとの文字種グループ検索 ファイルを作成してもその検索処理を高速化できる。 Next, as a third example, a case will be described in which a character type search file for each character is created and searched. Kanji has many character types, so even if a character type group search file is created for each character, the search process can be sped up.
この第三実施例は第二実施例とは、 文字セット種別の検索ファイルを構成する か、 1文字 fi^!Iの検索ファィルを作成するかの違いであり、 その検索ファィル作 成処理および検索処理は基本的には同一である。 The third embodiment is different from the second embodiment in that a search file of a character set type is formed or a search file of one character fi ^! I is created. The processing is basically the same.
まず、 検索ファィル作成処理においては、 第二実施例と比べると、 1文字ごと の文字種グループを生成するため、 文字檷ァドレス表および検索ファィルの構成 干異なる。 First, in the search file creation process, the character address table and the search file are slightly different from those in the second embodiment because a character type group is generated for each character.
検索ファィル作成処理における①検索ファィル領、域確保、 ②各構成文字への文 字位置情報の付与、 ③文字種別ごとにグループ化した文字位置情報のファィルへ の格納の 3つの動作は細部では異なるが基本的には第一実施例および第二実施例 と変わらない。 The details of the three operations in the search file creation process are as follows: ① Search file area and area reservation; ② Assignment of character position information to each constituent character; ③ Storing character position information grouped by character type in the file. However, this is basically the same as the first and second embodiments.
① 検索ファイルの領域確保 ① Secure search file area
本第三鍾例では、 日本語の全文の構成文字を分類し、 J I Sコード表に記載 されている文字種別に出現頻度を計数し、 検索ファイルの領域を確保する。 これ により、 第二^ M例の図 10に相当する文字種グループの先頭番地を J I Sコード 表の記載順、に配歹 ίίした文字欄ァドレス表を図 12に示すように作成する。 この文字
欄ァドレス表は第二実施例の文字欄ァドレス.表に比べると文字種ごとにその先頭 番地が記載されたものであり、 その数が J I S第 1水準、 J I S第 2水準に従う ため、 未使用コードを含めて No.8836文字欄の数ですむ。 In this third example, the characters constituting the entire Japanese text are classified, the appearance frequency is counted for the character types described in the JIS code table, and the area for the search file is secured. As a result, a character column address table in which the head addresses of the character type groups corresponding to FIG. 10 of the second ^ M example are arranged in the order described in the JIS code table is created as shown in FIG. This character The column address table is a character column address of the second embodiment.In comparison with the table, the starting address is described for each character type, and since the number complies with JIS Level 1 and JIS Level 2, unused codes are used. Only the number of No.8836 character fields is required.
② 各構成文字への文字位置情報の付与 ② Assignment of character position information to each constituent character
この文字位置情報の付与は、 本実施例が 1文字ごとに文字位置情報を付与する ため、 文字位置番号が検索単位ごとに文字の先頭から順に 1、 2、 3…と昇順に 番号が付与されて文字位置番号が付与され、 Since this embodiment assigns character position information for each character in this embodiment, the character position numbers are assigned in ascending order of 1, 2, 3,... Character position number,
文字位置情報は、 最大検索単位文字数を π、 最大属性数を aとするとき、 文字位置情報コード = {検索単位番号 x n +文字位置審号 } x a +属性番号 For character position information, when the maximum number of search unit characters is π and the maximum number of attributes is a, character position information code = {search unit number x n + character position identification code} x a + attribute number
…… (4 ) で与えるようにする。 …… (4) to give.
例えば第二実施例と同じような 「通信文書」 という文字列が 8番目の検索単位 である本文 (属性番号 = 5 ) の先頭から第 121〜124番目の文字位置にあった場 合、 この 「通」、 「信」、 「文」、 「書」 の文字にはそれぞれ「801215」、 「80 1225」、 「8012235 」、 「8012245 」 の文字位置情報が与えられる。 For example, if a character string “correspondence document” similar to the second embodiment is located at the 121st to 124th character positions from the beginning of the text (attribute number = 5), which is the eighth search unit, The characters “801215”, “801225”, “8012235”, and “8012245” are given to the characters “mail”, “shin”, “sentence”, and “sho”, respectively.
③ 文字位置情報の検索ファイルへの登録 ③ Register character position information in search file
文字種グループは、 図 12に示される文字欄アドレス表に基づいて J I Sコード 表に記載された順に検索ファィルに格納される。 この結果文字種グループに分け られて文字位置情報が格納された図 13に示される検索ファィルが作成される。 こ の検索ファィル作成処理の流れ図を図 14に示す。 Character type groups are stored in the search file in the order described in the JIS code table based on the character column address table shown in FIG. As a result, a search file shown in FIG. 13 in which character position information is stored by being divided into character type groups is created. Figure 14 shows a flowchart of this search file creation process.
次にこの文字種毎に構成された検索ファィルの検索処理を説明する。 Next, a search process of a search file configured for each character type will be described.
まず、 検索入力文字列の各構成文字に該当する文字欄ァドレス表内の文字欄先 頭番地を算出する。 そして検索入力文字歹 IJを出現頻度の低いものから並べ変え、 それぞれの文字に該当する文字種グループに格納されている文字位置情報を取り 出し、 その取り出した文字位置情報を基に、 出現頻度の低い文字種グループから 順に、 各文字種グループ間で検索単位が等しくかつ文字位置番号の差が検索入力 文字列の文字位置差に等し!/、文字位置情報の組み合わせを抽出する。
この文字位置情報の照合は、 検索入力文字列の全文出現鍾度の低!/ヽ文字の文字 位置番号を i、 全文出現頻度の高い文字の文字位置審号を jとするとき、 First, the head address of the character column in the character column address table corresponding to each constituent character of the search input character string is calculated. Then, the search input characters system IJ are rearranged from those with low appearance frequency, character position information stored in the character type group corresponding to each character is extracted, and based on the extracted character position information, low occurrence frequency In order from the character type group, the search unit is the same for each character type group, and the difference in character position number is equal to the character position difference in the search input string! /, Extract combinations of character position information. The collation of this character position information is as follows. When the character position number of the / ヽ character is i and the character position identification number of the character with the highest frequency of full text is j,
{ (文字位置審号 iの文字の文字位置情幸 β) - (文字位置審号 jの文字の文字 位置情報) } = ( i - j ) x a - ( 5 ) {(Character position information of character i) i (character position information of character j)} = (i-j) x a-(5)
a =最大属'隱 a = the largest genus
の式に合致する文字位置情報の組み合わせを抽出すればよい。 May be extracted as a combination of character position information that matches the expression.
これにより、文字種グループ間で検索単位が共通で文字の連続性がある文字位 置情報が抽岀され、 この抽出した文字位置情報から検索入力と同じ属性を有する 文字位置情報を照^""致として取り出す。 この照合一致した文字位置情報から検 索入力に合致する検索単位と文字位置が抽出される。 As a result, character position information having a common search unit and character continuity between character type groups is extracted, and character position information having the same attribute as the search input is extracted from the extracted character position information. Take out as. A search unit and a character position that match the search input are extracted from the character position information that matches.
具体的に検索対象として が指定され、 検索入力文字列として 「通信文書」 が指定されたとする。 Specifically, it is assumed that is specified as a search target, and that "communication document" is specified as a search input character string.
このとき、 各文字の全文出現頻度が「書」 く 「文」 く 「信」 < 「通」 の順であ り、 照合をこの順序に行うとする。 まず検索ファイル中の 「書」 の文字欄から取 り出した文字位置情報と 「文」の文字欄から取り出した文字位置情報とを上記 (5) 式を使用してその差が「一10」 になる文字位置情報を抽出すると、 検索ファイル の 「書」 内の文字位置情報の 「801245」 と 「文」 内の 「801235」 とを連続性ある 文字位置情報として抽出することができる。 次に、 「書」の中で照合結果として 残った文字位置情報と、 「信」 に該当する検索ファイルの文字欄から取り出した 文字位置情報を上記 (5)式を して、 その差が「一 20」 になる文字位置情報を抽 出すると、 「書」 内の文字位置情報の 「8012 」 と 「信」 内の文字位置情報「80 1225J とを連続性ある文字位置情報の組み合わせとして抽出することができる。 同様にして、 「書」 内の文字位置情報の 「801245」 と 「通」 内の文字位置情報「 801215」 とを連続性ある文字位置情報の組み合わせとして抽出することができる。 さらに検索条件は 「 」 であるから、 これまでの文字列照合で残った文字位置 情報の中から、属性蕃号が「5」 の文字位置情報として、 「801215」〜「801245 Jを抽出できる。
これにより、 この文字列が属する検索単位蕃号 「8」 の検索単位と文字位匿番 号「121 〜124 」 を検索結果として出力する。 この検索処理の流れ図を図 15に示 す。 At this time, the full-text appearance frequency of each character is in the order of “writing”, “sentence”, “shin” <“tsu”, and the collation is performed in this order. First, the difference between the character position information extracted from the character column of “Book” and the character position information extracted from the character column of “Sentence” in the search file using the above equation (5) is “1-10”. By extracting the character position information, the character position information “801245” in the “book” of the search file and “801235” in the “sentence” can be extracted as continuous character position information. Next, the character position information remaining as a result of the collation in “sho” and the character position information extracted from the character column of the search file corresponding to “shin” are calculated by the above equation (5), and the difference is expressed as “ When character position information that becomes "20" is extracted, character position information "8012" in "book" and character position information "80 1225J" in "shin" are extracted as a combination of character position information with continuity. Similarly, the character position information “801245” in “book” and the character position information “801215” in “tsu” can be extracted as a combination of character position information with continuity. Further, since the search condition is “”, “801215” to “801245 J” can be extracted from the character position information remaining by the character string matching so far as the character position information with the attribute number “5”. As a result, the search unit of the search unit “Ban” “8” to which this character string belongs and the character concealment number “121 to 124” are output as search results. Figure 15 shows the flow chart of this search process.
なお、 この実施例において、 漢字については 1文字毎に、 連続する片仮名文字、 平仮名文字については 2文字セッ トとして検索ファィルを作成することもできる。 特に技術用語として片仮名文字が使用されることが多く、 検索入力文字列として 仮名文字が入力される場合があるため、 このように漢字については 1文字毎に、 連続する片仮名文字や平仮名文字については 2文字セットとして検索ファィルを 作成することも検索の高速化のために有効である。 In this embodiment, it is also possible to create a search file for each kanji character, and for a continuous katakana character and a hiragana character as a two-character set. In particular, katakana characters are often used as technical terms, and kana characters may be entered as search input character strings. Thus, for each kanji character, continuous katakana and hiragana characters are used. Creating a search file as a two-character set is also effective for speeding up the search.
次に第四実施例ないし第六実施例として、 マルチキーワードを用いる部分一致 検索処理方式の例を説明する。 Next, as a fourth embodiment to a sixth embodiment, an example of a partial match search processing method using a multi-keyword will be described.
マルチキーワード情報検索方式として例えば図書検索システムの例を挙げて説 明する。 図書検索システムにおけるレコードは、 図書名、 著者名、 発行者名、 刊 行年、 抄録などのキーワードから構成されている。 そして、 このキーワードを舍 む各レコードを登録して検索ファィルを作成し、 検索入力としてキーヮードある いはキーヮードの一部の文字列を入力して対応するレコードを検索出力する。 この検索ファィルの作成を説明する。 An example of a book search system will be described as a multi-keyword information search method. Records in the book search system consist of keywords such as book title, author name, publisher name, year of publication, and abstract. Then, each record containing this keyword is registered to create a search file, and a key word or a partial character string of the key word is input as a search input to search and output a corresponding record. The creation of this search file will be described.
まず検索対象となる各レコ一ドに登録順序に従って昇順にレコード識別符号を 付与する。 次に各レコードが有するキーワードの論理的な種別を属性として、 そ の属性を示すキーワード属性符号を付与する。 図書検索システムの場合、 図書名、 著者名、 発行者名、 刊行年、 抄録などの属性を示すキーワード属性符号が付与さ れ、 検索入力と図書検索システムのキーワード間に論理的な関連付けが行われて いる。 検索者は検索する図書を特定しゃすいキ一ヮードゃ記憶しているキーヮー ドを検索入力とする。 さらに、 キーワードを 1文字あるいは文字セットに分解し、 各文字にキーヮードの先頭からの文字位置を示す文字位置順序符号または各文字 セッ トにキーワードの先頭からの各文字セッ 卜の先頭文字位置を示す文字セッ ト 位置順序符号を付与する。 これらのレコード識別符号、 キーワード属性符号、 文
字位置順序符号または文字セッ ト位置順序符号とからキーワードの各文字の文字 位置情報または各文字セットの文字セッ ト位置情報を生成する。 このときキーヮ 一ド厲性を文字位置で表せるように、 キーヮード厲性符号ごとにあらかじめ設定 されたキーヮードの先頭文字位置を定数として文字位置情報または文字セット位 置情報に加算されるようにしている。 この文字位置情報または文字セット位置情 報を文字種または文字セット種ごとにグループ化し、 これら各グループを集合し て検索ファイルを作成する。 したがつてこの検索ファイルは、 文字種ごとに文字 位置情報または文字セット種ごとに文字セッ ト位置情報が格納された形のフアイ ル構造となる。 First, record identification codes are assigned to the records to be searched in ascending order according to the registration order. Next, a keyword type code indicating the attribute is assigned with the logical type of the keyword included in each record as an attribute. In the case of a book search system, keyword attribute codes indicating attributes such as the book title, author name, publisher name, publication year, and abstract are assigned, and a logical association is made between the search input and the keywords of the book search system. ing. The searcher specifies a keyword for storing the book to be searched for as a search input. In addition, the keyword is decomposed into one character or character set, and each character indicates the character position order code indicating the character position from the beginning of the keyword, or each character set indicates the first character position of each character set from the beginning of the keyword. Character set Position sequence code is assigned. These record identification code, keyword attribute code, statement Character position information for each character of the keyword or character set position information for each character set is generated from the character position sequence code or character set position sequence code. At this time, the first character position of the key word preset for each key character code is added to the character position information or character set position information as a constant so that the key character can be represented by the character position. . This character position information or character set position information is grouped by character type or character set type, and these groups are assembled to create a search file. Therefore, this search file has a file structure in which character position information is stored for each character type or character set position information for each character set type.
検索処理では、 検^ λ力文字列と検索入力文字列属性とが射で 1個 ±入力さ れる。 各検索入力文字列について検索入力文字列を 1文字ごとあるいは文字セッ トに分解し、 検索ファィル中から検索入力を構成する文字と同じ文字の文字位置 情報あるいは検索入力を構成する文字セッ 卜と同じ文字セッ 卜の文字セット位置 情報を取り出す。 そしてレコード識別符号とキーワード属性符号が共通で文字位 衝!! I序符号または文字セッ ト位置順序符号が検索入力文字列の文字位置順序符号 あるいは文字セット位置順序符号と等しい順序であり、 かつそのキーワード属性 符号が検索入力と等しい文字位置情報または文字セット位置情報の組み合わせを 照合して取り出す。 取り出した文字位置情報または文字セット位置情報からすべ ての検索入力文字列に共通するレコ一ド識別符号を検索結果として取り出す。 次に第四実施例を説明する。 In the search process, a test λ power character string and a search input character string attribute are input ± 1 each. For each search input string, the search input string is decomposed into individual characters or character sets, and the same character position information as the characters that make up the search input from the search file or the same character set that makes up the search input Retrieves the character set position information of a character set. The record identification code and the keyword attribute code are common, and the character position code is the same. The I-order code or character set position sequence code is in the same order as the character position sequence code or character set position sequence code of the search input string, and The keyword attribute code is collated and extracted for character position information or character set position information that is the same as the search input. From the extracted character position information or character set position information, record identification codes common to all search input character strings are extracted as search results. Next, a fourth embodiment will be described.
本第四実施例での情報検索処理は、 検索処理に供するための検索対象となるレ コ一ドが有するマルチキーワードから作成するキーワード列について各キーヮー ドの構成文字をキーヮード列の先頭文字から 1文字ずつ取り出し、 その文字と次 に続く文字の合計 3文字で文字セットを作成し、 これらの文字セッ ト種ごとにグ ループ化した文字セッ 卜種グループで構成される検索ファィルを作成する検索フ ァィル作成処理と、 検索ファィルとの照合一致を行って検索入力に合致するキ一 ワードのレコ一ドを抽出する検索処理との二つに分けられる。
まず、 検索ファイル作成処理について説明する。 In the information search process of the fourth embodiment, the constituent characters of each keyword are changed from the first character of the keyword sequence to the keyword sequence created from the multi-keywords possessed by the record to be searched for the search process. A character set is created by taking out characters one by one and creating a character set with a total of three characters consisting of that character and the following character, and creating a search file consisting of character set type groups grouped for each of these character set types. File creation processing and search processing for extracting records of keywords that match the search input by collating and matching the search file. First, the search file creation process will be described.
この検索ファイル作成処理は、 第一実施例と同じく、 ①検索ファイル領域確保、 ②各キーワード構成文字セッ トへの文字セッ ト位置情報の付与、 ③文字セッ ト種 別ごとにグループ化した文字セッ ト位置情報の検索ファィルへの格納の 3つに分 けることができる。 この各処理にっ 、てそれぞれ説明する。 As in the first embodiment, this search file creation processing includes (1) securing a search file area, (2) assigning character set position information to each character set character set, and (3) character sets grouped by character set type. Storage of search location information in search files. Each of these processes will be described below.
① 検索ファイル領域確保 ① Secure search file area
検索ファィルは、 第一実施例で用いた図 2に示すように、 A S C I Iコード表 Π己載されている文字順に配列された文字セッ ト群で構成される。 各文字セッ ト 群の 2文字目と 3文字目は第一実施例と同じく図 3の文字セッ ト群の第 2、 第 3 文字組み合わせ一覧の記載のように構成され、 図 4に示す文字セッ トグル一プア ドレス表の記載順に配列される。 As shown in FIG. 2 used in the first embodiment, the search file is composed of an ASCII code table and a character set group arranged in the order of the characters listed on the table. The second and third characters of each character set group are configured as shown in the second and third character combination list of the character set group in FIG. 3, as in the first embodiment. They are arranged in the order described in the toggle address table.
② 各キーワード構成文字セッ 卜への文字セッ ト位置情報の付与 ② Assignment of character set position information to each keyword character set
ここで述べる文字セッ ト位置情報は、 レコードが有する各キ一ヮードをキ一ヮ 一ド属性番号に対応するキ一ヮード属性領域に配列して作成するキーヮード列に おいて、 各キーヮードを構成する文字セッ トが属するレコードの登録する順審を 示すレコード番号と、 キーワードにおけるその文字セッ 卜の出現する位置をその 文字セッ 卜の先頭文字の位置で示す文字セッ ト位置審号と、 キーワードの論理的 な種別を示すキーワード属性番号とで作成される。 The character set position information described here composes each key in a key word sequence created by arranging each key of the record in a key attribute area corresponding to the key attribute number. A record number indicating the order of registration of the record to which the character set belongs, a character set position identification number indicating the position of the character set in the keyword by the position of the first character of the character set, and the logic of the keyword It is created with the keyword attribute number indicating the target type.
まずレコード番号を説明する。 例えば、 一般的な図書検索システムでは、 図書 名、 著者名、 発行者名、 刊行年、 抄録のキーワードで図書を検索する。 このとき レコードは、 図書名、 著者名、 発行者名、 刊行年、 抄録のキーワードで構成され る検索対象であって、 このレコードが登録される順序に 1から昇順に番号を付与 してレコード審号とする。 First, the record number will be described. For example, a general book search system searches books using keywords such as book name, author name, publisher name, year of publication, and abstract. At this time, the record is a search target composed of the keywords of book title, author name, publisher name, publication year, and abstract. No.
次にキーワード属性番号を説明する。 一般的に検索者は、 検索する図書を特定 しゃすいキーワードを検索入力としたり、 あるいは記憶しているキーワードを検 索入力する。 このため図書検索システムでは、 例えば図書名、 著者名、 発行者名、 刊行年、 抄録の各キ一ワードにキーヮード属性を付加し、 検索入力と図書検索シ
ステムのキーワード間に論理的な関連付けを行っている。 ここではキーワード厲 性審号として、 図書名に 「1」、 著者名に 「2」、 発行者名に 「3」、 刊行年に 「4」、 抄録に 「5」 を付与する。 Next, the keyword attribute number will be described. In general, a searcher specifies a book to be searched by using a keyword as a search input or by searching for a stored keyword. For this reason, the book search system adds keyword attributes to keywords such as the book name, author name, publisher name, year of publication, and abstract, for example, and allows search input and book search systems. There is a logical association between the keywords in the stem. Here, “1” is assigned to the book name, “2” to the author name, “3” to the publisher name, “4” to the publication year, and “5” to the abstract as the keyword 性 gender.
次に文字セット位置審号を説明する。 キーワードごとに、 キーワードの先頭か ら 1文字ずつ取り出し、 その文字と次に続く文字の合計 3文字で文字セットを作 成し、 作成順に 1、 2、 3 · · ♦と异順に蕃号を付与して文字セット位置番号と する。 キーワードの最後の文字にはキーワードの最後を示す特殊記号 EM (ェン ドマーク) を 2文字付加し、 この EM記号と連結させて文字セットとし、 文字セ ッ ト位置蕃号を付与する。 なお EM記号には A S C I Iコード表の 「D E L_lの A S C I Iコード 「7 F」を割り当てる。 次にキーワード列を説明する。 レコ 一ドの有するキーワードに対する部分一致検索を検索入力文字セット列との文字 セット列照合により実現するために、 レコードの有するすべてのキ一ヮードを連 結して文字列を構成し、 これをキーワード列とする。 すなわち、 各キーワードを キーヮード属性番号に対応する固定長のキーヮード属性領域に配列しキーヮード 列を作成する。 これによりキーワード列における文字位置から、 その文字セット が属するキーワードの属性がわかる。 なお、 各キーワード属性領域に続いてキー ヮ―ド属性領域の区切りを示す EM記号がキーヮ一ド列に配列される。 この EM 記号はキ一ヮ一ドの最後を示す特殊記号 EMと同じものを使用する。 Next, the character set position identification code will be described. For each keyword, extract one character at a time from the beginning of the keyword, create a character set with a total of three characters consisting of that character and the following character, and assign a ban number in the order of creation 1, 2, 3 The character set position number. To the last character of the keyword, two special symbols EM (end mark) indicating the end of the keyword are added, concatenated with this EM symbol to form a character set, and the character set position “Ban” is given. The EM symbol is assigned “ASCII code“ 7 F ”of DEL_l in the ASCII code table. Next, the keyword string will be described. In order to realize partial match search for the keyword of the record by character set string matching with the search input character set string, a character string is formed by connecting all the keys of the record, A column. That is, the keywords are arranged in a fixed-length keyword attribute area corresponding to the keyword attribute number, and a keyword sequence is created. Thus, the attribute of the keyword to which the character set belongs can be determined from the character position in the keyword string. Note that, following each keyword attribute area, an EM symbol indicating the delimitation of the keyword attribute area is arranged in a keycode row. This EM symbol is the same as the special symbol EM indicating the end of the key.
そしてこのキーヮード列を対象として、 レコード番号とキーワード属性番号と 文字セッ ト位置蕃号からキーワードを構成するすべての文字セッ トを整数からな るコードに変換して文字セッ ト位置情報を作成する。 この文字セット位置情報は、 次の式 ( 6 ) で与えられる整数のコードである。 Then, for this keyword sequence, the character set position information is created by converting all the character sets constituting the keyword from the record number, the keyword attribute number, and the character set position number to codes consisting of integers. This character set position information is an integer code given by the following equation (6).
文字セッ ト位置情報コード-レコード審号 x n + (P a - 1 ) + p〜 (6 ) n :キーワード列文字数 Character set position information code-Record trial xn + (Pa-1) + p ~ (6) n: Number of characters in keyword string
P a :キーワード属性蕃号 aのキーワード属性領域のキーワード列における先 P a: Keyword attribute Ban No. a in the keyword column of the keyword attribute area
P :文字セッ ト位置番号
例えば、 キーワード列のキーワード属性領域サイズが、 図書名 =64バイ ト 64文 字、 著者名 =32バイ ト 32文字、 発行者 =64バイ ト 64文字、 刊行年 =4バイ 卜 4文 字、 抄録 =1000バイ ト 1000文字の図書検索システムにおいて、 レコード審号が 100 のレコードが、 「図書名 = E l e c t r o n i c Pub l i s h i ng」 、 「著者名 =J o o s t K i s t」 、 「発行者 =CR〇〇M HELM」 、 「刊 行年 =1990」、 「抄録 =W i t h~s o c i e t y」 の場合、 キーワード列は図 16に示すようになる。 このときキーワード列は 1169バイ ト 1169文字であるから各 文字セッ 卜の文字セッ ト位置情報は図 17に示すように構成される。 P: Character set position number For example, the keyword attribute area size of the keyword column is: book name = 64 bytes, 64 characters, author name = 32 bytes, 32 characters, publisher = 64 bytes, 64 characters, publication year = 4 bytes, 4 characters, abstract = 1000 bytes In a book search system with 1000 characters, records with a record qualification of 100 are “Book title = Electronic Publ ng”, “Author name = Joost K ist”, and “Publisher = CR〇〇M In the case of “HELM”, “publishing year = 1990”, and “abstract = With ~ society”, the keyword sequence is as shown in FIG. At this time, since the keyword string has 1169 bytes and 1169 characters, the character set position information of each character set is configured as shown in FIG.
そして、 このように文字セッ ト位置情報をそれぞれ 4バイ 卜のコードで構成す れば、 1169文字数のキーワード列を 232÷1169 367万個取り扱うことが可能で あ^ )o If the character set position information is composed of four-byte codes in this way, it is possible to handle 2 32 ÷ 1169.36 million keyword strings with 1169 characters ^) o
③ 文字セット位置情報の検索ファィルへの登録 ③ Registration of character set position information in search file
次にこの各文字セッ トごとに付与された文字セッ ト位置情報を検索ファィルに 登録する。 Next, the character set position information assigned to each character set is registered in a search file.
上述のように文字セッ ト種グループは、 図 2、 図 3に示す A S C ί Iコード表 に記載された順に検索ファイルに格納される。 そして各文字セッ 卜の文字セッ ト 位置情報を各文字セッ ト種グループに登録する。 この文字セッ ト位置情報の登録 は、 該当する文字セッ ト種グループの未格納領域の先頭にそれぞれ文字セッ ト位 置情報を格納することによって行われる。 このため、 登録順にレコ一ド審号を付 与すれば文字セッ ト種グループ内には文字セッ ト位置情報が数値順の昇順に登録 されることになる。 As described above, the character set type groups are stored in the search file in the order described in the ASCII code table shown in Figs. Then, the character set position information of each character set is registered in each character set type group. The registration of the character set position information is performed by storing the character set position information at the head of the unstored area of the corresponding character set type group. For this reason, if record records are given in the order of registration, character set position information will be registered in ascending numerical order in the character set type group.
上述の図書名 「E l e c t r o n i c Pub l i s h i ng」 の文字セッ ト 位置情報を検索ファイルに登録した例を図 18に示す。 このとき、 各グループ内の 文字セッ ト位置情報は昇順に格納される。 このファイル容量は、 文字セッ ト位置 情報が 4バイ トであると、 Figure 18 shows an example of registering the character set position information of the above-mentioned book name "Electronicc Publishng" in a search file. At this time, the character set position information in each group is stored in ascending order. This file size is, if the character set position information is 4 bytes,
4バイ ト x { (図書名構成文字数) + (著者名構成文字数) + (発行者名構 4 bytes x {(number of characters in book name) + (number of characters in author name) + (issuer name
:0 成文字数) +4+ (抄録構成文字数) } i
になる。 : 0 number of characters) +4+ (number of characters constituting abstract)} i become.
なお、 文字セッ ト位置情報の追加登録は、第一実施例と同様に追カ卩レコードが 有する各キーヮードの各文字セッ トに該当するグループの未格納領域の先頭新規 コ一ドを追加することで行う。 また、 削除は削除レコ一ドが有する各キーヮード の各文字セットに該当するグループ内の該当文字セッ ト位置情報を特殊記号 (こ こでは A S C I Iコードの 「0 0 0 0」 ) に変更することによって行う。 これに より追加登録と削除を短時間に行うことができる。 For additional registration of character set position information, as in the first embodiment, a new code at the head of the unstored area of the group corresponding to each character set of each keyword in the additional record is added. Do with. Deletion can be performed by changing the character set position information in the group corresponding to each character set of each key of the deleted record to a special symbol (here, ASCII code "0000"). Do. As a result, addition and deletion can be performed in a short time.
なお上述のようにこの検索ファィルの各文字セット?重グループごとに格 され た文字セット位置情報は、 第一実施例で示した図 4の文字セットグル一プアドレ ス表の各文字セッ トグループ先頭蕃地をディレクトリとして取り出すことができ α As described above, each character set in this search file? The character set position information classified for each overlapping group can be obtained by extracting the first banji of each character set group in the character set group address table of FIG. 4 shown in the first embodiment as a directory.
JSLLの検索ファイルの作成処理の流れを図 19a、 図 19 bに示す。 Figures 19a and 19b show the flow of the JSLL search file creation process.
すなわち、 文字セッ ト種の出現度数を計数して文字セッ ト欄アドレス表を作成 し (S 111、 112 ) 、 検索ファイルの領域を確保する (S 113 ) 。 次にレコード 登録順位カウンタを k = lに初期設定して、 レコード審号を 「1」 に、 キーヮー ド列文字数を n = 1169に、 キーヮード属性領域の先頭文字位置を図書名甩として That is, the frequency of occurrence of the character set type is counted to create a character set column address table (S111, 112), and an area for the search file is secured (S113). Next, the record registration order counter is initially set to k = l, the record number is set to "1", the number of characters in the keyword string is set to n = 1169, and the first character position in the keyword attribute area is set as the book title.
P t = K著者名用として Ρ 2 =66、 発行者名用として Ρ 3 =99、 刊行年用とし て Ρ 4 =164、抄録用として Ρ 5 =169を設定する (S 114 ) 。 そして最初のレ コードを取り出す (S 115 ) 。 ここまでが登録の前処理である。 ここからレコー ドごとの登録処理となり、 まず、 キーワード属性審号を a = lにセッ卜し (S 116)、 レコードの中からキーワード属性審号 aのキーワードを取り出す (S 117 ) 。 さ らに、 キーワードの構成文字数を mに、文字セット位置番号を p = lに、 キ一ヮ 一ド属性蕃号 aに該当するキーヮード属性領域の先頭文字位置を P a に設定するP t = K Ρ 2 = 66 for author names, Ρ 3 = 99 for publisher names, Ρ 4 = 164 for publication years, and Ρ 5 = 169 for abstracts (S 114). Then, the first record is taken out (S115). This is the pre-processing of registration. From here, registration processing is performed for each record. First, the keyword attribute agenda is set to a = l (S116), and the keyword of the keyword attribute a is extracted from the record (S117). In addition, the number of characters constituting the keyword is set to m, the character set position number is set to p = l, and the first character position of the keyword attribute area corresponding to the key attribute a is set to Pa.
(S 118 ) 。 次に、 取り出したキーワードの先頭文字から順に、 文字セット位置 蕃号 Pに相当する文字セット位置情報を(S118). Next, in order from the first character of the extracted keyword, character set position information corresponding to the character set position
の式を用いて作成する (S 119 ) 0
そして、 文字セッ ト位置番号 Pにある文字セッ トの文字セッ ト種グループが格 納されている検索ファイルの文字セッ ト攔を示す文字セッ ト欄ディレクトリ (文 字セット欄先頭審地) を文字セッ ト欄アドレス表から取り出して (S 120 ) 、 文 字セッ ト欄ディレク トリが示す検索ファイルの未格納領域の先頭行に文字セッ ト 位置情報を格納する (S 121 ) 。 そして、 P = P + 1、 m=m— 1とし、 キーヮ ード内のすべての文字セッ トを処理したところで (S 122、 S 123)、 a = a + 1 でキーワード属性番号を + 1して次のキーワード処理に移る (S 124、 S 125 ) 。 また、 レコードが有するすべてのキーワードを処理すると、 k = k + lでレコー ド登録順位カウンタを + 1して次のレコードの処理に移る (S 126、 S 127、 S 128 ) 。 全レコードの処理が終了すると登録処理が終わる (S 126 ) 。 (S 119) 0 Then, the character set column directory (character set column heading area) indicating the character set の of the search file that stores the character set type group of the character set at the character set position number P is written. The character set position information is extracted from the set column address table (S120), and the character set position information is stored in the first line of the unstored area of the search file indicated by the character set column directory (S121). When P = P + 1 and m = m-1 and all character sets in the keyword have been processed (S122, S123), the keyword attribute number is incremented by 1 with a = a + 1. Then, the process proceeds to the next keyword processing (S124, S125). When all the keywords in the record have been processed, the record registration order counter is incremented by 1 at k = k + 1, and the process proceeds to the next record (S126, S127, S128). When the processing of all the records is completed, the registration processing is completed (S126).
次にこのようにして作成された検索ファィルを用いる検索処理について説明す 。 Next, a search process using the search file created in this way will be described.
本実施例では、 検索ファイルから取り出した文字セッ ト位置情報を対象として、 検索入力文字列と同じ文字列を含むキーワードを文字列照合し、 かつ検索入力と 同じ属性であることを確認後、 すべての検索入力文字列に共通するレコードを検 索する例で説明する。 In this embodiment, for the character set position information extracted from the search file, a keyword including the same character string as the search input character string is collated, and after confirming that the attribute is the same as the search input, all An example will be described in which records that are common to the search input character strings are searched.
まず、 その検索処理は第一実施例と同様に以下の構成からなっている。 First, the search process has the following configuration, as in the first embodiment.
①検索入力文字列をその先頭文字から 3文字単位の文字セッ 卜に分解し、 検索 入力文字セッ ト列を作成する。 (1) The search input character string is decomposed from the first character into a character set consisting of three characters, and a search input character set string is created.
②検索入力文字セッ ト列の各文字セッ トに該当する文字セッ トグループアドレ ス表内の文字セッ トグループ先頭番地を算出する。 (2) Calculate the start address of the character set group in the character set group address table corresponding to each character set in the search input character set string.
③検索入力文字セッ ト列を出現頻度の少ない文字セッ トから順に並べ変える。 (3) Sort the search input character set sequence in order from the character set with the least frequency of appearance.
④並べ変えた文字セッ ト列の先頭から順に該当する文字セッ ト種グループを検 索フアイルから取り出してそこに格納されている文字セット位置情報から検索入 力文字セッ ト列を構成できる文字セッ ト位置情報の組み合わせを取り出す。 文字 A character set that can retrieve a character set type group from the search file in order starting from the rearranged character set string and retrieve the input character set string from the character set position information stored there. Extract the combination of location information.
⑤抽出した文字セッ ト位置情報から検索入力と同じ属性を有する文字セッ ト位 置情報を取り出し照合一致とする。
⑤①〜⑤を検索入力数分繰り返した後、 すべての検索入力文字列に共通するレ コ一ド蕃号を検索結果として出力する。 文字 From the extracted character set position information, character set position information having the same attribute as the search input is extracted and matched. After repeating steps (1) to (5) for the number of search inputs, the record ban which is common to all search input character strings is output as search results.
次に具体的にそれぞれの処理を説明する。 Next, each processing will be specifically described.
① 検索入力文字セット列の作成 ① Creation of search input character set string
第一実施例と同様に、検索ファイルに格納されている文字セッ 卜と照合可能な ように、 検索入力文字列を先頭文字から 3文字単位の文字セットに分解し、 検索 入力文字セッ ト列とする。 なお、 検索入力文字列を先頭文字から 3文字単位の文 字セットに分解したとき、 最後の文字セッ トが 3文字未満になり文字セットを作 成できないことがある。 このときには、 最後の文字セットの直前の文字セットの 後部から、 不足文字数分の文字を取り出し、 最後の文字セットの前部に連結して 3文字単位の文字セッ トを作成する。 As in the first embodiment, the search input character string is decomposed into three-character units from the first character so that it can be compared with the character set stored in the search file. I do. When a search input string is decomposed from the first character into a character set consisting of three characters, the last character set may be less than three characters, and a character set may not be created. At this time, it extracts the missing characters from the end of the character set immediately before the last character set and concatenates them with the front of the last character set to create a three-character character set.
② 各検索入力文字セッ 卜に該当する文字セッ トグル一プアドレス表内の文字セ ットグループ先頭蕃地の算出 (2) Calculation of the first banji of the character set group in the character set group address table corresponding to each search input character set
第一雄例の検索ファイルの作成時と同様に、 各検索入力文字セットの図 1と 図 3で示す各文字セットの記載順位を算出し、 これを文字セットグループアドレ ス表における各検索入力文字セッ トのァドレスボインタとする。 As in the case of creating the search file of the first male example, the description order of each character set shown in Figs. 1 and 3 for each search input character set was calculated, and this was used for each search input character in the character set group address table. Set the addressless pointer of the set.
③ 出現鏃順の並べ変え ③ Rearrangement of appearance arrow order
そして、 第一 例と同様に、 検索ファイルの各文字セット種グループの先頭 蕃地を示す文字セットグル一プアドレス表の文字セットグループ先頭審地を参照 して、 各検索入力文字セッ卜の出現歩!^を調べ、 検索入力文字セッ ト列を全キー ワードにおける出現頻度の低レ、ものから順に並べ変える。 Then, as in the first example, the appearance of each search input character set is referred to by referring to the character set group heading area in the character set group address table indicating the first banchi of each character set type group in the search file. Check the step! ^, And sort the search input character set sequence in ascending order of occurrence frequency in all keywords.
④ 文字列の照合 照 合 String collation
第一難例と同様に、 出現頻度の低い文字セットから文字セッ トグル一プアド レス表を参照してそれぞれの文字セット種グループ欄に格納されている文字セッ ト位置情報を取り出す。 そして取り出した文字セッ ト位置情報をもとに、 出現頻 度の低い文字セット種グループから順に、各文字セット種グループ間でレコード 番号とキ一ヮ一ド属性番号が等しくかつ文字セット位置番号の差が検索入力文字
列の該当する文字セッ トの先頭文字位置差に等しい文字セッ ト位匿情報の ,01み合 わせを抽出する。 As in the first difficult example, character set position information stored in each character set type group column is extracted from the character set with a low frequency of occurrence by referring to the character set group address table. Then, based on the extracted character set position information, the record number and the key attribute number are the same and the character set position number of each character set type group is equal, in order from the character set type group with the lowest occurrence frequency. Difference is search input character The character set position matching information that is equal to the first character position difference of the corresponding character set in the column is extracted with the 01 combination.
この文字セッ ト位置情報の照合は、 検索入力文字セッ ト列の全キーワードにお ける出現頻度の低い文字セッ ト位置番号を i、 出現頻度の高い文字セッ ト位置番 号を jとするとき、 This character set position information collation is based on the case where the character set position number with low occurrence frequency is i and the character set position number with high appearance frequency is j in all keywords in the search input character set string.
(文字セッ ト位置審号 iの文字セッ トの文字セッ ト位置情報) ― (文字セッ ト 位置番号 jの文字セッ ト位置情報) = i— j … ( 8 ) の式で照合すればよい。 (Character set position information of the character set of character set position i)-(Character set position information of character set position number j) = i-j ... (8)
⑤ キーワード属性番号の照合 ⑤ Keyword attribute number collation
文字列照合から得られた文字セッ ト位置情報の文字セッ ト位置審号についてキ 一ワード属性を照合する。 すなわち、 文字セット位置番号が 1〜64ならば文字セ ッ ト位置情報のキーワード属性は図書名であり、 文字セッ ト位置番号が 66〜97な らば文字セッ ト位置情報のキーワード厲性は著者名であり、 文字セッ ト位置審号 が 99〜162 ならば文字セッ ト位置情報のキーワード属性は発行者名であり、 文字 セッ ト位置番号が 164 -167 ならば文字セッ ト位置情報のキーワード属性は刊行 年であり、 文字セッ ト位置蕃号が 169〜: L168ならば文字セッ ト位置情報のキーヮ —ド属性は抄録であることがわかる。 そこで、 文字セッ ト列照合で得られた文字 セッ ト位置情報の中から検索入力時に指定された属性と同じ文字セッ ト位置情報 だけを取り出す。 The keyword attribute is verified for the character set position identification of the character set position information obtained from the character string verification. That is, if the character set position number is 1 to 64, the keyword attribute of the character set position information is the book name, and if the character set position number is 66 to 97, the keyword characteristic of the character set position information is the author. If the character set position number is between 99 and 162, the keyword attribute of the character set position information is the issuer name, and if the character set position number is 164 -167, the keyword attribute of the character set position information Is the year of publication, and if the character set position number is 169 or more: L168, it is understood that the key attribute of the character set position information is an abstract. Therefore, only the character set position information that is the same as the attribute specified at the time of retrieval and input is extracted from the character set position information obtained by character set collation.
⑥ レコード審号の抽出 ⑥ Extract record records
検索入力の数だけ①〜⑤を繰り返し、 得られた各検索入力文字列に該当する文 字セッ ト位置情報間で、 すべての検索入力文字列に共通するレコ一ド審号を取り 出す。 Repeat steps (1) to (4) as many times as the number of search inputs, and retrieve the record board common to all search input strings between the character set position information corresponding to each obtained search input string.
なお、 1検索入力を複数の文字列で指定する場合、 例えば抄録のような文字数 が多い項目を対象とする場合よくあることであるが、 最初の文字列のキーワード 属性照合終了後、 2審目以降の文字列に対しては、 その文字列の最初の照合文字 セッ トの文字セッ ト種グループから、 それまで得られたレコード蕃号とキーヮー
ド属性審号を有する文字セッ ト位置情報を取り出し、 得られた文字セット位置情 報を文字セッ ト列照合の先頭文字の文字セッ ト種グループとして同じ文字列内の 他の文字セッ トについて照合処理を行う。 When specifying a search input with multiple character strings, it is often the case that an item with a large number of characters, such as an abstract, is targeted, but after the keyword attribute matching of the first character string is completed, For subsequent character strings, the record bungo and key map obtained from the character set type group of the first collation character set of the character string are obtained. Character set position information with the attribute attribute, and the obtained character set position information is compared with other character sets in the same character string as the character set type group of the first character of character set string collation Perform processing.
以上の②〜⑥の動作を具体例を挙げて説明する。 The above operations (1) to (4) will be described with reference to specific examples.
検索対象として図書名が指定され、 検索入力文字列としては「E 1 e c t r o j が指定されたとする。 この場合図書名のキーワードの属性の属性審号は 「: U と する。 検索入力が「E 1 e c t r o」 であるから、 検索入力文字セットは「E 1 e」 と 「c t r j と 「o」 になる。 しかし 「o j は 1文字なので、 「o」 の前に ある 2文字と 結して 「t r o」 とする。 全文出現頻度が「E 1 e j く 「c t r」 く 「t r o jの順であるとすると、 照合をこの順序に行う。 まず検索ファイル中 の 「E 1 e」 の文字セットグループ ffiから取り出した文字セット位置情報と 「c t r jの文字セッ トグループ欄から取り出した文字セッ ト位置情報との間で、 検 索入力「E 1 e c t r o _j における 「E」 と 「c」 との文字位置が各々 「1」 と 「 であるから、 文字セット位置差が「一 3」 になる文字セッ ト位置情報を抽 出して図 18の検索ファイルの 「E I e j 内の文字セット位置情報の 「116901」 と 「c t r」 内の 「116904」 とを連続性ある文字セッ ト位置情報の組み合わせとし て抽出することができる。 この照合結果と 「t r o j の文字セットグル一プ欄か ら取り出した文字セッ ト位置情報との間で、 検索入力 「; E 1 e c t r o」 におけ る 「Ej と 「t」 との文字位置が各々 「1」 と 「5」 であるから、 文字セット位 置差が「4」 になる文字セッ ト位置情報を抽出して、 「E 1 e」 内の上記照合結 果である文字セッ ト位置情報の 「116901」 と図 18の検索ファイルの 「t r o」 内 の文字セッ ト位置情報の 「116905」 とを連続性ある文字セッ ト位置情報の組み合 わせとして抽出できる。 したがって、 検索入力「Ε 1 e c t r 0 j に対し、文字 セッ 卜位置情報「116901」 と 「116904」 と 「116905」 とが、 レコード蕃号とキー ワード属性蕃号が等しくかつ連続である文字セットであることがわかる。 さらに、 キーワード属性は 「図書名」であるから、 これまでの文字セット列照合で残った 文字セット位置情報の中から、 文字位置審号が、 1〜64の文字セット位置情報と
して 「116901」 と 「116904」 と 「116905」 を抽出できる。 It is assumed that a book name is specified as a search target and “E 1 ectroj is specified as a search input character string. In this case, the attribute name of the attribute of the keyword of the book name is“: U. The search input is “E 1 ectro ”, the search input character sets are“ E 1 e ”,“ ctrj ”and“ o ”. However, "oj is a single character, so it is connected to the two characters before" o "to form" tro ". If the appearance frequency of the full text is in the order of “E 1 ej”, “ctr”, and “troj”, the collation is performed in this order. First, the character set position extracted from the character set group “ffi” of “E 1 e” in the search file In the search input “E 1 ectro_j, the character positions of“ E ”and“ c ”are“ 1 ”and“ Therefore, the character set position information at which the character set position difference is “13” is extracted, and “116901” of the character set position information in “EI ej” and “” in “ctr” of the search file in FIG. 18 are extracted. 116904 "can be extracted as a combination of continuous character set position information. Between this collation result and the character set position information extracted from the character set group field of “troj,” the character positions of “Ej and“ t ”in the search input“; Since the character set positions are “1” and “5”, character set position information with a character set position difference of “4” is extracted, and the character set position information in “E 1 e”, which is the result of the above collation, is extracted. The character set position information “116905” in “tro” in the search file in FIG. 18 can be extracted as a combination of character set position information with continuity. Therefore, for the search input “Ε 1 ectr 0 j”, the character set position information “116901”, “116904”, and “116905” are the character set in which the record number and the keyword attribute number are equal and continuous. You can see that there is. Furthermore, since the keyword attribute is "book name", the character position identification number is 1 to 64 character set position information from the character set position information remaining in the character set string matching so far. Then, "116901", "116904", and "116905" can be extracted.
また文字セッ ト位置番号は、 キーワード列の文字数が「1169」 であるから、 116901 + 1169=100余り 1から、 1と 4と 5であることがわかる。 またこの文字 列が属するレコード番号は 100 であることもわかる。 Since the number of characters in the keyword string is "1169", the character set position numbers are 1, 1, 4 and 5 from 116901 + 1169 = more than 100. It can also be seen that the record number to which this character string belongs is 100.
この検索処理動作を図 20 a、 図 20 bにフローチャートとして示す。 This search processing operation is shown as a flowchart in FIGS. 20a and 20b.
すなわち、 キーヮード列文字数を n = 1169に、 キーヮード属性領域の文字位置 範囲 P a を図書名は P , =1〜64、 著者名は P 2 =66〜97、 発行者名は P 3 =99 〜162、 刊行年は P 4 =16 〜167、 抄録は P s =169〜1168に、 キーワード属 性蕃号を a = 1に設定し ( S 131 ) 、 キーヮード属性番号 aの検索入力文字列が ある場合にはそれを取り出す (S 132、 S 133 ) 。 ここからは検索入力文字列の 照合処理になる。 そこで、 検索入力を取り出し、 検索入力文字列の先頭から 3文 字単位の文字セットに分割して検索入力文字セット列を作成し、 その文字セッ ト 数一 1を照合回数 qとし (S 133、 S 134 ) 、 検索入力文字セッ ト列を全キーヮ —ドにおける出現頻度の低いものから順に並べ変える (S 136 ) 。 そして検索フ アイルから、 並べ変えた文字セットに該当する文字セッ ト種グループ欄に格納さ れている文字セット位置情報を取り出す (S 137 ) 。 次に検索入力文字セッ ト列 の全キ一ワードにおける出現頻度が低レ、文字セッ トの文字セッ ト位置審号を i、 出現頻度の高い文字セッ トの文字セッ ト位置審号を jとするとき、 (文字セッ ト 位置審号 iの文字セッ トの文字セッ ト位置情報) 一 (文字セッ ト位置審号 jの文 字セッ 卜の文字セット位置情報) = i— jである文字セット位置情報を取り出す ( S 138 ) 。 同様の処理を検索入力文字セッ ト列の残りの文字セッ トについても 行い (S 139、 S 140 ) 、 残った文字セッ 卜位置情報の中から文字セッ ト位置番 号がキーワード属性審号 aの文字位置範囲 P a 内にあるレコ一ド審号だけを取り 出す。 文字セッ ト位置情報から文字セッ ト位置審号を取り出すには、 次の式 (9 ) を用いる。 That is, the Kiwado column number to n = 1169, Book Name character position range P a of Kiwado attribute area P, = 1 to 64, the author name P 2 = 66~97, issuer name P 3 = 99 ~ 162, published year P 4 = 16 ~167, abstracts the P s = from 169 to 1,168, and set the keywords genus SeiShigerugo to a = 1 (S 131), there is a search input character string Kiwado attribute number a If so, take it out (S132, S133). From here on, it is the process of collating the search input string. Therefore, the search input is extracted, and a search input character set string is created by dividing the character string into three-character units from the beginning of the search input character string. S134), the search input character set sequence is rearranged in ascending order of occurrence frequency in all keys (S136). Then, the character set position information stored in the character set type group column corresponding to the rearranged character set is extracted from the search file (S137). Next, the frequency of occurrence in all keywords in the search input character set string is low, the character set position identification number of the character set is i, and the character set position identification number of the character set with high frequency is j. (Character set character set position information of character set i) Character set position information of character set j = Character set position character of character set i = j The position information is taken out (S138). The same process is performed for the remaining character sets in the search input character set string (S139, S140), and the character set position number is determined from the remaining character set position information by the keyword attribute board a. in character position range P a out takes only record one de trial No.. The following equation (9) is used to extract the character set position identification from the character set position information.
(文字セッ ト位置情報) ÷ (キーヮード列文字数) =レコード審号余り文字セッ ト位置蕃号 … (9 )
ここまでの処理で、 検索入力文字列を文字列として持ち、 さらに検索入力され た属性と同じキーワードを有するレコード蕃号がわかる (S 141 ) 。 抄録まで同 様の処理を行い、 検索入力された属性と同じキーヮードを有するレコ一ド番号を 取り出す (S 142、 S 143 ) 。 すべての検索入力文字列の照合が終わると、残つ たレコード蕃号を对象として、 すべての検索入力文字列に共通するレコ一ド審号 を検索結果として出力する (S 144 ) 上記実施例では検索入力が 1個以上の場 合について説明した。 また、 検索入力が複数の場合には各検索入力間で論理積演 算を行う例として説明したが、 論理積演算以外の論理演算を伴う複数の検索入力 の場合は照合結果として残ったレコード番号を各検索入力に対応付けて、 指定さ れた論理演算を行って満足するレコ一ド蕃号を検索結果として出力する。 (Character set position information) ÷ (Number of characters in the key word) = Record character surplus character set position Ban No.… (9) By the processing up to this point, it is possible to find a record number that has the search input character string as a character string and further has the same keyword as the search input attribute (S141). The same processing is performed up to the abstract, and a record number having the same keyword as the attribute input and retrieved is retrieved (S142, S143). When the matching of all the search input character strings is completed, a record reference number common to all the search input character strings is output as a search result with the remaining record kango as an object (S144). The case where there is more than one search input has been explained. Also, in the case of multiple search inputs, an example has been described in which a logical AND operation is performed between the search inputs.However, in the case of a multiple search input involving a logical operation other than the logical AND operation, the record number remaining as a matching result is used. Is associated with each search input, the specified logical operation is performed, and a satisfactory record ban is output as a search result.
なお、 第一実施例の場合と同じく、 他の表音文字についての検索処理も同様に 行なえる。 As in the case of the first embodiment, search processing for other phonetic characters can be performed in the same manner.
また、 検索の高速性が求められる場合、 文字セットの構成文字数を増加すると ますま^字セットの出現頻度が低くなり、 各文字セット種グループに格納され る文字セッ ト位置情報が少なくなるため、 容易に高速化を実現できる。 In addition, when high-speed search is required, increasing the number of characters in a character set reduces the appearance frequency of the character set, and reduces the character set position information stored in each character set type group. High speed can be easily realized.
次に第五難色例を説明する。 Next, a fifth example of difficult color will be described.
この第五^ M例は、 第一実施例に対する第二実施例の関係と同じであり、 日本 語検索処理を行う場合に、 2文字単位の文字セッ トを用い、 J I Sコード表にし たがった検索ファィルを作成する。 The fifth ^ M example is the same as the relationship of the second example with respect to the first example. When performing a Japanese search process, a search is performed according to a JIS code table using a character set of two characters. Create a file.
すなわち、 キーワード列のキーワード属性領域サイズが、 図書名 64バイ ト 32文 字、 著者名 32バイ ト 16文字、 発行者名 =64バイ ト 32文字、 刊行年 = 8バイ ト 4文 字、 抄録 400バイト 200文字の図書検索システムで、 レコード審号が 100 のレコ ードが、 「図書名 =通信 の構造」 、 「著者名 =田中一郎」、 「発行者 =太平 洋出版」、 「刊行年 =1990」、 「抄録 =初めての人にも〜てしている」 の場合は、 そのキーヮード歹 1Jは第四実施例と同じように図 21のようになり、 そのときのキー ワード列は 578バイト 289文字であるため、 各文字セッ卜の文字セット位置情報 は図 22のように作成される。
この図書名の 「通信文書の構造」 の文字セッ ト位置情報を登録した^ ¾索フアイ ルの例を図 23に示す。 In other words, the keyword attribute area size of the keyword column is: book name 32 bytes 32 characters, author name 32 bytes 16 characters, publisher name = 64 bytes 32 characters, publication year = 8 bytes 4 characters, abstract 400 A 200-byte book search system with records with 100 record qualifications: "Book name = communication structure", "Author name: Ichiro Tanaka", "Publisher = Hiroshi Taihei Shuppan", "Year = In the case of "1990" and "Abstract = even for first-time users", the key word system 1J is as shown in Fig. 21 as in the fourth embodiment, and the keyword string at that time is 578 bytes Since there are 289 characters, character set position information for each character set is created as shown in FIG. Fig. 23 shows an example of a ^ -search file in which the character set position information of the "structure of the correspondence document" of this book name is registered.
本第五実施例の検索フ了ィルの作成処理および検索処理手順はキーヮード文字 数およびキーヮード属性領域の設定が異なるだけで第四実施例と同じである。 第二実施例で述べたように、 欧文字よりその字種が多い仮名文字および漢字を 使う日本語文書の検索処理では 2文字セッ トの検索フアイルを用いることは有効 である。 なお、 第三実施例のところで述べたように、 仮名文字のみこの第五実施 例による文字セッ 卜の検索ファイルとし、 漢字については第六実施例による 1文 字単位の文字種グループ検索ファィルとしてもよい。 The search file creation processing and search processing procedure of the fifth embodiment are the same as those of the fourth embodiment except that the number of keyword characters and the setting of the keyword attribute area are different. As described in the second embodiment, it is effective to use a two-character set search file in the search processing of Japanese documents that use Kana characters and Kanji whose character types are more common than European characters. As described in the third embodiment, only kana characters may be used as the character set search file according to the fifth embodiment, and kanji may be used as the character type group search file for each character according to the sixth embodiment. .
次に第六実施例を説明する。 Next, a sixth embodiment will be described.
この第六実施例は、 第一実施例および第二実施例に対する第三実施例の関係と 同じであり、 漢字を舍む日本語の場合には、 1文字単位の文字位置情報を格納し た文字種グループから構成された検索ファィルを用いる。 The sixth embodiment has the same relationship as the first embodiment and the third embodiment with respect to the second embodiment. In the case of Japanese characters containing kanji, character position information is stored in units of one character. A search file composed of character type groups is used.
第五実施例の図 21に示すキーワード列のレコードが与えられたとき、 この第六 実施例は 1文字単位で文字位置情報を作成するため、 その文字位置情報は、 文字位置情報コード-レコード番号 X H + ( P a - 1 ) + p Given the record of the keyword string shown in FIG. 21 of the fifth embodiment, the sixth embodiment creates character position information in units of one character. Therefore, the character position information is represented by character position information code-record number. XH + (P a-1) + p
n :キーワード列文字数 n: Keyword string character count
P a :キーヮード属性番号 aのキーヮード属性領域のキーヮード列における先 頭文字位置 P a: Initial character position in the keyword column of the keyword attribute area of keyword attribute number a
P :文字位置番号 P: Character position number
で与えられる数字コードである。 Is the numeric code given by
このため第五実施例の図 21に示すキーワード列のレコードが与えられたとき、 その文字位置情報は図 24のように構成される。 また図書名 「通信文書の構造」 の の文字位置情報を検索ファィルに登録した例を図 25に示す。 Therefore, when a record of the keyword string shown in FIG. 21 of the fifth embodiment is given, the character position information is configured as shown in FIG. Fig. 25 shows an example in which the character position information of the book name "Correspondence of communication document" is registered in the search file.
この第六実施例での検索ファイルの作成処理の流れ図を図 26 a、 図 26 bに、 ま た検索処理の流れ図を図 27 a、 図 27 bに示す。 26a and 26b show a flowchart of the search file creation process in the sixth embodiment, and FIGS. 27a and 27b show a flowchart of the search process.
この検索ファィル作成処理および検索処理の手順は基本的には第四実施例と同
じであり、 検索フアイルが 1文字単位の文字種別グループで構成されて!、る点お よび日本語処理のため J I Sコードに基づいて構成されている点が異なっている c 〔産業上の利用可能性〕 The procedure of the search file creation process and the search process is basically the same as in the fourth embodiment. The search file is composed of character type groups in units of one character! , Ru point contact and is different in that it is constructed on the basis of the JIS code for Japanese processing c [INDUSTRIAL APPLICABILITY]
本発明は検索対象文字列の文字セット種ごとにその文字セッ 卜が属する検索単 位識別符:号、文字セッ ト位置順序符号、 検索単位の ί重別を示す属性蕃号からなる 文字セッ ト位置情報を格納した検索ファイルを作成し、 この検索ファイルを検索 入力の文字列を構成する文字セット種ごとにその文字セット位置情報を取り出し て、検索入力に合致する文字列を検索するようにした。 また字種の多い文字につ いては文字種別に文字位置情報が格納された検索ファィルを作成して、 検索入力 の文字列を構成する文字種ごとにその文字位置情報を取り出して検索入力に合致 する文字列を検索するようにした。 The present invention provides a character set consisting of a search unit identifier: a symbol, a character set position order code, and an attribute number indicating the number of search units to which the character set belongs for each character set type of the character string to be searched. Create a search file that stores location information, search this search file, extract the character set location information for each character set type that composes the input character string, and search for a character string that matches the search input . For characters with many character types, create a search file that stores character position information for each character type, extract the character position information for each character type that constitutes the character string of the search input, and match the search input Search for character strings.
このため、 本発明には次に述べる優れた効果がある。 Therefore, the present invention has the following excellent effects.
(1) 検索処理のための文字列照合回数を低減することができるため、 高速照合を 行うことか'できる。 (1) Since it is possible to reduce the number of character string matches for search processing, it is possible to perform high-speed matching.
(2) 文字セットと文字位置に着目して検索処理を行うため任意の文字列検索を行 うことができ、全文検索処理のィンデックス方式やプリサーチ方式のように登録 時に文字列抽出を行う必要はな (2) Any character string search can be performed because the search processing is performed by focusing on the character set and character position, and it is necessary to extract the character string at the time of registration as in the index method or pre-search method of full-text search processing Flower
(3) 専用のハードウエアを用いることなくソフトウェアだけで高速検索を実現で きるため、 汎用の情報処理装置で全文検索を効率よく行うことができ汎用性に富 む。 (3) A high-speed search can be realized only by software without using dedicated hardware, so that a full-text search can be efficiently performed by a general-purpose information processing device, and the versatility is high.
(4) データベースシステムでマルチキ一ヮ一ドを用いた部分一致検索を行う場合 に、 従来のィンデックス方式のように巨大な部分一致検索甩文字列のィンデック スを必要とせず、 また,検索対象となるレコ一ドが有するキーヮ一ドから自動的に 検索ファィルを作成することができるため、 データベースシステムを経済的に構 築できる。 (4) When performing a partial match search using a multi-key in a database system, a huge partial match search, unlike the conventional index method, does not require the index of a character string. Since a search file can be created automatically from the key-code of a certain record, a database system can be constructed economically.
(5) 全文検索のデータベースシステ厶に利用したとき、 その検索ファィルの作成 にキ一ヮ一ド抽出を行う必要がなく、 機械入力された論文などの文字列から自動
的に検索フ了ィルを作成することができるため、 データベースシステムを経済的 にかつ効率よく構築することが可能である。 (5) When used in a database system for full-text search, there is no need to extract keys to create the search file, and automatic search is performed from character strings such as papers input by machine. Since a search file can be created in an efficient manner, a database system can be constructed economically and efficiently.
(6) 欧文字のように字種の少ない文字からなる文字列も、 その文字列を構成する 文字セッ ト種グループで文字セッ ト位置情報を格納した検索ファィルを作成して 検索することにより、 同じ文字列の出現頻度は少ないため各文字セッ 卜の出現頓 度を低く抑えることができ、 出現頻度の少ない文字セッ 卜での検索照合を可能と するので高速検索が可能となる。 (6) A character string consisting of characters with few character types, such as European characters, can also be searched by creating a search file that stores character set position information in the character set type group that composes the character string. The frequency of occurrence of the same character string is low, so that the frequency of appearance of each character set can be kept low, and search matching can be performed in a character set with a low frequency of appearance, thus enabling high-speed search.
(7) 検索処理は検索入力文字列の対応する文字または文字セッ 卜の文字位置情報 または文字セッ ト位置情報のみを取り出せばよいだけなので、 検索ファイルの対 応する文字種の文字位置情報または文字セッ 卜の文字セッ ト位置情報が外部記憶 装置にあった場合でも、 この検索ファィルの内容を主メモリに転送する時間が少 なくてすみ、 検索処理を高速化することができる。
(7) Since the search process only needs to extract the character position information or character set position information of the corresponding character or character set of the search input character string, the character position information or character set of the corresponding character type in the search file is retrieved. Even when the character set position information of the data is in the external storage device, the time required to transfer the contents of the search file to the main memory is reduced, and the search process can be sped up.
Claims
1. 検索射象となる文字列を検索を行う単位である検索単位に分けこの検索単位 ごとに昇順の符号を付与する検索単位識別符号付与手段と、 1. A search unit identification code assigning means for dividing a character string to be searched into search units as search units and assigning an ascending code to each search unit;
この分けられた検索単位に対してその検索単位の論理的な区分を示す属性符号 を付与する属性符号付与手段と、 Attribute code assigning means for assigning an attribute code indicating a logical division of the search unit to the divided search unit;
検索対象となる文字列を各文字ごとにその文字と次に続く合計 r文字 (ただし rは 2 _Lの自然数) からなる文字セッ トとし、 この文字セッ トの属する検索単 位中での文字セットの先頭文字位置を示 f¾字セット位置順序符号を付与する文 字セット位置順序符号付与手段と、 The character string to be searched is a character set consisting of each character followed by a total of r characters (where r is a natural number of 2 _L), and the character set in the search unit to which this character set belongs Character set position order code assigning means for assigning a character set position order code indicating the leading character position of
上記検索単位識別符号と文字セット位置順序符号と属性符号とからなる文字セ ッ ト位置情報を作成して、 この文字セッ ト位置情報を文字セット種ごとの領域に 格納して検索フ了ィルを作成する手段と Character set position information including the above search unit identification code, character set position order code, and attribute code is created, and the character set position information is stored in an area for each character set type, and the search file is searched. Means to create
を備えた情報検索処理方式。 Information retrieval processing system with
2. 文字セット位置情報は、 2. Character set position information
{ (検索単位識別符号 x n ) 十文字セット位置順序符号 } x a +属性符号 n :最大検索単位文字数 {(Search unit identification code x n) Cross character set position order code} x a + Attribute code n: Maximum number of characters in search unit
a :最大属性数 a: Maximum number of attributes
なる数字コ一ドとして与えられる請求項 1記載の情報検索処理方式。 2. The information retrieval processing method according to claim 1, wherein the information retrieval processing method is provided as a numeric code.
3. 検索対象となる文字列について、 文字列を構成する文字セットごとに、文字 セット列からなり検索を行う単位である検索単位に昇順に付された検索単位識別符 号と、 検索単位中でのその文字セッ 卜の先頭文字位置を示す文字セット位置順序符 号と、 検索単位の論理区分を示す属性符号とからなる文字セッ ト位置情報を作成 し、 文字セット a ^ごとに格納した検索ファイルを備え、 3. For the character string to be searched, for each character set that composes the character string, a search unit identification code that is assigned in ascending order to the search unit that is a unit of search consisting of the character set string, Character set position information consisting of a character set position order code indicating the first character position of the character set and an attribute code indicating the logical division of the search unit, and a search file stored for each character set a ^ With
検索入力文字列の構成文字を先頭文字から r文字単位の文字セットに分解した 検索入力文字セット列を構成し、分解した文字セットと同じ文字セットの文字セ ッ ト位置情報を上記検索ファイルから取り出す手段と、 Creates a search input character set string in which the constituent characters of the search input character string are decomposed from the first character into a character set of r characters, and extracts the character set position information of the same character set as the decomposed character set from the above search file Means,
この取り出した各文字セッ 卜の文字セッ ト位置情報間で、 検索単位識別符号が 共通で文字セット位置順序符号の差が検索入力文字列の該当する文字セットの先
頭文字位置差に等しくかつその属性符号が検索入力と等しい文字セッ ト位置情報 の組み合わせを抽出する手段と、 Among the extracted character set position information of each character set, the search unit identification code is common, and the difference in the character set position order code is the end of the corresponding character set in the search input character string. Means for extracting a combination of character set position information that is equal to the initial character position difference and whose attribute code is equal to the search input;
この抽出された文字セッ ト位置情報の組み合わせに基づいて文字セッ ト列が属 する検索単位および各文字セッ ト構成各文字の検索単位における先頭文字からの 位置を示す文字位置を検索結果として出力する手段と Based on the combination of the extracted character set position information, the search unit to which the character set string belongs and the character position indicating the position from the first character in the search unit of each character set constituting each character set are output as a search result. Means
を備えた情報検索処理方式。 Information retrieval processing system with
4. 検索入力文字セッ ト列と同じ文字セッ ト列を構成できる文字セット位置情報 の組み合わせの抽出は、 検索入力の出現頻度の低い文字セッ 卜から順に行う請求 項 3記載の情報検索処理方式。 4. The information search processing method according to claim 3, wherein the extraction of the combination of character set position information that can form the same character set string as the search input character set string is performed in order from the character set with a low frequency of search input.
5. 検索入力文字セッ ト列と同じ文字セット列を構成できる文字セッ ト位置情報 の組み合わせの抽出は、 出現頻度の低い文字セットの文字セッ ト位置順序符号を i、 出現齄度の高 、文字セッ トの文字セッ ト位置順序符号を jとするとき、 (文 字セッ ト位置順序符号 iの文字セッ 卜の文字セッ ト位置情報) ― (文字セッ ト位 置順序符号 jの文字セッ トの文字セッ ト位置情報) = ( i - j ) X (最大属性数 ) に合致する文字セッ ト位置情報の組み合わせを抽出する 5. Extraction of a combination of character set position information that can form the same character set string as the search input character set string is performed by setting the character set position order code of a character set with a low frequency of occurrence to i, a high frequency of occurrence, and a character. When the character set position order code of the set is j, (character set position information of the character set of character set position order code i)-(character set position order code of character set j) Character set position information) = (i-j) Extracts a combination of character set position information that matches X (maximum number of attributes)
請求項 3または請求項 4記載の情報検索処理方式。 An information search processing method according to claim 3 or claim 4.
6. 検索対象文字列が記号を含む欧文文字列の場合は少なくとも 3文字記号単位 の文字セッ 卜で記号を含む欧文字の文字セット種のみの検索ファイルを用いる請 求項 1ないし 5のいずれか記載の情報検索処理方式。 6. If the search target character string is a European character string containing symbols, any of the claims 1 to 5 that uses a search file of only the character set of European characters including symbols in a character set of at least three characters Information retrieval processing method described.
7. 検索対象文字列が漢字を含む日本語文字列の場合は、 2文字単位の文字セッ ト種で構成された検索ファィルを用いる請求項 1ないし 5のいずれか記載の情報 検索処理方式。 7. The information search processing method according to claim 1, wherein when the search target character string is a Japanese character string including kanji, a search file composed of a character set type in units of two characters is used.
8. 検索対象文字列が漢字を含む日本語文字列の場合は、 仮名文字について少な くとも 2文字単位の文字セット種で構成された検索フアイルを用 、る請求項 1な いし 5のいずれか記載の情報検索処理方式。 8. If the search target character string is a Japanese character string containing kanji, use a search file composed of at least two-character unit character set types for kana characters. Information retrieval processing method described.
9. 検索対象となる文字列を検索を行う単位である検索単位に分けこの検索単位 ごとに昇順の符号を付与する検索単位識別符号付与手段と、 9. Search unit identification code assigning means for dividing a character string to be searched into search units, which are units for performing search, and assigning an ascending code to each search unit;
この分けられた検索単位に対してその検索単位の論理的な区分を示す属性符号 を付与する属性符号付与手段と、
検索対象となる文字列を各文字ごとに検索単位中での位置を示す文字位置順序 情報を付与する文字位置匿序符号付与手段と、 Attribute code assigning means for assigning an attribute code indicating a logical division of the search unit to the divided search unit; Character position concealment code providing means for providing character position order information indicating a position of a character string to be searched in a search unit for each character;
上記検索単位識別符号と文字位置順序符号と属性符号とからなる文字位置情報 を作成して、 この文字位置情報を文字種ごとの領域に格鈉して検索ファィルを作 成する手段と Means for creating character position information including the search unit identification code, the character position order code, and the attribute code, storing the character position information in an area for each character type, and creating a search file;
を備えた情報検索処理方式。 Information retrieval processing system with
10. 文字位置情報は、 10. Character position information
{ (検索単位識別符号 x n ) +文字位置順序符号 } x a +属性符号 {(Search unit identification code x n) + character position order code} x a + attribute code
n :最大検索単位文字数 n: Maximum number of search unit characters
a :最大属性数 a: Maximum number of attributes
なる数字コ一ドとして与えられる請求項 9記載の情報検索処理方式。 10. The information retrieval processing method according to claim 9, wherein the information retrieval processing method is provided as a numeric code.
11. 検索対象となる文字列について、 文字列を構成する文字ごとに、 文字検索を 行う単位である検索単位に昇順に付された検索単位識別符号と、 検索単位中での その文字の位置を示す文字位置順序符号と、 検索単位の論理区分を示す属性符号 とからなる文字位置情報を文字種別ごとに格納した検索ファィルを備え、 検索入力文字列の構成文字と同じ文字の文字位置情報を上記検索ファィルから 取り出す手段と、 11. For the character string to be searched, for each character that constitutes the character string, the search unit identification code added to the search unit, which is the unit for performing the character search, in ascending order, and the position of that character in the search unit A search file that stores character position information for each character type, consisting of a character position sequence code shown and an attribute code indicating the logical division of the search unit, and the character position information of the same characters as the constituent characters of the search input character string Means to retrieve it from the search file,
この取り出した各文字の文字位置情報間で、 検索単位識別符号が共通で文字位 置順序符号が検索 の文字列と等しい文字位置情報の組み合わせを抽出する手 段と、 Means for extracting a combination of character position information having the same search unit identification code and the same character position order code as the search character string among the character position information of the extracted characters;
この抽出された文字位置情報の組み合わせに基づいて文字列が属する検索単位 および文字位置を検索結果として出力する手段と Means for outputting a search unit and a character position to which a character string belongs based on the combination of the extracted character position information as a search result;
を備えた情報検索処理方式。 Information retrieval processing system with
12. 検索入力文字列を構成できる文字位置情報の組み合わせの抽出は、検索入力 文字の出現頻度の低い文字から順に行う請求項 11記載の情報検索処理方式。 12. The information search processing method according to claim 11, wherein extraction of combinations of character position information that can form a search input character string is performed in order from a character having a low appearance frequency of the search input character.
13. 検索入力の文字列を構成できる文字位置情報の組み合わせの抽出は、 出現頻 度の低い文字の文字位置順序符号を i、 出現頻度の高い文字の文字位置順序符号 を jとするとき、 (文字位置順序符号 iの文字の文字位置情報) ― (文字位置順 序符号 jの文字の文字位置情報) = ( i - j ) x (最大属性数) に合致する文字
位置情報の組み合わせを抽出する 13. Extraction of a combination of character position information that can form a character string for search input is performed by setting the character position sequence code of a character with low frequency of occurrence to i and the character position sequence code of a character with high frequency of frequency to j, Character position information of character with position code i)-(Character position information of character with character position sequence code j) = character that matches (i-j) x (maximum number of attributes) Extract combinations of location information
請求項 11または請求項 12記載の情報検索処理方式。 An information retrieval processing method according to claim 11 or claim 12.
14. 検索対象となるレコードごとに昇順の符号を付与するレコード識別符号付与 手段と、 14. record identification code assigning means for assigning an ascending code to each record to be searched;
このレコードが有する各キーワードにキーワードの論理的な区分を示す属性符 号を付与するキーワード属性符号付与手段と、 Keyword attribute code assigning means for assigning an attribute code indicating a logical division of the keyword to each keyword of the record;
このキーワードを各文字ごとにその文字と次に続く合計 r文字 (但し rは 2以 上の自然数) からなる文字セッ トとし、 この文字セッ トにキーヮード中での文字 セッ トの先頭文字位置を示す文字セッ ト位置順序符号を付与する文字セッ ト位置 順序符号付与手段と、 This keyword is a character set consisting of each character followed by a total of r characters (where r is a natural number of 2 or more), and this character set specifies the position of the first character of the character set in the keyword. Character set position order code assigning means for assigning a character set position order code to be indicated,
上記レコード識別符号とキーワード属性符号と文字セッ ト位置順序符号とから なる文字セッ ト位置情報を作成して、 この文字セッ ト位置情報を文字セッ ト種ご との領域に格納して検索ファィルを作成する手段と Character set position information including the record identification code, the keyword attribute code, and the character set position order code is created, and the character set position information is stored in an area for each character set type, and a search file is created. Means to create and
を備えた情報検索処理方式。 Information retrieval processing system with
15. 文字セッ ト位置情報は、 レコードを構成する各キーワードをキーワード属性 符号に対応してキーワード属性領域に配列されたキーワード列の各キーワードを 構成するすべての文字セッ トについて、 15. The character set position information is obtained by converting each keyword constituting the record into all the character sets constituting each keyword in the keyword sequence arranged in the keyword attribute area corresponding to the keyword attribute code.
レコード識別符号 x n + ( P a - 1 ) +文字セッ ト位置順序符号 Record identification code xn + (P a -1) + character set position order code
n :キーワード列文字数 n: Keyword string character count
P a :キーワード属性符号 aのキ一ヮ一ド属性領域のキーワード列における先 頭文字位置 P a: first character position in the keyword column of keys Wa one de attribute area of the keyword attribute code a
なる数字コ一ドとして与えられる請求項 14記載の情報検索処理方式。 15. The information retrieval processing method according to claim 14, wherein the information retrieval processing method is provided as a numeric code.
16. 検索対象となるレコードのキーワード列について、 各キーワードを構成する 文字ごとに、 レコードごとに昇順に付与されたレコード識別符号と、 このレコー ドが有するキーヮードの論理区分を示すキーワード属性符号とキーワード中での その文字セッ 卜の先頭文字位置を示す文字セッ ト位置順序符号とからなる文字セ ッ ト位置情報を文字セッ ト種別ごとに格納した検索ファイルを備え、 16. Regarding the keyword string of the record to be searched, for each character that constitutes each keyword, a record identification code assigned to each record in ascending order, a keyword attribute code and a keyword indicating the logical division of the key word possessed by this record A search file that stores character set position information for each character set type, including a character set position order code indicating the first character position of the character set in the
検索入力文字列の構成文字を先頭文字から r文字単位の文字セッ トに分解した 検索入力文字セッ ト列を構成し、 分解した文字セッ 卜と同じ文字セッ 卜の文字セ
ット位置情報を上記検索ファィルから取り出す手段と、 A search input character set string is composed by decomposing the constituent characters of the search input character string from the first character into a character set of r characters, and the character set of the same character set as the decomposed character set Means for extracting the cut position information from the search file;
この取り出した各文字セッ トの文字セット位置情報間で、 レコード識別符号と キーヮード属性符号が共通で文字セット位置順序符号が検索入力文字列の該当す る文字セットの先頭文字位置差に等しく、 かつそのキーワード属性符号が検索入 力と等しい文字セット位置情報の組み合わせを抽出する手段と、 Among the extracted character set position information of each character set, the record identification code and the keyword attribute code are common, the character set position order code is equal to the difference in the first character position of the corresponding character set in the search input character string, and Means for extracting a combination of character set position information whose keyword attribute code is equal to the search input;
この抽出された文字セット位置情報の組み合わせに基づいて検索入力に対応す るレコ一ド識別符号を検索結果として出力する手段と Means for outputting, as a search result, a record identification code corresponding to the search input based on the combination of the extracted character set position information;
を備えた情報検索処理方式。 Information retrieval processing system with
17. 検索入力文字セッ ト列と同じ文字セッ ト列を構成できる文字セッ ト位置情報 の組み合わせの抽出は検索入力文字セット列の全キ一ワードにおける出現頻度の 低 、文字セッ トから順に行う請求項 16記載の情報検索処理方式。 17. The combination of character set position information that can compose the same character set string as the search input character set string is extracted in order from the character set with the lowest occurrence frequency in all the keywords of the search input character set string. Information retrieval processing method described in Item 16.
18. 検索入力文字セッ ト列と同じ文字セッ ト列を構成できる文字セット位置情報 の組み合わせの抽出は、 検索入力文字セッ ト列の全キーワードにおける出現频度 の低い文字セッ ト位置順序符号を i、 出現頻度の高い文字セッ ト位置順序符号を jとするとき、 (文字セッ ト位置順序符号 iの文字セッ トの文字セット位置情報 ) ― (文字セッ ト位置順序符号 jの文字セッ トの文字セッ ト位置情報) = i - j に合致する文字セッ ト位置情報の組み合わせを抽出する請求項 16または請求項 17 記載の情報検索処理方式。 18. Extraction of a combination of character set position information that can form the same character set string as the search input character set string is performed by setting the character set position order code with low occurrence frequency in all keywords in the search input character set string to i. , Where character set position order code with high appearance frequency is j, (character set position information of character set with character set position order code i)-(character of character set with character set position order code j) 18. The information search processing method according to claim 16, wherein a combination of character set position information that matches (set position information) = i-j is extracted.
19. キーヮードが記号を含む欧文字列の場合は、 少なくとも 3文字記号単位の文 字セットで記号を含む欧文字の文字セッ ト種のみの検索ファィルを用 、る請求項 14ないし請求項 18のいずれか記載の情報検索処理方式。 19. If the keyword is a European character string containing symbols, use a character set of at least three characters and use a search file for only the character set type of European characters containing symbols. Any of the information retrieval processing methods described.
20. キーワードが漢字を含む日本語文字列の場合は、 2文字単位の文字セット種 で構成された検索ファィルを用いる請求項 14ないし請求項 18のいずれか記載の情 報検索処理方式。 20. The information search processing method according to claim 14, wherein when the keyword is a Japanese character string including kanji, a search file composed of a character set type in units of two characters is used.
21. キーワードが漢字を含む日本語文字列の場合は、 仮名文字について少なくと- も 2文字単位の文字セット種で構成された検索ファィルを用いる請求項 14ないし 請求項 18のいずれか記載の情報検索処理方式。 21. The information according to any one of claims 14 to 18, wherein when the keyword is a Japanese character string containing kanji, a search file composed of at least two character set character sets is used for kana characters. Search processing method.
22. 検索对象となるレコードごとに昇順の符号を付与するレコード識別符号付与 手段と、
このレコードが有する各キーワードにキーワードの論理的な区分を示す厲性符 号を付与するキーワード属性符号付与手段と、 22. record identification code assigning means for assigning an ascending code to each record to be searched; A keyword attribute code assigning means for assigning a character code indicating a logical division of the keyword to each keyword of the record;
このキーワードを各文字ごとに分解し、 各文字にキーワード中での位置を示す 文字位置順序符号を付与する文字位置順序符号付与手段と、 Character position sequence code assigning means for decomposing the keyword for each character and assigning a character position sequence code indicating a position in the keyword to each character;
上記レコード識別符号とキーワード属性符号と文字位置順序符号とからなる文 字位置情報を作成して、 この文字位置情報を文字種ごとの領域に格納して検索フ アイルを作成する手段と Means for creating character position information including the record identification code, the keyword attribute code, and the character position sequence code, storing the character position information in an area for each character type, and creating a search file;
を備えた情報検索処理方式。 Information retrieval processing system with
23. 文字位置情報は、 レコードを構成する各キーワードをキーワード厲性符号に 対Sしてキーヮード属性領域に配列されたキ一ヮ一ド列の各キーヮードを構成す るすべての文字について、 23. The character position information is obtained by associating each keyword constituting the record with the keyword character code for all the characters constituting each of the keywords in the key word array arranged in the keyword attribute area.
レコード識別符号 x n + ( P a — 1 ) 十文字位置順序符号 Record identification code xn + (P a — 1) Cross position code
n :キーワード列文字数 n: Keyword string character count
P a :キーヮード属性符号 aのキーヮード属性領域のキーヮード列における先 頭文字位置 Pa: Keyword attribute code The first character position in the keyword sequence in the keyword attribute area of a
なる数字コ一ドとして与えられる請求項 22記載の情報検索処理方式。 23. The information retrieval processing method according to claim 22, wherein the information retrieval processing method is provided as a numeric code.
24. 検索対象となるレコードのキーワード列について、 各キーワードを構成する 文字ごとに、 レコードごとに昇順に付与されたレコード識別符号と、 このレコー ドが有するキーヮードの論理区分を示すキーヮード属性符号とキーヮード中での その文字の位置を示す文字位置順序符号とからなる文字位置情報を文字種別ごと に格納した検索ファィルを備え、 24. For the keyword string of the record to be searched, for each character that constitutes each keyword, a record identification code assigned in ascending order for each record, and a keyword attribute code and keyword indicating the logical division of the key word of this record A search file that stores character position information consisting of a character position sequence code indicating the position of the character in the
検索入力文字列の構成文字と同じ文字の文字位置情報を上記検索ファィルから 取り出す手段と、 Means for extracting character position information of the same character as a constituent character of the search input character string from the search file;
この取り出した各文字の文字位置情報間で、 レコ一ド識別符号とキーワード厲 性符号が共通で文字位置順序符号が検索入力の文字歹 ijと等しい順序であり、 かつ そのキーヮ一ド属性符号が検索入力と等しい文字位置情報の組み合わせを抽出す る手段と、 Among the extracted character position information, the record identification code and the keyword character code are common, the character position sequence code is in the same order as the search input character system ij, and the key code attribute code is Means for extracting a combination of character position information equal to the search input;
この抽出された文字位置情報の組み合わせに基づいて検索入力に対応するレコ -ド識別符号を検索結果として出力する手段と
5 ϋ を備えた情報検索処理方式。 Means for outputting a record identification code corresponding to the search input as a search result based on the combination of the extracted character position information; Information retrieval processing method with 5 ϋ.
25. 検索入力の文字列を構成できる文字位置情報の組み合わせ抽出は、 検索入力 文字の全キ一ワードにおける出現頻度の低レ、文字から順に行う請求項 24記載の情 報検索処理方式。 25. The information search processing method according to claim 24, wherein the extraction of the combination of the character position information that can form the character string of the search input is performed in ascending order of appearance frequency in all the keywords of the search input character, starting from the character.
26. 検索入力の文字列を構成できる文字位置情報の組み合わせの抽出は、 出現頻 度の低い文字の文字位置順序符号を i、 出現頻度の高い文字の文字位置順序符号 を jとするとき、 (文字位置順序符号 iの文字の文字位置情報) ― (文字位置順 序符号 jの文字の文字位置情報) = i - jに合致する文字位置情報の組み合わせ を抽出する請求項 24または請求項 25記載の情報検索処理方式。
26. Extraction of a combination of character position information that can form a character string for search input is performed by setting the character position order code of a character with low appearance frequency to i and the character position order code of a character with high appearance frequency to j, 26. A combination of character position information that matches character position information of the character with the character position order code i) − (character position information of the character with the character position order code j) = i−j. Information retrieval processing method.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2/338546 | 1990-11-30 | ||
JP2338546A JPH0782504B2 (en) | 1990-11-30 | 1990-11-30 | Information retrieval processing method and retrieval file creation device |
JP2/417609 | 1990-12-12 | ||
JP2417609A JPH07109603B2 (en) | 1990-12-12 | 1990-12-12 | Information retrieval processing method and retrieval file creation device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1992009960A1 true WO1992009960A1 (en) | 1992-06-11 |
Family
ID=26576122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP1991/000011 WO1992009960A1 (en) | 1990-11-30 | 1991-01-10 | Data retrieving device |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO1992009960A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0595539A1 (en) * | 1992-10-30 | 1994-05-04 | AT&T Corp. | A sequential pattern memory searching and storage management technique |
US5913216A (en) * | 1996-03-19 | 1999-06-15 | Lucent Technologies, Inc. | Sequential pattern memory searching and storage management technique |
CN111369980A (en) * | 2020-02-27 | 2020-07-03 | 网易有道信息技术(北京)有限公司江苏分公司 | Voice detection method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4554631A (en) * | 1983-07-13 | 1985-11-19 | At&T Bell Laboratories | Keyword search automatic limiting method |
US4606002A (en) * | 1983-05-02 | 1986-08-12 | Wang Laboratories, Inc. | B-tree structured data base using sparse array bit maps to store inverted lists |
JPS6435627A (en) * | 1987-07-31 | 1989-02-06 | Fujitsu Ltd | Data retrieving system |
JPS6435626A (en) * | 1987-07-31 | 1989-02-06 | Fujitsu Ltd | Word retrieving system |
JPS6436329A (en) * | 1987-07-31 | 1989-02-07 | Nec Corp | Character string registration retriever |
-
1991
- 1991-01-10 WO PCT/JP1991/000011 patent/WO1992009960A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4606002A (en) * | 1983-05-02 | 1986-08-12 | Wang Laboratories, Inc. | B-tree structured data base using sparse array bit maps to store inverted lists |
US4554631A (en) * | 1983-07-13 | 1985-11-19 | At&T Bell Laboratories | Keyword search automatic limiting method |
JPS6435627A (en) * | 1987-07-31 | 1989-02-06 | Fujitsu Ltd | Data retrieving system |
JPS6435626A (en) * | 1987-07-31 | 1989-02-06 | Fujitsu Ltd | Word retrieving system |
JPS6436329A (en) * | 1987-07-31 | 1989-02-07 | Nec Corp | Character string registration retriever |
Non-Patent Citations (2)
Title |
---|
I. FLORES, "Data Management", 10 August 1972, TAKEUCHI SCHOTEN (TOKYO), p. 201-220, (I. FLORES, "Data Structure and Management", 1970, PRENTICE-HALL). * |
MASAYUKI TAKEDA, "High-speed Pattern Matching Algorithim for Total Text Processing", 1991, Treatises from Informatics Symposium Lecture, 8 January 1991. * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0595539A1 (en) * | 1992-10-30 | 1994-05-04 | AT&T Corp. | A sequential pattern memory searching and storage management technique |
US5913216A (en) * | 1996-03-19 | 1999-06-15 | Lucent Technologies, Inc. | Sequential pattern memory searching and storage management technique |
CN111369980A (en) * | 2020-02-27 | 2020-07-03 | 网易有道信息技术(北京)有限公司江苏分公司 | Voice detection method and device, electronic equipment and storage medium |
CN111369980B (en) * | 2020-02-27 | 2023-06-02 | 网易有道信息技术(江苏)有限公司 | Voice detection method, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Robertson et al. | Applications of n‐grams in textual information systems | |
US4775956A (en) | Method and system for information storing and retrieval using word stems and derivative pattern codes representing familes of affixes | |
JP3160201B2 (en) | Information retrieval method and information retrieval device | |
US5590317A (en) | Document information compression and retrieval system and document information registration and retrieval method | |
US5995962A (en) | Sort system for merging database entries | |
US5523946A (en) | Compact encoding of multi-lingual translation dictionaries | |
US20090193005A1 (en) | Processor for Fast Contextual Matching | |
JPH08249354A (en) | Word index and word index generating device and document retrieval device | |
Keskustalo et al. | Non-adjacent digrams improve matching of cross-lingual spelling variants | |
JP2669601B2 (en) | Information retrieval method and system | |
JP2833580B2 (en) | Full-text index creation device and full-text database search device | |
JP3220865B2 (en) | Full text search method | |
JPH04205560A (en) | Information retrieval processing system | |
JPH0740275B2 (en) | Keyword automatic evaluation system | |
Hockey et al. | The Oxford concordance program version 2 | |
Robertson et al. | A comparison of spelling-correction methods for the identification of word forms in historical text databases | |
JP2519130B2 (en) | Multi-word information retrieval processing method and retrieval file creation device | |
JP2519129B2 (en) | Multi-word information retrieval processing method and retrieval file creation device | |
WO1992009960A1 (en) | Data retrieving device | |
JPH04326164A (en) | Data base retrieval system | |
JP3081093B2 (en) | Index creation method and apparatus and document search apparatus | |
JPH04215181A (en) | Information retrieval processing system | |
JPH03150668A (en) | Input character string normalization system for retrieval system | |
JPH10177575A (en) | Device and method for extracting word and phrase and information storing medium | |
Williams et al. | Document retrieval using a substring index |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): CA US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): DE FR GB |
|
NENP | Non-entry into the national phase |
Ref country code: CA |