[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118069590A - Forward index processing method, device, medium and equipment for searching database - Google Patents

Forward index processing method, device, medium and equipment for searching database Download PDF

Info

Publication number
CN118069590A
CN118069590A CN202410479400.9A CN202410479400A CN118069590A CN 118069590 A CN118069590 A CN 118069590A CN 202410479400 A CN202410479400 A CN 202410479400A CN 118069590 A CN118069590 A CN 118069590A
Authority
CN
China
Prior art keywords
forward index
index structure
data source
written
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410479400.9A
Other languages
Chinese (zh)
Other versions
CN118069590B (en
Inventor
曾勇
刘佳鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Extreme Data Beijing Technology Co ltd
Original Assignee
Extreme Data Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Extreme Data Beijing Technology Co ltd filed Critical Extreme Data Beijing Technology Co ltd
Priority to CN202410479400.9A priority Critical patent/CN118069590B/en
Publication of CN118069590A publication Critical patent/CN118069590A/en
Application granted granted Critical
Publication of CN118069590B publication Critical patent/CN118069590B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/128Details of file system snapshots on the file-level, e.g. snapshot creation, administration, deletion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a forward index processing method, a device, a medium and equipment for searching a database. When the writing of the document in the pre-written log is monitored, the background thread creates a third forward index structure and writes the document. When the first forward index structure reaches the data quantity threshold, the first forward index structure stops writing and informs the third forward index structure to be created to the same position in the pre-written log, the created third forward index structure is written into a data source disk, a new data source is created after the writing is completed and can be read, and the data source is switched. In the invention, the forward index and the reverse index are logically separated, and the document can be written into a specially designed forward index structure, so that the processing is very flexible in the process of only aligning the forward index. Meanwhile, the created forward index structure can be searched in real time, and the searching efficiency is effectively improved.

Description

Forward index processing method, device, medium and equipment for searching database
Technical Field
The present invention relates to the field of database technologies, and in particular, to a method, an apparatus, a medium, and a device for processing a forward index for searching a database.
Background
Lucene is an open source search engine library that is used by many search engines. In Lucene's terminology, a segment (segment) is the smallest unit that it writes and searches, when a new document is added, a new segment is created for the document in memory, and then after the segment in memory is written to disk, the corresponding content can be searched. Thus, lucene cannot search in real time, and the memory of Lucene only plays a role in caching the accumulated batch, and searching can only operate on the segment after the disc is dropped.
And when a search operation is performed on an added document, a forward index created based on the document and its reverse index are operated simultaneously, i.e., the forward index and the reverse index are not logically separated, so that flexibility is not provided when only the forward index is processed.
Disclosure of Invention
Based on this, it is necessary to provide a forward index processing method, apparatus, medium and device for searching a database, so as to solve the problem that the searching of a single forward index structure lacks flexibility and can only be performed after a document is dropped.
A forward index processing method for searching a database, wherein the search database comprises a memory, a disk, a pre-written log and a background thread, data in the memory and the disk are used as data sources, atomic weight is tracked to be the latest data sources, new data sources are created in the process that the data are continuously written in, versions of the data sources correspondingly increase, the atomic weight is correspondingly updated, two forward index structures which can be searched in real time are initially created in the memory and used as first data sources, and the atomic weight is initially tracked to be the first data sources, and the method comprises the following steps:
Acquiring an uploaded document;
Writing the document into a first forward index structure of a latest data source memory of atomic weight tracking, and writing the document into a pre-written log; the first forward index structure is a forward index structure which is currently written in the two forward index structures in the data source, the second forward index structure is a forward index structure which is not currently written in the two forward index structures in the data source, and the first forward index structure is the same as the second forward index structure;
when the background thread monitors that the document is written in the pre-written log, creating a third forward index structure through the background thread, and writing the document into the created third forward index structure;
When the data quantity written by the first forward index structure reaches a threshold value, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into the disk, a second data source is created after the disk writing is completed and can be read, the second data source comprises the third forward index structure on the basis of the first data source, the first forward index structure is removed, the current second forward index structure is used as the updated first forward index structure, the new forward index structure is created as the updated second forward index structure, and when the second data source is created, atomic updating is carried out on the atomic weight of the tracked latest data source.
In one embodiment, the search database further comprises a search engine, the method further comprising:
If a real-time search request of a client is received in the writing process of the document, a search engine accesses the latest data source through atomic weight, and creates a snapshot on the data source, the snapshot maintains the state when the search engine accesses the data source, the search engine synchronously accesses a memory and the disk in the snapshot to read a forward index structure corresponding to the real-time search request in the memory and search the memory to obtain a first search result, and reads a forward index structure corresponding to the real-time search request in the disk and searches the disk to obtain a second search result;
And merging the first search result and the second search result to obtain a real-time search response, and feeding back the real-time search response to the client.
In one embodiment, the method further comprises: the creation of the snapshot is based on the latest data source accessed by the search engine, the data source will switch, the new data source will be created, and the old data source will be deleted after switching and waiting until all searches to create the snapshot for the old data source are completed.
In one embodiment, the method further comprises: when a deletion operation is performed on the content in the written document, adding a deletion mark at a metadata layer for the content to be deleted in the written document, wherein the deletion mark is a part of a forward index structure in a data source, and when searching a database responds to searching, the created snapshot comprises the deletion mark; wherein, the written file is a file written in a search database, and the content added with the deletion mark can not be read during searching.
In one embodiment, in the forward index structure in the data source, each document is recorded with a timestamp of when it was completely written, the method further comprising:
Each time a modification operation is executed on a written document, acquiring a modified document corresponding to the written document; the written document is a document written in a search database, and the modified document is a document obtained by modifying the written document currently; the time stamp is part of a forward index structure in the data source, and when the search database responds to the search, the created snapshot contains the time stamp;
After the modified document is completely written, recording a time stamp of the modified document when the modified document is completely written as an updated target time stamp;
and when searching the written document, reading the document corresponding to the current target timestamp.
In one embodiment, the writing the document into the first forward index structure of the memory in the data source and writing the document into the pre-written log includes:
after a field level real-time processing function is started, determining real-time fields and non-real-time fields in the document;
and writing the real-time field in the document into a first forward index structure of a memory in a data source, and writing the real-time field and the non-real-time field in the document into a pre-written log.
In one embodiment, the method further comprises:
when the non-real-time field is searched in real time, if an accurate search result is not needed, reading a corresponding forward index structure from the disk of the data source and searching;
If accurate search results are needed, reading the documents from the forward index structure of the disk of the data source and the pre-written log and searching.
The utility model provides a forward index processing device of search database, searches database includes memory, disk, write in advance log and backstage thread, and the data in memory and the disk is as the data source, and atomic weight tracking is last data source, and in the continuous in-process of being written into of data, new data source is established, and the version of data source is increased correspondingly, and the atomic weight is updated correspondingly, originally establish in the memory have two forward index structures that can be searched in real time, as first data source, atomic weight initially tracks first data source, search database's forward index processing device includes:
The first writing module is used for acquiring the uploaded document; writing the document into a first forward index structure of a latest data source memory of atomic weight tracking, and writing the document into a pre-written log; the first forward index structure is a forward index structure which is currently written in the two forward index structures in the data source, the second forward index structure is a forward index structure which is not currently written in the two forward index structures in the data source, and the first forward index structure is the same as the second forward index structure;
The second writing module is used for creating a third forward index structure through the background thread when the background thread monitors that the document is written in the pre-written log, and writing the document into the created third forward index structure; and when the data quantity written by the first forward index structure reaches a threshold value, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into the disk, a second data source is created after the disk writing is completed and can be read, the second data source comprises the third forward index structure on the basis of the first data source, the first forward index structure is removed, the current second forward index structure is used as the updated first forward index structure, the new forward index structure is created as the updated second forward index structure, and when the second data source is created, atomic weight of the latest data source is atomically updated.
A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the forward index processing method of searching a database described above.
A forward index processing device for searching a database, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the forward index processing method for searching a database described above.
The invention provides a forward index processing method, a device, a medium and equipment for searching a database, which are used for writing a document into a first forward index structure of a latest data source memory of atomic weight tracking and writing a pre-written log at the same time. When the writing of the document in the pre-written log is monitored, the background thread creates a third forward index structure and writes the document. When the first forward index structure reaches the data quantity threshold, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into a disk, a second data source is created after the disk writing is completed and can be read, the first forward index structure is removed on the basis of the first data source, the current second forward index structure is used as the updated first forward index structure, and the new forward index structure is created as the updated second forward index structure. At this time, the creation of a new data source is completed, and atomic updates are performed on the atomic weights that track the latest data source. In the invention, the forward index and the reverse index are logically separated, and the document can be written into a specially designed forward index structure, so that the processing is very flexible in the process of only aligning the forward index. Meanwhile, the created forward index structure can be searched in real time, so that the situation that the search can only be performed after the disc is dropped is avoided, and the search efficiency is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Wherein:
FIG. 1 is a flow chart of a forward index processing method for searching a database;
FIG. 2 is an overall process flow for forward indexing;
FIG. 3 is a flow chart of writing a document into a first forward index structure of a latest data source memory and writing a document into a pre-written log;
FIG. 4 is a flow diagram of a background thread detecting and creating a third forward index structure for a document;
FIG. 5 is a schematic flow chart of writing a third forward index structure created to disk and creating a new data source;
FIG. 6 is a schematic flow chart of switching data sources, wherein two data sources coexist;
FIG. 7 is a schematic diagram of a process for searching a database;
FIG. 8 is a schematic diagram of a forward index processing device for searching a database;
fig. 9 is a block diagram of a structure of a forward index processing apparatus searching a database.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
Referring to fig. 1, fig. 1 is a flowchart of a forward index processing method for searching a database according to an embodiment. Further, as shown in fig. 2, fig. 2 is an overall process flow of the forward index. Alternatively, we name the search database as Pizza, which includes a data source memory and disk, a pre-written log WAL, and a background thread background thread. The memory of the data source memory is initially created with two forward index structures which can be searched in real time, and the disk of the data source can be written into the forward index structures which can be searched in real time in the subsequent processing process. In this embodiment, the created forward index structure has two characteristics, one is that it can be searched in real time, and the other is separate from other indexes such as the reverse index, and only the writing of the content related to the forward index is supported.
The method for processing the forward index for searching the database in the embodiment comprises the following steps:
S101, acquiring the uploaded document.
Alternatively, it is understood that the search database may retrieve the uploaded documents via a user interface, an application program interface, batch importation, or integration with a third party service, among other means. Which way is specifically adopted depends on the design and use scenario of the database.
S102, writing the document into a first forward index structure of a latest data source memory of atomic weight tracking, and writing the document into a pre-written log.
The first forward index structure is a forward index structure which is currently written in the two forward index structures, and the second forward index structure is a forward index structure which is not currently written in the two forward index structures, and the first forward index structure is the same as the second forward index structure.
As shown in fig. 3, the forward index structure of the latest data source memory with the atomic weight tracking in this embodiment is called Realtime Docstore, realtime Docstore, and the writing speed is very fast, and may be specifically a hash table, a skip table or other data structures. Here Realtime Docstore is the first forward index structure since Realtime Docstore is currently written to. And Realtime Docstore2 is not currently written to, realtime Docstore is therefore the second forward index structure. It will be appreciated that both forward index structures Realtime Docstore, realtime Docstore2 are searchable in real-time while performing this step, but only the first forward index structure Realtime Docstore1 has content written to it, and therefore only valid content is output when querying Realtime Docstore.
At the same time, the document is also written to a pre-written log (Write-Ahead Log, WAL) that serves as a staging station for temporarily saving the document. The advantage of this is that even if a system crash or power failure occurs before the data is written into the disk, the search database Pizza can still resume operation according to the information in the pre-written log, thereby ensuring the consistency and durability of the database.
S103, when the background thread monitors that the document is written in the pre-written log, creating a third forward index structure through the background thread, and writing the document into the created third forward index structure.
As shown in FIG. 4, there is a background thread in the system that is responsible for monitoring write operations in the pre-write log and when a document write is detected, a third forward index structure is created for those documents.
It will be appreciated that both forward index structures Realtime Docstore, realtime Docstore2 are searchable in real-time as this step is performed.
And S104, when the data quantity written by the first forward index structure reaches a threshold value, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into a disk, a second data source is created after the disk writing is completed and can be read, the second data source comprises the third forward index structure on the basis of the first data source, the first forward index structure is removed, the current second forward index structure is used as the updated first forward index structure, a new forward index structure is created as the updated second forward index structure, and when the second data source is created, atomic weight of the latest data source is updated atomically.
Specifically, the first forward index structure Realtime Docstore1 becomes an immutable state after being written (i.e., the amount of data written reaches a threshold), and the immutable first forward index structure Realtime Docstore1 may not carry writing any more, but may carry searching.
When the data quantity written by the first forward index structure reaches a threshold value, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, and the third forward index structure is written into the disk for persistence, so that the third forward index structure is identical to the structure and the written content of the first forward index structure, and the consistency and the integrity of the data are maintained. Next, as shown in fig. 5 (a), the search database Pizza writes the created third forward index structure to disk so as to save the data for a long period of time. At this point the creation of the new data source, i.e. the second data source, is complete. As shown in fig. 6, the atomic weight of the trace-up-to-date data source is updated, and the second data source is traced. The previous first data source is not immediately deleted.
As shown in fig. 5 (b), after the third forward index structure is completely written to the disk, a forward archive file Archive Docstore that can be searched in real time is formed, and the forward archive file Archive Docstore has efficient compression rate and high-performance reading capability, and is particularly suitable for storing large-scale structured data. From the above example, we can see that the forward index structure Realtime Docstore and archive file Archive Docstore to be generated on disk are in one-to-one correspondence. Further, after the disk writing is completed and the disk can be read, a second data source is created, which contains Archive Docstore of new writing disks, realtime Docdtore is removed, the current second forward index structure Realtime Docstore is used as an updated first forward index structure Realtime Docstore2, and a new forward index structure is created as an updated second forward index structure Realtime Docstore3 on the basis of the first data source to receive the subsequent new document writing. It can be seen that the forward index structure Realtime Docstore is always 2 in the whole process, the old forward index structure falls into an archive file, and simultaneously a new forward index structure is continuously created.
It will be appreciated that both the forward index structures Realtime Docstore, realtime Docstore3 and the forward archive file Archive Docstore are searchable in real-time as this step is performed.
Further, in one embodiment, the search database also includes a search Engine Query Engine, which is responsible for processing search requests from clients. On this basis, the search database also performs the following steps: as shown in fig. 7, if a real-time search request of a client is received during the writing process of a document, a search Engine Query Engine accesses the latest data source through atomic weight, creates a snapshot of the data source,
The search engine synchronously accesses the memory and the disk in the snapshot to read a forward index structure corresponding to the real-time search request in the memory and search the memory to obtain a first search result, and reads the forward index structure corresponding to the real-time search request in the disk and searches the disk to obtain a second search result; and combining the first search result and the second search result to obtain a real-time search response, and finally feeding back the real-time search response to the client.
This sample search database is able to respond to search requests of clients and return results in time for searches that only require forward indexing. And the search results of the memory and the disk are summarized, so that feedback of response to the client is not missed.
Further, the creation of the snapshot is based on the latest data source accessed by the search engine, the data source is switched, the new data source is created, and the old data source is deleted after switching and waiting until all searches for creating the snapshot for the old data source are completed.
According to the forward index processing method for searching the database, the document is written into the first forward index structure of the latest data source memory tracked by atomic weight, and simultaneously, the pre-written log is written. When the writing of the document in the pre-written log is monitored, the background thread creates a third forward index structure and writes the document. When the first forward index structure reaches the data quantity threshold, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into a disk, a second data source is created after the disk writing is completed and can be read, the first forward index structure is removed on the basis of the first data source, the current second forward index structure is used as the updated first forward index structure, and the new forward index structure is created as the updated second forward index structure. At this time, the creation of a new data source is completed, and atomic updates are performed on the atomic weights that track the latest data source. The old first data source is not deleted.
In this embodiment, the forward index and the reverse index are logically separated, and the document is written into a specially designed forward index structure, so that the processing is very flexible in the process of only aligning the forward index. Meanwhile, the created forward index structure can be searched in real time, so that the situation that the search can only be performed after the disc is dropped is avoided, and the search efficiency is effectively improved.
Further, in a specific embodiment, the following steps are also performed: when a deletion operation is performed on the content in the written document, a deletion mark is added at the metadata layer to the content to be deleted in the written document; wherein the written document is a document written in the search database, and the content added with the deletion mark cannot be read during searching.
Specifically, searching the database Pizza does not actually delete the already written document data from the database when the deletion operation of the forward index structure is processed. Instead, it adds a special tag at the metadata level of the document, indicating that the document has been deleted. The delete markers are part of a forward index structure in the data source, and when the search database responds to the search, the created snapshot contains the delete markers.
When a search operation is performed, the search database Pizza checks the metadata of the documents, filters the documents which have been marked for deletion, ensures that the documents are not returned to the search results, and thus realizes the logical deletion of the documents.
It will be appreciated that this way of tag deletion can effectively handle the deletion operation of a document while avoiding data inconsistencies and performance issues that may be caused by direct deletion of document data. At the same time, this also makes it possible to easily restore the deleted document, if necessary, without performing a complicated data restoring operation.
Further, each document in the forward index structure in the data source is recorded with a time stamp at the time of full writing to indicate the time the document was fully written in the search database, which helps track version and history changes of the document. In a specific embodiment, the following steps are also performed: each time a modification operation is executed on a written document, a modified document corresponding to the written document is acquired; the written document is a document written in a search database, and the modified document is a document obtained by modifying the written document currently; the time stamp is part of a forward index structure in the data source and the created snapshot contains time stamp information when the search database responds to the search. After the modified document is completely written, recording a time stamp of the modified document when the modified document is completely written as an updated target time stamp; and when searching the written document, reading the document corresponding to the current target timestamp.
Specifically, when a modification operation is performed on a written document, searching the database Pizza acquires a modified document corresponding to the written document. This means that the modification operation to the document is actually creating a document of the latest document version. And then writing the modified document, and after the modified document is completely written into the search database Pizza, the search database Pizza records the time stamp when the modified document is completely written as the updated target time stamp. Thus, the latest version information of the document is recorded. And finally, when the search operation is executed, the search database Pizza reads the document version corresponding to the current target timestamp. This means that when a document is read, searching the database Pizza will obtain the latest version of the document instead of the old version.
It will be appreciated that, similar to the delete operation, this modification operation does not directly modify the original document, but creates a new version of the document and records the latest version information of the document in the metadata. This has the advantage that a historical version of the document can be retained and the document of the latest version retrieved when required, while avoiding direct modification of the original document.
Further, considering that the real-time framework for searching the database Pizza depends on the memory structure, that is, the forward index structure must be created in the memory to support real-time, this is at the cost of increasing the memory occupation. While in some scenarios, a user may have a need for real-time searching only for certain fields. Thus, in one embodiment, the following optimization operations may be performed on S102: after the field level real-time processing function is started, determining real-time fields and non-real-time fields in the document; and writing the real-time field in the document into a first forward index structure of the latest data source memory, and writing the real-time field and the non-real-time field in the document into a pre-written log.
Specifically, after the field level real-time processing function is turned on, the search database Pizza determines which fields in the document need real-time searching (corresponding to real-time fields) and which fields do not need real-time searching (corresponding to non-real-time fields), for example, the search database Pizza can determine by inputting real-time marks in the document, wherein the real-time marks are used as real-time fields, and the non-real-time marks are used as non-real-time fields. Or analyzing the frequency of data change of each field, if the value of a certain field is changed frequently and needs to be reflected in the search result in real time, then the field may need to be searched in real time. Conversely, if the data of a field changes less or the change does not greatly affect the search results, then it may be considered that no real-time search is needed. Then, only these real-time fields are written into the first forward index structure of the memory, while both real-time and non-real-time fields are written into the pre-write log.
It will be appreciated that this has the benefit of reducing memory usage and improving system performance and stability. For users with real-time search requirements only for part of fields, the users can be flexibly configured according to the requirements, and memory resources are saved.
Further, in a specific embodiment, the following steps are also performed: when the non-real-time field is searched in real time, if the accurate search result is not needed, the corresponding forward index structure is read from the disk of the data source snapshot and searching is carried out. If an accurate search result is required, the document is read from the forward index structure of the disk of the data source snapshot and the pre-written log and searched.
Specifically, after the field level real-time processing function is turned on, searches can be generally classified into 2 categories: 1. search involving real-time fields; 2. to searching of non-real time fields. The process of searching the real-time field after the field level real-time processing function is started is not changed, so that the description is omitted. And for searches involving non-real time fields, the user can configure if accurate results are needed as desired. If an accurate result is not required, the search database Pizza reads the corresponding forward index structure from the disk of the data source snapshot and searches, and the fields which are not in real time and are not written into the disk by the index structure cannot be searched. If an accurate result is required, the search database Pizza can acquire data from the forward index structure of the disk of the data source snapshot and the pre-written log and search the data because the complete data is stored in the pre-written log.
It can be appreciated that such a design can flexibly select a suitable data source for searching according to the accuracy requirement of the user on the search result, thereby meeting the real-time search requirement in different scenes.
In one embodiment, as shown in fig. 8, a forward index processing device for searching a database is provided, the search database includes a memory, a disk, a pre-write log and a background thread, data in the memory and the disk are used as data sources, an atomic weight tracking is performed on the latest data sources, in the process that data is continuously written in, new data sources are created, versions of the data sources correspondingly increase, the atomic weight is correspondingly updated, two forward index structures which can be searched in real time are initially created in the memory, and as a first data source, the atomic weight initially tracks the first data source, the device includes:
A first writing module 801, configured to obtain an uploaded document; writing the document into a first forward index structure of the latest data source memory of atomic weight tracking, and writing the document into a pre-written log; the first forward index structure is a forward index structure which is written in the two forward index structures in the data source at present, the second forward index structure is a forward index structure which is not written in the two forward index structures in the data source at present, and the first forward index structure is the same as the second forward index structure;
A second writing module 802, configured to create a third forward index structure by a background thread when the background thread monitors that a document is written in the pre-written log, and write the document into the created third forward index structure; and when the data quantity written by the first forward index structure reaches a threshold value, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into a disk, a second data source is created after the disk writing is completed and can be read, the second data source comprises the third forward index structure on the basis of the first data source, the first forward index structure is removed, the current second forward index structure is used as the updated first forward index structure, the new forward index structure is created as the updated second forward index structure, and when the second data source is created, atomic weight of the latest data source is updated atomically.
FIG. 9 illustrates an internal block diagram of a forward index processing device searching a database in one embodiment. As shown in fig. 9, the forward index processing device of the search database includes a processor, a memory, and a network interface connected through a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the forward index processing device of the search database stores an operating system, and may also store a computer program, which when executed by a processor, may cause the processor to implement a forward index processing method of the search database. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform a forward index processing method of searching a database. It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the forward index processing device of the search database to which the present inventive arrangements are applied, and that a particular forward index processing device of the search database may include more or less components than those shown, or may combine some components, or have a different arrangement of components.
A computer readable storage medium storing a computer program which when executed by a processor performs the steps of: acquiring an uploaded document; writing the document into a first forward index structure of a latest data source memory of atomic weight tracking, and writing the document into a pre-written log; the first forward index structure is a forward index structure which is written in the two forward index structures in the data source at present, the second forward index structure is a forward index structure which is not written in the two forward index structures in the data source at present, and the first forward index structure is the same as the second forward index structure; when the background thread monitors that the document is written in the pre-written log, creating a third forward index structure through the background thread, and writing the document into the created third forward index structure; when the data quantity written by the first forward index structure reaches a threshold value, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into a disk, a second data source is created after the disk writing is completed and can be read, on the basis of the first data source, the second data source comprises the third forward index structure, the first forward index structure is removed, the current second forward index structure is used as the updated first forward index structure, the new forward index structure is created as the updated second forward index structure, and when the second data source is created, atomic weight of the latest data source is atomic updated.
A forward index processing device for searching a database, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of when executing the computer program: acquiring an uploaded document; writing the document into a first forward index structure of a latest data source memory of atomic weight tracking, and writing the document into a pre-written log; the first forward index structure is a forward index structure which is written in the two forward index structures in the data source at present, the second forward index structure is a forward index structure which is not written in the two forward index structures in the data source at present, and the first forward index structure is the same as the second forward index structure; when the background thread monitors that the document is written in the pre-written log, creating a third forward index structure through the background thread, and writing the document into the created third forward index structure; when the data quantity written by the first forward index structure reaches a threshold value, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into a disk, a second data source is created after the disk writing is completed and can be read, on the basis of the first data source, the second data source comprises the third forward index structure, the first forward index structure is removed, the current second forward index structure is used as the updated first forward index structure, the new forward index structure is created as the updated second forward index structure, and when the second data source is created, atomic weight of the latest data source is atomic updated.
It should be noted that the above method, device, apparatus and computer readable storage medium for processing a forward index of a search database belong to a general inventive concept, and the content in the embodiments of the method, device, apparatus and computer readable storage medium for processing a forward index of a search database may be mutually applicable.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a non-transitory computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. The method is characterized in that the search database comprises a memory, a disk, a pre-written log and a background thread, data in the memory and the disk are used as data sources, atomic weight is tracked to be the latest data source, new data sources are created in the process that the data are continuously written in, versions of the data sources correspondingly increase, the atomic weight is correspondingly updated, two forward index structures which can be searched in real time are initially created in the memory and used as first data sources, and the atomic weight is initially tracked to be the first data sources, and the method comprises the following steps:
Acquiring an uploaded document;
Writing the document into a first forward index structure of a latest data source memory of atomic weight tracking, and writing the document into a pre-written log; the first forward index structure is a forward index structure which is currently written in the two forward index structures in the data source, the second forward index structure is a forward index structure which is not currently written in the two forward index structures in the data source, and the first forward index structure is the same as the second forward index structure;
when the background thread monitors that the document is written in the pre-written log, creating a third forward index structure through the background thread, and writing the document into the created third forward index structure;
When the data quantity written by the first forward index structure reaches a threshold value, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into the disk, a second data source is created after the disk writing is completed and can be read, the second data source comprises the third forward index structure on the basis of the first data source, the first forward index structure is removed, the current second forward index structure is used as the updated first forward index structure, the new forward index structure is created as the updated second forward index structure, and when the second data source is created, atomic updating is carried out on the atomic weight of the tracked latest data source.
2. The method of claim 1, wherein searching the database further comprises a search engine, the method further comprising:
If a real-time search request of a client is received in the writing process of the document, a search engine accesses the latest data source through atomic weight, and creates a snapshot on the data source, the snapshot maintains the state when the search engine accesses the data source, the search engine synchronously accesses a memory and the disk in the snapshot to read a forward index structure corresponding to the real-time search request in the memory and search the memory to obtain a first search result, and reads a forward index structure corresponding to the real-time search request in the disk and searches the disk to obtain a second search result;
And merging the first search result and the second search result to obtain a real-time search response, and feeding back the real-time search response to the client.
3. The method according to claim 2, wherein the method further comprises: the creation of the snapshot is based on the latest data source accessed by the search engine, the data source will switch, the new data source will be created, and the old data source will be deleted after switching and waiting until all searches to create the snapshot for the old data source are completed.
4. The method according to claim 1 or 2, characterized in that the method further comprises: when a deletion operation is performed on the content in the written document, adding a deletion mark at a metadata layer for the content to be deleted in the written document, wherein the deletion mark is a part of a forward index structure in a data source, and when searching a database responds to searching, the created snapshot comprises the deletion mark; wherein, the written file is a file written in a search database, and the content added with the deletion mark can not be read during searching.
5. The method of claim 1 or 2, wherein in the forward index structure in a data source, each document is recorded with a timestamp of when it was completely written, the method further comprising:
Each time a modification operation is executed on a written document, acquiring a modified document corresponding to the written document; the written document is a document written in a search database, and the modified document is a document obtained by modifying the written document currently; the time stamp is part of a forward index structure in the data source, and when the search database responds to the search, the created snapshot contains the time stamp;
After the modified document is completely written, recording a time stamp of the modified document when the modified document is completely written as an updated target time stamp;
and when searching the written document, reading the document corresponding to the current target timestamp.
6. The method of claim 1, wherein writing the document to the first forward index structure of the atomic weight tracked up-to-date data source memory and writing the document to a pre-write log comprises:
after a field level real-time processing function is started, determining real-time fields and non-real-time fields in the document;
and writing the real-time field in the document into a first forward index structure of a memory in a data source, and writing the real-time field and the non-real-time field in the document into a pre-written log.
7. The method of claim 6, wherein the method further comprises:
when the non-real-time field is searched in real time, if an accurate search result is not needed, reading a corresponding forward index structure from the disk of the data source and searching;
If accurate search results are needed, reading the documents from the forward index structure of the disk of the data source and the pre-written log and searching.
8. The forward index processing device is characterized in that the search database comprises a memory, a disk, a pre-written log and a background thread, data in the memory and the disk are used as data sources, atomic weight is tracked to be the latest data source, new data sources are created in the process that the data are continuously written in, versions of the data sources correspondingly increase, the atomic weight is correspondingly updated, two forward index structures which can be searched in real time are initially created in the memory and used as first data sources, the atomic weight is initially tracked to be the first data sources, and the forward index processing device for searching the database comprises:
The first writing module is used for acquiring the uploaded document; writing the document into a first forward index structure of a latest data source memory of atomic weight tracking, and writing the document into a pre-written log; the first forward index structure is a forward index structure which is currently written in the two forward index structures in the data source, the second forward index structure is a forward index structure which is not currently written in the two forward index structures in the data source, and the first forward index structure is the same as the second forward index structure;
The second writing module is used for creating a third forward index structure through the background thread when the background thread monitors that the document is written in the pre-written log, and writing the document into the created third forward index structure; and when the data quantity written by the first forward index structure reaches a threshold value, the data source is ready to switch, the latest data source is updated, the first forward index structure stops writing and informs the third forward index structure to be also created to the same position in the pre-written log, the created third forward index structure is written into the disk, a second data source is created after the disk writing is completed and can be read, the second data source comprises the third forward index structure on the basis of the first data source, the first forward index structure is removed, the current second forward index structure is used as the updated first forward index structure, the new forward index structure is created as the updated second forward index structure, and when the second data source is created, atomic weight of the latest data source is atomically updated.
9. A computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the method according to any of claims 1 to 7.
10. A forward index processing device for searching a database, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 7.
CN202410479400.9A 2024-04-22 2024-04-22 Forward index processing method, device, medium and equipment for searching database Active CN118069590B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410479400.9A CN118069590B (en) 2024-04-22 2024-04-22 Forward index processing method, device, medium and equipment for searching database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410479400.9A CN118069590B (en) 2024-04-22 2024-04-22 Forward index processing method, device, medium and equipment for searching database

Publications (2)

Publication Number Publication Date
CN118069590A true CN118069590A (en) 2024-05-24
CN118069590B CN118069590B (en) 2024-06-21

Family

ID=91100686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410479400.9A Active CN118069590B (en) 2024-04-22 2024-04-22 Forward index processing method, device, medium and equipment for searching database

Country Status (1)

Country Link
CN (1) CN118069590B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120150864A1 (en) * 2010-12-14 2012-06-14 Oracle International Corporation Text indexing for updateable tokenized text
CN109815194A (en) * 2019-02-01 2019-05-28 北京沃东天骏信息技术有限公司 Indexing means, indexing unit, computer readable storage medium and electronic equipment
US20190347343A1 (en) * 2018-05-09 2019-11-14 Palantir Technologies Inc. Systems and methods for indexing and searching
CN113641780A (en) * 2021-10-15 2021-11-12 阿里云计算有限公司 Search method, system, device, storage medium and computer program product
CN116226497A (en) * 2023-03-08 2023-06-06 杭州网易云音乐科技有限公司 Retrieval method, medium, device and computing equipment
CN116594962A (en) * 2023-04-14 2023-08-15 百度在线网络技术(北京)有限公司 Access request processing method and device and forward index system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120150864A1 (en) * 2010-12-14 2012-06-14 Oracle International Corporation Text indexing for updateable tokenized text
US20190347343A1 (en) * 2018-05-09 2019-11-14 Palantir Technologies Inc. Systems and methods for indexing and searching
CN109815194A (en) * 2019-02-01 2019-05-28 北京沃东天骏信息技术有限公司 Indexing means, indexing unit, computer readable storage medium and electronic equipment
CN113641780A (en) * 2021-10-15 2021-11-12 阿里云计算有限公司 Search method, system, device, storage medium and computer program product
CN116226497A (en) * 2023-03-08 2023-06-06 杭州网易云音乐科技有限公司 Retrieval method, medium, device and computing equipment
CN116594962A (en) * 2023-04-14 2023-08-15 百度在线网络技术(北京)有限公司 Access request processing method and device and forward index system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
潘隆禧;孙乐;: "基于动态文档集的索引技术", 计算机应用研究, no. 01, 15 January 2009 (2009-01-15), pages 21 - 24 *

Also Published As

Publication number Publication date
CN118069590B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN111656341B (en) Cache for efficient record lookup in LSM data structures
US11755427B2 (en) Fast recovery and replication of key-value stores
US8301603B2 (en) Information document search system, method and program for partitioned indexes on a time series in association with a backup document storage
US8868624B2 (en) Blob manipulation in an integrated structured storage system
CN101103355B (en) Methods and apparatus for managing deletion of data
US8620884B2 (en) Scalable blob storage integrated with scalable structured storage
US8484172B2 (en) Efficient search for migration and purge candidates
CN101137981A (en) Methods and apparatus for managing the storage of content in a file system
US11288128B2 (en) Indexing a relationship structure of a filesystem
US20130232175A1 (en) Information retrieval system, registration apparatus for indexes for information retrieval, information retrieval method and program
US11829291B2 (en) Garbage collection of tree structure with page mappings
CN108021562B (en) Disk storage method and device applied to distributed file system and distributed file system
CN115391337A (en) Database partitioning method and device, storage medium and electronic equipment
CN118069590B (en) Forward index processing method, device, medium and equipment for searching database
CN113253932B (en) Read-write control method and system for distributed storage system
CN114595286A (en) Data synchronization method and device, electronic equipment and storage medium
US7693883B2 (en) Online data volume deletion
CN118708671A (en) Reverse index processing method, device, medium and equipment for searching database
CN112115166B (en) Data caching method and device, computer equipment and storage medium
CN118132598B (en) Database data processing method and device based on multi-level cache
CN117435629A (en) Data processing method, device, equipment and medium
CN116126620A (en) Database log processing method, database change query method and related devices
CN118349573A (en) Compression method and device of key value data, storage medium and electronic equipment
CN117951094A (en) Storage space recycling method, file system, medium and computing device
CN115757563A (en) Data searching method and device based on Elasticissearch

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant