WO1997011427A1

WO1997011427A1 - Method and apparatus for fast access cdrom file structure

Info

Publication number: WO1997011427A1
Application number: PCT/US1996/015228
Authority: WO
Inventors: Bruce C. Atherton
Original assignee: International Compu Research, Inc.
Priority date: 1995-09-22
Filing date: 1996-09-23
Publication date: 1997-03-27

Abstract

A method and apparatus for a fast access CDROM file structure for indexed databases. The invention takes advantage of a sequential, multi-step look-up process for inverted indexes by physically interleaving an initial index with its corresponding inverted index on a CDROM disk. A speed improvement in accessing the indexes is realized by ensuring that the second of two required seeks in the process is almost always a longitudinal seek to a data block following a data block read after the first seek. Thus, lateral seeks are substantially reduced, thereby reducing average seek time.

Description

METHOD AND APPARATUS FOR FAST ACCESS CDROM FILE STRUCTURE

BACKGROUND OF THE INVENTION

1. Field of the Invention This invention relates to computer storage devices, and more particularly to a method and apparatus for a fast access CDROM file structure for indexed databases.

2. Description of Related Art

Certain types of large databases, particularly text databases, use an "inverted index" to provide a rapid method of finding particular data specified by a user. As is known in the art, an inverted index for a full-text database provides a pointer or set of pointers to every significant term in the database (high frequency "noise" words, like "the", "it", etc., are sometimes omitted). Accordingly, it is common to use an inverted index in conjunction with a text database to permit easy access to specific documents and subportions of documents pointed to by the index.

An inverted index includes a lookup table or initial index to locate selected entries in the inverted index. The initial index typically is a sorted table of counts and offsets referencing the inverted index. The offset defines a location in the inverted index at which a desired term is referenced, while the count defines the number of such terms in sequence. The inverted index itself typically includes a document or reference number, and a vector pointing to the location in the database of the actual document or text containing the selected term.

FIGURE 1 is a block diagram showing the logical structure of a typical prior art inverted index and initial index in the context of a database storing legal case law documents. As is known in the art, the original text (in this example, legal cases) is searched for ter s to index (i.e., noise words and any desired special terms are ignored) and an inverted index is built. The result of such a search typically comprises a document identifier and a vector pointing to the original text for each significant term (a vector may point to the actual location within a document of a significant term). The inverted index comprises a data file 10 typically having fixed length records or entries 11 , as shown in FIGURE 1. (Each inverted index record 11 is shown in FIGURE 1 with a record address, shown as the first number of the pair of numbers in the DOC column, and content, shown in parentheses in the DOC column. As is known, the record address is not normally stored with the record, but is only a referent to the content of the record. Thus, record #121 has a document identifier content of 1011 in the example shown). Each record 11 of the inverted index data file 10 stores a document identifier and an associated vector, as determined in the index generation process.

The initial index also comprises a data file 15 typically having fixed length records or entries 16. Each initial index record 16 typically has two values, a count value and an offset value. As noted above, the offset defines a location in the inverted index at which a desired term is referenced, while the count defines the number of such terms in sequence. (Each initial index record 16 is shown in FIGURE 1 with a record address, shown as the first number ofthe pair of numbers in the COUNT column, and content, shown in parentheses in the COUNT column. Again, the record address is not normally stored with the record, but is only a referent to the record. Thus, record #21 has a count of 4 in the example shown).

In building an inverted index, records 11 for document identifiers that point to documents containing the same terms are typically grouped together. For example, if four cases contain the term "LARCENY", the records 11 for those four cases are grouped together in the data file 10 comprising the inverted index. A record 16 would then be created in the initial index data file 15 with an offset value pointing to the first such inverted index record 11, and a count value indicating the number (four) of sequential related inverted index records 11.

To use the inverted index, a search term or query is entered by a user. Suppose, for example, that a user wants to find every case in a legal database containing the term "LARCENY". As is known in the art, the user's input can be parsed against a lookup table or query dictionary, to find a value associated with the selected term. Such a query dictionary 20 is shown in FIGURE 1. In the example shown, "LARCENY" is associated with the query number 21. The query number is used as a pointer to a corresponding record address in the initial index data file 15. In this example, entry of "LARCENY" points to an initial index record 16 that indicates four relevant inverted index records 11 are located beginning at address 121 of the inverted index data file 10. Other inverted index file structures are possible, but most have characteristics and functionality similar to that described above.

A text database and an associated inverted index can be quite large. CDROM 's (Compact Disc Read Only Memory's) are capable of storing large amounts of data, and thus are frequently used for storing such databases and associated index files.

Typically, each index is stored as a separate file each comprising a plurality of physically contiguous blocks or clusters. An example of such storage is shown in FIGURE 2, which is a stylized diagram of the physical layout on a CDROM disk 40 of a typical prior art inverted index data file 10 and initial index data file 15. Two "tracks" 41 , 42 of data on the CDROM disk 40 are shown, as are two blocks 43, 44 defining regions for storing data (in actuality, a CDROM uses a single spiral track to store data, but stored data on the spiral track is accessible as blocks in a manner similar to magnetic disk drives). Thus, typically, the inverted index data file 10 is stored contiguously (for example, in track 41) while the initial index data file 15 is stored contiguously (for example, in track 42) but separate and apart from the inverted index data file 10. The corresponding text database would also be stored on the CDROM disk 40.

A problem peculiar to CDROM systems is that they use a variable speed motor to maintain constant linear velocity along the single spiral track of data. The motor must speed up or slow down if the transducer head is moved in or out (known as a "lateral seek") along a radius of the disk 40. Block identification along a track is slow because of the lack of unique tracks, and so "longitudinal seeks" along a track (i.e., from block to block) also take time. (However, longitudinal seeks are much faster than lateral seeks, and are typically determined by the rotational latency of disk, which is constant due to the constant linear velocity characteristic of CDROM systems. A "double speed" CDROM system has approximately one-half of the rotational latency of a "single speed" CDROM system). In addition, the transducer mass of a CDROM drive is relatively large. Therefore, average seek times across a CDROM are very long (typically 20 to 40 times) compared to average seek times for magnetic disk drives or magneto-optical drives. Further, average lateral seek time is typically many times the average longitudinal seek time for a typical "double speed" CDROM system.

Accordingly, if an inverted index data file 10 and initial index data file 15 are stored on a CDROM disk 40 as described above, data access to a large inverted index is slow compared to a magnetic disk "hard" drive. More particularly, in a typical configuration, for each query term number (e.g., "21" for

"LARCENY"), a first CDROM lateral disk seek and one disk read are required to access the corresponding record 16 in the initial index data file 15. The offset field in the retrieved record 16 is then used as a pointer to the inverted index data file 10, generally requiring a second lateral disk seek. The number of disk reads from the inverted index data file 10 depends on the count value in the retrieved initial index record 16, but is a minimum of one. The average time required to process one query term number (not including the time to parse the original query term, "LARCENY" in this case) is approximated by the following formula: 2S + (R * (((N / (B / L)) + 2)) where: S = average CDROM seek time

R = sequential disk read time for one block N = number of inverted index records containing an indexed term B = size in bytes of a disk block L = size in bytes of a record in the inverted index data file.

Typically, sequential read times are significantly faster than seek times, so the seek time is the dominant factor in the above equation. A substantial increase in processing speed would be achieved if the number of required seeks could be reduced or the average time of a seek reduced.

The present invention provides such an improvement for an inverted indexed database stored on a CDROM.

SUMMARY OF THE INVENTION

The present invention is a method and apparatus for a fast access CDROM file struc¬ ture for indexed databases. The invention takes advantage of the sequential, multi- step look-up process for inverted indexes by physically interleaving an initial index with its corresponding inverted index on a CDROM disk. A speed improvement in accessing the indexes is realized by ensuring that the second seek in the process described above is almost always a longitudinal seek to a block following the block read after the first seek. Thus, lateral seeks are substantially reduced, thereby reducing average seek time.

The details of the preferred embodiment of the present invention are set forth in the accompanying drawings and the description below. Once the details of the invention are known, numerous additional innovations and changes will become obvious to one skilled in the art.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGURE 1 is a block diagram showing the logical structure of a typical prior art inverted index and initial index in the context of a database storing legal case law documents.

FIGURE 2 is a stylized diagram of the physical layout on a CDROM disk of a typical prior art inverted index data file and initial index data file.

FIGURE 3 is a stylized diagram of the physical layout on a CDROM disk of an inverted index data file and initial index data file in accordance with the present invention.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION OF THE INVENTION

Throughout this description, the preferred embodiment and examples shown should be considered as exemplars, rather than as limitations on the present invention.

FIGURE 3 is a stylized diagram of the physical layout on a CDROM disk 40 of an inverted index data file and initial index data file in accordance with the present invention. Two "tracks" 41 , 42 of data on the CDROM disk 40 are shown. In contrast to the prior art, rather than storing the inverted index data file 10 in contiguous blocks separate and apart from the initial index data file 15, blocks from each file are physically interleaved longitudinally (i.e., along) the spiral track (or, if CDROM technology changes, along concentric tracks). Thus, in FIGURE 3, blocks 43a and 43b are used to store the inverted index data file 10, while blocks 44a and 44b are used to store the initial index data file 15. The CDROM disk 40 is readable by a conventional CDROM drive having a drive motor, transducer head, and read and control circuitry. A speed improvement in accessing the indexes is realized by ensuring that the second of the two required seeks in the access process is almost always a longitudinal seek to a second data block following a first data block read after the first seek (which may be a lateral seek). (In some circumstances, a lateral seek may be required to access the second block, but the interleave of blocks means that most probably such would not be the case.) Thus, lateral seeks are substantially reduced, thereby reducing average seek time.

Using the example given in FIGURE 1 , the comparable process for accessing an inverted index on a CDROM in accordance with the present invention would be as follows:

1. The user enters a query (e.g., "LARCENY"), which is parsed into a query number (21).

2. The query number is used to initiate a first seek (which may be a lateral seek) to a block in the initial index data file 15. 3. An initial index record 16 (record 21) corresponding to the query number (21) is read from the block to obtain a count (4) and an offset pointer (121) to the inverted index data file 10.

4. The offset value is used to initiate a second seek to a block in the inverted index data file 10. If the inverted index record 11 pointed to by the offset pointer is physically located in the next or a nearby (e.g. , same circumferential track) block after the block containing the read initial index record 16, in accordance with the present invention, only a longitudinal seek is required.

5. The number of inverted index records 11 indicated by the count value (4) are read beginning at the record (121) indicated by the offset pointer (121).

While FIGURE 3 suggests that each index occupies the same number of blocks, the number of blocks on a circumferential "track" (taking into account the spiral nature of a CDROM disk track) dedicated to one index may be larger or smaller than the number of blocks dedicated to the other index. Thus, for example, if the initial index data file 15 comprises numerous small records, and the inverted index data file 10 comprises fewer larger records, it may be advantageous to interleave a small number of blocks (e.g., one) from the initial index data file 15 with a larger number of blocks (e.g., 25) from the inverted index data file 10. In general, this approach attempts to match the number of inverted index data file 10 blocks with the number of initial index data file 15 blocks containing records 16 with offset pointers corresponding to such inverted index data file 10 blocks. As a further example, if 100 initial index data file 15 blocks contain all of the offset pointers to 10,000 inverted index data file 10 blocks, on average, one initial index data file 15 block would be interleaved for every ten inverted index data file 10 blocks.

Further, the ratio of inverted index blocks to initial index blocks need not be constant, as were the underlying database contains a high frequency of a certain significant terms compared to the frequency of other significant terms. In such a case, one particular initial index block may contain records pointing to a large number of inverted index blocks having few distinct terms (i.e., the count value is high, indicating numerous contiguous inverted index records 11). Conversely, another particular initial index block may contain records pointing to a small number of inverted index blocks having many distinct terms (i.e., the count value is low, indicating few contiguous related inverted index records 11). Accordingly, by mapping initial index block records to inverted index records, a "custom" interleave for a particular database can be generated.

In the preferred embodiment, the distribution of initial index data file 15 blocks to inverted index data file 10 blocks is empirically determined on a case-by-case basis, taking into consideration the size of the two files, the size of the records in each file, and the distribution of inverted index records 11. In general, the records 11 of each inverted index data file 10 block should be distributed on the CDROM closely following the initial index data file 15 block containing the associated pointers for such records 11. Other algorithms can be devised for determining the "distribution frequency" of initial index data file 15 blocks to inverted index data file 10 blocks, but all come within the scope of the inventive concept described above.

Writing of blocks in the interleaved structure taught by the present invention is by conventional means. For example, the blocks 43a, 43b, 44a, 44b can be stored on a magnetic disk and fetched in the selected interleave order for writing to a recordable CDROM or to a writable master CDROM for mass-production. Altematively, the blocks 43a, 43b, 44a, 44b can be stored on a magnetic disk in the selected interleave order and then written as a single data stream to a recordable CDROM or to a writable master CDROM for mass-production.

Accordingly, the invention takes advantage of the sequential, multi-step look-up process for inverted indexes by physically interleaving an initial index with its corresponding inverted index on a CDROM disk. A speed improvement in accessing the indexes is realized by ensuring that the second of two required seeks in the process is almost always a longitudinal seek to a data block following a data block read after the first seek. Thus, lateral seeks are substantially reduced, thereby reducing average seek time.

The invention is also generally applicable to any similar arrangement of data on a

CDROM or similar device, where a first seek to and read of records of one file is followed by a seek to the records of a second file.

A number of embodiments of the present invention have been described. Neverthe¬ less, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, instead of parsing a search term into a query number, the search term entered by a user may be used to directly access the initial index data file 15. Further, as is known in the art, better performance may be achieved by caching the disk file allocation table (FAT) used in accessing blocks on a CDROM. Accordingly, it is to be understood that the invention is not to be limited by the specific illustrated embodiment, but only by the scope of the appended claims.

Claims

What is claimed is

1 A method for providing a fast access CDROM file structure for related files comprising the steps of

(a) providing a first file comprising a plurality of first records organized as first blocks, such first records including pointers to second records of a second file, wherein at least one first record must be accessed before accessing a second record,

(b) providing the second index comprising a plurality of such second records organized as second blocks,

(c) storing the first blocks of the first index and the second blocks of the second index on a CDROM disk, the CDROM disk being readable by a

CDROM system that has a lateral seek time and a longitudinal seek time, such that the first blocks are interleaved with the second blocks, wherein access to a first record requires one of a lateral or longitudinal seek to one such first block, while access to a second record pointed to by such first record most probably requires only a longitudinal seek to one such second block, thereby reducing average total seek time

2. A method for providing a fast access CDROM file structure for an indexed database, comprising the steps of:

(a) providing a first index comprising a plurality of first records organized as first blocks, such first records including pointers to second records of a second index, wherein at least one first record must be accessed before accessing a second record;

(b) providing the second index comprising a plurality of such second records organized as second blocks, such second records including index pointers to data records; (c) storing the data records indexed by the second index on a CDROM disk, the CDROM disk being readable by a CDROM system that has a lateral seek time and a longitudinal seek time;

(d) storing the first blocks of the first index and the second blocks of the second index on the CDROM disk such that the first blocks are interleaved with the second blocks, wherein access to a first record requires one of a lateral or longitudinal seek to one such first block, while access to a second record pointed to by such first record most probably requires only a longitudinal seek to one such second block, thereby reducing average total seek time.

3. A method for providing a fast access CDROM file structure for an indexed database, comprising the steps of:

(a) providing an initial index comprising a plurality of initial index records organized as first blocks, such initial index records including pointers to inverted index records of an inverted index, wherein at least one initial index record must be accessed before accessing an inverted index record;

(b) providing the inverted index comprising a plurality of such inverted index records organized as second blocks, such inverted index records including pointers to data records; (c) storing the data records indexed by the inverted index on a CDROM disk, the CDROM disk being readable by a CDROM system that has a lateral seek time and a longitudinal seek time;

(d) storing the first blocks of the initial index and the second blocks of the inverted index on the CDROM disk such that the first blocks are interleaved with the second blocks, wherein access to an initial index record requires one of a lateral or longitudinal seek to one such first block, while access to an inverted index record pointed to by such initial index record most probably requires only a longitudinal seek to one such second block, thereby reducing average total seek time.

4. A CDROM disk having stored thereon a file structure for an indexed database, the CDROM disk being readable by a CDROM system that has a lateral seek time and a longitudinal seek time, the CDROM disk including:

(a) a first index comprising a plurality of first records organized as first blocks, such first records including pointers to second records of a second index, wherein at least one first record must be accessed before accessing a second record;

(b) the second index comprising a plurality of such second records organized as second blocks, such second records including index pointers to data records;

(c) data records indexed by the second index;

(d) wherein the first blocks of the first index and the second blocks of the second index are stored on the CDROM disk such that the first blocks are longitudinally interleaved with the second blocks, wherein access to a first record requires one of a lateral or longitudinal seek to one such first block, while access to a second record pointed to by such first record most probably requires only a longitudinal seek to one such second block, thereby reducing average total seek time.

5. A CDROM system for accessing a CDROM disk having stored thereon a file structure for an indexed database, the CDROM system having a lateral seek time and a longitudinal seek time when accessing such a CDROM disk, the

CDROM system including: (a) a first index stored on a CDROM disk and comprising a plurality of first records organized as first blocks, such first records including pointers to second records of a second index, wherein at least one first record must be accessed before accessing a second record;

(b) the second index stored on the CDROM disk and comprising a plurality of such second records organized as second blocks, such second records including index pointers to data records;

(c) data records stored on the CDROM disk and indexed by the second index;

(d) wherein the first blocks of the first index and the second blocks of the second index are stored on the CDROM disk such that the first blocks are longitudinally interleaved with the second blocks, wherein access to a first record requires one of a lateral or longitudinal seek to one such first block, while access to a second record pointed to by such first record most probably requires only a longitudinal seek to one such second block, thereby reducing average total seek time; (e) a CDROM reader configured to access data on the CDROM disk.

6. A method for providing a fast access storage device file structure for related files, comprising the steps of:

(a) providing a first file comprising a plurality of first records organized as first blocks, such first records including pointers to second records of a second file, wherein at least one first record must be accessed before accessing a second record;

(b) providing the second index comprising a plurality of such second records organized as second blocks;

(c) storing the first blocks of the first index and the second blocks of the second index on a storage device, the storage device being readable by a storage device system that has a lateral seek time and a longitudinal seek time, such that the first blocks are interleaved with the second blocks, wherein access to a first record requires one of a lateral or longitudinal seek to one such first block, while access to a second record pointed to by such first record most probably requires only a longitudinal seek to one such second block, thereby reducing average total seek time.

7. A method for providing a fast access storage device file structure for an indexed database, comprising the steps of:

(b) providing the second index comprising a plurality of such second records organized as second blocks, such second records including index pointers to data records; (c) storing the data records indexed by the second index on a storage device, the storage device being readable by a storage device system that has a lateral seek time and a longitudinal seek time;

(d) storing the first blocks of the first index and the second blocks of the second index on the storage device such that the first blocks are inter- leaved with the second blocks, wherein access to a first record requires one of a lateral or longitudinal seek to one such first block, while access to a second record pointed to by such first record most probably requires only a longitudinal seek to one such second block, thereby reducing average total seek time.