WO2008144018A1 - A device and method for an efficient storage manager for copy-on-write snapshots - Google Patents
A device and method for an efficient storage manager for copy-on-write snapshots Download PDFInfo
- Publication number
- WO2008144018A1 WO2008144018A1 PCT/US2008/006366 US2008006366W WO2008144018A1 WO 2008144018 A1 WO2008144018 A1 WO 2008144018A1 US 2008006366 W US2008006366 W US 2008006366W WO 2008144018 A1 WO2008144018 A1 WO 2008144018A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- snapshot
- snapshots
- pages
- page
- database
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2308—Concurrency control
Definitions
- the present invention relates to snapshot storage management systems. More specifically, the present invention relates to snapshot storage management systems that provide applications with the ability to discriminate among snapshots efficiently.
- a new generation of storage systems exploit decreasing storage costs and efficient versioning techniques to allow applications to take snapshots of past states and retain them for long durations. Snapshot analysis is becoming increasingly important.
- an Intensive Care Unit (ICU) monitoring system may analyze the information on patients' past response to treatment.
- ICU Intensive Care Unit
- Figure 2 shows an embodiment of a split copy-on-write
- Figure 3 shows an embodiment of archiving split snapshots
- Figure 4 shows an embodiment of rank-tree policy
- Figure 5 shows an embodiment of eager ranked segregation
- Figure 6 shows an embodiment of a diff-based representation of BITE
- Figure 7 shows an embodiment of lazy segregation
- Figure 8 shows an embodiment of reclamation in Hybrid
- Figure 9 shows a graph ⁇ dP relative to Thor in an embodiment of Hybrid.
- Figure 10 shows a graph ⁇ p relative to Thor in an embodiment having multiple clients.
- Thresher is a new snapshot storage management system, based on a novel snapshot technique that is the first to provide applications the ability to discriminate among snapshots efficiently.
- An application may provide a discrimination policy that ranks snapshots. The policy may be specified when snapshots are taken, or later after snapshots have been created. Thresher may efficiently disentangle differently ranked snapshots, allowing valuable snapshots to be stored with faster access or to remain accessible for longer, while also allowing less valuable snapshots to be discarded. These operations may be executed in Thresher without creating disk fragmentation.
- Thresher is based on two key innovations. First, a novel technique called "ranked segregation" may efficiently separate on disk the states of differently-ranked copy-on-write snapshots, enabling no- copy reclamation without fragmentation. Second, while most snapshot systems rely on a no-overwrite update approach, Thresher may rely on a novel update-in-place technique that provides an efficient way to transform snapshot representation as snapshots are created.
- the ranked segregation technique may be efficiently composed with different snapshot representations to lower the storage management costs for several useful discrimination policies.
- Thresher may segregate the normal and abnormal snapshots efficiently by composing ranked segregation with a compact diff-based representation to reduce the cost of copying.
- Thresher may compose ranked segregation with a dual snapshot representation that is less compact but provides faster access.
- a snapshot storage manager like a garbage collector, must be designed with a concrete system in mind, and must perform well for different application workloads.
- Thresher was prototyped in an experimental snapshot system that allows flexible control of workload parameters. Update density and overwriting were identified as the two key parameters that determine the performance of a snapshot storage manager. Measurements of the Thresher prototype indicate that the new techniques are efficient and scalable, imposing minimal (4%) performance penalty on common expected workloads.
- Thresher is described herein in the context of a concrete system but it will be appreciated by one skilled in the art that the techniques described are more general and may be readily applied to other systems.
- Thresher is implemented in SNAP, the snapshot system that provides snapshots for the Thor object storage system.
- SNAP assumes that applications are structured as sequences of "transactions" accessing a storage system. It supports Back-In-Time Execution (BITE), a capability of a storage system where applications running general code may run against read-only snapshots in addition to the current state.
- BITE Back-In-Time Execution
- the snapshots may reflect transactionally consistent historical states.
- An application may choose which snapshots it wants to access so that snapshots may reflect states meaningful to the application.
- Applications may take snapshots at unlimited "resolution", e.g. after each transaction, without disrupting access to the current state.
- Thresher may allow applications to discriminate among snapshots by incorporating a snapshot discrimination policy into the following three snapshot operations: a request to take a snapshot ("snapshot request”, or “declaration") that provides a discrimination policy, or indicates lazy discrimination, a request to access a snapshot ("snapshot access”), and a request to specify a discrimination policy for a snapshot ("discrimination request").
- the operations may have the following semantics.
- an application may take a snapshot by asking for a snapshot "now".
- This snapshot request may be serialized along with other transactions and other snapshots. That is, a snapshot may reflect all state-modifications by transactions serialized before this request, but does not reflect modifications by transactions serialized after.
- a snapshot request may return a snapshot "name" that applications can use to refer to this snapshot later, e.g. to specify a discrimination policy for a snapshot. For simplicity, it is assumed that snapshots are assigned unique sequence numbers that correspond to the order in which they occur.
- a snapshot access request may specify which snapshot an application wants to use for back-in-time execution.
- the request may return a consistent set of object states, allowing the read-only transaction to run as if it were running against the current storage state.
- a discrimination policy may rank snapshots.
- a rank may be a numeric score assigned to a snapshot. Thresher may interpret the ranking to determine the relative lifetimes of snapshots and the relative snapshot access latency.
- a snapshot storage management system needs to be efficient and not unduly slow-down the snapshot system.
- the general approach to snapshot discrimination may be applicable to snapshot systems that separate snapshots from the current storage system state.
- Such so-called “split" snapshot systems may rely on update-in-place storage and create snapshots by copying out the past states, unlike snapshot systems that rely on no-overwrite storage and do not separate snapshot and current states.
- Split snapshots may be attractive in long-lived systems because they allow creation of high-frequency snapshots without disrupting access to the current state while preserving the on-disk object clustering for the current state. This approach may take advantage of the separation between snapshot and current states to provide efficient snapshot discrimination.
- a specialized snapshot representation may be created which is tailored to the discrimination policy while copying out the past states.
- Thor has a client/ server architecture.
- Servers provide persistent storage (called database storage) for objects.
- Clients may cache copies of the objects and run applications that interact with the system by making calls to methods of cached objects. Method calls may occur within a the context of transaction.
- a transaction commit may cause all modifications to become persistent, while an abort may leave no transaction changes in the persistent state.
- the system may use optimistic concurrency control.
- a client may send its read and write object sets with modified object states to the server when the application asks to commit the transaction. If no conflicts were detected, the server may commit the transaction.
- An object may belong to a particular server. The object within a server may be uniquely identified by an object reference (Oref). Objects may be clustered into 8KB pages.
- object reference Oref
- An object Oref is composed of a PageID and a oid.
- the PageID identifies the containing page and allows the lookup of an object location using a page table.
- the oid is an index into an offset table stored in the page.
- the offset table contains the object offsets within the page. This indirection may allow moving an object within a page without changing the references to it.
- the client may fetch the containing page from the server. Only modified objects may be shipped back to the server when the transaction commits. Thor provides transaction durability using the ARIES no-force no-steal redo log protocol. Since only modified objects may be shipped back at commit time, the server may need to do an installation read (iread) to obtain the containing page from disk.
- An in-memory, recoverable cache called the Modified Object Buffer (MOB) may store the committed modifications allowing ireads to be deferred and write absorption to be increased. The modifications may be propagated to the disk by a background "cleaner" thread that cleans the MOB. The cleaner may process the MOB in transaction log order to facilitate the truncation of the transaction log. For each modified object encountered, it may read the page containing the object from disk (iread). If the page is not cached, it may install all modifications in the MOB for objects in that page, write the updated page back to disk, and remove the objects from the MOB.
- MOB Modified Object Buffer
- the server also may manage an in-memory page cache used to serve client fetch requests. Before returning a requested page to the client, the server may update the cache copy, installing all modifications in the MOB for that page so that the fetched page reflects the up-to-date committed state.
- the page cache may use Least Recently Used (LRU) replacement but discards old dirty pages (it may depend on ireads to read them back during MOB cleaning) rather than writing them back to disk immediately. Therefore, the cleaner thread may be the only component of the system that writes pages to disk.
- LRU Least Recently Used
- SNAP may create snapshots by copying out the past storage system states onto a separate snapshot archive disk.
- a snapshot may provide the same abstraction as the storage system, consisting of "snapshot pages” and a “snapshot page table”. This may allow unmodified application code running in the storage system to run as BITE over a snapshot.
- SNAP may copy snapshot pages and snapshot page table mappings into the archive during cleaning. It may use an incremental copy-on-write technique specialized for split snapshots: a snapshot page is constructed and copied into the archive when a page on the database disk is about to be overwritten the first time after a snapshot is declared. Archiving a page may create a snapshot page table mapping for the archived page.
- a client application may send a "snapshot access request" to the server.
- the server may construct an Archive Page Table (APT) for version v (APT V ) and “mounts" it for the client.
- APT V may map each page in snapshot v into its archive address or indicate the page is in the database.
- the server may receive a page fetch requests from the client may look up pages in APT 1 , and read them from either archive or database. Since snapshots may be accessed readonly, APT V can be shared by all clients mounting snapshot v.
- Figure 1 shows an embodiment of how unmodified client application code accesses objects in snapshot v that includes both archived and database pages.
- client code may request object y on page Q.
- the server may look up Q in APT V , load page Q v from the archive and send it to the client. Later on client code may follow a reference from y to x in the client cache, requesting object x in page P from the server.
- the server may look up P in APT V and find out that the page P for snapshot v is still in the database.
- the server reads P from the database and sends it to the client.
- the archive representation for a snapshot page may include the complete storage system page. This representation is referred to as "page-based".
- page-based This representation is referred to as "page-based”.
- the following describe different snapshot page representations, specialized to various discrimination policies.
- a snapshot page can have a more compact representation based on modified object diffs, or it can have two different representations.
- Such variations in snapshot representation are transparent to the application code running BITE, since the archive read operation reconstructs the snapshot page into storage system representation before sending it to the client.
- snapshot declarations may partition transaction history into "spans".
- the span of a snapshot v starts with its declaration and ends with the declaration of the next snapshot (v+1).
- snapshot v “records” its version of P.
- snapshot v records pages P and S (retaining the pre-states modified by transaction tx 2 ) and the page T (retaining the pre-state modified by transaction /xj). Note that there may not be a need to retain the pre-state of page P modified by transaction fct ? since it is not the first modification of P in the span.
- the later snapshot may records v's version of P.
- v's version of page Q may be recorded by a later snapshot v+1 who also records its own version of P.
- Snapshot pages may be constructed and copied into the archive during cleaning when the pre- states of modified pages about to be overwritten in the database are available in memory. Since the cleaner may run asynchronously with the snapshot declaration, the snapshot system needs to prevent snapshot states from being overwritten by the on-going transactions. For example, if several snapshots are declared between two successive cleaning rounds, and a page P gets modified after each snapshot declaration, the snapshot system may have to retain a different version of P for each snapshot.
- SNAP may prevent snapshot state overwriting, without blocking the on-going transactions. It may retain the pre-states needed for snapshot creation in an in-memory data structure called Versioned Modified Object Buffer (VMOB).
- VMOB may contain a queue of buckets, one for each snapshot. Bucket v may hold modifications committed in v's span. As transactions commit modifications, modified objects may be added to the bucket of the latest snapshot (Step 1 , Figure 3).
- the declaration of a new snapshot may create a new mutable bucket, and make the preceding snapshot bucket immutable, preventing the overwriting of the needed snapshot states.
- a cleaner may update the database by cleaning the modifications in the VMOB, and in the process of cleaning, construct the snapshot pages for archiving.
- Steps 2-5 in Figure 3 trace this process.
- the cleaner may first obtain a database copy of P.
- the cleaner may then use P and the modifications in the buckets to create all the needed snapshot versions of P before updating P in the database.
- the cleaner may construct the version of P recorded by v simply by using the database copy of P.
- the cleaner may then update P by applying modifications in bucket v, remove the modifications from the bucket v, and proceed to the following bucket.
- the updated P will be the version of P recorded by the snapshot that has the next modification to P in its bucket. This process may be repeated for all pages with modifications in VMOB, constructing the recorded snapshot pages for the snapshots corresponding to the immutable VMOB buckets.
- the cleaner may write the recorded pages into the archive sequentially in snapshot order, thus creating "incremental" snapshots.
- the mappings for the archived snapshot pages may be collected in versioned incremental snapshot page tables.
- VPT V versioned page table for snapshot v
- VPT V may be a data structure containing the mappings (from page id to archive address) for the pages recorded by snapshot v.
- mappings may be inserted into VPT V .
- VPT V may be archived as well.
- the cleaner may write the VPTs sequentially, in snapshot order, into a separate archive data structure. This way, a forward sequential scan through the archived incremental page tables from VPT V and onward may find the mappings for all the archived pages that belong to snapshot v. Namely, the mapping for v's version of page P may be found either in VPT V , or, if not there, in the VPT of the first subsequently declared snapshot that records P. SNAP efficiently bounds the length of the scan. For brevity, the bounded scan protocol here is not reviewed here.
- SNAP may need to identify the snapshot v pages that are in the current database.
- Highest Archived Version may be an auxiliary data structure that tracks the highest archived version for each page. If HAV(P) ⁇ v, the snapshot v page P may be in the database.
- a snapshot discrimination policy may specify that older snapshots outlive more recently declared snapshots. Since snapshots may be archived incrementally, managing storage for snapshots according to such a discrimination policy can be costly. Pages that belong to a longer-lived snapshot may be recorded by a later short-lived snapshot thus entangling short-lived and long-lived pages. When different lifetime pages are entangled, discarding shorter-lived pages may create archive storage fragmentation. For example, consider two consecutive snapshots v and v+1 in Figure 2, with v recording pages versions P v , and Sv, and v+1 recording pages P v+ ⁇ , ⁇ 2 v+ i and S v+ i- The page recorded by v+1 belongs to both snapshots v and v+1.
- Thresher exploits the copying of past states in a split snapshot system.
- Thresher may "segregate" the long-lived and the short-lived snapshot pages and copy different lifetime pages into different archive areas.
- reclamation may create no fragmentation.
- the archive system knows at snapshot v+1 creation time that it is shorter-lived than v, it can store the long-lived snapshot pages P v , S v and Q v+ ⁇ in a long-lived archive area and the transient 5 v+ i pages in a short-lived area, so that the shorter-lived pages can be reclaimed without fragmentation.
- This approach therefore, combines a discrimination policy and a discrimination mechanism.
- the discrimination policies supported in Thresher are categorized. The following describes the discrimination mechanisms for different policies.
- a snapshot discrimination policy may convey to the snapshot storage management system the importance of snapshots so that more important snapshots can have longer lifetimes, or can be stored with faster access. Thresher may support a class of flexible discrimination policies described below using an example.
- An application may specify a discrimination policy by providing a relative snapshot ranking. Higher-ranked snapshots may be deemed more important. By default, every snapshot may be created with a lowest rank. An application can "bump up" the importance of a snapshot by assigning it a higher rank.
- a policy may assign the lowest rank to snapshots corresponding to minute by minute vital signs monitor readings, a higher rank to the monitor readings that correspond to hourly nurses' checkups, yet a higher rank to the readings viewed in doctors' rounds. Within a given rank level, more recent snapshots may be considered more important.
- the discrimination policy may assign longer lifetimes to more important snapshots, defining a 3-level sliding window hierarchy of snapshot lifetimes.
- rank- tree a general class of discrimination policies
- rank levels are given integer values 1 through k:
- a snapshot ranked as level i, i > 1, corresponds to a snapshot at each lower rank level from 1 to (/ - 1).
- RT2 Ranking a snapshot at a higher rank level increases its lifetime.
- RT3 Within a rank level, more recent snapshots outlive older snapshots.
- Figure 4 depicts an embodiment of a 3-level rank-tree policy for the hospital example, where snapshot number 1 , ranked at level 3, corresponds to a monitor reading that was sent for inspection to both the nurse and the doctor, but snapshot number 4 was only sent to the nurse.
- An application can specify a rank-tree policy "eagerly" by providing a snapshot rank at snapshot declaration time, or "lazily", by providing the rank after declaring a snapshot.
- An application can also ask to store recent snapshots with faster access. In the hospital example above, the importance and the relative lifetimes of the snapshots associated with routine procedure are likely to be known in advance, so the hospital application can specify a snapshot discrimination policy eagerly.
- the eager "ranked segregation" protocol may provide efficient discrimination for eager rank-tree policies.
- the protocol may assign a separate archive region to hold the snapshot pages ⁇ volumes') and snapshot page tables (VPT') for snapshots at level i.
- VPT' snapshot page tables
- the protocol may segregate the different lifetime pages and copy them into the corresponding regions. This way, each region may contain pages and page tables with the same lifetime, and temporal reclamation of snapshots (satisfying policy property RT2) within a region does not create disk fragmentation.
- Figure 3 shows an embodiment of a segregated archive.
- snapshots ranked at level / may be archived in the same incremental manner as in SNAP and at the same low sequential cost.
- the cost is low because by using sufficiently large write buffers (one for each volume), archiving to multiple volumes can be as efficient as strictly sequential archiving into one volume. Since it is expected that the rank-tree may be quite shallow the total amount of memory allocated to write buffers is small.
- the eager ranked segregation works as follows.
- the declaration of a snapshot v with a rank specified at level k (k > 1 ), may create a separate incremental snapshot page table, VPTJ for every rank level i (i ⁇ k).
- the incremental page table VPTJ may collect the mappings for the pages recorded by snapshot v at level /. Since the incremental tables in VPT' may map the pages recorded by all the snapshots at level i, the basic snapshot page table reconstruction protocol based on a forward scan through VPT' detailed above can be used in region i to reconstruct snapshot tables for level i snapshots.
- the recorded pages may contain the pre-state before the first page modification in the snapshot span. Since the span for snapshot v at level i (denoted S v ' ) may include the spans of all the lower level snapshots declared during S v ' , pages recorded by a level i snapshot v may also be recorded by some of these lower-ranked snapshots. In Figure 4, the span of snapshot 4 ranked at level 2 includes the spans of snapshots (4), 5 and 6 at level 1. Therefore, a page recorded by the snapshot 4 at level 2 may also be recorded by one of the snapshots (4), 5, or 6 at level 1.
- a page P recorded by snapshots at multiple levels may be archived in the volume of the highest- ranked snapshot that records P.
- the highest recorder “captures" P. Segregating archived pages this way guarantees that a volume of the shorter-lived snapshots contains no longer-lived pages and therefore temporal reclamation within a volume creates no fragmentation.
- the mappings in a snapshot page table VPTJ in area / may point to the pages recorded by snapshot v in whatever area these pages are archived. Snapshot reclamation may need to insure that the snapshot page table mappings are safe, that is, they do not point to reclaimed pages.
- the segregation protocol may guarantee the safety of the snapshot page table mappings by enforcing the following invariant I that constrains the intra-level and inter-level reclamation order for snapshot pages and page tables:
- VPTJ VPTJ and the pages recorded by snapshot v that are captured in volume' are reclaimed together, in temporal snapshot order.
- Invariant Il may insure that in each given rank-tree level, the snapshot page table mappings are safe when they point to pages captured in volumes within the same level.
- Invariant 12 may insure that the snapshot page table mappings are safe when they point to pages captured in volumes above their level. Note that the rank-tree policy property RT2 only requires that "bumping up" a lower-ranked snapshot v to level k may extend its lifetime but it does not constrain the lifetimes of the lower-level snapshots declared in the span of v at level k. Invariant 12 may insure the safety of the snapshot table mappings for these later lower-level snapshots.
- Figure 5 depicts an embodiment of the eager segregation protocol for a two-level rank-tree policy.
- Snapshot v 4 specified at level 2, has a snapshot page table at both level 1 and level 2.
- the archived page P modified within the span of snapshot V 5 is recorded by snapshot V 5 , and also by the level 2 snapshot V 4 .
- This version of P is archived in the volume of the highest recording snapshot (denoted volume v t).
- the snapshot page tables of both recording snapshots VPT ⁇ 5 and VPT ⁇ 4 contain this mapping for P.
- the pre-state of page Q modified within the span of v 6 is also captured in volume ⁇ .
- P is modified again within the span of snapshot v 6 .
- This later version of P may not be recorded by snapshot v 4 at level 2 since V 4 has already recorded its version of P.
- This later version of P may be archived in volume ⁇ and its mapping may be inserted into VTTj 6 .
- Invariant Il may guarantee that in VPT] 6 mapping for page P in volume ⁇ is safe.
- Invariant 12 may guarantee that in VPT] 6 the mapping for page Q in volume v 4 is safe.
- Some applications may need to defer snapshot ranking to after the snapshot has already been declared (use a lazy rank-tree policy).
- snapshot discrimination may be costly because it requires copying.
- the lazy segregation protocol may provide efficient lazy discrimination by combining two techniques to reduce the cost of copying. First, it may use a more compact diff-based representation for snapshot pages so that there is less to copy. Second, the diff-based representation (as explained below) may include a component that has a page-based snapshot representation. This page-based component may be segregated without copying using the eager segregation protocol.
- the compact diff-based representation may implement the same abstraction of snapshot pages and snapshot page tables, as the page-based snapshot representation. It may be similar to database redo recovery log consisting of sequential repetitions of two types of components, checkpoints and diffs.
- the checkpoints may be incremental page-based snapshots declared periodically by the storage management system.
- the diffs may be versioned page diffs, consisting of versioned object modifications clustered by page. Since typically only a small fraction of objects in a page may be modified by a transaction, and moreover, many attributes do not change, it is expected that the diffs will be compact.
- the log repetitions containing the diffs and the checkpoints may be archived sequentially, with diffs and checkpoints written into different archive data structures.
- the incremental snapshot page tables may collect the archived page mappings for the checkpoint snapshots.
- a simple page index structure may keep track of page-diffs in each log repetition (the diffs in one log repetition are referred to as "diff extent").
- the cleaner may sort the diffs in an in-memory buffer, assembling the page-based diffs for the diff extents.
- the available sorting buffer size may determine the length of diff extents. Since frequent checkpoints decrease the compactness of the diff-based representation, to get better compactness, the cleaner may create several diff extents in a single log repetition. Increasing the number of diff extents may slow down BITE. This trade-off is similar to the recovery log. For brevity, the details of how the diff-based representation is constructed are omitted. The details can be found in "Timebox: A High Performance Archive for Split Snapshots" (Hao Xu, PhD Thesis, Brandeis University, 2005) which is incorporated by reference in its entirety. Below some of the issues related to the diff-based representation compactness that are relevant to the snapshot storage management performance are discussed.
- the snapshots declared between checkpoints may be reconstructed by first mounting the snapshot page table for the closest (referred to as "base") checkpoint and the corresponding diff-page index. This may allow BITE to access the checkpoint pages, and the corresponding page-diffs.
- base the closest
- the server may read page P from the checkpoint, and then read in order, the diff-pages for P from all the needed diff extents and apply them to the checkpoint P in order.
- Figure 6 shows an embodiment of reconstructing a page P in a diff-based snapshot v from a checkpoint page ⁇ o and diff-pages contained in several diff extents.
- the system may simply read back the archived diff extents, assemble the diff extents for the longer-lived snapshots, create the corresponding long-lived base checkpoints, and archive the retained snapshots sequentially into a longer-lived area. If diffs are compact, the cost of copying may be low.
- the long-lived base checkpoints may be created without copying by separating the long-lived and short-lived checkpoint pages using eager segregation. Since checkpoints may simply be page-based snapshots declared periodically by the system, the system can derive the ranks for the base checkpoints once the application specifies the snapshot ranks. Knowing ranks at checkpoint declaration time may enable eager segregation.
- Figure 7 shows an embodiment of eager rank-tree policy for checkpoints in lazy segregation.
- a representation for level-1 snapshots has the diff extents E ⁇ , £ 2 and £ 3 (in the archive region G d x ⁇ ) associated with the base checkpoints B ⁇ , ⁇ 2 and Bj.
- E ⁇ , £ 2 and £ may be merged into extent E (in region G d 2 ⁇ ).
- This extent E has a base checkpoint B ⁇ .
- extents E ⁇ , £ 2 Eventually, extents E ⁇ , £ 2 ,
- the lazy segregation protocol may be optimized for the case where the application specifies snapshot rank within a limited time period after snapshot declaration, which is expected to be the common case. If the limit is exceeded, the system may reclaim shorter-lived base checkpoints by copying out longer-lived pages at a much higher cost. The same approach can also be used if the application needs to change the discrimination policy.
- the diff-based representation may be more compact but has a slower BITE than the page-based representation.
- Some applications require lazy discrimination but also need low-latency BITE on a recent window of snapshots. For example, to examine the recent snapshots and identify the ones to be retained.
- the eager segregation protocol may allow efficient composition of diff-based and page-based representations to provide fast BITE on recent snapshots, and lazy snapshot discrimination.
- the composed representation called "Hybrid”, works as follows. When an application declares a snapshot, Hybrid may create two snapshot representations. A page-based representation may be created in a separate archive region that maintains a sliding window of W recent snapshots, reclaimed temporally. BITE on snapshots within W may run on the fast page-based representation.
- Hybrid may create for the snapshots a diff-based representation.
- BITE on snapshots outside W may run on the slower diff -based representation. Snapshots within W therefore may have two representations (page-based and diff-based).
- the eager segregation protocol can be used to efficiently compose the two representations and provide efficient reclamation.
- the system may specify an eager rank-tree policy that ranks the page-based snapshots as lowest-rank (level-0) rank-tree snapshots, but specifies the ones that correspond to the system-declared checkpoints in the diff-based representation, as level- 1.
- the checkpoints may be further discriminated by bumping up the rank of the longer-lived checkpoints.
- the eager segregation protocol may retain the snapshots declared by the system as checkpoints without copying, and may reclaim the aged snapshots in the page-based window W without fragmentation.
- the cost of checkpoint creation and segregation may be completely absorbed into the cost of creating the page-based snapshots, resulting in lower archiving cost than the simple sum of the two representations.
- Figure 8 shows an embodiment of reclamation in the Hybrid system that adds faster BITE to the snapshots in Figure 7.
- the system may create the page-based snapshots V, and use them to run fast BITE on recent snapshots. Snapshots V 1 and V 4 may be used as base checkpoints B ⁇ and B 2 for the diff-based representation, and checkpoint B ⁇ may be retained as a longer-lived checkpoint.
- the system may specify an eager rank-tree policy, ranking snapshots V, at level-0, and bumping up Vi to level-2 and V 4 to level- 1. This may allow the eager segregation protocol to create the checkpoints B ⁇ , and B 2 , and eventually reclaim B 2 , V 5 and V 6 without copying.
- the metric ⁇ ⁇ j p captures the non-disruptiveness of an I/O-bound storage system. This metric is used to gauge the impact of snapshot discrimination.
- r pour be the "pouring rate" - average object cache (MOB/VMOB) free-space consumption speed due to incoming transaction commits, which insert modified objects.
- / ⁇ ⁇ / admirhold • be the "draining rate” - the average growth rate of object cache free space produced by MOB/VMOB cleaning.
- ⁇ d P is defined as: -1 dra
- Adp ⁇ pour ⁇ j p indicates how well the draining keeps up with the pouring. If ⁇ d P > 1 , the system operates within its capacity and the foreground transaction performance is not be affected by background cleaning activities. If ⁇ dP ⁇ 1, the system is overloaded, transaction commits eventually block on free object cache space, and clients experience commit delay.
- t c ⁇ ean be the average cleaning time per dirty database page. Hence, t C [ ean determines /> bark, origin. In Thresher, t c ⁇ ean reflects, in addition to the database ireads and writes, the cost of snapshot creation and snapshot discrimination. Since snapshots are created on a separate disk in parallel with the database cleaning, the cost of snapshot-related activity can
- ⁇ is an update workload parameter, defined as the percentage of repeated modifications to the same object or page, ⁇ affects both r pour and r ⁇ /ra ⁇ .
- updates cause less cleaning in the storage system because the object cache (MOB/VMOB) absorbs repeated modifications, but high frequency snapshots may need to archive most of the repeated modifications. With less cleaning, it may be harder to hide archiving behind cleaning, so snapshots may become more disruptive.
- workloads with repeated modifications reduce the amount of copying when lazy discrimination copies diffs.
- r pour and r, /ram are measured experimentally in a system with and without discrimination for a range of workloads with low, medium and high degree of overwriting, and analyze the resulting ⁇ dP .
- ⁇ dP may determine the maximum throughput of an I/O bound storage system. Measuring the maximum throughput in a system with and without discrimination could provide an end-to-end metric for gauging the impact of discrimination. ⁇ dP is focused on because it allows better explanations of the complex dependency between workload parameters and cost of discrimination.
- the effectiveness of diff-based representation in reducing copying cost may depend on the "compactness" of the representation.
- Compactness is characterized by a relative snapshot retention metric R, defined as the size of snapshot state written into the archive for a given snapshot history length H, relative to the size of the snapshot state for H captured in full snapshot pages.
- R 1 for the page-based representation.
- R of the diff -based representation has two contributing components, R C k P for the checkpoints, and Rj 1 Jf for the diffs.
- "Density" ( ⁇ ) a workload parameter defined as the fraction of the page that gets modified by an update, may determine R d ,ff.
- R C k P may depend on the frequency of checkpoints, determined by L — the number of snapshots declared in the history interval corresponding to one log repetition.
- L the number of snapshots declared in the history interval corresponding to one log repetition.
- increasing L may decrease R c i ⁇ since checkpoints are page-based snapshots that record the first pre-state for each page modified in the log repetition.
- Increasing L by increasing d the number of diff extents in a log repetition, may raise the snapshot page reconstruction cost for BITE.
- Increasing L without increasing d may require additional server memory for the cleaner to sort diffs when assembling diff pages.
- Diff-based representation may not be compact if transactions modify all the objects in a page. Common update workloads have sparse modifications because most applications modify far fewer objects than they read. The compactness of the diff-based representation is determined by measuring R ⁇ and R CkP for workloads with expected medium and low update density.
- Thresher implements in SNAP the techniques described, and also support for recovery during normal operation without the failure recovery procedure. This allows evaluation of system performance in the absence of failures. Comparing the performance of Thresher and SNAP reveals a meaningful snapshot discrimination cost because SNAP is very efficient: even at high snapshot frequencies it has low impact on the storage system.
- An 007 transaction includes a readonly traversal (Tl ), or a read-write traversal (T2a or T2b).
- Tl readonly traversal
- T2a or T2b read-write traversal
- the traversals T2a and T2b generate workloads with fixed amount of object overwriting and density.
- Extended traversal has been implemented which is summarized below which allowed control of these parameters.
- a variant traversal T2a' is used, that extends T2a to update a randomly selected AtomicPart object of a CompositePart instead of always modifying the same (root) object in T2a.
- each T2a' traversal modifies 500 objects.
- the desired amount of overwriting is achieved by adjusting the object update history in a sequence of T2a' traversals.
- Workload parameter ⁇ controls the amount of overwriting.
- the experiments use three settings for ⁇ , corresponding to low (0.08), medium (0.30) and very high (0.50) degree of overwriting.
- T2f modifies 500 objects
- ⁇ the average number of modified Atomic-Part objects on a dirty page when the dirty page is written back to database (on average, a page in 007 has 27 such objects).
- T2f modifies a group of Atomic-Part objects around the chosen one. Denote by T2f-g the workload with group of size g.
- T2f-1 is essentially T2a'.
- the workload density ⁇ is controlled by specifying the size of the group.
- T2f-g since repeated T2f-g traversals update multiple objects on each data page due to write-absorption provided by MOB, T2f-g, like T2a ⁇ also controls the overwriting between traversals.
- the single-client experiments run with snapshot frequency 1, declaring a snapshot after each transaction, in a 3-user OO7 database (185MB in size).
- the multi-client scalability experiments run with snapshot frequency 10 in a large database (140GB in size).
- the size of a single private client module is the same in both configurations. All the reported results show the mean of at least three trials with maximum standard deviation at 3%.
- the storage system server runs on a Linux (kernel 2.4.20) workstation with dual 64-bit Xeon 3Ghz CPU, IGB RAM.
- Two Seagate Cheetah disks (model ST3146707LC, 10000 rpm, 4.7ms avg seek, Ultra320 SCSI) directly attach to the server via LSI Fusion MPT adapter.
- the database and the archive reside on separate raw hard disks.
- the implementation uses Linux raw devices and direct I/O to bypass file system cache.
- the client(s) run on workstations with single P3 850Mhz CPU and 512MB of RAM.
- the clients and server are inter-connected via a lOOMbps switched network.
- the server In single-client experiments, the server is configured with 18 MB of page cache ( 10% of the database size), and a 2MB MOB in Thor. In multi-client experiments, the server is configured with 30MB of page cache and 8-1 1MB of MOB in Thor.
- the snapshot systems are configured with slightly more memory for VMOB so that the same number of dirty database pages is generated in all snapshot systems, normalizing the r ⁇ / ra , n comparison to Thor.
- the cost of eager discrimination in Thresher includes the cost of creating VPTs for higher-level snapshots.
- Table 1 shows t c ⁇ ean in Thresher for a two-level eager rank-tree with inter-level retention fraction f set to one snapshot in 200, 400, 800, and 1600.
- the t c ⁇ ean in SNAP is 5.07ms.
- the small incremental page tables contribute a very small fraction (0.02% to 0.14% ) of the overall archiving cost even for the lowest-level snapshots, rendering eager segregation essentially free of cost. This result is important, because eager segregation is used to reduce the cost of lazy segregation and hybrid representation.
- a key factor affecting the cleaning costs in the diff-based systems is the compactness of the diff-based representation.
- a diff-based system configured with a 4MB sorting buffer, with medium overwriting, has a very low /? c * p (0.5% - 2%) for the low density workload (R ⁇ j f is 0.3%).
- R ⁇ j f is 0.3%).
- Z? ⁇ is 3.7%
- Table 2 shows the cleaning costs and ⁇ dP for all four systems for medium density workload with low, medium, and high overwriting.
- the t c ⁇ ean measured in the Lazy and Diff systems includes the database iread and write cost, the CPU cost for processing VMOB, the page archiving and checkpointing cost via parallel I/O, snapshot page table archiving, and the cost for sorting diffs and creating diff-extents but does not directly include the cost of reading and archiving diffs, since this activity is performed asynchronously with cleaning.
- the measured t ⁇ g reflects these diff related costs (including I/O on diff extents, and diff page index maintenance) per dirty database page.
- the t c ⁇ ean measured for SNAP and Thor includes the (obvious) relevant cost components.
- Lazy Compared to Diff, Lazy has a higher t ⁇ g reflecting the diff copying overhead. This overhead decreases as overwriting rate increases, t ⁇ g does not drop proportionally to the overwriting increase because the dominant cost component of creating higher level extents, reading back the extents in the lowest level, is insensitive to the overwriting rate. Lazy pays no checkpoint segregation cost because it uses the eager protocol.
- the Hybrid system incurs the archiving costs of a page-based snapshot system, plus the costs of diff extent creation and segregation, deeming it the costliest of Thresher configurations. Workload density impacts the diff- related costs.
- Figure 9 shows how the non-disruptiveness ⁇ of Hybrid decreases relative to Thor for workloads with low, medium and high density and a fixed medium overwriting. The denser workload implies more diff I/O. Under the densest possible workload, included here for comparison, the drop of ⁇ dP of hybrid over Thor is 13.6%, where for the common expected medium and low workloads the drop is 3.2% and 1.8% respectively.
- BITE in Diff and SNAP to Thor are compared.
- the experiment creates Diff and SNAP archives by running 16000 medium density, medium overwriting traversals declaring a snapshot after each traversal.
- the incremental VPT protocol checkpoints VPTs at 4MB intervals to bound reconstruction scan cost.
- the APT mounting time, depending on the distance from the VPT checkpoint is of a seesaw pattern, between 21.05ms and 48.77ms.
- the latency of BITE is determined by the average fetch cost via APT (4.10ms per page).
- Diff mounts snapshot by mounting the closest checkpoint, i.e. reconstructing the checkpoint page table (same cost as VPT in SNAP), and mounting the involved page index structures at average mounting time of page index at 7.61ms.
- Diff reads the checkpoint page and the d diff-pages of P.
- the average cost to fetch a checkpoint page is 5.80ms
- to fetch a diff-page from one extent is 5.42ms.
- the cost of constructing the requested page version by applying the diff-pages back to the checkpoint page is negligible.
- Table 3 shows the average end-to-end BITE cost measured at the client side by running one standard 007 Tl traversal against Thor, SNAP and Diff respectively. Hybrid has the latency of SNAP for recent snapshots, and latency of Diff otherwise. The end-to-end BITE latency (page fetch cost) increases over time as pages are archived. Table 3 lists the numbers corresponding to a particular point in system execution history with the intention of providing general indication of BITE performance on different representations compared to the performance of accessing the current database. The performance gap between page-based and diff-based BITE motivates the hybrid representation.
- the database size is 140GB, which virtually contains over 3000 007 modules. Each client accesses its private module (45MB in size) in the database.
- the private modules of the testing clients are evenly allocated within the space of 140GB.
- the ⁇ d P of Thor is 2.85, 1.64 and 1.30 respectively.
- Figure 10 shows the decrease of ⁇ dP in Hybrid relative to Thor when load increases. Note, that adding more concurrent clients doesn't cause Hybrid to perform worse. In fact, with 8-client concurrent workload, Hybrid performs better than single-client workload.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Thresher is a new snapshot storage management system, based on a novel snapshot technique that is the first to provide applications the ability to discriminate among snapshots efficiently. An application may provide a discrimination policy that ranks snapshots. The policy may be specified when snapshots are taken, or later after snapshots have been created. Thresher may efficiently disentangle differently ranked snapshots, allowing valuable snapshots to be stored with faster access or to remain accessible for longer, while also allowing less valuable snapshots to be discarded. These operations may be executed in Thresher without creating disk fragmentation.
Description
A Device and Method for an Efficient Storage Manager for Copy-on-Write Snapshots
BACKGROUND OF THE INVENTION
Field of the Invention
[001] The present invention relates to snapshot storage management systems. More specifically, the present invention relates to snapshot storage management systems that provide applications with the ability to discriminate among snapshots efficiently.
Description of the Prior Art
[002] A new generation of storage systems exploit decreasing storage costs and efficient versioning techniques to allow applications to take snapshots of past states and retain them for long durations. Snapshot analysis is becoming increasingly important. For example, an Intensive Care Unit (ICU) monitoring system may analyze the information on patients' past response to treatment.
[003] Over time, current snapshot techniques can produce large volumes of snapshots. Indiscriminately keeping all snapshots accessible is impractical, even if raw disk storage is cheap, because administering such large-volume storage is expensive over a long duration. Moreover, not all snapshots are equally valuable. Some are of value for a long time, some for a short time. Some may require faster access. For example, a patient monitoring system might retain readings showing an abnormal behavior. Recent snapshots may require faster access than older snapshots.
[004] Current snapshot systems do not provide applications with the ability to discriminate efficiently among snapshots, so that valuable snapshots remain accessible while less valuable snapshots are discarded or moved off-line. The problem is that incremental copy-on-write, the basic technique that makes snapshot creation efficient, entangles on the disk successive snapshots. Separating entangled snapshots creates disk fragmentation that reduces snapshot system performance over time.
BRIEF DESCRIPTION OF THE DRAWINGS
[005] Embodiments of the invention will be understood and appreciated more fully from the following detailed description in conjunction with the drawings in which like reference numerals indicate corresponding, analogous or similar elements, and in which:
[006] Figure 1 shows an embodiment of a page-based representation of BITE;
[007] Figure 2 shows an embodiment of a split copy-on-write;
[008] Figure 3 shows an embodiment of archiving split snapshots;
[009] Figure 4 shows an embodiment of rank-tree policy;
[0010] Figure 5 shows an embodiment of eager ranked segregation;
[0011] Figure 6 shows an embodiment of a diff-based representation of BITE;
[0012] Figure 7 shows an embodiment of lazy segregation;
[0013] Figure 8 shows an embodiment of reclamation in Hybrid;
[0014] Figure 9 shows a graph λdP relative to Thor in an embodiment of Hybrid; and
[0015] Figure 10 shows a graph λ^p relative to Thor in an embodiment having multiple clients.
DESCRIPTION OF THE PREFERRED EMBODIMENT
[0016] In the following description, various aspects of the present invention will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without the specific details presented herein. Furthermore, well- known features may be omitted or simplified in order not to obscure the present invention. Various examples are given throughout this description. These are merely descriptions of specific embodiments of the invention. The scope of the invention is not limited to the examples given.
[0017] Thresher is a new snapshot storage management system, based on a novel snapshot technique that is the first to provide applications the ability to discriminate among snapshots efficiently. An application may provide a discrimination policy that ranks snapshots. The policy may be specified when snapshots are taken, or later after snapshots have been created. Thresher may efficiently disentangle differently ranked snapshots, allowing valuable snapshots to be stored with faster access or to remain accessible for longer, while also allowing less valuable snapshots to be discarded. These operations may be executed in Thresher without creating disk fragmentation.
[0018] Thresher is based on two key innovations. First, a novel technique called "ranked segregation" may efficiently separate on disk the states of differently-ranked copy-on-write snapshots, enabling no-
copy reclamation without fragmentation. Second, while most snapshot systems rely on a no-overwrite update approach, Thresher may rely on a novel update-in-place technique that provides an efficient way to transform snapshot representation as snapshots are created.
[0019] The ranked segregation technique may be efficiently composed with different snapshot representations to lower the storage management costs for several useful discrimination policies. When applications need to defer snapshot discrimination, for example until after examining one or more subsequent snapshots to identify abnormalities, Thresher may segregate the normal and abnormal snapshots efficiently by composing ranked segregation with a compact diff-based representation to reduce the cost of copying. For applications that need faster access to recent snapshots, Thresher may compose ranked segregation with a dual snapshot representation that is less compact but provides faster access.
[0020] A snapshot storage manager, like a garbage collector, must be designed with a concrete system in mind, and must perform well for different application workloads. To explore how the performance of these techniques depends on the storage system workload, Thresher was prototyped in an experimental snapshot system that allows flexible control of workload parameters. Update density and overwriting were identified as the two key parameters that determine the performance of a snapshot storage manager. Measurements of the Thresher prototype indicate that the new techniques are efficient and scalable, imposing minimal (4%) performance penalty on common expected workloads.
[0021] Thresher is described herein in the context of a concrete system but it will be appreciated by one skilled in the art that the techniques described are more general and may be readily applied to other systems.
[0022] Thresher is implemented in SNAP, the snapshot system that provides snapshots for the Thor object storage system. SNAP assumes that applications are structured as sequences of "transactions" accessing a storage system. It supports Back-In-Time Execution (BITE), a capability of a storage system where applications running general code may run against read-only snapshots in addition to the current state. The snapshots may reflect transactionally consistent historical states. An application may choose which snapshots it wants to access so that snapshots may reflect states meaningful to the application. Applications may take snapshots at unlimited "resolution", e.g. after each transaction, without disrupting access to the current state.
[0023] Thresher may allow applications to discriminate among snapshots by incorporating a snapshot discrimination policy into the following three snapshot operations: a request to take a snapshot ("snapshot request", or "declaration") that provides a discrimination policy, or indicates lazy discrimination, a request
to access a snapshot ("snapshot access"), and a request to specify a discrimination policy for a snapshot ("discrimination request").
[0024] The operations may have the following semantics. Informally, an application may take a snapshot by asking for a snapshot "now". This snapshot request may be serialized along with other transactions and other snapshots. That is, a snapshot may reflect all state-modifications by transactions serialized before this request, but does not reflect modifications by transactions serialized after. A snapshot request may return a snapshot "name" that applications can use to refer to this snapshot later, e.g. to specify a discrimination policy for a snapshot. For simplicity, it is assumed that snapshots are assigned unique sequence numbers that correspond to the order in which they occur. A snapshot access request may specify which snapshot an application wants to use for back-in-time execution. The request may return a consistent set of object states, allowing the read-only transaction to run as if it were running against the current storage state. A discrimination policy may rank snapshots. A rank may be a numeric score assigned to a snapshot. Thresher may interpret the ranking to determine the relative lifetimes of snapshots and the relative snapshot access latency.
[0025] A snapshot storage management system needs to be efficient and not unduly slow-down the snapshot system. The general approach to snapshot discrimination may be applicable to snapshot systems that separate snapshots from the current storage system state. Such so-called "split" snapshot systems may rely on update-in-place storage and create snapshots by copying out the past states, unlike snapshot systems that rely on no-overwrite storage and do not separate snapshot and current states. Split snapshots may be attractive in long-lived systems because they allow creation of high-frequency snapshots without disrupting access to the current state while preserving the on-disk object clustering for the current state. This approach may take advantage of the separation between snapshot and current states to provide efficient snapshot discrimination. A specialized snapshot representation may be created which is tailored to the discrimination policy while copying out the past states.
[0026] Thor has a client/ server architecture. Servers provide persistent storage (called database storage) for objects. Clients may cache copies of the objects and run applications that interact with the system by making calls to methods of cached objects. Method calls may occur within a the context of transaction. A transaction commit may cause all modifications to become persistent, while an abort may leave no transaction changes in the persistent state. The system may use optimistic concurrency control. A client may send its read and write object sets with modified object states to the server when the application asks to commit the transaction. If no conflicts were detected, the server may commit the transaction.
[0027] An object may belong to a particular server. The object within a server may be uniquely identified by an object reference (Oref). Objects may be clustered into 8KB pages. Typically objects are small and there are many of them in a page. An object Oref is composed of a PageID and a oid. The PageID identifies the containing page and allows the lookup of an object location using a page table. The oid is an index into an offset table stored in the page. The offset table contains the object offsets within the page. This indirection may allow moving an object within a page without changing the references to it.
[0028] When an object is needed by a transaction, the client may fetch the containing page from the server. Only modified objects may be shipped back to the server when the transaction commits. Thor provides transaction durability using the ARIES no-force no-steal redo log protocol. Since only modified objects may be shipped back at commit time, the server may need to do an installation read (iread) to obtain the containing page from disk. An in-memory, recoverable cache called the Modified Object Buffer (MOB) may store the committed modifications allowing ireads to be deferred and write absorption to be increased. The modifications may be propagated to the disk by a background "cleaner" thread that cleans the MOB. The cleaner may process the MOB in transaction log order to facilitate the truncation of the transaction log. For each modified object encountered, it may read the page containing the object from disk (iread). If the page is not cached, it may install all modifications in the MOB for objects in that page, write the updated page back to disk, and remove the objects from the MOB.
[0029] The server also may manage an in-memory page cache used to serve client fetch requests. Before returning a requested page to the client, the server may update the cache copy, installing all modifications in the MOB for that page so that the fetched page reflects the up-to-date committed state. The page cache may use Least Recently Used (LRU) replacement but discards old dirty pages (it may depend on ireads to read them back during MOB cleaning) rather than writing them back to disk immediately. Therefore, the cleaner thread may be the only component of the system that writes pages to disk.
[0030] SNAP may create snapshots by copying out the past storage system states onto a separate snapshot archive disk. A snapshot may provide the same abstraction as the storage system, consisting of "snapshot pages" and a "snapshot page table". This may allow unmodified application code running in the storage system to run as BITE over a snapshot.
[0031] SNAP may copy snapshot pages and snapshot page table mappings into the archive during cleaning. It may use an incremental copy-on-write technique specialized for split snapshots: a snapshot page is constructed and copied into the archive when a page on the database disk is about to be
overwritten the first time after a snapshot is declared. Archiving a page may create a snapshot page table mapping for the archived page.
[0032] Consider the pages of snapshot v and page table mappings over the transaction history starting with the snapshot v declaration. At the declaration point, all snapshot v pages are in the database and all the snapshot v page table mappings point to the database. Later, after several update transactions have committed modifications, some of the snapshot v pages may have been copied into the archive, while the rest are still in the database. If a page P has not been modified since v was declared snapshot page P is in the database. If P has been modified since v was declared, the snapshot v version of P is in the archive. The snapshot v page table mappings track this information, i.e. the archive or database address of each page in snapshot v.
Snapshot access
[0033] The following describes how BITE of unmodified application code running on a snapshot may use a snapshot page table to look up objects and transparently redirect object references within a snapshot between database and archive pages.
[0034] To request a snapshot v, a client application may send a "snapshot access request" to the server. The server may construct an Archive Page Table (APT) for version v (APTV) and "mounts" it for the client. APTV may map each page in snapshot v into its archive address or indicate the page is in the database. Once APTV is mounted, the server may receive a page fetch requests from the client may look up pages in APT1, and read them from either archive or database. Since snapshots may be accessed readonly, APTV can be shared by all clients mounting snapshot v.
[0035] Figure 1 shows an embodiment of how unmodified client application code accesses objects in snapshot v that includes both archived and database pages. For simplicity, an assumption is made that a server state where all committed modifications has been already propagated to the database and the archive disk. In the figure, client code may request object y on page Q. The server may look up Q in APTV, load page Qv from the archive and send it to the client. Later on client code may follow a reference from y to x in the client cache, requesting object x in page P from the server. The server may look up P in APTV and find out that the page P for snapshot v is still in the database. The server reads P from the database and sends it to the client.
[0036] In SNAP, the archive representation for a snapshot page may include the complete storage system page. This representation is referred to as "page-based". The following describe different snapshot page
representations, specialized to various discrimination policies. For example, a snapshot page can have a more compact representation based on modified object diffs, or it can have two different representations. Such variations in snapshot representation are transparent to the application code running BITE, since the archive read operation reconstructs the snapshot page into storage system representation before sending it to the client.
Snapshot creation
[0037] The notions of a "snapshot span" and pages "recorded" by a snapshot capture the incremental copy-on-write manner by which SNAP archives snapshot pages and snapshot page tables. Snapshot declarations may partition transaction history into "spans". The span of a snapshot v starts with its declaration and ends with the declaration of the next snapshot (v+1). Consider the first modification of a page P in a span of a snapshot v. The pre-state of P may belong to snapshot v and has to be eventually copied into the archive. We say snapshot v "records" its version of P. In Figure 2, snapshot v records pages P and S (retaining the pre-states modified by transaction tx2) and the page T (retaining the pre-state modified by transaction /xj). Note that there may not be a need to retain the pre-state of page P modified by transaction fct? since it is not the first modification of P in the span.
[0038] If v does not record a version of page P, but P is modified after v is declared, in a span of a later snapshot, the later snapshot may records v's version of P. In above example, v's version of page Q may be recorded by a later snapshot v+1 who also records its own version of P.
[0039] Snapshot pages may be constructed and copied into the archive during cleaning when the pre- states of modified pages about to be overwritten in the database are available in memory. Since the cleaner may run asynchronously with the snapshot declaration, the snapshot system needs to prevent snapshot states from being overwritten by the on-going transactions. For example, if several snapshots are declared between two successive cleaning rounds, and a page P gets modified after each snapshot declaration, the snapshot system may have to retain a different version of P for each snapshot.
[0040] SNAP may prevent snapshot state overwriting, without blocking the on-going transactions. It may retain the pre-states needed for snapshot creation in an in-memory data structure called Versioned Modified Object Buffer (VMOB). VMOB may contain a queue of buckets, one for each snapshot. Bucket v may hold modifications committed in v's span. As transactions commit modifications, modified objects may be added to the bucket of the latest snapshot (Step 1 , Figure 3). The declaration of a new snapshot may create a new mutable bucket, and make the preceding snapshot bucket immutable, preventing the overwriting of the needed snapshot states.
[0041] A cleaner may update the database by cleaning the modifications in the VMOB, and in the process of cleaning, construct the snapshot pages for archiving. Steps 2-5 in Figure 3 trace this process. To clean a page P, the cleaner may first obtain a database copy of P. The cleaner may then use P and the modifications in the buckets to create all the needed snapshot versions of P before updating P in the database. Let v be the first bucket containing modifications to P, i.e. snapshot v records its version of P. The cleaner may construct the version of P recorded by v simply by using the database copy of P. The cleaner may then update P by applying modifications in bucket v, remove the modifications from the bucket v, and proceed to the following bucket. The updated P will be the version of P recorded by the snapshot that has the next modification to P in its bucket. This process may be repeated for all pages with modifications in VMOB, constructing the recorded snapshot pages for the snapshots corresponding to the immutable VMOB buckets.
[0042] The cleaner may write the recorded pages into the archive sequentially in snapshot order, thus creating "incremental" snapshots. The mappings for the archived snapshot pages may be collected in versioned incremental snapshot page tables. VPTV (versioned page table for snapshot v) may be a data structure containing the mappings (from page id to archive address) for the pages recorded by snapshot v. As pages recorded by v are archived, mappings may be inserted into VPTV. After all pages recorded by v have been archived, VPTV may be archived as well.
[0043] The cleaner may write the VPTs sequentially, in snapshot order, into a separate archive data structure. This way, a forward sequential scan through the archived incremental page tables from VPTV and onward may find the mappings for all the archived pages that belong to snapshot v. Namely, the mapping for v's version of page P may be found either in VPTV, or, if not there, in the VPT of the first subsequently declared snapshot that records P. SNAP efficiently bounds the length of the scan. For brevity, the bounded scan protocol here is not reviewed here.
[0044] To construct a snapshot page table for snapshot v for BITE, SNAP may need to identify the snapshot v pages that are in the current database. Highest Archived Version (HAV) may be an auxiliary data structure that tracks the highest archived version for each page. If HAV(P) < v, the snapshot v page P may be in the database.
[0045] A snapshot discrimination policy may specify that older snapshots outlive more recently declared snapshots. Since snapshots may be archived incrementally, managing storage for snapshots according to such a discrimination policy can be costly. Pages that belong to a longer-lived snapshot may be recorded by a later short-lived snapshot thus entangling short-lived and long-lived pages. When different lifetime
pages are entangled, discarding shorter-lived pages may create archive storage fragmentation. For example, consider two consecutive snapshots v and v+1 in Figure 2, with v recording pages versions Pv, and Sv, and v+1 recording pages Pv+ι, <2v+i and Sv+i- The page
recorded by v+1 belongs to both snapshots v and v+1. If the discrimination policy specifies that v is long-lived but v+1 is transient, reclaiming v+1 before v may create disk fragmentation. This is because we need to reclaim Pv+] and 5v+i but not <2v+i since Qv+\ is needed by the long-lived v.
[0046] In a long-lived system, disk fragmentation may degrade archive performance causing nonsequential archive disk writes. The alternative approach that copies out the pages of the long-lived snapshots, may incur the high cost of random disk reads. But to remain non-disruptive, the snapshot system may need to keep the archiving costs low, i.e. limit the amount of archiving I/O and rely on low- cost sequential archive writes. The challenge is to support snapshot discrimination efficiently.
[0047] Thresher exploits the copying of past states in a split snapshot system. When the application provides a snapshot discrimination policy that determines the lifetimes of snapshots, Thresher may "segregate" the long-lived and the short-lived snapshot pages and copy different lifetime pages into different archive areas. When no long-lived pages are stored in short-lived areas, reclamation may create no fragmentation. In the example above, if the archive system knows at snapshot v+1 creation time that it is shorter-lived than v, it can store the long-lived snapshot pages Pv, Sv and Qv+\ in a long-lived archive area and the transient
5v+i pages in a short-lived area, so that the shorter-lived pages can be reclaimed without fragmentation.
[0048] This approach therefore, combines a discrimination policy and a discrimination mechanism. Below, the discrimination policies supported in Thresher are categorized. The following describes the discrimination mechanisms for different policies.
Discrimination policies
[0049] A snapshot discrimination policy may convey to the snapshot storage management system the importance of snapshots so that more important snapshots can have longer lifetimes, or can be stored with faster access. Thresher may support a class of flexible discrimination policies described below using an example. An application may specify a discrimination policy by providing a relative snapshot ranking. Higher-ranked snapshots may be deemed more important. By default, every snapshot may be created with a lowest rank. An application can "bump up" the importance of a snapshot by assigning it a higher rank. In a hospital ICU patient database, a policy may assign the lowest rank to snapshots corresponding to minute by minute vital signs monitor readings, a higher rank to the monitor readings that correspond to
hourly nurses' checkups, yet a higher rank to the readings viewed in doctors' rounds. Within a given rank level, more recent snapshots may be considered more important. The discrimination policy may assign longer lifetimes to more important snapshots, defining a 3-level sliding window hierarchy of snapshot lifetimes.
[0050] The above policy may be representative of a general class of discrimination policies called "rank- tree". More precisely, a k-level rank-tree policy has the following properties, assuming rank levels are given integer values 1 through k:
RTl: A snapshot ranked as level i, i > 1, corresponds to a snapshot at each lower rank level from 1 to (/ - 1).
RT2: Ranking a snapshot at a higher rank level increases its lifetime. RT3: Within a rank level, more recent snapshots outlive older snapshots.
[0051] Figure 4 depicts an embodiment of a 3-level rank-tree policy for the hospital example, where snapshot number 1 , ranked at level 3, corresponds to a monitor reading that was sent for inspection to both the nurse and the doctor, but snapshot number 4 was only sent to the nurse.
[0052] An application can specify a rank-tree policy "eagerly" by providing a snapshot rank at snapshot declaration time, or "lazily", by providing the rank after declaring a snapshot. An application can also ask to store recent snapshots with faster access. In the hospital example above, the importance and the relative lifetimes of the snapshots associated with routine procedure are likely to be known in advance, so the hospital application can specify a snapshot discrimination policy eagerly.
[0053] The eager "ranked segregation" protocol may provide efficient discrimination for eager rank-tree policies. The protocol may assign a separate archive region to hold the snapshot pages {volumes') and snapshot page tables (VPT') for snapshots at level i. During snapshot creation, the protocol may segregate the different lifetime pages and copy them into the corresponding regions. This way, each region may contain pages and page tables with the same lifetime, and temporal reclamation of snapshots (satisfying policy property RT2) within a region does not create disk fragmentation. Figure 3 shows an embodiment of a segregated archive.
[0054] At each rank level i, snapshots ranked at level / may be archived in the same incremental manner as in SNAP and at the same low sequential cost. The cost is low because by using sufficiently large write buffers (one for each volume), archiving to multiple volumes can be as efficient as strictly sequential
archiving into one volume. Since it is expected that the rank-tree may be quite shallow the total amount of memory allocated to write buffers is small.
[0055] The eager ranked segregation works as follows. The declaration of a snapshot v with a rank specified at level k (k > 1 ), may create a separate incremental snapshot page table, VPTJ for every rank level i (i < k). The incremental page table VPTJ may collect the mappings for the pages recorded by snapshot v at level /. Since the incremental tables in VPT' may map the pages recorded by all the snapshots at level i, the basic snapshot page table reconstruction protocol based on a forward scan through VPT' detailed above can be used in region i to reconstruct snapshot tables for level i snapshots.
[0056] The recorded pages may contain the pre-state before the first page modification in the snapshot span. Since the span for snapshot v at level i (denoted Sv' ) may include the spans of all the lower level snapshots declared during Sv' , pages recorded by a level i snapshot v may also be recorded by some of these lower-ranked snapshots. In Figure 4, the span of snapshot 4 ranked at level 2 includes the spans of snapshots (4), 5 and 6 at level 1. Therefore, a page recorded by the snapshot 4 at level 2 may also be recorded by one of the snapshots (4), 5, or 6 at level 1.
[0057] A page P recorded by snapshots at multiple levels may be archived in the volume of the highest- ranked snapshot that records P. We say that the highest recorder "captures" P. Segregating archived pages this way guarantees that a volume of the shorter-lived snapshots contains no longer-lived pages and therefore temporal reclamation within a volume creates no fragmentation.
[0058] The mappings in a snapshot page table VPTJ in area / may point to the pages recorded by snapshot v in whatever area these pages are archived. Snapshot reclamation may need to insure that the snapshot page table mappings are safe, that is, they do not point to reclaimed pages. The segregation protocol may guarantee the safety of the snapshot page table mappings by enforcing the following invariant I that constrains the intra-level and inter-level reclamation order for snapshot pages and page tables:
II: VPTJ and the pages recorded by snapshot v that are captured in volume' are reclaimed together, in temporal snapshot order.
12: Pages recorded by snapshot v at level k (k > 1), captured in volume11, are reclaimed after the pages recorded by all level / (i < k) snapshots declared in the span of snapshot v at level k.
[0059] Invariant Il may insure that in each given rank-tree level, the snapshot page table mappings are safe when they point to pages captured in volumes within the same level. Invariant 12 may insure that the snapshot page table mappings are safe when they point to pages captured in volumes above their level. Note that the rank-tree policy property RT2 only requires that "bumping up" a lower-ranked snapshot v to level k may extend its lifetime but it does not constrain the lifetimes of the lower-level snapshots declared in the span of v at level k. Invariant 12 may insure the safety of the snapshot table mappings for these later lower-level snapshots.
[0060] Figure 5 depicts an embodiment of the eager segregation protocol for a two-level rank-tree policy. Snapshot v4, specified at level 2, has a snapshot page table at both level 1 and level 2. The archived page P modified within the span of snapshot V5, is recorded by snapshot V5, and also by the level 2 snapshot V4. This version of P is archived in the volume of the highest recording snapshot (denoted volumevt). The snapshot page tables of both recording snapshots VPT^5 and VPT^4 contain this mapping for P. Similarly, the pre-state of page Q modified within the span of v6 is also captured in volume^. P is modified again within the span of snapshot v6. This later version of P may not be recorded by snapshot v4 at level 2 since V4 has already recorded its version of P. This later version of P may be archived in volume^ and its mapping may be inserted into VTTj6 . Invariant Il may guarantee that in VPT]6 mapping for page P in volume^ is safe. Invariant 12 may guarantee that in VPT]6 the mapping for page Q in volume v4 is safe.
[0061] Some applications may need to defer snapshot ranking to after the snapshot has already been declared (use a lazy rank-tree policy). When snapshots are archived first and ranked later, snapshot discrimination may be costly because it requires copying. The lazy segregation protocol may provide efficient lazy discrimination by combining two techniques to reduce the cost of copying. First, it may use a more compact diff-based representation for snapshot pages so that there is less to copy. Second, the diff-based representation (as explained below) may include a component that has a page-based snapshot representation. This page-based component may be segregated without copying using the eager segregation protocol.
Diff-based snapshots
[0062] The compact diff-based representation may implement the same abstraction of snapshot pages and snapshot page tables, as the page-based snapshot representation. It may be similar to database redo recovery log consisting of sequential repetitions of two types of components, checkpoints and diffs. The checkpoints may be incremental page-based snapshots declared periodically by the storage management
system. The diffs may be versioned page diffs, consisting of versioned object modifications clustered by page. Since typically only a small fraction of objects in a page may be modified by a transaction, and moreover, many attributes do not change, it is expected that the diffs will be compact.
[0063] The log repetitions containing the diffs and the checkpoints may be archived sequentially, with diffs and checkpoints written into different archive data structures. Like in SNAP, the incremental snapshot page tables may collect the archived page mappings for the checkpoint snapshots. A simple page index structure may keep track of page-diffs in each log repetition (the diffs in one log repetition are referred to as "diff extent").
[0064] To create the diff-based representation, the cleaner may sort the diffs in an in-memory buffer, assembling the page-based diffs for the diff extents. The available sorting buffer size may determine the length of diff extents. Since frequent checkpoints decrease the compactness of the diff-based representation, to get better compactness, the cleaner may create several diff extents in a single log repetition. Increasing the number of diff extents may slow down BITE. This trade-off is similar to the recovery log. For brevity, the details of how the diff-based representation is constructed are omitted. The details can be found in "Timebox: A High Performance Archive for Split Snapshots" (Hao Xu, PhD Thesis, Brandeis University, 2005) which is incorporated by reference in its entirety. Below some of the issues related to the diff-based representation compactness that are relevant to the snapshot storage management performance are discussed.
[0065] The snapshots declared between checkpoints may be reconstructed by first mounting the snapshot page table for the closest (referred to as "base") checkpoint and the corresponding diff-page index. This may allow BITE to access the checkpoint pages, and the corresponding page-diffs. To reconstruct Pv, the version of P in snapshot v, the server may read page P from the checkpoint, and then read in order, the diff-pages for P from all the needed diff extents and apply them to the checkpoint P in order. Figure 6 shows an embodiment of reconstructing a page P in a diff-based snapshot v from a checkpoint page Λo and diff-pages contained in several diff extents.
Segregation
[0066] When an application eventually provides a snapshot ranking, the system may simply read back the archived diff extents, assemble the diff extents for the longer-lived snapshots, create the corresponding long-lived base checkpoints, and archive the retained snapshots sequentially into a longer-lived area. If diffs are compact, the cost of copying may be low. The long-lived base checkpoints may be created without copying by separating the long-lived and short-lived checkpoint pages using eager segregation.
Since checkpoints may simply be page-based snapshots declared periodically by the system, the system can derive the ranks for the base checkpoints once the application specifies the snapshot ranks. Knowing ranks at checkpoint declaration time may enable eager segregation.
[0067] Consider two adjacent log repetitions L1, L,+ l for level- 1 snapshots, with corresponding base checkpoints Bi, and B,+l. Suppose the base checkpoint β, +1 is to be reclaimed when the adjacent level- 1 diff extents are merged into one level 2 diff extent. Declaring the base checkpoint S, a level-2 rank tree snapshot, and base checkpoint B1 +1 as level-1 rank tree snapshot, may allow reclaiming the pages of B1 +1 without fragmentation or copying.
[0068] Figure 7 shows an embodiment of eager rank-tree policy for checkpoints in lazy segregation. A representation for level-1 snapshots has the diff extents E\, £2 and £3 (in the archive region Gd x φ ) associated with the base checkpoints B\, β2 and Bj. To create the level-2 snapshots, E\, £2 and £3 may be merged into extent E (in region Gd 2 φ ). This extent E has a base checkpoint B\. Eventually, extents E\, £2,
£3 and checkpoints B2, B3 may be reclaimed. Since B\ was ranked at declaration time as rank-2 longer- lived snapshot, the eager segregation protocol may let B\ capture all the checkpoint pages it records, allowing it to reclaim the shorter-lived pages of S2 and B3 without fragmentation.
[0069] The lazy segregation protocol may be optimized for the case where the application specifies snapshot rank within a limited time period after snapshot declaration, which is expected to be the common case. If the limit is exceeded, the system may reclaim shorter-lived base checkpoints by copying out longer-lived pages at a much higher cost. The same approach can also be used if the application needs to change the discrimination policy.
[0070] The diff-based representation may be more compact but has a slower BITE than the page-based representation. Some applications require lazy discrimination but also need low-latency BITE on a recent window of snapshots. For example, to examine the recent snapshots and identify the ones to be retained. The eager segregation protocol may allow efficient composition of diff-based and page-based representations to provide fast BITE on recent snapshots, and lazy snapshot discrimination. The composed representation, called "Hybrid", works as follows. When an application declares a snapshot, Hybrid may create two snapshot representations. A page-based representation may be created in a separate archive region that maintains a sliding window of W recent snapshots, reclaimed temporally. BITE on snapshots within W may run on the fast page-based representation. In addition, to enable efficient lazy discrimination, Hybrid may create for the snapshots a diff-based representation. BITE on
snapshots outside W may run on the slower diff -based representation. Snapshots within W therefore may have two representations (page-based and diff-based).
[0071] The eager segregation protocol can be used to efficiently compose the two representations and provide efficient reclamation. To achieve the efficient composition, the system may specify an eager rank-tree policy that ranks the page-based snapshots as lowest-rank (level-0) rank-tree snapshots, but specifies the ones that correspond to the system-declared checkpoints in the diff-based representation, as level- 1. As in lazy segregation, the checkpoints may be further discriminated by bumping up the rank of the longer-lived checkpoints. With such eager policy, the eager segregation protocol may retain the snapshots declared by the system as checkpoints without copying, and may reclaim the aged snapshots in the page-based window W without fragmentation. The cost of checkpoint creation and segregation may be completely absorbed into the cost of creating the page-based snapshots, resulting in lower archiving cost than the simple sum of the two representations.
[0072] Figure 8 shows an embodiment of reclamation in the Hybrid system that adds faster BITE to the snapshots in Figure 7. The system may create the page-based snapshots V, and use them to run fast BITE on recent snapshots. Snapshots V1 and V4 may be used as base checkpoints B\ and B2 for the diff-based representation, and checkpoint B\ may be retained as a longer-lived checkpoint. The system may specify an eager rank-tree policy, ranking snapshots V, at level-0, and bumping up Vi to level-2 and V4 to level- 1. This may allow the eager segregation protocol to create the checkpoints B\, and B2, and eventually reclaim B2, V5 and V6 without copying.
[0073] Efficient discrimination should not increase significantly the cost of snapshots. We analyze the discrimination techniques under a range of workloads and show they have minimal impact on the snapshot system performance. The results of our experiments are presented below. The evaluation approach is explained in the following.
Cost of discrimination
[0074] The metric λ<jp captures the non-disruptiveness of an I/O-bound storage system. This metric is used to gauge the impact of snapshot discrimination. Let rpour be the "pouring rate" - average object cache (MOB/VMOB) free-space consumption speed due to incoming transaction commits, which insert modified objects. Let /■ </„„•„ be the "draining rate" - the average growth rate of object cache free space produced by MOB/VMOB cleaning. λdP is defined as:
-1 ' dra, Adp ~ pour λjp indicates how well the draining keeps up with the pouring. If λdP > 1 , the system operates within its capacity and the foreground transaction performance is not be affected by background cleaning activities. If λdP < 1, the system is overloaded, transaction commits eventually block on free object cache space, and clients experience commit delay.
[0075] Let tcιean be the average cleaning time per dirty database page. Apparently, tC[ean determines />„,„. In Thresher, tcιean reflects, in addition to the database ireads and writes, the cost of snapshot creation and snapshot discrimination. Since snapshots are created on a separate disk in parallel with the database cleaning, the cost of snapshot-related activity can
[0076] "Overwriting" (α) is an update workload parameter, defined as the percentage of repeated modifications to the same object or page, α affects both rpour and r</raιπ. When overwriting increases, updates cause less cleaning in the storage system because the object cache (MOB/VMOB) absorbs repeated modifications, but high frequency snapshots may need to archive most of the repeated modifications. With less cleaning, it may be harder to hide archiving behind cleaning, so snapshots may become more disruptive. On the other hand, workloads with repeated modifications reduce the amount of copying when lazy discrimination copies diffs. For example, for a two-level discrimination policy that retains one snapshot out of every hundred, of all the repeated modifications to a given object o, archived for the short-lived level- 1 snapshots, only one (last) modification gets retained in the level-2 snapshots. To gauge the impact of discrimination on the non-disruptiveness, rpour and r,/ram are measured experimentally in a system with and without discrimination for a range of workloads with low, medium and high degree of overwriting, and analyze the resulting λdP.
[0077] λdP may determine the maximum throughput of an I/O bound storage system. Measuring the maximum throughput in a system with and without discrimination could provide an end-to-end metric for gauging the impact of discrimination. λdP is focused on because it allows better explanations of the complex dependency between workload parameters and cost of discrimination.
Compactness of representation
[0078] The effectiveness of diff-based representation in reducing copying cost may depend on the "compactness" of the representation. Compactness is characterized by a relative snapshot retention metric R, defined as the size of snapshot state written into the archive for a given snapshot history length H,
relative to the size of the snapshot state for H captured in full snapshot pages. R = 1 for the page-based representation. R of the diff -based representation has two contributing components, RCkP for the checkpoints, and Rj1Jf for the diffs. "Density" (β), a workload parameter defined as the fraction of the page that gets modified by an update, may determine Rd,ff. For example, in a static update workload where any time a page is updated, the same half of the page gets modified, Rdlff = 0:5. RCkP may depend on the frequency of checkpoints, determined by L — the number of snapshots declared in the history interval corresponding to one log repetition. In workloads with overwriting, increasing L may decrease Rciφ since checkpoints are page-based snapshots that record the first pre-state for each page modified in the log repetition. Increasing L by increasing d, the number of diff extents in a log repetition, may raise the snapshot page reconstruction cost for BITE. Increasing L without increasing d may require additional server memory for the cleaner to sort diffs when assembling diff pages.
[0079] Diff-based representation may not be compact if transactions modify all the objects in a page. Common update workloads have sparse modifications because most applications modify far fewer objects than they read. The compactness of the diff-based representation is determined by measuring R^ and RCkP for workloads with expected medium and low update density.
EXAMPLES
[0080] Thresher implements in SNAP the techniques described, and also support for recovery during normal operation without the failure recovery procedure. This allows evaluation of system performance in the absence of failures. Comparing the performance of Thresher and SNAP reveals a meaningful snapshot discrimination cost because SNAP is very efficient: even at high snapshot frequencies it has low impact on the storage system.
Workloads
[0081] To study the impact of the workload the standard multi-user OO7 benchmark for object storage systems is used. The benchmark definition is omitted for brevity. An 007 transaction includes a readonly traversal (Tl ), or a read-write traversal (T2a or T2b). The traversals T2a and T2b generate workloads with fixed amount of object overwriting and density. Extended traversal has been implemented which is summarized below which allowed control of these parameters. To control the degree of overwriting, a variant traversal T2a' is used, that extends T2a to update a randomly selected AtomicPart object of a CompositePart instead of always modifying the same (root) object in T2a. Like T2a, each T2a' traversal modifies 500 objects. The desired amount of overwriting is achieved by adjusting the object update history in a sequence of T2a' traversals. Workload parameter α controls the amount of
overwriting. The experiments use three settings for α, corresponding to low (0.08), medium (0.30) and very high (0.50) degree of overwriting.
[0082] To control density, a variant of traversal T2a' was developed, called T2f (also modifies 500 objects), that allows to determine β, the average number of modified Atomic-Part objects on a dirty page when the dirty page is written back to database (on average, a page in 007 has 27 such objects). Unlike T2a' which modifies one AtomicPart in the CompositePart, T2f modifies a group of Atomic-Part objects around the chosen one. Denote by T2f-g the workload with group of size g. T2f-1 is essentially T2a'.
[0083] The workload density β is controlled by specifying the size of the group. In addition, since repeated T2f-g traversals update multiple objects on each data page due to write-absorption provided by MOB, T2f-g, like T2a\ also controls the overwriting between traversals. The size of the group is specified, and the desired overwriting, and experimentally determine β in the resulting workload. For example, given 2MB of VMOB (the standard configuration in Thor and SNAP for single-client workload), the measured β of multiple T2f-1 is 7.6 (medium α, transaction 50% on private module, 50% on public module). T2f-180 that modifies almost every Atomic-Part in a module, has β = 26, yielding almost the highest possible workload density for OO7 benchmark. The experiments use workloads corresponding to three settings of density β, low (T2f-l,β = 7.6), medium (T2f-26,β = 16) and very high (T2f-180,β = 26). Unless otherwise specified, a medium overwriting rate is being used.
Experimental configuration
[0084] Two experimental system configurations are used. The single-client experiments run with snapshot frequency 1, declaring a snapshot after each transaction, in a 3-user OO7 database (185MB in size). The multi-client scalability experiments run with snapshot frequency 10 in a large database (140GB in size). The size of a single private client module is the same in both configurations. All the reported results show the mean of at least three trials with maximum standard deviation at 3%.
[0085] The storage system server runs on a Linux (kernel 2.4.20) workstation with dual 64-bit Xeon 3Ghz CPU, IGB RAM. Two Seagate Cheetah disks (model ST3146707LC, 10000 rpm, 4.7ms avg seek, Ultra320 SCSI) directly attach to the server via LSI Fusion MPT adapter. The database and the archive reside on separate raw hard disks. The implementation uses Linux raw devices and direct I/O to bypass file system cache. The client(s) run on workstations with single P3 850Mhz CPU and 512MB of RAM. The clients and server are inter-connected via a lOOMbps switched network. In single-client experiments, the server is configured with 18 MB of page cache ( 10% of the database size), and a 2MB MOB in Thor. In multi-client experiments, the server is configured with 30MB of page cache and 8-1 1MB of MOB in
Thor. The snapshot systems are configured with slightly more memory for VMOB so that the same number of dirty database pages is generated in all snapshot systems, normalizing the r</ra,n comparison to Thor.
[0086] In turn, the performance of eager segregation, lazy segregation, hybrid representation, and BITE are analyzed under a single-client workload, and then an evaluation of system scalability under a multiple concurrent client workload is made.
Snapshot discrimination Eager segregation
[0087] Compared to SNAP, the cost of eager discrimination in Thresher includes the cost of creating VPTs for higher-level snapshots. Table 1 shows tcιean in Thresher for a two-level eager rank-tree with inter-level retention fraction f set to one snapshot in 200, 400, 800, and 1600. The tcιean in SNAP is 5.07ms. Not surprisingly, the results show no noticeable change, regardless of retention fraction. The small incremental page tables contribute a very small fraction (0.02% to 0.14% ) of the overall archiving cost even for the lowest-level snapshots, rendering eager segregation essentially free of cost. This result is important, because eager segregation is used to reduce the cost of lazy segregation and hybrid representation.
Table 1 : tcιenn: eager segregation
/ 200 400 800 1600 t'cUan Il 5.08ms I 5.07ms | 5.10ms | 5.08ms |
Lazy segregation
[0088] The cost of lazy segregation was analyzed for a 2-level rank-tree by comparing the cleaning costs, and the resulting λap in four different system configurations, Thresher with lazily segregated diff-based snapshots ("Lazy"), Thresher with unsegregated diff-based snapshots ("Diff '), page-based (unsegregated) snapshots ("SNAP"), and storage system without snapshots ("Thor"), under workloads with a wide range of density and overwriting parameters. The complete results, omitted for lack of space, can be found in "Timebox: A High Performance Archive for Split Snapshots". Here the focus is on the low and medium overwriting and density parameter values which are expected to be more common.
[0089] A key factor affecting the cleaning costs in the diff-based systems is the compactness of the diff- based representation. A diff-based system configured with a 4MB sorting buffer, with medium overwriting, has a very low /?c*p (0.5% - 2%) for the low density workload (Rώjf is 0.3%). For medium density workload (Z?^ is 3.7%), the larger diffs fill sorting buffer faster but Rφ decreases from 10.1% to 4.8% when d increases from 2 to 4 diff extents. These results point to the space saving benefits offered by the diff-based representation.
[0090] Table 2 shows the cleaning costs and λdP for all four systems for medium density workload with low, medium, and high overwriting. The tcιean measured in the Lazy and Diff systems includes the database iread and write cost, the CPU cost for processing VMOB, the page archiving and checkpointing cost via parallel I/O, snapshot page table archiving, and the cost for sorting diffs and creating diff-extents but does not directly include the cost of reading and archiving diffs, since this activity is performed asynchronously with cleaning. The measured tφg reflects these diff related costs (including I/O on diff extents, and diff page index maintenance) per dirty database page. The tcιean measured for SNAP and Thor includes the (obvious) relevant cost components.
[0091] Compared to Diff, Lazy has a higher t^g reflecting the diff copying overhead. This overhead decreases as overwriting rate increases, t^g does not drop proportionally to the overwriting increase because the dominant cost component of creating higher level extents, reading back the extents in the lowest level, is insensitive to the overwriting rate. Lazy pays no checkpoint segregation cost because it uses the eager protocol.
[0092] Next, consider non-disruptiveness. rpour is measured, and λφ is conservatively computed for Diff and Lazy by adding t^g to tcιean, approximating a very busy system where diff I/O is forced to run synchronously with the cleaning. When overwriting is low, λφ in all snapshot systems is close to Thor. When overwriting is high, all systems have high λφ because there is very little cleaning in the storage system, and R is low in Diff and Lazy. Importantly, even with the conservative adjustment, λφ in both diff-based systems is very close to SNAP, while providing significantly more compact snapshots. Notice, all snapshot systems declare snapshots after each traversal transaction. "Snap: Efficient Snapshots for Back-in-Time Execution" (Liuba Shrira and Hao Xu, Proceedings of the 21s1 International Conference on Data Engineering, Tokyo, Japan, April 2005), which is incorporated by reference in its entirety, shows that λdP increases quickly as snapshot frequency decreases.
Table 2: Lazy segregation and overwriting
Hybrid
[0093] The Hybrid system incurs the archiving costs of a page-based snapshot system, plus the costs of diff extent creation and segregation, deeming it the costliest of Thresher configurations. Workload density impacts the diff- related costs. Figure 9 shows how the non-disruptiveness λφ of Hybrid decreases relative to Thor for workloads with low, medium and high density and a fixed medium overwriting. The denser workload implies more diff I/O. Under the densest possible workload, included here for comparison, the drop of λdP of hybrid over Thor is 13.6%, where for the common expected medium and low workloads the drop is 3.2% and 1.8% respectively. Note in all configurations, because system's λdP is greater than 1, there is no client-side performance difference observed between Hybrid and Thor. As a result, the metric λφ directly reflects the server's "cleaning" speed (tclecm). The results in Figure 9 indicate that Hybrid is a feasible solution for systems that need fast BITE and lazy discrimination (or snapshot compactness).
Back-in-time execution
[0094] BITE in Diff and SNAP to Thor are compared. The experiment creates Diff and SNAP archives by running 16000 medium density, medium overwriting traversals declaring a snapshot after each traversal. The incremental VPT protocol checkpoints VPTs at 4MB intervals to bound reconstruction scan cost. The APT mounting time, depending on the distance from the VPT checkpoint is of a seesaw pattern, between
21.05ms and 48.77ms. The latency of BITE is determined by the average fetch cost via APT (4.10ms per page).
[0095] Diff mounts snapshot by mounting the closest checkpoint, i.e. reconstructing the checkpoint page table (same cost as VPT in SNAP), and mounting the involved page index structures at average mounting time of page index at 7.61ms. To access a page P, Diff reads the checkpoint page and the d diff-pages of P. The average cost to fetch a checkpoint page is 5.80ms, to fetch a diff-page from one extent is 5.42ms. The cost of constructing the requested page version by applying the diff-pages back to the checkpoint page is negligible.
[0096] Table 3 shows the average end-to-end BITE cost measured at the client side by running one standard 007 Tl traversal against Thor, SNAP and Diff respectively. Hybrid has the latency of SNAP for recent snapshots, and latency of Diff otherwise. The end-to-end BITE latency (page fetch cost) increases over time as pages are archived. Table 3 lists the numbers corresponding to a particular point in system execution history with the intention of providing general indication of BITE performance on different representations compared to the performance of accessing the current database. The performance gap between page-based and diff-based BITE motivates the hybrid representation.
Table 3: End-to-end BITE performance
I current db | page-based diff-based
Tl traversal || 17.53s | 27.06s 42.11s
Scalability
[0097] To show the impact of discrimination in a heavily loaded system Thresher (hybrid) and Thor are compared as the storage system load increases, for single-client, 4-client and 8-client loads, for medium density and medium overwriting workload. (An 8-client load saturates the capacity of the storage system).
[0098] The database size is 140GB, which virtually contains over 3000 007 modules. Each client accesses its private module (45MB in size) in the database. The private modules of the testing clients are evenly allocated within the space of 140GB. Under 1 -client, 4-client and 8-client workloads, the λdP of Thor is 2.85, 1.64 and 1.30 respectively. These λ<jp values indicate, that Thor is heavily loaded under multi-client workloads. Figure 10 shows the decrease of λdP in Hybrid relative to Thor when load increases. Note, that adding more concurrent clients doesn't cause Hybrid to perform worse. In fact, with 8-client concurrent workload, Hybrid performs better than single-client workload. This is because with
private modules evenly allocated across the large database, the database random read costs increase compared to the single-client workload, hiding the cost of sequential archiving during cleaning more effectively. Under all concurrent client workloads, Hybrid, the costliest Thresher configurations, is non- disruptive.
Claims
1. A database management system comprising: a database; a server configured to execute transactions on objects within pages stored in said database; and an archive configured to store snapshots of said pages, each of said snapshots corresponding to a state of said database before one of said transactions, wherein each of said snapshots is ranked according to importance.
2. The database management system of claim 1 , wherein higher ranked snapshots are stored for a longer period of time than lower ranked snapshots.
3. The database management system of claim 2, wherein said snapshots are segregated into separate storage areas within said archive, based on the rank of said snapshots.
4. The database management system of claim 1 , wherein said snapshots are ranked at the time of snapshot creation.
5. The database management system of claim 1, wherein said snapshots are ranked after the snapshots are created.
6. The database management system of claim 1, wherein said snapshot pages are stored in their entirety only at designated checkpoint events.
7. The database management system of claim 6, wherein changes to objects within said snapshot pages since the last checkpoint event are stored separately within said archive.
8. The database management system of claim 7, wherein said snapshot pages are retrieved by loading the snapshot pages of the most recent checkpoint preceding the time of said snapshot creation and applying the changes to the said objects within said snapshot pages which have occurred between the checkpoint event and the time of snapshot creation.
9. A method for storing and retrieving snapshots of pages in a database comprising: collecting one or more pages within said database corresponding to the state of said database before a given transaction; executing a transaction on at least one object in said one or more pages; storing said collected one or more pages as a snapshot in an archive; and ranking said snapshot based on importance.
10. The method of claim 9, wherein said step of storing said snapshot comprises storing higher ranked snapshots for a longer period of time than lower ranked snapshots.
11. The method of claim 9, further comprising: segregating said snapshots into separate storage areas within said archive, based on the rank of said snapshots.
12. The method of claim 9, wherein said step of ranking said snapshot comprises ranking said snapshot at the time of snapshot creation.
13. The method of claim 9, wherein said step of ranking said snapshot comprises ranking said snapshot after the snapshot is created.
14. The method of claim 9, wherein said step of storing said snapshot comprises storing said snapshot pages in their entirety only at designated checkpoint events.
15. The method of claim 14, wherein said step of storing said snapshot comprises storing changes to objects within said snapshot pages since the last checkpoint event separately within said archive.
16. The method of claim 15, further comprising: retrieving said snapshot pages by loading the snapshot pages of the most recent checkpoint preceding the time of said snapshot creation and applying the changes to the said objects within said snapshot pages which have occurred between the checkpoint event and the time of snapshot creation.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US92453107P | 2007-05-18 | 2007-05-18 | |
US60/924,531 | 2007-05-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008144018A1 true WO2008144018A1 (en) | 2008-11-27 |
Family
ID=40122068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2008/006366 WO2008144018A1 (en) | 2007-05-18 | 2008-05-19 | A device and method for an efficient storage manager for copy-on-write snapshots |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008144018A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8527722B1 (en) | 2012-04-24 | 2013-09-03 | Hitachi, Ltd. | Selecting a snapshot method based on cache memory consumption |
US20220156225A1 (en) * | 2017-08-29 | 2022-05-19 | Cohesity, Inc. | Snapshot archive management |
CN115982096A (en) * | 2022-12-09 | 2023-04-18 | 北京水脉科技有限公司 | Real-time database snapshot storage method and system based on hotspot file |
US11841824B2 (en) | 2020-09-24 | 2023-12-12 | Cohesity, Inc. | Incremental access requests for portions of files from a cloud archival storage tier |
US11874805B2 (en) | 2017-09-07 | 2024-01-16 | Cohesity, Inc. | Remotely mounted file system with stubs |
US11914485B2 (en) | 2017-09-07 | 2024-02-27 | Cohesity, Inc. | Restoration of specified content from an archive |
US11989436B2 (en) | 2022-07-18 | 2024-05-21 | Hewlett Packard Enterprise Development Lp | Variable extent size |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5956713A (en) * | 1997-04-23 | 1999-09-21 | Oracle Corporation | Techniques for reducing the number of snapshots of a database |
US20030069880A1 (en) * | 2001-09-24 | 2003-04-10 | Ask Jeeves, Inc. | Natural language query processing |
US20030078918A1 (en) * | 2001-10-23 | 2003-04-24 | Souvignier Todd J. | Method, apparatus and system for file sharing between computers |
US20030159007A1 (en) * | 2002-02-15 | 2003-08-21 | International Business Machines Corporation | Deferred copy-on-write of a snapshot |
US6981114B1 (en) * | 2002-10-16 | 2005-12-27 | Veritas Operating Corporation | Snapshot reconstruction from an existing snapshot and one or more modification logs |
-
2008
- 2008-05-19 WO PCT/US2008/006366 patent/WO2008144018A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5956713A (en) * | 1997-04-23 | 1999-09-21 | Oracle Corporation | Techniques for reducing the number of snapshots of a database |
US20030069880A1 (en) * | 2001-09-24 | 2003-04-10 | Ask Jeeves, Inc. | Natural language query processing |
US20030078918A1 (en) * | 2001-10-23 | 2003-04-24 | Souvignier Todd J. | Method, apparatus and system for file sharing between computers |
US20030159007A1 (en) * | 2002-02-15 | 2003-08-21 | International Business Machines Corporation | Deferred copy-on-write of a snapshot |
US6981114B1 (en) * | 2002-10-16 | 2005-12-27 | Veritas Operating Corporation | Snapshot reconstruction from an existing snapshot and one or more modification logs |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8527722B1 (en) | 2012-04-24 | 2013-09-03 | Hitachi, Ltd. | Selecting a snapshot method based on cache memory consumption |
WO2013160934A1 (en) * | 2012-04-24 | 2013-10-31 | Hitachi, Ltd. | Storage apparatus, computer system and snapshot creation method |
US20220156225A1 (en) * | 2017-08-29 | 2022-05-19 | Cohesity, Inc. | Snapshot archive management |
US11880334B2 (en) * | 2017-08-29 | 2024-01-23 | Cohesity, Inc. | Snapshot archive management |
US11874805B2 (en) | 2017-09-07 | 2024-01-16 | Cohesity, Inc. | Remotely mounted file system with stubs |
US11914485B2 (en) | 2017-09-07 | 2024-02-27 | Cohesity, Inc. | Restoration of specified content from an archive |
US11841824B2 (en) | 2020-09-24 | 2023-12-12 | Cohesity, Inc. | Incremental access requests for portions of files from a cloud archival storage tier |
US11989436B2 (en) | 2022-07-18 | 2024-05-21 | Hewlett Packard Enterprise Development Lp | Variable extent size |
CN115982096A (en) * | 2022-12-09 | 2023-04-18 | 北京水脉科技有限公司 | Real-time database snapshot storage method and system based on hotspot file |
CN115982096B (en) * | 2022-12-09 | 2023-09-08 | 北京水脉科技有限公司 | Real-time database snapshot storage method and system based on hot spot file |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kaiyrakhmet et al. | {SLM-DB}:{Single-Level}{Key-Value} store with persistent memory | |
Bonwick et al. | The zettabyte file system | |
Ganger et al. | Soft updates: a solution to the metadata update problem in file systems | |
Graefe | The five-minute rule twenty years later, and how flash memory changes the rules | |
O’Neil et al. | The log-structured merge-tree (LSM-tree) | |
Rosenblum et al. | The design and implementation of a log-structured file system | |
Zhou et al. | Second-level buffer cache management | |
Saxena et al. | Flashtier: a lightweight, consistent and durable storage cache | |
Rosenblum et al. | The design and implementation of a log-structured file system | |
US6785771B2 (en) | Method, system, and program for destaging data in cache | |
Do et al. | Turbocharging DBMS buffer pool using SSDs | |
Rodeh | B-trees, shadowing, and clones | |
Ousterhout et al. | Beating the I/O bottleneck: A case for log-structured file systems | |
Zhou et al. | Spitfire: A three-tier buffer manager for volatile and non-volatile memory | |
Shirriff | Sawmill: A Logging File System for a High-Performance RAID Disk Array | |
WO2008144018A1 (en) | A device and method for an efficient storage manager for copy-on-write snapshots | |
Spillane et al. | An efficient multi-tier tablet server storage architecture | |
Shapiro et al. | Design Evolution of the EROS Single-Level Store. | |
Ghemawat | The modified object buffer: a storage management technique for object-oriented databases | |
Shrira et al. | Snap: Efficient snapshots for back-in-time execution | |
Shrira et al. | Thresher: An Efficient Storage Manager for Copy-on-write Snapshots. | |
Hulse et al. | A log-structured persistent store | |
Lomet | The case for log structuring in database systems | |
Piernas et al. | The design of new journaling file systems: The dualfs case | |
Soares et al. | Meta-data snapshotting: a simple mechanism for file system consistency. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08767793 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08767793 Country of ref document: EP Kind code of ref document: A1 |