8000 Application-level crash resistance · Issue #4430 · cockroachdb/pebble · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Application-level crash resistance #4430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rjl493456442 opened this issue Mar 27, 2025 · 7 comments
Open

Application-level crash resistance #4430

rjl493456442 opened this issue Mar 27, 2025 · 7 comments

Comments

@rjl493456442
Copy link
Contributor
rjl493456442 commented Mar 27, 2025

Pebble supports sync and async write options.

Sync write: The write operation blocks until the WAL (Write-Ahead Log) is fsync’d
Async write: The operation returns immediately once the data is cached in the
queue, without waiting for it to be written to the WAL.

Sync mode ensures the durability across the machine crash; While in Async mode,
the recent write can be lost if the application is crashed.

Can we provide the third option in which the write operation will be blocked until
the data is written into the WAL without fsync. As in the platform (e.g. MacOS)
we have observed that fsync is very expensive.

Application level crash resistance is fairly important. Is it a considerable option?

Jira issue: PEBBLE-367

@RaduBerinde
Copy link
Member

You can implement your own vfs.FS which does nothing on fsync-type methods. But there may be consistency issues after recovery if the machine does crash.

@RaduBerinde
Copy link
Member

You might also be able to achieve this through configuration - like having a separate partition for the Pebble store and mounting it with special options that disable fsync. You can definitely do this in linux, not sure about MacOS.

@petermattis
Copy link
Collaborator

I happened to be experimenting in this area last week. Disabling all syncs from Pebble is problematic on Linux because doing so allows a large amount of dirty data to build up in the OS which then gets flushed at arbitrary times (this is why we trickle out syncs for sstables while writing them). Additionally, if you disable syncs at the filesystem level then the sync command usually doesn't work as well so you can't even get durability on a graceful machine shutdown. Disabling syncs for just the WAL should be done via a config option. Definitely comes with caveats. The code to do this is pretty localized within record.LogWriter. Here is a patch which shows the one location in the WAL code that syncs have to be disabled:

 func (w *LogWriter) syncWithLatency() (time.Duration, error) {
-       start := crtime.NowMono()
-       err := w.s.Sync()
-       syncLatency := start.Elapsed()
-       return syncLatency, err
+       return time.Duration(0), nil
+       // start := crtime.NowMono()
+       // err := w.s.Sync()
+       // syncLatency := start.Elapsed()
+       // return syncLatency, err
 }

When running in this mode you'd want to set options.WALBytesPerSync to something like 512KB so that you don't build up dirty data for the WAL. When that option is set, Pebble will use the sync_file_range system to call to asynchronously sync the WAL periodically. I've verified this works.

@pav-kv
Copy link
Contributor
pav-kv commented Apr 6, 2025

Since running without syncs isn’t safe, it would be nice to detect situations when Pebble might have lost some writes.

@petermattis and I discussed this a little, and below is a sketch for the approach. The idea is to have a “proof” that Pebble has been synced, or the system has been running continuously between consecutive Pebble starts and couldn't lose writes (with high probability).

To detect that the system has been running continuously, we need some kind of “epoch” that changes with every system start. A good candidate for this is the system boot time, which can be found with grep btime /proc/stat (or sysctl kern.boottime on MacOS).

Example on my machine:

$ sysctl kern.boottime
kern.boottime: { sec = 1742501182, usec = 290861 } Thu Mar 20 20:06:22 2025

When Pebble starts up, check for a LAST_BOOT file.

  • If LAST_BOOT does not exist, this is a first start, or a clean shutdown had been performed. Create LAST_BOOT and put the current epoch in it.
  • If LAST_BOOT exists, compare the epoch noted in that file with the current one. If they mismatch, that means the system restarted since the last time Pebble was running, and there hasn’t been a clean shutdown. There are options on how to handle it (could be configurable): either start with a loud message that there could be data loss, or refuse to start Pebble.
  • On clean shutdown (after WAL is synced), remove LAST_BOOT.
  • Operators can also force a clean shutdown by syncing Pebble directory / file system (e.g. run the sync command), and removing LAST_BOOT.

@rjl493456442
Copy link
Contributor Author
rjl493456442 commented Apr 17, 2025

@petermattis it will be a reasonable approach!

Using WriteOption with sync=false can be very useful on SSDs that aren’t particularly powerful, or on platforms with very slow fsync operations (e.g., macOS).

With WALBytesPerSync, the background flusher will periodically issue an fsync to the WAL file, helping to limit data loss in the case of a machine-level failure.

Any plan to ship this new feature?

@rjl493456442
Copy link
Contributor Author

May I also request another feature? Would it be possible to expose an external API to explicitly fsync the WAL?

This would be very useful in our use case, where we have two independent storage engines: (a) Pebble and (b) a set of raw files. We need to ensure that the write order between the two is always respected. We have this kind of API for (b) to fsync all the uncommitted files and would be nice to also have it from pebble.

@petermattis
Copy link
Collaborator

@rjl493456442 There are no current plans to work on this functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0