Application-level crash resistance #4430

rjl493456442 · 2025-03-27T08:29:05Z

Pebble supports sync and async write options.

Sync write: The write operation blocks until the WAL (Write-Ahead Log) is fsync’d
Async write: The operation returns immediately once the data is cached in the
queue, without waiting for it to be written to the WAL.

Sync mode ensures the durability across the machine crash; While in Async mode,
the recent write can be lost if the application is crashed.

Can we provide the third option in which the write operation will be blocked until
the data is written into the WAL without fsync. As in the platform (e.g. MacOS)
we have observed that fsync is very expensive.

Application level crash resistance is fairly important. Is it a considerable option?

Jira issue: PEBBLE-367

RaduBerinde · 2025-03-27T13:36:20Z

You can implement your own vfs.FS which does nothing on fsync-type methods. But there may be consistency issues after recovery if the machine does crash.

RaduBerinde · 2025-03-27T14:40:18Z

You might also be able to achieve this through configuration - like having a separate partition for the Pebble store and mounting it with special options that disable fsync. You can definitely do this in linux, not sure about MacOS.

petermattis · 2025-04-04T11:40:34Z

I happened to be experimenting in this area last week. Disabling all syncs from Pebble is problematic on Linux because doing so allows a large amount of dirty data to build up in the OS which then gets flushed at arbitrary times (this is why we trickle out syncs for sstables while writing them). Additionally, if you disable syncs at the filesystem level then the sync command usually doesn't work as well so you can't even get durability on a graceful machine shutdown. Disabling syncs for just the WAL should be done via a config option. Definitely comes with caveats. The code to do this is pretty localized within record.LogWriter. Here is a patch which shows the one location in the WAL code that syncs have to be disabled:

 func (w *LogWriter) syncWithLatency() (time.Duration, error) {
-       start := crtime.NowMono()
-       err := w.s.Sync()
-       syncLatency := start.Elapsed()
-       return syncLatency, err
+       return time.Duration(0), nil
+       // start := crtime.NowMono()
+       // err := w.s.Sync()
+       // syncLatency := start.Elapsed()
+       // return syncLatency, err
 }

When running in this mode you'd want to set options.WALBytesPerSync to something like 512KB so that you don't build up dirty data for the WAL. When that option is set, Pebble will use the sync_file_range system to call to asynchronously sync the WAL periodically. I've verified this works.

pav-kv · 2025-04-06T14:25:57Z

Since running without syncs isn’t safe, it would be nice to detect situations when Pebble might have lost some writes.

@petermattis and I discussed this a little, and below is a sketch for the approach. The idea is to have a “proof” that Pebble has been synced, or the system has been running continuously between consecutive Pebble starts and couldn't lose writes (with high probability).

To detect that the system has been running continuously, we need some kind of “epoch” that changes with every system start. A good candidate for this is the system boot time, which can be found with grep btime /proc/stat (or sysctl kern.boottime on MacOS).

Example on my machine:

$ sysctl kern.boottime
kern.boottime: { sec = 1742501182, usec = 290861 } Thu Mar 20 20:06:22 2025

When Pebble starts up, check for a LAST_BOOT file.

If LAST_BOOT does not exist, this is a first start, or a clean shutdown had been performed. Create LAST_BOOT and put the current epoch in it.
If LAST_BOOT exists, compare the epoch noted in that file with the current one. If they mismatch, that means the system restarted since the last time Pebble was running, and there hasn’t been a clean shutdown. There are options on how to handle it (could be configurable): either start with a loud message that there could be data loss, or refuse to start Pebble.
On clean shutdown (after WAL is synced), remove LAST_BOOT.
Operators can also force a clean shutdown by syncing Pebble directory / file system (e.g. run the sync command), and removing LAST_BOOT.

rjl493456442 · 2025-04-17T06:47:56Z

@petermattis it will be a reasonable approach!

Using WriteOption with sync=false can be very useful on SSDs that aren’t particularly powerful, or on platforms with very slow fsync operations (e.g., macOS).

With WALBytesPerSync, the background flusher will periodically issue an fsync to the WAL file, helping to limit data loss in the case of a machine-level failure.

Any plan to ship this new feature?

rjl493456442 · 2025-04-17T07:39:18Z

May I also request another feature? Would it be possible to expose an external API to explicitly fsync the WAL?

This would be very useful in our use case, where we have two independent storage engines: (a) Pebble and (b) a set of raw files. We need to ensure that the write order between the two is always respected. We have this kind of API for (b) to fsync all the uncommitted files and would be nice to also have it from pebble.

petermattis · 2025-04-21T13:17:49Z

@rjl493456442 There are no current plans to work on this functionality.

blathers-crl bot added A-storage T-storage labels Mar 27, 2025

rjl493456442 mentioned this issue Mar 27, 2025

Fatal error after crlt-c kill: gap in the chain between ancients [...] and leveldb [...] ethereum/go-ethereum#31499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application-level crash resistance #4430

Application-level crash resistance #4430

Application-level crash resistance #4430

Application-level crash resistance #4430

Comments