8000 Provide a method to get instance vclock after all writes in progress are finished · Issue #10142 · tarantool/tarantool · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Provide a method to get instance vclock after all writes in progress are finished #10142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sergepetrenko opened this issue Jun 18, 2024 · 5 comments · Fixed by #10422
Closed
Assignees
Labels
3.3 Target is 3.3 and all newer release/master branches feature A new functionality

Comments

@sergepetrenko
Copy link
Collaborator

It's common practice to implement master switch via the following steps:

  1. Issue box.cfg{read_only = true} on the old master
  2. Remember lsn of the last written transaction on that master(box.info.lsn)
  3. Wait for replication up to that lsn on the new desired master
  4. Issue box.cfg{read_only = false} on the new master

These steps are supposed to guarantee that in case both old and new masters are alive the switch happens only after the new master receives everything that was written by the old master. But there's a problem: box.cfg{read_only = true} only affects the transactions which aren't yet sent to WAL. IOW, setting box.cfg{read_only = true} doesn't guarantee that box.info.lsn will stop growing immediately. It might still need some time to reflect the last writes.

Reproducer:
-- Step 1.
-- Start a Tarantool built in debug mode
box.cfg{}
-- Step 2.
box.schema.space.create('test'):create_index('pk')
-- Step 3.
box.error.injection.set('ERRINJ_WAL_DELAY', true)
-- Step 4.
require('fiber').new(function() box.space.test:insert{1} end)
-- Step 5.
box.cfg{read_only=true}
-- Step 6.
box.info.lsn -- Will show something like 2.
-- Step 7.
box.error.injection.set('ERRINJ_WAL_DELAY', false)
-- Step 8.
box.info.lsn -- Will show a bigger value. For example, 3.

Let's provide a box.ctl method to sync WAL. For example, box.ctl.wal_sync(). It'll issue the C wal_sync method and make sure that all the ongoing writes are finished. With this implementing failover correctly will become possible: the user will have to perform box.ctl.wal_sync() before remembering box.info.lsn.

@sergepetrenko sergepetrenko added the feature A new functionality label Jun 18, 2024
@sergepetrenko
Copy link
Collaborator 8000 Author

See also #9937.

@R-omk
Copy link
R-omk commented Jul 15, 2024

Will it be possible to use this to read committed data even if we don’t enable MVCC ?

Now we use something like this as a workround:

box.begin()
space:select() -- read data
space:udapte( { 'non exitsts key' }, {})
box.commit() -- We are waiting for confirmation based on indirect evidence that the read data was committed.

@sergepetrenko
Copy link
Collaborator Author

@R-omk

That's a clever hack you have! I haven't thought of that myself.

Will it be possible to use this to read committed data even if we don’t enable MVCC ?

Now we use something like this as a workround:

box.begin() space:select() -- read data space:udapte( { 'non exitsts key' }, {}) box.commit() -- We are waiting for confirmation based on indirect evidence that the read data was committed.

I think not. At least not with the current implementation.

  1. wal_sync yields, so it won't be possible to call it inside an open transaction without mvcc.
  2. We could introduce some box.commit() flag to make it sync WAL, probably even reuse the exitsting txn_isolation = 'read-committed', which would internally do wal_sync even if the transaction has nothing to write. But it seems this would be worse performance-wise than the hack you propose (because in your approach a dummy write goes to the same existing cbus message between tx and WAL, and wal_sync creates a separate cbus message per each wal_sync() call.

Having a ton of transactions all doing box.commit{wal_sync=true} we could clog the cbus quite easily.

Of course we might redesign wal_sync to make use of already scheduled cbus message between tx and wal, but this would require extra effort, and I'm not sure that it's worth it.

@R-omk
Copy link
R-omk commented Aug 5, 2024

@sergepetrenko

wal_sync yields, so it won't be possible to call it inside an open transaction

I don’t need it inside the transaction, it’s enough to understand that the writes that were in the queue have been completed, this will indicate that any previous operations do not read dirty data. .. if I'm not mistaken..

But it seems this would be worse performance-wise than the hack you propose.

my hack assumes that there will always be a dummy write, but if there is nothing in the queue it don't have to wait for anything at all, this should immediately indicate that any previous readings were not dirty.

My issue is precisely to more effectively handle situations when it is known in advance that all writing transactions have already been completed.

For example, I could extend my hack and make a counter of open and completed transactions, increase it at the beginning of the transaction and decrease it in the oncommit/rollback trigger, in this case it need to make a dummy write ops only if the counter is not zero. .. it seems like it should work, again, unless I'm missing something..

@sergepetrenko
Copy link
Collaborator Author
sergepetrenko commented Aug 5, 2024

@R-omk

@sergepetrenko

wal_sync yields, so it won't be possible to call it inside an open transaction

I don’t need it inside the transaction, it’s enough to understand that the writes that were in the queue have been completed, this will indicate that any previous operations do not read dirty data. .. if I'm not mistaken..

But it seems this would be worse performance-wise than the hack you propose.

my hack assumes that there will always be a dummy write, but if there is nothing in the queue it don't have to wait for anything at all, this should immediately indicate that any previous readings were not dirty.

My issue is precisely to more effectively handle situations when it is known in advance that all writing transactions have already been completed.

For example, I could extend my hack and make a counter of open and completed transactions, increase it at the beginning of the transaction and decrease it in the oncommit/rollback trigger, in this case it need to make a dummy write ops only if the counter is not zero. .. it seems like it should work, again, unless I'm missing something..

Ok, I see. It seems you could do the following then:

  1. Perform a bunch of reads from a single fiber without yields
  2. Issue wal_sync from outside the transaction. If it completes successfully, everything you have read is certainly written. In case it fails, it doesn't guarantee that the reads you performed are bad, but still you can't be sure anymore.

It seems this will stay correct as long as you do wal_sync after any amount of reads from any amount of fibers as long as wal_sync gets executed in the same event loop iteration as the reads (how to achieve this? Some fiber.wakeup() or fiber_reschedule() magic probably).

LevKats added a commit to LevKats/tarantool that referenced this issue Aug 15, 2024
@TarantoolBot document
Title: `box.ctl.wal_sync` can be used in lua to wait write flush

Now one can use `box.ctl.wal_sync()` to access C `wal_sync` function
needed for syncronisation between old and new masters and any other
situations when user need consistenct vclock between instances.

Fixes tarantool#10142
LevKats added a commit to LevKats/tarantool that referenced this issue Aug 15, 2024
@TarantoolBot document
Title: `box.ctl.wal_sync` can be used in lua to wait write flush

Now one can use `box.ctl.wal_sync()` to access C `wal_sync` function
needed for syncronisation between old and new masters and any other
situations when user need consistenct vclock between instances.

Fixes tarantool#10142
LevKats added a commit to LevKats/tarantool that referenced this issue Aug 21, 2024
@TarantoolBot document
Title: `box.ctl.wal_sync` can be used in lua to wait write flush

Now one can use `box.ctl.wal_sync()` to wait until all submitted
writes are successfully flushed to the disk. If write failed it
throws an error. It is primarily needed for syncronisation between
old and new masters or actually any other situations when user
need consistent vclock between instances.

Fixes tarantool#10142
LevKats added a commit to LevKats/tarantool that referenced this issue Aug 21, 2024
@TarantoolBot document
Title: `box.ctl.wal_sync` can be used in lua to wait write flush

Now one can use `box.ctl.wal_sync()` to wait until all submitted
writes are successfully flushed to the disk. If write failed it
throws an error. It is primarily needed for syncronisation between
old and new masters or actually any other situations when user
need consistent vclock between instances.

Fixes tarantool#10142
LevKats added a commit to LevKats/tarantool that referenced this issue Aug 23, 2024
@TarantoolBot document
Title: `box.ctl.wal_sync` can be used in lua to wait write flush

Now one can use `box.ctl.wal_sync()` to wait until all submitted
writes are successfully flushed to the disk. If write failed it
throws an error. It is primarily needed for syncronisation between
old and new masters or actually any other situations when user
need consistent vclock between instances.

Fixes tarantool#10142
LevKats added a commit to LevKats/tarantool that referenced this issue Sep 9, 2024
@TarantoolBot document
Title: `box.ctl.wal_sync` can be used in lua to wait write flush

Now one can use `box.ctl.wal_sync()` to wait until all submitted
writes are successfully flushed to the disk. If write fails it
throws an error. It is primarily needed for synchronisation between
old and new masters or actually any other situations when user
need consistent vclock between instances.

Fixes tarantool#10142
LevKats added a commit to LevKats/tarantool that referenced this issue Sep 11, 2024
@TarantoolBot document
Title: `box.ctl.wal_sync` can be used in lua to wait write flush

Now one can use `box.ctl.wal_sync()` to wait until all submitted
writes are successfully flushed to the disk. If write fails it
throws an error. After the function is executed one may reliably
use box.info.vclock for comparisons when choosing a new master

Fixes tarantool#10142
@Totktonada Totktonada added the 3.3 Target is 3.3 and all newer release/master branches label Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.3 Target is 3.3 and all newer release/master branches feature A new functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0