-
Notifications
You must be signed in to change notification settings - Fork 0
PS-9703 "Upstream 8.0.41 release does not fully fix PS-9144" #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PS-9703 "Upstream 8.0.41 release does not fully fix PS-9144" #8
Conversation
https://perconadev.atlassian.net/browse/PS-9703 Problem: -------- ALTER TABLE which rebuilds InnoDB table using INPLACE algorithm might sometimes lead to row loss if concurrent purge happens on the table being ALTERed. Analysis: --------- This issue was introduced in Upstream version 8.0.41 as unwanted side-effect of fixes for bug#115608 (PS-9144), in which similar problem is observed but in a different scenario, and bug#115511 (PS-9214). It was propageted to Percona Server 8.0.41-32, in which we opted for reverting our versions of fixes for PS-9144 and PS-9214 in favour of Upstream ones. New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread case. This implementation iterates over all the rows in the table, in general case, handling different subtrees of a B-tree in different threads. This iteration over table rows needs to be paused, from time to time, to commit InnoDB MTR/ release page latches it holds. This is necessary to give a way to concurrent actions on the B-tree scanned (this happens when switching to the next page) or before flushing rows of new version of table from in-memory buffer to the B-tree. In order to resume iteration after such pause persistent cursor position saved before pause is restored. The problem described above occurs when we try to save and then restore position of cursor pointing to page supremum, before switching to the next page. In post-8.0.41 code this is done by simply calling btr_pcur_t::store_position()/restore_position() methods for cursor that point to supremum. In 8.0.42-based code this is done in PCursor::save_previous_user_record_as_last_processed() and PCursor::restore_to_first_unprocessed() pair of methods. However, this doesn't work correctly in scenario, when after we have saved cursor position and then committed mini-transaction/released latches on the current page the next page is merged into the current one (after purge removes some records from it). In this case the cursor position is still restored as pointing to page supremum, and thus rows which were moved over by merge are erroneously skipped. *** Let us take look at an example. Let us assume that we have two pages p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup]. Our thread which is one of the parallel ALTER TABLE worker threads has completed scan of p1, so its cursor positioned on p1:'sup' record. Now it needs to switch to page p2, but also give a way to threads concurrently updating the table. So it needs to make cursor savepoint, commit mini-transaction and release the latches. In post-8.0.41 code we simply do btr_pcur_t::store_position()/ restore_position() with the cursor positioned on p1 : 'sup' record, then the following might happen: concurrent purge on page p2 might delete some record from it (e.g. 'f') and decide to merge of this page into the page p1. If this happens while latches are released this merge would go through and and resulting in page p1 with the following contents p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't be invalidated (one can say that savepoints for sup and inf are not safe against concurrent merges in this respect) and after restoration of cursor the iteration will continue, on the next page, skipping records 'd' and 'e'. *** Fix: ---- This patch solves the problem by working around the issue with saving/ restoring cursor pointing to supremum. Instead of storing position of supremum record PCursor::save_previous_user_record_as_last_processed() now stores the position of record that precedes it. And then PCursor::restore_to_first_unprocessed() does restore in two steps - 1) restores position of this preceding record (or its closest precedessor if it was purged meanwhile) and then 2) moves one step forward assuming that will get to the supremum record at which cursor pointed originally. If this is not true, i.e. there is user record added to the page by the merge (or simple concurrent insert), we assume that this and following records are unprocessed. The caller of PCursor::restore_to_first_unprocessed() detects this situation by checking if cursor is positioned on supremum and handles by resuming processing from record under the cursor if not. *** Let us return to the above example to explain how fix works. PCursor::save_previous_user_record_as_last_processed() does a step back before calling btr_pcur_t::store_position(), so for cursor positioned on p1 : 'sup' it is actually position corresponding to p1 : 'c' what is saved. If the merge happens when latches are released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint is not invalidated. PCursor::restore_to_first_unprocessed() calls btr_pcur_t::restore_position() gets cursor pointing to p1 : 'c' as result, and then it tries to compensate for step-back and moves cursor one step forward making it to point to p1 : 'd'. Code which does scanning detects the situation that saving/restoring resulted in jump from supremum record to user record and resume iteration from p1 : 'd' without skipping any records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @dlenev And very well written commit message with no ambiguity. Appreciate the work with examples :)
May be lets add thank you message for the reporter testcase?
PS-5741: Incorrect use of memset_s in keyring_vault. Fixed the usage of memset_s. The arguments should be: void memset_s(void *dest, size_t dest_max, int c, size_t n) where the 2nd argument is size of buffer and the 3rd is argument is character to fill. --------------------------------------------------------------------------- PS-7769 - Fix use-after-return error in audit_log_exclude_accounts_validate --- *Problem:* `st_mysql_value::val_str` might return a pointer to `buf` which after the function called is deleted. Therefore the value in `save`, after reuturnin from the function, is invalid. In this particular case, the error is not manifesting as val_str` returns memory allocated with `thd_strmake` and it does not use `buf`. *Solution:* Allocate memory with `thd_strmake` so the memory in `save` is not local. --------------------------------------------------------------------------- Fix test main.bug12969156 when WITH_ASAN=ON *Problem:* ASAN complains about stack-buffer-overflow on function `mysql_heartbeat`: ``` ==90890==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fe746d06d14 at pc 0x7fe760f5b017 bp 0x7fe746d06cd0 sp 0x7fe746d06478 WRITE of size 24 at 0x7fe746d06d14 thread T16777215 Address 0x7fe746d06d14 is located in stack of thread T26 at offset 340 in frame #0 0x7fe746d0a55c in mysql_heartbeat(void*) /home/yura/ws/percona-server/plugin/daemon_example/daemon_example.cc:62 This frame has 4 object(s): [48, 56) 'result' (line 66) [80, 112) '_db_stack_frame_' (line 63) [144, 200) 'tm_tmp' (line 67) [240, 340) 'buffer' (line 65) <== Memory access at offset 340 overflows this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork (longjmp and C++ exceptions *are* supported) Thread T26 created by T25 here: #0 0x7fe760f5f6d5 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216 #1 0x557ccbbcb857 in my_thread_create /home/yura/ws/percona-server/mysys/my_thread.c:104 #2 0x7fe746d0b21a in daemon_example_plugin_init /home/yura/ws/percona-server/plugin/daemon_example/daemon_example.cc:148 #3 0x557ccb4c69c7 in plugin_initialize /home/yura/ws/percona-server/sql/sql_plugin.cc:1279 #4 0x557ccb4d19cd in mysql_install_plugin /home/yura/ws/percona-server/sql/sql_plugin.cc:2279 #5 0x557ccb4d218f in Sql_cmd_install_plugin::execute(THD*) /home/yura/ws/percona-server/sql/sql_plugin.cc:4664 #6 0x557ccb47695e in mysql_execute_command(THD*, bool) /home/yura/ws/percona-server/sql/sql_parse.cc:5160 #7 0x557ccb47977c in mysql_parse(THD*, Parser_state*, bool) /home/yura/ws/percona-server/sql/sql_parse.cc:5952 #8 0x557ccb47b6c2 in dispatch_command(THD*, COM_DATA const*, enum_server_command) /home/yura/ws/percona-server/sql/sql_parse.cc:1544 percona#9 0x557ccb47de1d in do_command(THD*) /home/yura/ws/percona-server/sql/sql_parse.cc:1065 percona#10 0x557ccb6ac294 in handle_connection /home/yura/ws/percona-server/sql/conn_handler/connection_handler_per_thread.cc:325 percona#11 0x557ccbbfabb0 in pfs_spawn_thread /home/yura/ws/percona-server/storage/perfschema/pfs.cc:2198 percona#12 0x7fe760ab544f in start_thread nptl/pthread_create.c:473 ``` The reason is that `my_thread_cancel` is used to finish the daemon thread. This is not and orderly way of finishing the thread. ASAN does not register the stack variables are not used anymore which generates the error above. This is a benign error as all the variables are on the stack. *Solution*: Finish the thread in orderly way by using a signalling variable. --------------------------------------------------------------------------- PS-8204: Fix XML escape rules for audit plugin https://jira.percona.com/browse/PS-8204 There was a wrong length specified for some XML escape rules. As a result of this terminating null symbol from replacement rule was copied into resulting string. This lead to quer text truncation in audit log file. In addition added empty replacement rules for '\b' and 'f' symbols which just remove them from resulting string. These symboles are not supported in XML 1.0. --------------------------------------------------------------------------- PS-8854: Add main.percona_udf MTR test Add a test to check FNV1A_64, FNV_64, and MURMUR_HASH user-defined functions. --------------------------------------------------------------------------- PS-9369: Fix currently processed query comparison in audit_log https://perconadev.atlassian.net/browse/PS-9369 The audit_log uses stack to keep track of table access operations being performed in scope of one query. It compares last known table access query string stored on top of this stack with actual query in audit event being processed at the moment to decide if new record should be pushed to stack or it is time to clean records from the stack. Currently audit_log simply compares char* variables to decide if this is the same query string. This approach doesn't work. As a result plugin looses control of the stack size and it starts growing with the time consuming memory. This issue is not noticable on short term server connections as memory is freed once connection is closed. At the same time this leads to extra memory consumption for long running server connections. The following is done to fix the issue: - Query is sent along with audit event as MYSQL_LEX_CSTRING structure. It is not correct to ignore MYSQL_LEX_CSTRING.length comparison as sometimes MYSQL_LEX_CSTRING.str pointer may be not iniialised properly. Added string length check to make sure structure contains any valid string. - Used strncmp to compare actual strings instead of comparing char* variables.
…ocal DDL executed https://perconadev.atlassian.net/browse/PS-9018 Problem ------- In high concurrency scenarios, MySQL replica can enter into a deadlock due to a race condition between the replica applier thread and the client thread performing a binlog group commit. Analysis -------- It needs at least 3 threads for this deadlock to happen 1. One client thread 2. Two replica applier threads How this deadlock happens? -------------------------- 0. Binlog is enabled on replica, but log_replica_updates is disabled. 1. Initially, both "Commit Order" and "Binlog Flush" queues are empty. 2. Replica applier thread 1 enters the group commit pipeline to register in the "Commit Order" queue since `log-replica-updates` is disabled on the replica node. 3. Since both "Commit Order" and "Binlog Flush" queues are empty, the applier thread 1 3.1. Becomes leader (In Commit_stage_manager::enroll_for()). 3.2. Registers in the commit order queue. 3.3. Acquires the lock MYSQL_BIN_LOG::LOCK_log. 3.4. Commit Order queue is emptied, but the lock MYSQL_BIN_LOG::LOCK_log is not yet released. NOTE: SE commit for applier thread is already done by the time it reaches here. 4. Replica applier thread 2 enters the group commit pipeline to register in the "Commit Order" queue since `log-replica-updates` is disabled on the replica node. 5. Since the "Commit Order" queue is empty (emptied by applier thread 1 in 3.4), the applier thread 2 5.1. Becomes leader (In Commit_stage_manager::enroll_for()) 5.2. Registers in the commit order queue. 5.3. Tries to acquire the lock MYSQL_BIN_LOG::LOCK_log. Since it is held by applier thread 1 it will wait until the lock is released. 6. Client thread enters the group commit pipeline to register in the "Binlog Flush" queue. 7. Since "Commit Order" queue is not empty (there is applier thread 2 in the queue), it enters the conditional wait `m_stage_cond_leader` with an intention to become the leader for both the "Binlog Flush" and "Commit Order" queues. 8. Applier thread 1 releases the lock MYSQL_BIN_LOG::LOCK_log and proceeds to update the GTID by calling gtid_state->update_commit_group() from Commit_order_manager::flush_engine_and_signal_threads(). 9. Applier thread 2 acquires the lock MYSQL_BIN_LOG::LOCK_log. 9.1. It checks if there is any thread waiting in the "Binlog Flush" queue to become the leader. Here it finds the client thread waiting to be the leader. 9.2. It releases the lock MYSQL_BIN_LOG::LOCK_log and signals on the cond_var `m_stage_cond_leader` and enters a conditional wait until the thread's `tx_commit_pending` is set to false by the client thread (will be done in the Commit_stage_manager::process_final_stage_for_ordered_commit_group() called by client thread from fetch_and_process_flush_stage_queue()). 10. The client thread wakes up from the cond_var `m_stage_cond_leader`. The thread has now become a leader and it is its responsibility to update GTID of applier thread 2. 10.1. It acquires the lock MYSQL_BIN_LOG::LOCK_log. 10.2. Returns from `enroll_for()` and proceeds to process the "Commit Order" and "Binlog Flush" queues. 10.3. Fetches the "Commit Order" and "Binlog Flush" queues. 10.4. Performs the storage engine flush by calling ha_flush_logs() from fetch_and_process_flush_stage_queue(). 10.5. Proceeds to update the GTID of threads in "Commit Order" queue by calling gtid_state->update_commit_group() from Commit_stage_manager::process_final_stage_for_ordered_commit_group(). 11. At this point, we will have - Client thread performing GTID update on behalf if applier thread 2 (from step 10.5), and - Applier thread 1 performing GTID update for itself (from step 8). Due to the lack of proper synchronization between the above two threads, there exists a time window where both threads can call gtid_state->update_commit_group() concurrently. In subsequent steps, both threads simultaneously try to modify the contents of the array `commit_group_sidnos` which is used to track the lock status of sidnos. This concurrent access to `update_commit_group()` can cause a lock-leak resulting in one thread acquiring the sidno lock and not releasing at all. ----------------------------------------------------------------------------------------------------------- Client thread Applier Thread 1 ----------------------------------------------------------------------------------------------------------- update_commit_group() => global_sid_lock->rdlock(); update_commit_group() => global_sid_lock->rdlock(); calls update_gtids_impl_lock_sidnos() calls update_gtids_impl_lock_sidnos() set commit_group_sidno[2] = true set commit_group_sidno[2] = true lock_sidno(2) -> successful lock_sidno(2) -> waits update_gtids_impl_own_gtid() -> Add the thd->owned_gtid in `executed_gtids()` if (commit_group_sidnos[2]) { unlock_sidno(2); commit_group_sidnos[2] = false; } Applier thread continues.. lock_sidno(2) -> successful update_gtids_impl_own_gtid() -> Add the thd->owned_gtid in `executed_gtids()` if (commit_group_sidnos[2]) { <=== this check fails and lock is not released. unlock_sidno(2); commit_group_sidnos[2] = false; } Client thread continues without releasing the lock ----------------------------------------------------------------------------------------------------------- 12. As the above lock-leak can also happen the other way i.e, the applier thread fails to unlock, there can be different consequences hereafter. 13. If the client thread continues without releasing the lock, then at a later stage, it can enter into a deadlock with the applier thread performing a GTID update with stack trace. Client_thread ------------- #1 __GI___lll_lock_wait #2 ___pthread_mutex_lock #3 native_mutex_lock <= waits for commit lock while holding sidno lock #4 Commit_stage_manager::enroll_for #5 MYSQL_BIN_LOG::change_stage #6 MYSQL_BIN_LOG::ordered_commit #7 MYSQL_BIN_LOG::commit #8 ha_commit_trans percona#9 trans_commit_implicit percona#10 mysql_create_like_table percona#11 Sql_cmd_create_table::execute percona#12 mysql_execute_command percona#13 dispatch_sql_command Applier thread -------------- #1 ___pthread_mutex_lock #2 native_mutex_lock #3 safe_mutex_lock #4 Gtid_state::update_gtids_impl_lock_sidnos <= waits for sidno lock #5 Gtid_state::update_commit_group #6 Commit_order_manager::flush_engine_and_signal_threads <= acquires commit lock here #7 Commit_order_manager::finish #8 Commit_order_manager::wait_and_finish percona#9 ha_commit_low percona#10 trx_coordinator::commit_in_engines percona#11 MYSQL_BIN_LOG::commit percona#12 ha_commit_trans percona#13 trans_commit percona#14 Xid_log_event::do_commit percona#15 Xid_apply_log_event::do_apply_event_worker percona#16 Slave_worker::slave_worker_exec_event percona#17 slave_worker_exec_job_group percona#18 handle_slave_worker 14. If the applier thread continues without releasing the lock, then at a later stage, it can perform recursive locking while setting the GTID for the next transaction (in set_gtid_next()). In debug builds the above case hits the assertion `safe_mutex_assert_not_owner()` meaning the lock is already acquired by the replica applier thread when it tries to re-acquire the lock. Solution -------- In the above problematic example, when seen from each thread individually, we can conclude that there is no problem in the order of lock acquisition, thus there is no need to change the lock order. However, the root cause for this problem is that multiple threads can concurrently access to the array `Gtid_state::commit_group_sidnos`. In its initial implementation, it was expected that threads should hold the `MYSQL_BIN_LOG::LOCK_commit` before modifying its contents. But it was not considered when upstream implemented WL#7846 (MTS: slave-preserve-commit-order when log-slave-updates/binlog is disabled). With this patch, we now ensure that `MYSQL_BIN_LOG::LOCK_commit` is acquired when the client thread (binlog flush leader) when it tries to perform GTID update on behalf of threads waiting in "Commit Order" queue, thus providing a guarantee that `Gtid_state::commit_group_sidnos` array is never accessed without the protection of `MYSQL_BIN_LOG::LOCK_commit`.
…tion fault https://perconadev.atlassian.net/browse/PS-9719 Problem ------- When changing binlog_transaction_dependency_tracking in high load workload, MySQL can get a segmentation fault. Analysis -------- Address sanitizer runs exposed the heap-use-after-free. READ of size 8 at 0x6030002c3298 thread T52 #0 _M_hash_code() #1 _M_bucket_index() .. #7 std::unordered_map::insert() #8 Writeset_trx_dependency_tracker::get_dependency() percona#9 Transaction_dependency_tracker::get_dependency() percona#10 MYSQL_BIN_LOG::write_transaction() percona#11 binlog_cache_data::flush() percona#12 binlog_cache_mngr::flush() percona#13 MYSQL_BIN_LOG::flush_thread_caches() percona#14 MYSQL_BIN_LOG::process_flush_stage_queue() percona#15 MYSQL_BIN_LOG::ordered_commit() percona#16 MYSQL_BIN_LOG::commit() freed by thread T49 here: #0 operator delete() ... #7 std::unordered_map::clear() #8 Writeset_trx_dependency_tracker::rotate(long) percona#9 Transaction_dependency_tracker::tracking_mode_changed() percona#10 update_binlog_transaction_dependency_tracking percona#11 sys_var::update() - The Writeset_trx_dependency_tracker uses std::unordered_map for storing depdendency information. - When a transaction is committing, the committing thread inserts the dependency information to this map in through get_dependency(). - When the tracking mode is changed, then the map is cleared by Writeset_trx_dependency_tracker::rotate(). Note that no lock/mutex is taken during the rotation. - As the rotate() and get_dependency() operations can be concurrently called from different threads and there is no mutex protection to handle it, it can result in segmentation fault when the get_dependency() tries to insert to the already deleted map. Solution -------- Use std::shared_ptr with atomic load/store for safer dependency tracker map rotation. - Replaced direct usage of of map with std::shared_ptr in the Writeset_trx_dependency_tracker class. - Modified the implementation of rotate() to used std::atomic_load and std::atomic_store to enable thread-safe reads and rotations. With the new solution the rotation happens in an atomic manner. So that transactions calling get_dependency() always use the object returned by shared_ptr. So, even if rotate() happens in parallel, the memory won't be freed until all readers are done.
…tion fault https://perconadev.atlassian.net/browse/PS-9719 Problem ------- When changing binlog_transaction_dependency_tracking in high load workload, MySQL can get a segmentation fault. Analysis -------- Address sanitizer runs exposed the heap-use-after-free. READ of size 8 at 0x6030002c3298 thread T52 #0 _M_hash_code() #1 _M_bucket_index() .. #7 std::unordered_map::insert() #8 Writeset_trx_dependency_tracker::get_dependency() percona#9 Transaction_dependency_tracker::get_dependency() percona#10 MYSQL_BIN_LOG::write_transaction() percona#11 binlog_cache_data::flush() percona#12 binlog_cache_mngr::flush() percona#13 MYSQL_BIN_LOG::flush_thread_caches() percona#14 MYSQL_BIN_LOG::process_flush_stage_queue() percona#15 MYSQL_BIN_LOG::ordered_commit() percona#16 MYSQL_BIN_LOG::commit() freed by thread T49 here: #0 operator delete() ... #7 std::unordered_map::clear() #8 Writeset_trx_dependency_tracker::rotate(long) percona#9 Transaction_dependency_tracker::tracking_mode_changed() percona#10 update_binlog_transaction_dependency_tracking percona#11 sys_var::update() - The Writeset_trx_dependency_tracker uses std::unordered_map for storing depdendency information. - When a transaction is committing, the committing thread inserts the dependency information to this map in through get_dependency(). - When the tracking mode is changed, then the map is cleared by Writeset_trx_dependency_tracker::rotate(). Note that no lock/mutex is taken during the rotation. - As the rotate() and get_dependency() operations can be concurrently called from different threads and there is no mutex protection to handle it, it can result in segmentation fault when the get_dependency() tries to insert to the already deleted map. Solution -------- Use std::shared_ptr with atomic load/store for safer dependency tracker map rotation. - Replaced direct usage of of map with std::shared_ptr in the Writeset_trx_dependency_tracker class. - Modified the implementation of rotate() to used std::atomic_load and std::atomic_store to enable thread-safe reads and rotations. With the new solution the rotation happens in an atomic manner. So that transactions calling get_dependency() always use the object returned by shared_ptr. So, even if rotate() happens in parallel, the memory won't be freed until all readers are done.
…tion fault https://perconadev.atlassian.net/browse/PS-9719 Problem ------- When changing binlog_transaction_dependency_tracking in high load workload, MySQL can get a segmentation fault. Analysis -------- Address sanitizer runs exposed the heap-use-after-free. READ of size 8 at 0x6030002c3298 thread T52 #0 _M_hash_code() #1 _M_bucket_index() .. #7 std::unordered_map::insert() #8 Writeset_trx_dependency_tracker::get_dependency() percona#9 Transaction_dependency_tracker::get_dependency() percona#10 MYSQL_BIN_LOG::write_transaction() percona#11 binlog_cache_data::flush() percona#12 binlog_cache_mngr::flush() percona#13 MYSQL_BIN_LOG::flush_thread_caches() percona#14 MYSQL_BIN_LOG::process_flush_stage_queue() percona#15 MYSQL_BIN_LOG::ordered_commit() percona#16 MYSQL_BIN_LOG::commit() freed by thread T49 here: #0 operator delete() ... #7 std::unordered_map::clear() #8 Writeset_trx_dependency_tracker::rotate(long) percona#9 Transaction_dependency_tracker::tracking_mode_changed() percona#10 update_binlog_transaction_dependency_tracking percona#11 sys_var::update() - The Writeset_trx_dependency_tracker uses std::unordered_map for storing depdendency information. - When a transaction is committing, the committing thread inserts the dependency information to this map in through get_dependency(). - When the tracking mode is changed, then the map is cleared by Writeset_trx_dependency_tracker::rotate(). Note that no lock/mutex is taken during the rotation. - As the rotate() and get_dependency() operations can be concurrently called from different threads and there is no mutex protection to handle it, it can result in segmentation fault when the get_dependency() tries to insert to the already deleted map. Solution -------- Use std::shared_ptr with atomic load/store for safer dependency tracker map rotation. - Replaced direct usage of of map with std::shared_ptr in the Writeset_trx_dependency_tracker class. - Modified the implementation of rotate() to used std::atomic_load and std::atomic_store to enable thread-safe reads and rotations. With the new solution the rotation happens in an atomic manner. So that transactions calling get_dependency() always use the object returned by shared_ptr. So, even if rotate() happens in parallel, the memory won't be freed until all readers are done.
Hello Satya! Thanks for your review! Meanwhile I am closing this PR to keep things tidy. |
https://perconadev.atlassian.net/browse/PS-9703
Problem:
ALTER TABLE which rebuilds InnoDB table using INPLACE algorithm might sometimes lead to row loss if concurrent purge happens on the table being ALTERed.
Analysis:
This issue was introduced in Upstream version 8.0.41 as unwanted side-effect of fixes for bug#115608 (PS-9144), in which similar problem is observed but in a different scenario, and bug#115511 (PS-9214). It was propageted to Percona Server 8.0.41-32, in which we opted for reverting our versions of fixes for PS-9144 and PS-9214 in favour of Upstream ones.
New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread case.
This implementation iterates over all the rows in the table, in general case, handling different subtrees of a B-tree in different threads. This iteration over table rows needs to be paused, from time to time, to commit InnoDB MTR/ release page latches it holds. This is necessary to give a way to concurrent actions on the B-tree scanned (this happens when switching to the next page) or before flushing rows of new version of table from in-memory buffer to the B-tree. In order to resume iteration after such pause persistent cursor position saved before pause is restored.
The problem described above occurs when we try to save and then restore position of cursor pointing to page supremum, before switching to the next page. In post-8.0.41 code this is done by simply calling btr_pcur_t::store_position()/restore_position() methods for cursor that point to supremum. In 8.0.42-based code this is done in PCursor::save_previous_user_record_as_last_processed() and PCursor::restore_to_first_unprocessed() pair of methods.
However, this doesn't work correctly in scenario, when after we have saved cursor position and then committed mini-transaction/released latches on the current page the next page is merged into the current one (after purge removes some records from it). In this case the cursor position is still restored as pointing to page supremum, and thus rows which were moved over by merge are erroneously skipped.
Let us take look at an example. Let us assume that we have two pages p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup]. Our thread which is one of the parallel ALTER TABLE worker threads has completed scan of p1, so its cursor positioned on p1:'sup' record. Now it needs to switch to page p2, but also give a way to threads concurrently updating the table. So it needs to make cursor savepoint, commit mini-transaction and release the latches.
In post-8.0.41 code we simply do btr_pcur_t::store_position()/ restore_position() with the cursor positioned on p1 : 'sup' record, then the following might happen: concurrent purge on page p2 might delete some record from it (e.g. 'f') and decide to merge of this page into the page p1.
If this happens while latches are released this merge would go through and and resulting in page p1 with the following contents p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't be invalidated (one can say that savepoints for sup and inf are not safe against concurrent merges in this respect) and after restoration of cursor the iteration will continue, on the next page, skipping records 'd' and 'e'.
Fix:
This patch solves the problem by working around the issue with saving/ restoring cursor pointing to supremum. Instead of storing position of supremum record PCursor::save_previous_user_record_as_last_processed() now stores the position of record that precedes it. And then PCursor::restore_to_first_unprocessed() does restore in two steps - 1) restores position of this preceding record (or its closest precedessor if it was purged meanwhile) and then 2) moves one step forward assuming that will get to the supremum record at which cursor pointed originally. If this is not true, i.e. there is user record added to the page by the merge (or simple concurrent insert), we assume that this and following records are unprocessed. The caller of PCursor::restore_to_first_unprocessed() detects this situation by checking if cursor is positioned on supremum and handles by resuming processing from record under the cursor if not.
Let us return to the above example to explain how fix works. PCursor::save_previous_user_record_as_last_processed() does a step back before calling btr_pcur_t::store_position(), so for cursor positioned on p1 : 'sup' it is actually position corresponding to p1 : 'c' what is saved. If the merge happens when latches are released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint is not invalidated. PCursor::restore_to_first_unprocessed() calls btr_pcur_t::restore_position() gets cursor pointing to p1 : 'c' as result, and then it tries to compensate for step-back and moves cursor one step forward making it to point to p1 : 'd'. Code which does scanning detects the situation that saving/restoring resulted in jump from supremum record to user record and resume iteration from p1 : 'd' without skipping any records.