-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor: fix hang in hash agg when exceeding memory limit leads to panic (#57641) #57696
executor: fix hang in hash agg when exceeding memory limit leads to panic (#57641) #57696
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## release-8.5 #57696 +/- ##
================================================
Coverage ? 56.8661%
================================================
Files ? 1770
Lines ? 626484
Branches ? 0
================================================
Hits ? 356257
Misses ? 246123
Partials ? 24104
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: XuHuaiyu, xzhangxian1008 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
This is an automated cherry-pick of #57641
What problem does this PR solve?
Issue Number: close #57546
Problem Summary:
Partial worker gets input by calling
getChildInput
function. In this function, we callConsume
function to track memory occupied by input chunk. When sql's memory usage is high, there will be possible to lead to panic inConsume
function. According to the synchronization rule, each time we get a chunk fromgetChildInput
we need to callDone
function for variableinflightChunkSync
. However, when panic happensDone
function is not called and the counter ininflightChunkSync
will never be minused to 0 which leads to hang.Solution: Each time waked up in channel in
getChildInput
function, we will set a variable namedneedDone
. When panic happens ingetChildInput
function, we will catch the panic in this function and check if variableneedDone
is set to true then decide if we need to callDone
function forinflightChunkSync
, and rethrow the panic at last.What changed and how does it work?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.