-
Notifications
You must be signed in to change notification settings - Fork 69
Add worker visibility API - heartbeat and list worker satus #599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
temporal/api/worker/v1/message.proto
Outdated
google.protobuf.Timestamp last_heartbeat_time = 12; | ||
|
||
WorkerTaskStatus workflow_task_status = 13; | ||
WorkerTaskStatus activity_task_status = 14; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Go a worker can also have a session worker, conceptually in the SDK the session worker is under the users worker, but is listening on a different task queue so I guess it would heartbeat separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will need SDK input on this. I don't know what that worker is doing, and what is the value for users to separate it? (and on what task queue it is listening?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worker sessions are a Go SDK feature https://docs.temporal.io/develop/go/sessions, that let a user schedule multiple activities on the same worker. To achieve this the worker polls on a couple of additional task queues (one task queue for session creation and one to schedule the activities). From the users perspective it is logically under one worker, but there are actually three task queues being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we don't need to handle the go session thing for now. Unless we decided to make the session more generally available across SDKs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't want to add sdk-specific metrics.
Worth to mention - info in this API is expected to be "top level".
There will also be an interface to get "low level" metrics/information from workers. example - "last 100 errors", etc.
This will be done via "worker commands". Getting "session" information can be part of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not in this version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we support DescribeWorker
? I think it is important information to understand how it may affect the HTTP path for heartbeat and other individual worker RPC calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed. Yes we will, and DescribeWorker will contain more information, like config/environment/user metadata/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to include it here so we can understand what it looks like, specifically with regards to lookup (ID, identity, etc). I think that has a bearing on how we do some other things here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to bloat the scope of this PR. I don't know what will be inside DescribeWorker, it is still under discussion, and in early stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is this PR can kinda be affected by both 1) how you will uniquely identify a worker and 2) what's in the worker info model vs live-queried model. Maybe we can just have a high level discussion on those. Mainly for number 1, I just want to know whether I can look up a worker by its identity (the thing we put on events even though it technically can be non-unique), and for number 2, I just want to confirm everything in this PR we are sure isn't better served live-obtaining from the worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to know whether I can look up a worker by its identity
The only reason we have worker id (or worker_instanse_key) is because worker identity may be not unique, right?
What should happen if there is more then 1 worker with the same identity? Returning "repeated" for DescribeWorker? seems clumsy for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to confirm everything in this PR we are sure isn't better served live-obtaining from the worker.
One of the requirement - be able to handle 1M workers. And then you call ListWorker... Fan out to 1M workers? What about paging? Have internal list of active workers, select XXX, fun out only those? But we need to keep this list, and keep it up to date, and thus we come to heartbeats again. This is in short.
temporal/api/worker/v1/message.proto
Outdated
string process_id = 2; | ||
|
||
// freeform (e.g. "k8s container abc123", etc.) | ||
string host_context = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this value? Are we asking users to set this? Maybe we should just have a general purpose "summary"/"description" for the worker? And shouldn't we allow it to be end-to-end encrypted? And therefore, maybe we should accept a temporal.api.sdk.v1.UserMetadata metadata
field that is not specific to host info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we don't need it at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memo? (if it needs to be in the list). Unindexed, encryptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after some thinking - removed. This may be a part of DescribeWorker API. For that API status is only a part of it, it can contain much more details about host/environment/user metadata/etc.
string worker_id = 1; | ||
|
||
// Worker host information. Required. | ||
WorkerHostInfo host_info = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure there is value in breaking this into a separate message from an API POV, but not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are discussing on adding k8s specific info, so lets keep it separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like that is removed, so do we still need this separated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, lets keep it separated. It is nice to have all host related info in one place. Are there any concerns?
//* SdkVersion | ||
//* Uptime | ||
//* LastHeartbeatTime | ||
//* Status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "Status" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments to the status field. Essentially it is "running/shutting down/shutdown".
We may think about adding "Stuck" status, that we will set when extracting the data.
WorkerPollerInfo activity_poller_info = 16; | ||
WorkerPollerInfo nexus_poller_info = 17; | ||
|
||
float cpu_usage_percent = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu usage is surprisingly tricky to measure and interpret. Over what time range is this measured? Does it include kernel time or just user time? Treating it as relative to some other thing makes it even worse: relative to the machine? the container soft limits? hard limits? To one core? What about burst (frequency scaling or burstable cycles in some cloud vm types)? SMT?
To do this properly you need multiple stats. But to keep it simple, maybe something like:
float cpu_usage_percent = 18; | |
// Average number of CPU cores used per second by this worker process (user time only), | |
// over the time between the previous heartbeat and this heartbeat. | |
float cpu_cores_used = 18; |
i.e. ignore the machine/container and just count cpu time.
WorkerPollerInfo nexus_poller_info = 17; | ||
|
||
float cpu_usage_percent = 18; | ||
int64 memory_usage_bytes = 19; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total allocated or in-use? Just heap or stacks too? Counting shared memory or only private?
I'm wondering if cpu/memory shouldn't even be here, maybe leave that to monitoring systems that know how to deal with all the intricacies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.
All those are a good questions David, but I was under impression we already have those in SDK, and those questions are answered :).
Guess I need to check my assumptions.
Same for CPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, not in https://docs.temporal.io/references/sdk-metrics.
I will start the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be ok just dropping all of these system/process metrics for the initial version. I think worker status is not necessarily system/process status. If we want to have a system/process metrics section (maybe later?), we should more clearly define exactly how the values are obtained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we do have them in SDK - we need them for the resource based tuner. David is right that properly measuring them is extremely tricky, though (and there are some known problems with my existing implementation). This would be a forcing function to fix those.
I'm OK with adding them in later, but I do think we want them, for the same reason we want all this other stuff, which is that people simply don't set up other forms of monitoring.
int64 memory_usage_bytes = 19; | ||
|
||
// A Workflow Task found a cached Workflow Execution to run against. | ||
int32 sticky_cache_hit = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these counters since process start or since the last heartbeat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is something we already have: https://docs.temporal.io/references/sdk-metrics
So I guess since process start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need clear delineation in docs between values that are reset every heartbeat and ones that are cumulative
|
||
// Holds everything needed to identify the worker host/process context | ||
message WorkerHostInfo { | ||
// Worker host identifier, should be unique for the namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "unique for the namespace" mean? obviously multiple workers can run on the same host...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have namespaces, and in the boundary of a single namespace we can have multiple workers. All those workers should be distinguishable between is other. Thus - unique across namespace. So there can be only one worker "W" in namespace "N1", but there can be worker "W" in namespace "N2"
// Number of tasks processed in the last minute. | ||
int32 processed_tasks_last_minute = 5; | ||
// Number of failed tasks processed in the last minute. | ||
int32 failure_tasks_last_minute = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think the field names should assume heartbeat frequency. If there are values that are only for a certain window, then the start/end of the window needs to be provided as well IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"last_minute" doesn't assume heartbeat frequency. 5 sec, 1min or 10 min - " processed_tasks_last_minute" should be the same. It is "current processing speed", and speed at the moment of reporting doesn't depend on the frequency of reporting.
I choose "last_minute" because I think "per_sec" is too low, not sure if average number of tasks per sec is greater then one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider going ahead and adding WorkerInfo
on ShutdownWorker
at this time (I would also support doing it for poll calls too if we know we're going to have it, but I understand waiting there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind - not much work for me. But for the first iteration it is not needed. So up to SDK team. @Sushisource ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be easy to add on shutdown since we already have to call the shutdown API
int32 slots_used = 2; | ||
|
||
// Total number of tasks processed by the worker so far. | ||
int32 processed_tasks = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify whether this is "successfully processed tasks" or just any processed tasks? Can you also clarify whether this is just received tasks or completed tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually "total" indicates both successful and failed. Like "total workflows".
And for me "processed" implies that they are completed.
I will try to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"completed" often can mean "completed successfully" and "completed unsuccessfully". Yeah, can just clarify. And also this means that we aren't showing how many are being processed at this time? That can kinda be obtained from slots used (though that includes poller count), so not necessarily required. There is a difference between "slots reserved" and "slots in use" that we may want to clarify here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also this means that we aren't showing how many are being processed at this time? That can kinda be obtained from slots used (though that includes poller count), so not necessarily required.
I though "occupied slots" has the information only on currently processing (not processed) tasks?
If task is completed slot is free, isn't that the case?
Either way the intention is - total (commutative) value from the start. For more recent we will have "processed_tasks_last_minute" (or whatever the name will be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comments.
44b650a
to
46cd519
Compare
NOT FOR MERGE. No need to sign off. Once the number of comments go down I will create a new PR.
The goal of this PR is to review the proposed APIs.
It will not necessary be merged, treat it as a part of Design review.
What changed?
2 new APIs added:
Why?
Part of the worker visibility work.
Note:
Currently metrics are not supported as a part of List operation query. But it may happen, based on user feedback.