8000 Add worker visibility API - heartbeat and list worker satus by ychebotarev · Pull Request #599 · temporalio/api · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add worker visibility API - heartbeat and list worker satus #599

New issue
8000

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ychebotarev
Copy link
Contributor
@ychebotarev ychebotarev commented May 28, 2025

NOT FOR MERGE. No need to sign off. Once the number of comments go down I will create a new PR.
The goal of this PR is to review the proposed APIs.
It will not necessary be merged, treat it as a part of Design review.

What changed?
2 new APIs added:

  • WorkerHeartbeat
  • ListWorkerStatus

Why?
Part of the worker visibility work.

Note:
Currently metrics are not supported as a part of List operation query. But it may happen, based on user feedback.

google.protobuf.Timestamp last_heartbeat_time = 12;

WorkerTaskStatus workflow_task_status = 13;
WorkerTaskStatus activity_task_status = 14;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go a worker can also have a session worker, conceptually in the SDK the session worker is under the users worker, but is listening on a different task queue so I guess it would heartbeat separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need SDK input on this. I don't know what that worker is doing, and what is the value for users to separate it? (and on what task queue it is listening?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker sessions are a Go SDK feature https://docs.temporal.io/develop/go/sessions, that let a user schedule multiple activities on the same worker. To achieve this the worker polls on a couple of additional task queues (one task queue for session creation and one to schedule the activities). From the users perspective it is logically under one worker, but there are actually three task queues being used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we don't need to handle the go session thing for now. Unless we decided to make the session more generally available across SDKs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't want to add sdk-specific metrics.

Worth to mention - info in this API is expected to be "top level".
There will also be an interface to get "low level" metrics/information from workers. example - "last 100 errors", etc.
This will be done via "worker commands". Getting "session" information can be part of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in this version.

Copy link
Member
@cretz cretz May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we support DescribeWorker? I think it is important information to understand how it may affect the HTTP path for heartbeat and other individual worker RPC calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed. Yes we will, and DescribeWorker will contain more information, like config/environment/user metadata/etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to include it here so we can understand what it looks like, specifically with regards to lookup (ID, identity, etc). I think that has a bearing on how we do some other things here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to bloat the scope of this PR. I don't know what will be inside DescribeWorker, it is still under discussion, and in early stage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is this PR can kinda be affected by both 1) how you will uniquely identify a worker and 2) what's in the worker info model vs live-queried model. Maybe we can just have a high level discussion on those. Mainly for number 1, I just want to know whether I can look up a worker by its identity (the thing we put on events even though it technically can be non-unique), and for number 2, I just want to confirm everything in this PR we are sure isn't better served live-obtaining from the worker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to know whether I can look up a worker by its identity
The only reason we have worker id (or worker_instanse_key) is because worker identity may be not unique, right?

What should happen if there is more then 1 worker with the same identity? Returning "repeated" for DescribeWorker? seems clumsy for me.

2364
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to confirm everything in this PR we are sure isn't better served live-obtaining from the worker.

One of the requirement - be able to handle 1M workers. And then you call ListWorker... Fan out to 1M workers? What about paging? Have internal list of active workers, select XXX, fun out only those? But we need to keep this list, and keep it up to date, and thus we come to heartbeats again. This is in short.

string process_id = 2;

// freeform (e.g. "k8s container abc123", etc.)
string host_context = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this value? Are we asking users to set this? Maybe we should just have a general purpose "summary"/"description" for the worker? And shouldn't we allow it to be end-to-end encrypted? And therefore, maybe we should accept a temporal.api.sdk.v1.UserMetadata metadata field that is not specific to host info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we don't need it at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memo? (if it needs to be in the list). Unindexed, encryptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some thinking - removed. This may be a part of DescribeWorker API. For that API status is only a part of it, it can contain much more details about host/environment/user metadata/etc.

string worker_id = 1;

// Worker host information. Required.
WorkerHostInfo host_info = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure there is value in breaking this into a separate message from an API POV, but not a big deal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are discussing on adding k8s specific info, so lets keep it separated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like that is removed, so do we still need this separated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, lets keep it separated. It is nice to have all host related info in one place. Are there any concerns?

//* SdkVersion
//* Uptime
//* LastHeartbeatTime
//* Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "Status" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments to the status field. Essentially it is "running/shutting down/shutdown".
We may think about adding "Stuck" status, that we will set when extracting the data.

WorkerPollerInfo activity_poller_info = 16;
WorkerPollerInfo nexus_poller_info = 17;

float cpu_usage_percent = 18;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu usage is surprisingly tricky to measure and interpret. Over what time range is this measured? Does it include kernel time or just user time? Treating it as relative to some other thing makes it even worse: relative to the machine? the container soft limits? hard limits? To one core? What about burst (frequency scaling or burstable cycles in some cloud vm types)? SMT?

To do this properly you need multiple stats. But to keep it simple, maybe something like:

Suggested change
float cpu_usage_percent = 18;
// Average number of CPU cores used per second by this worker process (user time only),
// over the time between the previous heartbeat and this heartbeat.
float cpu_cores_used = 18;

i.e. ignore the machine/container and just count cpu time.

WorkerPollerInfo nexus_poller_info = 17;

float cpu_usage_percent = 18;
int64 memory_usage_bytes = 19;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total allocated or in-use? Just heap or stacks too? Counting shared memory or only private?

I'm wondering if cpu/memory shouldn't even be here, maybe leave that to monitoring systems that know how to deal with all the intricacies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.
All those are a good questions David, but I was under impression we already have those in SDK, and those questions are answered :).
Guess I need to check my assumptions.
Same for CPU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, not in https://docs.temporal.io/references/sdk-metrics.
I will start the discussion.

Copy link
Member
@cretz cretz May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok just dropping all of these system/process metrics for the initial version. I think worker status is not necessarily system/process status. If we want to have a system/process metrics section (maybe later?), we should more clearly define exactly how the values are obtained.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we do have them in SDK - we need them for the resource based tuner. David is right that properly measuring them is extremely tricky, though (and there are some known problems with my existing implementation). This would be a forcing function to fix those.

I'm OK with adding them in later, but I do think we want them, for the same reason we want all this other stuff, which is that people simply don't set up other forms of monitoring.

int64 memory_usage_bytes = 19;

// A Workflow Task found a cached Workflow Execution to run against.
int32 sticky_cache_hit = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these counters since process start or since the last heartbeat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is something we already have: https://docs.temporal.io/references/sdk-metrics
So I guess since process start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need clear delineation in docs between values that are reset every heartbeat and ones that are cumulative


// Holds everything needed to identify the worker host/process context
message WorkerHostInfo {
// Worker host identifier, should be unique for the namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "unique for the namespace" mean? obviously multiple workers can run on the same host...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have namespaces, and in the boundary of a single namespace we can have multiple workers. All those workers should be distinguishable between is other. Thus - unique across namespace. So there can be only one worker "W" in namespace "N1", but there can be worker "W" in namespace "N2"

Comment on lines +39 to +41
// Number of tasks processed in the last minute.
int32 processed_tasks_last_minute = 5;
// Number of failed tasks processed in the last minute.
int32 failure_tasks_last_minute = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think the field names should assume heartbeat frequency. If there are values that are only for a certain window, then the start/end of the window needs to be provided as well IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"last_minute" doesn't assume heartbeat frequency. 5 sec, 1min or 10 min - " processed_tasks_last_minute" should be the same. It is "current processing speed", and speed at the moment of reporting doesn't depend on the frequency of reporting.
I choose "last_minute" because I think "per_sec" is too low, not sure if average number of tasks per sec is greater then one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider going ahead and adding WorkerInfo on ShutdownWorker at this time (I would also support doing it for poll calls too if we know we're going to have it, but I understand waiting there)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind - not much work for me. But for the first iteration it is not needed. So up to SDK team. @Sushisource ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be easy to add on shutdown since we already have to call the shutdown API

int32 slots_used = 2;

// Total number of tasks processed by the worker so far.
int32 processed_tasks = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify whether this is "successfully processed tasks" or just any processed tasks? Can you also clarify whether this is just received tasks or completed tasks?

Copy link
Contributor Author
@ychebotarev ychebotarev May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually "total" indicates both successful and failed. Like "total workflows".
And for me "processed" implies that they are completed.
I will try to make it more clear.

Copy link
Member
@cretz cretz May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"completed" often can mean "completed successfully" and "completed unsuccessfully". Yeah, can just clarify. And also this means that we aren't showing how many are being processed at this time? That can kinda be obtained from slots used (though that includes poller count), so not necessarily required. There is a difference between "slots reserved" and "slots in use" that we may want to clarify here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also this means that we aren't showing how many are being processed at this time? That can kinda be obtained from slots used (though that includes poller count), so not necessarily required.

I though "occupied slots" has the information only on currently processing (not processed) tasks?
If task is completed slot is free, isn't that the case?

Either way the intention is - total (commutative) value from the start. For more recent we will have "processed_tasks_last_minute" (or whatever the name will be)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0