Add worker visibility API - heartbeat and list worker satus #599

ychebotarev · 2025-05-28T20:28:47Z

NOT FOR MERGE. No need to sign off. Once the number of comments go down I will create a new PR.
The goal of this PR is to review the proposed APIs.
It will not necessary be merged, treat it as a part of Design review.

What changed?
2 new APIs added:

WorkerHeartbeat
ListWorkerStatus

Why?
Part of the worker visibility work.

Note:
Currently metrics are not supported as a part of List operation query. But it may happen, based on user feedback.

temporal/api/worker/v1/message.proto

Quinn-With-Two-Ns · 2025-05-28T20:57:18Z

temporal/api/worker/v1/message.proto

+  google.protobuf.Timestamp last_heartbeat_time = 12;
+
+  WorkerTaskStatus workflow_task_status = 13;
+  WorkerTaskStatus activity_task_status = 14;


In Go a worker can also have a session worker, conceptually in the SDK the session worker is under the users worker, but is listening on a different task queue so I guess it would heartbeat separately?

I will need SDK input on this. I don't know what that worker is doing, and what is the value for users to separate it? (and on what task queue it is listening?)

Worker sessions are a Go SDK feature https://docs.temporal.io/develop/go/sessions, that let a user schedule multiple activities on the same worker. To achieve this the worker polls on a couple of additional task queues (one task queue for session creation and one to schedule the activities). From the users perspective it is logically under one worker, but there are actually three task queues being used.

I feel like we don't need to handle the go session thing for now. Unless we decided to make the session more generally available across SDKs.

I also don't want to add sdk-specific metrics.

Worth to mention - info in this API is expected to be "top level".
There will also be an interface to get "low level" metrics/information from workers. example - "last 100 errors", etc.
This will be done via "worker commands". Getting "session" information can be part of that.

not in this version.

temporal/api/workflowservice/v1/service.proto

temporal/api/worker/v1/message.proto

temporal/api/workflowservice/v1/request_response.proto

temporal/api/worker/v1/message.proto

cretz · 2025-05-29T13:29:15Z

temporal/api/workflowservice/v1/service.proto

Will we support DescribeWorker? I think it is important information to understand how it may affect the HTTP path for heartbeat and other individual worker RPC calls.

Discussed. Yes we will, and DescribeWorker will contain more information, like config/environment/user metadata/etc.

I think we might want to include it here so we can understand what it looks like, specifically with regards to lookup (ID, identity, etc). I think that has a bearing on how we do some other things here.

I don't want to bloat the scope of this PR. I don't know what will be inside DescribeWorker, it is still under discussion, and in early stage.

My concern is this PR can kinda be affected by both 1) how you will uniquely identify a worker and 2) what's in the worker info model vs live-queried model. Maybe we can just have a high level discussion on those. Mainly for number 1, I just want to know whether I can look up a worker by its identity (the thing we put on events even though it technically can be non-unique), and for number 2, I just want to confirm everything in this PR we are sure isn't better served live-obtaining from the worker.

I just want to know whether I can look up a worker by its identity
The only reason we have worker id (or worker_instanse_key) is because worker identity may be not unique, right?

What should happen if there is more then 1 worker with the same identity? Returning "repeated" for DescribeWorker? seems clumsy for me.

I just want to confirm everything in this PR we are sure isn't better served live-obtaining from the worker.

One of the requirement - be able to handle 1M workers. And then you call ListWorker... Fan out to 1M workers? What about paging? Have internal list of active workers, select XXX, fun out only those? But we need to keep this list, and keep it up to date, and thus we come to heartbeats again. This is in short.

temporal/api/worker/v1/message.proto

cretz · 2025-05-29T13:30:32Z

temporal/api/worker/v1/message.proto

+  string process_id = 2;
+
+  // freeform (e.g. "k8s container abc123", etc.)
+  string host_context = 3;


What is the purpose of this value? Are we asking users to set this? Maybe we should just have a general purpose "summary"/"description" for the worker? And shouldn't we allow it to be end-to-end encrypted? And therefore, maybe we should accept a temporal.api.sdk.v1.UserMetadata metadata field that is not specific to host info.

may be we don't need it at all?

memo? (if it needs to be in the list). Unindexed, encryptable.

after some thinking - removed. This may be a part of DescribeWorker API. For that API status is only a part of it, it can contain much more details about host/environment/user metadata/etc.

temporal/api/worker/v1/message.proto

cretz · 2025-05-29T13:37:25Z

temporal/api/worker/v1/message.proto

+  string worker_id = 1;
+
+  // Worker host information. Required.
+  WorkerHostInfo host_info = 2;


Not sure there is value in breaking this into a separate message from an API POV, but not a big deal

we are discussing on adding k8s specific info, so lets keep it separated.

Sounds like that is removed, so do we still need this separated?

yeah, lets keep it separated. It is nice to have all host related info in one place. Are there any concerns?

temporal/api/workflowservice/v1/request_response.proto

cretz · 2025-05-29T13:42:56Z

temporal/api/workflowservice/v1/request_response.proto

+//* SdkVersion
+//* Uptime
+//* LastHeartbeatTime
+//* Status


What is "Status" here?

I added comments to the status field. Essentially it is "running/shutting down/shutdown".
We may think about adding "Stuck" status, that we will set when extracting the data.

temporal/api/workflowservice/v1/service.proto

temporal/api/worker/v1/message.proto

dnr · 2025-05-29T23:48:16Z

temporal/api/worker/v1/message.proto

+  WorkerPollerInfo activity_poller_info = 16;
+  WorkerPollerInfo nexus_poller_info = 17;
+
+  float cpu_usage_percent = 18;


cpu usage is surprisingly tricky to measure and interpret. Over what time range is this measured? Does it include kernel time or just user time? Treating it as relative to some other thing makes it even worse: relative to the machine? the container soft limits? hard limits? To one core? What about burst (frequency scaling or burstable cycles in some cloud vm types)? SMT?

To do this properly you need multiple stats. But to keep it simple, maybe something like:

Suggested change

float cpu_usage_percent = 18;

// Average number of CPU cores used per second by this worker process (user time only),

// over the time between the previous heartbeat and this heartbeat.

float cpu_cores_used = 18;

i.e. ignore the machine/container and just count cpu time.

dnr · 2025-05-29T23:49:33Z

temporal/api/worker/v1/message.proto

+  WorkerPollerInfo nexus_poller_info = 17;
+
+  float cpu_usage_percent = 18;
+  int64 memory_usage_bytes = 19;


Total allocated or in-use? Just heap or stacks too? Counting shared memory or only private?

I'm wondering if cpu/memory shouldn't even be here, maybe leave that to monitoring systems that know how to deal with all the intricacies?

Hm.
All those are a good questions David, but I was under impression we already have those in SDK, and those questions are answered :).
Guess I need to check my assumptions.
Same for CPU

Ok, not in https://docs.temporal.io/references/sdk-metrics.
I will start the discussion.

I would be ok just dropping all of these system/process metrics for the initial version. I think worker status is not necessarily system/process status. If we want to have a system/process metrics section (maybe later?), we should more clearly define exactly how the values are obtained.

Well, we do have them in SDK - we need them for the resource based tuner. David is right that properly measuring them is extremely tricky, though (and there are some known problems with my existing implementation). This would be a forcing function to fix those.

I'm OK with adding them in later, but I do think we want them, for the same reason we want all this other stuff, which is that people simply don't set up other forms of monitoring.

dnr · 2025-05-29T23:50:37Z

temporal/api/worker/v1/message.proto

+  int64 memory_usage_bytes = 19;
+
+  // A Workflow Task found a cached Workflow Execution to run against.
+  int32 sticky_cache_hit = 20;


Are these counters since process start or since the last heartbeat?

Now this is something we already have: https://docs.temporal.io/references/sdk-metrics
So I guess since process start.

We do need clear delineation in docs between values that are reset every heartbeat and ones that are cumulative

dnr · 2025-05-29T23:53:57Z

temporal/api/worker/v1/message.proto

+
+// Holds everything needed to identify the worker host/process context
+message WorkerHostInfo {
+  // Worker host identifier, should be unique for the namespace.


what does "unique for the namespace" mean? obviously multiple workers can run on the same host...

We have namespaces, and in the boundary of a single namespace we can have multiple workers. All those workers should be distinguishable between is other. Thus - unique across namespace. So there can be only one worker "W" in namespace "N1", but there can be worker "W" in namespace "N2"

temporal/api/worker/v1/message.proto

cretz · 2025-05-30T13:42:30Z

temporal/api/worker/v1/message.proto

+  // Number of tasks processed in the last minute.
+  int32 processed_tasks_last_minute = 5;
+  // Number of failed tasks processed in the last minute.
+  int32 failure_tasks_last_minute = 6;


I do not think the field names should assume heartbeat frequency. If there are values that are only for a certain window, then the start/end of the window needs to be provided as well IMO.

"last_minute" doesn't assume heartbeat frequency. 5 sec, 1min or 10 min - " processed_tasks_last_minute" should be the same. It is "current processing speed", and speed at the moment of reporting doesn't depend on the frequency of reporting.
I choose "last_minute" because I think "per_sec" is too low, not sure if average number of tasks per sec is greater then one.

cretz · 2025-05-30T13:50:28Z

temporal/api/workflowservice/v1/service.proto

I think we should consider going ahead and adding WorkerInfo on ShutdownWorker at this time (I would also support doing it for poll calls too if we know we're going to have it, but I understand waiting there)

I don't mind - not much work for me. But for the first iteration it is not needed. So up to SDK team. @Sushisource ?

It should be easy to add on shutdown since we already have to call the shutdown API

temporal/api/workflowservice/v1/request_response.proto

cretz · 2025-05-30T13:52:13Z

temporal/api/worker/v1/message.proto

+  int32 slots_used = 2;
+
+  // Total number of tasks processed by the worker so far.
+  int32 processed_tasks = 3;


Can you clarify whether this is "successfully processed tasks" or just any processed tasks? Can you also clarify whether this is just received tasks or completed tasks?

Usually "total" indicates both successful and failed. Like "total workflows".
And for me "processed" implies that they are completed.
I will try to make it more clear.

"completed" often can mean "completed successfully" and "completed unsuccessfully". Yeah, can just clarify. And also this means that we aren't showing how many are being processed at this time? That can kinda be obtained from slots used (though that includes poller count), so not necessarily required. There is a difference between "slots reserved" and "slots in use" that we may want to clarify here.

And also this means that we aren't showing how many are being processed at this time? That can kinda be obtained from slots used (though that includes poller count), so not necessarily required.

I though "occupied slots" has the information only on currently processing (not processed) tasks?
If task is completed slot is free, isn't that the case?

Either way the intention is - total (commutative) value from the start. For more recent we will have "processed_tasks_last_minute" (or whatever the name will be)

updated the comments.

ychebotarev requested review from a team as code owners May 28, 2025 20:28

ychebotarev requested review from Sushisource, ShahabT, Alex-Tideman and yuandrew May 28, 2025 20:39

Quinn-With-Two-Ns reviewed May 28, 2025

View reviewed changes

temporal/api/worker/v1/message.proto Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed May 28, 2025

View reviewed changes

temporal/api/worker/v1/message.proto Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed May 28, 2025

View reviewed changes

temporal/api/worker/v1/message.proto Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed May 28, 2025

View reviewed changes

cretz reviewed May 28, 2025

View reviewed changes

Quinn-With-Two-Ns reviewed May 28, 2025

View reviewed changes

temporal/api/worker/v1/message.proto Outdated Show resolved Hide resolved

Quinn-With-Two-Ns reviewed May 28, 2025

View reviewed changes

temporal/api/worker/v1/message.proto Outdated Show resolved Hide resolved

yuandrew reviewed May 28, 2025

View reviewed changes

temporal/api/worker/v1/message.proto Show resolved Hide resolved

Sushisource reviewed May 28, 2025

View reviewed changes

temporal/api/worker/v1/message.proto Outdated Show resolved Hide resolved

cretz reviewed May 29, 2025

View reviewed changes

ychebotarev requested review from yiminc, yuandrew, Quinn-With-Two-Ns, Sushisource and cretz May 29, 2025 22:16

dnr reviewed May 29, 2025

View reviewed changes

cretz reviewed May 30, 2025

View reviewed changes

temporal/api/workflowservice/v1/request_response.proto Outdated Show resolved Hide resolved

cretz reviewed May 30, 2025

View reviewed changes

ychebotarev added 4 commits May 30, 2025 12:15

Add worker visibility API - heartbeat and list worker satus

2cf4061

work on comments

56bc12d

regenerate opeapi after merge

f360b84

work on comments

71b35c2

ychebotarev added 2 commits May 30, 2025 12:15

work on comments- extract status to enum

bc1cbc9

work on comments

46cd519

ychebotarev forc 9D8E e-pushed the y_heartbeat branch from 44b650a to 46cd519 Compare May 30, 2025 19:16

-  float cpu_usage_percent = 18;
+  // Average number of CPU cores used per second by this worker process (user time only),
+  // over the time between the previous heartbeat and this heartbeat.
+  float cpu_cores_used = 18;

Add worker visibility API - heartbeat and list worker satus #599

Are you sure you want to change the base?

Add worker visibility API - heartbeat and list worker satus #599

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment