8000 zesDeviceProcessesGetState is returning 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE) · Issue #809 · intel/compute-runtime · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

zesDeviceProcessesGetState is returning 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE) #809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jketreno opened this issue Feb 8, 2025 · 20 comments
Labels
bug in queue L0 Sysman Issue related to L0 Sysman

Comments

@jketreno
Copy link
jketreno commented Feb 8, 2025

I'm writing a small ze-top like utility to monitor the B580. It looks like zesDeviceProcessesGetState should be able to tell me the info for processes using the GPU. However, it always returns ZE_RESULT_ERROR_UNSUPPORTED_FEATURE. That error return code is documented for other APIs, but doesn't seem to be in the list of valid return codes for zesDeviceProcessesGetState

I have a valid device handle, which I'm using to call zesDeviceEnumEngineGroups to get usage info from the engines, and that's working well.

I've tried running as sudo in case there was a permissions issue, but that didn't help.

#define _MAX_PROCESSES 2048
processCount = _MAX_PROCESSES;
zes_process_state_t allProcesses[_MAX_PROCESS];
ret = zesDeviceProcessesGetState(hSysmanHandle, &processCount, allProcesses);
if (ret != ZE_RESULT_SUCCESS && ret != ZE_RESULT_ERROR_INVALID_SIZE) {
    fprintf(stderr, "Unable to get process information (ret count %u): %08X (%s)\n", processCount, ret, ze_error_to_str(ret));
}
...

The above outputs:

Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)

I've tried setting processCount to 0 to have it tell me how many process items to use, but that has the same error code returned.

I'm using libze-intel-gpu1 version 24.52.32224.5-124.10ppa2, and libze1 version 1.19.2.0-1076~24.10.

Thanks,
James

@JablonskiMateusz JablonskiMateusz added the L0 Sysman Issue related to L0 Sysman label Feb 10, 2025
@jketreno
Copy link
Author

Adding additional context; it looks like the device handle I was using was for the integrated Intel UHD 770:

Output while UHD 770 is running a workload, and I monitor the UHD 770:

Device 0: 868080A7-0400-0000-0002-000000000000
 BDF: 0000:0000:0002:0000
 PCI ID: 8086:A780
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) UHD Graphics 770
 Vendor Name: Intel(R) Corporation
 Driver Version: 7209A40C3CFCD5142354A9F
 Type: GPU
 Is integrated with host: Yes
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: No
Device 0: 7 engines found.
 Engine 0:
  Type: ZES_ENGINE_GROUP_RENDER_SINGLE
  Sub-device: No
 Engine 1:
  Type: ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE
  Sub-device: No
 Engine 2:
  Type: ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE
  Sub-device: No
 Engine 3:
  Type: ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE
  Sub-device: No
 Engine 4:
  Type: ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE
  Sub-device: No
 Engine 5:
  Type: ZES_ENGINE_GROUP_COPY_SINGLE
  Sub-device: No
 Engine 6:
  Type: ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE
  Sub-device: No
INFO: No temperature sensors to monitor.
Monitoring 7 engines.
ZES_ENGINE_GROUP_RENDER_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: N/A
ZES_ENGINE_GROUP_COPY_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE: N/A
Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)
ZES_ENGINE_GROUP_RENDER_SINGLE: 98%
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: 0%
ZES_ENGINE_GROUP_COPY_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE: 0%
Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)
...

I had mistakenly thought the B580 would have engine groups, so mistook the existence of engine groups meaning it was running on the B580. So while zesDeviceProcessesGetState is working correctly on the B580, it is failing on the UHD 770.

When I run the workload on the B580 and and monitor it, zesDeviceProcessesGetState is showing activity on engine type ZES_ENGINE_TYPE_FLAG_COMPUTE, zesDeviceEnumEngineGroups is not returning any engine groups for the B580. Is there another way to track compute utilization w/ the B580 or is there a kernel parameter required to turn that on in the Xe driver?

Output while running workload on B580 and monitor its usage:

Device 0: 86800BE2-0000-0000-0300-000000000000
 BDF: 0000:0003:0000:0000
 PCI ID: 8086:E20B
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) Graphics [0xe20b]
 Vendor Name: Intel(R) Corporation
 Driver Version: 977D4CB66F62C239FD56D33
 Type: GPU
 Is integrated with host: No
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: Yes
Device 0: 0 engines found.
INFO: No temperature sensors to monitor.
INFO: No engines to monitor.
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
...

An oddity is when running the workload on the integrated GPU (i915) the query to the B580 for process stats is showing the process that the i915 driver is using, but with no engine group flags:

Output while UHD 770 is running a workload, and I monitor the B580:

Device 0: 86800BE2-0000-0000-0300-000000000000
 BDF: 0000:0003:0000:0000
 PCI ID: 8086:E20B
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) Graphics [0xe20b]
 Vendor Name: Intel(R) Corporation
 Driver Version: 977D4CB66F62C239FD56D33
 Type: GPU
 Is integrated with host: No
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: Yes
Device 0: 0 engines found.
INFO: No temperature sensors to monitor.
INFO: No engines to monitor.
       23724 python chat.py                 MEM: 3420160              SHR: 0                    FLAGS:

@saik-intel
Copy link
Contributor

@jketreno we will look into internally and update you

@saik-intel
Copy link
Contributor

When I run the workload on the B580 and and monitor it, zesDeviceProcessesGetState is showing activity on engine type ZES_ENGINE_TYPE_FLAG_COMPUTE, zesDeviceEnumEngineGroups is not returning any engine groups for the B580. Is there another way to track compute utilization w/ the B580 or is there a kernel parameter required to turn that on in the Xe driver?

[Sai] XE driver upstream patch is in review and waiting for merge. once it is ready, it will merge and regarding other issue you raised for UHD770 , we able to see its working as per below log

root@DUT6051BMGSVC:/home/gta/level_zero/bin# export ZELLO_SYSMAN_USE_ZESINIT=1; export ZES_ENABLE_SYSMAN=1; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/gta/level_zero/libs/:/home/gta/level_zero/latest_loa der/:/home/gta/level_zero/bin/;
root@DUT6051BMGSVC:/home/gta/level_zero/bin# ./zello_sysman -g
ZES_ENABLE_SYSMAN environment variable Set
Sysman Initialization done via zesInit ---- Global Operations tests ----
properties.numSubdevices = 0
properties.serialNumber = unknown
properties.boardNumber = unknown
properties.brandName = Intel(R) Corporation
properties.modelName = Intel(R) UHD Graphics 770
properties.vendorName = Intel(R) Corporation
properties.driverVersion = BABE9C47939376BE4C71D06
properties.core.type = 1
properties.core.vendorId = 32902
properties.core.deviceId = 42880
properties.core.flags = 1
properties.core.coreClockRate = 1650
properties.core.maxHardwareContexts = 65536
properties.core.maxCommandQueuePriority = 0
properties.core.numThreadsPerEU = 7
properties.core.numEUsPerSubslice = 16
properties.core.numSubslicesPerSlice = 2
properties.core.numSlices = 1
properties.core.timerResolution = 52
properties.core.timestampValidBits = 36
properties.core.kernelTimestampValidBits = 32
properties.core.uuid =
134 128 128 167 4 0 0 0 0 2 0 0 0 0 0 0
properties.core.name = Intel(R) UHD Graphics 770
reset status: 0
repair0 ---- Global Operations tests ----
properties.numSubdevices = 0
properties.serialNumber = unknown
properties.boardNumber = unknown
properties.brandName = Intel(R) Corporation
properties.modelName = Intel(R) Arc(TM) B580 Graphics
properties.vendorName = Intel(R) Corporation
properties.driverVersion = BABE9C47939376BE4C71D06
properties.core.type = 1
properties.core.vendorId = 32902
properties.core.deviceId = 57867
properties.core.flags = 8
properties.core.coreClockRate = 2850
properties.core.maxHardwareContexts = 65536
properties.core.maxCommandQueuePriority = 0
properties.core.numThreadsPerEU = 8
properties.core.numEUsPerSubslice = 8
properties.core.numSubslicesPerSlice = 4
properties.core.numSlices = 5
properties.core.timerResolution = 52
properties.core.timestampValidBits = 64
properties.core.kernelTimestampValidBits = 64
properties.core.uuid =
134 128 11 226 0 0 0 0 3 0 0 0 0 0 0 0
properties.core.name = Intel(R) Arc(TM) B580 Graphics
reset status: 0
repair0

@eero-t
Copy link
eero-t commented Feb 12, 2025

This looks like relevant kernel patch series, but it's for Xe KMD tree, not upstream: https://patchwork.freedesktop.org/series/144408/

@jketreno
Copy link
Author

[...] regarding other issue you raised for UHD770 , we able to see its working as per below log

root@DUT6051BMGSVC:/home/gta/level_zero/bin#
export ZELLO_SYSMAN_USE_ZESINIT=1;
export ZES_ENABLE_SYSMAN=1;
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/gta/level_zero/libs/:/home/gta/level_zero/latest_loader/:/home/gta/level_zero/bin/;
root@DUT6051BMGSVC:/home/gta/level_zero/bin/zello_sysman -g
...

Reproduction of U770 failure

Running Ubuntu Oracular (24.10) with the linux-intel kernel and all other packages updated to latest versions as of 2025-02-20.

Find version of libze-intel-gpu1 on system

$ uname -a
Linux battle-linux 6.11.0-1006-intel #6-Ubuntu SMP PREEMPT_DYNAMIC Thu Jan  9 18:18:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
$ apt-cache policy libze-intel-gpu1
libze-intel-gpu1:
  Installed: 24.52.32224.14-1~24.10~ppa2
  Candidate: 24.52.32224.14-1~24.10~ppa2
  Version table:
 *** 24.52.32224.14-1~24.10~ppa2 500
        500 https://ppa.launchpadcontent.net/kobuk-team/intel-graphics/ubuntu oracular/main amd64 Packages
        100 /var/lib/dpkg/status
     24.35.30872.24-1 500
        500 http://us.archive.ubuntu.com/ubuntu oracular/universe amd64 Packages
apt-cache policy libze-dev
libze-dev:
  Installed: 1.19.2.0-1076~24.10
  Candidate: 1.19.2.0-1076~24.10
  Version table:
 *** 1.19.2.0-1076~24.10 500
        500 https://ppa.launchpadcontent.net/kobuk-team/intel-graphics/ubuntu oracular/main amd64 Packages
        100 /var/lib/dpkg/status
     1.17.42-1 500
        500 http://us.archive.ubuntu.com/ubuntu oracular/universe amd64 Packages

Get compute-runtime source matching the version of libze-intel-gpu1

git clone https://github.com/intel/compute-runtime.git
cd compute-runtime
git tag  | grep 24.52.32224.14
git checkout 24.52.32224.14
cd level_zero/tools/test/black_box_tests/

Build zello_sysman

g++ -O2 -Wall -o zello_sysman  zello_sysman.cpp -lze_loader -locloc

Test

export ZELLO_SYSMAN_USE_ZESINIT=1
export ZES_ENABLE_SYSMAN=1
./zello_sysman -g

Output:

ZES_ENABLE_SYSMAN environment variable Set
Sysman Initialization done via zesInit
...
[...deleted 0xe20b output...]
...
 ----  Global Operations tests ---- 
properties.numSubdevices = 0
properties.serialNumber = unknown
properties.boardNumber = unknown
properties.brandName = unknown
properties.modelName = Intel(R) UHD Graphics 770
properties.vendorName = Intel(R) Corporation
properties.driverVersion = 7209A40C3CFCD5142354A9F
properties.core.type = 1
properties.core.vendorId = 32902
properties.core.deviceId = 42880
properties.core.flags = 1
properties.core.coreClockRate = 1650
properties.core.maxHardwareContexts = 65536
properties.core.maxCommandQueuePriority = 0
properties.core.numThreadsPerEU = 7
properties.core.numEUsPerSubslice = 16
properties.core.numSubslicesPerSlice = 2
properties.core.numSlices = 1
properties.core.timerResolution = 52
properties.core.timestampValidBits = 36
properties.core.kernelTimestampValidBits = 32
properties.core.uuid = 
134 128 128 167 4 0 0 0 0 2 0 0 0 0 0 0 
properties.core.name = Intel(R) UHD Graphics 770
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesDeviceProcessesGetState(device, &count, nullptr): testSysmanGlobalOperations: 1433
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesDeviceProcessesGetState(device, &count, processes.data()): testSysmanGlobalOperations: 1435
reset status: 0
repair0

@jketreno
Copy link
Author

This looks like relevant kernel patch series, but it's for Xe KMD tree, not upstream: https://patchwork.freedesktop.org/series/144408/

I see it is failing in the patch tests:

Image

Assuming those errors get fixed, am I correct that the flow will be Xe KMD tree -> DRM next -> DRM -> kernel.org? Or would these go straight to kernel.org as a bug fix to the existing Xe KMD driver? Or might they get picked up in the linux-intel kernel in the Ubuntu intel-graphics PPA?

I'm just trying to figure out if I should abandon trying to get the B580 to work for a few more months while I wait for these patches to land, or if there might be a shorter path. I'm not too keen on rip/replace my system's kernel with one I build from source as I tend to end up with other random system failures anytime I use a tip-of-tree kernel and I'm trying to keep this system as a "production" config vs. a franken-developer config :)

Thanks,
James

@eero-t
Copy link
eero-t commented Feb 21, 2025

Assuming those errors get fixed, am I correct that the flow will be Xe KMD tree -> DRM next -> DRM -> kernel.org? Or would these go straight to kernel.org as a bug fix to the existing Xe KMD driver?

I'm not a kernel developer, but it's a new feature (for the Xe KMD), and I would think even bug fixes normally go through DRM integration tree, to make sure they do not break anything.

Or might they get picked up in the linux-intel kernel in the Ubuntu intel-graphics PPA?

I'm not familiar with that. Ubuntu HWE packages are LTS backports from things that have been tested for few months in latest non-LTS releases, so those would have quite a lot of delay, but I guess PPAs could include anything. I don't think they would do backporting though, at least not for things like metrics, which do not block using the HW. So either it would be upstream kernel with the Xe stuff already merged, or kernel test package from the Xe driver repo.

In latter case, I personally I would rather build test kernels myself. One might be able to fork-lift latest driver source from the driver integration repo to distro (HWE) kernel version sources; either whole driver, or specific source file(s). If you do that, you could notify upstream whether it worked or not (add your Tested-By tag if you tested the patch series).

I'm just trying to figure out if I should abandon trying to get the B580 to work for a few more months while I wait for these patches to land, or if there might be a shorter path. I'm not too keen on rip/replace my system's kernel with one I build from source as I tend to end up with other random system failures anytime I use a tip-of-tree kernel and I'm trying to keep this system as a "production" config vs. a franken-developer config :)

While one could use stripped distro kernel config for building own kernels, to speed up the builds, I'd use the configs as-is (as much as possible), when wanting to make sure everything works as expected. If you do build your own kernel and it fails, it would be good to notifty upstream about that, at least about reproducible issues.

@eero-t
Copy link
eero-t commented Feb 25, 2025

@jketreno
Copy link
Author
jketreno commented Mar 1, 2025

I downloaded a tarball of the drm-xe-next and I now see temperature sensors reporting:

Device: 8086:E20B (Intel(R) Graphics [0xe20b])
Engines: 0
Temperature Sensors: 3
Processes: 0
Sensor 0: 52C
Sensor 1: 47C
Sensor 2: 52C

However, I'm still not seeing engines. I do see engines with the i915 driver:

Device: 8086:A780 (Intel(R) UHD Graphics 770)
Engines: 7
Temperature Sensors: 0
Processes: 0
...E_GROUP_RENDER_SINGLE [    0%                                                                   ]
...P_MEDIA_DECODE_SINGLE [    0%                                                                   ]
...P_MEDIA_DECODE_SINGLE [    0%                                                                   ]
...P_MEDIA_ENCODE_SINGLE [    0%                                                                   ]
...P_MEDIA_ENCODE_SINGLE [    0%                                                                   ]
...INE_GROUP_COPY_SINGLE [    0%                                                                   ]
...IA_ENHANCEMENT_SINGLE [    0%                                                                   ]

Are there changes in L0 needed to use the added perf interface on the Xe driver?

I'm still seeing the error with L0 Sysman when I call zesDeviceProcessesGetState() on the i915 driver:

Unable to get process information (ret 78000003): ZE_RESULT_ERROR_UNSUPPORTED_FEATURE

@eero-t
Copy link
eero-t commented Mar 3, 2025

I downloaded a tarball of the drm-xe-next and I now see temperature sensors reporting:

Btw. I would suggest using tagged version of that, if you are not.

AFAIK integration trees are rebased to upstream kernel regularly and I would suggest using something that's rebased (close) on a upstream release version, not something rebased e.g. on rc1 release (where most of new changes for next kernel release are merged, and not yet stabilized).

Are there changes in L0 needed to use the added perf interface on the Xe driver?

Does perf list the GPU "perf" events for it?

I'm still seeing the error with L0 Sysman when I call zesDeviceProcessesGetState() on the i915 driver:

Unable to get process information (ret 78000003): ZE_RESULT_ERROR_UNSUPPORTED_FEATURE

Did you mean Xe driver, not i915 one?

@jketreno
Copy link
Author
jketreno commented Mar 3, 2025

I downloaded a tarball of the drm-xe-next and I now see temperature sensors reporting:

Btw. I would suggest using tagged version of that, if you are not.

AFAIK integration trees are rebased to upstream kernel regularly and I would suggest using something that's rebased (close) on a upstream release version, not something rebased e.g. on rc1 release (where most of new changes for next kernel release are merged, and not yet stabilized).

I'll keep my eye open for such a tree; while the tree I used did start showing temperature sensors via L0, docker no longer worked so I had to revert back to the linux-intel kernel from the Canonical PPA.

Are there changes in L0 needed to use the added perf interface on the Xe driver?

Does perf list the GPU "perf" events for it?

I wasn't sure how to use perf to list the engines. I'm not booted into the drm-xe-next kernel, so I can't run that right now.

As a side note, sensors does list some sensors for the B580:

$ sensors xe-pci-0300
xe-pci-0300
Adapter: PCI adapter
card:             N/A  (max =   0.00 W)
pkg:              N/A  (max =   0.00 W, crit = 420.00 W)
card:         20.75 kJ
pkg:          10.11 kJ

I haven't added code to my utility to query zesDeviceEnumPowerDomains yet; I'm hopeful I'll see similar details from L0 once I do.

I'm still seeing the error with L0 Sysman when I call zesDeviceProcessesGetState() on the i915 driver:

Unable to get process information (ret 78000003): ZE_RESULT_ERROR_UNSUPPORTED_FEATURE

Did you mean Xe driver, not i915 one?

For the UHD 770 (integrated Raptor Lake graphics) it is using the i915 kernel driver, which gives the ZE_RESULT_ERROR_UNSUPPORTED_FEATURE when calling zesDeviceProcessesGetState:

$ lspci -k | grep -EA3 'VGA|3D'
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-S GT1 [UHD Graphics 770] (rev 04)
        DeviceName: Onboard IGD
        Subsystem: ASUSTeK Computer Inc. Device 8882
        Kernel driver in use: i915
--
03:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Intel Graphics]
        Subsystem: ASRock Incorporation Device 6020
        Kernel driver in use: xe
        Kernel modules: xe

If I block i915 from being 8000 used and instead have the U770 use the Xe driver, it doesn't list engines (with the linux-intel kernel), however it can report the processes being used:

$ ze-monitor --device 
Device: 8086:A780 (Intel(R) UHD Graphics 770)
Engines: 0
Temperature Sensors: 0
Processes: 6
1 /sbin/init splash  MEM: 0 SHR: 0 FLAGS: RENDER
1606 /usr/lib/systemd/systemd-logind  MEM: 0 SHR: 0 FLAGS: RENDER
5164 /usr/bin/gnome-shell  MEM: 0 SHR: 0 FLAGS: RENDER
5237 /usr/bin/Xwayland :1024 -rootless -noreset -accessx -core -auth /run/user/120/.mutter-Xwaylanda
5552 /usr/libexec/mutter-x11-frames  MEM: 0 SHR: 0 FLAGS:
5566 /usr/libexec/ibus-x11 --kill-daemon  MEM: 0 SHR: 0 FLAGS:

This is the URL for the little utility I'm building:

ze-monitor

Cheers,
James

@eero-t
Copy link
eero-t commented Mar 4, 2025

I wasn't sure how to use perf to list the engines. I'm not booted into the drm-xe-next kernel, so I can't run that right now.

Patch series show examples on that:

Btw. Looking at the first series commits, you need also new enough GuC version:

@jketreno
Copy link
Author
jketreno commented Apr 14, 2025

To update for anyone that finds this in the future. I just built kernel.org from source for 6.15-rc2 and installed the latest guc and huc from linux-firmware.git (prior to updating GuC, dmesg reported the engine activity required a newer version of GuC):

cd /lib/firmware/xe
sudo wget -O bmg_guc_70.bin https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/xe/bmg_guc_70.bin
sudo wget -O bmg_huc.bin https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/xe/bmg_huc.bin
sudo update-initramfs -u

After a reboot:

$ sudo dmesg | grep -i guc
[    3.348002] xe 0000:03:00.0: [drm] Using GuC firmware from xe/bmg_guc_70.bin version 70.44.1
[    3.358922] xe 0000:03:00.0: [drm] Using GuC firmware from xe/bmg_guc_70.bin version 70.44.1
[    3.551883] i915 0000:00:02.0: [drm] GT0: GuC firmware i915/tgl_guc_70.bin version 70.29.2
[    3.555327] i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
[    3.555328] i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled
[    3.555656] i915 0000:00:02.0: [drm] GT0: GUC: RC enabled
$ uname -r
6.15.0-rc2

While perf seems to list the counters, they are not being surfaced from L0 using libze-intel-gpu1 25.05.32567.18 for the B580 via Xe. The U770 still reports engines through i915:

sudo ./perf list | grep 'xe.*engine'
  xe_0000_03_00.0/engine-active-ticks/               [Kernel PMU event]
  xe_0000_03_00.0/engine-total-ticks/                [Kernel PMU event]
  xe_0000_03_00.0/gt=0..15,event=0..0xfff,engine_class=0..255,.../modifier[Raw event descriptor]
  xe:xe_guc_engine_activity                          [Tracepoint event]

Minimal test case I'm using:

cat << EOF > engines.c
#include <level_zero/ze_api.h>  // for _ze_result_t, ze_result_t, ZE_MAX_DE...
#include <level_zero/zes_api.h> // for zes_device_handle_t, _zes_structure_...
#include <stdio.h>              // for printf, fprintf, stderr, size_t, NULL
#include <malloc.h>             // for free, calloc

#define nullptr NULL

// gcc -o engines engines.c -I/usr/include/ze -lze_loader
int main() {
  if (zesInit(0) != ZE_RESULT_SUCCESS)
  {
      printf("Can't initialize the API\n");
      return -1;
  }

  // Discover all the drivers
  uint32_t driversCount = 0;
  zesDriverGet(&driversCount, nullptr);

  if (driversCount == 0)
  {
      fprintf(stderr, "No ze sysman drivers found.\n");
      return 1;
  }

  printf("%d drivers found\n", driversCount);

  // Allocate memory for driver handles
  zes_driver_handle_t *drivers = (zes_driver_handle_t *)calloc(driversCount, sizeof(zes_driver_handle_t));
  
8000
if (!drivers)
  {
      fprintf(stderr, "Memory allocation failed for drivers\n");
      return 1;
  }

  zesDriverGet(&driversCount, drivers);

  for (uint32_t driver = 0; driver < driversCount; ++driver)
  {
      // Discover devices in a driver
      uint32_t deviceCount = 0;
      zesDeviceGet(drivers[driver], &deviceCount, nullptr);
      if (deviceCount == 0)
      {
          printf("Driver %i:\n  No devices found\n", driver);
          continue;
      }

      printf("Driver %i:\n  %d devices found\n", driver, deviceCount);

      // Allocate memory for device handles
      zes_device_handle_t * deviceHandles = (zes_device_handle_t *)calloc(deviceCount, sizeof(zes_device_handle_t));
      if (!deviceHandles)
      {
          printf("Memory allocation failed for devices\n");
          return 1;
      }

      zesDeviceGet(drivers[driver], &deviceCount, deviceHandles);

      // Walk through each device and get properties
      for (uint32_t device = 0; device < deviceCount; ++device)
      {
        uint32_t count = 0;
        ze_result_t result;
        zes_device_properties_t deviceProperties = {
          .stype = ZES_STRUCTURE_TYPE_DEVICE_PROPERTIES,
        };

        result = zesDeviceGetProperties(deviceHandles[device], &deviceProperties);
        if (result != ZE_RESULT_SUCCESS)
        {
            fprintf(stderr, "zesDeviceGetProperties failed: %08x\n", result);
            return 1;
        }
        printf("Device %d: %04X:%04X (%s)\n",
               device,
               deviceProperties.core.vendorId,
               deviceProperties.core.deviceId,
               deviceProperties.modelName);
        result = zesDeviceEnumEngineGroups(deviceHandles[device], &count, nullptr);
        if (result != ZE_RESULT_SUCCESS)
        {
            fprintf(stderr, "Failed to enumerate engine groups: %08x\n", result);
            return 1;
        }
    
        printf("Device %d: %d engine groups found\n", device, count);
        if (count > 0)
        {
            zes_engine_handle_t *engineHandles = (zes_engine_handle_t *)calloc(count, sizeof(zes_engine_handle_t));
    
            result = zesDeviceEnumEngineGroups(deviceHandles[device], &count, engineHandles);
            if (result != ZE_RESULT_SUCCESS)
            {
                fprintf(stderr, "Failed to retrieve engine groups: %08x\n", result);
                return 0;
            }
            for (size_t i = 0; i < count; ++i) 
            {
                zes_engine_properties_t engineProperties = {
                    .stype = ZES_STRUCTURE_TYPE_ENGINE_PROPERTIES,
                };
                result = zesEngineGetProperties(engineHandles[i], &engineProperties);
                if (result != ZE_RESULT_SUCCESS)
                {
                    fprintf(stderr, "Failed to retrieve engine properties: %08x\n", result);
                    return 0;
                }
    
                printf("Engine %zu: Type %d\n", i, engineProperties.type);
            }
        }
      }
  }

  return 0;
}
EOF

Then compile engines.c and run it via sudo:

gcc -o engines engines.c -I/usr/include/ze -lze_loader
sudo ./engines 
1 drivers found
Driver 0:
  2 devices found
Device 0: 8086:E20B (Intel(R) Arc(TM) B580 Graphics)
Device 0: 0 engine groups found
Device 1: 8086:A780 (Intel(R) UHD Graphics 770)
Device 1: 7 engine groups found
Engine 0: Type 5
Engine 1: Type 6
Engine 2: Type 6
Engine 3: Type 7
Engine 4: Type 7
Engine 5: Type 8
Engine 6: Type 9

Note the '0 engine groups found' for the B580 (device 0):

Device 0: 8086:E20B (Intel(R) Arc(TM) B580 Graphics)
Device 0: 0 engine groups found

Am I missing something?

Thanks,
James

@eero-t
Copy link
eero-t commented Apr 24, 2025

Looking at the test code, you need either to set .pNext members to NULL, or a valid address, see e.g:

(You could use calloc() instead of malloc().)

@jketreno
Copy link
Author

Thanks for taking the time to look through the example; it was a simplified minimal quick case I put together to show the problem -- I've updated the code to use calloc to initialize all the values to 0. The problem persists, however -- engine groups are found for the U770 but not for the B580.

@eero-t
Copy link
eero-t commented Apr 25, 2025

Properties struct definitions still have undefined .Pnext values (may cause crash, or weird behavior).

@jketreno
Copy link
Author

Those structures are using designated initializers to initialize the structures to 0 except for the fields explicitly set to a value.

@eero-t
Copy link
eero-t commented Apr 25, 2025

Ah, right, when at least one member is initialized, C99+ initializes rest to zero/NULL.

And if perf sees the engine activity counters, then FW + Xe KMD should be fine, and problem is indeed on UMD side i.e. this project/repo.

@jketreno
Copy link
Author
jketreno commented May 8, 2025

Since this issue was originally titled about the ZE_RESULT_ERROR_UNSUPPORTED_FEATURE with the U770, and the engine activity issue on the B580 was discovered while looking into that topic, should I re-file a new issue that covers that part?

@eero-t
Copy link
eero-t commented May 8, 2025

Yes, separate tickets for separate issues please.

505F
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug in queue L0 Sysman Issue related to L0 Sysman
Projects
None yet
Development

No branches or pull requests

4 participants
0