8000 Various of fixes for the HeadlessMode and more by rouming · Pull Request #8011 · diasurgical/DevilutionX · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Various of fixes for the HeadlessMode and more #8011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

rouming
Copy link
Contributor
@rouming rouming commented May 23, 2025

This PR addresses various crashes in HeadlessMode and lost keystrokes caused by the game menu or incorrect event polling. These issues were observed during AI agent training, and commits targeting them are part of the RFC #7974 PR. I decided to split the original RFC on several PRs because it is unnecessarily complex and has an unclear status. This PR includes the following commits for easier review:

  • Ensured keystrokes are not dropped in game menu or during progress events:
  • ShowProgress now preserves non-custom SDL events, ensuring no events are lost
  • Fixed crashes in HeadlessMode by avoiding SDL or audio access
  • Resolved a GCC warning in logging utility by ensuring a return value in all cases

@rouming rouming force-pushed the part1--port-ai-patches branch 2 times, most recently from 992b4c8 to d26c399 Compare May 23, 2025 17:18
@rouming
Copy link
Contributor Author
rouming commented May 23, 2025

It seems my SDL event processing fix needs some massagging for SDL1. I'm looking into the SDL1.2 SDL_PeepEvents() implementation, which doesn't support an event type range but instead uses a 32-bit mask, and I am confused about one thing: the code clearly states that you can only have 8 custom events (here), but interfac.h defines 13 events. According to the mask check here you will never get any event picked up from the queue due to the 32-bit overflow. Therefore, all custom events starting from WM_DIABRETOWN should not be ever processed for the SDL1.2, but apparently this is not true, right? What am I missing here?

@glebm
Copy link
Collaborator
glebm commented May 23, 2025

but apparently this is not true, right? What am I missing here?

Probably that there is no actual limitation as it's just a 32-bit integer in the end of the day, and we don't use SDL_PeepEvents anywhere currently. The masked query for SDL_PeepEvents won't work because SDL_EVENTMASK will overflow.

Let's reduce the number of custom events by putting the actual event identifier into event.user.code? This will reduce the number of custom events to just 1.

@rouming
Copy link
Contributor Author
rouming commented May 24, 2025

but apparently this is not true, right? What am I missing here?

Probably that there is no actual limitation as it's just a 32-bit integer in the end of the day, and we don't use SLD_PeepEvents anywhere currently. The masked query for SDL_PeepEvents won't work because SDL_EVENTMASK will overflow.

SDL_PollEvents calls SDL_PeepEvents with a mask set to SDL_ALLEVENTS (which is (unsigned)-1) (here), so current code should be affected. Therefore not clear how that works for events with types >= 32.

Let's reduce the number of custom events by putting the actual event identifier into event.user.code? This will reduce the number of custom events to just 1.

Right, that what I thought should be used for sdl 1.2, but there is no such code in DevilutionX, thus my confusion.

I'll try to make a few tests in order to prove / disprove that events with types >= 32 can be missed on the sdl 1.2 before going deeper in these changes.

@rouming
Copy link
Contributor Author
rouming commented May 24, 2025

Oh, it's even worse, than just missing events. C17 standard (I assume same for all others, including CPP) says the following for bit shift:

6.5.7 Bitwise shift operators

If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

And here we are "lucky" that at least on x86 shl instruction (left shift) behaves as cnt mod n, where cnt is number of bits to shift and n is a width of the operand. For example from here

Also, Intel's manual[1] states that the results are undefined when cnt is greater than the operand size, but at least for 32- and 64-bit data sizes it has been observed that shift operations are performed by (cnt mod n), with n being the data size.

So on x86 regardless of bits count the value is wrapped:

#include <assert.h>

unsigned shift(unsigned bits)
{
	return 1<<bits;
}

int main(int argc, char *argv[])
{
	assert(shift(33) == (1<<(33 % 32)));
	return 0;
}

But on other platforms this is undefined and if 0 is returned (as I expected from the very beginning), then we get missing events on SDL 1.2

@StephenCWills
Copy link
Member

It might help explain this.

#if (defined(__APPLE__) || defined(__3DS__)) && defined(USE_SDL1)
// On Tiger PPC, SDL_PushEvent from a background thread appears to do nothing.
#define SDL_PUSH_EVENT_BG_THREAD_WORKS 0
#else
#define SDL_PUSH_EVENT_BG_THREAD_WORKS 1
#endif

We should just fix it as glebm described, and then I can test on 3DS.

@rouming
Copy link
Contributor Author
rouming commented May 24, 2025

It might help explain this.

#if (defined(__APPLE__) || defined(__3DS__)) && defined(USE_SDL1)
// On Tiger PPC, SDL_PushEvent from a background thread appears to do nothing.
#define SDL_PUSH_EVENT_BG_THREAD_WORKS 0
#else
#define SDL_PUSH_EVENT_BG_THREAD_WORKS 1
#endif

Great there is an issue which can be explained by this. I was scratching my head, thinking it couldn't go unnoticed.

We should just fix it as glebm described, and then I can test on 3DS.

Awesome. I will rebase then on top of your PR.

@rouming rouming force-pushed the part1--port-ai-patches branch from d26c399 to 66921e0 Compare May 25, 2025 10:59
@rouming
Copy link
Contributor Author
rouming commented May 25, 2025
8000

@StephenCWills @glebm thanks folks for helping me resolve the SDL1.2 issue. Now with only one type of custom event SDL_PeepEvents() logic becomes simpler.

Am I correct that there is no full ctest run on the github for the USE_SDL1 case?

@glebm
Copy link
Collaborator
glebm commented May 25, 2025

There isn't one but it'd be good to add it to the existing SDL1 Linux workflow

@rouming
Copy link
Contributor Author
rouming commented May 30, 2025

Hey folks, just a quick reminder to review this PR when you get a chance! Thanks.

glebm
glebm previously approved these changes May 30, 2025
@rouming
Copy link
Contributor Author
rouming commented Jun 5, 2025

Anything else to address from my side or this can be merged?

@glebm
Copy link
Collaborator
glebm commented Jun 5, 2025

Looks good to me, just waiting for another maintainer to also have a look and merge! /cc @StephenCWills @AJenbo

@rouming
Copy link
Contributor Author
rouming commented Jun 5, 2025

Thanks @glebm for reviewing that.

Comment on lines 676 to 679
// We call `CustonEventHandler.poll` directly instead of
// calling the `FetchMessage` to process real events rather
// than the recorded ones in demo mode.
while (CurrentEventHandler.poll(&event)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have a look at #7712? It appears to me that you've replaced every call to PollEvent() with SDL_PollEvent(), which I expect will reintroduce this bug for a second time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I accidentally replaced PollEvent() here and in FetchMessage_Real() with SDL_PollEvent or SDL_PeepEvent(). I missed that PollEvent() also handles controller states and motion by calling:

UnlockControllerState(*event);
ProcessControllerMotion(*event);

Good catch, thanks.

I updated the code by splitting the commit in question on two for simpler review:

  • The first one splits the EventHandler into the actual handler and poller callback. This commit should not change the logic of event processing and is just preparation for the next commit. The PollEvent() calls from input.h should be kept as they are.
  • The second commit introduces SDL_PeepEvent() instead of SDL_PollEvent() for all the message handlers which are called from the ShowProgress() routine. This is actually a fix for the problem described.

The main idea is that ShowProgress() handles only custom events and SDL_QUIT, leaving other events in the queue for future processing. PollEvent() is replaced with PollEventCustom(), which performs the same input processing but uses the SDL_PeepEvent() call for polling.

Please, take a look.

Copy link
Collaborator
@glebm glebm Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to leave events in the queue rather than process and ignore most of them? Wouldn't it be odd if I pressed some keys in the loading screen and suddenly they got processed all at once once I get in-game?

Perhaps we only want to leave certain types of events in the queue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered a nasty problem of lost keystroke events when a hero warps to another location (I assume there are other, not only warping). In my case these were save/load/pause keystrokes, which were dropped while being processed in the ShowProgress() routine.

Unfortunately once the event injected to the queue, I don't have a clue from outside, was a keystroke dropped or not yet processed. This breaks a lot of things in the training procedure and lead to an agent hang waiting for the key to be completed.

Wouldn't it be odd if I pressed some keys in the loading screen and suddenly they got processed all at once once I get in-game?

Well, this is of course arguable, but this is the behavior I expect. To be frank, I do not see this would really break the human experience - the game load time is not as it was in 90th - almost instance, I doubt anyone will really notice anything unusual if any prior event executes after a hero warps to another level. But a lost key breaks the automation, that's for sure.

Copy link
Collaborator
@glebm glebm Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps automation should not be sending keystrokes during loading?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once any SDL event lands in the queue between steps 2 and 3, it will be lost. This logic is demonstrated by my snippet.

You're right that human input is asynchronous and delayed at many stages from perception to motor action to event dispatch. But regardless of a delay along the way, the final destination of the event is a queue, and what matters is the actual moment when an event added to the queue.

No, I maintain that the moment when an event is queued does not matter. Rather, what I mean is, it is completely irrelevant to the human experience. The queue merely dictates the order in which those events are processed, nothing more. This ordering of events does not promise anything about the order in which external actions actually occurred except that it is accurate "within reason", mostly defined by how accurately a human can perceive the ordering of the physical actions that trigger those events.

To illustrate what I mean, consider that what we are talking about is a race condition between a key press and an in-game event. The former is triggered by physical action, and the moment it occurs is perceived by the user based on the tactile and auditory feedback they get from pressing the key on the keyboard. The latter is subject to the game's frame rate and CPU speed, and the moment that it occurs is perceived by the user through processing the visual signals coming from the display. These physical processes are completely independent, and the "true order in which they occurred" cannot be properly perceived by a human. That is why the human player can't tell the difference between a key press that gets queued between 2 and 3 and a keypress that gets queued after 3. Whether we change the behavior of the logic that processes events from the queue or not, the human experience does not change.

So the problem does affect human players, just not consistently or predictably, which arguably makes it worse, since these bugs are the hardest to diagnose. That's why I believe it's worth addressing.

I don't know what bugs you're talking about. Processing events in the order in which they are queued is not a bug. That's how the event queue is supposed to work.

My agent experienced the problem with lost events and hanging in infinite wait. This PR ensures that no events are lost, and extensive RL training (I usually run up to a hundred instsances simultaneously on a machine, which of course increases the chance to hit an issue) does not reproduce an issue for months.

It is asynchronous, and the only guarantee needed is that no events are lost, which this PR addresses.

If it is asynchronous, then there can be no guarantee. Even if you change the cutoff point from 2 to 3, the bot has no guarantee that the event it sent before it observed 3 actually made it to the queue before the moment 3 actually occurred. It sounds to me like you'd be better off implementing a timeout based on feedback from the game so the bot can recognize, the same way a human player would, that the action it intended to execute wasn't executed.

I'm not sure I follow...

What I was saying was that the logic you are using to determine whether the key press should or should not be processed is not generally applicable to all events. Ignore whether I think that the behavior regarding the save action or the ESC key is actually a bug. Whether you are able to show that there exists a bug with a particular input event does not and cannot prove that there exists a problem for all input events. There is no clear conceptual problem that requires us to process any input event in a different order from the order in which it is queued. All events are equal in the eyes of the event queue.

If you run this on the latest master, you should see something like the following https://youtu.be/90Q2Txev9c0. The player warps and continues moving unless another mouse button is pressed.

I don't suppose you've tried changing the order of those events? I ran some tests just now, and it doesn't seem to matter whether the SDL_RELEASED event is processed before or after the custom event. Try running this code. The bug you've found has nothing to do with whether the engine "discards" the mouse up event.

--- a/Source/player.cpp
+++ b/Source/player.cpp
@@ -2860,12 +2862,34 @@ StartNewLvl(Player &player, interface_mode fom, int lvl)
        if (&player == MyPlayer) {
                player._pmode = PM_NEWLVL;
                player._pInvincible = true;
+
                SDL_Event event;
+
+               // Let's imagine the MOUSE BUTTON DOWN event lands in the
+               // queue right before the custom event was pushed.
+               event.type = SDL_MOUSEBUTTONDOWN;
+               event.button.x = MousePosition.x;
+               event.button.y = MousePosition.y;
+               event.button.button = SDL_BUTTON_LEFT;
+               event.button.state = SDL_PRESSED;
+               SDL_PushEvent(&event);
+
+               // and MOUSE BUTTON UP event lands in the queue also right before
+               // the custom event was pushed. This event will not be discarded
+               // by the `ShowProgress()` filter yet the player never stops
+               // moving until next MOUSE BUTTON sequence or when they reach their destination.
+               event.type = SDL_MOUSEBUTTONUP;
+               event.button.button = SDL_BUTTON_LEFT;
+               event.button.state = SDL_RELEASED;
+               SDL_PushEvent(&event);
+
                CustomEventToSdlEvent(event, fom);
                SDL_PushEvent(&event);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why the human player can't tell the difference between a key press that gets queued between 2 and 3 and a keypress that gets queued after 3. Whether we change the behavior of the logic that processes events from the queue or not, the human experience does not change.

Absolutely agree - there's no visual difference for a human to spot regarding when an event is executed. But if a human presses ESC (or any other key action to your choice) expecting the game to show a menu and this doesn't happen, then that's an issue. Of course, we could speculate that the key was pressed almost simultaneously with the moment when a loading screen should show up, but in my opinion, this isn't proper behavior for an event queue. Lost events are a nightmare to debug and fix.

I don't know what bugs you're talking about.

Fair enough. It’s a weird edge case and not easy to catch. I guess it just comes down to different styles: some people are more reactive, others more proactive about fixing stuff.

Processing events in the order in which they are queued is not a bug. That's how the event queue is supposed to work.

Correct, the order is not an issue, but discarding events is.

If it is asynchronous, then there can be no guarantee. Even if you change the cutoff point from 2 to 3, the bot has no guarantee that the event it sent before it observed 3 actually made it to the queue before the moment 3 actually occurred. It sounds to me like you'd be better off implementing a timeout based on feedback from the game so the bot can recognize, the same way a human player would, that the action it intended to execute wasn't executed.

The agent does not concern itself with when an event is executed relative to the queue, this is out of the scope of this discussion. What the agent expects is that the event eventually reaches completion, regardless of event position in the queue. Unfortunately, with the current queue design, this is not guaranteed, thus this PR.

I don't suppose you've tried changing the order of those events?

No, I have not.

I ran some tests just now, and it doesn't seem to matter whether the SDL_RELEASED event is processed before or after the custom event. Try running this code.

Interesting. That's seems to me a regression, I was able to bisect to this one:

3e6b501 ("Fix pathfinding and increase player path limit")

Cc: @glebm

The bug you've found has nothing to do with whether the engine "discards" the mouse up event.

Ooops. But still, the bug I've found doesn't prove the opposite, that discard won't happen, right? :) And peeking into the queue to see whether the UP event exists or not won't be difficult.

Copy link
Contributor Author
@rouming rouming Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mouse DOWN button event in the queue before warp leads to MakePlrPath being called after a warp with the following arguments (in my quick test): pos.future=(25,31), target=(82,44), where pos.future belongs to the current dungeon (town), but the target "leaked" from the dungeon that the player just warped from.

The commit 3e6b501 I mentioned in the previous comment increases the walk path, which is why it is much easier to reproduce a spontaneous walk after a warp (before the commit, 25 steps were not enough and a player would stay put), but the commit does not cause the described issue. Sorry, @glebm, for the noise.

To my limited understanding, there might need to be some logic which resets the walk destination variable after a warp. Anyway, I leave this to devilutionx experts.

Copy link
Member
@StephenCWills StephenCWills Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, we could speculate that the key was pressed almost simultaneously with the moment when a loading screen should show up, but in my opinion, this isn't proper behavior for an event queue. Lost events are a nightmare to debug and fix.

This is not speculation. That's literally the thing you're describing. The event has to be queued in the same "display frame" as the custom event is queued, otherwise they would be processed in separate iterations of the main loop. If they are processed in separate iterations of the main loop, then there is no choice; you have to process them in the order they appear in the queue. Even if they are not processed in separate iterations of the main loop, you may as well process them in the order they appear in the queue, because there is no difference as far as the engine is concerned.

Correct, the order is not an issue, but discarding events is.

The agent does not concern itself with when an event is executed relative to the queue, this is out of the scope of this discussion. What the agent expects is that the event eventually reaches completion, regardless of event position in the queue. Unfortunately, with the current queue design, this is not guaranteed, thus this PR.

Perhaps I haven't been clear enough about this. No events are actually "lost" or "discarded" in the process of transitioning to the loading screen. The mouse up event that you claim was lost is actually being processed by the ProgressEventHandler() function, which invokes the DisableInputEventHandler() function, which in turn sets sgbMouseDown = CLICK_NONE. The only way in which it is "lost" is the fact that it doesn't get processed by the GameEventHandler() function. I am arguing that this is proper because that handler executes in an entirely different context where a mouse up event might open a UI menu or something. It only stands to reason that a mouse up event would work differently on the loading screen, which doesn't have a UI menu to open.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me sum it all up so we can wrap this long discussion and close the PR:

  1. Non-custom events on the loading screen are intentionally dropped to prevent a character from executing "old" events right after a warp.

  2. Even if a non-custom event is dropped just before the loading screen appears, it's not a big deal, because a human player won't notice whether it was dropped right before or during the loading screen.

  3. All event handlers in the engine should be able to handle dropped or partially dropped events (e.g. a key press consists of a "down" and an "up" events, and only the "up" event might get dropped because of point 1, thus "partially dropped" naming). Any bugs (like the one where the character keeps moving after a warp) should be fixed in their own handlers and are probably not caused by point 1 (at least there are no known such issues by now).

  4. Any automation (like Lua scripts, etc.) has its own logic and needs to deal with dropped events (as described in point 1) on its own.

Points 1 and 4 are crucial for my automated agent, which makes me stick to my own branch. That's totally fine with me, I just wanted to clarify things.

Thanks to all of you folks for such a long conversation.

A quite legitimate warning: if FMT_EXCEPTIONS is not defined we should
return something from the function. Let it be an empty string.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
rouming added 4 commits June 6, 2025 12:07
There are various message handlers in the code, including:

  PollEvent()
  FetchMessage()
  FetchMessage_Real()

These functions internally call SDL_PollEvent() and then handle the
received events differently depending on the context.

This commit separates the actual handlers from the poll invocations by
adding an additional `poll` callback parameter. This parameter can be
either SDL_PollEvent() or something else, allowing the caller of the
message handler to decide which events need to be processed.

This commit does not change the logic of event processing. Its main
purpose is to make it possible to replace SDL_PollEvent() with
SDL_PeepEvents() in the next patch.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
`ShowProgress` is called on any custom event, such as when a player
warps to a new level. It is designed to handle only custom events and
discards others from the queue. This behavior results in lost
keystrokes sent by the AI agent. For instance, the issue was
reproduced when attempting to pause the game by sending the PAUSE key,
which was lost due to the behavior of `ShowProgress`.

This change introduces a poller along with the custom handler. For the
`ShowProgress` routine, the poller will peek at only custom events,
leaving others in the queue.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Patch fixes various crashes and accesses to the audio or SDL
subsystems that were not properly initialized in HeadlessMode:

 * Don't create an SDL window
 * Fully disable audio/sndfx during headless mode
 * Init SDL with SDL_INIT_EVENTS, otherwise an error:
       "The event system has been shut down"

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
This patch updates the function `gmenu_presskeys()` to return `true`
if the keys are actually handled. This allows the caller function
`PressKey()` to continue processing keys and not silently drop them,
even when the game menu is active.

Why bother? First of all, there should not be any reasons not to
handle keys once the game menu is open. For example, why not handle
the "Quick Load" shortcut once the player is dead and the menu is
shown on the screen?

The other reason is more practical. In headless mode, once the game is
controlled by a 3rd party application (training of an AI agent, for
example), lost key presses lead to a broken training procedure.
Therefore, even if the menu is raised, it is crucial not to lose a
keystroke.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
@rouming rouming force-pushed the part1--port-ai-patches branch from 3a0afe3 to 1512614 Compare June 6, 2025 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0