8000 Swarm gets blocked/deadlocked during tests · Issue #96 · bitwalker/swarm · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Swarm gets blocked/deadlocked during tests #96
Open
@sashaafm

Description

@sashaafm

Hello @bitwalker, we've been using Swarm for some time now and recently we started having a problem where our test suite seems to get blocked or deadlocked when it runs. It always get stuck at at the same point, printing the same logs and getting blocked there.

Below are the logs which are printed before the suite getting stuck:

[warn] [swarm on primary@127.0.0.1] [tracker:start_pid_remotely] failed to start {:tracker, "user_id", "acc_id"} on nonode@nohost: nodedown, retrying operation.. pid=<0.3704.0> application=swarm module=Swarm.Logger function=warn/1 file=/presence-api/deps/swarm/lib/swarm/logger.ex line=15 
[error] [swarm on primary@127.0.0.1] [tracker:start_pid_remotely] ** (ErlangError) Erlang error: {:function_clause, {:gen_statem, :call, [{Swarm.Tracker, {:error, {:invalid_ring, :no_nodes}}}, {:track, {:tracker, "user_id", "acc_id"}, %{mfa: {Presence.Users.Swarm, :start, [{:user, "user_id", "acc_id"}, {:controller, "user_id", "acc_id"}, {:user_supervisor, "user_id", "acc_id"}, {:tracker, "user_id", "acc_id"}, %Presence.Models.User{account_id: "acc_id", user_id: "user_id"}]}}}, :infinity]}}
    (stdlib) gen.erl:253: :gen.do_for_proc({Swarm.Tracker, {:error, {:invalid_ring, :no_nodes}}}, #Function<0.94381411/1 in :gen.call/4>)
    (stdlib) gen_statem.erl:591: :gen_statem.call_dirty/4
    (swarm) lib/swarm/tracker/tracker.ex:1115: Swarm.Tracker.start_pid_remotely/6
    (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
 pid=<0.3704.0> application=swarm module=Swarm.Logger function=error/1 file=/presence-api/deps/swarm/lib/swarm/logger.ex line=27 
[error] Task #PID<0.3704.0> started from Swarm.Tracker terminating
** (Protocol.UndefinedError) protocol String.Chars not implemented for {:error, {:invalid_ring, :no_nodes}}. This protocol is implemented for: Atom, BitString, Date, DateTime, Float, Integer, List, NaiveDateTime, Time, URI, Version, Version.Requirement
    (elixir) /usr/local/src/elixir/lib/elixir/lib/string/chars.ex:3: String.Chars.impl_for!/1
    (elixir) /usr/local/src/elixir/lib/elixir/lib/string/chars.ex:22: String.Chars.to_string/1
    (swarm) lib/swarm/tracker/tracker.ex:1152: Swarm.Tracker.start_pid_remotely/6
    (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Function: #Function<1.98420104/0 in Swarm.Tracker.do_track/2>
    Args: [] pid=<0.3704.0> error_logger=format 

So far we've got no idea why it is getting stuck. Even worse, I had a hunch about what could be causing the problem. I altered the code I thought was problematic and the suite no longer got stuck (locally, running in a Docker container). However, the same problem seems to persist in our CI pipeline, which runs the same exact Docker image and test script. I've even executed the tests with the --trace flag, which serialises all tests and should prevent deadlocks (I think).

I'm not sure if this is a bug in Swarm, or how we should proceed to try to dig deeper. I also find it quite odd that ExUnit isn't timing out... Maybe it's getting stuck between test cases?

Also what could cause such behaviour?

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0