Description
Hello @bitwalker, we've been using Swarm for some time now and recently we started having a problem where our test suite seems to get blocked or deadlocked when it runs. It always get stuck at at the same point, printing the same logs and getting blocked there.
Below are the logs which are printed before the suite getting stuck:
[warn] [swarm on primary@127.0.0.1] [tracker:start_pid_remotely] failed to start {:tracker, "user_id", "acc_id"} on nonode@nohost: nodedown, retrying operation.. pid=<0.3704.0> application=swarm module=Swarm.Logger function=warn/1 file=/presence-api/deps/swarm/lib/swarm/logger.ex line=15
[error] [swarm on primary@127.0.0.1] [tracker:start_pid_remotely] ** (ErlangError) Erlang error: {:function_clause, {:gen_statem, :call, [{Swarm.Tracker, {:error, {:invalid_ring, :no_nodes}}}, {:track, {:tracker, "user_id", "acc_id"}, %{mfa: {Presence.Users.Swarm, :start, [{:user, "user_id", "acc_id"}, {:controller, "user_id", "acc_id"}, {:user_supervisor, "user_id", "acc_id"}, {:tracker, "user_id", "acc_id"}, %Presence.Models.User{account_id: "acc_id", user_id: "user_id"}]}}}, :infinity]}}
(stdlib) gen.erl:253: :gen.do_for_proc({Swarm.Tracker, {:error, {:invalid_ring, :no_nodes}}}, #Function<0.94381411/1 in :gen.call/4>)
(stdlib) gen_statem.erl:591: :gen_statem.call_dirty/4
(swarm) lib/swarm/tracker/tracker.ex:1115: Swarm.Tracker.start_pid_remotely/6
(elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
pid=<0.3704.0> application=swarm module=Swarm.Logger function=error/1 file=/presence-api/deps/swarm/lib/swarm/logger.ex line=27
[error] Task #PID<0.3704.0> started from Swarm.Tracker terminating
** (Protocol.UndefinedError) protocol String.Chars not implemented for {:error, {:invalid_ring, :no_nodes}}. This protocol is implemented for: Atom, BitString, Date, DateTime, Float, Integer, List, NaiveDateTime, Time, URI, Version, Version.Requirement
(elixir) /usr/local/src/elixir/lib/elixir/lib/string/chars.ex:3: String.Chars.impl_for!/1
(elixir) /usr/local/src/elixir/lib/elixir/lib/string/chars.ex:22: String.Chars.to_string/1
(swarm) lib/swarm/tracker/tracker.ex:1152: Swarm.Tracker.start_pid_remotely/6
(elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Function: #Function<1.98420104/0 in Swarm.Tracker.do_track/2>
Args: [] pid=<0.3704.0> error_logger=format
So far we've got no idea why it is getting stuck. Even worse, I had a hunch about what could be causing the problem. I altered the code I thought was problematic and the suite no longer got stuck (locally, running in a Docker container). However, the same problem seems to persist in our CI pipeline, which runs the same exact Docker image and test script. I've even executed the tests with the --trace
flag, which serialises all tests and should prevent deadlocks (I think).
I'm not sure if this is a bug in Swarm, or how we should proceed to try to dig deeper. I also find it quite odd that ExUnit isn't timing out... Maybe it's getting stuck between test cases?
Also what could cause such behaviour?