8000 Frequent leader changes on localhost under default configuration · Issue #57 · logcabin/logcabin · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Frequent leader changes on localhost under default configuration #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers an 8000 d the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bjmnbraun opened this issue Nov 18, 2014 · 5 comments
Closed

Frequent leader changes on localhost under default configuration #57

bjmnbraun opened this issue Nov 18, 2014 · 5 comments
Assignees
Labels
Milestone

Comments

@bjmnbraun
Copy link

I am observing poor performance in LogCabin after performing the steps described in the "Running a real cluster" section of the documentation when running 3 servers and 1 clients, all connected over loopback on the same 4-core machine.

Observed performance:
Varies from 1.5 to 2.5 "writeEx"s per second

Reproduce:
Follow the directions in the documentation to set up a simple 3-server RAFT cluster all over loopback. Then, modify Examples/SmokeTest.cc so that client writes "ha" 50 times to the file, then reads back the value and asserts that it is "ha".

I measured runtime using "time python2 scripts/smoketest.py". There is some initialization time due to setting up the configuration, but most of the time is spent processing real work.

Warnings / Error messages:
The message "Server/RaftConsensus.cc:266 in callRPC() WARNING[1:Peer(2)]: RPC to server failed: RPC canceled by user" appears 14 times in debug/1. Also 14 times, I see " Server/RaftConsensus.cc:256 in callRPC() WARNING[1:Peer(2)]: RPC to server succeeded after xxx failures" where xxx ranges from 1 to 3.

I also observed that the leadership changed 21 times (!) in the execution.

Perhaps the low performance is due to something causing the RPCs to cancel, leading to unecessary leader changes (why?).

@bjmnbraun bjmnbraun changed the title Performance of logcabin under default configuration Low performance of logcabin under default configuration Nov 18, 2014
@ongardie
Copy link
Member

Hey @bjmnbraun ,

Thanks for the issue. I'd expect the current performance of master to be low (hundreds or low thousands of ops per second), and I hope to improve that soon. But it sounds like you're running into a more serious problem due to leader changes. This might take a little back and forth.

A bit of explanation on the log messages:

  • RPC cancellations are expected when a candidate or leader steps down: that server no longer has any use for the results of the RPCs, so it cancels them, which just means that it'll ignore any result.
  • RPC to server succeeded after xxx failures would usually indicate that the endpoint wasn't reachable for a while and then became reachable.

It's possible that the default timeouts are set too low for your configuration. If that were the case, your leader might not be able to heartbeat in time to stop itself from stepping down or stop other servers from starting elections.

The election timeouts might be too low if your CPU, network, or disks are slower than mine, especially if I'm not using the default timeouts myself (I don't remember). The disk is most suspect, since you're on localhost: network should be fast, but you're fsyncing from all servers to the same disk. Also, the filesystem backend in master is very inefficient.

So let's entertain the hypothesis that your disk is slow and overloaded, causing election timeouts to fire and leader changes to occur. It's plausible, and it's pretty easy to test. Can you try putting your storage files in a tmpfs (/dev/shm usually), or alternatively, try wrapping the LogCabin invocations with eatmydata ( https://www.flamingspork.com/projects/libeatmydata/ ).

@bjmnbraun
Copy link
Author

Cool, putting the debug files and storage files under /dev/shm dramatically improves performance.

I now see slightly more than 1k writes per second when using a single client (I ran 50K writes, duration roughly 42 seconds). I also see no WARNING messages in the logs, now, and the leader is unchanged for the whole experiment.

@ongardie
Copy link
Member

Good to hear, thanks for trying that out. Obviously you've lost durability now, but maybe that's OK for now if you're just poking around? Things might improve if you run on separate machines instead of all on localhost, especially if those machines have SSDs. Hopefully the new storage backend will also help longer term, though I don't yet know when I'll get to testing and merging it (you might want to subscribe to #30).

Also, what is your localhost? Dedicated or VM? SSD or spinny disk? ext4 on Linux?

@bjmnbraun
Copy link
Author

Localhost is far from ideal. Ext3 on spinny disk, it's a lenovo thinkpad. Yes, I was just poking around really :)

@ongardie
Copy link
Member

Thanks, that sort of makes sense then. Let's leave this issue open as a reminder to at least add a note to the README, after #58.

@ongardie ongardie changed the title Low performance of logcabin under default configuration Frequent leader changes on localhost under default configuration Nov 19, 2014
@ongardie ongardie added the perf label Feb 27, 2015
@ongardie ongardie self-assigned this Apr 29, 2015
@ongardie ongardie added this to the 1.0.0 milestone Apr 29, 2015
nhardt pushed a commit to nhardt/logcabin that referenced this issue Aug 21, 2015
Close logcabin#57: Frequent leader changes on localhost under default configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0