-
Notifications
You must be signed in to change notification settings - Fork 305
Frequent leader changes on localhost under default configuration #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers an 8000 d the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @bjmnbraun , Thanks for the issue. I'd expect the current performance of master to be low (hundreds or low thousands of ops per second), and I hope to improve that soon. But it sounds like you're running into a more serious problem due to leader changes. This might take a little back and forth. A bit of explanation on the log messages:
It's possible that the default timeouts are set too low for your configuration. If that were the case, your leader might not be able to heartbeat in time to stop itself from stepping down or stop other servers from starting elections. The election timeouts might be too low if your CPU, network, or disks are slower than mine, especially if I'm not using the default timeouts myself (I don't remember). The disk is most suspect, since you're on localhost: network should be fast, but you're fsyncing from all servers to the same disk. Also, the filesystem backend in master is very inefficient. So let's entertain the hypothesis that your disk is slow and overloaded, causing election timeouts to fire and leader changes to occur. It's plausible, and it's pretty easy to test. Can you try putting your storage files in a tmpfs (/dev/shm usually), or alternatively, try wrapping the LogCabin invocations with eatmydata ( https://www.flamingspork.com/projects/libeatmydata/ ). |
Cool, putting the debug files and storage files under /dev/shm dramatically improves performance. I now see slightly more than 1k writes per second when using a single client (I ran 50K writes, duration roughly 42 seconds). I also see no WARNING messages in the logs, now, and the leader is unchanged for the whole experiment. |
Good to hear, thanks for trying that out. Obviously you've lost durability now, but maybe that's OK for now if you're just poking around? Things might improve if you run on separate machines instead of all on localhost, especially if those machines have SSDs. Hopefully the new storage backend will also help longer term, though I don't yet know when I'll get to testing and merging it (you might want to subscribe to #30). Also, what is your localhost? Dedicated or VM? SSD or spinny disk? ext4 on Linux? |
Localhost is far from ideal. Ext3 on spinny disk, it's a lenovo thinkpad. Yes, I was just poking around really :) |
Thanks, that sort of makes sense then. Let's leave this issue open as a reminder to at least add a note to the README, after #58. |
Close logcabin#57: Frequent leader changes on localhost under default configuration
I am observing poor performance in LogCabin after performing the steps described in the "Running a real cluster" section of the documentation when running 3 servers and 1 clients, all connected over loopback on the same 4-core machine.
Observed performance:
Varies from 1.5 to 2.5 "writeEx"s per second
Reproduce:
Follow the directions in the documentation to set up a simple 3-server RAFT cluster all over loopback. Then, modify Examples/SmokeTest.cc so that client writes "ha" 50 times to the file, then reads back the value and asserts that it is "ha".
I measured runtime using "time python2 scripts/smoketest.py". There is some initialization time due to setting up the configuration, but most of the time is spent processing real work.
Warnings / Error messages:
The message "Server/RaftConsensus.cc:266 in callRPC() WARNING[1:Peer(2)]: RPC to server failed: RPC canceled by user" appears 14 times in debug/1. Also 14 times, I see " Server/RaftConsensus.cc:256 in callRPC() WARNING[1:Peer(2)]: RPC to server succeeded after xxx failures" where xxx ranges from 1 to 3.
I also observed that the leadership changed 21 times (!) in the execution.
Perhaps the low performance is due to something causing the RPCs to cancel, leading to unecessary leader changes (why?).
The text was updated successfully, but these errors were encountered: