Overhaul Hopper, break up mono-lock hegemony #19

blt · 2018-02-02T01:26:52Z

There's a lot going on in this PR and it's all in the service of supporting postmates/cernan#411 and postmates/cernan#412.

Prior to this PR hopper caller had no good way of strictly limiting how many items were kept in memory: this value was hard-coded. We've changed the interface to allow this. Further, the old FsSync method -- where all senders and receivers rendezvous on a single locked structure -- is now broken up. We have an in-memory deque protected on either end by a mutex. Senders are isolated from one another but the receiver is able to jam independently.

Overall, the Sender / Receriver code is more straightforward but the deque has introduced a fair bit of unsafe code into hopper. We've seen only a 3x performance drop compared to stdlib MPSC on Linux machines, so, uh, that's not so bad.

Hopper is pokey. We have a need to expose more telemetry about Hopper's internal workings so Ops folks can make good decisions about how queues are behaving in practice. A good chunk of that work will be teasing the information _away_ into a queriable form. Work is inspired by http://blog.troutwine.us/tag/hopper/ Signed-off-by: Brian L. Troutwine <blt@postmates.com>

I have done a fair bit of trimming of FsSync. We no longer have _so_ much repetitive fiddling going on in that shared structure and the sender / receiver are simpler for it. I _hope_ to get the FsSync totally destroyed. The only things that remain are a count of _disk_ writes and the VecDeque. I don't have a plan per se but maybe one will come to me. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

The key insight here is that _if_ we write information into the deque buffer we can instruct the receiver to switch into disk mode without coordinating through any other memory points. This does mean every payload incurs an additional enum variant cost but, meh, I'll take it. What's _especially_ nice is this opens up the possibility of keeping order without flushing the whole of the memory buf -- since we can increase the disk read indicate at need -- and will, I think, allow us to use a deque not protected by a mutex. I think. There are a few tests that need to be uncommented that'll come here shortly. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

The test runs are _noticably_ faster now. :D Signed-off-by: Brian L. Troutwine <blt@postmates.com>

We were not previously compressing on-disk representations. This is now corrected. The compression-level is not configurable and is pegged to 'fast'. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

This is causing -- I think -- travis to spin out. It does _not_ fail on my local machine, probably because travis disks are very slow. As a part of this work I'm going to add proper AFL testing back into the project. They won't run in travis but they'll be there. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

The current queue has issues in that it does not actually work across threads: the pointers don't update. Which, makes sense. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Okey dokey, so this is some kind of progress. What's happening here is it's possible for the enq to lose the thread if it rumbles oddly with deq. The thing is, this won't square with my need to attempt an enq, get bounced and then deq_n. I need to invest in contiguous storage which I believe will be better for me anyhow as I can press put the enq/deq index into the enq_lock / deq_lock mutex, this skip out on the condvar entirely. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

This implementation is still not perfect: there are pauses and the ocassional crash now but it's actually workable and we've hit the finicky QC stage of things now. Nifty. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

We still yet need to do an AFL run and longer-duration QC runs as well but this is progress. The primary change is we no longer point to contiguous options but to const pointers, being sure to synchronize on size as AoMP suggests for ARM and friends' sake. Special thanks to https://users.rust-lang.org/t/why-do-my-distinct-values-resolve-to-the-same-pointer/15517/3 for help figuring out my stack allocation issue. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

I've currently got a double mut issue with the way deque works but I'm closer to where I need to be. Namely, the sender / receiver _should_ have fine grained control over their locks and the sender loop is closer to functional. Now I just have to figure out how to take a lock and then meddle with it. I think the answer is pushing InnerQueue back up into Queue. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

This commit manages to work back some of the mutable borrow errors that plauged the last commit. The main difficulty now is located entirely in Sender, which, uh, is progress. Somehow I need to fiddle with write_to_disk so that we can do rotations without having to write to self, or something. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

This commit was hard won. Note now that we no longer leak memory when we pop/push and correctly set the offsets when doing so. I'm keeping a 'check' of which all-up tests are passing, going through one-by-one. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

The tests were a bit rough, I'm now realizing. In the next commit I'll do more to probe at the implementation, aswering these questions: * In a single sender / single receiver setup do we get things in sender write-order? * In a multi-sender / single receiver setup do we get all the same elements back, in the end? Signed-off-by: Brian L. Troutwine <blt@postmates.com>

The approach taken to the quickcheck tests from deque has been migrated upward into the main library. The deque module is now untested except but indirectly. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

This benchmark replaces the older bencher based benchmark which, itself, didn't do too much to probe the functional speed of hopper. The timing I get out of this now is... fishy, considering that one of the hopper tests should be heading to disk and therefore be much slower. Tom Santero has some ideas there. Also, we're significantly more pokey than stdlib mpsc right now. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

I believe in AFL runs I'm seeing crashes to do with over-consumption of the filesystem. Right now hopper assumes that filesystem ops will never fail. Which, that's not correct. This commit starts to pay some of that down. The API will have to change to accomidate the rest. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Well hey! Who knew that thread::spawn could crash? The library authors, for one. I sure didn't. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Whew! Well guess what, you can get a lot of fun crashes when manipulating the filesystem if there are no more FDs available. Fun fact: hopper startup is now much less expensive because of the way Sender clone is reworked. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

This is the first commit that is ready for review. There's still work to be done but it's polish. Well, except for that need for Drop. That's an actual problem. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt · 2018-02-13T01:21:29Z

Current problems: if the on-disk queue size is very small relative to T it's possible we'll eat up our FD allotment. Run QUICKCHECK_TESTS=100000 to get an earful.

I'm considering making the minimum on-disk queue size relatively normal, something like 1Mb. That won't absolve us of FD exhaustion problems but will make life considerably nicer for almost all real systems.

This commit includes a lot of material to guard against failures owing to file-descriptor exhaustion on-system. Mostly this just means careful looping and passing of errors up the call-stack, which previously we assumed "would not happen" and let slide. I think there might be a collision happening somewhere where two receivers will start-up at once and delete one's queue files. I think adding a strict mode -- where a Receiver will refuse to start if there are queue files present -- would be a solid idea to avoid this situation. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

tsantero · 2018-02-13T15:09:46Z

Some test data:

Benchmark runner system

The following benchmarks were performed on a machine with 12 core Intel Xeon E5-1650 and 32GB RAM.

➜  criterion git:(telemetry_et_al) ✗ rustc --version
rustc 1.23.0 (766bd11c8 2018-01-01)
➜  criterion git:(telemetry_et_al) ✗ cargo --version
cargo 0.24.0 (45043115c 2017-12-05)

hopper_tst/HopperInput { in_memory_total: 16384, max_disk_total: 32768, total_senders: 4, ith: 8192 }

.	Lower bound	Estimate	Upper bound
Slope	3.3240 ms	3.3654 ms	3.4141 ms
R²	0.6607484	0.6710318	0.6569205
Mean	3.3427 ms	3.3814 ms	3.4247 ms
Std. Dev.	151.49 us	211.50 us	270.72 us
Median	3.2696 ms	3.3033 ms	3.3380 ms
MAD	73.844 us	124.52 us	159.21 us

hopper_tst/HopperInput { in_memory_total: 4096, max_disk_total: 32768, total_senders: 4, ith: 8192 }

.	Lower bound	Estimate	Upper bound
Slope	3.3313 ms	3.3796 ms	3.4340 ms
R²	0.6565575	0.6701833	0.6529781
Mean	3.2890 ms	3.3247 ms	3.3631 ms
Std. Dev.	146.47 us	190.00 us	227.78 us
Median	3.2375 ms	3.2661 ms	3.3053 ms
MAD	92.262 us	127.42 us	156.50 us

mspc_tst

.	Lower bound	Estimate	Upper bound
Slope	952.71 us	964.67 us	978.99 us
R²	0.6978732	0.7078468	0.6936176
Mean	957.14 us	968.27 us	980.82 us
Std. Dev.	41.497 us	61.040 us	77.280 us
Median	941.22 us	946.77 us	951.92 us
MAD	16.165 us	25.120 us	35.879 us

Quickcheck

I also ran over 5 million iterations of QuickCheck tests overnight, all passing.

tsantero · 2018-02-13T15:21:17Z

Note: still running up against fd exhaustion issues on my laptop (2 core Intel i7-7500U, 16GB RAM), which might be a more accurate representation of resource allocation within cloud environments / k8s pods provisioned on AWS/GCE. Worth continued investigation IMO

blt · 2018-02-13T15:23:43Z

@tsantero Agreed. I'll keep plugging away.

Prior to this commit the Receiver was responsible for nuking unused queue files before starting up. Consider what happens if the receiver is very, very late to the party. It'll toast the files it needs to read, then fall over when the files it needs to read are gone. Oops! Signed-off-by: Brian L. Troutwine <blt@postmates.com>

My last commit broke the comment test we had in place. I've shuffled senders to start first as they create the heirarchy of directories needed. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

These changes are non-functional and satisfy clippy. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt · 2018-02-13T20:15:35Z

@tsantero FWIW I think we're in the clear. I passed 1M quickcheck tests locally and I have an AFL run going on two cores.

This commit is non-functional but it does add a lot of commentary to the codebase. The rustdocs are also updated and brought up to the times as they are. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt · 2018-02-14T04:46:02Z

src/deque.rs

+            };
+            let elem: Box<T> = Box::from_raw(*self.data.offset((*guard).offset) as *mut T);
+            *self.data.offset((*guard).offset) = ptr::null_mut();
+            self.size.fetch_sub(1, Ordering::Release);


This needs to be moved before any of the other pointer modifications, else we might end up allowing Receiver to pull something that is about to be popped off.

dparton

Some questions and comments for you, @blt

dparton · 2018-02-14T15:42:20Z

src/deque.rs

+    }
+
+    pub unsafe fn pop_front(&self) -> T {
+        let mut guard = self.front_lock.lock().expect("front lock poisoned");


Any reason why we're not using the self.lock_front() interface here?

Can't think of one. Historical accident, I'd say.

dparton · 2018-02-14T15:55:19Z

src/private.rs

+            .to_str()
+            .unwrap()
+            .parse::<usize>()
+            .unwrap();


Should you skip over unparseable entries instead of crashing?

Historically we've assumed that if someone fiddles with hopper's queue directory it should just fail to start or otherwise fall over. Maybe that's a bad assumption?

dparton · 2018-02-14T15:55:46Z

src/private.rs

+            .unwrap();
+        max = cmp::max(num, max);
+    }
+    Ok(max)


Does this need to be a Result if it never returns an error?

It will return an error if any of the ? trigger. That'll happen when the system has exhausted all its file descriptors, for instance.

dparton · 2018-02-14T15:56:17Z

src/private.rs

+        min = cmp::min(num, min);
+    }
+    assert!(worked);
+    Ok(min)


Same here about crashing and Result. Why assert instead of returning an error?

A fine question. No good reason.

Oh, I get what's going on here. This is a guard to ensure we don't start the receiver before the sender, in which case there'd be no file for it to take a handle on.

In the future we could make file opening a lazy thing, rather than something we do in Receiver::new

dparton · 2018-02-14T15:57:34Z

src/private.rs

+    if data_dir.is_dir() {
+        for directory_entry in fs::read_dir(data_dir)? {
+            let de = directory_entry?;
+            fs::remove_file(de.path())?


mkdir in the data directory is an attack vector here

Not sure I follow entirely? Or, to be clear, I guess hopper is not resilient at all to someone fiddling with its data directory. I'd like to understand what scenario you've cooked up though.

dparton · 2018-02-14T16:02:12Z

src/receiver.rs

-                                    fslock.disk_writes_to_read -= 1;
-                                    return Some(event);
+                                    self.disk_writes_to_read -= 1;
+                                    return Ok(event);
                                }
                                Err(e) => panic!("Failed decoding. Skipping {:?}", e),


Error says skipping, but isn't this a crash?

Heh, the crashing-est kind of skip.

Will correct the language.

dparton · 2018-02-14T16:10:38Z

src/sender.rs

+        guard: &mut MutexGuard<BackGuardInner<SenderSync>>,
+    ) -> Result<(), (T, super::Error)> {
+        let mut buf: Vec<u8> = Vec::with_capacity(64);
+        buf.clear();


This clear shouldn't be necessary, as the length of this vector is initially zero

Ah, yep. Totally redundant.

It turns out there's a bug still yet in the code. It's possible to cause the Receiver to lock up. This happens when the queue is full and we pull two items, even though the atomic counter had previously been incremented to show the presence of one of those items. The Receiver and Sender then double-pop -- in effect -- the item. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt · 2018-02-15T01:15:28Z

Okay, as demonstrated in ae75bc3 we have a problem. If a Sender calls pop_back_no_block there's a chance the Receiver will have already woken up and be reading the value we're about to fiddle with. Okay. My idea to resolve this is that the Sender may only ever add values to the queue, the Receiver may only ever remove them.

Which, okay, how do we deal with overflow? Can't throw it out. Here's what we do: the Sender flips into disk-write mode on overflow. The value gets written to disk and the Sender attempts to write a disk placement to the queue. If it can't, stays in disk mode. If it can, goes back to memory mode. We need to be careful to flush a Sender in disk-mode to the queue on Drop etc etc but I think this'll work out okay.

Well, there's progress here of a kind. pop_back is now gone, we only have two operations on the deque and once something's in there it's in there until its permanently gone. Good. We also block the sender on count size rather than in-memory representation. Also good. Issue is, tests still fail and the need for a flush command to drive the last disk writes out is a tedious API. Only needed on shutdown, flush, so that's a positive for actual Hopper use. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

We were forgetting to reset the total number of disk writes outstanding when the Sender was flushed. This instructed the Receiver to wait for writes that would never come. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

One issue we had with the previous now-we-are-done commit was that none of our tests looped. For any given input it's possible that a concurrency bug will force failure in 1/10000 runs, say. Which, boo-urns, right? This commit does not include any 10k loops but does have a 2.5k loop. The cautious developer is encouraged to bulk this up during development. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt · 2018-02-16T01:03:47Z

@dparton ready for re-review.

Okey dokey! I'm pretty sure at this point we're back on track. I've passed 1M QC tests plus the new loopers and the deque is about as bare-bones as I can conceive making it at this point, which is a good sign I do believe. Mostly this commit is fiddling with documentation. I did adjust the benchmark to have the same ith values so we could compare apples to apples. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Brian L. Troutwine added 5 commits January 29, 2018 19:01

Tidy up tests, re-enable AFL crash

3f248f3

The test runs are _noticably_ faster now. :D Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Introduce per-payload compression

024f266

We were not previously compressing on-disk representations. This is now corrected. The compression-level is not configurable and is pegged to 'fast'. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt requested review from pulltab and dparton February 2, 2018 01:27

blt force-pushed the telemetry_et_al branch from 2a0ad6a to 3683527 Compare February 2, 2018 01:49

Brian L. Troutwine added 5 commits February 6, 2018 16:29

On the road to a concurrent queue

7917798

The current queue has issues in that it does not actually work across threads: the pointers don't update. Which, makes sense. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Use contiguous memory

57c380a

This implementation is still not perfect: there are pauses and the ocassional crash now but it's actually workable and we've hit the finicky QC stage of things now. Nifty. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt force-pushed the telemetry_et_al branch from 16ee14a to 34d40f5 Compare February 10, 2018 06:06

Brian L. Troutwine added 9 commits February 10, 2018 15:30

More passing tests

c65211d

This commit was hard won. Note now that we no longer leak memory when we pop/push and correctly set the offsets when doing so. I'm keeping a 'check' of which all-up tests are passing, going through one-by-one. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Spruce up the QC tests

a83a64e

The approach taken to the quickcheck tests from deque has been migrated upward into the main library. The deque module is now untested except but indirectly. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Avoid thread crashes in benchmarks

2b3a33f

Well hey! Who knew that thread::spawn could crash? The library authors, for one. I sure didn't. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Update TODOs in the main library

7c65539

This is the first commit that is ready for review. There's still work to be done but it's polish. Well, except for that need for Drop. That's an actual problem. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt changed the title ~~Optimize Hopper, add more knobs~~ Overhaul Hopper, break up mono-lock hegemony Feb 12, 2018

blt requested a review from tsantero February 13, 2018 01:10

Brian L. Troutwine added 3 commits February 13, 2018 08:49

Correct lib test

be81bc1

My last commit broke the comment test we had in place. I've shuffled senders to start first as they create the heirarchy of directories needed. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Incorporate clippy feedback

9ec0b5b

These changes are non-functional and satisfy clippy. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

Update documentation, comments

0e53a5b

This commit is non-functional but it does add a lot of commentary to the codebase. The rustdocs are also updated and brought up to the times as they are. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt commented Feb 14, 2018

View reviewed changes

dparton reviewed Feb 14, 2018

View reviewed changes

Brian L. Troutwine added 3 commits February 15, 2018 09:15

Pass tests on low QC levels

173fc28

We were forgetting to reset the total number of disk writes outstanding when the Sender was flushed. This instructed the Receiver to wait for writes that would never come. Signed-off-by: Brian L. Troutwine <blt@postmates.com>

blt force-pushed the telemetry_et_al branch from 9ead6b6 to ec3bf95 Compare February 16, 2018 01:13

Address @dparton comments

5fccd21

Signed-off-by: Brian L. Troutwine <blt@postmates.com>

tsantero approved these changes Feb 16, 2018

View reviewed changes

blt merged commit 8b8c86a into master Feb 16, 2018

blt deleted the telemetry_et_al branch February 16, 2018 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overhaul Hopper, break up mono-lock hegemony #19

Overhaul Hopper, break up mono-lock hegemony #19

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Overhaul Hopper, break up mono-lock hegemony #19

Overhaul Hopper, break up mono-lock hegemony #19

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Benchmark runner system

hopper_tst/HopperInput { in_memory_total: 16384, max_disk_total: 32768, total_senders: 4, ith: 8192 }

hopper_tst/HopperInput { in_memory_total: 4096, max_disk_total: 32768, total_senders: 4, ith: 8192 }

mspc_tst

Quickcheck

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!