Description
Hi all,
we've implemented an rpc server based on evhtp + libevent2 and we have many http clients connecting to our rpc server.
We start from 10 clients to 2000, and the rpc server works okay. However, when we increase the number of clients to 2000+ ( or sometimes 3000+) it fails, printing out the log: event_queue_remove_active: 0x??? not on queue 8
, and exit with -1
.
We tried two versions(libevent-release-2.0.22-stable
& libevent-2.1.8-stable
), both show the error. We also tried the debug version and it strangely didn't show up. However, after we read the code we found the execution flow of the debug version is not the same as that of the relase one, though they used the same optimization level(-O2
). So we modified the logging functions and added some log prints at places like event_add
/event_del
/queue_remove
/insert
and compile it as release and run. The error shows again and here's all the output:
** backtrace:**
#0 0x000000000602a5e5 in raise () from /lib64/libc.so.6
#1 0x000000000602bdc5 in abort () from /lib64/libc.so.6
#2 0x000000000602370e in __assert_fail_base () from /lib64/libc.so.6
#3 0x00000000060237d0 in __assert_fail () from /lib64/libc.so.6
#4 0x000000000763d1ca in fatal_cb (err=-559030611) at xxxx.cpp:81
#5 0x0000000007c5d291 in event_exit (errcode=-559030611) at log.c:102
#6 0x0000000007c5d35c in event_errx (eval=-559030611, fmt=<value optimized out>) at log.c:162
#7 0x0000000007c4fe7d in event_queue_remove_active (base=0x6c557e0, evcb=0xaae5410) at event.c:3200
#8 0x0000000007c54fc4 in event_process_active_single_queue (base=0x6c557e0, activeq=0x6c62020, max_to_process=2147483647, endtime=0x0) at event.c:1606
#9 0x0000000007c55aef in event_process_active (base=0x6c557e0, flags=<value optimized out>) at event.c:1738
#10 event_base_loop (base=0x6c557e0, flags=<value optimized out>) at event.c:1961
...
(NOTE: line numbers may not be the same as the original code as we've added some logging codes)
modified log functions and logs:
#ifdef EVENT_DEBUG_LOGGING_ENABLED
#define event_debug(x) do { \
if (event_debug_get_logging_mask_()) { \
event_debugx_ x; \
} \
} while (0)
#else
#define event_debug(x) ((void)0)
#endif
--- our own log functions
#define event_user_debug(x) do { \
event_debugx_ x; \
} while (0)
#define event_user_test_log_evcb(evcb, tv) do{ \
struct timeval cur_tv; \
event_base_gettimeofday_cached(NULL, &cur_tv); \
event_user_debug(( \
"%s: evcb: %p , %s%s%s%s%s%s%s%s%s call %p, curtime(s/us):%d/%d", \
__func__, \
evcb, \
evcb->evcb_flags & EVLIST_TIMEOUT ? "EVLIST_TIMEOUT " : " ", \
evcb->evcb_flags & EVLIST_INSERTED ? "EVLIST_INSERTED " : " ", \
evcb->evcb_flags & EVLIST_SIGNAL ? "EVLIST_SIGNAL " : " ", \
evcb->evcb_flags & EVLIST_ACTIVE ? "EVLIST_ACTIVE " : " ", \
evcb->evcb_flags & EVLIST_INTERNAL ? "EVLIST_INTERNAL " : " ", \
evcb->evcb_flags & EVLIST_ACTIVE_LATER ? "EVLIST_ACTIVE_LATER " : " ", \
evcb->evcb_flags & EVLIST_FINALIZING ? "EVLIST_FINALIZING " : " ", \
evcb->evcb_flags & EVLIST_INIT ? "EVLIST_INIT " : " ", \
tv ? "EV_TIMEOUT " : " ", \
evcb, \
cur_tv.tv_sec, \
cur_tv.tv_usec)); \
} while(0)
#define event_user_test_log_ev(ev, tv) do{ \
struct timeval cur_tv; \
event_base_gettimeofday_cached(NULL, &cur_tv); \
event_user_debug(( \
"%s: event: %p (fd "EV_SOCK_FMT"), %s%s%s%scall %p, curtime(s/us):%d/%d", \
__func__, \
ev, \
EV_SOCK_ARG(ev->ev_fd), \
ev->ev_events & EV_READ ? "EV_READ " : " ", \
ev->ev_events & EV_WRITE ? "EV_WRITE " : " ", \
ev->ev_events & EV_CLOSED ? "EV_CLOSED " : " ", \
tv ? "EV_TIMEOUT " : " ", \
ev->ev_callback, \
cur_tv.tv_sec, \
cur_tv.tv_usec)); \
event_user_test_log_evcb(event_to_event_callback(ev), tv); \
} while(0)
we call these log functions at the beginning of thses functions:
event_add
event_add_nolock_
event_del
event_del_
event_del_nolock_
event_queue_insert_inserted
event_queue_remove_inserted
event_queue_insert_active
event_queue_remove_active
--- logs collected
...
[debug] event_queue_remove_active: evcb: 0xaae5410 , EVLIST_ACTIVE EVLIST_INIT call 0xaae5410, curtime(s/us):1487269625/736809
[debug] event_del: event: 0xaae5410 (fd 45), EV_WRITE call 0x7c4cf70, curtime(s/us):1487269625/737094
[debug] event_del: evcb: 0xaae5410 , EVLIST_INIT call 0xaae5410, curtime(s/us):1487269625/737170
[debug] event_del_: event: 0xaae5410 (fd 45), EV_WRITE call 0x7c4cf70, curtime(s/us):1487269625/737223
[debug] event_del_: evcb: 0xaae5410 , EVLIST_INIT call 0xaae5410, curtime(s/us):1487269625/737617
[debug] event_del_nolock_: event: 0xaae5410 (fd 45), EV_WRITE call 0x7c4cf70, curtime(s/us):1487269625/737698
[debug] event_del_nolock_: evcb: 0xaae5410 , EVLIST_INIT call 0xaae5410, curtime(s/us):1487269625/737760
...
[debug] event_del_nolock_: event: 0xaae5410 (fd 45), EV_WRITE call 0x7c4cf70, curtime(s/us):1487269625/874871
[debug] event_del_nolock_: evcb: 0xaae5410 , EVLIST_INIT call 0xaae5410, curtime(s/us):1487269625/874931
[debug] event_queue_remove_active: evcb: 0x6d332b0 , EVLIST_INSERTED EVLIST_ACTIVE EVLIST_INIT call 0x6d332b0, curtime(s/us):1487269625/875019
...
[debug] event_queue_remove_inserted: evcb: 0xad024e0 , EVLIST_INSERTED EVLIST_INIT call 0xad024e0, curtime(s/us):1487269627/632445
[debug] event_queue_remove_active: evcb: 0x6cfda80 , EVLIST_ACTIVE EVLIST_FINALIZING EVLIST_INIT call 0x6cfda80, curtime(s/us):1487269627/634394
[debug] event_queue_remove_active: evcb: 0xa8c6c10 , EVLIST_ACTIVE EVLIST_FINALIZING EVLIST_INIT call 0xa8c6c10, curtime(s/us):1487269627/635058
...
[debug] event_queue_insert_active: evcb: 0x6c54888 , EVLIST_INSERTED EVLIST_INIT call 0x6c54888, curtime(s/us):1487269627/658787
[debug] event_queue_insert_active: evcb: 0xab05410 , EVLIST_INSERTED EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/658976
[debug] event_queue_insert_active: evcb: 0xaae5410 , EVLIST_INIT call 0xaae5410, curtime(s/us):1487269627/659050 -<<<===========0xaae5410 persistent and inserted again but with no EVLIST_INSERTED flag set !!!
...
[debug] event_queue_insert_active: evcb: 0x6d21dd0 , EVLIST_INSERTED EVLIST_INIT call 0x6d21dd0, curtime(s/us):1487269627/684201
[debug] event_queue_remove_active: evcb: 0x6c54888 , EVLIST_INSERTED EVLIST_ACTIVE EVLIST_INIT call 0x6c54888, curtime(s/us):1487269627/684312
[debug] event_add: event: 0x6cfd4b0 (fd -1), EV_READ call 0x7e90790, curtime(s/us):1487269627/684589
[debug] event_add: evcb: 0x6cfd4b0 , EVLIST_INIT call 0x6cfd4b0, curtime(s/us):1487269627/684664
[debug] event_add_nolock_: event: 0x6cfd4b0 (fd -1), EV_READ call 0x7e90790, curtime(s/us):1487269627/684721
[debug] event_add_nolock_: evcb: 0x6cfd4b0 , EVLIST_INIT call 0x6cfd4b0, curtime(s/us):1487269627/684782
[debug] event_queue_insert_inserted: event: 0x6cfd4b0 (fd -1), EV_READ call 0x7e90790, curtime(s/us):1487269627/685126
[debug] event_queue_insert_inserted: evcb: 0x6cfd4b0 , EVLIST_INIT call 0x6cfd4b0, curtime(s/us):1487269627/685201
[debug] event_add: event: 0x6ce5730 (fd 28), EV_READ call 0x7c4da90, curtime(s/us):1487269627/685302
[debug] event_add: evcb: 0x6ce5730 , EVLIST_INIT call 0x6ce5730, curtime(s/us):1487269627/685373
[debug] event_add_nolock_: event: 0x6ce5730 (fd 28), EV_READ call 0x7c4da90, curtime(s/us):1487269627/685429
[debug] event_add_nolock_: evcb: 0x6ce5730 , EVLIST_INIT call 0x6ce5730, curtime(s/us):1487269627/685482
[debug] event_queue_insert_inserted: event: 0x6ce5730 (fd 28), EV_READ call 0x7c4da90, curtime(s/us):1487269627/685566
[debug] event_queue_insert_inserted: evcb: 0x6ce5730 , EVLIST_INIT call 0x6ce5730, curtime(s/us):1487269627/685619
[debug] event_add: event: 0x6dcb180 (fd -1), EV_READ call 0x7e90790, curtime(s/us):1487269627/685818
[debug] event_add: evcb: 0x6dcb180 , EVLIST_INIT call 0x6dcb180, curtime(s/us):1487269627/685883
[debug] event_add_nolock_: event: 0x6dcb180 (fd -1), EV_READ call 0x7e90790, curtime(s/us):1487269627/685936
[debug] event_add_nolock_: evcb: 0x6dcb180 , EVLIST_INIT call 0x6dcb180, curtime(s/us):1487269627/686036
[debug] event_queue_insert_inserted: event: 0x6dcb180 (fd -1), EV_READ call 0x7e90790, curtime(s/us):1487269627/686091
[debug] event_queue_insert_inserted: evcb: 0x6dcb180 , EVLIST_INIT call 0x6dcb180, curtime(s/us):1487269627/686143
...
[debug] event_queue_remove_active: evcb: 0xab05410 , EVLIST_INSERTED EVLIST_ACTIVE EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/738224
[debug] event_del: event: 0xab05410 (fd 64), EV_READ call 0x7c4da90, curtime(s/us):1487269627/738444
[debug] event_del: evcb: 0xab05410 , EVLIST_INSERTED EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/738512
[debug] event_del_: event: 0xab05410 (fd 64), EV_READ call 0x7c4da90, curtime(s/us):1487269627/738565
[debug] event_del_: evcb: 0xab05410 , EVLIST_INSERTED EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/738619
[debug] event_del_nolock_: event: 0xab05410 (fd 64), EV_READ call 0x7c4da90, curtime(s/us):1487269627/738676
[debug] event_del_nolock_: evcb: 0xab05410 , EVLIST_INSERTED EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/738732
[debug] event_queue_remove_inserted: event: 0xab05410 (fd 64), EV_READ call 0x7c4da90, curtime(s/us):1487269627/738784
[debug] event_queue_remove_inserted: evcb: 0xab05410 , EVLIST_INSERTED EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/738866
[debug] event_add: event: 0xab05410 (fd 64), EV_READ call 0x7c4da90, curtime(s/us):1487269627/738985
[debug] event_add: evcb: 0xab05410 , EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/739038
[debug] event_add_nolock_: event: 0xab05410 (fd 64), EV_READ call 0x7c4da90, curtime(s/us):1487269627/739089
[debug] event_add_nolock_: evcb: 0xab05410 , EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/739138
[debug] event_queue_insert_inserted: event: 0xab05410 (fd 64), EV_READ call 0x7c4da90, curtime(s/us):1487269627/739200
[debug] event_queue_insert_inserted: evcb: 0xab05410 , EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/739252
[debug] event_queue_remove_active: evcb: 0xaae5410 , EVLIST_INIT call 0xaae5410, curtime(s/us):1487269627/739309
[err] event_queue_remove_active: 0xaae5410 not on queue 8 -<<<===========0xaae5410
It fails when 0xaae5410
is removed from the active queue while its EVLIST_ACTIVE
is not set.
Seems some negative data race exists and the EVLIST_ACTIVE
mask or EVLIST_INSERTED
mask is set in a wrong order , thus vialoating the implied happen-before relation?
We compared with other callbacks from log seq and found the general pattern:
EVLIST_INSERTED
EVLIST_ACTIVE
EVLIST_INIT
three masks were set when remove like:
[debug] event_queue_insert_active: evcb: 0xab05410 , EVLIST_INSERTED EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/658976
[debug] event_queue_remove_active: evcb: 0xab05410 , EVLIST_INSERTED EVLIST_ACTIVE EVLIST_INIT call 0xab05410, curtime(s/us):1487269627/738224
or EVLIST_ACTIVE
EVLIST_FINALIZING
EVLIST_INIT
three masks were set
[debug] event_queue_insert_active: evcb: 0x6ec9ef0 , EVLIST_INIT call 0x6ec9ef0, curtime(s/us):1487269627/473269
[debug] event_queue_remove_active: evcb: 0x6ec9ef0 , EVLIST_ACTIVE EVLIST_FINALIZING EVLIST_INIT call 0x6ec9ef0, curtime(s/us):1487269627/654246
but the failed one:
[debug] event_queue_insert_active: evcb: 0xaae5410 , EVLIST_INIT call 0xaae5410, curtime(s/us):1487269627/659050
[debug] event_queue_remove_active: evcb: 0xaae5410 , EVLIST_INIT call 0xaae5410, curtime(s/us):1487269627/739309
[err] event_queue_remove_active: 0xaae5410 not on queue 8
os version:
[root@localhost ~]# cat /etc/issue
CentOS release 6.5 (Final)
Kernel \r on an \m
[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Any suggestions and ideas will be appreciated, please help.