nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) #18217

fabian18 · 2022-06-16T10:43:40Z

Description

During the testing of #17962, I discovered that something in the transmission behaviour of nanocoap is not working correctly. Using e.g. tests/nanocoap_cli: client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata, to create a resource on a server. The first transmission succeeds, but if you repeat the command, a retransmission is also triggered immediately. Even though the first transmission succeeds, the first call to sock_udp_sendv() in nanocoap_sock_request_cb() returns the expected value, but the message is not sent out (?). At least it does not appear in wireshark. Then the first call to sock_udp_recv_buf() always returns -ETIMEDOUT. But then in the second call to sock_udp_sendv(), the message is sent out and appears in wireshark.

Steps to reproduce the issue

I use libcoap´s coap-server: coap-server -A fe80::3478:9157:9fd5:81c3%eth0 -d 1.
I connect nucleo-f767zi over Ethernet with my notebook, where the server is running.
I compile and flash tests/nanocoap_cli: CFLAGS+=-DCONFIG_GNRC_IPV6_NIB_ARSM=1 BOARD=miot-nucleo-f767zi make flash term.
I open wireshark and repeat the nanocoap client put command: client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata and see the attached result.

Expected results

The nanocoap client should not immediately trigger a retransmission, but wait for [ACK_TIMEOUT, ACK_TIMEOUT * RANDOM_FACTOR], before it triggers the first retransmission.

Actual results

Most of the time, the retransmission happens immediately, if the server is not very fast with the ACK.
nanocoap_cli_put.pcapng.gz

Versions

Linux 5.18.3-arch1-1
(but that´s not relevant)

The text was updated successfully, but these errors were encountered:

benpicco · 2022-06-21T11:13:07Z

Hm this seems to be one of those problems that go away when setting ENABLE_DEBUG to 1 😩

fabian18 · 2022-06-22T09:26:57Z

With ENABLE_DEBUG=1 in nanocoap/sock.c, I get a failed assertion:

> client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata
2022-06-22 11:22:45,749 # client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata
2022-06-22 11:22:45,752 # nanocli: sending msg ID 1, 25 bytes
2022-06-22 11:22:45,755 # nanocoap: send 25 bytes (5 tries left)
2022-06-22 11:22:45,760 # nanocoap: waiting for response (timeout: 2092058 µs)
2022-06-22 11:22:45,762 # nanocoap: response code=204
2022-06-22 11:22:45,767 # nanocoap: waiting for response (timeout: 2085058 µs)
2022-06-22 11:22:45,773 # sys/net/gnrc/pktbuf/gnrc_pktbuf.c:93 => *** RIOT kernel panic:
2022-06-22 11:22:45,774 # FAILED ASSERTION.
2022-06-22 11:22
8000
:45,775 # 
2022-06-22 11:22:45,776 # *** halted.
2022-06-22 11:22:45,776 # 
2022-06-22 11:23:22,860 # Exiting Pyterm

assert(gnrc_pktbuf_contains(pkt));

benpicco · 2022-06-22T10:28:43Z

Wait what 😨

Is that with the current master?
When I try

% coap-server-notls -A fe80::90a7:a6ff:fe4b:2e32%tapbr0 -d 1

and

> client put fe80::90a7:a6ff:fe4b:2e32 5683 /somepath somedata

I get

client put fe80::90a7:a6ff:fe4b:2e32 5683 /somepath somedata
nanocli: sending msg ID 1, 25 bytes
nanocoap: send 25 bytes (5 tries left)
nanocoap: waiting for response (timeout: 2448721 µs)
nanocoap: response code=201
nanocoap: waiting for response (timeout: 2447721 µs)
nanocli: response Success, code 2.01, empty payload

fabian18 · 2022-06-22T11:23:39Z

Is that with the current master?

master 1638606

You have to send the request multiple times. Your last comment shows only 2.01 Created.

fabian18 · 2022-06-22T11:29:34Z

and with current master branch, yes

benpicco · 2022-06-22T11:47:21Z

I don't see this on native, is this on stm32?

8000

fabian18 · 2022-06-22T11:56:44Z

yes on stm32

benpicco · 2022-06-22T23:30:38Z

Could also be an out of bounds write on the Ethernet driver, corrupting the pktbuf free list.

can you try with #18227

fabian18 · 2022-06-26T19:13:14Z

can you try with #18227

CFLAGS+=-DCONFIG_GNRC_PKTBUF_CHECK_USE_AFTER_FREE=1 makes no difference, the same assertion is triggered.

benpicco · 2022-07-17T20:11:45Z

What 'works' surprisingly well is

--- a/sys/net/application_layer/nanocoap/sock.c
+++ b/sys/net/application_layer/nanocoap/sock.c
@@ -174,7 +174,7 @@ ssize_t nanocoap_sock_request_cb(nanocoap_sock_t *sock, coap_pkt_t *pkt,
         case STATE_RESPONSE_RCVD:
         case STATE_RESPONSE_OK:
             if (ctx == NULL) {
-                DEBUG("nanocoap: waiting for response (timeout: %"PRIu32" µs)\n",
+                printf("nanocoap: waiting for response (timeout: %"PRIu32" µs)\n",
                       _deadline_left_us(deadline));
             }
             const void *old_ctx = ctx;

However I don't see an immediate re-transmission in the error case, instead the node appears to wait for the timeout and never gets the response. I do see the response in Wireshark however when sniffing directly on the node's interface. (DOSE, with #18163)

benpicco · 2022-08-08T12:45:08Z

So my issue was caused by a different thread being scheduled every 10 ms (and calling ztimer_set each time), which in turn caused the UART RX interrupt to be delayed enough that it would only be executed after another byte was received already. The lost byte caused a CRC mismatch and the frame to be dropped, even though it was complete on the wire.

Adding the printf (or a busy loop, really) before calling sock_udp_recv_buf() meant that the other thread did not get scheduled and the RX interrupts would occur in time. But the proper fix is of course to pause all other threads when downloading firmware updates (since improving RIOTs real-time grantees is certainly no easy task).

I assume the behavior you saw was then caused by issue described in #18416 - so can we close this now?

fabian18 · 2022-08-08T21:19:31Z

Good news: With #18416 no assertion fires anymore, and it is functioning very well.

But still, on the second attempt to upload a file, some blocks are transmitted twice immediately. With ENABLE_DEBUG=1 it is Block 1 and without it is Block 0.
So there still is something odd, but CoAP makes it work nevertheless.

fabian18 · 2022-08-08T21:30:54Z

so can we close this now?

I would keep it open, because it is only partially fixed.

fabian18 · 2022-11-25T10:45:23Z

fixed with #18969

benpicco added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: network Area: Networking labels Jun 16, 2022

benpicco assigned miri64 and benpicco Jun 16, 2022

fabian18 changed the title ~~nanocoap: problem with [re]transmission behaviour~~ nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) Aug 8, 2022

fabian18 mentioned this issue Nov 25, 2022

cpu/stm32/periph_timer: fix spurious IRQs and race conditions #18969

Merged

fabian18 closed this as completed Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) #18217

nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) #18217

nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) #18217

nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) #18217

Comments

Description

Steps to reproduce the issue

Expected results

Actual results

Versions