8000 nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) · Issue #18217 · RIOT-OS/RIOT · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) #18217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fabian18 opened this issue Jun 16, 2022 · 14 comments
Assignees
Labels
Area: network Area: Networking Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@fabian18
Copy link
Contributor

Description

During the testing of #17962, I discovered that something in the transmission behaviour of nanocoap is not working correctly. Using e.g. tests/nanocoap_cli: client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata, to create a resource on a server. The first transmission succeeds, but if you repeat the command, a retransmission is also triggered immediately. Even though the first transmission succeeds, the first call to sock_udp_sendv() in nanocoap_sock_request_cb() returns the expected value, but the message is not sent out (?). At least it does not appear in wireshark. Then the first call to sock_udp_recv_buf() always returns -ETIMEDOUT. But then in the second call to sock_udp_sendv(), the message is sent out and appears in wireshark.

Steps to reproduce the issue

  • I use libcoap´s coap-server: coap-server -A fe80::3478:9157:9fd5:81c3%eth0 -d 1.
  • I connect nucleo-f767zi over Ethernet with my notebook, where the server is running.
  • I compile and flash tests/nanocoap_cli: CFLAGS+=-DCONFIG_GNRC_IPV6_NIB_ARSM=1 BOARD=miot-nucleo-f767zi make flash term.
  • I open wireshark and repeat the nanocoap client put command: client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata and see the attached result.

Expected results

The nanocoap client should not immediately trigger a retransmission, but wait for [ACK_TIMEOUT, ACK_TIMEOUT * RANDOM_FACTOR], before it triggers the first retransmission.

Actual results

Most of the time, the retransmission happens immediately, if the server is not very fast with the ACK.
nanocoap_cli_put.pcapng.gz

Versions

Linux 5.18.3-arch1-1
(but that´s not relevant)

@benpicco benpicco added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: network Area: Networking labels Jun 16, 2022
@benpicco
Copy link
Contributor

Hm this seems to be one of those problems that go away when setting ENABLE_DEBUG to 1 😩

@fabian18
Copy link
Contributor Author

With ENABLE_DEBUG=1 in nanocoap/sock.c, I get a failed assertion:

> client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata
2022-06-22 11:22:45,749 # client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata
2022-06-22 11:22:45,752 # nanocli: sending msg ID 1, 25 bytes
2022-06-22 11:22:45,755 # nanocoap: send 25 bytes (5 tries left)
2022-06-22 11:22:45,760 # nanocoap: waiting for response (timeout: 2092058 µs)
2022-06-22 11:22:45,762 # nanocoap: response code=204
2022-06-22 11:22:45,767 # nanocoap: waiting for response (timeout: 2085058 µs)
2022-06-22 11:22:45,773 # sys/net/gnrc/pktbuf/gnrc_pktbuf.c:93 => *** RIOT kernel panic:
2022-06-22 11:22:45,774 # FAILED ASSERTION.
2022-06-22 11:22
8000
:45,775 # 
2022-06-22 11:22:45,776 # *** halted.
2022-06-22 11:22:45,776 # 
2022-06-22 11:23:22,860 # Exiting Pyterm
assert(gnrc_pktbuf_contains(pkt));

@benpicco
Copy link
Contributor

Wait what 😨

Is that with the current master?
When I try

% coap-server-notls -A fe80::90a7:a6ff:fe4b:2e32%tapbr0 -d 1

and

> client put fe80::90a7:a6ff:fe4b:2e32 5683 /somepath somedata

I get

client put fe80::90a7:a6ff:fe4b:2e32 5683 /somepath somedata
nanocli: sending msg ID 1, 25 bytes
nanocoap: send 25 bytes (5 tries left)
nanocoap: waiting for response (timeout: 2448721 µs)
nanocoap: response code=201
nanocoap: waiting for response (timeout: 2447721 µs)
nanocli: response Success, code 2.01, empty payload

@fabian18
Copy link
Contributor Author

Is that with the current master?

master 1638606

You have to send the request multiple times. Your last comment shows only 2.01 Created.

@fabian18
Copy link
Contributor Author

and with current master branch, yes

@benpicco
Copy link
Contributor

I don't see this on native, is this on stm32?

8000

@fabian18
Copy link
Contributor Author

yes on stm32

@benpicco
Copy link
Contributor

Could also be an out of bounds write on the Ethernet driver, corrupting the pktbuf free list.

can you try with #18227

@fabian18
Copy link
Contributor Author

can you try with #18227

CFLAGS+=-DCONFIG_GNRC_PKTBUF_CHECK_USE_AFTER_FREE=1 makes no difference, the same assertion is triggered.

@benpicco
Copy link
Contributor

What 'works' surprisingly well is

--- a/sys/net/application_layer/nanocoap/sock.c
+++ b/sys/net/application_layer/nanocoap/sock.c
@@ -174,7 +174,7 @@ ssize_t nanocoap_sock_request_cb(nanocoap_sock_t *sock, coap_pkt_t *pkt,
         case STATE_RESPONSE_RCVD:
         case STATE_RESPONSE_OK:
             if (ctx == NULL) {
-                DEBUG("nanocoap: waiting for response (timeout: %"PRIu32" µs)\n",
+                printf("nanocoap: waiting for response (timeout: %"PRIu32" µs)\n",
                       _deadline_left_us(deadline));
             }
             const void *old_ctx = ctx;

However I don't see an immediate re-transmission in the error case, instead the node appears to wait for the timeout and never gets the response. I do see the response in Wireshark however when sniffing directly on the node's interface. (DOSE, with #18163)

@benpicco
Copy link
Contributor
benpicco commented Aug 8, 2022

So my issue was caused by a different thread being scheduled every 10 ms (and calling ztimer_set each time), which in turn caused the UART RX interrupt to be delayed enough that it would only be executed after another byte was received already. The lost byte caused a CRC mismatch and the frame to be dropped, even though it was complete on the wire.

Adding the printf (or a busy loop, really) before calling sock_udp_recv_buf() meant that the other thread did not get scheduled and the RX interrupts would occur in time. But the proper fix is of course to pause all other threads when downloading firmware updates (since improving RIOTs real-time grantees is certainly no easy task).

I assume the behavior you saw was then caused by issue described in #18416 - so can we close this now?

@fabian18
Copy link
Contributor Author
fabian18 commented Aug 8, 2022

Good news: With #18416 no assertion fires anymore, and it is functioning very well.

But still, on the second attempt to upload a file, some blocks are transmitted twice immediately. With ENABLE_DEBUG=1 it is Block 1 and without it is Block 0.
So there still is something odd, but CoAP makes it work nevertheless.

@fabian18
Copy link
Contributor Author
fabian18 commented Aug 8, 2022

so can we close this now?

I would keep it open, because it is only partially fixed.

@fabian18 fabian18 changed the title nanocoap: problem with [re]transmission behaviour nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) Aug 8, 2022
@fabian18
Copy link
Contributor Author

fixed with #18969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: network Area: Networking Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

3 participants
0