-
Notifications
You must be signed in to change notification settings - Fork 104
latency packets can be "lost" at high packet rates #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, I've unfortunately seen this exact behavior before. It doesn't happen on an X550 or i40e NICs, so it looks like a hardware issue and we can't do anything :( Let me know if you find anything useful or a work-around. Software timestamping should work. |
BTW, I tried this on i40, and I still lose some packets, 5.8% at 19.84 Mpps. I will try some SW timestamping approach to see if it is truly losing these packets. |
Interesting result, I only tested it on an X710 10 Gbit i40e NIC at 14.88 Mpps (since that was the only one that's currently in a loop-back config in our lab). |
I just talked to Franck Baudin at the DPDK Summit about this issue and he stressed the importance of this for you guys in the opnfv project. I unfortunately don't have access to any directly connected ixgbe or XL710 NICs at the moment. I'll setup a test system with a direct connection between two ixgbe ports in my lab on monday. I've just tested an X710 (i40e 10 GbE NIC) and this NIC works fine:
|
Okay, I've setup a few loopback connections and found the following:
Specific to 82599 NICs:
Specific to XL710 NICs:
To conclude:
So I believe that this is not a big problem, it merely reduces the sample rate for timestamps at high packet loads. Use device Rx/Tx counters to report throughput and packet loss, ignore non-timestamped packets, this simply means that the sample rate will be lower under full load. Certainly not a good thing, but it doesn't look like that we can do better with the hardware. Maybe it's also possible to install an explicit drop filter (like the commented out :setPromisc(false) call in the example script) for non-timestamped packets. I'm however not sure if that works and if the counters still work (they don't with promisc = false, but I think with an fdir filter they should). BTW: timestamping at full load is not a useful scenario in many cases. For example, if you are forwarding between two ports with the same speed, then buffers might fill up due to short interruptions on the DuT and it's not possible for the DuT to "catch up" since the packets are coming in at the same rate that they can be sent out. This will be visible as an increasing latency over time for no obvious reason.) |
Thanks for all the testing and information. Initially this problem was quite severe around 10 Mpps (losing every single latency packet), but that was on a much older version of MoonGen/DPDK. More recent versions were significantly better, only seeing a small percentage of loss. I'll run your test script on the latest code just be sure I am seeing the same thing. I agree that time-stamping at full load might not be useful, if the DUT cannot sustain 0 packet loss. However, we tune the DUT quite extensively to obtain 0-packet loss, and typically test this for 2 hours, and sometimes 12 hours or more. Technically this is not full load, because we need a DUT to process packets at a slightly higher rate than it is receiving, so that when there is some preemption, and buffer use increases, the buffer can later be "drained" before the next preemption happens. But, at this maximum, sustained, no-loss rate, we really do want to have a good characterization of latency. |
We have been using the timestamper:measureLatency() to measure latency while having a concurrent bulk network traffic in the background. Some of these latency packets can get "lost", and I am not sure why this is happening yet. Somewhere around 10Mpps is where we start seeing loss. At 14.7Mpps, we get up to 25% loss. During this test, the bulk network (14.7Mpps) has exactly 0 loss. The latency packets are sent at about 100/second.
I have added debug code to measureLatency() to report the different ways loss can happen, and it is always because the time to wait for the packet has expired.
So far these tests use 64 byte frames for bulk network traffic and 76(80) byte frames for the latency packets. If we increase the frame size to 124 for latency, we get all of the packets.
The network adapter is Intel Niantic, and the ports are connected to each other (no other system involved).
I am wondering if this has anything to do with filtering on the adapter. Have you seen anything like this? I will probably look at the software timestamping next to see if the same problem is there.
The text was updated successfully, but these errors were encountered: