Articles > Virtualization
Printer Friendly Version
Views: 2913

NSX Deployment network packet loss issues

Last Updated: 5/6/22

Problem:

Deployed NSX-T onto our Dell EMC VxRail cluster. Immediately started having network issues (packet loss) that impacted many, but not all Windows VMs. it was hard to track down because nothing was offline, just connection issues.

The problem was easiest to see by running this command: ping -t 127.0.0.1

let that run for a while. then press ctrl + break to see summary of packet loss. Less than 1% is a normally working server. If you see 2% to 14% like I was seeing, then something is wrong.  The packet loss could be seen running ping on any address (localhost, gateway, other servers, internet), but only outbould, not inbould from a non-affected VM.

I tested with tcping to make sure that the problem was not just for ICMP (ping). Tcping was also impacted with packet loss.

 

Started seeing event in Windows event logs, on some affected VMs, not all affected VMs:
Source: AFD
Event ID: 16002
Message: Clsing UDP socket with local port number in process is taking longer than expected.

Perhaps related to VMs that had more network load? unsure.

 

I could not find a pattern with the affected VMs. I looked for VMs on same host, VMs with same vmnic type, Guest OS version, etc...  Linux VMs did seem to be unaffected, but we didn't have enough Linux VMs to really test well.

 

Resolution:

We removed NSX, but the problem continued until we reboot all affected VMs.

 

Still not sure what will happen when we re-deploy NSX.

 

UPDATE Fall of 2021:

This was not a problem with the NSX component that gets installed on ESXi. Yes, removing that and rebooting the VMs will fix the problem, but the real issue was the NSX agent drivers for Windows. There is a known filter driver issue that Microsoft and VMware are working on. Hopefully by the time you read this the issue will be resolved. You have to avoid the NSX driver with certain versions of VMware tools. There are some VMware kb articles about it. There wasn't when this happened to us, but there are now. There are scripts to remove this component while still keeping the rest of VMware tools.





Keywords: none