Virtualized networks have unique packet loss issues and are inherently one layer more complex than physical network environments.Packet loss on physical networks generally stems from simple causes:insufficient bandwidth,hardware faults and queue overflow.
In virtualized environments,however,traffic between virtual machines(VMs)on the same physical host,as well as traffic between VMs and external networks,is affected by at least four hidden risk factors:
Scheduling and queue mechanisms of virtual switches(vSwitch,Linux Bridge,OVS)
Scheduling latency of the Hypervisor CPU,which prevents vCPUs from processing network card interrupts in a timely manner
Configuration differences among virtual network adapters and their queues(vhost-net,virtio,VMXNET3)
Multi-queue settings of physical network adapters and virtual machine power management mechanisms(e.g.,Copy-on-Write,Memory Ballooning)
Combined,these factors lead to a puzzling scenario:no packet loss is detected inside the VM,yet actual service traffic suffers from packet loss.Only by adopting testing methods tailored for virtualization can you pinpoint the root causes.
Below are five critical precautions,where oversights will easily lead to misjudgment.
Precaution 1:Test multiple paths,not just connectivity to the gateway
Reason
Traffic between VMs on the same host is transmitted via memory copy through the virtual switch,which rarely passes through the physical network adapter.It features ultra-low latency and almost no packet loss.In contrast,traffic destined for external networks has to traverse the physical network adapter and the host protocol stack,presenting a completely different set of packet loss risks.
Recommended Practice
Run packet loss tests with at least 1,000 test packets across three scenarios:
Test VM↔Another VM on the same host(to evaluate virtual switch performance)
Test VM↔Host gateway(to examine virtualization layer and physical network adapter ingress)
Test VM↔External public IP(to inspect the full end-to-end link)
Typical Case
A user reported slow public network access to their web server,while access to the database(on the same host)remained fast.Ping tests with 1,000 packets showed 0%packet loss between co-located VMs,and 1.2%packet loss to external addresses.The root cause was uneven distribution of soft interrupts on the host’s physical network adapter,triggering packet loss on outbound traffic.
Precaution 2:Use virtio drivers on KVM and avoid e1000 emulation
Reason
The default emulated e1000 network adapter delivers poor performance and is highly prone to packet loss under high packets-per-second(PPS)loads.As a paravirtualized driver,virtio works with vhost-net or vhost-user to bypass QEMU,drastically reducing packet loss.
Check Command(inside Linux VM)
plaintext
ethtool-i eth0|grep driver
If the result shows e1000,replace it with the virtio driver immediately.
Performance Gap
In tests using 64-byte UDP small packets:the e1000 adapter suffers a 5%packet loss rate at 5,000 PPS,while virtio maintains a loss rate below 0.1%even at 50,000 PPS.
Precaution 3:Watch out for speed limit and packet loss caused by security groups and distributed firewalls on VMware
Reason
VMware NSX or vSphere distributed switches may enable traffic shaping and security policies by default.Based on the token bucket algorithm,these policies drop packets directly once instantaneous traffic exceeds the threshold,and no related error statistics will be recorded inside the VM.
Typical Case
A user observed 0.5%packet loss via ping tests.Packet capture with tcpdump inside the VM confirmed ICMP Echo Requests were sent normally,but no replies were received.The issue was ICMP rate limiting(capped at 100 PPS by default)enforced by the VMware distributed firewall.Solution:Adjust the security policy or disable ICMP rate limiting.
Verification Methods
Check Security Policy and Traffic Shaping settings for the VM in VMware vCenter.
If policy modification is unavailable,create a temporary port group without restrictions for comparison testing.
Precaution 4:Identify pseudo packet loss caused by CPU contention and vCPU overload
Reason
When the host CPU is overprovisioned(e.g.,4 physical cores assigned to 16 vCPUs)and multiple VMs compete for computing resources,vCPUs may be preempted for tens of milliseconds.In this case,the network interrupt handler inside the VM cannot respond promptly,resulting in overflow of the physical network adapter’s ring buffer.Packet loss occurs on the host side,leaving no error logs inside the VM.
Detection Methods
On KVM host:Run virsh domstats<vm-name>--cpu-total and monitor the cpu.wait metric.
On VMware host:Launch esxtop,press p to check the%RDY value.A reading above 5%indicates severe CPU contention.
Inside the VM:If the packet loss rate of ping-c 1000 is significantly higher than that of ping-c 10,periodic CPU contention is the likely cause.
Solutions
Configure CPU pinning to bind vCPUs to dedicated physical cores.
Reduce the number of vCPUs to cut scheduling overhead.
Allocate CPU reservations(MHz)for latency-sensitive services on VMware.
Precaution 5:Test with real-world packet sizes instead of only default 64-byte packets
Reason
Virtual network adapters process large and small packets via different paths.Large packets(exceeding MTU)will trigger GSO/TSO offloading,while small packets mainly test PPS processing capability.Testing only with small packets cannot reflect real packet loss for services such as file transmission.
Recommended Test Packet Sizes
64 bytes:Test PPS limit and interrupt processing capability
512 bytes:Simulate typical RPC and database traffic
1400 bytes:Test packet segmentation and reassembly near the MTU threshold
Random sizes:Simulate real business traffic
TSO/GSO Note for KVM
Run the command below to check GSO/TSO status:
plaintext
ethtool-k eth0|grep generic-segmentation-offload
If TSO is disabled,packet segmentation is handled by the VM kernel,raising CPU usage and increasing the risk of packet loss.
Golden 3-Step Workflow for Packet Loss Testing in Virtualization
Step 1:Baseline Test(exclude virtualization layer faults)
Run tests between two VMs on the same physical host:
plaintext
#On VM1(receiver)
iperf3-s
#On VM2(client)
iperf3-c<VM1_IP>-u-b 1000M-t 30
If the packet loss rate exceeds 0.1%,optimize the virtual switch or vCPU configuration first.
Step 2:External Link Test
Use mtr to trace the full path from the host to the external target:
plaintext
mtr-r-c 500--report-wide<External IP>
Check packet loss at each hop.If loss starts from the host gateway,the fault lies with the physical network adapter or Hypervisor network stack.
Step 3:Stress Test(Locate packet loss threshold)
Gradually increase traffic load(via ping-f or iperf3-b from 100M to 1000M),and record the bandwidth/PPS value where the packet loss rate jumps from 0%to 0.1%.This value is the maximum stable bandwidth under current configuration.
Case
A KVM VM maintained 0%packet loss at 100M~200M bandwidth,while loss rose to 0.8%at 300M.Troubleshooting found only 1 queue enabled on the physical network adapter and just 1 vCPU assigned.After enabling 4 multi-queues and matching 4 vCPUs,the stable bandwidth threshold increased to 800M.
Summary:Do's and Don'ts for Virtualized Packet Loss Testing
|
Do |
Don't |
|
Test three paths:co-located VM,gateway and external network |
Test only connectivity to the default gateway |
|
Adopt virtio/VMXNET3 network drivers |
Use emulated e1000 adapters |
|
Monitor host CPU ready time |
Rely solely on in-VM packet loss statistics |
In virtualized environments,90%of packet loss issues stem from resource scheduling rather than physical network faults.When facing VM packet loss,prioritize checking host CPU and memory contention,then verify virtual network adapter drivers and queue configurations,and finally troubleshoot physical links.Standardized packet loss testing will truly reflect the actual service experience in virtualization.