HELIOS Tech Info #154
Fri, 10 Jan 2014
10 Gb Ethernet tuning in VMware ESX/Mac/Linux environments
Summary
This Tech Info gives tuning advice for 10 Gb Ethernet environments, to enable optimum performance. By default, 10 Gb Ethernet already works great for most usage cases. However, when the fastest end-to-end file server performance is needed, additional tuning with proper setup and optimized parameters can help boost performance, to be much faster, and almost reach the 10 Gb limit on the server. Today's clients and servers usually use 1 Gb Ethernet which allows data transfers of about 100 MByte/sec. Every client can easily utilize 1 Gb Ethernet, therefore 10 Gb Ethernet is needed to bump performance. At minimum, servers should be deployed with 10 Gb Ethernet to the switch to serve the majority of 1 Gb clients. For workstations, 10 Gb Ethernet makes sense when file server transfer performances of more than 50 MB/sec are needed, e.g. 350 MB/sec for uncompressed HD video editing directly from a HELIOS server volume.
The following table shows maximum throughput in different Ethernet technologies:
Network |
MB/s |
PC clients handling this |
10 Mbit Ethernet |
1 MB |
Every PC in the last 20 years |
100 Mbit Ethernet |
10 MB |
Every PC in the last 15 years |
1 Gbit Ethernet |
100 MB |
500 Mhz PCs in the last 10 years |
10 Gbit Ethernet |
1000 MB |
Multicore PCs since 2010 |
This table clearly shows that today's PCs can easily utilize Ethernet networks with hundreds of megabytes per second. For medium and larger networks 10 Gb Ethernet is required to keep up with the client performance.
This Tech Info is suitable for 10 Gb Ethernet server to 10 Gb client usage cases, e.g.:
-
Full HD video editing directly from the server (uncompressed video)
-
Using very large files, e.g. VM images directly from the server
-
Copying very large files between the server and workstation
-
Running backups between workstation and server
-
High-performance computing which requires accessing a large amount of shared files
-
Server to server file synchronization or backups
We expect that this Tech Info provides good advice for many 10 Gb Ethernet environments. We have focused on testing VMware ESX/Mac/Linux environments. The various settings can in general also be applied to other setups.
Please note: Only for 10 Gb Ethernet environments should this tuning be applied, because 1 Gb works fine out of the box. Wrong buffer or packet sizes can bring down performance or introduce incompatibilities.
We look forward to seeing more 10 Gb Ethernet networks.
Table of Contents
-
Equipment Used for Testing »
-
Jumbo Frames »
-
AFP Server DSI Block Size »
-
Server and Client Network TCP Buffer & Tuning »
-
References »
Equipment Used for Testing
-
Client
-
Mac Pro, 2x 2.8 GHz quad-core Intel Xeon CPUs
-
2 GB memory
-
Small Tree 10 GbE card (Intel 8259x), then current driver v. 2.20
-
10 GbE network card is available in PCI slot 2 and known to the OS as "en3"
-
Server
-
IBM X3650 with VMware ESXi 5
-
QLogic 10 GbE
-
IBM built-in PCI Raid-5 storage
-
VM server setup
-
Virtualized HELIOS Virtual Server Appliance (VSA) with 4 CPUs
-
2 GB memory
-
10 GbE network card "vmxnet3" is available in slot 0 and known to the OS as "eth0"
-
Test utility
-
HELIOS LanTest network performance and reliability testing utility
Jumbo Frames
-
Benefit/drawback
Jumbo frames are Ethernet frames with more than 1500 bytes of payload.
In this Tech Info we will use the phrase "jumbo frame" for Ethernet frames of 9000 bytes payload.
The benefit of using jumbo frames is usually less CPU usage, less processing overhead, and the potential of higher network throughput.
There are no drawbacks to using jumbo frames, but careful implementation is required.
In order to use jumbo frames, not only the server and client, but also all the network entities in between, must support jumbo frames of the same size. This includes routers and switches, as well as Virtual Machine logical switches.
-
Configuration of Mac, Linux, VMWare
Before reconfiguring any of the client or server network interfaces, you should make sure that basic TCP connectivity is working. See the "Tests" section (below) for more details.
-
Mac configuration
Setting network card specific parameters can be done via the "System Preferences":
System Preferences > Network > <select card | e.g.: PCI Ethernet Slot 2> > Advanced > Hardware
There you can change several parameter settings. The values should be similar to the ones below:
Configure : Manually
Speed : 10 GbaseSR
Duplex : full-duplex,flow-control
MTU : Custom
enter value : 9000
Confirm with OK
/Apply
Switch back once to verify that the settings were applied.
In case you see again the default settings, you need to configure these settings using "ifconfig".
Another option is to set network card specific options in a Terminal session with "ifconfig".
This must be done as the superuser:
# sudo ifconfig en3 mtu 9000
Please note: "ifconfig" will set the "mtu" only temporarily and after a boot this setting will be gone.
You can set this permanently by adding the above "ifconfig" sequence to one of the boot scripts.
For example, you can add the line "ifconfig en3 mtu 9000" at the end of "/etc/rc.common".
You can use an editor like "vi" or "pico" to edit the file.
-
Linux configuration
On Linux you can set network card specific options in a "Terminal" session with "ifconfig". This must be done as the super user:
# ifconfig eth0 mtu 9000
Please note: "ifconfig" will set the "mtu" only temporarily and after a boot this setting will be gone.
You can set this permanently by adding the above "ifconfig" sequence to one of the boot scripts.
On the VSA you can also add the "mtu" change to the "iface" block of the "eth0" interface in the network configuration file "/etc/network/interfaces".
The block would look similar to this one:
iface eth0 inet static
address 192.168.2.1
netmask 255.255.255.0
mtu 9000
You can use an editor like "vi" or "pico" to edit the file.
-
VMware configuration
By default, an Intel E1000 card driver is installed for a Linux VM.
We did also test the VMware “vmxnet3” driver and found it to perform slightly better.
In order to get this driver, you have to install and configure the VMware tools in the Linux VM first.
-
Configure the "vmxnet3" driver in vSphere
In the "vSphere" client, choose the ESXi 5 host on which the virtual machine is running. Select the virtual machine, and in tab "Summary" choose "Edit Settings".
In the "Hardware" tab choose the button "Add", select "Ethernet Adapter" and continue with "Next>". As "Adapter Type" choose "VMXNET 3" and select your "Network Connection" you want to connect this adapter to.
Then continue with Next >
, verify the displayed settings and Finish
.
-
Setting MTU to 9000
In the "vSphere" client, choose the ESXi 5 host on which the virtual machine is running. Then choose the tab "Configuration" and from the "Hardware" list box select the "Networking" item.
Identify the "vSwitch" connected to the 10GbE network and open its "Properties" configuration. Then edit the "vSwitch" configuration "Advanced Properties"/"MTU" and change its value to 9000.
-
Basic TCP connectivity test
From both client and server, verify that the other machine is reachable via "ping":
client # ping 192.168.2.1
PING 192.168.2.1 (192.168.2.1): 56 data bytes
64 bytes from 192.168.2.1: icmp_seq=0 ttl=64 time=0.354 ms
64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.162 ms
..
server # ping 192.168.2.2
PING 192.168.2.2 (192.168.2.2): 56 data bytes
64 bytes from 192.168.2.2: icmp_seq=0 ttl=64 time=0.358 ms
64 bytes from 192.168.2.2: icmp_seq=1 ttl=64 time=0.232 ms
..
-
Jumbo packet TCP connectivity test
When basic TCP connectivity is working, and the MTU has been adjusted to 9000 on server, client and all TCP network entities in between, verify also that packets with a payload of 9000 bytes can be exchanged.
For this, you call ping with a "size" value, and maybe a count like this:
client # ping -c 2 -s 9000 192.168.2.1
PING 192.168.2.1 (192.168.2.1): 9000 data bytes
9008 bytes from 192.168.2.1: icmp_seq=0 ttl=64 time=0.340 ms
9008 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.292 ms
--- 192.168.2.1 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.292/0.316/0.340/0.024 ms
server # ping -c 2 -s 9000 192.168.2.2
ping -c 2 -s 9000 192.168.2.2
PING 192.168.2.2 (192.168.2.2) 9000(9028) bytes of data.
9008 bytes from 192.168.2.2: icmp_req=1 ttl=64 time=0.240 ms
9008 bytes from 192.168.2.2: icmp_req=2 ttl=64 time=0.244 ms
--- 192.168.2.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.240/0.242/0.244/0.002 ms
AFP Server DSI Block Size
-
What is the DSI block size?
The so-called DSI block size, or "request quantum", is exchanged by AFP client and server when a new AFP session is established.
It specifies the maximum amount of data to be processed with one single DSI command.
By default, EtherShare offers a value of 128 KB. If necessary, this value can be increased.
-
Why increase its value?
This would make sense when there are applications that read or write larger amounts of data with one call.
If server and clients are sufficiently fast, and the TCP connection is also fast and reliable, increasing the "dsiblocksize" can result in a much higher throughput.
-
Adjust DSI block size preference for EtherShare
This is done via the HELIOS "prefvalue" command.
For example, this command sequence sets the "dsiblocksize" preference to 1 MB (1024*1024):
# prefvalue -k Programs/afpsrv/dsiblocksize -t int 1048576
Thereafter a restart of the HELIOS "afpsrv" is required:
# srvutil stop -f afpsrv
# srvutil start -f afpsrv
Server and Client Network TCP Buffer & Tuning
-
Client
On Mac OS X, Kernel parameters of interest are:
Name | Default | Max | Tuned
===================================================
kern.ipc.maxsockbuf | 4194304 | 4194304 | 4194304
net.inet.tcp.recvspace| 131072 | 3727360 | 1000000
net.inet.tcp.sendspace| 131072 | 3727360 | 3000000
We conducted our tests with the values from the "Tuned" column.
Current values you can display with "sysctl", e.g.:
# sysctl net.inet.tcp.recvspace
net.inet.tcp.recvspace: 131072
You can also list multiple parameters with one single call to "sysctl":
# sysctl kern.ipc.maxsockbuf net.inet.tcp.recvspace net.inet.tcp.sendspace
kern.ipc.maxsockbuf: 4194304
net.inet.tcp.recvspace: 131072
net.inet.tcp.sendspace: 131072
You can set new values with option "-w", e.g.:
# sysctl -w net.inet.tcp.recvspace=1000000
net.inet.tcp.recvspace: 1000000
Please note: These changes are only temporary and will fall back to the defaults after the next boot.
In order to make these changes permanent, you either have to set them after each boot process with "sysctl -w", or specify them in the configuration file "/etc/sysctl.conf".
By default, this file does not exist on Mac OS X 10.8/10.9.
You can create it with an editor like "pico" or "vi".
Then enter the parameter=value tuples like this:
kern.ipc.maxsockbuf=4194304
net.inet.tcp.recvspace=1000000
net.inet.tcp.sendspace=3000000
Please note: You must NOT enter the "sysctl -w" command before the parameter=value entry.
-
Server
On Debian Linux, Kernel parameters of interest are:
Name | Default | Tuned
======================================
net.core.wmem_max | 131071 | 12582912
net.core.rmem_max | 131072 | 12582912
Default
=============================================
net.ipv4.tcp_rmem | 4096 87380 4194304
net.ipv4.tcp_wmem | 4096 16384 4194304
Tuned
==============================================
net.ipv4.tcp_rmem | 4096 87380 12582912
net.ipv4.tcp_wmem | 4096 16384 12582912
Current values can be displayed with "sysctl", e.g.:
# sysctl net.core.wmem_max
net.core.wmem_max: 131071
You can also list multiple parameters with one single call to "sysctl":
# sysctl net.core.wmem_max net.core.rmem_max net.ipv4.tcp_rmem net.ipv4.tcp_wmem
net.core.wmem_max = 131071
net.core.rmem_max = 131071
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
You can set new values with the option "-w", e.g.:
# sysctl -w net.core.wmem_max=12582912
net.core.wmem_max: 12582912
# sysctl -w 'net.ipv4.tcp_rmem=4096 87380 12582912'
net.ipv4.tcp_rmem = 4096 87380 12582912
Don't forget the ticks around the 'parameter=value' pair.
These changes are only temporary and will fall back to the defaults after the next boot.
In order to make these changes permanent, you either have to set them after each boot process with "sysctl", or specify them in the configuration file "/etc/sysctl.conf".
By default this file does exist on Debian Linux 6.0.6.
You can edit it with an editor like "pico" or "vi".
Then enter the parameter=value tuples like this:
net.core.wmem_max = 12582912
net.core.rmem_max = 12582912
net.ipv4.tcp_rmem = 4096 87380 12582912
net.ipv4.tcp_wmem = 4096 16384 12582912
Please note: You must NOT enter the "sysctl -w" command before the parameter=value entry.
-
Server network card settings
Depending on used network card and driver, it is possible to adjust certain card characteristics.
Here the hardware RX/TX ring buffers are of interest. The defaults are:
# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 288
RX Mini: 0
RX Jumbo: 0
TX: 512
These we increased to 1024:
# ethtool -G eth0 rx 1024
# ethtool -G eth0 tx 1024
Please note: These changes are only temporary and will fall back to the defaults after the next boot.
In order to make these changes permanent, you either have to set them after each boot process with "ethtool", or add them to one of the boot scripts.
As we tested on the VSA, we set up a HELIOS startup script which is called during every start/stop of HELIOS.
The HELIOS startup scripts reside in "/usr/local/helios/etc/startstop", and we added a script "01eth0bufs", that runs "ethtool" and sets the number of card buffers each time HELIOS is started.
========================================
#!/bin/sh
## set eth0 10gb rx and tx buffers
# /sbin/ethtool -G eth0 rx 1024
# /sbin/ethtool -G eth0 tx 1024
case "$1" in
pre-start)
/sbin/ethtool -G eth0 rx 1024 > /dev/null 2>&1
/sbin/ethtool -G eth0 tx 1024 > /dev/null 2>&1
;;
post-start)
;;
pre-stop)
;;
post-stop)
#
/sbin/ethtool -G eth0 rx 348 > /dev/null 2>&1
#
/sbin/ethtool -G eth0 tx 548 > /dev/null 2>&1
;;
*)
echo "Usage: $0 { pre-start | post-start | pre-stop | post-stop }"
exit 1
;;
esac
exit 0
========================================
After you have created this script, make sure to make it executable, e.g. "chmod a+x 01eth0bufs".
-
PCI bus performance on Mac OS X
If you don't get near the expected throughput, verify that the PCI card is using the maximal "Link" settings. HELIOS LanTest can be used to test throughput.
During testing we experienced that a cold boot is required in order for the PCI Ethernet Card to get the maximal "Link Width" of 8.
A reboot of the same OS X version, or between different installed OS X versions may result in "Link Width" values as low as 1.
With that setting you won't be able to achieve 10 GbE throughput.
Check at:
About This Mac > More Info > System Report > Hardware > PCI-Cards > ethernet
that the values are similar to these:
Link Width: x8
Link Speed: 2.5 GT/s
The absolute values may vary depending on the used Ethernet card.
Please note: The server and client network TCP buffer & tuning has been done by testing in multiple cycles to find out which buffer configurations offer the best and most reliable read/write performance. This may vary using different operating systems and different 10 Gb Ethernet NICs and drivers. Wrong network tuning can also result in slower performance or even a faulty or failing network.
References