Compare HAProxy performance on x86_64 and arm64 CPU architectures

5 min readJul 9, 2020

HAProxy 2.2 has been released few days ago so I’ve decided to run my load tests against it on my x86_64 and aarch64 VMs:

x86_64

aarch64

Note: the VMs are as close as possible in their hardware capabilities — same type and amount of RAM, same disks, network cards and bandwidth. Also the CPUs are as similar as possible but there are some differences

the CPU frequency: 3000 MHz (x86_64) vs 2400 MHz (aarch64)
BogoMIPS: 6000 (x86_64) vs 200 (aarch64)
Level 1 caches: 128 KiB (x86_64) vs 512 KiB (aarch64)

Both VMs run Ubuntu 20.04 with latest software updates.

HAProxy is built from source for the master branch, so it might have few changes since the cut of haproxy-2.2 tag!

I’ve tried to fine tune it as much as I could by following all best practices I was able to find in the official documentation and in the web.

The HAProxy config is:

This way HAProxy is used as a load balancer in front of four HTTP servers.

To also use it as a SSL terminator one just needs to comment out line 34 and uncomment line 35.

The best results I’ve achieved by using the multithreaded setup. As the documentation says this is the recommended setup anyway but it also gave me almost twice better throughput! In addition the best results were with 32 threads. The throughput was increasing from 8 to 16 and from 16 to 32, but dropped when used 64 threads.

I’ve also pinned the threads to stay at the same CPU for its lifetime with cpu-map 1/all 0–7.

The other important setting is the algorithm to use to balance between the backends. Just like in Willy Tarreau’s tests for me leastconn gave the best performance.

As recommended at HAProxy Enterprice documentation I’ve disabled irqbalance.

Finally I’ve applied the following kernel settings:

sudo sysctl -w net.ipv4.ip_local_port_range="1024 65024"
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=100000
sudo sysctl -w net.core.netdev_max_backlog=100000
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w fs.file-max=500000

fs.file-max is related also with a change in /etc/security/limits.conf:

root soft nofile 500000
root hard nofile 500000
* soft nofile 500000
* hard nofile 500000

For backend I used very simple HTTP servers written in Golang. They just write “Hello World” back to the client without reading/writing from/to disk or to the network:

As load testing client I have used WRK with the same setup as for testing Apache Tomcat.

And now the results:

aarch64, HTTP

x86_64, HTTP

aarch64, HTTPS

x86_64, HTTPS

What we see here is:

that HAProxy is almost twice faster on the x86_64 VM than the aarch64 VM!
and also that TLS offloading decreases the throughput with around 5–8%

Update 1 (Jul 10 2020): To see whether the Golang based HTTP servers are not the bottleneck in the above testing I’ve decided to run the same WRK load tests directly against one of the backends, i.e. skip HAProxy.

aarch64, HTTP

x86_64, HTTP

Here we see that the HTTP server running on aarch64 is around 30% faster than on x86_64!

And the more important observation is that the throughput is several times better when not using load balancer at all! I think the problem here is in my setup — both HAProxy and the 4 backend servers run on the same VM, so they fight for resources! I will pin the Golang servers to their own CPU cores and let HAProxy use only the other 4 CPU cores! Stay tuned for an update!

Update 2 (Jul 10 2020):

To pin the processes to specific CPUs I will use numactl.

$ numactl — hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16012 MB
node 0 free: 170 MB
node distances:
node 0
0: 10

I’ve pinned the Golang HTTP servers with:

numactl — cpunodebind=0 — membind=0 — physcpubind=4 env PORT=8081 go run etc/haproxy/load/http-server.
go

i.e. this backend instance is pinned to CPU node 0 and to physical CPU 4. The other three backend servers are pinned respectively to physical CPUs 5, 6 and 7.

Also I’ve changed slightly the HAProxy configuration:

nbthread 4
cpu-map 1/all 0–3

i.e. HAProxy will spawn 4 threads and they will be pinned to physical CPUs 0–3.

With these changes the results stayed the same for aarch64:

but dropped for x86_64:

and same for HTTP (no TLS):

aarch64

x86_64

So now HAProxy is a bit faster on aarch64 than on x86_64 but still far slower than the “no load balancer” approach with 120 000+ requests per second.

Update 3 (Jul 10 2020): After seeing that the performance of the Golang HTTP server is so good (120–160K reqs/sec) and to simplify the setup I’ve decided to remove the CPU pinning from Update 2 and to use the backends from the other VM, i.e. when hitting HAProxy on the aarch64 VM it will load balance between the backends running on the x86_64 and when WRK hits HAProxy running on the x86_64 VM it will use the Golang HTTP servers running on the aarch64 VM. And here are the new results:

aarch64, HTTP

x86_64, HTTP

aarch64, HTTPS

x86_64, HTTPS

Update 4 (Jul 16 2020):

Thanks to the comments by Willy Tarreau (the HAProxy creator!) I was able to improve further the performance results on both VMs:

Removing nbthread and cpu-map settings and letting HAProxy find the best values improved this way:

aarch64, HTTP: 16688.53 (value from Update 3) -> 19446.49
x86_64, HTTP: 25908.50-> 32309.90
aarch64, HTTPS: 16821.60 -> 19049.35
x86_64, HTTPS: 30376.95 -> 31555.68

Removing option http-server-close (this one was here by mistake) improved even more:

aarch64, HTTP: 19446.49 -> 25046.17
x86_64, HTTP: 32309.90 -> 45398.78
aarch64, HTTPS: 19049.35 -> 25003.79
x86_64, HTTPS: 31555.68 -> 41769.52

The bigger improvement came from replacing the command line option -d (debug mode) with -D (daemon). A very important typo! In debug mode HAProxy was producing around 2 Gb of logs which I was redirecting to a file on the disk to reduce the repainting of the console. Changing the log levels to err, crit or emerg didn’t help so initially I started redirecting the logs to /dev/null and this increased the performance a lot (numbers below). Then I asked in HAProxy forums and the issue (debug mode) has been pointed out!

Running in non-debug mode and without writing to the disk led to these nicer results:

aarch64, HTTP: 25046.17 -> 90 534.57
x86_64, HTTP: 45398.78 -> 64 960.36
aarch64, HTTPS: 25003.79 -> 82 475.74
x86_64, HTTPS: 41769.52 -> 74 960.51

We see two interesting things here:

aarch64 started performing better than x86_64 for both HTTP and HTTPS! Both VMs use the same kind of disks (same manufacturer, same model, same IOPS) but for some reason Ubuntu 20.04 writes slower on aarch64 than on x86_64. I will run some disk IO benchmark tool to verify this!
x86_64 gives better results for HTTPS than HTTP. I have no explanation about this at the moment. Decrypting the data should add more time for processing, not less.

Happy hacking and stay safe!

Compare HAProxy performance on x86_64 and arm64 CPU architectures

Written by Martin Grigorov