cachegrand has the ability to leverage the hugepages for several optimizations and although the usage is not mandatory it's expected to be run in production with them enabled.
When the hugepages are enable it's possible to use the Fast Fixed Memory Allocator (FFMA) which is capable of providing a ~10% boost, in some cases even more, to cachegrand thanks to the advanced techniques put in place to allocate and free memory of specific fixed sizes.
cachegrand can also leverage the hugepages to run its own code from them, this dramatically reduces the TLB cache misses and provides up to a 5% speed bump.
Below a number of tips and suggestions to help improve cachegrand performances, although in general they are useful for every network platform that is built around cache locality.
The Receive Side Scaling, or RSS, is a mechanism provided in hardware by network cards to distribute packets across a number of receive queues evenly using IRQs, this mechanism works great but data received via a queue can be processed by different CPUs causing cache misses. To reduce the cache misses it's possible to bind 1 queue to 1 specific cpu using /proc/interrupts and /proc/irq/IRQ_NUMBER/smp_affinity_list.
As part of the configuration, it's necessary to disable the irqbalance service.
Replace NIC with the network card interface in use, keep the dash after the name
sudo systemctl stop irqbalance.service
export IRQS=($(grep __NIC__\- /proc/interrupts | awk '{print $1}' | tr -d :))
for I in ${!IRQS[@]}; do
echo $I | sudo tee /proc/irq/${IRQS[I]}/smp_affinity_list > /dev/null;
done;
The systemctl stop irqbalance.service
command may not work as it is on all the distribution, it has been tested only
on Ubuntu
It's also important to configure cachegrand to start a worker for each receive queue available, to get the best vertical scalability. It's possible to start more than one worker per queue or more workers than queues but of course the benefits will be limited.
Here a simple snippet to list the amount of queues available for your nic
grep __NIC__\- /proc/interrupts -c
Meanwhile RSS is a mechanism provided to tune the distribution and locality of the incoming data, the Transmit Packet Steering, or XPS, is used for the outgoing packets.
Replace NIC with the network card interface in use
sudo systemctl stop irqbalance.service
export TXQUEUES=($(ls -1qdv /sys/class/net/__NIC__/queues/tx-*))
for i in ${!TXQUEUES[@]}; do
printf '%x' $((2**i)) | sudo tee ${TXQUEUES[i]}/xps_cpus > /dev/null;
done;
The systemctl stop irqbalance.service
command may not work as it is on all the distribution, it has been tested only
on Ubuntu
sudo sysctl net.core.busy_poll=1
When using Virtual Machines that are not using a physical network card via a VF, it's possible to turn off the unprocessed packets queueing because normally the driver will just accept everything that is arriving.
Replace NIC with the network card interface in use
sudo sysctl net.core.default_qdisc=noqueue
sudo tc qdisc replace dev __NIC__ root mq
The Generic Receive Offload, or GRO, is a mechanism provided by the kernel to merge together packets at kernel level to reduce the amount of packets processed by the network stack.
Meanwhile, it's normally a very good idea to keep it on, when using cachegrand there might be cases where it's convenient
to turn it off as the vast majority of commands will fit in a single packet (which can be as long as the MTU) further
reducing extra processing.
This though can have an impact with bigger commands if they are used often than the others, for example if the Redis
SET
command is very often used with blocks of data bigger than 1400 bytes the network stack will have to process and
generate multiple events for these.
sudo ethtool -K __NIC__ gro off