Tuning the Linux Kernel and TCP Parameters with Sysctl

Posted on April 25, 2017 at 1:10 pm

There are many guides online about Linux kernel and TCP tuning, I tried to sum the most useful and detailed Linux kernel and TCP tuning tips, including the best guides about TCP and kernel tuning on Linux, useful to scale and handle more concurrent connections on a linux server.

Time ago I wrote about optimizing Linux Sysctl.conf parameters.

This is a more advanced post about Linux TCP and Kernel optimization.

Sysctl.conf Optimization

This is the /etc/sysctl.conf file I use on my servers (Debian 8.7.1):

I included references and personal comments.

# Increase number of max open-files
fs.file-max = 150000
# Increase max number of PIDs
kernel.pid_max = 4194303
# Increase range of ports that can be used
net.ipv4.ip_local_port_range = 1024 65535
# https://tweaked.io/guide/kernel/
# Forking servers, like PostgreSQL or Apache, scale to much higher levels of concurrent connections if this is made larger
# https://tweaked.io/guide/kernel/
# Various PostgreSQL users have reported (on the postgresql performance mailing list) gains up to 30% on highly concurrent workloads on multi-core systems
kernel.sched_autogroup_enabled = 0
# https://github.com/ton31337/tools/wiki/tcp_slow_start_after_idle---tcp_no_metrics_save-performance
# Avoid falling back to slow start after a connection goes idle
# https://github.com/ton31337/tools/wiki/Is-net.ipv4.tcp_abort_on_overflow-good-or-not%3F
# Enable TCP window scaling (enabled by default)
# https://en.wikipedia.org/wiki/TCP_window_scale_option
# Enables fast recycling of TIME_WAIT sockets.
# (Use with caution according to the kernel documentation!)
net.ipv4.tcp_tw_recycle = 1
# Allow reuse of sockets in TIME_WAIT state for new connections
# only when it is safe from the network stack’s perspective.
net.ipv4.tcp_tw_reuse = 1
# Turn on SYN-flood protections
# Only retry creating TCP connections twice
# Minimize the time it takes for a connection attempt to fail
# How many retries TCP makes on data segments (default 15)
# Some guides suggest to reduce this value
# Optimize connection queues
# https://www.linode.com/docs/web-servers/nginx/configure-nginx-for-optimized-performance
# Increase the number of packets that can be queued
net.core.netdev_max_backlog = 3240000
# Max number of "backlogged sockets" (connection requests that can be queued for any given listening socket)
net.core.somaxconn = 50000
# Increase max number of sockets allowed in TIME_WAIT
net.ipv4.tcp_max_tw_buckets = 1440000
# Number of packets to keep in the backlog before the kernel starts dropping them
# A sane value is net.ipv4.tcp_max_syn_backlog = 3240000
net.ipv4.tcp_max_syn_backlog = 3240000 
# TCP memory tuning
# View memory TCP actually uses with: cat /proc/net/sockstat
# *** These values are auto-created based on your server specs ***
# *** Edit these parameters with caution because they will use more RAM ***
# Changes suggested by IBM on https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20Central/page/Linux%20System%20Tuning%20Recommendations
# Increase the default socket buffer read size (rmem_default) and write size (wmem_default)
# *** Maybe recommended only for high-RAM servers? ***
# Increase the max socket buffer size (optmem_max), max socket buffer read size (rmem_max), max socket buffer write size (wmem_max)
# 16MB per socket - which sounds like a lot, but will virtually never consume that much
# rmem_max over-rides tcp_rmem param, wmem_max over-rides tcp_wmem param and optmem_max over-rides tcp_mem param
# Configure the Min, Pressure, Max values (units are in page size)
# Useful mostly for very high-traffic websites that have a lot of RAM
# Consider that we already set the *_max values to 16777216
# So you may eventually comment these three lines
net.ipv4.tcp_mem=16777216 16777216 16777216
net.ipv4.tcp_wmem=4096 87380 16777216
net.ipv4.tcp_rmem=4096 87380 16777216
# Keepalive optimizations
# By default, the keepalive routines wait for two hours (7200 secs) before sending the first keepalive probe,
# and then resend it every 75 seconds. If no ACK response is received for 9 consecutive times, the connection is marked as broken. 
# The default values are: tcp_keepalive_time = 7200, tcp_keepalive_intvl = 75, tcp_keepalive_probes = 9
# We would decrease the default values for tcp_keepalive_* params as follow:
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
# The TCP FIN timeout belays the amount of time a port must be inactive before it can reused for another connection. 
# The default is often 60 seconds, but can normally be safely reduced to 30 or even 15 seconds
# https://www.linode.com/docs/web-servers/nginx/configure-nginx-for-optimized-performance
net.ipv4.tcp_fin_timeout = 7

The following modifications caused many 500 errors, so I removed them:

# Disable TCP SACK (TCP Selective Acknowledgement), DSACK (duplicate TCP SACK), and FACK (Forward Acknowledgement)
# SACK requires enabling tcp_timestamps and adds some packet overhead
# Only advised in cases of packet loss on the network
net.ipv4.tcp_sack = 0
net.ipv4.tcp_dsack = 0
net.ipv4.tcp_fack = 0
# Disable TCP timestamps
# Can have a performance overhead and is only advised in cases where sack is needed (see tcp_sack)

Type “sysctl -p” to apply the sysctl changes (I also reboot the server).

Reduce Disk I/O Requests

Another optimization I have done on my servers is to mount the /webserver partition with “noatime” to disable the access time on files to reduce the disk I\O. Just edit /etc/fstab and add “noatime” to the partition where you have the web server data (vhosts, database, etc):

UUID=[...] /webserver    ext4    defaults,noexec,nodev,nosuid,noatime        0       2

For the changes to take effect reboot the server or remount the partition:

mount -o remount /webserver

Use “mount” to verify that /webserver has been remounted with “noatime” attribute.

You may disable access time also on / partition and other partitions.

Disable Nginx Access Log

Reduce disk I\O by disabling web server access logs:

access_log off;

Concurrent Connections Test

This is a screenshot of the concurrent connections handled with the above changes:

I used https://loader.io/ to stress-test the server.

This is a screenshot without any sysctl.conf changes (a lot of 500 errors):

This is a screenshot without the sysctl.conf “TCP memory tuning”:

References and Links

Here are the guides I used to create the sysctl.conf file:

Find detailed information about all TCP variables:

Useful tips about Linux TCP and kernel optimizations:
Optimizing servers – Tuning the GNU/Linux Kernel
Linux System Tuning Recommendations by IBM
Part 1: Lessons learned tuning TCP and Nginx in EC2
Using TCP keepalive under Linux
Kernel: The “Out of socket memory” error
Sysctl tweaks – Sysctl Network tweaks and settings for VMs / VPS
How to Configure nginx for Optimized Performance by Linode

Updated on April 29, 2017 at 6:47 pm

Receive updates via email

Other Posts

Updated Posts