Sophos UTM Reverse Proxy Connections per Second

The Sophos UTM is a "next generation" firewall (ie, does more than just packet filtering), based on Linux. It is deployed as a physical or virtual appliance with a wide range of licensed features.

Among the licensable features is "Webserver Protection", via application level proxying. It is implemented by associating a "virtual webserver" (front end) with one or more backend ("real") webservers -- and depending on how it is configured, can optionally do load balancing, SSL termination, and "Web Application Firewalling". (All the configuration is under "Webserver Protection" -> "Web Application Firewall", but if there is no firewall profile associated then no meaningful "application firewalling" will take place -- without that firewall it is basically a reverse proxy load balancer.)

Recently a client was experiencing a performance problem with websites hosted behind the "Webserver Protection" of a Sophos UTM. At times (busier periods of site usage, we think) the time to answer various requests would be 10-12 seconds -- while requests to the backend webserver directly would be answered in fractions of a second. This remained even when they set it to "no firewall profile" (ie, just the reverse proxy). So I took a look under the hood.

The Webserver Protection feature (at least on the Sophos UTM version 9.352) is implemented via an Apache HTTPD (2.4.10) server with the mod_proxy functionality enabled (and what seems to be mod_security for some of the "firewall" functionality). The basic reverse proxy configuration seems to be sane, and capable of doing SSL Termination if needed and/or load balancing (via configuring multiple backends -- even with a single backend a mod_proxy_balancer seems to be configured in the template).

The Apache HTTPD reverse proxy runs in a chroot under /var/storage/chroot-reverseproxy, with logging via a pipe to /var/log/reverseproxy.log. The configuration for the reverse proxy is in /var/storage/chroot-reverseproxy/usr/apache/conf, including the obvious reverseproxy.conf for the actual proxy configuration. It appears a single set of Apache HTTPD process instances is shared among all websites behind the Sophos UTM Webserver Protection -- so any scalability limitations will be seeing the aggregate load of all requests for all protected websites.

Watching the reverseproxy.log showed that for the affected UTM, around the time that users were noticing problems, there were in the order of 100-200 requests/second being logged. And the request processing times recorded in the log where approximately consistent with the times for directly requesting from the backend server. (Note that logging is in a CustomLog format which includes time="%D" -- meaning "The time taken to serve the request, in microseconds."; this is in httpd.conf. Because the time is in microseconds, 1,000,000 means 1 second, and values in the thousands represent just milliseconds.)

In the Apache HTTPD server the next thing to investigate for performance issues is how busy the server instances are -- for which Apache provides mod_status. Conveniently the Sophos UTM reverseproxy is already configured with [mod_status] enabled -- in status.conf. However it is only listening on the localhost (127.0.0.1), on port TCP/4080. Accessing it requires a ssh port forward, eg:

ssh -L 4080:127.0.0.1:4080 loginuser@UTM

after which you can then access:

http://localhost:4080/status
http://localhost:4080/lb-status

(the first is the main Apache HTTPD status page, and the second is the status of the load balancer instances.)

Looking at the main status page revealed the problem:

800 requests currently being processed, 0 idle workers

That is "all workers busy, please hold". In particular due to HTTP Keep Alive, the workers will remain busy (waiting to serve another request to a recent client) for the KeepAliveTimeout -- which now defaults to 5 seconds, but is set to 15 seconds on the Sophos UTM (KeepAliveTimeout 15 in httpd.conf; note that the MaxKeepAliveRequests 100 is the number of requests on the connection before it is closed not the number of connections that can be waiting in Keep Alive state).

The limit of 800 workers comes from the configuration of the Apache HTTPD MPM worker module, in mpm.conf, which on the Sophos UTM contains:

LoadModule mpm_worker_module /usr/apache/modules/mod_mpm_worker.so
ServerLimit 16
ThreadsPerChild 50
ThreadLimit 50
MaxRequestWorkers 800
MinSpareThreads 25
MaxSpareThreads 75

Sensibly this is larger than what appear to be the Apache HTTPD MPM worker defaults that result in only 400 workers. Unfortunately for a busy website, or particularly a series of busy websites behind a single Sophos UTM, it clearly is not large enough.

Once the UTM reverse proxy becomes busy enough additional requests are queued waiting for a worker -- up to ListenBackLog requests, which defaults to 511 (and appears not to be overridden on the Sophos UTM). Basic statistics suggests that requests near the head of that queue will need to wait on average about half the KeepAliveTimeout (15 seconds) for another connection to time out and a worker to become free -- so around 7-8 seconds. Requests later in the queue would potentially need to wait for 1-3 full connection timeouts, as the default queue is nearly as long as the number of workers (and on a busy site, new requests will be arriving all the time to keep the queue full....).

Having discovered all of this, the question becomes "how many workers are required to handle usual website loads". If we assume that each webpage requires accessing multiple resources (eg, HTML, CSS, Javascript, Javascript Callbacks, images, etc), then it is likely that each client will make multiple connections to the webserver. In most modern webbrowsers this is up to 6 persistent connections per client; historically the limit used to be 4. For simplicity we can assume that requests by a client with an active worker will be answered promptly (since the backend is still performing well), and thus it will not need to make as many parallel connections as it can to get smaller resources. This makes the older "4 connections per client" still a reasonable rule of thumb. (Sites with larger files being transferred may not be able to make the assumption.)

With 800 workers, and 4 connections (and thus workers) per client, that means that 200 clients (800/4) can be served simultaneously -- ie, in any given "15 second" window. (The actual number of requests will be higher -- but a lot of the "client time" will be that 15 second timeout per connection). Beyond that number of clients, one or more requests from a client is going to end up waiting in the backlog queue. Of note, those "clients" are not necessarily "IP addresses" in the request log, because:

multiple machines behind a NAT firewall are separate clients with their own ideas about being able to use 4-6 connections each
multiple processes on each machine are "separate clients", again with their own ideas about being able to use 4-6 connections each
potentially threads in other server processes that do not use connection pooling will also be making repeated new connections

And also of note, things like issuing client side redirects (301, 302) will add to the request load and may cause additional connections to be used (eg, if there are pipelined requests in the first connection that received the redirect).

It is relatively easy to get an idea of how many IP addresses are making requests in a given period, eg from the Sophos UTM graphs in the user interface, or something like:

tail -100000 /var/log/reverseproxy.log |
       sed 's/^.*srcip="//; s/".*$//;' | sort | uniq | wc -l

but harder to get a good idea of the number of clients -- especially if there is considerable inter-server traffic via HTTP as well.

For the site I was investigating, tweaking that tail line count to go back about an hour, it turned out there were about 35,000 different IP addresses making requests in the last hour. That's around 9-10 new IPs per second on average. So it was pretty easy to see -- even not counting multiple clients via the same IP -- that any clumping of requests into a shorter period of time could easily exceed 200 clients making connections over a 15-60 second period. And once the reverse proxy started to queue connections it could take quite a while for the load to quiet down enough that it would ever catch up.

Clearly, the number of workers needs to be set quite a bit higher to handle even normal concentrations of requests. An exact number is difficult to calculate, but aiming to handle 1600 clients -- 6400 workers -- would at least provide 8 times the capacity and hopefully handle most "normal" concentrations of traffic. Adding more workers will require more RAM which also needs to be taken into account -- but fortunately my client has the UTM in a virtual machine, so giving it more RAM is pretty easy. (This older Apache tuning guide has lots of useful information on trade-offs on tuning the Apache Worker model.)

Unfortunately it appears the Sophos UTM does not provide any user-accessible interface to tweak the Apache HTTPD MPM Worker values; requests to Sophos Support suggested that they would need to schedule a Sophos Engineer to log into the affected UTM and adjust the parameters. If one were brave, and skilled with Linux administration, then it might be possible to change the mpm.conf configuration values by hand and restart the reverse proxy (a reload is insufficient -- because shared memory is allocated based on the values, several of those values are ignored except on httpd server startup).

(ETA, 2016-01-21: Unfortunately changing the mpm.conf file by hand does not survive the UTM reboot; mpm.conf is rewritten, with the defaults, on reboot. So a hand edit of the file is useful only for testing. It turns out the values come from the ConfD database, which is managed with confd-client.plx, also callable via the alias cc as root on the UTM. It is not really documented for customer use, but there are hints online.)

Possible replacement values, assuming ample free RAM (at least 1GB more than needed for the default config; possibly more):

LoadModule mpm_worker_module /usr/apache/modules/mod_mpm_worker.so
ServerLimit 32
ThreadsPerChild 200
ThreadLimit 200
MaxRequestWorkers 6400
MinSpareThreads 25
MaxSpareThreads 600

Note that MaxRequestWorkers must be set to ServerLimit * ThreadsPerChild (or lower), otherwise it will be capped, and that ThreadLimit (hard limit) must be equal or higher than ThreadsPerChild (soft limit) or ThreadsPerChild will be capped. The shared memory allocation is in part based on ServerLimit * ThreadLimit, so ThreadLimit should not be too high. (I believe ThreadsPerChild is the only one that can be meaningfully changed at runtime, and only up to the ThreadLimit of the server when it was started.)

Also worth noting, allowing 6400 workers will make the process/thread listing on the UTM very long! So smaller values may be preferable if not needed. For this reason MinSpareThreads is left at its Sophos UTM default value, but MaxSpareThreads is scaled proportionately, to avoid constantly destroying/starting new threads. Together they're likely to mean there are 1000-2000 threads at all times given the load experienced.

It is also tempting to consider reducing the KeepAliveTimeout, perhaps back down to the default 5 seconds value, as this would cause workers to become free again quicker. Doing so might possibly halve the number of workers required by a site serving a lot of small requests (ie, where the KeepAliveTimeout dominates the total connection time). Unfortunately KeepAliveTimeout is buried inside httpd.conf, and so not as easy to change (eg, it is less clear if httpd.conf is regenerated on changes in the Sophos UTM management interface).

Finally, this dependency on separate clients making connections, and those connections being held open, makes load testing more complicated than it otherwise would be. Since the backend answers individual requests quickly, simply making lots of requests is not sufficient. What it really needs is:

lots of clients making requests
in parallel
preferably from lots of different IPs
using HTTP Keep Alive, and
always waiting for the Server to close the connection

And the volume of clients and requests needs to be reasonably equivalent to the volume of production traffic -- 35,000 distinct client IPs in an hour, most likely with multiple clients behind many of those. That is getting close to DDOS-style testing....

ETA, 2016-01-21: Permanent configuration is in the Sophos UTM Confd which is managed by /usr/local/bin/confd-client.plx (apparently a compiled Perl program from the name/behaviour); it can also be run via the alias cc. Changes made here are definitely at your own risk (and may not even be supportable). You probably want to make a configuration backup before you start making changes.

When run interactively, the client has a "help" command that provides some hints on usage; the relevant section is reverse_proxy, and within that there are some values which correspond to ServerLimit, ThreadLimit and MaxSpareThreads (the other values seem to be derived from those). The existing value is printed as part of the command prompt, so you can check the current value.

Once done, you still need to stop/start the reverse proxy to have it running with the new values; see below. (I think the Confd changes do a reload or restart, but for these particular changes a full stop/start cycle is required -- you can visit http://localhost:4080/status to confirm if has taken effect or not.)

These changes do appear to survive a reboot.

ETA, 2016-01-21: The script to start/stop the reverse proxy is:

/var/mdw/scripts/reverseproxy stop
sleep 10
/var/mdw/scripts/reverseproxy start

(as listed in a Sophos UTM knowledgebase article on the WAF.)

Because the process/thread values are involved in shared memory allocation on server startup, they only change on a fresh start of the Apache HTTPD server. So a restart or reload will not be sufficient; and a pause is needed after the stop to ensure that the processes all exit, before it is possible to start new ones without being told "reverseproxy already running".

(Note that some other httpd processes run on the system for things like the management interface, so not all httpd processes will be started afresh by that.)

ETA, 2018-01-30: Fixed up localhost URLs to include port number so traffic goes through ssh port forward. Also of note, today we had to reapply these "permanent" changes to the "same" UTM -- in practice the most plausible explanation is that the UTM got rebuilt via config export and import, and maybe these tweaked settings are not included in the config export/import process. Certainly something to check carefully after any subsequent migration.