The Sophos UTM is a "next generation" firewall (ie, does more than just packet filtering), based on Linux. It is deployed as a physical or virtual appliance with a wide range of licensed features.
Among the licensable features is "Webserver Protection", via application level proxying. It is implemented by associating a "virtual webserver" (front end) with one or more backend ("real") webservers -- and depending on how it is configured, can optionally do load balancing, SSL termination, and "Web Application Firewalling". (All the configuration is under "Webserver Protection" -> "Web Application Firewall", but if there is no firewall profile associated then no meaningful "application firewalling" will take place -- without that firewall it is basically a reverse proxy load balancer.)
Recently a client was experiencing a performance problem with websites hosted behind the "Webserver Protection" of a Sophos UTM. At times (busier periods of site usage, we think) the time to answer various requests would be 10-12 seconds -- while requests to the backend webserver directly would be answered in fractions of a second. This remained even when they set it to "no firewall profile" (ie, just the reverse proxy). So I took a look under the hood.
The Webserver Protection feature (at least on the Sophos UTM version
9.352) is implemented via an Apache HTTPD (2.4.10)
server with the
mod_proxy
functionality enabled (and what seems to be
mod_security
for some of the
"firewall" functionality). The basic reverse proxy configuration
seems to be sane, and capable of doing SSL Termination if needed
and/or load balancing (via configuring multiple backends -- even
with a single backend a
mod_proxy_balancer
seems to be configured in the template).
The Apache HTTPD reverse proxy runs in a chroot
under
/var/storage/chroot-reverseproxy
, with logging via a pipe
to /var/log/reverseproxy.log
. The configuration for the reverse
proxy is in /var/storage/chroot-reverseproxy/usr/apache/conf
,
including the obvious reverseproxy.conf
for the actual proxy
configuration. It appears a single set of Apache HTTPD process
instances is shared among all websites behind the Sophos UTM
Webserver Protection -- so any scalability limitations will be seeing
the aggregate load of all requests for all protected websites.
Watching the reverseproxy.log
showed that for the affected UTM,
around the time that users were noticing problems, there were in
the order of 100-200 requests/second being logged. And the request
processing times recorded in the log where approximately consistent
with the times for directly requesting from the backend server.
(Note that logging is in a CustomLog
format
which includes time="%D"
-- meaning "The time taken to serve the
request, in microseconds."; this is in httpd.conf
. Because the
time is in microseconds, 1,000,000 means 1 second, and values in
the thousands represent just milliseconds.)
In the Apache HTTPD server the next thing to investigate for
performance issues is how busy the server instances are -- for which
Apache provides
mod_status
.
Conveniently the Sophos UTM reverseproxy is already configured with
[mod_status
] enabled -- in status.conf
. However it is only
listening on the localhost
(127.0.0.1
), on port TCP/4080
. Accessing
it requires a ssh
port forward, eg:
ssh -L 4080:127.0.0.1:4080 loginuser@UTM
after which you can then access:
http://localhost:4080/status
http://localhost:4080/lb-status
(the first is the main Apache HTTPD status page, and the second is the status of the load balancer instances.)
Looking at the main status page revealed the problem:
800 requests currently being processed, 0 idle workers
That is "all workers busy, please hold". In particular due to HTTP
Keep Alive, the workers will remain busy (waiting to serve another
request to a recent client) for the
KeepAliveTimeout
-- which now defaults to 5 seconds, but is set to 15 seconds on
the Sophos UTM (KeepAliveTimeout 15
in httpd.conf
; note that
the MaxKeepAliveRequests 100
is the number of requests on the
connection before it is
closed
not the number of connections that can be waiting in Keep Alive
state).
The limit of 800 workers comes from the configuration of the
Apache HTTPD MPM worker
module, in mpm.conf
, which on the Sophos UTM contains:
LoadModule mpm_worker_module /usr/apache/modules/mod_mpm_worker.so
ServerLimit 16
ThreadsPerChild 50
ThreadLimit 50
MaxRequestWorkers 800
MinSpareThreads 25
MaxSpareThreads 75
Sensibly this is larger than what appear to be the Apache HTTPD MPM worker defaults that result in only 400 workers. Unfortunately for a busy website, or particularly a series of busy websites behind a single Sophos UTM, it clearly is not large enough.
Once the UTM reverse proxy becomes busy enough additional requests
are queued waiting for a worker -- up to
ListenBackLog
requests, which defaults to 511 (and appears not to be overridden
on the Sophos UTM). Basic statistics suggests that requests near
the head of that queue will need to wait on average about half the
KeepAliveTimeout
(15 seconds) for another connection to time out
and a worker to become free -- so around 7-8 seconds. Requests
later in the queue would potentially need to wait for 1-3 full
connection timeouts, as the default queue is nearly as long as the
number of workers (and on a busy site, new requests will be arriving
all the time to keep the queue full....).
Having discovered all of this, the question becomes "how many workers are required to handle usual website loads". If we assume that each webpage requires accessing multiple resources (eg, HTML, CSS, Javascript, Javascript Callbacks, images, etc), then it is likely that each client will make multiple connections to the webserver. In most modern webbrowsers this is up to 6 persistent connections per client; historically the limit used to be 4. For simplicity we can assume that requests by a client with an active worker will be answered promptly (since the backend is still performing well), and thus it will not need to make as many parallel connections as it can to get smaller resources. This makes the older "4 connections per client" still a reasonable rule of thumb. (Sites with larger files being transferred may not be able to make the assumption.)
With 800 workers, and 4 connections (and thus workers) per client, that means that 200 clients (800/4) can be served simultaneously -- ie, in any given "15 second" window. (The actual number of requests will be higher -- but a lot of the "client time" will be that 15 second timeout per connection). Beyond that number of clients, one or more requests from a client is going to end up waiting in the backlog queue. Of note, those "clients" are not necessarily "IP addresses" in the request log, because:
multiple machines behind a NAT firewall are separate clients with their own ideas about being able to use 4-6 connections each
multiple processes on each machine are "separate clients", again with their own ideas about being able to use 4-6 connections each
potentially threads in other server processes that do not use connection pooling will also be making repeated new connections
And also of note, things like issuing client side redirects (301, 302) will add to the request load and may cause additional connections to be used (eg, if there are pipelined requests in the first connection that received the redirect).
It is relatively easy to get an idea of how many IP addresses are making requests in a given period, eg from the Sophos UTM graphs in the user interface, or something like:
tail -100000 /var/log/reverseproxy.log |
sed 's/^.*srcip="//; s/".*$//;' | sort | uniq | wc -l
but harder to get a good idea of the number of clients -- especially if there is considerable inter-server traffic via HTTP as well.
For the site I was investigating, tweaking that tail
line count
to go back about an hour, it turned out there were about 35,000
different IP addresses making requests in the last hour. That's
around 9-10 new IPs per second on average. So it was pretty easy
to see -- even not counting multiple clients via the same IP --
that any clumping of requests into a shorter period of time could
easily exceed 200 clients making connections over a 15-60 second
period. And once the reverse proxy started to queue connections
it could take quite a while for the load to quiet down enough that
it would ever catch up.
Clearly, the number of workers needs to be set quite a bit higher to handle even normal concentrations of requests. An exact number is difficult to calculate, but aiming to handle 1600 clients -- 6400 workers -- would at least provide 8 times the capacity and hopefully handle most "normal" concentrations of traffic. Adding more workers will require more RAM which also needs to be taken into account -- but fortunately my client has the UTM in a virtual machine, so giving it more RAM is pretty easy. (This older Apache tuning guide has lots of useful information on trade-offs on tuning the Apache Worker model.)
Unfortunately it appears the Sophos UTM does not provide any
user-accessible interface to tweak the Apache HTTPD MPM Worker
values; requests to Sophos Support suggested that they would need
to schedule a Sophos Engineer to log into the affected UTM and adjust
the parameters. If one were brave, and skilled with Linux administration,
then it might be possible to change the mpm.conf
configuration values by
hand and restart the reverse proxy (a reload is insufficient -- because
shared memory is allocated based on the values, several of those
values are ignored except on httpd
server startup).
(ETA, 2016-01-21: Unfortunately changing the mpm.conf
file
by hand does not survive the UTM reboot; mpm.conf
is rewritten,
with the defaults, on reboot. So a hand edit of the file is useful
only for testing. It turns out the values come from the ConfD
database,
which is managed with confd-client.plx
, also callable via the
alias cc
as root
on the UTM. It is not really documented for
customer use, but there are
hints
online.)
Possible replacement values, assuming ample free RAM (at least 1GB more than needed for the default config; possibly more):
LoadModule mpm_worker_module /usr/apache/modules/mod_mpm_worker.so
ServerLimit 32
ThreadsPerChild 200
ThreadLimit 200
MaxRequestWorkers 6400
MinSpareThreads 25
MaxSpareThreads 600
Note that MaxRequestWorkers
must be set to ServerLimit
*
ThreadsPerChild
(or lower), otherwise it will be capped, and that
ThreadLimit
(hard limit) must be equal or higher than ThreadsPerChild
(soft limit) or ThreadsPerChild
will be capped. The shared memory
allocation is in part based on ServerLimit
* ThreadLimit
, so ThreadLimit
should not be too high. (I believe ThreadsPerChild
is the only one that
can be meaningfully changed at runtime, and only up to the ThreadLimit
of the server when it was started.)
Also worth noting, allowing 6400 workers will make the process/thread listing
on the UTM very long! So smaller values may be preferable if not needed.
For this reason MinSpareThreads
is left at its Sophos UTM default value,
but MaxSpareThreads
is scaled proportionately, to avoid constantly
destroying/starting new threads. Together they're likely to mean there
are 1000-2000 threads at all times given the load experienced.
It is also tempting to consider reducing the KeepAliveTimeout
,
perhaps back down to the default 5 seconds
value, as this would
cause workers to become free again quicker. Doing so might possibly
halve the number of workers required by a site serving a lot of
small requests (ie, where the KeepAliveTimeout
dominates the total
connection time). Unfortunately KeepAliveTimeout
is buried inside
httpd.conf
, and so not as easy to change (eg, it is less clear
if httpd.conf
is regenerated on changes in the Sophos UTM management
interface).
Finally, this dependency on separate clients making connections, and those connections being held open, makes load testing more complicated than it otherwise would be. Since the backend answers individual requests quickly, simply making lots of requests is not sufficient. What it really needs is:
lots of clients making requests
in parallel
preferably from lots of different IPs
using HTTP Keep Alive, and
always waiting for the Server to close the connection
And the volume of clients and requests needs to be reasonably equivalent to the volume of production traffic -- 35,000 distinct client IPs in an hour, most likely with multiple clients behind many of those. That is getting close to DDOS-style testing....
ETA, 2016-01-21: Permanent configuration is in the Sophos UTM
Confd
which is managed by /usr/local/bin/confd-client.plx
(apparently a compiled Perl program from the name/behaviour); it
can also be run via the alias cc
. Changes made here are definitely
at your own risk (and may not even be supportable). You probably
want to make a configuration
backup
before you start making changes.
When run interactively, the client has a "help" command that provides
some hints on usage; the relevant section is reverse_proxy
, and
within that there are some values which correspond to ServerLimit
,
ThreadLimit
and MaxSpareThreads
(the other values seem to be
derived from those). The existing value is printed as part of the
command prompt, so you can check the current value.
Once done, you still need to stop/start the reverse proxy to have
it running with the new values; see below. (I think the Confd
changes do a reload
or restart
, but for these particular changes
a full stop
/start
cycle is required -- you can visit
http://localhost:4080/status
to confirm if has taken effect or not.)
These changes do appear to survive a reboot.
ETA, 2016-01-21: The script to start/stop the reverse proxy is:
/var/mdw/scripts/reverseproxy stop
sleep 10
/var/mdw/scripts/reverseproxy start
(as listed in a Sophos UTM knowledgebase article on the WAF.)
Because the process/thread values are involved in shared memory allocation
on server startup, they only change on a fresh start of the Apache HTTPD
server. So a restart
or reload
will not be sufficient; and a pause
is needed after the stop
to ensure that the processes all exit, before
it is possible to start new ones without being told "reverseproxy already running".
(Note that some other httpd
processes run on the system for things
like the management interface, so not all httpd
processes will
be started afresh by that.)
ETA, 2018-01-30: Fixed up localhost
URLs to include port number
so traffic goes through ssh
port forward. Also of note, today we had
to reapply these "permanent" changes to the "same" UTM -- in practice the
most plausible explanation is that the UTM got rebuilt via config export
and import, and maybe these tweaked settings are not included in the
config export/import process. Certainly something to check carefully
after any subsequent migration.