High availability and clustering

Whilst we have pretty good uptime and availability on most of the systems we run here, we do get the odd hardware problem. When its a disk that’s gone down, that’s not a problem as our physical servers are all running mirrored pairs in hot-swap enclosures, so we can swap out easily.

We’ve recently been doing a lot of work into reducing the number of physical servers we have by moving into a virtualised environment (QEMU+KVM running on Scientific Linux 6 x86_64 hardware). We’ve got some fairly major drivers to virtualise more hardware due to an ageing server estate. Virtualising our estate gives us the opportunity to review how we’ve got systems configured and to look at if we can improve service by introducing high-availability and load-balancing.

In the past week I’ve been looking at our web presence www.cs.bham.ac.uk. The current infrastructure for this was installed in 2006 and things were a little different back then. Before the 2006 system, the site was running on a Sun Ultra 5 with NFS mounted file-store. The power supply to the building was unreliable, and the file-server could take several hours to reboot following a power-outage. There was demand to be able to provide at least some web-presence for external users whilst waiting for the systems to reboot. We built a system out of two Sun X2100 servers, one a stand-alone web-server serving the very front few web-pages, the second used NFS mounts to server the rest of the site. We used the Apache httpd proxy module to transparently forward on requests to the back-end hardware.

Fast forward 6 years. We still use an NFS file-server for all our stuff, but its considerably more reliable. You’d expect that, we have a NetApp storage system. We also have high speed interconnects between our switching fabric and the network (10GbE links to the filer and between the core fabric).

LVS Direct Routing

We’ve got a pair of SL6 machines running in our virtual environment which are configured as a hot/standby LVS load-balancer. We’re using piranha to manage to LVS configs. Its something I’ve used before, but never really in anger (we did some experimentation a few years ago running Samba in a cluster. It was fine till we did fail-over of a node…). LVS is actually really easy to get installed and working, and the fail-over between nodes seems to work reliably. Its not great when you change the config as you have to restart pulse. And the second node has a habbit of taking over the cluster at that point!

I’ve been pondering how to move the www service into a high-availabilty configuration. One option is to continue to have a front-end node using mod_proxy to forward request to a backend set of servers. The problem with this is that the front-end server will either be a physical machine, or be a VM (which requires our NFS file-server for the VM host servers). It could be a HA cluster of front-end machines, but still it will rely on the load balancers, which again are VMs and require the NFS server to be up.

mod_proxy

We already use mod_proxy in httpd to handle calls to back-end servers, for example our “personal” home-pages are served from a completely separate VM. In the past personal pages with “issues” have had an impact on the main www web presence. So we separated them, the following config snippet shows what we do here:

    <Proxy balancer://staffwebcluster>
      BalancerMember http://staffweb-lb-1.cs.bham.ac.uk loadfactor=1
      BalancerMember http://staffweb-lb-2.cs.bham.ac.uk loadfactor=1
      ProxySet lbmethod=bytraffic
    </Proxy>
    <Proxy balancer://staffwebclusterssl>
      BalancerMember https://staffweb-lb-1.cs.bham.ac.uk loadfactor=1
      BalancerMember https://staffweb-lb-2.cs.bham.ac.uk loadfactor=1
      ProxySet lbmethod=bytraffic
    </Proxy>
    RewriteCond %{HTTPS} on
    RewriteRule ^/~(.*) balancer://staffwebclusterssl/~$1 [P]
    RewriteRule ^/~(.*) balancer://staffwebcluster/~$1 [P]

The problem with this approach for a “front-end” server cluster is that we’ve carefully used firewall marks on the load balancer to ensure visitors hit the same front-end server for both HTTP and HTTPS transactions, but then we have no control over the backend server the client connects to as its determined by Apache’s load balancing.

So given the fact we rely on our NFS server being up for practically the whole system to be available, do we really still need to have a front-end/back-end configuration. I’m fairly sure we don’t in real terms. So whilst we’ll continue to use mod_proxy to allow us to run whole sections of our web-server on different real hosts (some parts of the www site are even proxied to an IIS server), we’ll be dropping the front-end/back-end approach and letting the load-balancer handle the traffic for us.

Comments are closed.