High availability and clustering

Whilst we have pretty good uptime and availability on most of the systems we run here, we do get the odd hardware problem. When its a disk that’s gone down, that’s not a problem as our physical servers are all running mirrored pairs in hot-swap enclosures, so we can swap out easily.

We’ve recently been doing a lot of work into reducing the number of physical servers we have by moving into a virtualised environment (QEMU+KVM running on Scientific Linux 6 x86_64 hardware). We’ve got some fairly major drivers to virtualise more hardware due to an ageing server estate. Virtualising our estate gives us the opportunity to review how we’ve got systems configured and to look at if we can improve service by introducing high-availability and load-balancing.

In the past week I’ve been looking at our web presence www.cs.bham.ac.uk. The current infrastructure for this was installed in 2006 and things were a little different back then. Before the 2006 system, the site was running on a Sun Ultra 5 with NFS mounted file-store. The power supply to the building was unreliable, and the file-server could take several hours to reboot following a power-outage. There was demand to be able to provide at least some web-presence for external users whilst waiting for the systems to reboot. We built a system out of two Sun X2100 servers, one a stand-alone web-server serving the very front few web-pages, the second used NFS mounts to server the rest of the site. We used the Apache httpd proxy module to transparently forward on requests to the back-end hardware.

Fast forward 6 years. We still use an NFS file-server for all our stuff, but its considerably more reliable. You’d expect that, we have a NetApp storage system. We also have high speed interconnects between our switching fabric and the network (10GbE links to the filer and between the core fabric).

LVS Direct Routing

We’ve got a pair of SL6 machines running in our virtual environment which are configured as a hot/standby LVS load-balancer. We’re using piranha to manage to LVS configs. Its something I’ve used before, but never really in anger (we did some experimentation a few years ago running Samba in a cluster. It was fine till we did fail-over of a node…). LVS is actually really easy to get installed and working, and the fail-over between nodes seems to work reliably. Its not great when you change the config as you have to restart pulse. And the second node has a habbit of taking over the cluster at that point!

I’ve been pondering how to move the www service into a high-availabilty configuration. One option is to continue to have a front-end node using mod_proxy to forward request to a backend set of servers. The problem with this is that the front-end server will either be a physical machine, or be a VM (which requires our NFS file-server for the VM host servers). It could be a HA cluster of front-end machines, but still it will rely on the load balancers, which again are VMs and require the NFS server to be up.

mod_proxy

We already use mod_proxy in httpd to handle calls to back-end servers, for example our “personal” home-pages are served from a completely separate VM. In the past personal pages with “issues” have had an impact on the main www web presence. So we separated them, the following config snippet shows what we do here:

    <Proxy balancer://staffwebcluster>
      BalancerMember http://staffweb-lb-1.cs.bham.ac.uk loadfactor=1
      BalancerMember http://staffweb-lb-2.cs.bham.ac.uk loadfactor=1
      ProxySet lbmethod=bytraffic
    </Proxy>
    <Proxy balancer://staffwebclusterssl>
      BalancerMember https://staffweb-lb-1.cs.bham.ac.uk loadfactor=1
      BalancerMember https://staffweb-lb-2.cs.bham.ac.uk loadfactor=1
      ProxySet lbmethod=bytraffic
    </Proxy>
    RewriteCond %{HTTPS} on
    RewriteRule ^/~(.*) balancer://staffwebclusterssl/~$1 [P]
    RewriteRule ^/~(.*) balancer://staffwebcluster/~$1 [P]

The problem with this approach for a “front-end” server cluster is that we’ve carefully used firewall marks on the load balancer to ensure visitors hit the same front-end server for both HTTP and HTTPS transactions, but then we have no control over the backend server the client connects to as its determined by Apache’s load balancing.

So given the fact we rely on our NFS server being up for practically the whole system to be available, do we really still need to have a front-end/back-end configuration. I’m fairly sure we don’t in real terms. So whilst we’ll continue to use mod_proxy to allow us to run whole sections of our web-server on different real hosts (some parts of the www site are even proxied to an IIS server), we’ll be dropping the front-end/back-end approach and letting the load-balancer handle the traffic for us.

Research Computing Blogs …

A few years ago we looked at providing a blog service to support a teaching module, back then the only option for multiple blogs was to either install multiple instances of WordPress, or opt for WordPress MU. Happily things have moved on a long way with WordPress since then. There’s now “networks” which form an integrated part of the WordPress code base – with MU, we quickly found we were on an outdated version.

http://researchblogs.cs.bham.ac.uk

So proudly, today I announce that we’re now providing http://researchblogs.cs.bham.ac.uk/ to allow research members of the school the ability to create blogs about their research.

We really don’t just like bunging in new systems which aren’t integrated into anything else, so we’re using a couple of plug-ins to help tie authentication into our normal authentication systems. The http-authentication plugin allows one to use Apache auth to provide logins. A few years back, I wrote the authentication module we use, and this provides integrated cookie based authentication across a number of our sites.

So researchers here can now register on-line for blogging. This sets them up inside the WordPress world.

Why not just use normal WordPress registration?

Whilst WordPress allows configuration options to disable registration or to restrict to email domains, we only want to allow our research staff at present to register, so we’ve provided a click to register option. We also don’t want to allow anyone who can use the system to create their own blogs, again we’d like to restrict that to a sub-set of users.

There’s no easy way to accomplish this sensibly with WordPress right now – there’s no command line tools, and poking things directly into the WordPress database is just going to cause trouble in the future, so I’ve written some code to act as an API – internally my API code uses the Curl module in PHP to authenticate into WordPress as a trusted user and then allows it to make calls from the WordPress function reference, so for example, my API code logs in as an admin user internally and then calls the get_blog_details function to find out info on a blog. The main page of the site uses this internally to render all the blogs (and a tweak to the index handler for Apache to load a different page by default). This means we can list currently active and archived blogs, which is derived using the WordPress functions, rather than poking into the database directly.

As we know our researchers have collaborators, we’ve also built part of the API so that allows staff to add external users which are authenticated using the WordPress internal authentication system, so they’ll be able to add collaborators and be able to allow them to post onto research blogs. Hopefully this will make it a workable solution for our researchers!

And what’s with the robot man?

That’s the building we’re located in. The statue is right outside my office. Its called “Faraday” and was designed by Eduardo Paolozzi.

And if we’ve changed the theme since posting … he’ll be gone by now!

Posted in Web