Proxying to GlassFish

We’re currently developing a new course work submission system and the developer is building it in Java running on GlassFish, we generally don’t like to expose individual servers out to the world, but proxy URLs using Apache httpd’s mod_proxy, so for example https://www.cs.bham.ac.uk/internal/students/submission will be the URL used, but that’s actually running on some back-end GlassFish servers.

When testing uploads of files (generally large, but possibly over slow connections), we found that we were getting the following in the httpd error_log:

proxy: pass request body failed to

And the client would error with something like:

net::ERR_CONNECTION_RESET

Eventually after much Googling with similar reports, I came upon the Apache httpd directive we need to tweak:

RequestReadTimeout body=10,MinRate=1000

Basically what we are saying here is that the maximum time for request is 10 seconds, but increase it by 1 second per 1000 bytes of data received.

Configuring Active Directory authentication integration – Building an OpenIndiana based ZFS File Server – part 3

Getting Kerberos based authentication working with Active Directory is actually pretty simple, there’s numerous blogs out there on the topic, here, here, here and here, so I’m probably mostly covering old ground on the basic integration stuff.

Our Active Directory already has Schema extentions to hold Unix account data, initially Services for Unix (SFU), but we added the Server 2008 schema which adds RFC 2307 attributes which we now use for Linux authentication.

First off, I should point out that we have a disjoint DNS namespace for our AD and normal client DNS. i.e. our AD is socs-ad.cs.bham.ac.uk, but our clients are all in cs.bham.ac.uk. This shouldn’t really cause any problems for most people (I’ve only come across 3 cases, 1: back in about 2003 where a NetApp filer couldn’t work out the DNS name as it didn’t match the netbios name … fixed a long time ago, 2: about 18 months ago when experimenting with SCCM and AMT provisioning – it doesn’t support disjoint DNS, 3: with the OI in-kernel CIFS server … in part 5!).

First off, we need to configure some config files:

/etc/resolv.conf
  domain  cs.bham.ac.uk
  search cs.bham.ac.uk socs-ad.cs.bham.ac.uk
  nameserver  147.188.192.4
  nameserver  147.188.192.8
cp /etc/inet/ntp.client /etc/inet/ntp.conf
/etc/inet/ntp.conf
  server ad1.cs.bham.ac.uk ad2.cs.bham.ac.uk timehost.cs.bham.ac.uk

The first two are our DCs, the latter a general NTP servers – all should be pretty much in step though!

Finally, enable ntp:

svcadm enable ntp

To help with Kerberos principle generation, I grabbed a copy of adjoin. Note that because we have a disjoint namespace, I had to hack it a little otherwise it tries to add the full Windows domain to the hostname in the SPNs:

###fqdn=${nodename}.$dom
fqdn=bigstore.cs.bham.ac.uk
./adjoin -f

Check you have a correct looking machine principle file

klist -e -k /etc/krb5/krb5.keytab

And enable a couple of services we’ll need:

svcadm enable /network/dns/client
svcadm enable /system/name-service-cache

We also need to configure pam.conf to use Kerberos, so you need to add a couple of lines similar to:

 other   auth required           pam_unix_cred.so.1
 other auth sufficient pam_krb5.so.1 debug
 other   auth required           pam_unix_auth.so.1
 other   account requisite       pam_roles.so.1
 other account required pam_krb5.so.1 debug nowarn
 other   account required        pam_unix_account.so.1
 other   password requisite      pam_authtok_check.so.1
 other password sufficient pam_krb5.so.1 debug
 other   password required       pam_authtok_store.so.1

The debug is optional to help with debugging why things aren’t working. The nowarn on the middle example is needed to stop a password expiry warning on each password based login – our AD passwords are set to never expire, but without this, it warns about expiry in 9244 days.

We now need to edit a file to be able to allow us to configure LDAP for passwd/group data, we want to remove all references to ldap except for passwd, group and automount:

/etc/nsswitch.ldap
  passwd:     files ldap
  group:      files ldap
  hosts:      files dns
  ipnodes:    files dns
  automount:  files ldap

Before we go any further, we also need to tweak the krb5 config:

/etc/krb5.conf
  [libdefaults]
    default_tkt_enctypes = rc4-hmac arcfour-hmac arcfour-hmac-md5
    default_tgs_enctypes = rc4-hmac arcfour-hmac arcfour-hmac-md5
    permitted_enctypes = rc4-hmac arcfour-hmac arcfour-hmac-md5

You might not need to do this, our DCs are quite old running Server 2003, without this, Kerberos authentication wouldn’t work.

We now need to configure LDAP:

 ldapclient -v manual \
 -a credentialLevel=self \
 -a authenticationMethod=sasl/gssapi \
 -a defaultSearchBase=dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk \
 -a defaultSearchScope=sub \
 -a domainName=socs-ad.cs.bham.ac.uk \
 -a defaultServerList="ad1.socs-ad.cs.bham.ac.uk ad2.socs-ad.cs.bham.ac.uk" \
 -a attributeMap=passwd:gecos=ad1.socs-ad.cs.bham.ac.uk \
 -a attributeMap=passwd:homedirectory=unixHomeDirectory \
 -a attributeMap=passwd:uid=sAMAccountName \
 -a attributeMap=group:uniqueMember=member \
 -a attributeMap=group:cn=sAMAccountName \
 -a objectClassMap=group:posixGroup=group \
 -a objectClassMap=passwd:posixAccount=user \
 -a objectClassMap=shadow:shadowAccount=user \
 -a serviceSearchDescriptor='passwd:ou=people,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk?sub' \
 -a serviceSearchDescriptor='shadow:ou=people,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk?sub?memberOf=CN=sysop,OU=Groups of People,OU=Groups,DC=socs-ad,DC=cs,DC=bham,DC=ac,DC=uk' \
 -a serviceSearchDescriptor='group:ou=groups,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk?sub?(&(objectClass=group)(gidNumber=*))'

You’d need to tweak it for your environment of course. Importantly, you need the bit

(&(objectClass=group)(gidNumber=*))

for the group serviceSearchDescriptor, otherwise you’ll get spurious results if you have groups with no gidNumber assigned. Ideally we’d also have similar filters for passwd and shadow, but that didn’t seem to work properly.

Restart the nscd daemon:

svcadm enable name-service-cache

You could try doing an LDAP search with something like:

ldapsearch -R -T -h ad1.socs-ad.cs.bham.ac.uk -o authzid= -o mech=gssapi -b dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk -s sub cn=jaffle

and also getent should now work

getent passwd
getent group

Restricting login access

So, we’ve managed to integrate our password data on the server – pretty much we need to have access to all our directory users for NFS to work properly so that usernames and UIDs match, however this means anyone can login to the server. There’s no equivalent of Linux’s pam_access, and there doesn’t appear to be any native way of specifying Unix groups of people who can login to the system. The closest I found was pam_list, however this only works with netgroups, and as we don’t use these for anything anymore, they were never migrated to our AD, and anyway, we’ve got perfectly good Unix groups of people to use on our other systems.

After running round in circles for a while, and almost creating netgroups, I came across a solution that seems to work nicely. Its a bit of a hack, but actually it quite a nice solution for us. The key is in the ldapclient definition for shadow:

-a serviceSearchDescriptor='shadow:ou=people,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk?sub?memberOf=CN=sysop,OU=Groups of People,OU=Groups,DC=socs-ad,DC=cs,DC=bham,DC=ac,DC=uk'

Note that we add an LDAP filter requiring membership of a specific LDAP group, this means that shadow data is only present for the members of that group and hey presto – we’ve got Unix group based authorisation working on OpenIndiana. If you wanted multiple groups, you’d have to tweak the filter with some parenthesis and a | probably …

And automount/autofs?

Autofs from LDAP was a little bit more complicated to get going, probably complicated a bit as we have Linux format/named autofs maps in our AD. Technically we don’t need this bit working for our file-server, but we thought for completeness, we’d investigate the options for it.

First off, we’ll edit a couple of files:

/etc/auto_home
  #+auto_home

/etc/auto_master
  #
  +auto_master
  /bham   +auto.linux
  #+auto.master

A long time ago, we used NIS for Solaris and Linux – back then the maps had to be discrete – Linux autofs didn’t have nested/multi-mount and Solaris didn’t support nested mount. e.g. under Solaris we could do:

/bham
    ... /foo
    ... /baa
    ... /otherdir
            .... /foo
            .... /bar

But under Linux, it had to be:

/bham
    ... /foo
    ... /baa

and in another Linux map:

/bham
    ... /otherdir
            .... /foo
            .... /bar

When the NIS maps got carried over the LDAP for Linux when we rolled out Scientific Linux 6, this got carried over as well. Now of course, this won’t work with Solaris and neither does it work with OpenIndiana.

After a bit of consideration, we thought we were going to have to build a separate set of maps for OI again. But then we found that Linux autofs 5 now supports multi-map, so we can use traditional Solaris format maps for nested directories in a single map. A quick test and an early morning edit to the maps, and we can now use the same maps under both OSes.

ldapclient mod \
  -a "serviceSearchDescriptor=auto_master:cn=auto.master,OU=automount,OU=Maps,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk" \
  -a "serviceSearchDescriptor=auto.home:cn=auto.home-linux,OU=automount,OU=Maps,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk" \
  -a "serviceSearchDescriptor=auto.linux:cn=auto.linux,OU=automount,OU=Maps,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk" \
  -a "serviceSearchDescriptor=auto.home-linux:cn=auto.home-linux,OU=automount,OU=Maps,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk" \
  -a objectclassMap=automount:automountMap=nisMap \
  -a objectclassMap=automount:automount=nisObject \
  -a objectclassMap=auto.home-linux:automount=nisObject \
  -a objectclassMap=auto.linux:automount=nisObject \
  -a attributeMap=automount:automountMapName=nisMapName \
  -a attributeMap=automount:automountKey=cn \
  -a attributeMap=automount:automountInformation=nisMapEntry \
  -a attributeMap=auto.home-linux:automountMapName=nisMapName \
  -a attributeMap=auto.home-linux:automountKey=cn \
  -a attributeMap=auto.home-linux:automountInformation=nisMapEntry \
  -a attributeMap=auto.linux:automountMapName=nisMapName \
  -a attributeMap=auto.linux:automountKey=cn \
  -a attributeMap=auto.linux:automountInformation=nisMapEntry

One caveat to note is that we had to map each of the top-level named maps. One might think that the lines:

  -a attributeMap=automount:automountMapName=nisMapName \
  -a attributeMap=automount:automountKey=cn \
  -a attributeMap=automount:automountInformation=nisMapEntry \

would inherit, but apparently not!

So all that’s left to do is restart autofs:

svcadm enable autofs

Just as a side note on the format of maps we use for autofs, I’ve mentioned they are stored in our Active Directory. We’ve created a number of “nisMap” objects, for example, the object at “cn=auto.linux,OU=automount,OU=Maps,dc=socs-ad,dc=cs,dc=bham,dc=ac,dc=uk” is a nisMap object (probably created using ADSI edit, but I think there’s a tab available if you install the right roles on the server).

The nisMap object then contains a number of nisObject objects. e.g.:

CN=bin,CN=auto.linux,OU=automount,OU=Maps,DC=socs-ad,DC=cs,DC=bham,DC=ac,DC=uk
nisMapName -> bin
nisMapEntry -> -rw,suid,hard,intr jaffle:/vol/vol1/bham.linux/bin

For the muilti-mount map entry, each mount point is just space separated in the nisMapEntry, e.g.:

CN=htdocs,CN=auto.linux,OU=automount,OU=Maps,DC=socs-ad,DC=cs,DC=bham,DC=ac,DC=uk
nisMapName -> htdocs
nisMapEntry -> /events -rw,hard,intr jaffle:/vol/vol1/htdocs/web-events /hci -rw,hard,intr jaffle:/vol/vol1/htdocs/web-hci ...

(and yes, if you’re reading in the feed, the parts did get mixed up … I wrote part 4 before 3!)

part 1 – Hardware and Basic information
part 2 – Base Operating System Installation
part 3 – Configuring Active Directory authentication integration
part 4 – Configuring the storage pool and auto snapshots
part 5 – NFS Server config & In-kernel CIFS

Configuring the Storage Pools – Building an OpenIndiana based ZFS File Server – part 4

ZFS offers a number of different types of RAID configuration. raid-z1 is basically like traditional raid5 with a single parity disk, raid-z2 has two parity disks and there’s also raid-z3 with three parity disks. Actually, parity disk isn’t strictly correct as the parity is distributed across the disks in the set with three copies of parity available.

Using multiple parity sets is important, particularly when using high capacity disks. Its not necessarily just a case of losing multiple disks, but also taking account of what happens if you get block failures on a high capacity disk. In fact, on our NetApp filer head, we’ve been using RAID-DP for some years now. We’ve actually had a double disk failure occur in on a SATA shelf, but we didn’t lose any data and the hot spare span in quickly enough that we’d have had to lose another 2 disks for us to actually lose any data.

For this file-server we’ve got a 16 disk JBOD array, and we had a few discussions about how to configure the array, should we have hot-spares … looking around, people seem to always suggest having a global hot-spare available. However with raid-z3, we’ve decided to challenge that view.number of

We were taking for granted running at least raid-z2 with dual parity and a hot spare, however with raid-z3, we don’t believe we need a hot-spare. Using all the disks in raid-z3, we’re not burning any more disks than we would have done with raid-z2 + hot spare, and the extra parity set effectively acts as if we’d got a hot-spare, however its already silvered and spinning with the data available. So in that sense, we’d be able to lose 3 disks and still not have lost any data, some performance yes, but not actually any data until we’d lost a fourth disk. And that’s the same if we had raid-z2 + hot spare, except we’d have a risk factor whilst the hot spare was silvering …

Of course, you need to take a look at the value of your data and consider the options yourself! There’s an Oracle paper on RAID strategy for high capacity drives.

Configuring the Pool

All the examples of building a pool use the name tank for the pool name. So I’m going to hang with tradition and call ours tank as well. First off, we need to work out what disks we want to add to the pool. format is our friend here:

root@bigstore:~# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c2t10d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@a,0
       1. c2t11d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@b,0
       2. c2t12d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@c,0
       3. c2t13d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@d,0
       4. c2t14d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@e,0
       5. c2t15d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@f,0
       6. c2t16d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@10,0
       7. c2t17d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@11,0
       8. c2t18d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@12,0
       9. c2t19d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@13,0
      10. c2t20d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@14,0
      11. c2t21d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@15,0
      12. c2t22d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@16,0
      13. c2t23d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@17,0
      14. c2t24d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@18,0
      15. c2t25d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@19,0
      16. c3t0d0 <ATA-INTELSSDSA2BT04-0362 cyl 6228 alt 2 hd 224 sec 56>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@0,0
      17. c3t1d0 <ATA-INTELSSDSA2BT04-0362 cyl 6228 alt 2 hd 224 sec 56>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@1,0
      18. c3t2d0 <ATA-INTEL SSDSA2BT04-0362-37.27GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@2,0
      19. c3t3d0 <ATA-INTEL SSDSA2BT04-0362-37.27GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@3,0
      20. c3t4d0 <ATA-INTEL SSDSA2BW16-0362-149.05GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@4,0
      21. c3t5d0 <ATA-INTEL SSDSA2BW16-0362-149.05GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@5,0
Specify disk (enter its number):

Disks 0-15 are the SATA drives in the JBOD (note they are truncated to 2TB disks at present, though they are actually 3TB – we’re awaiting a new controller – as I mentioned in part 1, the LSI 1604e chip on the external SAS module truncates the disks. When the new controller arrives, we’ll have to re-create the pool, but 2TB is fine for our initial configuration and testing!).

Now lets create the pool with all the disks using raidz3:

zpool create tank raidz3 c2t10d0 c2t11d0 c2t12d0 c2t13d0 c2t14d0 c2t15d0 c2t16d0 c2t17d0 c2t18d0 c2t19d0 c2t20d0 c2t21d0 c2t22d0 c2t23d0 c2t24d0 c2t25d0

To act as a read-cache, we also have 2x 160GB Intel SSDs, so we need to add them as L2ARC cache into the array:

zpool add tank cache c3t4d0 c3t5d0

You can’t mirror L2ARC, though I did read some interesting thoughts on doing so. – I don’t think its necessary in our case, however I could see how suddenly losing the cache in some environments might have a massive performance impact that could become mission critical.

We’ve also got a couple of SSDs in the system to act as the ZFS intent log (ZIL), so we’ll add them to the array:

zpool add tank log mirror c3t2d0 c3t3d0

Note that they are mirrored. The ZIL is used to speed up write by writing the data through SSD before going back to disk. Its used as a last chance should a write to spinning disk not have been completed in the event of a system event (e.g. power failure). With high-speed drives, its perhaps not necessary, but under heavy lead will give write performance. Given its important its consistent, we mirror it and its purpose is to allow a write to be acknowledged to a client even it its not been fully committed to spinning disk.

Ideally the drives should be SLC rather than MLC, but there’s a cost trade-off to be had there.

De-duplication and compression

ZFS allows de-duplication and compression to be enabled at pool or dataset/file-system level. We’ll enable both at the pool and it will inherit from there:

zfs set dedup=on tank
zfs set compression=on tank
zfs get compression tank
zfs get compressratio tank

The latter command will show the compression ratio and will vary over time depending on what files are in the tank.

When thinking about de-duplication, its important to consider how much storage you have and how much RAM is available. This is because you want the de-duplication hash tables to be in memory as much as possible for write speed. We’ve only got 12GB RAM in our server, but from what I’ve read, the  L2ARC SSDs should also be able to hold some of the tables and pick up some of the workload there.

When setting up de-duplication, we had some discussions about hashing and how it might work in relation to hash clashes … Bonwick’s blog has a lot of answers on this topic!

Just to touch briefly on quotas when using compression and de-duplication. This isn’t something that I’ve seen an answer on, and I don’t have time to look at the source code … but, here’s my supposition. Quota’s do take account of the compressed files. i.e. the account for the blocks used on disk rather than the size of the file. Quota’s don’t take account of any de-duplicated data. I’m pretty sure this must be the case, otherwise the first block owner would be penalised, and the final block owner could suddenly go massively over quota if other copies were deleted.

And a quick note on compression … this is something which I was mildly bemused by on first look:

milo 31% du -sh /data/private/downloads/
4.1G	/data/private/downloads/
milo 32% date && cp -Rp /data/private/downloads/* . && date
Fri May 25 15:25:39 BST 2012
Fri May 25 15:28:29 BST 2012

milo 34% du -sh .
3.0G	.

So by the time the files were copied to the NFS mounted compressed ZFS file-system, they’d shrunk by 1.1G. Now thinking about this, I should have realised that du was showing the size of blocks used, rather than the file-sizes, but at first glance, I was a little bemused!

Automatic snapshots

We’d really like to have automatic snapshots available for our file-systems, and time-slider can provide this. The only real problem is that time-slider is a GUI application and on a file-server, this isn’t ideal. Anyway, I found its possible to configure auto-snapshots from the terminal, first off install time-slider:

 pkg install time-slider
 svcadm restart dbus
 svcadm enable time-slider

Note, if you don’t restart dbus after installing time-slider, you’ll get python errors out of the time-slider service like:

dbus.exceptions.DBusException: org.freedesktop.DBus.Error.AccessDenied: Connection ":1.3" is not allowed to own the service "org.opensolaris.TimeSlider" due to security policies in the configuration file

To configure, you need to set the hidden ZFS property on the pool and then enable the auto-snapshots that you require:

zfs com.sun:auto-snapshot=true tank
svcadm enable auto-snapshot:hourly
svcadm enable auto-snapshot:weekly
svcadm enable auto-snapshot:daily

This blog gives more details on configuring, in short, you need to export the auto-snapshot manifest, edit it and then re-import it.

However, you should also be able to manipulate the service using svccfg, for example:

svccfg -s auto-snapshot:daily setprop zfs/keep= astring: '7'

Would change the keep period for daily snapshots to 7, the properties that are interesting are (depending if you are changing monthly, daily, hourly or weekly:

svccfg -s auto-snapshot:daily listprop zfs
zfs           application
zfs/interval  astring  days
zfs/period    astring  1
zfs/keep      astring  6

To list your snapshots, use the following:

root@bigstore:~# zfs list -t snapshot
NAME                                                  USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/openindiana-1@install                     17.7M      -  1.52G  -
rpool/ROOT/openindiana-1@2012-05-18-14:54:49          317M      -  1.82G  -
rpool/ROOT/openindiana-1@2012-05-24-10:58:07         65.6M      -  2.32G  -
tank@zfs-auto-snap_weekly-2012-05-24-12h14               0      -  90.6K  -
tank@zfs-auto-snap_daily-2012-05-27-12h14                0      -  90.6K  -
tank@zfs-auto-snap_hourly-2012-05-28-11h14               0      -  90.6K  -
tank/research@zfs-auto-snap_weekly-2012-05-24-12h14  55.0K      -  87.3K  -
tank/research@zfs-auto-snap_hourly-2012-05-24-13h14  55.0K      -  87.3K  -
tank/research@zfs-auto-snap_hourly-2012-05-25-09h14  64.7K      -   113K  -
tank/research@zfs-auto-snap_hourly-2012-05-25-11h14  55.0K      -   116K  -
tank/research@zfs-auto-snap_daily-2012-05-25-12h14   71.2K      -  25.2M  -
tank/research@zfs-auto-snap_hourly-2012-05-25-16h14   162K      -  3.00G  -
tank/research@zfs-auto-snap_daily-2012-05-26-12h14   58.2K      -  3.00G  -
tank/research@zfs-auto-snap_hourly-2012-05-27-04h14  58.2K      -  3.00G  -
tank/research@zfs-auto-snap_daily-2012-05-27-12h14       0      -  3.00G  -
tank/research@zfs-auto-snap_hourly-2012-05-28-11h14      0      -  3.00G  -

Just as a caveat on auto-snapshots, the above output is a system that has been running weekly, hourly and daily snapshots for several days. Note that the hourly snapshots seem a bit sporadic in their availability. This is because if the snapshot the intermediate (hourly or daily) snapshots are 0K in size, then they get removed. i.e. snapshots will only be listed where there’s actually a change in the data. This seems quite sensible really…

Finally, we want to make the snapshots visible to NFS users:

zfs set snapdir=visible tank/research

For Windows users connecting via the in-kernel CIFS server, the previous versions tab should work.

part 1 – Hardware and Basic information
part 2 – Base Operating System Installation
part 3 – Configuring Active Directory authentication integration
part 4 – Configuring the storage pool and auto snapshots
part 5 – NFS Server config & In-kernel CIFS

Building an OpenIndiana based ZFS File Server – part 2

OpenIndiana Installation

Getting OpenIndiana onto our file-server hardware was a pretty simple affair, download the memory stick image and dd it onto a fresh memory stick and install … actually we did struggle with this to start with – we were planning on testing it on an old PC in my office (an 2007 generation 965 based Core2 Duo PC – can’t believe these are machines coming out of desktop service!)

Basically follow the text based installer – its quite reminiscent of pre-Jumpstart installs! We didn’t enable any additional users as we were planning on integrating with our Active Directory for authentication. If you do, then the installer will disable root logins by default.

The installer creates a default ZFS rpool for the root file-system, as we want that to be a mirror, we needed to add a second device once it had booted. This blog entry has some instructions on this. As its been a while since I’ve done Solaris, I’d actually approached it via a different manner (used format and fdisk to manipulate the disk partitions by hand, but prtvtoc | fmthard is of course the way we used to do it with Solstace DiskSuite, and is quicker and easier.

If we hadn’t already installed Linux onto the disk to test initially, we probably wouldn’t have got the error

cannot label 'c3t1d0': EFI labeled devices are not supported
    on root pools

when trying to add the disk directly when running

zpool attach rpool c3t0d0s0 c3t1d0s0

Clearing and reinitialising the partition table resolved that problem though. Don’t forget to use the installgrub command on the second disk to install the boot loader onto the disk, otherwise you won’t actually be able to boot off it in the event of a primary disk failure!

installgrub -m /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c3t1d0s0

Ensure that the disks have finished resilvering before you reboot the system! (zpool status will tell you!).

The resilver process in ZFS is different to those familiar with mirrors in DiskSuite or linux software RAID as it doesn’t create an exact block for block mirror of the disks, it only copies data blocks, so the data layout on the disks is likely to be different. It also means that if you have a 100GB drive with only 8GB data, you only need to mirror the 8GB of data, not all the empty file-system blocks as well …

Updating the OS

Since OpenIndiana 151a was released, there’s a bunch of updates been made available, so you want to upgrade to the latest image. First off, update the pkg package:

pkg install package/pkg

You can then do a trial run image update:

pkg image-update -nv

Run it again without the “n” flag to actually do the update. This will create a new boot environment, you can list using the command:

root@bigstore:~# beadm list
BE            Active Mountpoint Space Policy Created
openindiana   -      -          8.17M static 2012-05-18 15:11
openindiana-1 NR     /          3.95G static 2012-05-18 15:54
openindiana-2 -      -          98.0K static 2012-05-24 11:58

That’s it, you’re installed and all up to date!

In addition to the base Operating System repositories, we also add some extra repos:

pkg set-publisher -p http://pkg.openindiana.org/sfe
pkg set-publisher -p http://pkg.openindiana.org/sfe-encumbered

Enabling Serial Console

All of our servers are connected to Cyclades serial console servers, this lets us connect to the serial ports via ssh from remote locations, which is great when bits of the system go down. To enable to serial console, you need to edit the grub config file. As we have ZFS pool zpool, its located at /rpool/boot/grub/menu.lst. You need to comment out the splashimage line if present (as you can’t display the XPM down a serial port!), then add a couple more lines to enable serial access for grub.

 ###splashimage /boot/grub/splash.xpm.gz
 serial --unit=0 --speed=9600
 terminal --timeout=5 serial

You also need to find the kernel line and append some data to that:

-B console=ttya,ttya-mode="9600,8,n,1,-"

As there was already a -B flag in use, our resulting kernel line looked like this:

kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS,console=ttya,ttya-mode="9600,8,n,1,-"

You might want to tweak the line-speed, parity etc for your environment.

Static IP address

For our servers, we prefer to use static IP addressing rather than have then DHCP. As we selected auto-configure during the installer, we need to swap to static IP address. Note that if you don’t have access (serial console or real console, you’ll likely get disconnected)!

First off, we need to disable the DHCP client. OpenIndiana dhcpclient is part of the nwam service:

svcadm disable network/physical:nwam

Boom! Down goes the network connection. So make sure you have access via another method! I’ve been adminning Solaris for a long time (since 2001), so I’m going to configure networking in the traditional (pre Solaris 10) manner … with some files, but you can do this step with ipadm command.

/etc/hostname.ixgbe0
    bigstore
/etc/hosts
    147.188.203.45 bigstore bigstore.cs.bham.ac.uk bigstore.local
    ::1 bigstore bigstore.local localhost loghost
/etc/defaultrouter
    147.188.203.6
/etc/netmasks
    147.188.0.0     255.255.255.0
/etc/resolv.conf
    domain  cs.bham.ac.uk
    search cs.bham.ac.uk
    nameserver  147.188.192.4
    nameserver  147.188.192.8
cp /etc/nsswitch.dns /etc/nsswitch.conf

Finally we need to enable the static IP configuration service:

svcadm enable network/physical:default

part 1 – Hardware and Basic information
part 2 – Base Operating System Installation
part 3 – Configuring Active Directory authentication integration
part 4 – Configuring the storage pool and auto snapshots
part 5 – NFS Server config & In-kernel CIFS

Building an OpenIndiana based ZFS File Server

We’ve recently been looking at options to replace one of our ageing file-servers which stores research data. The data is currently sat on an out-of-warranty Infortrend RAID array which we purchased back in July 2005. Basically its got 3TB of storage, so was a reasonable size when we bought it. Its attached to a Sun V20z box running Linux, serving data via NFS and Samba.

Our home directory storage is hung off a NetApp FAS3140 filer head, but we just couldn’t afford to purchase additional storage on there for the huge volumes of research data people want to keep.

So we looked around. Spoke to some people. Then came up with a plan. We’ve decided to home-brew a file-server based on ZFS as the underlying file system. ZFS was developed by Sun for Solaris and is designed to cope with huge file-systems, and provides a number of key features for modern file-systems, including snapshotting, deduplication, cloning and management of disk storage in pools. You can create and destroy file systems on the fly within a ZFS pool (a file-system looks like a directory in userland, you just create it with a simple zfs create command).

Since OpenSolaris, there’ve been ports of ZFS into other operating systems, FreeBSD has an implementation, and there’s event ZFS on Linux.

Hardware spec

As we were pondering an Illumos based OS for the system, we wanted kit we’d be fairly happy was supported. There’s no real compatibility list for Illumos, but digging around we found some stuff we’d be fairly happy would work.

One pair of 40GB SSDs is for the Operating System mirrored pair, one pair of mirrored SSDs for the ZFS intent log (ZIL), and the 2 160GB SSDs are to provide an L2ARC cache.

Once the kit arrived and was racked up, a bit of testing was done, but I was out of the office for a few weeks so didn’t inspect it properly until later. In short, the External SAS module (based on LSI 1604e chipset) was causing us some problems – the 3TB drives were being truncated to 2TB. After several calls to Viglen technical support, it transpires that the external SAS module doesn’t support 3TB drives. They’ve agreed to ship out a new LSI 9205-8e HBA. We believe this should work fine with OpenIndiana.

Intially we were offered a MegaRAID card, but there’s no mention of a Solaris 10 driver on the LSI site, so we steered clear of it.

And for an operating system?

We started off with one of our standard Scientific Linux 6 server installs to test out the hardware. We also built up ZFS on Linux to try it out. It works great (and we later even imported a pool created under Linux into a SmartOS instance). Our main issues with this approach are:

  • We have to build ZFS for the kernel, and kernel updates would break our ZFS access – we’d have to rebuild for each update.
  • ZFS on Linux with the POSIX layer isn’t as mature as ZFS on Illumos
  • We’d have to use Samba to share to our Windows clients
  • ZFS snapshots are inaccessible from NFS clients

We were always planning on looking at an Illumos based build (we did consider Oracle Solaris on non-Oracle hardware, but the cost is quite high). So we looked around at our options.

SmartOS

Hey wouldn’t it be neat if we could run SmartOS on it – this an Illumos based build which boots of a memory stick, and then you run KVM/Zones virtual machines inside to provide services. The outer SmartOS is transient, but the inner VMs are persistent on the ZFS pool. The big problem with this though is that you can only run an NFS server in the global zone as its tightly hooked into the kernel. This basically scuppers the plan to use SmartOS as our option here.

OpenIndiana

There’s numerous other Illumos based distributions about, but OpenIndiana is probably the closest to how OpenSolaris was when it was “closed”, and possibly the most stable of the distributions.

Installing OI is easy from USB memory stick or CD. Getting it fully integrated into our system was another matter! We had a bunch of issues with Kerberos authentication (we authenticate Linux against our AD), autofs and our AD/LDAP maps, and then getting the in-kernel CIFS service working as we have a disjoint DNS namespace (our Windows DNS name is different from our normal working cs.bham.ac.uk) – it broke the Kerberos authentication we’d already got working!

I’ll post some more details on getting things working in another article!

part 1 – Hardware and Basic information
part 2 – Base Operating System Installation
part 3 – Configuring Active Directory authentication integration
part 4 – Configuring the storage pool and auto snapshots
part 5 – NFS Server config & In-kernel CIFS

High availability and clustering

Whilst we have pretty good uptime and availability on most of the systems we run here, we do get the odd hardware problem. When its a disk that’s gone down, that’s not a problem as our physical servers are all running mirrored pairs in hot-swap enclosures, so we can swap out easily.

We’ve recently been doing a lot of work into reducing the number of physical servers we have by moving into a virtualised environment (QEMU+KVM running on Scientific Linux 6 x86_64 hardware). We’ve got some fairly major drivers to virtualise more hardware due to an ageing server estate. Virtualising our estate gives us the opportunity to review how we’ve got systems configured and to look at if we can improve service by introducing high-availability and load-balancing.

In the past week I’ve been looking at our web presence www.cs.bham.ac.uk. The current infrastructure for this was installed in 2006 and things were a little different back then. Before the 2006 system, the site was running on a Sun Ultra 5 with NFS mounted file-store. The power supply to the building was unreliable, and the file-server could take several hours to reboot following a power-outage. There was demand to be able to provide at least some web-presence for external users whilst waiting for the systems to reboot. We built a system out of two Sun X2100 servers, one a stand-alone web-server serving the very front few web-pages, the second used NFS mounts to server the rest of the site. We used the Apache httpd proxy module to transparently forward on requests to the back-end hardware.

Fast forward 6 years. We still use an NFS file-server for all our stuff, but its considerably more reliable. You’d expect that, we have a NetApp storage system. We also have high speed interconnects between our switching fabric and the network (10GbE links to the filer and between the core fabric).

LVS Direct Routing

We’ve got a pair of SL6 machines running in our virtual environment which are configured as a hot/standby LVS load-balancer. We’re using piranha to manage to LVS configs. Its something I’ve used before, but never really in anger (we did some experimentation a few years ago running Samba in a cluster. It was fine till we did fail-over of a node…). LVS is actually really easy to get installed and working, and the fail-over between nodes seems to work reliably. Its not great when you change the config as you have to restart pulse. And the second node has a habbit of taking over the cluster at that point!

I’ve been pondering how to move the www service into a high-availabilty configuration. One option is to continue to have a front-end node using mod_proxy to forward request to a backend set of servers. The problem with this is that the front-end server will either be a physical machine, or be a VM (which requires our NFS file-server for the VM host servers). It could be a HA cluster of front-end machines, but still it will rely on the load balancers, which again are VMs and require the NFS server to be up.

mod_proxy

We already use mod_proxy in httpd to handle calls to back-end servers, for example our “personal” home-pages are served from a completely separate VM. In the past personal pages with “issues” have had an impact on the main www web presence. So we separated them, the following config snippet shows what we do here:

    <Proxy balancer://staffwebcluster>
      BalancerMember http://staffweb-lb-1.cs.bham.ac.uk loadfactor=1
      BalancerMember http://staffweb-lb-2.cs.bham.ac.uk loadfactor=1
      ProxySet lbmethod=bytraffic
    </Proxy>
    <Proxy balancer://staffwebclusterssl>
      BalancerMember https://staffweb-lb-1.cs.bham.ac.uk loadfactor=1
      BalancerMember https://staffweb-lb-2.cs.bham.ac.uk loadfactor=1
      ProxySet lbmethod=bytraffic
    </Proxy>
    RewriteCond %{HTTPS} on
    RewriteRule ^/~(.*) balancer://staffwebclusterssl/~$1 [P]
    RewriteRule ^/~(.*) balancer://staffwebcluster/~$1 [P]

The problem with this approach for a “front-end” server cluster is that we’ve carefully used firewall marks on the load balancer to ensure visitors hit the same front-end server for both HTTP and HTTPS transactions, but then we have no control over the backend server the client connects to as its determined by Apache’s load balancing.

So given the fact we rely on our NFS server being up for practically the whole system to be available, do we really still need to have a front-end/back-end configuration. I’m fairly sure we don’t in real terms. So whilst we’ll continue to use mod_proxy to allow us to run whole sections of our web-server on different real hosts (some parts of the www site are even proxied to an IIS server), we’ll be dropping the front-end/back-end approach and letting the load-balancer handle the traffic for us.