Configuring the Storage Pools – Building an OpenIndiana based ZFS File Server – part 4

ZFS offers a number of different types of RAID configuration. raid-z1 is basically like traditional raid5 with a single parity disk, raid-z2 has two parity disks and there’s also raid-z3 with three parity disks. Actually, parity disk isn’t strictly correct as the parity is distributed across the disks in the set with three copies of parity available.

Using multiple parity sets is important, particularly when using high capacity disks. Its not necessarily just a case of losing multiple disks, but also taking account of what happens if you get block failures on a high capacity disk. In fact, on our NetApp filer head, we’ve been using RAID-DP for some years now. We’ve actually had a double disk failure occur in on a SATA shelf, but we didn’t lose any data and the hot spare span in quickly enough that we’d have had to lose another 2 disks for us to actually lose any data.

For this file-server we’ve got a 16 disk JBOD array, and we had a few discussions about how to configure the array, should we have hot-spares … looking around, people seem to always suggest having a global hot-spare available. However with raid-z3, we’ve decided to challenge that view.number of

We were taking for granted running at least raid-z2 with dual parity and a hot spare, however with raid-z3, we don’t believe we need a hot-spare. Using all the disks in raid-z3, we’re not burning any more disks than we would have done with raid-z2 + hot spare, and the extra parity set effectively acts as if we’d got a hot-spare, however its already silvered and spinning with the data available. So in that sense, we’d be able to lose 3 disks and still not have lost any data, some performance yes, but not actually any data until we’d lost a fourth disk. And that’s the same if we had raid-z2 + hot spare, except we’d have a risk factor whilst the hot spare was silvering …

Of course, you need to take a look at the value of your data and consider the options yourself! There’s an Oracle paper on RAID strategy for high capacity drives.

Configuring the Pool

All the examples of building a pool use the name tank for the pool name. So I’m going to hang with tradition and call ours tank as well. First off, we need to work out what disks we want to add to the pool. format is our friend here:

root@bigstore:~# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c2t10d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@a,0
       1. c2t11d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@b,0
       2. c2t12d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@c,0
       3. c2t13d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@d,0
       4. c2t14d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@e,0
       5. c2t15d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@f,0
       6. c2t16d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@10,0
       7. c2t17d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@11,0
       8. c2t18d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@12,0
       9. c2t19d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@13,0
      10. c2t20d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@14,0
      11. c2t21d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@15,0
      12. c2t22d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@16,0
      13. c2t23d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@17,0
      14. c2t24d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@18,0
      15. c2t25d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@19,0
      16. c3t0d0 <ATA-INTELSSDSA2BT04-0362 cyl 6228 alt 2 hd 224 sec 56>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@0,0
      17. c3t1d0 <ATA-INTELSSDSA2BT04-0362 cyl 6228 alt 2 hd 224 sec 56>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@1,0
      18. c3t2d0 <ATA-INTEL SSDSA2BT04-0362-37.27GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@2,0
      19. c3t3d0 <ATA-INTEL SSDSA2BT04-0362-37.27GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@3,0
      20. c3t4d0 <ATA-INTEL SSDSA2BW16-0362-149.05GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@4,0
      21. c3t5d0 <ATA-INTEL SSDSA2BW16-0362-149.05GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@5,0
Specify disk (enter its number):

Disks 0-15 are the SATA drives in the JBOD (note they are truncated to 2TB disks at present, though they are actually 3TB – we’re awaiting a new controller – as I mentioned in part 1, the LSI 1604e chip on the external SAS module truncates the disks. When the new controller arrives, we’ll have to re-create the pool, but 2TB is fine for our initial configuration and testing!).

Now lets create the pool with all the disks using raidz3:

zpool create tank raidz3 c2t10d0 c2t11d0 c2t12d0 c2t13d0 c2t14d0 c2t15d0 c2t16d0 c2t17d0 c2t18d0 c2t19d0 c2t20d0 c2t21d0 c2t22d0 c2t23d0 c2t24d0 c2t25d0

To act as a read-cache, we also have 2x 160GB Intel SSDs, so we need to add them as L2ARC cache into the array:

zpool add tank cache c3t4d0 c3t5d0

You can’t mirror L2ARC, though I did read some interesting thoughts on doing so. – I don’t think its necessary in our case, however I could see how suddenly losing the cache in some environments might have a massive performance impact that could become mission critical.

We’ve also got a couple of SSDs in the system to act as the ZFS intent log (ZIL), so we’ll add them to the array:

zpool add tank log mirror c3t2d0 c3t3d0

Note that they are mirrored. The ZIL is used to speed up write by writing the data through SSD before going back to disk. Its used as a last chance should a write to spinning disk not have been completed in the event of a system event (e.g. power failure). With high-speed drives, its perhaps not necessary, but under heavy lead will give write performance. Given its important its consistent, we mirror it and its purpose is to allow a write to be acknowledged to a client even it its not been fully committed to spinning disk.

Ideally the drives should be SLC rather than MLC, but there’s a cost trade-off to be had there.

De-duplication and compression

ZFS allows de-duplication and compression to be enabled at pool or dataset/file-system level. We’ll enable both at the pool and it will inherit from there:

zfs set dedup=on tank
zfs set compression=on tank
zfs get compression tank
zfs get compressratio tank

The latter command will show the compression ratio and will vary over time depending on what files are in the tank.

When thinking about de-duplication, its important to consider how much storage you have and how much RAM is available. This is because you want the de-duplication hash tables to be in memory as much as possible for write speed. We’ve only got 12GB RAM in our server, but from what I’ve read, the  L2ARC SSDs should also be able to hold some of the tables and pick up some of the workload there.

When setting up de-duplication, we had some discussions about hashing and how it might work in relation to hash clashes … Bonwick’s blog has a lot of answers on this topic!

Just to touch briefly on quotas when using compression and de-duplication. This isn’t something that I’ve seen an answer on, and I don’t have time to look at the source code … but, here’s my supposition. Quota’s do take account of the compressed files. i.e. the account for the blocks used on disk rather than the size of the file. Quota’s don’t take account of any de-duplicated data. I’m pretty sure this must be the case, otherwise the first block owner would be penalised, and the final block owner could suddenly go massively over quota if other copies were deleted.

And a quick note on compression … this is something which I was mildly bemused by on first look:

milo 31% du -sh /data/private/downloads/
4.1G	/data/private/downloads/
milo 32% date && cp -Rp /data/private/downloads/* . && date
Fri May 25 15:25:39 BST 2012
Fri May 25 15:28:29 BST 2012

milo 34% du -sh .
3.0G	.

So by the time the files were copied to the NFS mounted compressed ZFS file-system, they’d shrunk by 1.1G. Now thinking about this, I should have realised that du was showing the size of blocks used, rather than the file-sizes, but at first glance, I was a little bemused!

Automatic snapshots

We’d really like to have automatic snapshots available for our file-systems, and time-slider can provide this. The only real problem is that time-slider is a GUI application and on a file-server, this isn’t ideal. Anyway, I found its possible to configure auto-snapshots from the terminal, first off install time-slider:

 pkg install time-slider
 svcadm restart dbus
 svcadm enable time-slider

Note, if you don’t restart dbus after installing time-slider, you’ll get python errors out of the time-slider service like:

dbus.exceptions.DBusException: org.freedesktop.DBus.Error.AccessDenied: Connection ":1.3" is not allowed to own the service "org.opensolaris.TimeSlider" due to security policies in the configuration file

To configure, you need to set the hidden ZFS property on the pool and then enable the auto-snapshots that you require:

zfs com.sun:auto-snapshot=true tank
svcadm enable auto-snapshot:hourly
svcadm enable auto-snapshot:weekly
svcadm enable auto-snapshot:daily

This blog gives more details on configuring, in short, you need to export the auto-snapshot manifest, edit it and then re-import it.

However, you should also be able to manipulate the service using svccfg, for example:

svccfg -s auto-snapshot:daily setprop zfs/keep= astring: '7'

Would change the keep period for daily snapshots to 7, the properties that are interesting are (depending if you are changing monthly, daily, hourly or weekly:

svccfg -s auto-snapshot:daily listprop zfs
zfs           application
zfs/interval  astring  days
zfs/period    astring  1
zfs/keep      astring  6

To list your snapshots, use the following:

root@bigstore:~# zfs list -t snapshot
NAME                                                  USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/openindiana-1@install                     17.7M      -  1.52G  -
rpool/ROOT/openindiana-1@2012-05-18-14:54:49          317M      -  1.82G  -
rpool/ROOT/openindiana-1@2012-05-24-10:58:07         65.6M      -  2.32G  -
tank@zfs-auto-snap_weekly-2012-05-24-12h14               0      -  90.6K  -
tank@zfs-auto-snap_daily-2012-05-27-12h14                0      -  90.6K  -
tank@zfs-auto-snap_hourly-2012-05-28-11h14               0      -  90.6K  -
tank/research@zfs-auto-snap_weekly-2012-05-24-12h14  55.0K      -  87.3K  -
tank/research@zfs-auto-snap_hourly-2012-05-24-13h14  55.0K      -  87.3K  -
tank/research@zfs-auto-snap_hourly-2012-05-25-09h14  64.7K      -   113K  -
tank/research@zfs-auto-snap_hourly-2012-05-25-11h14  55.0K      -   116K  -
tank/research@zfs-auto-snap_daily-2012-05-25-12h14   71.2K      -  25.2M  -
tank/research@zfs-auto-snap_hourly-2012-05-25-16h14   162K      -  3.00G  -
tank/research@zfs-auto-snap_daily-2012-05-26-12h14   58.2K      -  3.00G  -
tank/research@zfs-auto-snap_hourly-2012-05-27-04h14  58.2K      -  3.00G  -
tank/research@zfs-auto-snap_daily-2012-05-27-12h14       0      -  3.00G  -
tank/research@zfs-auto-snap_hourly-2012-05-28-11h14      0      -  3.00G  -

Just as a caveat on auto-snapshots, the above output is a system that has been running weekly, hourly and daily snapshots for several days. Note that the hourly snapshots seem a bit sporadic in their availability. This is because if the snapshot the intermediate (hourly or daily) snapshots are 0K in size, then they get removed. i.e. snapshots will only be listed where there’s actually a change in the data. This seems quite sensible really…

Finally, we want to make the snapshots visible to NFS users:

zfs set snapdir=visible tank/research

For Windows users connecting via the in-kernel CIFS server, the previous versions tab should work.

part 1 – Hardware and Basic information
part 2 – Base Operating System Installation
part 3 – Configuring Active Directory authentication integration
part 4 – Configuring the storage pool and auto snapshots
part 5 – NFS Server config & In-kernel CIFS

Building an OpenIndiana based ZFS File Server

We’ve recently been looking at options to replace one of our ageing file-servers which stores research data. The data is currently sat on an out-of-warranty Infortrend RAID array which we purchased back in July 2005. Basically its got 3TB of storage, so was a reasonable size when we bought it. Its attached to a Sun V20z box running Linux, serving data via NFS and Samba.

Our home directory storage is hung off a NetApp FAS3140 filer head, but we just couldn’t afford to purchase additional storage on there for the huge volumes of research data people want to keep.

So we looked around. Spoke to some people. Then came up with a plan. We’ve decided to home-brew a file-server based on ZFS as the underlying file system. ZFS was developed by Sun for Solaris and is designed to cope with huge file-systems, and provides a number of key features for modern file-systems, including snapshotting, deduplication, cloning and management of disk storage in pools. You can create and destroy file systems on the fly within a ZFS pool (a file-system looks like a directory in userland, you just create it with a simple zfs create command).

Since OpenSolaris, there’ve been ports of ZFS into other operating systems, FreeBSD has an implementation, and there’s event ZFS on Linux.

Hardware spec

As we were pondering an Illumos based OS for the system, we wanted kit we’d be fairly happy was supported. There’s no real compatibility list for Illumos, but digging around we found some stuff we’d be fairly happy would work.

One pair of 40GB SSDs is for the Operating System mirrored pair, one pair of mirrored SSDs for the ZFS intent log (ZIL), and the 2 160GB SSDs are to provide an L2ARC cache.

Once the kit arrived and was racked up, a bit of testing was done, but I was out of the office for a few weeks so didn’t inspect it properly until later. In short, the External SAS module (based on LSI 1604e chipset) was causing us some problems – the 3TB drives were being truncated to 2TB. After several calls to Viglen technical support, it transpires that the external SAS module doesn’t support 3TB drives. They’ve agreed to ship out a new LSI 9205-8e HBA. We believe this should work fine with OpenIndiana.

Intially we were offered a MegaRAID card, but there’s no mention of a Solaris 10 driver on the LSI site, so we steered clear of it.

And for an operating system?

We started off with one of our standard Scientific Linux 6 server installs to test out the hardware. We also built up ZFS on Linux to try it out. It works great (and we later even imported a pool created under Linux into a SmartOS instance). Our main issues with this approach are:

  • We have to build ZFS for the kernel, and kernel updates would break our ZFS access – we’d have to rebuild for each update.
  • ZFS on Linux with the POSIX layer isn’t as mature as ZFS on Illumos
  • We’d have to use Samba to share to our Windows clients
  • ZFS snapshots are inaccessible from NFS clients

We were always planning on looking at an Illumos based build (we did consider Oracle Solaris on non-Oracle hardware, but the cost is quite high). So we looked around at our options.

SmartOS

Hey wouldn’t it be neat if we could run SmartOS on it – this an Illumos based build which boots of a memory stick, and then you run KVM/Zones virtual machines inside to provide services. The outer SmartOS is transient, but the inner VMs are persistent on the ZFS pool. The big problem with this though is that you can only run an NFS server in the global zone as its tightly hooked into the kernel. This basically scuppers the plan to use SmartOS as our option here.

OpenIndiana

There’s numerous other Illumos based distributions about, but OpenIndiana is probably the closest to how OpenSolaris was when it was “closed”, and possibly the most stable of the distributions.

Installing OI is easy from USB memory stick or CD. Getting it fully integrated into our system was another matter! We had a bunch of issues with Kerberos authentication (we authenticate Linux against our AD), autofs and our AD/LDAP maps, and then getting the in-kernel CIFS service working as we have a disjoint DNS namespace (our Windows DNS name is different from our normal working cs.bham.ac.uk) – it broke the Kerberos authentication we’d already got working!

I’ll post some more details on getting things working in another article!

part 1 – Hardware and Basic information
part 2 – Base Operating System Installation
part 3 – Configuring Active Directory authentication integration
part 4 – Configuring the storage pool and auto snapshots
part 5 – NFS Server config & In-kernel CIFS