Configuring the Storage Pools – Building an OpenIndiana based ZFS File Server – part 4

ZFS offers a number of different types of RAID configuration. raid-z1 is basically like traditional raid5 with a single parity disk, raid-z2 has two parity disks and there’s also raid-z3 with three parity disks. Actually, parity disk isn’t strictly correct as the parity is distributed across the disks in the set with three copies of parity available.

Using multiple parity sets is important, particularly when using high capacity disks. Its not necessarily just a case of losing multiple disks, but also taking account of what happens if you get block failures on a high capacity disk. In fact, on our NetApp filer head, we’ve been using RAID-DP for some years now. We’ve actually had a double disk failure occur in on a SATA shelf, but we didn’t lose any data and the hot spare span in quickly enough that we’d have had to lose another 2 disks for us to actually lose any data.

For this file-server we’ve got a 16 disk JBOD array, and we had a few discussions about how to configure the array, should we have hot-spares … looking around, people seem to always suggest having a global hot-spare available. However with raid-z3, we’ve decided to challenge that view.number of

We were taking for granted running at least raid-z2 with dual parity and a hot spare, however with raid-z3, we don’t believe we need a hot-spare. Using all the disks in raid-z3, we’re not burning any more disks than we would have done with raid-z2 + hot spare, and the extra parity set effectively acts as if we’d got a hot-spare, however its already silvered and spinning with the data available. So in that sense, we’d be able to lose 3 disks and still not have lost any data, some performance yes, but not actually any data until we’d lost a fourth disk. And that’s the same if we had raid-z2 + hot spare, except we’d have a risk factor whilst the hot spare was silvering …

Of course, you need to take a look at the value of your data and consider the options yourself! There’s an Oracle paper on RAID strategy for high capacity drives.

Configuring the Pool

All the examples of building a pool use the name tank for the pool name. So I’m going to hang with tradition and call ours tank as well. First off, we need to work out what disks we want to add to the pool. format is our friend here:

root@bigstore:~# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c2t10d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@a,0
       1. c2t11d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@b,0
       2. c2t12d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@c,0
       3. c2t13d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@d,0
       4. c2t14d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@e,0
       5. c2t15d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@f,0
       6. c2t16d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@10,0
       7. c2t17d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@11,0
       8. c2t18d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@12,0
       9. c2t19d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@13,0
      10. c2t20d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@14,0
      11. c2t21d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@15,0
      12. c2t22d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@16,0
      13. c2t23d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@17,0
      14. c2t24d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@18,0
      15. c2t25d0 <ATA-Hitachi HUA72303-A5C0-2.00TB>
          /pci@0,0/pci8086,3410@9/pci8086,346c@0/sd@19,0
      16. c3t0d0 <ATA-INTELSSDSA2BT04-0362 cyl 6228 alt 2 hd 224 sec 56>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@0,0
      17. c3t1d0 <ATA-INTELSSDSA2BT04-0362 cyl 6228 alt 2 hd 224 sec 56>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@1,0
      18. c3t2d0 <ATA-INTEL SSDSA2BT04-0362-37.27GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@2,0
      19. c3t3d0 <ATA-INTEL SSDSA2BT04-0362-37.27GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@3,0
      20. c3t4d0 <ATA-INTEL SSDSA2BW16-0362-149.05GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@4,0
      21. c3t5d0 <ATA-INTEL SSDSA2BW16-0362-149.05GB>
          /pci@0,0/pci8086,3a40@1c/pci8086,3505@0/sd@5,0
Specify disk (enter its number):

Disks 0-15 are the SATA drives in the JBOD (note they are truncated to 2TB disks at present, though they are actually 3TB – we’re awaiting a new controller – as I mentioned in part 1, the LSI 1604e chip on the external SAS module truncates the disks. When the new controller arrives, we’ll have to re-create the pool, but 2TB is fine for our initial configuration and testing!).

Now lets create the pool with all the disks using raidz3:

zpool create tank raidz3 c2t10d0 c2t11d0 c2t12d0 c2t13d0 c2t14d0 c2t15d0 c2t16d0 c2t17d0 c2t18d0 c2t19d0 c2t20d0 c2t21d0 c2t22d0 c2t23d0 c2t24d0 c2t25d0

To act as a read-cache, we also have 2x 160GB Intel SSDs, so we need to add them as L2ARC cache into the array:

zpool add tank cache c3t4d0 c3t5d0

You can’t mirror L2ARC, though I did read some interesting thoughts on doing so. – I don’t think its necessary in our case, however I could see how suddenly losing the cache in some environments might have a massive performance impact that could become mission critical.

We’ve also got a couple of SSDs in the system to act as the ZFS intent log (ZIL), so we’ll add them to the array:

zpool add tank log mirror c3t2d0 c3t3d0

Note that they are mirrored. The ZIL is used to speed up write by writing the data through SSD before going back to disk. Its used as a last chance should a write to spinning disk not have been completed in the event of a system event (e.g. power failure). With high-speed drives, its perhaps not necessary, but under heavy lead will give write performance. Given its important its consistent, we mirror it and its purpose is to allow a write to be acknowledged to a client even it its not been fully committed to spinning disk.

Ideally the drives should be SLC rather than MLC, but there’s a cost trade-off to be had there.

De-duplication and compression

ZFS allows de-duplication and compression to be enabled at pool or dataset/file-system level. We’ll enable both at the pool and it will inherit from there:

zfs set dedup=on tank
zfs set compression=on tank
zfs get compression tank
zfs get compressratio tank

The latter command will show the compression ratio and will vary over time depending on what files are in the tank.

When thinking about de-duplication, its important to consider how much storage you have and how much RAM is available. This is because you want the de-duplication hash tables to be in memory as much as possible for write speed. We’ve only got 12GB RAM in our server, but from what I’ve read, the  L2ARC SSDs should also be able to hold some of the tables and pick up some of the workload there.

When setting up de-duplication, we had some discussions about hashing and how it might work in relation to hash clashes … Bonwick’s blog has a lot of answers on this topic!

Just to touch briefly on quotas when using compression and de-duplication. This isn’t something that I’ve seen an answer on, and I don’t have time to look at the source code … but, here’s my supposition. Quota’s do take account of the compressed files. i.e. the account for the blocks used on disk rather than the size of the file. Quota’s don’t take account of any de-duplicated data. I’m pretty sure this must be the case, otherwise the first block owner would be penalised, and the final block owner could suddenly go massively over quota if other copies were deleted.

And a quick note on compression … this is something which I was mildly bemused by on first look:

milo 31% du -sh /data/private/downloads/
4.1G	/data/private/downloads/
milo 32% date && cp -Rp /data/private/downloads/* . && date
Fri May 25 15:25:39 BST 2012
Fri May 25 15:28:29 BST 2012

milo 34% du -sh .
3.0G	.

So by the time the files were copied to the NFS mounted compressed ZFS file-system, they’d shrunk by 1.1G. Now thinking about this, I should have realised that du was showing the size of blocks used, rather than the file-sizes, but at first glance, I was a little bemused!

Automatic snapshots

We’d really like to have automatic snapshots available for our file-systems, and time-slider can provide this. The only real problem is that time-slider is a GUI application and on a file-server, this isn’t ideal. Anyway, I found its possible to configure auto-snapshots from the terminal, first off install time-slider:

 pkg install time-slider
 svcadm restart dbus
 svcadm enable time-slider

Note, if you don’t restart dbus after installing time-slider, you’ll get python errors out of the time-slider service like:

dbus.exceptions.DBusException: org.freedesktop.DBus.Error.AccessDenied: Connection ":1.3" is not allowed to own the service "org.opensolaris.TimeSlider" due to security policies in the configuration file

To configure, you need to set the hidden ZFS property on the pool and then enable the auto-snapshots that you require:

zfs com.sun:auto-snapshot=true tank
svcadm enable auto-snapshot:hourly
svcadm enable auto-snapshot:weekly
svcadm enable auto-snapshot:daily

This blog gives more details on configuring, in short, you need to export the auto-snapshot manifest, edit it and then re-import it.

However, you should also be able to manipulate the service using svccfg, for example:

svccfg -s auto-snapshot:daily setprop zfs/keep= astring: '7'

Would change the keep period for daily snapshots to 7, the properties that are interesting are (depending if you are changing monthly, daily, hourly or weekly:

svccfg -s auto-snapshot:daily listprop zfs
zfs           application
zfs/interval  astring  days
zfs/period    astring  1
zfs/keep      astring  6

To list your snapshots, use the following:

root@bigstore:~# zfs list -t snapshot
NAME                                                  USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/openindiana-1@install                     17.7M      -  1.52G  -
rpool/ROOT/openindiana-1@2012-05-18-14:54:49          317M      -  1.82G  -
rpool/ROOT/openindiana-1@2012-05-24-10:58:07         65.6M      -  2.32G  -
tank@zfs-auto-snap_weekly-2012-05-24-12h14               0      -  90.6K  -
tank@zfs-auto-snap_daily-2012-05-27-12h14                0      -  90.6K  -
tank@zfs-auto-snap_hourly-2012-05-28-11h14               0      -  90.6K  -
tank/research@zfs-auto-snap_weekly-2012-05-24-12h14  55.0K      -  87.3K  -
tank/research@zfs-auto-snap_hourly-2012-05-24-13h14  55.0K      -  87.3K  -
tank/research@zfs-auto-snap_hourly-2012-05-25-09h14  64.7K      -   113K  -
tank/research@zfs-auto-snap_hourly-2012-05-25-11h14  55.0K      -   116K  -
tank/research@zfs-auto-snap_daily-2012-05-25-12h14   71.2K      -  25.2M  -
tank/research@zfs-auto-snap_hourly-2012-05-25-16h14   162K      -  3.00G  -
tank/research@zfs-auto-snap_daily-2012-05-26-12h14   58.2K      -  3.00G  -
tank/research@zfs-auto-snap_hourly-2012-05-27-04h14  58.2K      -  3.00G  -
tank/research@zfs-auto-snap_daily-2012-05-27-12h14       0      -  3.00G  -
tank/research@zfs-auto-snap_hourly-2012-05-28-11h14      0      -  3.00G  -

Just as a caveat on auto-snapshots, the above output is a system that has been running weekly, hourly and daily snapshots for several days. Note that the hourly snapshots seem a bit sporadic in their availability. This is because if the snapshot the intermediate (hourly or daily) snapshots are 0K in size, then they get removed. i.e. snapshots will only be listed where there’s actually a change in the data. This seems quite sensible really…

Finally, we want to make the snapshots visible to NFS users:

zfs set snapdir=visible tank/research

For Windows users connecting via the in-kernel CIFS server, the previous versions tab should work.

part 1 – Hardware and Basic information
part 2 – Base Operating System Installation
part 3 – Configuring Active Directory authentication integration
part 4 – Configuring the storage pool and auto snapshots
part 5 – NFS Server config & In-kernel CIFS

Comments are closed.