ZFS - Part 1

By Paulus, 5 May, 2024

At work, we use ZFS for our High-Performance Compute cluster and was asked to investigate making some changes. As I was doing my research, I started to really like what I saw.

  • Native encryption
  • Snapshots/Rollbacks
  • Automatically expands when new space is added to the pool
  • Store a user-specified number of copies of data or meta data
  • Automated and typically silent self-healing
  • Native data compression and de-duplication
  • Efficient array rebuilds 
  • Datasets for different volumes so you can apply different properties

There was one thing that I saw that I didn't care for, but it isn't relevant and that is the file system can only grow and for the most part you're stuck with what you define as your virtual devices. The only time that I would decrease my storage would be if I would go from RAID6 to RAID10. With the ability to dynamically increase the pool size, I can simply just replace a vdev with larger disks, thus making that issue irrelevant as I mentioned prior.

This article will go over the basics of planning and setting up a storage pool. There will be some features covered such as setting up a dataset to be encrypted and snapshots. Managing privileges, more advanced encryption options, and file sharing will be saved for another article.

Terminology

  • clone A file system whose initial contents are identical to the contests of a snapshot.
  • dataset Is a generic name for ZFS components: clones, file systems, snapshots, and volumes. Each dataset is comprised of several components: pool/path[@snapshot]
    • pool The name of the pool that the dataset belongs to.
    • path Slash-delimited path name for the dataset component. This can be deeper than one level.
    • snapshot This portion only applies to snapshots.
  • mirror A virtual device that stores identical copies of data on two more more disks. 
  • pool  A grouping of virtual devices that describe the layout and physical characteristics of the available storage. 
  • RAID-Z A virtual devices that stores a parity on one or more disks.
  • resilvering Equivalent to rebuild, where data is either copied from a mirror or rebuilt from the remaining drives.
  • snapshot A read-only view of data as it was in a specific point in time.
  • virtual device A logical device in a pool. A virtual device is also known as a vdev and can be a physical device, file, or a collections of devices.
  • volume A dataset that represents a block device.

Planning

As with anything, thought must be put into how the pool is set up. This is especially the case when deciding how each vdev will be. In earlier versions of ZFS once a vdev was created that was it. Now they can be changed if space is not reduced. For example, I can create a vdev of a single drive and attach another drive as a mirror, or as a stripe.

When creating a new pool and vdevs here are some general rules and tips:

  • striped-mirror/RAID10, In this configuration you have a series of vdevs in the pool that contain two disks that have identical data.
    • Cheaper to upgrade since you only need two drives to get the most storage from the vdev
    • Quickly recovers from a failed disk
    • Resilver does not impact performance much
    • Best for both read and write performance
    • Good for volumes that will hold virtual machines
  • RAIDZ for file storage or archives
    • Not a lot of data is written or very often
    • Good for accessing data quickly
  • 1/3 of RAIDZ should be parity for example:
    • RAIDZ1 vdev should have three disks
    • RAIDZ2 vdev should have six disks
    • RAIDZ3 vdev should have nine disk
  • Resilvering puts a lot of stress on the drives
  • Keep the vdev slim by not having more than nine drives
  • Adding more devices to a vdev will increase read speed in RAIDZ

Generally, a striped-mirror is what people go with and if you need more space, it's easy to add it as you need it. If you need to maximize the space because you have a lot of data, then look into stripe-RAIDZ.

Creating the Storage Pool

When adding disks to a vdev, it's best to use an identifier that is static. By using something like /dev/disk/by-id, you can easily know which disk has the problem. Using /dev/sdX is still an option but it might take more effort to figure out which disk is causing the problem. Unless you're working in a data center where a server might have hundreds of disks it's not as big of a deal.


ls -lh /dev/disk/by-id/ata-ST8000VN004-*
lrwxrwxrwx 1 root root  9 Mar 18 03:34 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000001 -> ../../sda
lrwxrwxrwx 1 root root  9 Mar 18 02:56 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000002 -> ../../sdb
lrwxrwxrwx 1 root root  9 Mar 18 03:35 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000003 -> ../../sdc
lrwxrwxrwx 1 root root  9 Mar 18 02:26 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000004 -> ../../sdd
lrwxrwxrwx 1 root root  9 Mar 18 03:34 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000005 -> ../../sde
lrwxrwxrwx 1 root root  9 Mar 18 02:56 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000006 -> ../../sdf
lrwxrwxrwx 1 root root  9 Mar 18 03:35 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000007 -> ../../sdg
lrwxrwxrwx 1 root root  9 Mar 18 02:26 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000008 -> ../../sdh
lrwxrwxrwx 1 root root  9 Mar 18 03:34 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000009 -> ../../sdi
lrwxrwxrwx 1 root root  9 Mar 18 02:56 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000010 -> ../../sdj
lrwxrwxrwx 1 root root  9 Mar 18 03:35 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000011 -> ../../sdk
lrwxrwxrwx 1 root root  9 Mar 18 02:26 /dev/disk/by-id/ata-ST8000VN004-3CP101_00000012 -> ../../sdl

Stripe


zpool create tank ata-ST8000VN004-3CP101_00000001
zpool status
  pool: tank
 state: ONLINE
config:
        NAME                               STATE     READ WRITE CKSUM
        tank                               ONLINE       0     0     0
          ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
errors: No known data errors
zpool add tank ata-ST8000VN004-3CP101_00000001
zpool status
  pool: tank
 state: ONLINE
config:
        NAME                               STATE     READ WRITE CKSUM
        tank                               ONLINE       0     0     0
          ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
          ata-ST8000VN004-3CP101_00000002  ONLINE       0     0     0
errors: No known data errors

Mirror


zpool create tank mirror ata-ST8000VN004-3CP101_00000001 ata-ST8000VN004-3CP101_00000002
zpool status
  pool: tank
 state: ONLINE
config:
        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000002  ONLINE       0     0     0
errors: No known data errors

Stripe-Mirror

To create something equivalent to RAID10.


zpool create tank mirror ata-ST8000VN004-3CP101_00000001 ata-ST8000VN004-3CP101_00000002 mirror ata-ST8000VN004-3CP101_00000003 ata-ST8000VN004-3CP101_00000004
zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 432K in 00:00:00 with 0 errors on Mon Mar 18 03:35:05 2024
config:
        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000002  ONLINE       0     0     0
          mirror-1                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000003  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000004  ONLINE       0     0     0
errors: No known data errors

To go from a stripe to a mirror:


zpool attach tank ata-ST8000VN004-3CP101_00000001 ata-ST8000VN004-3CP101_00000002
zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 1.08M in 00:00:01 with 0 errors on Mon Mar 18 03:34:44 2024
config:
        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000002  ONLINE       0     0     0
          ata-ST8000VN004-3CP101_00000003    ONLINE       0     0     0
errors: No known data errors
zpool attach tank ata-ST8000VN004-3CP101_00000003 ata-ST8000VN004-3CP101_00000004
zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 432K in 00:00:00 with 0 errors on Mon Mar 18 03:35:05 2024
config:
        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000002  ONLINE       0     0     0
          mirror-1                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000003  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000004  ONLINE       0     0     0
errors: No known data errors

RAIDZ


zpool create tank raidz ata-ST8000VN004-3CP101_00000001 ata-ST8000VN004-3CP101_00000002 ata-ST8000VN004-3CP101_00000003
zstatus
  pool: tank
 state: ONLINE
  scan: resilvered 432K in 00:00:00 with 0 errors on Mon Mar 18 03:35:05 2024
config:
        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz-0                            ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000002  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000003  ONLINE       0     0     0
errors: No known data errors

RAIDZ2


zpool create tank raidz2 ata-ST8000VN004-3CP101_00000001 ata-ST8000VN004-3CP101_00000002 ata-ST8000VN004-3CP101_00000003 ata-ST8000VN004-3CP101_00000004 ata-ST8000VN004-3CP101_00000005 ata-ST8000VN004-3CP101_00000006
zstatus
  pool: tank
 state: ONLINE
  scan: resilvered 432K in 00:00:00 with 0 errors on Mon Mar 18 03:35:05 2024
config:
        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz2-0                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000002  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000003  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000004  ONLINE       0     0     0 
            ata-ST8000VN004-3CP101_00000005  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000006  ONLINE       0     0     0
errors: No known data errors

RAIDZ3


zpool create tank raidz3 ata-ST8000VN004-3CP101_00000001 ata-ST8000VN004-3CP101_00000002 ata-ST8000VN004-3CP101_00000003 ata-ST8000VN004-3CP101_00000004 ata-ST8000VN004-3CP101_00000005 ata-ST8000VN004-3CP101_00000006 ata-ST8000VN004-3CP101_00000007 ata-ST8000VN004-3CP101_00000008 ata-ST8000VN004-3CP101_00000009
zstatus
  pool: tank
 state: ONLINE
  scan: resilvered 432K in 00:00:00 with 0 errors on Mon Mar 18 03:35:05 2024
config:
        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz3-0                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000001  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000002  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000003  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000004  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000005  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000006  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000007  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000008  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_00000009  ONLINE       0     0     0
errors: No known data errors

Creating the Dataset

Once the pool has been defined we can start creating datasets:


zfs create -o mountpoint=/data tank/data

ZFS automatically mounts the file systems when they are created and when the system boots. If you need to change some mount options or explicitly mount or unmount file systems then you can use the zfs mount and zfs umount

If you don't specify a mount point then it will default to /tank/data

Encryption

ZFS encryption works by encrypting the dataset, unlike other encryption options such as LUKS, which encrypts the entire disk. You can use ZFS on top of a fully encrypted disk, but each disk must be decrypted before the pool can be imported.

Encryption is done on datasets and not pools. Encryption is inherited therefore by creating a dataset called tank/encrypted, all child datasets under that will be encrypted, e.g., tank/encrypted/home, tank/encrypted/home/user, etc.

It is important to know that native encryption does not encrypt all metadata. This is so that some maintenance tasks can still be performed on an unmounted encrypted dataset. What can be exposed is name, size, usage, and properties of the dataset. This is not true for the sizes of individual files and their contents.


# Create a new dataset with a passphrase
zfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase tank/encrypted
# Verify the settings for the encrypted datasets
zfs get encryption,keylocation,keyformat tank/encrypted
zfs umount tank/encrypted
zfs unload-key tank/encrypted
zfs mount tank/encrypted
cannot mount 'tank/encrypted': encryption key not loaded
zfs load-key tank/encrypted
Enter passphrase for 'tank/encrypted':

Changing the key that encrypts the master key. This means that the data doesn't need to be re-encrypted.


zfs change-key -l -o keyloacation=location -o keyformat=format -o pbkdf2iters=value poolname/dataset

Moving Pools

First we need to unmount all the datasets before we can export the pool.


zfs umount -a
zpool export tank

Move the drives to the new system and then run the following commands on said system.


zpool import [tank]
zpool import -f [tank]

Conclusion

That covers the basics of creating pools and datasets. In part two I will cover more advanced aspects of ZFS such as setting properties and sharing.