XFS Storage Layout Considerations

About 18 months ago I wondered why the XFS FAQ recommended a stripe width half the number of disks for RAID-10, as the underlying rationale did not seem to be properly explained anywhere (see XFS FAQ on sunit, swidth values). The answer turned out to be because the RAID-0 portion of RAID-10 dominated the layout choices.

I suggested extending the FAQ to provide some rationale, but Dave Chiner (the main Linux XFS maintainer) said "The FAQ is not the place to explain how the filesystem optimises allocation for different types of storage, and pointed at a section of the XFS admin doc on alignment to storage geometry, which at the time -- and now, 18 months later -- reads:

==== Alignment to storage geometry

TODO: This is extremely complex and requires an entire chapter to itself.

which is... rather sparse. Because Dave had not had time to write that "chapter to itself".

At the time I offered to write a "sysadmin's view" of the relevant considerations, which got delayed by actual work, but still would be greatly appreciated.

I eventually posted what I had written to the XFS mailing list in February 2018, where it seems to have been lost in the noise and ignored.

Since it is now nearly a year later, and nothing seems to have happened with the documentation I wrote -- and the mailing list location is not very searchable either -- I have decided to repost it here on my blog as a (slightly) more permanent home. It appears unlikely to be incorporated into the XFS documentation.

So below is that original, year old, documentation draft. The advice below is unreviewed by the XFS maintainers (or anybody, AFAICT), and is just converted from the Linux kernel documentation RST format to Markdown (for my blog). Conversion done with pandoc and a bunch of manual editing, for all the things pandoc missed, or was confused by (headings, lists, command line examples, etc).

I would suggest double checking anything below against other sources before relying on it. If there is no other documentation to check, perhaps ask on the XFS Mailing List instead.

Alignment to storage geometry

XFS can be used on a wide variety of storage technology (spinning magnetic disks, SSDs), on single disks or spanned across multiple disks (with software or hardware RAID). Potentially there are multiple layers of abstraction between the physical storage medium and the file system (XFS), including software layers like LVM, and potentially flash translation layers or hierachical) storage management.

Each of these technology choices has its own requirements for best alignment, and/or its own trade offs between latency and performance, and the combination of multiple layers may introduce additional alignment or layout constraints.

The goal of file system alignment to the storage geometry is to:

maximise throughput (eg, through locality or parallelism)
minimise latency (at least for common activities)
minimise storage overhead (such as write amplification due to read-modify-write -- RMW -- cycles).

Physical Storage Technology

Modern storage technology divides into two broad categories:

magnetic storage on spinning media (eg, HDD)
flash storage (eg, SSD or NVMe)

These two storage technology families have distinct features that influence the optimal file system layout.

Magnetic Storage: accessing magnetic storage requires moving a physical read/write head across the magnetic media, which takes a non-trivial amount of time (ms). The seek time required to move the head to the correct location is approximately linearly proportional to the distance the head needs to move, which means two locations near each other are faster to access than two locations far away. Performance can be improved by locating data regularly accessed together "near" each other. (See also Wikipeida Overview of HDD performance characteristics.)

4KiB physical sectors HDD: Most larger modern magnetic HDDs (many 2TiB+, almost all 4TiB+) use 4KiB physical sectors to help minimise storage overhead (of sector headers/footers and inter-sector gaps), and thus maximise storage density. But for backwards compatibility they continue to present the illusion of 512 byte logical sectors. Alignment of file system data structures and user data blocks to the start of (4KiB) physical sectors avoids unnecessarily spanning a read or write across two physical sectors, and thus avoids write amplification.

Flash Storage: Flash storage has both a page size (smallest unit that can be written at once), and an erase block size (smallest unit that can be erased) which is typically much larger (eg, 128KiB). A key limitation of flash storage is that only one value can be written to it on an individual bit/byte level. This means that updates to physical flash storage usually involve an erase cycle to "blank the slate" with a single common value, followed by writing the bits that should have the other value (and writing back the unmodified data -- a read-modify-write cycle). To further complicate matters, most flash storage physical media has a limitation on how many times a given physical storage cell can be erased, depending on the technology used (typically in the order of 10000 times).

To compensate for these technological limitations, all flash storage suitable for use with XFS uses a Flash Translation Layer within the device, which provides both wear levelling and relocation of individual pages to different erase blocks as they are updated (to minimise the amount that needs to be updated with each write, and reduce the frequency blocks are erased). These are often implemented on-device as a type of log structured file system, hidden within the device.

For a file system like XFS, a key consideration is to avoid spanning data structures across erase blocks boundaries, as that would mean that multiple erase blocks would need updating for a single change. Write amplification within the SSD may still result in multiple updates to physical media for a single update, but this can be reduced by advising the flash storage of blocks that do not need to be preserved (eg, with the discard mount option, or by using fstrim) so it stops copying those blocks around.

RAID

RAID provides a way to combine multiple storage devices into one larger logical storage device, with better performance or more redundancy (and sometimes both, eg, RAID-10). There are multiple RAID array arrangements ("levels") with different performance considerations. RAID can be implemented both directly in the Linux kernel ("software RAID", eg the "MD" subsystem), or within a dedicated controller card ("hardware RAID"). The filesystem layout considerations are similar for both, but where the "MD" subsystem is used modern user space tools can often automatically determine key RAID parameters and use those to tune the layout of higher layers; for hardware RAID these key values typically need to be manually determined and provided to user space tools by hand.

RAID 0 stripes data across two or more storage devices, with the aim of increasing performance, but provides no redundancy (in fact the data is more at risk as failure of any disk probably renders the data inaccessible). For XFS storage layout the key consideration is to maximise parallel access to all the underlying storage devices by avoiding "hot spots" that are reliant on a single underlying device.

RAID 1 duplicates data (identically) across two more more storage devices, with the aim of increasing redundancy. It may provide a small read performance boost if data can be read from multiple disks at once, but provides no write performance boost (data needs to be written to all disks). There are no special XFS storage layout considerations for RAID 1, as every disk has the same data.

RAID 5 organises data into stripes across three or more storage devices, where N-1 storage devices contain file system data, and the remaining storage device contains parity information which allows recalculation of the contents of any one other storage device (eg in the event that storage device fails). To avoid the "parity" block being a hot spot, its location is rotated amongst all the member storage devices (unlike RAID 4 which had a parity hot spot). Writes to RAID-5 require reading multiple elements of the RAID 5 parity block set (to be able to recalculate the parity values), and writing at least the modified data block and parity block. The performance of RAID 5 is improved by having a high hit rate on caching (thus avoiding the read part of the read-modify-write cycle), but there is still an inevitable write overhead.

For XFS storage layout on RAID 5 the key considerations are the read-modify-write cycle to update the parity blocks (and avoiding needing to unnecessarily modify multiple parity blocks), as well as increasing parallelism by avoiding hot spots on a single underlying storage device. For this XFS needs to know both the stripe size on an underlying disk, and how many of those stripes can be stored before it cycles back to the same underlying disk (N-1).

RAID 6 is an extension of the RAID 5 idea, which uses two parity blocks per set, so N-2 storage devices contain file system data and the remaining two storage device contain parity information. This increases the overhead of writes, for the benefit of being able to recover information if more than one storage device fails at the same time (including, eg, during the recovery from the first storage device failing -- a not unknown even with larger storage devices and thus longer RAID parity rebuild recovery times).

For XFS storage layout on RAID 6, the considerations are the same as RAID 5, but only N-2 disks contain user data.

RAID 10 is a conceptual combination of RAID 1 and RAID 0, across at least four underlying storage devices. It provides both storage redundancy (like RAID 1) and interleaving for performance (like RAID 0). The write performance (particularly for smaller writes) is usually better than RAID 5/6, at the cost of less usable storage space. For XFS storage layout the RAID-0 performance considerations apply -- spread the work across the underlying storage devices to maximise parallelism.

A further layout consideration with RAID is that RAID arrays typically need to store some metadata with each RAID array that helps it locate the underlying storage devices. This metadata may be stored at the start or end of the RAID member devices. If it is stored at the start of the member devices, then this may introduce alignment considerations. For instance the Linux "MD" subsystem has multiple metadata formats, and formats 0.9/1.0 store the metadata at the end of the RAID member devices and formats 1.1/1.2 store the metadata at the beginning of the RAID member devices. Modern user space tools will typically try to ensure user data starts on a 1MiB boundary ("Data Offset").

Hardware RAID controllers may use either of these techniques too, and may require manual determination of the relevant offsets from documentation or vendor tools.

Disk partitioning

Disk partitioning impacts on file system alignment to the underlying storage blocks in two ways:

the starting sectors of each partition need to be aligned to the underlying storage blocks for best performance. With modern Linux user space tools this will typically happen automatically, but older Linux and other tools often would attempt to align to historically relevant boundaries (eg, 63-sector tracks) that are not only irrelevant to modern storage technology but due to the odd number (63) result in misalignment to the underlying storage blocks (eg, 4KiB sector HDD, 128KiB erase block SSD, or RAID array stripes).
the partitioning system may require storing metadata about the partition locations between partitions (eg, MBR logical partitions), which may throw off the alignment of the start of the partition from the optimal location. Use of GPT partitioning is recommended for modern systems to avoid this, or if MBR partitioning is used either use only the 4 primary partitions or take extra care when adding logical partitions.

Modern Linux user space tools will typically attempt to align on 1MiB boundaries to maximise the chance of achieving a good alignment; beware if using older tools, or storage media partitioned with older tools.

Storage Virtualisation and Encryption

Storage virtualisation such as the Linux kernel LVM (Logical Volume Manager) introduce another layer of abstraction between the storage device and the file system. These layers may also need to store their own metadata, which may affect alignment with the underlying storage sectors or erase blocks.

LVM needs to store metadata the physical volumes (PV) -- typically 192KiB at the start of the physical volume (check the "1st PE" value with pvs -o name,pe_start). This holds both physical volume information as well as volume group (VG) and logical volume (LV) information. The size of this metadata can be adjusted at pvcreate time to help improve alignment of the user data with the underlying storage.

Encrypted volumes (such as LUKS) also need to store their own metadata at the start of the volume. The size of this metadata depends on the key size used for encryption. Typical sizes are 1MiB (256-bit key) or 2MiB (512-bit key), stored at the start of the underlying volume. These headers may also cause alignment issues with the underlying storage, although probably only in the case of wider RAID 5/6/10 sets. The --align-payload argument to cryptsetup may be used to influence the data alignment of the user data in the encrypted volume (it takes a value in 512 byte logical sectors), or a detached header (--header DEVICE) may be used to store the header somewhere other than the start of the underlying device.

Determining su/sw values

Assuming every layer in your storage stack is properly aligned with the underlying layers, the remaining step is to give mkfs.xfs appropriate values to guide the XFS layout across the underlying storage to minimise latency and hot spots and maximise performance. In some simple cases (eg, modern Linux software RAID) mkfs.xfs can automatically determine these values; in other cases they may need to be manually calculated and supplied.

The key values to control layout are:

su: stripe unit size, in bytes (use m or g suffixes for MiB or GiB) that is updatable on a single underlying device (eg, RAID set member)
sw: stripe width, in member elements storing user data before you wrap around to the first storage device again (ie, excluding parity disks, spares, etc); this is used to distribute data/metadata (and thus work) between multiple members of the underlying storage to reduce hot spots and increase parallelism.

When multiple layers of storage technology are involved, you want to ensure that each higher layer has a block size that is the same as the underlying layer, or an even multiple of the underlying layer, and then give that largest multiple to mkfs.xfs.

Formulas for calculating appropriate values for various storage technology:

HDD: alignment to physical sector size (512 bytes or 4KiB). This will happen automatically due to XFS defaulting to 4KiB block sizes.
Flash Storage: alignment to erase blocks (eg, 128 KiB). If you have a single flash storage device, specify su=ERASE_BLOCK_SIZE and sw=1.
RAID 0: Set su=RAID_CHUNK_SIZE and sw=NUMBER_OF_ACTIVE_DISKS, to spread the work as evenly as possible across all member disks.
RAID 1: No special values required; use the values required from the underlying storage.
RAID 5: Set su=RAID_CHUNK_SIZE and sw=(NUMBER_OF_ACTIVE_DISKS-1), as one disk is used for parity so the wrap around to the first disk happens one disk earlier than the full RAID set width.
RAID 6: Set su=RAID_CHUNK_SIZE and sw=(NUMBER_OF_ACTIVE_DISKS-2), as two disks are used for parity so the wrap around to the first disk happens two disks earlier than the full RAID set width.
RAID-10: The RAID 0 portion of RAID-10 dominates alignment considerations. The RAID 1 redundancy reduces the effective number of active disks, eg 2-way mirroring halves the effective number of active disks, and 3-way mirroring reduces it to one third. Calculate the number of effective active disks, and then use the RAID 0 values. Eg, for 2-way RAID 10 mirroring, use su=RAID_CHUNK_SIZE and sw=(NUMBER_OF_MEMBER_DISKS / 2).
RAID-50/RAID-60: These are logical combinations of RAID 5 and RAID 0, or RAID 6 and RAID 0 respectively. Both the RAID 5/6 and the RAID 0 performance characteristics matter. Calculate the number of disks holding parity (2+ for RAID 50; 4+ for RAID 60) and subtract that from the number of disks in the RAID set to get the number of data disks. Then use su=RAID_CHUNK_SIZE and sw=NUMBER_OF_DATA_DISKS.

For the purpose of calculating these values in a RAID set only the active storage devices in the RAID set should be included; spares, even dedicated spares, are outside the layout considerations.

A note on `sunit`/`swidth` versus `su`/`sw`

Alignment values historically were specified in sunit/swidth values, which provided numbers in 512-byte sectors, where as swidth was some multiple of sunit. These units were historically useful when all storage technology used 512-byte logical and physical sectors, and often reported by underlying layers in physical sectors. However they are increasingly difficult to work with for modern storage technology with its variety of physical sector and block sizes.

The su/sw values, introduced later, provide a value in bytes (su) and a number of occurrences (sw), which are easier to work with when calculating values for a variety of physical sector and block sizes.

Logically:

sunit = su / 512
swidth = sunit * sw

With the result that swidth = (su / 512) * sw.

Use of sunit / swidth is discouraged, and use of su / sw is encouraged to avoid confusion.

WARNING: beware that while the sunit/swidth values are specified to mkfs.xfs in 512-byte sectors, they are reported by +mkfs.xfs (and xfs_info) in file system blocks (typically 4KiB, shown in the bsize value). This can be very confusing, and is another reason to prefer to specify values with su / sw and ignore the sunit / swidth options to mkfs.xfs.