Realigning Linux MD RAID-1 partitions

Introduction

On modern storage media, and storage subsystems, file system alignment to "larger than 512 byte" boundaries is increasingly important for achieving good write performance, and on some media for avoiding excessive wear on the underlying media (due to additional write amplification). For about the last 8 years, the Linux kernel has supported an "/sys/block/*/alignment_offset" metric which indicates the number of bytes needed to get a particular layer back into alignment with the underlying storage media, and Linux distributions have included tools that attempt to do automatic alignment, where possible, for the last few years. Which helps newer systems, particularly new installations, but cannot automatically fix older systems.

I had one older system (originally installed at least 15 years ago, and "grandfather's axe" upgraded through various versions of hardware), that ended up having both:

4KB (4096 byte) physical sectors (on a pair of WDC WD20EFRX-68A 2TB drive)

ewen@linux:~$ cat /sys/block/sda/device/model 
WDC WD20EFRX-68A
ewen@linux:~$ cat /sys/block/sda/queue/physical_block_size 
4096
ewen@linux:~$ cat /sys/block/sda/queue/logical_block_size 
512
ewen@linux:~$

although curiously the second apparently identical drive detects with 512-byte physical sectors, at least at present:

ewen@tv:~$ cat /sys/block/sdb/device/model 
WDC WD20EFRX-68A
ewen@tv:~$ cat /sys/block/sdb/queue/physical_block_size 
512
ewen@tv:~$ cat /sys/block/sdb/queue/logical_block_size 
512
ewen@tv:~$

for reasons I do not understand (both drives were purchased approximately the same time, as far as I can recall)

1990s-style partition layout, with partitions starting on a "cylinder" boundary:

(parted) unit s
(parted) print
Model: ATA WDC WD20EFRX-68A (scsi)
Disk /dev/sda: 3907029168s
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Disk Flags:

Number  Start        End          Size         Type      File system  Flags
 1      63s          498014s      497952s      primary   ext4         raid
 2      498015s      4498199s     4000185s     primary                raid
 3      4498200s     8498384s     4000185s     primary                raid
 4      8498385s     3907024064s  3898525680s  extended
 5      8498448s     72501344s    64002897s    logical                raid
 6      72501408s    584508959s   512007552s   logical                raid
 7      584509023s   1096516574s  512007552s   logical                raid
 8      1096516638s  1608524189s  512007552s   logical                raid
 9      1608524253s  2120531804s  512007552s   logical                raid
10      2120531868s  2632539419s  512007552s   logical                raid
11      2632539483s  3144547034s  512007552s   logical                raid
12      3144547098s  3656554649s  512007552s   logical                raid
13      3656554713s  3907024064s  250469352s   logical                raid

(parted)

Linux MD RAID-1 and LVM with no adjustments for the partition offsets to the physical block boundaries (due to being created with old tools), and
Linux file systems (ext4, xfs) created with no adjustments to the physical block boundaries (due to being created with old tools)

I knew that this misalignment was happening at the time I swapped in the newer (2TB) disks a few years ago, but did not have time to try to manually figure out the correct method to align all the layers, so I decided to just accept the lower performance at the time. (Fortunately being magnetic storage rather than SSDs, there was not an additional risk of excessive drive wear caused by the misalignment.)

After upgrading to a modern Debian Linux version, including a new kernel, this misalignment was made more visible again, including in kernel messages on every boot:

device-mapper: table: 253:2: adding target device (start sect 511967232 len 24903680) caused an alignment inconsistency
device-mapper: table: 253:4: adding target device (start sect 511967232 len 511967232) caused an alignment inconsistency
device-mapper: table: 253:4: adding target device (start sect 1023934464 len 511967232) caused an alignment inconsistency
device-mapper: table: 253:4: adding target device (start sect 1535901696 len 36962304) caused an alignment inconsistency
device-mapper: table: 253:4: adding target device (start sect 1572864000 len 209715200) caused an alignment inconsistency
device-mapper: table: 253:5: adding target device (start sect 34078720 len 39321600) caused an alignment inconsistency

so I planned to eventually re-align the partitions on the underlying drives to match the modern "optimal" conventions (ie, start partitions at 1 MiB boundaries). I finally got time to do that realignment over the Christmas/New Year period, during a "staycation" that let me be around periodically for all the steps required.

Overall process

The system in question had two drives (both 2TB), in RAID-1 for redundancy. Since I did not want to lose the RAID redundancy during the process my general approach was:

Obtain a 2TB external drive
Partition the 2TB external drive with 1 MiB aligned partitions matching the desired partition layout, marked as "raid" partitions
Extend the Linux MD RAID-1 to cover three drives, including the 2TB external drive, and wait for the RAID arrays to resync.
Then for each drive to be repartitioned, remove the drive from the RAID-1 sets (leaving the other original drive and the external drive), repartition it optimally, and re-add the drive back into the RAID-1 sets and wait for the RAID arrays to resync (then repeat for the other drive).
Remove the external 2TB drive from all the RAID sets
Reboot the system to ensure it detected the original two drives as now aligned.

This process took about 2 days to complete, most of which was waiting for the 2TB of RAID arrays to sync onto the external drive, as the system in question had only USB-2 (not USB-3), and thus copies onto the external drive went at about 30MB/s and took 18-20 hours. The last stage of repartioning and resyncing the original drives went much faster as the copies onto those drives went at over 120MB/s (reasonable for SATA-1.5 Gbps connected drives: the drives are SATA-6Gbps capable, but the host controller is only SATA 1.5 Gbps).

NOTE: If you are going to attempt to follow this process yourself, I strongly recommend that you have a separate backup of the system -- other than on the RAID disks you are modifying -- as accidentally removing or overwriting the wrong thing at the wrong time during this process could easily lead to a difficult to recover system or permanent data loss.

Partition alignment

Modern parted is capable of creating "optimal" aligned partitions if you give it the "-a optimal" flag when start it up. However, due to the age of this system I needed to recreate a MBR partition table with 13 partitions on it -- necessitating several logical partitions -- which come with their own alignment challenges (it turns out that the Extended Boot Record logical partitions require a linked list of partition records between the logical partitions, thus requiring some space between each partition); on a modern system using a GUID Partition Table avoids most of these challenges. (Some choose not to align the extended partition, as there is no user-data in the extended partition so only the logical partition alignment matters; but this is only a minor help if you have multiple logical partitions.)

After some thought I concluded that my desired partitioning had:

The first partition starting at 1MiB (2048s)
Every other partition starting on a MiB boundary
Every subsequent partition as close as possible to the previous one
All partitions at least a multiple of the physical sector size (4KiB)
All partitions larger than the original partitions on the disk, so that the RAID-1 resync would trivially work
Minimise wasted "rounding up" space
The last partition on the disks absorbing all of the reductions in disk space needed to meet the other considerations

Unfortunately it turned out that the need for multiple "logical" partitions, the desire to start every partition on a MiB boundary, parted most easily supporting creating partitions that were N MiB long, and a desire not to waste space between partitions, ended up conflicting with the need to have an Extended Boot Record between every logical partition. I could either accept that I would lose 1 MiB between each logical partition -- to hold the 512 byte EBR and have everything start/end on a MiB boundary -- or use more advanced methods to describe to parted what I needed. I chose to use more advanced methods.

Creating the first part of the partition table was pretty easy (here /dev/sdc was the 2TB external drive; but I followed the same partitioning on the original drives when I got to rebuilding them):

ewen@linux:~$ sudo parted -a optimal /dev/sdc
(parted) mklabel msdos
Warning: The existing disk label on /dev/sdc will be destroyed and all data on
this disk will be lost. Do you want to continue?
Yes/No? yes
(parted) quit

ewen@linux:~$ sudo parted -a optimal /dev/sdc
(parted) unit s
(parted) print
Model: WD Elements 25A2 (scsi)
Disk /dev/sdc: 3906963456s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start  End  Size  Type  File system  Flags

(parted) mkpart primary 1MiB 245MiB
(parted) set 1 raid on
(parted) mkpart primary 245MiB 2199MiB
(parted) set 2 raid on
(parted) mkpart primary 2199MiB 4153MiB
(parted) set 3 raid on
(parted) mkpart extended 4153MiB 100%

This gave a drive with three primary partitions starting on MiB boundaries, and an extended partition which covered the remainder (majority) of the drive. Each primary partition was marked as a RAID partition.

After that it got more complicated. I needed to start each logical partition on a MiB boundary, and then finishing them just before the MiB boundary, to allow room for the Extended Boot Record to sit in between. For the first logical partition I chose to sacrifice 1 MiB, and simply start it on the next MiB boundary, but for the end position I needed to figure out "4KiB less than next MiB" (ie, one physical sector) so as to leave room for the EBR and then starting the next logical partition on a MiB boundary -- with minimal wasted space.

I calculated the first one by hand, as it needed a unique size, and specified it in sectors (ie, 512-byte units -- logical sectors):

(parted) mkpart logical 4154MiB 72511484s      # 35406MiB - 4 sectors
(parted) set 5 raid on

then for most of the rest, they were all the same size, and so the pattern was quite repetitive. To solve this I wrote a trivial, hacky Python script to generate the right parted commands:

base=35406
inc=250004

for i in range(7):
  start = base + (i * inc)
  end   = base + ((i+1) * inc)
  last  = (end * 2 * 1024) - 4
  print("mkpart logical {0:7d}MiB {1:10d}s    # {2:7d}MiB - 4 sectors".format(start, last, end))

and then fed that output to parted:

(parted) mkpart logical   35406MiB  584519676s    #  285410MiB - 4 sectors
(parted) mkpart logical  285410MiB 1096527868s    #  535414MiB - 4 sectors
(parted) mkpart logical  535414MiB 1608536060s    #  785418MiB - 4 sectors
(parted) mkpart logical  785418MiB 2120544252s    # 1035422MiB - 4 sectors
(parted) mkpart logical 1035422MiB 2632552444s    # 1285426MiB - 4 sectors
(parted) mkpart logical 1285426MiB 3144560636s    # 1535430MiB - 4 sectors
(parted) mkpart logical 1535430MiB 3656568828s    # 1785434MiB - 4 sectors

to create all the consistently sized partitions (this would have been easier if parted had supported start/size values, which is what is actually stored in the MBR/EBR -- but it requires start/end values, which need more manual calculation; seems like a poor UI to me).

After that I could create the final partition to use the remainder of the disk, which is trivial to specify:

(parted) mkpart logical 1785434MiB 100%

and then mark all the other partitions as "raid" partitions:

(parted) set 6 raid on
(parted) set 7 raid on
(parted) set 8 raid on
(parted) set 9 raid on
(parted) set 10 raid on
(parted) set 11 raid on
(parted) set 12 raid on
(parted) set 13 raid on

which gave me a final partition table of:

(parted) unit s
(parted) print
Model: WD Elements 25A2 (scsi)
Disk /dev/sdc: 3906963456s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start        End          Size         Type      File system  Flags
 1      2048s        501759s      499712s      primary                raid, lba
 2      501760s      4503551s     4001792s     primary                raid, lba
 3      4503552s     8505343s     4001792s     primary                raid, lba
 4      8505344s     3906963455s  3898458112s  extended               lba
 5      8507392s     72511484s    64004093s    logical                raid, lba
 6      72511488s    584519676s   512008189s   logical                raid, lba
 7      584519680s   1096527868s  512008189s   logical                raid, lba
 8      1096527872s  1608536060s  512008189s   logical                raid, lba
 9      1608536064s  2120544252s  512008189s   logical                raid, lba
10      2120544256s  2632552444s  512008189s   logical                raid, lba
11      2632552448s  3144560636s  512008189s   logical                raid, lba
12      3144560640s  3656568828s  512008189s   logical                raid, lba
13      3656568832s  3906963455s  250394624s   logical                raid, lba

(parted)

As a final double check I also used the parted "align-check" to check the alignment (and manually divided each starting sector by 2048 to ensure it was on a MiB boundary -- 2048 = 2 * 1024, as the values are in 512-byte sectors):

(parted) align-check optimal 1
1 aligned
(parted) align-check optimal 2
2 aligned
(parted) align-check optimal 3
3 aligned
(parted) align-check optimal 4
4 aligned
(parted) align-check optimal 5
5 aligned
(parted) align-check optimal 6
6 aligned
(parted) align-check optimal 7
7 aligned
(parted) align-check optimal 8
8 aligned
(parted) align-check optimal 9
9 aligned
(parted) align-check optimal 10
10 aligned
(parted) align-check optimal 11
11 aligned
(parted) align-check optimal 12
12 aligned
(parted) align-check optimal 13
13 aligned
(parted)

And then exited to work with this final partition table:

(parted) quit

For more on partition alignment, particularly with MD / LVM layers as well see Thomas Krenn's post on Partition Alignment, and a great set of slides on partition / MD LVM alignment. Of note, both Linux MD RAID-1 metadata (1.2) and LVM Physical Volume metadata will take up some space at the start of the partition if you accept the modern defaults.

For Linux MD RAID-1, metadata 1.2 is at the start of the partition, and then the Data will begin at the "Data Offset" within the partition. (Linux MD RAID metadata 0.9 is at the end of the disk, so there is no offset, which is sometimes useful including for /boot partitions.) You can see the offset in use by examining the individual RAID-1 elements: on metadata 1.2 RAID sets there is a "Data Offset" value reported, which is typically 1 MiB (2048 * 512-byte sectors):

ewen@linux:~$ sudo mdadm -E /dev/sda2 | egrep "Version|Offset"
        Version : 1.2
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
ewen@linux:~$

although RAID sets created with more modern mdadm tools might have larger offsets (possibly for bitmaps to speed up resync?):

ewen@linux:~$ sudo mdadm -E /dev/sda13 | egrep "Version|Offset"
        Version : 1.2
    Data Offset : 131072 sectors
   Super Offset : 8 sectors
ewen@linux:~$

These result in unused space in the MD RAID-1 elements which can be seen:

ewen@linux:~$ sudo mdadm -E /dev/sda2 | egrep "Unused"
   Unused Space : before=1960 sectors, after=1632 sectors
ewen@linux:~$ sudo mdadm -E /dev/sda13 | egrep "Unused"
   Unused Space : before=130984 sectors, after=65536 sectors
ewen@linux:~$

although in this case the unused space at the end is most likely due to rounding up the partition sizes from those in the originally created RAID array. (The "--data-offset is computed automatically, but may be overridden from the command line when the array is created -- on a per-member-device basis. But presumably if the data offset is too small, various metadata -- such as bitmaps -- cannot be stored.)

By default, it appears that modern LVM will default to 192 KiB of data at the start of its physical volumes (PV), which can be seen by checking the "pe_start" value:

ewen@linux:~$ sudo pvs -o +pe_start /dev/md26
  PV         VG Fmt  Attr PSize   PFree 1st PE 
  /dev/md26  r1 lvm2 a--  244.12g    0  192.00k
ewen@linux:~$ sudo pvs -o +pe_start /dev/md32
  PV         VG Fmt  Attr PSize   PFree   1st PE 
  /dev/md32     lvm2 ---  244.14g 244.14g 192.00k
ewen@linux:~$

and controlled at pvcreate time with the --metadatasize and --dataalignment values (as well as an optimal manual override of the offset).

Fortunately all of these values (1MiB == 2048s, 64MiB == 131072s, 192 KiB) are all multiples of 4 KiB, so providing you are only aligning to 4 KiB boundaries you do not need to worry about additional alignment options if the underlying partitions are aligned. But if you need to align to, eg, larger SSD erase blocks or larger hardware RAID stripes, you may need to adjust the MD and LVM alignment options as well to avoid leaving the underlying file system misaligned. (If you are using modern Linux tools with all software layers -- RAID, LVM, etc -- then the alignment_offset values will probably help to ensure the defaults help with alignment; if there are any hardware layers you will need to provide additional information to ensure the best alignment.)

RAID rebuild

Having created a third (external 2TB) disk with suitably aligned partitions I could then move on to resyncing all the RAID arrays 3 times (once onto the external drive, and then once onto each of the internal drives). A useful Debian User post outlined the process for extending the RAID-1 array onto a third disk, and then removing the third disk again, which provided the basis of my approach.

The first step was to extend all but the last RAID array onto the new disk (the last one needed special treatment as it was getting smaller, but fortunately it did not have any data on it yet). Growing onto the third disk is fairly simple:

sudo mdadm --grow /dev/md21 --level=1 --raid-devices=3 --add /dev/sdc1
sudo mdadm --grow /dev/md22 --level=1 --raid-devices=3 --add /dev/sdc2
sudo mdadm --grow /dev/md23 --level=1 --raid-devices=3 --add /dev/sdc3
sudo mdadm --grow /dev/md25 --level=1 --raid-devices=3 --add /dev/sdc5
sudo mdadm --grow /dev/md26 --level=1 --raid-devices=3 --add /dev/sdc6
sudo mdadm --grow /dev/md27 --level=1 --raid-devices=3 --add /dev/sdc7
sudo mdadm --grow /dev/md28 --level=1 --raid-devices=3 --add /dev/sdc8
sudo mdadm --grow /dev/md29 --level=1 --raid-devices=3 --add /dev/sdc9
sudo mdadm --grow /dev/md30 --level=1 --raid-devices=3 --add /dev/sdc10
sudo mdadm --grow /dev/md31 --level=1 --raid-devices=3 --add /dev/sdc11
sudo mdadm --grow /dev/md32 --level=1 --raid-devices=3 --add /dev/sdc12

although as mentioned above it did take a long time (most of a calendar day) due to the external drive being connected via USB-2, and thus limited to about 30MB/s.

The second step was to destroy the RAID array for the last partition on the disks, discarding all data on it (fortunately none in my case), as that partition had to get smaller as described above. If you have important data on that last RAID array you will need to copy it somewhere else before proceeding.

ewen@linux:~$ head -4 /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md33 : active (auto-read-only) raid1 sdb13[0]
      125233580 blocks super 1.2 [2/1] [U_]

ewen@linux:~$ sudo mdadm --stop /dev/md33
[sudo] password for ewen:
mdadm: stopped /dev/md33
ewen@linux:~$ sudo mdadm --remove /dev/md33
ewen@linux:~$ grep md33 /proc/mdstat
ewen@linux:~$ sudo mdadm --zero-superblock /dev/sda13
ewen@linux:~$ sudo mdadm --zero-superblock /dev/sdb13
ewen@linux:~$

After this, check the output of /proc/mdstat to ensure that all the remaining RAID sets are happy, and show three active disks ("UUU") -- the two original disks, and the temporary external disk. If you are not sure everything is perfectly prepared, sort out the remainining issues before proceeding, as the next step will break the first original drive out of the RAID-1 arrays.

The third step, when everything is ready, is to remove the first original drive from the RAID-1 arrays:

sudo mdadm /dev/md21 --fail /dev/sda1  --remove /dev/sda1
sudo mdadm /dev/md22 --fail /dev/sda2  --remove /dev/sda2
sudo mdadm /dev/md23 --fail /dev/sda3  --remove /dev/sda3
sudo mdadm /dev/md25 --fail /dev/sda5  --remove /dev/sda5
sudo mdadm /dev/md26 --fail /dev/sda6  --remove /dev/sda6
sudo mdadm /dev/md27 --fail /dev/sda7  --remove /dev/sda7
sudo mdadm /dev/md28 --fail /dev/sda8  --remove /dev/sda8
sudo mdadm /dev/md29 --fail /dev/sda9  --remove /dev/sda9
sudo mdadm /dev/md30 --fail /dev/sda10 --remove /dev/sda10
sudo mdadm /dev/md31 --fail /dev/sda11 --remove /dev/sda11
sudo mdadm /dev/md32 --fail /dev/sda12 --remove /dev/sda12

and then repartition the drive following the instructions above (ie, to be identical to the 2TB external drive, other than the size of the final partition).

When the partitioning is complete, run:

ewen@linux:~$ sudo partprobe -d -s /dev/sda
/dev/sda: msdos partitions 1 2 3 4 <5 6 7 8 9 10 11 12 13>
ewen@linux:~$ sudo partprobe  -s /dev/sda
/dev/sda: msdos partitions 1 2 3 4 <5 6 7 8 9 10 11 12 13>
ewen@linux:~$

to ensure the new partitions are recognised, and then also compare the output of:

ewen@linux:~$ sudo parted -a optimal /dev/sda unit s print
Model: ATA WDC WD20EFRX-68A (scsi)
Disk /dev/sda: 3907029168s
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Disk Flags:

Number  Start        End          Size         Type      File system  Flags
 1      2048s        501759s      499712s      primary                raid
 2      501760s      4503551s     4001792s     primary                raid
 3      4503552s     8505343s     4001792s     primary                raid
 4      8505344s     3907028991s  3898523648s  extended               lba
 5      8507392s     72511484s    64004093s    logical                raid
 6      72511488s    584519676s   512008189s   logical                raid
 7      584519680s   1096527868s  512008189s   logical                raid
 8      1096527872s  1608536060s  512008189s   logical                raid
 9      1608536064s  2120544252s  512008189s   logical                raid
10      2120544256s  2632552444s  512008189s   logical                raid
11      2632552448s  3144560636s  512008189s   logical                raid
12      3144560640s  3656568828s  512008189s   logical                raid
13      3656568832s  3907028991s  250460160s   logical                raid

ewen@linux:~$

with the start/size sectors recognised by Linux as active:

ewen@linux:/sys/block/sda$ for PART in 1 2 3 5 6 7 8 9 10 11 12 13; do echo "sda${PART}: " $(cat "sda${PART}/start") $(cat "sda${PART}/size"); done
sda1:  2048 499712
sda2:  501760 4001792
sda3:  4503552 4001792
sda5:  8507392 64004093
sda6:  72511488 512008189
sda7:  584519680 512008189
sda8:  1096527872 512008189
sda9:  1608536064 512008189
sda10:  2120544256 512008189
sda11:  2632552448 512008189
sda12:  3144560640 512008189
sda13:  3656568832 250460160
ewen@linux:/sys/block/sda$

to ensure that Linux will copy onto the new partitions, not old locations on the disk.

The fourth step is to add the original first drive back into the RAID-1 sets, and wait for them to all resync:

sudo mdadm --manage /dev/md21 --add /dev/sda1
sudo mdadm --manage /dev/md22 --add /dev/sda2
sudo mdadm --manage /dev/md23 --add /dev/sda3
sudo mdadm --manage /dev/md25 --add /dev/sda5
sudo mdadm --manage /dev/md26 --add /dev/sda6
sudo mdadm --manage /dev/md27 --add /dev/sda7
sudo mdadm --manage /dev/md28 --add /dev/sda8
sudo mdadm --manage /dev/md29 --add /dev/sda9
sudo mdadm --manage /dev/md30 --add /dev/sda10
sudo mdadm --manage /dev/md31 --add /dev/sda11
sudo mdadm --manage /dev/md32 --add /dev/sda12

which in my case took about 6 hours.

Once this is done, the same steps can be repeated to remove the /dev/sdb* partitions, repartition the /dev/sdb drive, re-check the partitions are correctly recognised, and then re-add the /dev/sdb* partitions into the RAID sets.

Of note, when I started adding the /dev/sda* partitions back in after repartitioning, I got warnings saying:

ewen@linux:/sys/block$ sudo dmesg -T | grep misaligned
[Mon Jan  1 09:22:20 2018] md21: Warning: Device sda1 is misaligned
[Mon Jan  1 09:22:35 2018] md22: Warning: Device sda2 is misaligned
[Mon Jan  1 10:07:26 2018] md27: Warning: Device sda7 is misaligned
[Mon Jan  1 10:49:12 2018] md28: Warning: Device sda8 is misaligned
[Mon Jan  1 10:49:21 2018] md29: Warning: Device sda9 is misaligned
[Mon Jan  1 11:30:04 2018] md30: Warning: Device sda10 is misaligned
[Mon Jan  1 12:26:56 2018] md31: Warning: Device sda11 is misaligned
[Mon Jan  1 12:45:08 2018] md32: Warning: Device sda12 is misaligned
ewen@linux:/sys/block$

and when I went checking I found that the "alignment_offset" values had been set to "-1" in the affected cases:

ewen@tv:/sys/block$ grep . md*/alignment_offset
md21/alignment_offset:-1
md22/alignment_offset:-1
md23/alignment_offset:0
md25/alignment_offset:0
md26/alignment_offset:0
md27/alignment_offset:-1
md28/alignment_offset:-1
md29/alignment_offset:-1
md30/alignment_offset:-1
md31/alignment_offset:-1
md32/alignment_offset:-1
ewen@tv:/sys/block$ grep . md*/alignment_offset

Those alignment offsets should normally be bytes to adjust by to achieve alignment again -- I saw values like 3072, 3584, etc, in them prior to aligning the underlying physical partitions properly, and "0" indicates that it is natively aligned already.

After some hunting it turned out that "-1" was a special magic value meaning basically "alignment is impossible":

 *    Returns 0 if the top and bottom queue_limits are compatible.  The
 *    top device's block sizes and alignment offsets may be adjusted to
 *    ensure alignment with the bottom device. If no compatible sizes
 *    and alignments exist, -1 is returned and the resulting top
 *    queue_limits will have the misaligned flag set to indicate that
 *    the alignment_offset is undefined.

My conclusion was that because the RAID-1 sets had stayed active throughout what was happening was that the previous alignment offset of /dev/sda* partitions was non-zero, and the RAID-1 sets were attempting to find an alignment offset for the new /dev/sda* partitions that would match the physical sectors and use those same offsets -- but there was no such valid offset that matched both the old and new /dev/sda* partition offsets, hence it failed. So I chose to ignore those warnings and carry on.

Once both original drives had been repartitioned and resync'd, the next step was to recreate the /dev/md33 RAID partition again, on the smaller partitions:

ewen@tv:~$ sudo mdadm --zero-superblock /dev/sda13
ewen@tv:~$ sudo mdadm --zero-superblock /dev/sdb13
ewen@tv:~$ sudo mdadm --zero-superblock /dev/sdc13
ewen@tv:~$ sudo mdadm --create /dev/md33 --level=1 --raid-devices=3 --chunk=4M /dev/sda13 /dev/sdb13 /dev/sdc13
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md33 started.
ewen@tv:~$

(because I was not booting from that partition metadata 1.2 was fine, and gave more options -- this one was created with recovery bitmaps).

Note that in this case I chose to create the RAID-1 set including three drives, because the external 2TB drive was slightly smaller, and I wanted the option of later resync'ing it onto that drive as an offsite backup.

At this point it is useful to update /etc/mdadm/mdadm.conf with the new UUID of the new RAID set, to ensure that it stays in sync and RAID arrays can be auto-started.

When that new RAID set completed resync'ing, I then removed the 2TB external drive from all the RAID sets, and set them back to "2-way" RAID to avoid the RAID sets sitting there partly failed:

sudo mdadm /dev/md21 --fail /dev/sdc1  --remove /dev/sdc1
sudo mdadm /dev/md22 --fail /dev/sdc2  --remove /dev/sdc2
sudo mdadm /dev/md23 --fail /dev/sdc3  --remove /dev/sdc3
sudo mdadm /dev/md25 --fail /dev/sdc5  --remove /dev/sdc5
sudo mdadm /dev/md26 --fail /dev/sdc6  --remove /dev/sdc6
sudo mdadm /dev/md27 --fail /dev/sdc7  --remove /dev/sdc7
sudo mdadm /dev/md28 --fail /dev/sdc8  --remove /dev/sdc8
sudo mdadm /dev/md29 --fail /dev/sdc9  --remove /dev/sdc9
sudo mdadm /dev/md30 --fail /dev/sdc10 --remove /dev/sdc10
sudo mdadm /dev/md31 --fail /dev/sdc11 --remove /dev/sdc11
sudo mdadm /dev/md32 --fail /dev/sdc12 --remove /dev/sdc12
sudo mdadm /dev/md33 --fail /dev/sdc13 --remove /dev/sdc13

sudo mdadm --grow /dev/md21 --raid-devices=2
sudo mdadm --grow /dev/md22 --raid-devices=2
sudo mdadm --grow /dev/md23 --raid-devices=2
sudo mdadm --grow /dev/md25 --raid-devices=2
sudo mdadm --grow /dev/md26 --raid-devices=2
sudo mdadm --grow /dev/md27 --raid-devices=2
sudo mdadm --grow /dev/md28 --raid-devices=2
sudo mdadm --grow /dev/md29 --raid-devices=2
sudo mdadm --grow /dev/md30 --raid-devices=2
sudo mdadm --grow /dev/md31 --raid-devices=2
sudo mdadm --grow /dev/md32 --raid-devices=2
sudo mdadm --grow /dev/md33 --raid-devices=2

and then checked for any remaining missing drive references or "sdc" references:

ewen@linux:~$ cat /proc/mdstat | grep "_"
ewen@linux:~$ cat /proc/mdstat | grep "sdc"
ewen@linux:~$

Then I unplugged the 2TB external drive, to keep for now as an offline backup.

To make sure that the system still booted, I reinstalled grub and updated the initramfs to pick up the new RAID UUIDs:

ewen@linux:~$ sudo grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
ewen@linux:~$ sudo grub-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.
ewen@linux:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-4.9.0-4-686-pae
ewen@linux:~$
ewen@linux:~$ sudo update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.9.0-4-686-pae
Found initrd image: /boot/initrd.img-4.9.0-4-686-pae
Found linux image: /boot/vmlinuz-4.9.0-3-686-pae
Found initrd image: /boot/initrd.img-4.9.0-3-686-pae
Found linux image: /boot/vmlinuz-3.16.0-0.bpo.4-686-pae
Found initrd image: /boot/initrd.img-3.16.0-0.bpo.4-686-pae
Found memtest86 image: /memtest86.bin
Found memtest86+ image: /memtest86+.bin
Found memtest86+ multiboot image: /memtest86+_multiboot.bin
done
ewen@linux:~$

and then rebooted the system to make sure it could boot cleanly by itself. Fortunately it rebooted automatically without any issues!

After reboot I checked for reports of misalignment:

ewen@linux:~$ uptime
 11:53:28 up 4 min,  1 user,  load average: 0.05, 0.42, 0.25
ewen@linux:~$ sudo dmesg -T | grep -i misaligned
ewen@linux:~$ sudo dmesg -T | grep alignment
ewen@linux:~$ sudo dmesg -T | grep inconsistency
ewen@linux:~$

and was pleased to find that none were reported. I also checked all the alignment_offset values, and was pleased to see all of those were now "0" -- ie "naturally aligned" (in this case to the 4KiB physical sector boundaries):

ewen@linux:~$ cat /sys/block/sda/sda*/alignment_offset
0
0
0
0
0
0
0
0
0
0
0
0
0
ewen@linux:~$ cat /sys/block/sdb/sdb*/alignment_offset
0
0
0
0
0
0
0
0
0
0
0
0
0
ewen@linux:~$ cat /sys/block/md*/alignment_offset
0
0
0
0
0
0
0
0
0
0
0
0
ewen@linux:~$ cat /sys/block/dm*/alignment_offset
0
0
0
0
0
0
0
ewen@linux:~$

It is too soon to tell if this has any actual practical benefits in performance due to improving the alignment. But not being reminded that I "did it wrong" several years ago when putting the disks in -- due to the fdisk partition defaults at the time being wrong for 4 KiB physical sector disks -- seems worth the effort anyway.