Introduction
On modern storage media, and storage subsystems, file system alignment
to "larger than 512 byte" boundaries is increasingly important for
achieving good write performance, and on some media for avoiding
excessive wear on the underlying media (due to additional write
amplification). For about the last 8 years, the Linux kernel has
supported an "/sys/block/*/alignment_offset
" metric which indicates
the number of bytes needed to get a particular layer back into
alignment with the underlying storage media, and Linux distributions
have included tools that attempt to do automatic alignment, where
possible, for the last few years. Which helps newer systems, particularly
new installations, but cannot automatically fix older systems.
I had one older system (originally installed at least 15 years ago, and "grandfather's axe" upgraded through various versions of hardware), that ended up having both:
4KB (4096 byte) physical sectors (on a pair of
WDC WD20EFRX-68A
2TB drive)ewen@linux:~$ cat /sys/block/sda/device/model WDC WD20EFRX-68A ewen@linux:~$ cat /sys/block/sda/queue/physical_block_size 4096 ewen@linux:~$ cat /sys/block/sda/queue/logical_block_size 512 ewen@linux:~$
although curiously the second apparently identical drive detects with 512-byte physical sectors, at least at present:
ewen@tv:~$ cat /sys/block/sdb/device/model WDC WD20EFRX-68A ewen@tv:~$ cat /sys/block/sdb/queue/physical_block_size 512 ewen@tv:~$ cat /sys/block/sdb/queue/logical_block_size 512 ewen@tv:~$
for reasons I do not understand (both drives were purchased approximately the same time, as far as I can recall)
1990s-style partition layout, with partitions starting on a "cylinder" boundary:
(parted) unit s (parted) print Model: ATA WDC WD20EFRX-68A (scsi) Disk /dev/sda: 3907029168s Sector size (logical/physical): 512B/4096B Partition Table: msdos Disk Flags: Number Start End Size Type File system Flags 1 63s 498014s 497952s primary ext4 raid 2 498015s 4498199s 4000185s primary raid 3 4498200s 8498384s 4000185s primary raid 4 8498385s 3907024064s 3898525680s extended 5 8498448s 72501344s 64002897s logical raid 6 72501408s 584508959s 512007552s logical raid 7 584509023s 1096516574s 512007552s logical raid 8 1096516638s 1608524189s 512007552s logical raid 9 1608524253s 2120531804s 512007552s logical raid 10 2120531868s 2632539419s 512007552s logical raid 11 2632539483s 3144547034s 512007552s logical raid 12 3144547098s 3656554649s 512007552s logical raid 13 3656554713s 3907024064s 250469352s logical raid (parted)
Linux MD RAID-1 and LVM with no adjustments for the partition offsets to the physical block boundaries (due to being created with old tools), and
Linux file systems (
ext4
,xfs
) created with no adjustments to the physical block boundaries (due to being created with old tools)
I knew that this misalignment was happening at the time I swapped in the newer (2TB) disks a few years ago, but did not have time to try to manually figure out the correct method to align all the layers, so I decided to just accept the lower performance at the time. (Fortunately being magnetic storage rather than SSDs, there was not an additional risk of excessive drive wear caused by the misalignment.)
After upgrading to a modern Debian Linux version, including a new kernel, this misalignment was made more visible again, including in kernel messages on every boot:
device-mapper: table: 253:2: adding target device (start sect 511967232 len 24903680) caused an alignment inconsistency
device-mapper: table: 253:4: adding target device (start sect 511967232 len 511967232) caused an alignment inconsistency
device-mapper: table: 253:4: adding target device (start sect 1023934464 len 511967232) caused an alignment inconsistency
device-mapper: table: 253:4: adding target device (start sect 1535901696 len 36962304) caused an alignment inconsistency
device-mapper: table: 253:4: adding target device (start sect 1572864000 len 209715200) caused an alignment inconsistency
device-mapper: table: 253:5: adding target device (start sect 34078720 len 39321600) caused an alignment inconsistency
so I planned to eventually re-align the partitions on the underlying drives to match the modern "optimal" conventions (ie, start partitions at 1 MiB boundaries). I finally got time to do that realignment over the Christmas/New Year period, during a "staycation" that let me be around periodically for all the steps required.
Overall process
The system in question had two drives (both 2TB), in RAID-1 for redundancy. Since I did not want to lose the RAID redundancy during the process my general approach was:
Obtain a 2TB external drive
Partition the 2TB external drive with 1 MiB aligned partitions matching the desired partition layout, marked as "
raid
" partitionsExtend the Linux MD RAID-1 to cover three drives, including the 2TB external drive, and wait for the RAID arrays to resync.
Then for each drive to be repartitioned, remove the drive from the RAID-1 sets (leaving the other original drive and the external drive), repartition it optimally, and re-add the drive back into the RAID-1 sets and wait for the RAID arrays to resync (then repeat for the other drive).
Remove the external 2TB drive from all the RAID sets
Reboot the system to ensure it detected the original two drives as now aligned.
This process took about 2 days to complete, most of which was waiting for the 2TB of RAID arrays to sync onto the external drive, as the system in question had only USB-2 (not USB-3), and thus copies onto the external drive went at about 30MB/s and took 18-20 hours. The last stage of repartioning and resyncing the original drives went much faster as the copies onto those drives went at over 120MB/s (reasonable for SATA-1.5 Gbps connected drives: the drives are SATA-6Gbps capable, but the host controller is only SATA 1.5 Gbps).
NOTE: If you are going to attempt to follow this process yourself, I strongly recommend that you have a separate backup of the system -- other than on the RAID disks you are modifying -- as accidentally removing or overwriting the wrong thing at the wrong time during this process could easily lead to a difficult to recover system or permanent data loss.
Partition alignment
Modern parted
is capable of creating "optimal
" aligned partitions
if you give it the "-a optimal
" flag when start it up. However,
due to the age of this system I needed to recreate a
MBR partition
table with 13 partitions on it -- necessitating several logical
partitions -- which come with their own alignment challenges (it
turns out that the Extended Boot
Record logical
partitions require a linked list of partition records between
the logical
partitions,
thus requiring some space between each partition); on a modern
system using a GUID Partition
Table avoids
most of these challenges. (Some choose not to align the extended
partition,
as there is no user-data in the extended partition so only the
logical partition alignment matters; but this is only a minor help
if you have multiple logical partitions.)
After some thought I concluded that my desired partitioning had:
The first partition starting at 1MiB (2048s)
Every other partition starting on a MiB boundary
Every subsequent partition as close as possible to the previous one
All partitions at least a multiple of the physical sector size (4KiB)
All partitions larger than the original partitions on the disk, so that the RAID-1 resync would trivially work
Minimise wasted "rounding up" space
The last partition on the disks absorbing all of the reductions in disk space needed to meet the other considerations
Unfortunately it turned out that the need for multiple "logical"
partitions, the desire to start every partition on a MiB boundary,
parted
most easily supporting creating partitions that were N MiB
long, and a desire not to waste space between partitions, ended up
conflicting with the need to have an Extended Boot
Record between
every logical partition. I could either accept that I would lose
1 MiB between each logical partition -- to hold the 512 byte EBR
and have everything start/end on a MiB boundary -- or use more
advanced methods to describe to parted
what I needed. I chose to
use more advanced methods.
Creating the first part of the partition table was pretty easy (here
/dev/sdc
was the 2TB external drive; but I followed the same partitioning
on the original drives when I got to rebuilding them):
ewen@linux:~$ sudo parted -a optimal /dev/sdc
(parted) mklabel msdos
Warning: The existing disk label on /dev/sdc will be destroyed and all data on
this disk will be lost. Do you want to continue?
Yes/No? yes
(parted) quit
ewen@linux:~$ sudo parted -a optimal /dev/sdc
(parted) unit s
(parted) print
Model: WD Elements 25A2 (scsi)
Disk /dev/sdc: 3906963456s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
(parted) mkpart primary 1MiB 245MiB
(parted) set 1 raid on
(parted) mkpart primary 245MiB 2199MiB
(parted) set 2 raid on
(parted) mkpart primary 2199MiB 4153MiB
(parted) set 3 raid on
(parted) mkpart extended 4153MiB 100%
This gave a drive with three primary partitions starting on MiB boundaries, and an extended partition which covered the remainder (majority) of the drive. Each primary partition was marked as a RAID partition.
After that it got more complicated. I needed to start each logical partition on a MiB boundary, and then finishing them just before the MiB boundary, to allow room for the Extended Boot Record to sit in between. For the first logical partition I chose to sacrifice 1 MiB, and simply start it on the next MiB boundary, but for the end position I needed to figure out "4KiB less than next MiB" (ie, one physical sector) so as to leave room for the EBR and then starting the next logical partition on a MiB boundary -- with minimal wasted space.
I calculated the first one by hand, as it needed a unique size, and specified it in sectors (ie, 512-byte units -- logical sectors):
(parted) mkpart logical 4154MiB 72511484s # 35406MiB - 4 sectors
(parted) set 5 raid on
then for most of the rest, they were all the same size, and so the
pattern was quite repetitive. To solve this I wrote a trivial, hacky
Python script to generate the right parted
commands:
base=35406
inc=250004
for i in range(7):
start = base + (i * inc)
end = base + ((i+1) * inc)
last = (end * 2 * 1024) - 4
print("mkpart logical {0:7d}MiB {1:10d}s # {2:7d}MiB - 4 sectors".format(start, last, end))
and then fed that output to parted
:
(parted) mkpart logical 35406MiB 584519676s # 285410MiB - 4 sectors
(parted) mkpart logical 285410MiB 1096527868s # 535414MiB - 4 sectors
(parted) mkpart logical 535414MiB 1608536060s # 785418MiB - 4 sectors
(parted) mkpart logical 785418MiB 2120544252s # 1035422MiB - 4 sectors
(parted) mkpart logical 1035422MiB 2632552444s # 1285426MiB - 4 sectors
(parted) mkpart logical 1285426MiB 3144560636s # 1535430MiB - 4 sectors
(parted) mkpart logical 1535430MiB 3656568828s # 1785434MiB - 4 sectors
to create all the consistently sized partitions (this would have been
easier if parted
had supported start/size values, which is what is
actually stored in the MBR/EBR -- but it requires start/end values, which
need more manual calculation; seems like a poor UI to me).
After that I could create the final partition to use the remainder of the disk, which is trivial to specify:
(parted) mkpart logical 1785434MiB 100%
and then mark all the other partitions as "raid
" partitions:
(parted) set 6 raid on
(parted) set 7 raid on
(parted) set 8 raid on
(parted) set 9 raid on
(parted) set 10 raid on
(parted) set 11 raid on
(parted) set 12 raid on
(parted) set 13 raid on
which gave me a final partition table of:
(parted) unit s
(parted) print
Model: WD Elements 25A2 (scsi)
Disk /dev/sdc: 3906963456s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
1 2048s 501759s 499712s primary raid, lba
2 501760s 4503551s 4001792s primary raid, lba
3 4503552s 8505343s 4001792s primary raid, lba
4 8505344s 3906963455s 3898458112s extended lba
5 8507392s 72511484s 64004093s logical raid, lba
6 72511488s 584519676s 512008189s logical raid, lba
7 584519680s 1096527868s 512008189s logical raid, lba
8 1096527872s 1608536060s 512008189s logical raid, lba
9 1608536064s 2120544252s 512008189s logical raid, lba
10 2120544256s 2632552444s 512008189s logical raid, lba
11 2632552448s 3144560636s 512008189s logical raid, lba
12 3144560640s 3656568828s 512008189s logical raid, lba
13 3656568832s 3906963455s 250394624s logical raid, lba
(parted)
As a final double check I also used the parted
"align-check
" to
check the alignment (and manually divided each starting sector by 2048
to ensure it was on a MiB boundary -- 2048 = 2 * 1024, as the values
are in 512-byte sectors):
(parted) align-check optimal 1
1 aligned
(parted) align-check optimal 2
2 aligned
(parted) align-check optimal 3
3 aligned
(parted) align-check optimal 4
4 aligned
(parted) align-check optimal 5
5 aligned
(parted) align-check optimal 6
6 aligned
(parted) align-check optimal 7
7 aligned
(parted) align-check optimal 8
8 aligned
(parted) align-check optimal 9
9 aligned
(parted) align-check optimal 10
10 aligned
(parted) align-check optimal 11
11 aligned
(parted) align-check optimal 12
12 aligned
(parted) align-check optimal 13
13 aligned
(parted)
And then exited to work with this final partition table:
(parted) quit
For more on partition alignment, particularly with MD / LVM layers as well see Thomas Krenn's post on Partition Alignment, and a great set of slides on partition / MD LVM alignment. Of note, both Linux MD RAID-1 metadata (1.2) and LVM Physical Volume metadata will take up some space at the start of the partition if you accept the modern defaults.
For Linux MD RAID-1, metadata 1.2 is at the start of the partition,
and then the Data will begin at the "Data Offset" within the
partition. (Linux MD RAID metadata 0.9 is at the end of the disk,
so there is no offset, which is sometimes
useful
including for /boot
partitions.) You can see the offset in use
by examining the individual RAID-1 elements: on metadata 1.2 RAID sets
there is a "Data Offset" value reported, which is typically 1 MiB (2048
* 512-byte sectors):
ewen@linux:~$ sudo mdadm -E /dev/sda2 | egrep "Version|Offset"
Version : 1.2
Data Offset : 2048 sectors
Super Offset : 8 sectors
ewen@linux:~$
although RAID sets created with more modern mdadm
tools might have
larger offsets (possibly for bitmaps to speed up resync?):
ewen@linux:~$ sudo mdadm -E /dev/sda13 | egrep "Version|Offset"
Version : 1.2
Data Offset : 131072 sectors
Super Offset : 8 sectors
ewen@linux:~$
These result in unused space in the MD RAID-1 elements which can be seen:
ewen@linux:~$ sudo mdadm -E /dev/sda2 | egrep "Unused"
Unused Space : before=1960 sectors, after=1632 sectors
ewen@linux:~$ sudo mdadm -E /dev/sda13 | egrep "Unused"
Unused Space : before=130984 sectors, after=65536 sectors
ewen@linux:~$
although in this case the unused space at the end is most likely due
to rounding up the partition sizes from those in the originally created
RAID array. (The "--data-offset
is computed automatically, but may
be overridden from the command line when the array is created -- on a
per-member-device basis. But presumably if the data offset is too
small, various metadata -- such as bitmaps -- cannot be stored.)
By default, it appears that modern LVM will default to 192 KiB of data
at the start of its physical volumes (PV), which can be seen by checking
the "pe_start
" value:
ewen@linux:~$ sudo pvs -o +pe_start /dev/md26
PV VG Fmt Attr PSize PFree 1st PE
/dev/md26 r1 lvm2 a-- 244.12g 0 192.00k
ewen@linux:~$ sudo pvs -o +pe_start /dev/md32
PV VG Fmt Attr PSize PFree 1st PE
/dev/md32 lvm2 --- 244.14g 244.14g 192.00k
ewen@linux:~$
and controlled at pvcreate
time with the --metadatasize
and --dataalignment
values (as well
as an optimal manual override of the offset).
Fortunately all of these values (1MiB == 2048s, 64MiB == 131072s,
192 KiB) are all multiples of 4 KiB, so providing you are only
aligning to 4 KiB boundaries you do not need to worry about additional
alignment options if the underlying partitions are aligned. But
if you need to align to, eg, larger SSD erase blocks or larger
hardware RAID stripes, you may need to adjust the MD and LVM alignment
options as well to avoid leaving the underlying file system misaligned.
(If you are using modern Linux tools with all software layers --
RAID, LVM, etc -- then the alignment_offset
values will probably
help to ensure the defaults help with alignment; if there are any
hardware layers you will need to provide additional information to
ensure the best alignment.)
RAID rebuild
Having created a third (external 2TB) disk with suitably aligned partitions I could then move on to resyncing all the RAID arrays 3 times (once onto the external drive, and then once onto each of the internal drives). A useful Debian User post outlined the process for extending the RAID-1 array onto a third disk, and then removing the third disk again, which provided the basis of my approach.
The first step was to extend all but the last RAID array onto the new disk (the last one needed special treatment as it was getting smaller, but fortunately it did not have any data on it yet). Growing onto the third disk is fairly simple:
sudo mdadm --grow /dev/md21 --level=1 --raid-devices=3 --add /dev/sdc1
sudo mdadm --grow /dev/md22 --level=1 --raid-devices=3 --add /dev/sdc2
sudo mdadm --grow /dev/md23 --level=1 --raid-devices=3 --add /dev/sdc3
sudo mdadm --grow /dev/md25 --level=1 --raid-devices=3 --add /dev/sdc5
sudo mdadm --grow /dev/md26 --level=1 --raid-devices=3 --add /dev/sdc6
sudo mdadm --grow /dev/md27 --level=1 --raid-devices=3 --add /dev/sdc7
sudo mdadm --grow /dev/md28 --level=1 --raid-devices=3 --add /dev/sdc8
sudo mdadm --grow /dev/md29 --level=1 --raid-devices=3 --add /dev/sdc9
sudo mdadm --grow /dev/md30 --level=1 --raid-devices=3 --add /dev/sdc10
sudo mdadm --grow /dev/md31 --level=1 --raid-devices=3 --add /dev/sdc11
sudo mdadm --grow /dev/md32 --level=1 --raid-devices=3 --add /dev/sdc12
although as mentioned above it did take a long time (most of a calendar day) due to the external drive being connected via USB-2, and thus limited to about 30MB/s.
The second step was to destroy the RAID array for the last partition on the disks, discarding all data on it (fortunately none in my case), as that partition had to get smaller as described above. If you have important data on that last RAID array you will need to copy it somewhere else before proceeding.
ewen@linux:~$ head -4 /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md33 : active (auto-read-only) raid1 sdb13[0]
125233580 blocks super 1.2 [2/1] [U_]
ewen@linux:~$ sudo mdadm --stop /dev/md33
[sudo] password for ewen:
mdadm: stopped /dev/md33
ewen@linux:~$ sudo mdadm --remove /dev/md33
ewen@linux:~$ grep md33 /proc/mdstat
ewen@linux:~$ sudo mdadm --zero-superblock /dev/sda13
ewen@linux:~$ sudo mdadm --zero-superblock /dev/sdb13
ewen@linux:~$
After this, check the output of /proc/mdstat
to ensure that all the
remaining RAID sets are happy, and show three active disks ("UUU") --
the two original disks, and the temporary external disk. If you are
not sure everything is perfectly prepared, sort out the remainining
issues before proceeding, as the next step will break the first original
drive out of the RAID-1 arrays.
The third step, when everything is ready, is to remove the first original drive from the RAID-1 arrays:
sudo mdadm /dev/md21 --fail /dev/sda1 --remove /dev/sda1
sudo mdadm /dev/md22 --fail /dev/sda2 --remove /dev/sda2
sudo mdadm /dev/md23 --fail /dev/sda3 --remove /dev/sda3
sudo mdadm /dev/md25 --fail /dev/sda5 --remove /dev/sda5
sudo mdadm /dev/md26 --fail /dev/sda6 --remove /dev/sda6
sudo mdadm /dev/md27 --fail /dev/sda7 --remove /dev/sda7
sudo mdadm /dev/md28 --fail /dev/sda8 --remove /dev/sda8
sudo mdadm /dev/md29 --fail /dev/sda9 --remove /dev/sda9
sudo mdadm /dev/md30 --fail /dev/sda10 --remove /dev/sda10
sudo mdadm /dev/md31 --fail /dev/sda11 --remove /dev/sda11
sudo mdadm /dev/md32 --fail /dev/sda12 --remove /dev/sda12
and then repartition the drive following the instructions above (ie, to be identical to the 2TB external drive, other than the size of the final partition).
When the partitioning is complete, run:
ewen@linux:~$ sudo partprobe -d -s /dev/sda
/dev/sda: msdos partitions 1 2 3 4 <5 6 7 8 9 10 11 12 13>
ewen@linux:~$ sudo partprobe -s /dev/sda
/dev/sda: msdos partitions 1 2 3 4 <5 6 7 8 9 10 11 12 13>
ewen@linux:~$
to ensure the new partitions are recognised, and then also compare the output of:
ewen@linux:~$ sudo parted -a optimal /dev/sda unit s print
Model: ATA WDC WD20EFRX-68A (scsi)
Disk /dev/sda: 3907029168s
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
1 2048s 501759s 499712s primary raid
2 501760s 4503551s 4001792s primary raid
3 4503552s 8505343s 4001792s primary raid
4 8505344s 3907028991s 3898523648s extended lba
5 8507392s 72511484s 64004093s logical raid
6 72511488s 584519676s 512008189s logical raid
7 584519680s 1096527868s 512008189s logical raid
8 1096527872s 1608536060s 512008189s logical raid
9 1608536064s 2120544252s 512008189s logical raid
10 2120544256s 2632552444s 512008189s logical raid
11 2632552448s 3144560636s 512008189s logical raid
12 3144560640s 3656568828s 512008189s logical raid
13 3656568832s 3907028991s 250460160s logical raid
ewen@linux:~$
with the start/size sectors recognised by Linux as active:
ewen@linux:/sys/block/sda$ for PART in 1 2 3 5 6 7 8 9 10 11 12 13; do echo "sda${PART}: " $(cat "sda${PART}/start") $(cat "sda${PART}/size"); done
sda1: 2048 499712
sda2: 501760 4001792
sda3: 4503552 4001792
sda5: 8507392 64004093
sda6: 72511488 512008189
sda7: 584519680 512008189
sda8: 1096527872 512008189
sda9: 1608536064 512008189
sda10: 2120544256 512008189
sda11: 2632552448 512008189
sda12: 3144560640 512008189
sda13: 3656568832 250460160
ewen@linux:/sys/block/sda$
to ensure that Linux will copy onto the new partitions, not old locations on the disk.
The fourth step is to add the original first drive back into the RAID-1 sets, and wait for them to all resync:
sudo mdadm --manage /dev/md21 --add /dev/sda1
sudo mdadm --manage /dev/md22 --add /dev/sda2
sudo mdadm --manage /dev/md23 --add /dev/sda3
sudo mdadm --manage /dev/md25 --add /dev/sda5
sudo mdadm --manage /dev/md26 --add /dev/sda6
sudo mdadm --manage /dev/md27 --add /dev/sda7
sudo mdadm --manage /dev/md28 --add /dev/sda8
sudo mdadm --manage /dev/md29 --add /dev/sda9
sudo mdadm --manage /dev/md30 --add /dev/sda10
sudo mdadm --manage /dev/md31 --add /dev/sda11
sudo mdadm --manage /dev/md32 --add /dev/sda12
which in my case took about 6 hours.
Once this is done, the same steps can be repeated to remove the /dev/sdb*
partitions, repartition the /dev/sdb
drive, re-check the partitions are
correctly recognised, and then re-add the /dev/sdb*
partitions into the
RAID sets.
Of note, when I started adding the /dev/sda*
partitions back in after
repartitioning, I got warnings saying:
ewen@linux:/sys/block$ sudo dmesg -T | grep misaligned
[Mon Jan 1 09:22:20 2018] md21: Warning: Device sda1 is misaligned
[Mon Jan 1 09:22:35 2018] md22: Warning: Device sda2 is misaligned
[Mon Jan 1 10:07:26 2018] md27: Warning: Device sda7 is misaligned
[Mon Jan 1 10:49:12 2018] md28: Warning: Device sda8 is misaligned
[Mon Jan 1 10:49:21 2018] md29: Warning: Device sda9 is misaligned
[Mon Jan 1 11:30:04 2018] md30: Warning: Device sda10 is misaligned
[Mon Jan 1 12:26:56 2018] md31: Warning: Device sda11 is misaligned
[Mon Jan 1 12:45:08 2018] md32: Warning: Device sda12 is misaligned
ewen@linux:/sys/block$
and when I went checking I found that the "alignment_offset
" values had
been set to "-1" in the affected cases:
ewen@tv:/sys/block$ grep . md*/alignment_offset
md21/alignment_offset:-1
md22/alignment_offset:-1
md23/alignment_offset:0
md25/alignment_offset:0
md26/alignment_offset:0
md27/alignment_offset:-1
md28/alignment_offset:-1
md29/alignment_offset:-1
md30/alignment_offset:-1
md31/alignment_offset:-1
md32/alignment_offset:-1
ewen@tv:/sys/block$ grep . md*/alignment_offset
Those alignment offsets should normally be bytes to adjust by to achieve alignment again -- I saw values like 3072, 3584, etc, in them prior to aligning the underlying physical partitions properly, and "0" indicates that it is natively aligned already.
After some hunting it turned out that "-1" was a special magic value meaning basically "alignment is impossible":
* Returns 0 if the top and bottom queue_limits are compatible. The
* top device's block sizes and alignment offsets may be adjusted to
* ensure alignment with the bottom device. If no compatible sizes
* and alignments exist, -1 is returned and the resulting top
* queue_limits will have the misaligned flag set to indicate that
* the alignment_offset is undefined.
My conclusion was that because the RAID-1 sets had stayed active throughout
what was happening was that the previous alignment offset of /dev/sda*
partitions was non-zero, and the RAID-1 sets were attempting to find
an alignment offset for the new /dev/sda*
partitions that would match
the physical sectors and use those same offsets -- but there was no such
valid offset that matched both the old and new /dev/sda*
partition
offsets, hence it failed. So I chose to ignore those warnings and carry on.
Once both original drives had been repartitioned and resync'd, the next
step was to recreate the /dev/md33
RAID partition again, on the smaller
partitions:
ewen@tv:~$ sudo mdadm --zero-superblock /dev/sda13
ewen@tv:~$ sudo mdadm --zero-superblock /dev/sdb13
ewen@tv:~$ sudo mdadm --zero-superblock /dev/sdc13
ewen@tv:~$ sudo mdadm --create /dev/md33 --level=1 --raid-devices=3 --chunk=4M /dev/sda13 /dev/sdb13 /dev/sdc13
mdadm: Note: this array has metadata at the start and
may not be suitable as a boot device. If you plan to
store '/boot' on this device please ensure that
your boot-loader understands md/v1.x metadata, or use
--metadata=0.90
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md33 started.
ewen@tv:~$
(because I was not booting from that partition metadata 1.2 was fine, and gave more options -- this one was created with recovery bitmaps).
Note that in this case I chose to create the RAID-1 set including three drives, because the external 2TB drive was slightly smaller, and I wanted the option of later resync'ing it onto that drive as an offsite backup.
At this point it is useful to update /etc/mdadm/mdadm.conf
with the
new UUID of the new RAID set, to ensure that it stays in sync and RAID
arrays can be auto-started.
When that new RAID set completed resync'ing, I then removed the 2TB external drive from all the RAID sets, and set them back to "2-way" RAID to avoid the RAID sets sitting there partly failed:
sudo mdadm /dev/md21 --fail /dev/sdc1 --remove /dev/sdc1
sudo mdadm /dev/md22 --fail /dev/sdc2 --remove /dev/sdc2
sudo mdadm /dev/md23 --fail /dev/sdc3 --remove /dev/sdc3
sudo mdadm /dev/md25 --fail /dev/sdc5 --remove /dev/sdc5
sudo mdadm /dev/md26 --fail /dev/sdc6 --remove /dev/sdc6
sudo mdadm /dev/md27 --fail /dev/sdc7 --remove /dev/sdc7
sudo mdadm /dev/md28 --fail /dev/sdc8 --remove /dev/sdc8
sudo mdadm /dev/md29 --fail /dev/sdc9 --remove /dev/sdc9
sudo mdadm /dev/md30 --fail /dev/sdc10 --remove /dev/sdc10
sudo mdadm /dev/md31 --fail /dev/sdc11 --remove /dev/sdc11
sudo mdadm /dev/md32 --fail /dev/sdc12 --remove /dev/sdc12
sudo mdadm /dev/md33 --fail /dev/sdc13 --remove /dev/sdc13
sudo mdadm --grow /dev/md21 --raid-devices=2
sudo mdadm --grow /dev/md22 --raid-devices=2
sudo mdadm --grow /dev/md23 --raid-devices=2
sudo mdadm --grow /dev/md25 --raid-devices=2
sudo mdadm --grow /dev/md26 --raid-devices=2
sudo mdadm --grow /dev/md27 --raid-devices=2
sudo mdadm --grow /dev/md28 --raid-devices=2
sudo mdadm --grow /dev/md29 --raid-devices=2
sudo mdadm --grow /dev/md30 --raid-devices=2
sudo mdadm --grow /dev/md31 --raid-devices=2
sudo mdadm --grow /dev/md32 --raid-devices=2
sudo mdadm --grow /dev/md33 --raid-devices=2
and then checked for any remaining missing drive references or "sdc" references:
ewen@linux:~$ cat /proc/mdstat | grep "_"
ewen@linux:~$ cat /proc/mdstat | grep "sdc"
ewen@linux:~$
Then I unplugged the 2TB external drive, to keep for now as an offline backup.
To make sure that the system still booted, I reinstalled grub and updated the initramfs to pick up the new RAID UUIDs:
ewen@linux:~$ sudo grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
ewen@linux:~$ sudo grub-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.
ewen@linux:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-4.9.0-4-686-pae
ewen@linux:~$
ewen@linux:~$ sudo update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.9.0-4-686-pae
Found initrd image: /boot/initrd.img-4.9.0-4-686-pae
Found linux image: /boot/vmlinuz-4.9.0-3-686-pae
Found initrd image: /boot/initrd.img-4.9.0-3-686-pae
Found linux image: /boot/vmlinuz-3.16.0-0.bpo.4-686-pae
Found initrd image: /boot/initrd.img-3.16.0-0.bpo.4-686-pae
Found memtest86 image: /memtest86.bin
Found memtest86+ image: /memtest86+.bin
Found memtest86+ multiboot image: /memtest86+_multiboot.bin
done
ewen@linux:~$
and then rebooted the system to make sure it could boot cleanly by itself. Fortunately it rebooted automatically without any issues!
After reboot I checked for reports of misalignment:
ewen@linux:~$ uptime
11:53:28 up 4 min, 1 user, load average: 0.05, 0.42, 0.25
ewen@linux:~$ sudo dmesg -T | grep -i misaligned
ewen@linux:~$ sudo dmesg -T | grep alignment
ewen@linux:~$ sudo dmesg -T | grep inconsistency
ewen@linux:~$
and was pleased to find that none were reported. I also checked all
the alignment_offset
values, and was pleased to see all of those
were now "0" -- ie "naturally aligned" (in this case to the 4KiB
physical sector boundaries):
ewen@linux:~$ cat /sys/block/sda/sda*/alignment_offset
0
0
0
0
0
0
0
0
0
0
0
0
0
ewen@linux:~$ cat /sys/block/sdb/sdb*/alignment_offset
0
0
0
0
0
0
0
0
0
0
0
0
0
ewen@linux:~$ cat /sys/block/md*/alignment_offset
0
0
0
0
0
0
0
0
0
0
0
0
ewen@linux:~$ cat /sys/block/dm*/alignment_offset
0
0
0
0
0
0
0
ewen@linux:~$
It is too soon to tell if this has any actual practical benefits in
performance due to improving the alignment. But not being reminded that
I "did it wrong" several years ago when putting the disks in -- due to
the fdisk
partition defaults at the time being wrong for 4 KiB physical
sector disks -- seems worth the effort anyway.