Replacing a Disk in RAID

Let’s say the server has 2 disks: /dev/sda and /dev/sdb. These disks are assembled into software RAID1 using mdadm --assemble.

One of the disks failed, for example, /dev/sdb. The failed disk must be replaced.

Please note that before replacing a disk, it is advisable to remove it from the array.

Removing a Disk From the Array

View the array state by running the following:

cat /proc/mdstat 

Personalities : [raid1] 
md1 : active raid1 sda3[0] sdb3[1]
      975628288 blocks super 1.2 [2/2] [UU]
      bitmap: 3/8 pages [12KB], 65536KB chunk

md0 : active raid1 sda2[2] sdb2[1]
      999872 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

In this case, the array is assembled so that md0 consists of sda2 and sdb2, and md1 consists of sda3 and sdb3.

On this server, md0 is /boot, and md1 is swap and root.

lsblk
NAME             MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0              7:0    0   985M  1 loop  
sda                8:0    0 931.5G  0 disk  
├─sda1             8:1    0     1M  0 part  
├─sda2             8:2    0   977M  0 part  
│ └─md0            9:0    0 976.4M  0 raid1 
└─sda3             8:3    0 930.6G  0 part  
  └─md1            9:1    0 930.4G  0 raid1 
    ├─vg0-swap_1 253:0    0   4.8G  0 lvm   
    └─vg0-root   253:1    0 925.7G  0 lvm   /
sdb                8:16   0 931.5G  0 disk  
├─sdb1             8:17   0     1M  0 part  
├─sdb2             8:18   0   977M  0 part  
│ └─md0            9:0    0 976.4M  0 raid1 
└─sdb3             8:19   0 930.6G  0 part  
  └─md1            9:1    0 930.4G  0 raid1 
    ├─vg0-swap_1 253:0    0   4.8G  0 lvm   
    └─vg0-root   253:1    0 925.7G  0 lvm   /

Remove sdb from all devices:

mdadm /dev/md0 --remove /dev/sdb2
mdadm /dev/md1 --remove /dev/sdb3

If partitions are not removed from the array, mdadm does not consider the disk to be failed and uses it. When removing a disk, an error is displayed that the device is in use.

In this case, mark the disk as failed before removing it:

mdadm /dev/md0 -f /dev/sdb2
mdadm /dev/md1 -f /dev/sdb3

Run the commands to remove partitions from the array again.

After removing the failed disk from the array, request disk replacement by creating a ticket specifying the s/n of the failed disk. Downtime availability depends on server configuration.

Defining the Partition Table (GPT or MBR) and Moving It to the New Disk

After replacing the failed disk, you need to add the new disk to the array. To do this, you need to determine the partition table type: GPT or MBR. gdisk is used for this.

Install gdisk:

apt-get install gdisk -y

Run the following:

gdisk -l /dev/sda

Where /dev/sda is a healthy disk in the RAID.

The output looks as follows for MBR:

Partition table scan:
MBR: MBR only
BSD: not present
APM: not present
GPT: not present

And something like this for GPT:

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Before adding a disk to the array, you need to create the same partitions as on sda. This process varies depending on the disk layout.

Copying the Partition Layout for GPT

To copy the partition layout for GPT:

sgdisk -R /dev/sdb /dev/sda

Please note that the disk that the layout is copied to is written first, and the disk that the layout is copied from is the second (that is, from sda to sdb). If you swap them, the layout on the initially healthy disk will be destroyed.

The second way to copy partition layout:

sgdisk --backup=table /dev/sda
sgdisk --load-backup=table /dev/sdb

After copying, assign a new random UIDD to the disk:

sgdisk -G /dev/sdb

Copying the Partition Layout for MBR

To copy the partition layout for MBR:

sfdisk -d /dev/sda | sfdisk /dev/sdb

Please note that the disk that the layout is copied from is written first, and the disk that the layout is copied to is the second.

If you cannot see the partitions in the system, then you can re-read the partition table by running the following:

sfdisk -R /dev/sdb

Adding a Disk to the Array

When partitions on /dev/sdb are created, you can add the disk to the array:

mdadm /dev/md0 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3

After adding the disk to the array, synchronization starts. The speed depends on the disk size and type (ssd/hdd):

cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sda3[1] sdb3[0]
      975628288 blocks super 1.2 [2/1] [U_]
      [============>........]  recovery = 64.7% (632091968/975628288) finish=41.1min speed=139092K/sec
      bitmap: 3/8 pages [12KB], 65536KB chunk

md0 : active raid1 sda2[2] sdb2[1]
      999872 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Installing a Boot Loader

After adding the disk to the array, you need to install a boot loader on it.

If the server is booted into normal mode or in infiltrate-root, this can be done by running the following:

grub-install /dev/sdb

If the server is booted to Recovery or Rescue mode, i.e. with a live cd, the boot loader installation looks like this:

  1. Mount the root file system to /mnt:
mount /dev/md2 /mnt
  1. Mount boot:
mount /dev/md0 /mnt/boot
  1. Mount /dev, /proc and /sys:
mount --bind /dev /mnt/dev
mount --bind /proc /mnt/proc
mount --bind /sys  /mnt/sys
  1. chroot into the mounted file system:
chroot /mnt
  1. Install grub on sdb:
grub-install /dev/sdb

Now you can try to boot into normal mode.

Replacing a Failed Disk

You can conditionally make the disk failed in the array failed using --fail (-f):

mdadm /dev/md0 --fail /dev/sda1

or

mdadm /dev/md0 -f /dev/sda1
mdadm /dev/md0 --remove /dev/sda1

or

mdadm /dev/md0 -r /dev/sda1

You can add a new disk to the array using --add (-a) and --re-add:

mdadm /dev/md0 --add /dev/sda1

or

mdadm /dev/md0 -a /dev/sda1

Error while Restoring the Boot Loader after Replacing the Disk in RAID1

If the following error appears while installing grub:

root #grub-install --root-directory=/boot /dev/sda
Could not find device for /boot/boot: not found or not a block device

Run the following:

root #grep -v rootfs /proc/mounts > /etc/mtab