Published by Pascal on 28. Aug 2017 12:40:22
Updated by Pascal on 14. Nov 2017 15:05:43
------------------------------------------------------------------------

Here at Encodo, we host our services in our own infrastructure which, after 12
years, has grown quite large. But this article is about our migration away from
VMWare.

So, here's how we proceeded:

We set up a test environment as close as possible to the new one before buying
the new server, to test everything. This is the first time we had contact with
software raids and it's monitoring capabilities.

[Install the Hypervisor]

Installation time, here it goes:

  * Install latest Proxmox [1]: This is very straight forward, I won't go into
    that part.
  * After the installation is done, log in via ssh and check the syslog for
    errors (we had some NTP issues, so I fixed that before doing anything else).

[Check Disks]

We have our 3 Disks for our Raid5. We do not have a lot of files to store, so we
use 1TB which should be still OK (see "Why RAID 5 stops working in 2009"
 as to why you
shouldn't do RAID5 anymore).

We set up Proxmox on a 256GB SSD. Our production server will have 4x 1TB SSD's,
one of which is a spare. Note down the serial number of all your disks. I don't
care how you do it---make pictures or whatever---but if you ever care which slot
contains which disk or if the failing disk is actually in that slot, having
solid documentation really helps a ton.

You should check your disks for errors beforehand! Do a full smartctl check.
Find out which disks are which. This is key, we even took pictures prior to
inserting them into the server (and put them in our wiki) so we have the SN
available for each slot.

See which disk is which:

for x in {a..e}; do smartctl -a /dev/sd$x | grep 'Serial' | xargs echo
"/dev/sd$x: "; done

Start a long test for each disk:

for x in {a..e}; do smartctl -t long /dev/sd$x; done

See "SMART tests with smartctl"
 for more
detailed information.

[Disk Layout & Building the RAID]

We'll assume the following hard disk layout:


/dev/sda = System Disk (Proxmox installation)
/dev/sdb = RAID 5, 1
/dev/sdc = RAID 5, 2
/dev/sdd = RAID 5, 3
/dev/sde = RAID 5 Spare disk
/dev/sdf = RAID 1, 1
/dev/sdg = RAID 1, 2
/dev/sdh = Temporary disk for migration

When the check is done (usually a few hours), you can verify the test result
with

smartctl -a /dev/sdX

Now that we know our disks are OK, we can proceed creating the software RAID.
Make sure you get the correct disks:

mdadm --create --verbose /dev/md0 --level=5 --raid-devices=3 /dev/sdb /dev/sdc
/dev/sdd

The RAID5 will start building immediately but you can also start using it right
away. Since I had other things on my hand, I waited for it to finish.

Add the spare disk (if you have one) and export the configuration to the config:

mdadm --add /dev/md0 /dev/sde
mdadm --detail --scan >> /etc/mdadm/mdadm.conf

[Configure Monitoring]

Edit the email address in /etc/mdadm/mdadm.conf to a valid mail address within
your network and test it via 

mdadm --monitor --scan --test -1

Once you know that your monitoring mails come through, add active monitoring for
the raid device:

mdadm --monitor --daemonise --mail=valid@domain.com --delay=1800 /dev/md0

To finish up monitoring, it's important to read the mismatch_cnt from
/sys/block/md0/md/mismatch_cnt periodically to make sure the Hardware is OK. We
use our very old Nagios installation for this and got a working script for the
check from "Mdadm checkarray" by Thomas Krenn


[Creating and Mounting Volumes]

Back to building! We now need to make the created storage available to Proxmox.
To do this, we create a PV, VG and an LV-Thin-Pool. We use 90% of the storage
since we need to migrate other devices as well, and 10% is enough for us to
migrate 2 VM's at a time. We format it with XFS:


pvcreate /dev/md0 storage
vgcreate raid5vg /dev/md0
lvcreate -l 90%FREE -T raid5vg
lvcreate -n migrationlv -l +100%FREE raid5vg
mkfs.xfs /dev/mapper/raid5vg-migrationlv

Mount the formatted migration logical volume (if you want to reboot, add it to
fstab obviously):

mkdir /mnt/migration
mount /dev/mapper/raid5vg-migrationlv /mnt/migration

If you don't have the disk space to migrate the VM's like this, add an
additional disk (/dev/sdh in our case). Create a new partition on it with

fdisk /dev/sdh
n

Accept all the defaults for max size. Then format the partition with xfs and
mount it:

mkfs.xfs /dev/sdh1
mkdir /mnt/largemigration
mount /dev/sdh1 /mnt/largemigration

Now you can go to your Proxmox installation and add the thin pool (and your
largemigration partition if you have it) in the Datacenter -> Storage -> Add.
Give it an ID (I called it raid5 because I'm very creative), Volume Group:
raid5vg, Thin Pool: raid5lv.

[Extra: Upgrade Proxmox]

At this time, we'd bought our Proxmox license and did a dist upgrade from
4.4 to 5.0 which had just released. To do that, follow the upgrade document from
the Proxmox wiki. Or install 5.0 right away.

[Migrating VMs]

Now that the storage is in place, we are all set to create our VM's and do the
migration. Here's the process we were doing - there are probably more elegant
and efficient ways to do that, but this way works for both our Ubuntu
installations and our Windows VM's:

   1. In ESXi: Shut down the VM to migrate
   2. Download the vmdk file from vmware from the storage or activate ssh on
      ESXi and scp the vmdk including the flat file (important) directly to
      /mnt/migration (or largemigration respectively).
   3. Shrink the vmdk if you actually downloaded it locally (use the non-flat
      file as input if the flat doesn't work): [2]vdiskmanager-windows.exe -r vmname.vmdk -t 0 vmname-pve.vmdk
   4. Copy the new file (vmname-pve.vmdk) to proxmox via scp into the migration
      directory /mnt/migration (or largemigration respectively)
   5. Ssh into your proxmox installation and convert the disk to qcow2:qemu-img convert -f vmdk /mnt/migration/vmname-pve.vmdk -O qcow2
   /mnt/migration/vmname-pve.qcow2
   6. In the meantime you can create a new VM:
      1. In general: give it the same resources as it had in the old hypervisor
      2. Do not attach a cd/dvd
      3. Set the disk to at least the size of the vmdk image
      4. Make sure the image is in the "migration" storage
      5. Note the ID of the vm, you're gonna need it in the next step
   7. Once the conversion to qcow2 is done, override the existing image with the
      converted one. Make sure you get the correct ID and that the target .qcow2
      file exists. Override with no remorse:mv /mnt/migration/vmname-pve.qcow2
   /mnt/migration/images//vm--disk-1.qcow2
   8. When this is done, boot the image and test if it comes up and runs
   9. If it does, go to promox and move the disk to the RAID5:
      1. Select the VM you just started
      2. Go to Hardware
      3. Click on Hard Disk
      4. Click on Move Disk
      5. Select the Raid5 Storage and check the checkbox Delete Source
      6. This will happen live

That's it. Now repeat these last steps for all the VMs - in our case around 20,
which is just barely manageable without any automation.
If you have more VMs you could automate more things, like copying the VMs
directly from ESXi to Proxmox via scp and do the initial conversion there.

--------------------------------------------------------------------------------


[1] We initially installed Proxmox 4.4, then upgraded to 5.0 during the
    migration.


[1] You can get the vdiskmanager from "Repairing a virtual disk in Fusion 3.1
    and Workstation 7.1 (1023856)"
    
    under "Attachments"