I’ve recently begun playing with ceph’s rbd pool as a way to provide network block devices for libvirt guests managed through Pacemaker, having had success with drbd and iscsi. This post should be considered notes of my ongoing experiments, and not a hard-and-fast ‘howto’ for this concept. Nevertheless, it might be useful to someone!

Concepts

We’re using ceph rbd pool (RADOS block device) to offer up the storage for the VMs. If you’ve used iscsi before, this is a similar concept, but with a replicated, distributed backend for the data.

We’re using pacemaker with the VirtualDomain RA to manage libvirt (kvm) instances.

Hardware

You’ll need a minimum of 2 nodes to run the osd daemons (object store - i.e. the data) and 3 or more nodes (always an odd number for quorum) nodes for the mon daemons (ceph pool monitoring). The ceph documentation gives suggested hardware requirements for these.

You’ll also need 2 (or more) nodes for VMs to allow live migration. CPU and memory are important here.

I started using 4 nodes - 2 disk servers, and 2 vm servers. 3 (or more) nodes run pacemaker (to allow quorum) and one (or more) vm server hosts the extra mon daemon. This was mostly due to the hardware I had to hand falling into 2 categories of ‘fast disk’ and ‘fast cpu/mem’ - but the pool can expand later as needed.

Ceph Configuration

Follow the ceph documentation on setting up a pool - using mkcephfs (or ceph-deploy) to get things going.

Make sure that the path you point your osd’s at is the mountpoint for an xfs or btrfs filesystem.

I use the following /etc/ceph/ceph.conf:

  
[global]  
auth supported = none # see also cephx  
  
[mon]  
mon data = /srv/ceph/mon-$id  
[mon.0]  
host = disksrv1  
mon addr = 192.168.1.1:6789  
[mon.1]  
host = disksrv2  
mon addr = 192.168.1.2:6789  
[mon.2]  
host = vmsrv1  
mon addr = 192.168.1.3:6789  
  
[osd]  
osd data = /srv/ceph/osd-$id  
osd journal = /srv/ceph/osd-$id/journal #In production, use an SSD or at least it's own partition  
osd journal size = 2000 # Only needed if using a file for the journal (2GB is a good start for GBe)  
  
[osd.0]  
host = disksrv1  
[osd.1]  
host = disksrv2  

When you start ceph, run

ceph status

on one of the mons and wait for it to settle - look out for

HEALTH_OK

and check that the appropriate number of mons and osds are running.

Guest disk Creation

You now need to create a disk in the pool for the guest to use. This is done from any of the mon nodes with the following command - replacing with the disk size, and and with the name of your rbd pool and guest vm. If –pool is omitted, it will default to the

rbd

pool:

  
rbd create [guestname] --size [megabytes] --pool [poolname]  

Note that rbd images are thinly provisioned - that is no space will be used unless files are written to the image, and the size is only an upper limit. you can later change the size of a disk with:

  
rbd resize [guestname] --size [megabytes] --pool [poolname]  

Full documentation on the rbd commands is available on the Ceph wiki

By default, ceph pools have replication set at 2 - i.e 2 copies of all data. If you are paranoid, you can up this number, but be aware that this requires a corresponding increase in the number of osds, and also will incur a performance hit.

Libvirt configuration

RBD support in qemu has been around for a while - definitely in the 0.15.1 releases. Some distibutions don’t compile with it enabled - in which case you need to compile it yourself with the

--enable-rbd

configure option.

To see if your version has rbd support, then try the following command (after having created the rbd image above):

  
qemu-img info -f rbd rbd:[poolname]/[guestname]  

you should see something like:

  
image: rbd:rbd/guest1  
file format: rbd  
virtual size: 1.0G (1073741824 bytes)  
disk size: unavailable  
cluster_size: 4194304  

You need to ensure that your various libvirt daemons can communicate for migration. You can use TLS (recommended, using certificates for auth), ssh, or tcp with no auth (good for testing, but insecure). See the libvirt Remote documentation for information. Note that many distros also need you to tweak the libvirtd startup options to include

--listen

in

/etc/sysconfig/libvirtd

or

/etc/defaults/libvirtd

.

Once communication is established, you need to create a guest libvirt XML configuration for each guest, and deploy this to all the VMs. The important addition is the inclusion of a disk of type ‘network’ with source protocol ‘rbd’ and the name set appropriately to [poolname]/[guestname] from the rbd commands above.

Pacemaker

I am assuming here that you are familiar with the workings of pacemaker. At present, this guide only covers using pacemaker to manage libvirt - though there are resource agents for monitoring ceph’s init daemons, and for ‘mounting’ rbd images available written by the excellent people at Hastexo. You should also ensure that your OS is automatically starting libvirtd, but not automatically starting any guests (e.g. libvirt-guests init.d scripts, libvirt autostart etc).

Be aware that if you are running pacemaker to monitor VirtualDomain guests AND ceph, you may need to put in place location rules to prevent ceph running on hosts with ‘VM hardware’ and libvirt running on hosts with ‘disk hardware’.

The relevant crm configuration snippet for each guest will look something like this:

  
primitive guest1 ocf:heartbeat:VirtualDomain \  
params config=/etc/libvirtcfg/guest1.xml \  
hypervisor="qemu:///system" migration_transport="tls" meta allow-migrate="true" \  
op start timeout="300" op stop timeout="300" \  
op monitor depth="0" timeout="30" interval="10" \  
op migrate_from timeout="300" \  
op migrate_to timeout="300"  

If you are used to running with DRBD and iscsi, this might seem quite short - however since libvirt is handling all of the rbd access, much of the complexity disappears.


Matthew Richardson