The Blog of Ben Rockwood

Red Hat Cluster Sucks... errr, Suite

06 Apr '06 - 21:19 by benr

In the continueing Zimbra saga, not only did I have to deploy RHEL, but I had to deploy Red Hat Cluster Suite. RH Cluster is a GPL suite that has largely been developed by Red Hat, with code dating back to 2001. You can download the source here. Alternatively, if you don't want to pony up the $500 for it (on top of the $800 for RHEL ES Standard), you can use CentOS and their build of both GFS and Cluster Suite. Note that the Red Hat Global Filesystem (GFS, formerly property of Sistina Software, but bought by Red Hat in 2003) is not an included part of Cluster Suite and will run you an additional $2,200 per node... yes, I said PER node. So on a cost basis that means a two node cluster runs you $2,600 without GFS and $7,000 with GFS. Must be good to command that price huh? Think again.

Before going any further, I'll point out that also available in the HA Cluster space on Linux is Veritas Cluster Server (VCS) and HP Serviceguard.

So after you've bought all this stuff, you first need to install RHEL. When the install completes it should ask for all your Red Hat support information and register your system with the Red Hat Network. Post-install, go to the Red Hat Network site and look at your entitlements, where you'll see the system(s) you've installed. On that page you should set the Base Subscription if its not already set (set it to RHEL 4 ES, or whatever is right for your contract), and then add an additional channel to it: Cluster Suite. Now, back on the RHEL box you can use up2date to install Cluster Suite:

Updating RHEL and Installing Cluster Suite

up2date -f -l: This will list the available updates for this system (-f ensures that it updates the kernel as well.)
up2date -f -u: Preform the update itself. This can take an hour or two. Reboot when complete.
up2date -f --installall --channel rhel-i386-es-4-cluster: This installs the Cluster Suite itself and any dependancies.
Done, poke around in /usr/sbin to see the cluster utilities, such as clustat and clusvcadm.

Once RH Cluster is installed, you'll find the tools in /usr/sbin. Here's the breakdown of availble tools:

RH Cluster Administration Tools

/usr/sbin/clustat: Display the current status of the cluster. (Sun Cluster equiv: scstat)
/usr/sbin/clusvcadm: Change the state of the services, such as enabling, disabling, relocating, etc. (Sun Cluster equiv: scswitch)
/usr/sbin/system-config-cluster: A Cluster Configuration GUI. It simplifies the creation of a cluster.conf as well as acting as a GUI management interface.

There are two things to note if your new to Red Hat Cluster. Firstly, you need to use a Fence Device (you can go without one but its highly frowned upon and unsupported by many vendors). Secondly, you do not require shared storage. The device typically used as a Fence Device is an APC MasterSwitch. In the event that a node is unresponsive a surviving node can (don't laugh) power cycle its partner. This method is also apparently used in some failover situations to ensure that the node wasn't doing anything it shouldn't be doing prior to failover. In other clusters, a quorum device is typically needed, but not in RH Cluster (new in version 4 apparently), which means that you don't require shared storage for cluster operation, which can be a benefit if you don't actually need to store anything on shared storage.

The cluster configuration is stored in a single file as XML: /etc/cluster/cluster.conf. You can configure a new cluster by either creating the cluster.conf by hand, using a pre-existing one, or using the /usr/sbin/system-config-cluster GUI tool. Using the GUI is, of course, the supported method.

Cluster configuration consists of the following componants:

Cluster Nodes: Nodes that are members of the cluster, also specified here is the number of votes that node has and which fence device port that controls that node.
Fence Devices: One or more Fence devices, such as an APC MasterSwitch, including the IP address, username and password that can be used to login to and control the fence device.
Failover Domains: Defines a logical grouping of nodes which can fail over to each other
Shared Resources: A resource used by a cluster service, such GFS, a shared filesystem, ip address, NFS resource, script, or Samba service.
Services: An HA service provided by the cluster, which combines together shared resources within a failover domain utilizing one or more nodes and their associated fence device.

Perhaps the most important of these is the "Script" shared resource. This script is a standard RC script (such as those in /etc/init.d) that aceepts at least 3 arguments: start, stop, and status (or monitor). When a cluster service is started the appropriate node is selected, and the shared resources given to it, such as mounting a shared filesystem and assuming a shared IP address. It then runs the script to start the service. Then, every 30 seconds, it runs the script with the "status" argument to monitor whether or not the service is indeed still online. In the event of a graceful failover the stop argument is given to the script to close it, before moving all the resources to the new node and starting it there.

The whole setup is pretty flimsy in comparison to other HA suites such as IBM's HACMP and Sun's SunCluster. Its akin to tying dental floss between two nodes. Using a network PDU is like holding a gun to the head of each node: answer me or else. You'll notice that there are no explicit interconnects.

[root@zimbra4 cluster]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  zimbra4.XX                     Online, Local, rgmanager
  zimbra5.XX                     Online, rgmanager
  zimbra6.XX                     Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  webmail1.XX          zimbra4.XX                     started
  webmail2.XX          zimbra5.XX                     started

Although it might be flimsy, it does work well in some situations. Because you don't need explicit interconnects and don't require a shared quorum device it means that very little pre-planning is needed for a simple cluster setup, so long as you've got a MasterSwitch handy. If you, for instance, wanted to setup an HA Apache service, you'd just use the /etc/init.d/httpd script, add a shared IP, and then share your htdocs/ on, say, and NFS mount point which is setup as a shared resource, edit your httpd.conf for the right htdocs/ directory and your basically done. Of course, when doing this, make sure you don't allow Apache to startup on boot by itself (chkconfig off httpd).

So for small services it might work well, but would I run Oracle or DB2 on it? Not a chance in hell. Here are my gripes:

Shared IP's don't show up as IP aliases in ifconfig. This has got to be a bug. If a shared IP is present, I should see its address in ifconfig as eth0:1 or something, but you don't. This makes checking the current location of the address difficult (ie: telnet/ssh to it and see where you end up.) This seems to be due to the fact that RH Cluster doesn't tie shared IP's to specific interfaces, which in and of itself, is problematic imho. Either way, it still would be nice if it showed up as like "clu1" or something.
Shared IP Address "Drift". I have run into numberous problems with the shared IP just drifting to its failover partner. The shared storage doesn't move and the service itself doesn't move, just the IP, which means that service is effectively down, although the cluster is totally unaware of the problem (as checked with clustat). To resolve the issue I've got to disable the service completely and then restart it on the appropriate node (ie: clusvcadm -d mysvc & clusvcadm -e mysvc -m node1).
Unexpected shutdown of a service: Things are humming along fine and then I get a call from QA, service is down. If it wasn't IP drift it would be an unexpected failover or shutdown of the service. clustat may or may not know whats going on in these cases, and often in the case of a failover reported that the service was still running on the previous (pre-failover) node when in fact is was not.

I just can't find anything to like about Red Hat Cluster Suite. If I wanted a light cluster solutions I'd opt for something thats tried and true and enterprise grade, such as Veritas Cluster Suite. If you want a totally integrated and comprehensive clustering solution, Sun Cluster is the way to go, hands down, but that requires Solaris, and thus doesn't really apply here.

I'm aware that some of these issues listed above may be unresolved bugs, some may be monitor issues, etc. But this is supposed to be an enterprise ready suite that I paid a lot of money for and it just doesn't act like one. Some of these issues are possibly due to Zimbra's monitoring scripts, but reguardless I'm bothered that RH Cluster doesn't have a way to deal with these situations like a true solution (say, Sun Cluster or HACMP) does. Couple this with the fact that the documentation is some of the worst I've ever seen. Flip through the docs here.

UPDATE: I've been digging around the source for RH Cluster this afternoon. Aprarently although ifconfig won't show you the shared IP, ip (yes, thats a command) will. Example:

[root@zimbra4 cluster]# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:50:8B:D3:8D:51
          inet addr:10.10.0.144  Bcast:10.10.3.255  Mask:255.255.252.0
          inet6 addr: fe80::250:8bff:fed3:8d51/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:25040869 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18583752 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1373798465 (1.2 GiB)  TX bytes:893112790 (851.7 MiB)

eth1      Link encap:Ethernet  HWaddr 00:50:8B:D3:8D:5B
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:2757771 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2757771 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:296459762 (282.7 MiB)  TX bytes:296459762 (282.7 MiB)

sit0      Link encap:IPv6-in-IPv4
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

[root@zimbra4 cluster]# ip addr list
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:50:8b:d3:8d:51 brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.144/22 brd 10.10.3.255 scope global eth0
    inet 10.10.0.147/32 scope global eth0   <--- Thats the Shared IP

    inet6 fe80::250:8bff:fed3:8d51/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:50:8b:d3:8d:5b brd ff:ff:ff:ff:ff:ff
4: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0

As for the drifting IP address problem... I started to wonder if it might be because of the way Red Hat Cluster monitors the interface. If it was doing a ping test, it would explain what I've been seeing, because the address would in fact be online, it just isn't on the right system. Looking at rgmanager/src/resources/ip.sh it appears that this is exactly the problem. Why its drifting in the first place, I can't say, but clearly Red Hat Cluster's method of monitoring the links is open to some serious issues.

- - C O M M E N T S - -
greetings !

Matthew (Email) (URL) - 13 June '06 - 21:09