Multi-cluster Ganglia on EC2 or unicast networks
Ganglia is a scalable distributed monitoring system, and is excellent for keeping tabs on 10s or 1,000s of servers.
It works best on multicast-capable networks, where its army of gmonds
will chat to one another with minimal configuration, and the gmetad
federator is able to ask any of them for data, but it’s fairly simple to set up a single cluster implementation on a unicast network too.
However, one of Ganglia’s awesome features - multiple clusters - is considerably more complicated to set up on a unicast network. There are a few options to do it:
- Dedicate one
gmond
in each cluster as the receiver, and have all the others in that cluster send to it. This is OK, but if that one server goes down, there’ll be no stats for that cluster. - Run one receiving
gmond
per cluster on a dedicated reporting machine (e.g. same asgmetad
is on). Works, but annoying to configure as you’ve got a big stack ofgmonds
running on one server. - Emulate multicast support by having each
gmond
in each cluster send/receive to all or some of the others. A centralgmetad
is configured to point at one or more of thesegmonds
for eachdata_source
(cluster).
None of these are as elegant as the default multicast configuration, but we’ll have to make do. I’ve opted to use the third option as I believe it strikes the best balance between reliability and ease of maintenance, and to avoid having 100s of nodes spamming all the others constantly, you can designate a handful of receiver nodes in each cluster and have all the others report to those.
Note: doing this by hand will not be fun. Combine this guide with your server automation tool of choice - I’ll be using Chef, so you’ll need to translate the instructions yourself if you’re using something else.
Installing Ganglia
First you’ll want to have a basic Ganglia set up with your chosen automation tool. I’m using Heavy Water’s Ganglia cookbook as a base. Although it’ll need heavy tweaking for this use case, it’s a good place to start.
Once you have gmonds
installed on each node you want to monitor, and gmetad
(and, optionally, ganglia-web
) installed on a designated monitoring server, we can get started with the configuration.
Cluster attributes
In Chef, you can use your roles to describe the cluster for each node. For example, I have roles like app server
, mongodb server
etc and I’d like each of these to be a cluster. So, in my roles/app_server.rb
file, I have the following attributes, which will be used later:
override_attributes(
:ganglia => {
:gmond => {
:cluster_name => 'app servers'
}
}
)
You’ll probably also want to add some defaults into a base
role which is applied to every monitored node:
default_attributes(
:ganglia => {
:gmond => {
:cluster_name => 'generic server',
:cluster_owner => 'Your Organisation'
}
}
)
gmond configuration
gmond.conf
is the main configuration file for gmond
. Unless a gmond is configured to be deaf
it will open up a UDP server on the specified port and allow other gmonds
to send data to it. On a multicast network, these gmonds
would all join the same multicast group and share information automatically, but we’ll need to specifically tell each gmond
which hosts to send data to.
You have a choice here - if your clusters are fairly small (I’d say, 10 servers or less), you can tell each gmond
in a cluster to spread data to every other gmond
in the same cluster. This will be simpler to configure; however, if your clusters are quite large, this would involve a lot of connections on every server, which doesn’t strike me as the best idea. Instead, you could nominate a handful of servers in each cluster to be the receivers, and have all the others send to those. So long as one of those receivers is online, all the other servers in that cluster can continue to share data.
Get the list of nodes in the same cluster
In Chef, this can be done with a search. So, in your recipe where you create the gmond.conf
, add something like:
this_cluster = node[:ganglia][:gmond][:cluster_name]
gmonds_in_cluster = []
search(:node, "ganglia_gmond_cluster_name:\"#{this_cluster}\" AND chef_environment:#{node.chef_environment}") do |n|
gmonds_in_cluster << n[:ipaddress]
end
Then pass this into your gmond.conf
template and do:
<% @gmonds_in_cluster.each do |host| %>
udp_send_channel {
host = <%= host %>
port = 8659
ttl = 1
}
<% end %>
udp_recv_channel {
port = 8659
}
Once Chef has converged on all the nodes in that cluster, each gmond.conf
will have a list of other gmonds
to send to.
If you wanted to limit how many receivers you have (so, instead of every gmond
talking to every other gmond
, you designate only a few, and set the others to be deaf) you could adjust the search logic we used above. You could sort the list of nodes in the cluster, and attach an attribute to the first 2, e.g. node.set_unless[:ganglia][:gmond][:receiver] = true
.
Then all your search logic only searches for nodes where ganglia_gmond_receiver:true
and, when you configure the gmetad
, you restrict the data_source
to use only these nodes too. This should reduce network activity whilst still maintaining redundancy.
gmetad configuration
The gmetad
daemon provides federation across all your gmonds
. Each cluster is represented in /etc/gmetad.conf
by a data_source
option, e.g. data_source "app servers" host1 host2
When the gmetad
looks for a host for a particular data source, it tries the first one and uses the others for redundancy. This is the reason we couldn’t have a far simpler configuration where each gmond
is independent and the gmetad
is configured to ask all of them for data - it would only ever get data from one, and ignore the rest.
Get the list of clusters and gmond hosts
Configuring gmetad
is fairly straightforward: we just use search to get a list of all known clusters along with their gmond
hosts:
# find all nodes running ganglia so we can collect clusters
clusters = {}
# all my nodes with ganglia have the ganglia recipe - adjust as needed
search(:node, "recipes:ganglia AND chef_environment:#{node.chef_environment}") do |n|
name = n[:ganglia][:gmond][:cluster_name]
clusters[name]
? clusters[name] << n[:ipaddress]
: clusters[name] = [n[:ipaddress]]
end
If only a limited subset of your gmonds
are acting as receivers, make sure you’re only feeding these nodes into gmetad
, for example:
search(:node, "recipes:ganglia AND ganglia_gmond_receiver:true AND chef_environment:#{node.chef_environment}") do |n|
name = n[:ganglia][:gmond][:cluster_name]
clusters[name]
? clusters[name] << n[:ipaddress]
: clusters[name] = [n[:ipaddress]]
end
Then pass the list of clusters and hosts into your gmetad.conf
template:
<% @clusters.each do |name, hosts| %>
data_source "<%= name %>" <%= hosts.join(' ') %>
<% end %>
Testing it and debugging
That should do it for the configuration. If everything has worked, you’ll be able to turn it all on, set up your monitoring server with a gmetad
and the Ganglia web UI, and see all your clusters and data appear.
If you run into problems, it’s easiest to start debugging at a cluster level. Turn off your gmetad
and all the gmonds
in a single cluster and then start up two gmonds
on two machines in that cluster.
On one of them, use nc localhost 8649
. This should splurge out some XML. Inside the <CLUSTER>
tag you should, if the gmonds
are talking to each other, have at least two <HOST>
entries, one for each node running gmond
.
Common things to check:
- Firewall configuration (or security groups on AWS):
gmonds
all communicate via a UDP port (8649
by default), so this needs to be open for in/out connections on nodesgmetad
gets data fromgmonds
via TCP port8649
(default), so this should be open on any nodes runninggmond
that are marked as receivers
- Each
gmond
should be configured to list itself as audp_send_channel
, even if it’s the only node in a cluster. Not listing this will probably result in weird issues.
I have this implementation running in production at LoyaltyLion and it’s working great. If you run in to any issues, feel free to comment and I’ll see if I can help.