clarkdave.net

Null-safety with JSON and PostgreSQL

If you’re using the native JSON support in PostgreSQL you may, from time to time, come across the issue of null safety. For example, if you’re trying to query a nested JSON object which contains null values. Attempting to use one of the JSON operators on a null will result in an error like this:

ERROR:  cannot extract element from a scalar

PostgreSQL’s -> operator is generally pretty good at soaking up null references. You can chain it on a nested object and it’ll always return null if it encounters a key which doesn’t exist. However, if it encounters a key whose value is actually null. you’ll get the above error.

The easiest way to avoid this is to make sure you never set a JSON key to null if you’re ever going to treat that key as an object. If you can’t make that guarantee though, you’ve got a few other options to navigate safely around the null.

Historical records with PostgreSQL, temporal tables and SQL:2011

Sometimes you need to find out what a record looked like at some point in the past. This is known as the Slowly Changing Dimension problem. Most database models - by design - don’t keep the history of a record when it’s updated. But there are plenty of reasons why you might need to do this, such as:

  • audit/security purposes
  • implementing an undo functionality
  • showing a model’s change over time for stats or comparison

There are a few ways to do this in PostgreSQL, but this article is going to focus on the implementation provided by the SQL:2011 standard, which added support for temporal databases. It’s also going to focus on actually querying that historical data, with some real-life use cases.

PostgreSQL doesn’t support these features natively, but this temporal tables approximates them. This requires PostgreSQL 9.2 or higher, as that was the first version with support for a timestamp range data type.

Before you dig in, it’s important to note that this extension does not provide complete support for the 2011 standard. Specifically, there is no support for the new syntaxes provided for querying across historical tables, such as the AS OF SYSTEM TIME keyword.

These are generally conveniences though - the temporal tables extension takes care of the updating, and you’ll at least be adopting a standard used by other databases, as opposed to rolling your own solution or using something application-specific.

How to use webpack with Rails

webpack is a powerful module bundler, primarily designed for front-end development, which can integrate nicely with bower and npm JavaScript modules.

It has quite a few advantages over the typical Rails methods of managing front-end JS, but can still slot in seamlessly with Sprockets and the asset pipeline. Unfortunately, the documentation for webpack is still a bit hard to digest, especially if you’re wanting to integrate it with Rails.

If you’re still not sold on using webpack, here’s some of what it can do for you:

  • manage all your front-end JS (& their dependencies) using NPM or Bower
  • automatically preprocess CoffeeScript, ES6, etc
  • output source maps for absolutely everything, with minimal effort
  • help you separate the JS for different pages into different files, with ‘common’ modules automatically shared across all pages
  • split off large modules into separate files which are only downloaded on demand (via require.ensure)

If that sounds good to you, read on to see how to use all this with either an existing Rails app or a brand new one. By the way, although this is Rails specific, some of what’s here might be of benefit when combining webpack with any Rails-like framework.

Tracking errors with Logstash and Sentry

Logstash is an excellent way to eat up logs and send them elsewhere. In a typical setup you’ll send them to Elasticsearch and the excellent Kibana for viewing and analysis, which works well but is missing a vital part: being alerted when your application throws errors.

There’s a whole bunch of ways you can deal with errors without Logstash, one of which is Sentry. This is a software service which takes your errors, samples and groups them and, crucially, alerts you by email (or many other options). Sentry can plug in to most applications with its different Raven clients, which will allow you to track errors as part of your application and then send them to Sentry directly.

But, well, if you’re already using Logstash to log everything (including errors), wouldn’t it be great to just have Logstash send errors on to Sentry for you? I think so! And, luckily, it’s quick and easy to do!

What can you do with PostgreSQL and JSON?

PostgreSQL 9.2 added a native JSON data type, but didn’t add much else. You’ve got three options if you actually want to do something with it:

  1. Wait for PostgreSQL 9.3 (or use the beta)
  2. Use the plv8 extension. Valid option, but more DIY (you’ll have to define your own functions)
  3. Use the json_enhancements extension, which backports the new JSON functionality in 9.3 to 9.2

I wanted to use this stuff now, and I opted to go with option 3. I wrote a blog post which should help you get going if you want to go this route: adding json_enhancements to PostgreSQL 9.2.

So let’s assume you’re on either 9.3, or 9.2 with json_enhancements. What can you do? Lots! All the new JSON operators and functions are in the 9.3 documentation, so I’m going to run through some of the more fun things you can do along with a real-world use case.

Adding json_enhancements to PostgreSQL 9.2

If you keep up with PostgreSQL developments there’s no way you’ve missed the JSON datatype introduced in 9.2, and the upcoming JSON functions in 9.3. The biggest change here is that in 9.3 it’ll be possible to query your JSON columns without needing to use plv8, which embeds V8 into Postgres.

That said, the JSON functions in 9.3 have been backported to 9.2, and can be used right now with 9.2. Before deciding if this is for you, you’ll probably want to play around with it on your development machine. If so, this guide might help!

Multi-cluster Ganglia on EC2 or unicast networks

Ganglia is a scalable distributed monitoring system, and is excellent for keeping tabs on 10s or 1,000s of servers.

It works best on multicast-capable networks, where its army of gmonds will chat to one another with minimal configuration, and the gmetad federator is able to ask any of them for data, but it’s fairly simple to set up a single cluster implementation on a unicast network too.

However, one of Ganglia’s awesome features - multiple clusters - is considerably more complicated to set up on a unicast network. There are a few options to do it:

  • Dedicate one gmond in each cluster as the receiver, and have all the others in that cluster send to it. This is OK, but if that one server goes down, there’ll be no stats for that cluster.
  • Run one receiving gmond per cluster on a dedicated reporting machine (e.g. same as gmetad is on). Works, but annoying to configure as you’ve got a big stack of gmonds running on one server.
  • Emulate multicast support by having each gmond in each cluster send/receive to all or some of the others. A central gmetad is configured to point at one or more of these gmonds for each data_source (cluster).

None of these are as elegant as the default multicast configuration, but we’ll have to make do. I’ve opted to use the third option as I believe it strikes the best balance between reliability and ease of maintenance, and to avoid having 100s of nodes spamming all the others constantly, you can designate a handful of receiver nodes in each cluster and have all the others report to those.

Note: doing this by hand will not be fun. Combine this guide with your server automation tool of choice - I’ll be using Chef, so you’ll need to translate the instructions yourself if you’re using something else.

Guide to setting up Nagios with Chef

Nagios is an awesome open source tool for monitoring servers and applications, and, being such a mission to install and configure, it’s a perfect use case for Chef. Opscode have a nice Nagios cookbook but it’s still not the most straightforward thing to get running.

To make things easier, I’ll explain here how to use this cookbook to set up a Nagios server and have it monitor multiple client servers and applications such as PostgreSQL, MongoDB and nginx.

Creating and bootstrapping EC2 instances in an Amazon VPC using Chef & Knife

Instances in an Amazon VPC are a little tricker than usual to create using the knife ec2 server create command, because they are, of course, private. However, if you have an SSH gateway you can use that to create and bootstrap them.

There is one small caveat, however: if you create a server in a public subnet, it will not be able to access the Internet (and thus complete the Chef bootstrap process) until you assign it an Elastic IP. The knife ec2 plugin doesn’t let you do this, so you’ll either have to do it manually after the server has been created, or script it using the AWS API.

Managing EBS volumes with Chef

Chef works great for creating and attaching Amazon EBS volumes to instances, and there’s an Opscode chef-aws cookbook which has providers for working with single EBS volumes and also for creating RAID arrays from multiple volumes.

The chef-aws cookbook is fairly straightforward on its own, but requires a little DIY to get it fully working. I’ll explain here how to use this cookbook to effectively create single or RAID EBS volumes and allow these to be customised using attributes in roles or environments (so you could have your staging DB use a single 20GB volume, and your production DB use a 200GB RAID).

Syntax highlighting in nanoc

Syntax highlighting is easy in nanoc using the (built-in) colorize_syntax filter and Pygments. Pygments is an extremely robust Python library with support for many languages. You wouldn’t ordinarily want to call out to a Python application from your Ruby app, but as your nanoc site is compiled this method works great.

Vagrant, Chef and the sudo cookbook

Vagrant is great. Chef is great. Let’s say you’re using Vagrant to test a Chef config for a server. Maybe you want to add the Opscode sudo cookbook to manage your sudo package and sudoers? If so, be careful!

This particular sudo cookbook replaces the /etc/sudoers file to do its thing, wiping out the %sudo ALL=(ALL:ALL) ALL line in the process. It just so happens that, on a typical Vagrant box, the vagrant user is part of this group, and depends on it to be able to do, well, anything.

Blindly installing the sudo book (as I did) will therefore result in your vagrant user being powerless. Side-effects of this include: network failures; an inability to shutdown the VM (vagrant halt) and much more!

If you are dead-set on using the sudo cookbook, make sure you add the vagrant user to the sysadmin group, or add the %sudo ALL=(ALL:ALL) ALL line back in to the sudoers file.

Node.js Cluster and working directory shenanigans

Cluster is a excellent built-in Node.js module which lets you run a master node process which spawns, balances and maintains multiple child processes. One nice advantage of this is you can reload the worker processes independently, so you can get zero-downtime deploys by deploying the new code, and asking the master to reload its workers.

Let’s say we have the following master setup:

cluster.setupMaster({
  exec: '/app/current/worker.js'
});

On start, and when asked to reload, the cluster master forks node /app/current/worker.js processes. My problem came about due to my deploy process, with Capistrano. On a deploy, Capistrano adds the application code into a timestamped /app/releases directory, and then creates a symlink from /app/releases/timestamp -> /app/current.

Following the deploy, the cluster master happily reloads its workers. The problem is, the working directory of the cluster master is an old release (specifically, whatever the /app/current symlink pointed at when the master was last started). When Cluster forks a child process, that child process has the same working directory as the master.

What ended up occuring was the node /app/current/worker.js would run, and correctly load the worker.js file from the current symlink of /app/current (so, the latest release). However, anything inside worker.js which references process.cwd(), such as require('./lib/something'), would actually be resolved to an old file.

The special __dirname variable would, however, show the real directly, as __dirname shows the directory the current script is executed from. So, knowing that, the fix was simple - at the top of worker.js, make sure the cwd is the same as __dirname and change it if not.

if (__dirname !== process.cwd()) {
  process.chdir(__dirname);
}

Send deploy keys to Bitbucket in a Chef recipe

A common thing to do with Chef and app server configuration is to create a ‘deploy’ user. This user will be involved with the deployment of code and often needs read-only access to the source repository. In my case, this was Bitbucket, but this procedure should copy across with a few tweaks for GitHub or most other providers too.

In the case of Bitbucket (and GitHub) a deploy user is given read-only access to a repository through their ssh key. Because we’re creating our deploy user through Chef anyway, along with their ssh key, it makes a lot of sense to send this off to Bitbucket and that’s what this little recipe does:

# create the deploy user
user "deploy" do
  shell "/bin/bash"
  home "/home/deploy"
  supports :manage_home => true
end

chef_gem 'httparty'

# create their ssh key
execute 'generate ssh key for deploy' do
  user 'deploy'
  creates '/home/deploy/.ssh/id_rsa'
  command 'ssh-keygen -t rsa -q -f /home/deploy/.ssh/id_rsa -P ""'
  notifies :create, "ruby_block[add_ssh_key_to_bitbucket]"
  notifies :run, "execute[add_bitbucket_to_known_hosts]"
end

# add bitbucket.org to known hosts, so future deploys won't be interrupted
execute "add_bitbucket_to_known_hosts" do
  action :nothing # only run when ssh key is created
  user 'deploy'
  command 'ssh-keyscan -H bitbucket.org >> /home/deploy/.ssh/known_hosts'
end

# send id_rsa.pub over to Bitbucket as a new deploy key
ruby_block "add_ssh_key_to_bitbucket" do
  action :nothing # only run when ssh key is created
  block do
    require 'httparty'
    url = "https://api.bitbucket.org/1.0/repositories/#{node['bitbucket_user']}/repo-name/deploy-keys"
    response = HTTParty.post(url, {
      :basic_auth => {
        :username => node['bitbucket_user'],
        :password => node['bitbucket_pass']
      },
      :body => {
        :label => 'deploy@' + node['fqdn'],
        :key => File.read('/home/deploy/.ssh/id_rsa.pub')
      }
    })

    unless response.code == 200 or response.code == 201
      Chef::Log.warn("Could not add deploy key to Bitbucket, response: #{response.body}")
      Chef::Log.warn("Add the key manually:")
      Chef::Log.info(File.read('/home/deploy/.ssh/id_rsa.pub'))
    end
  end
end

The bitbucket_user and bitbucket_pass attributes should be set somewhere, and in the url you’ll want to change repo-name to the actual repo you’re deploying to. Bitbucket only lets you add deploy keys per repository, so if this user will be deploying from multiple repositories this is a good place to do it - just update the Ruby block so it loops through all your repositories and sends a deploy key off for each one.

You’ll most likely want to run this only on production or staging environments, otherwise you could end up adding dozens of junk deploy keys to Bitbucket while you’re spinning up all those Vagrant VMs!

Munin and PostgreSQL monitoring

I ran into a little hiccup when trying to configure Munin to monitor PostgreSQL. After linking the ‘postgres_’ plugins and restarting munin-node, no Postgres stats were appearing and I was seeing error messages in the munin-node.log like this:

Service 'postgres_size_ALL' exited with status 1/0
Service 'postgres_locks_ALL' exited with status 1/0
Service 'postgres_cache_ALL' exited with status 1/0

Not very helpful but, it turns out, easy to fix. The Munin Postgres plugins use Perl and the DBD::Pg module to talk to your PostgreSQL database so if either of these are missing, you’ll get these errors.

The solution is to install the DBD::Pg module from CPAN. If you’re using Chef, add the perl cookbook and then run cpan_module 'DBD::Pg in a recipe somewhere.