2015-01-30 - Progress - Tony Finch
The last couple of weeks have been a bit slow, being busy with email and DNS support, an unwell child, and surprise 0day. But on Wednesday I managed to clear the decks so that on Thursday I could get down to some serious rollout planning.
My aim is to do a forklift upgrade of our DNS servers - a tier 1 service - with negligible downtime, and with a backout plan in case of fuckups.
- Solaris Zones
- Rollout plans
- Ansible - configuration vs orchestration
- When to write an Ansible module
- Start stupid and expect to fail
- Learn from failure
- Consistent and quick rollouts
- Result
Solaris Zones
Our old existing DNS service is based on Solaris Zones. The nice thing about this is that I can quickly and safely halt a zone - which stops the software and unconfigures the network interface - and if the replacement does not work I can restart the zone - which brings up the interfaces and the software.
Even better, the old servers have a couple of test zones which I can bounce up and down without a care. These give me enormous freedom to test my migration scripts without worrying about breaking things and with a high degree of confidence that my tests are very similar to the real thing.
Testability gives you confidence, and confidence gives you productivity.
Before I started setting up our new recursive DNS servers, I ran
zoneadm -z testdns* halt on the old servers so that I could
use the testdns addresses for developing and testing
our keepalived setup.
So I had the testdns zones in reserve for developing and testing the
rollout/backout scripts.
Rollout plans
The authoritative and recursive parts of the new setup are quite different, so they require different rollout plans.
On the authoritative side we will have a virtual machine for each service address. I have not designed the new authoritative servers for any server-level or network-level high availability, since the DNS protocol should be able to cope well enough. This is similar in principle to our existing Solaris Zones setup. The vague rollout plan is to set up new authdns servers on standby addresses, then renumber them to take over from the old servers. This article is not about the authdns rollout plan.
On the recursive side, there are four physical servers any of which can host any of the recdns or testdns addresses, managed by keepalived. The vague rollout plan is to disable a zone on the old servers then enable its service address on the keepalived cluster.
Ansible - configuration vs orchestration
So far I have been using Ansible in a simple way as a configuration management system, treating it as a fairly declarative language for stating what the configuration of my servers should be, and then being able to run the playbooks to find out and/or fix where reality differs from intention.
But Ansible can also do orchestration: scripting a co-ordinated sequence of actions across disparate sets of servers. Just what I need for my rollout plans!
When to write an Ansible module
The first thing I needed was a good way to drive zoneadm from Ansible. I have found that using Ansible as a glorified shell script driver is pretty unsatisfactory, because its shell and command modules are too general to provide proper support for its idempotence and check-mode features. Rather than messing around with shell commands, it is much more satisfactory (in terms of reward/effort) to write a custom module.
My zoneadm module does the bare minimum: it runs zoneadm list -pi
to get the current state of the machine's zones, checks if the target
state matches the current state, and if not it runs zoneadm boot or
zoneadm halt as required. It can only handle zone states that are
"installed" or "running". 60 lines of uncomplicated Python, nice.
Start stupid and expect to fail
After I had a good way to wrangle zoned it was time to do a quick
hack to see if a trial rollout would work. I wrote the following
playbook which does three things: move the testdns1 zone from running
to installed, change the Ansible configuration to enable testdns1 on
the keepalived cluster, then push the new keepalived configuration to
the cluster.
---
- hosts: helen2.csi.cam.ac.uk
  tasks:
    - zoneadm: name=testdns1 state=installed
- hosts: localhost
  tasks:
    - command: bin/vrrp_toggle rollout testdns1
- hosts: rec
  roles:
    - keepalived
This is quick and dirty, hardcoded all the way, except for the vrrp_toggle
command which is the main reality check.
The vrrp_toggle script just changes the value of an Ansible variable
called vrrp_enable which lists which VRRP instances should be included
in the keepalived configuration. The keepalived configuration is
generated from a Jinja2 template, and each vrrp_instance (testdns1
etc.) is emitted if the instance name is not commented out of the
vrrp_enable list.
Fail.
Ansible does not re-read variables if you change them in the middle of a playbook like this. Good. That is the right thing to do.
The other way in which this playbook is stupid is there are actually 8 of them: 2 recdns plus 2 testdns, rollout and backout. Writing them individually is begging for typos; repeated code that is similar but systematically different is one of the most common ways to introduce bugs.
Learn from failure
So the right thing to do is tweak the variable then run the playbook.
And note the vrrp_toggle command arguments describe almost everything
you need to know to generate the playbook! (The only thing missing is
the mapping from instance name (like testdns1) to parent host (like
helen2).
So I changed the vrrp_toggle script into a rec-rollout / rec-backout
script, which tweaks the vrrp_enable variable and generates the
appropriate playbook. The playbook consists of just two tasks, whose
order depends on whether we are doing rollout or backout, and which
have a few straightforward place-holder substitutions.
The nice thing about this kind of templating is that if you screw it up (like I did at first), usually a large proportion of the cases fail, probably including your test cases; whereas with clone-and-hack there will be a nasty surprise in a case you didn't test.
Consistent and quick rollouts
In the playbook I quoted above I am using my keepalived role, so I can be absolutely sure that my rollout/backout plan remains consistent with my configuration management setup. Nice!
However the keepalived role does several configuration tasks, most of which are not necessary in this situation. In fact all I need to do is copy across the templated configuration file and tell keepalived to reload it if the file has changed.
Ansible tags are for just this kind of optimization. I added a line to my keepalived.conf task:
tags: quick
Only one task needed tagging because the keepalived.conf task has a
handler to tell keepalived to reload its configuration when that
changes, which is the other important action. So now I can run my
rollout/backout playbooks with a --tags quick argument, so only the
quick tasks (and if necessary their handlers) are run.
Result
Once I had got all that working, I was able to easily flip testdns0 and testdns1 back and forth between the old and new setups. Each switchover takes about ten seconds, which is not bad - it is less than a typical DNS lookup timeout.
There are a couple more improvements to make before I do the rollout
for real. I should improve the
molly guard
to make better use of ansible-playbook --check. And I should
pre-populate the new servers' caches with the Alexa Top 1,000,000 list
to reduce post-rollout latency. (If you have a similar UK-centric
popular domains list, please tell me so I can feed that to the servers
as well!)
