HashiCorp has added drift detection to Terraform Cloud, specifically Terraform Cloud Business, as a public beta. While this is broadly good news, there are some nuances to how it apparently works that I’d like to explore.
The HashiCorp blog announcing drift detection is worth a read to get HashiCorp’s perspective on the challenge. I noted with some interest that HashiCorp does more than just detect drift here. It also provides the ability to take action as a result: either overwriting the change, or accepting it.
I’m not sure HashiCorp has actually gone far enough here, though this is a welcome addition to Terraform functionality. To explain why, let’s quickly recap why drift detection is needed and then look at how HashiCorp does things.
Why drift detection?
Terraform is essentially a state management tool. You encode a desired state for infrastructure in your configuration files as a plan and then change your infrastructure to match the plan using
terraform apply. Terraform will then convert your (somewhat abstracted) plan into actual infrastructure commands, adapted to match the specific infrastructure you’re using.
Terraform tries to provide a useful abstraction over infrastructure so it doesn’t really matter much which instance/server/storage array you’re using to provide the infrastructure service. The abstraction makes infrastructure easier to deal with at scale, since you don’t need to care about the names of the specific machines in use, and infrastructure becomes just a set of commodity materials that you combine in various ways to make something useful.
But the map is not the territory. Reality may well look different from whatever is in your config files, potentially immediately after you’ve run
terraform apply. This can be for a variety of good, and not so good, reasons.
If someone changes things manually instead of changing the configuration files and then running
terraform apply again, you get drift. If you never go back and update your config files, you can end up with a big difference between what your map says, and what the territory looks like. That can make things go horribly awry.
If you make changes to your plan based on what you think the state of the world looks like and you’re quite wrong about reality, your changes might fail. Sometimes the failure might be substantial. Automation only tends to work well if the system stays well within the parameters of what the automation can handle. If reality drifts outside the system parameters, the automation stops working.
The bigger the drift between your map and the territory, the bigger the problem. If you let things go on long enough, your automation tool becomes functionally useless.
There are a few ways to handle drift. The first is to just clobber the changes by re-asserting the dominance of the map. This works well for correcting unauthorised changes to the system that shouldn’t have been done. You just revert the state back to what it should be. This is the principle behind backups: a copy of the known state you can restore if the wrong changes occur.
You could already do this with Terraform, but drift detection adds the monitoring piece that was missing. With just Terraform
apply, there’s no monitoring, unless you do it manually yourself. That’s hard, because humans are bad at attention to detail and tend to miss things. Computers are really good at being pedantic arseholes, so getting them to watch for minute changes in configuration that differ from the plan is something they can do well.
But you don’t always want to clobber the changes to production with whatever was in the plan. Sometimes stuff happens that needs changes to be made outside of what Terraform is set up to handle. Sometimes you just want to accept that the change is okay.
The computer doesn’t know that it’s okay, because computers are breathtakingly stupid, and they need a human’s help to say that the change is fine. Terraform’s drift detection does this by running a
refresh-only plan. The refresh-only plan option was added last year, and it updates Terraform’s state cache with what the world actually looks like.
But there’s a final piece here that I think is still missing from how Terraform works.
Closing the loop
There are three locations for state with Terraform, not just the map and the territory. As I just mentioned, Terraform has a state cache.
Terraform’s internal state cache is what gets updated when you run
terraform plan. This is so that Terraform doesn’t have to constantly go out to look at the whole world whenever you make a change. That takes too long when you’re managing a lot of infrastructure. If you’re managing everything exclusively through Terraform, it’s not a problem, because changes can’t happen any other way.
Right? Uh, so what about the drift we’ve just been talking about?
Yeah, it turns out that “that should never happen” is a really brittle way to build a system. Even if you’re supposedly ‘all in’ on Terraform, weirdness happens. And for most organisations, they won’t be all in on Terraform all at once. Moving to it takes time. There will be drift.
Right now, the only way to handle drift that you want to keep is to manually go and update your configuration files to include the desirable change. Using drift detection and a
refresh-only plan only changes the state cache to match reality, it doesn’t change your source config that actually drives everything.
That’s the piece I want to see Terraform do as well when you accept a change: automate updating of the source configuration to match both reality and the state cache. That way, the next
terraform apply won’t clobber the accepted changes, and you won’t have to rely on supremely disciplined and highly accurate humans to ensure the changes are made to the source files.
This is really hard to do, so I won’t be surprised to learn that Terraform doesn’t do this yet. I’ve asked, and hopefully HashiCorp get back to me with some details about this.
Adding drift detection is a good thing, and this will genuinely help people who use Terraform to manage their infrastructure. I just hope that the really smart folks at HashiCorp can figure out how to remove this last bit of boring human toil from the process.