16 April, 2023

Multi-env AWS with Terraform and Terragrunt revisited

Since publishing the first version of this article I've built a startup using the same techniques to manage the AWS infrastructure. The fundamental concept has worked really well but on the way I've made a few improvements to multi-region support as well as reducing duplication in the Terragrunt configuration. It's nothing groundbreaking but there's enough there to warrant an updated article. I've pushed some new example code to the same repository and the original code is on a branch called v1.

Multi-region support

Operating across 2 continents, we needed our product to work across multiple regions from day one. I made a few tweaks to our terraform stack to support this. Within each environment directory I added directories for each region while keeping global or non-aws modules at the top level.

- prod
   - cloudfront
   - datadog
   - eu-west-2
   - s3
   - us-east-1
   - ...

Within each region directory is a region.yaml file that contains the region name

region: eu-west-2

In base.hcl we add some logic to resolve the current region as a module input

# Load env vars...
locals {
  env_vars = yamldecode(join("\n", [
    file(find_in_parent_folders("env.yaml")),
    fileexists("${get_terragrunt_dir()}/../region.yaml") ? file("${get_terragrunt_dir()}/../region.yaml") : ""
  ]))
  aws_root_account_id = "<ROOT_ACCOUNT_ID>"
}

# ...and make them available as inputs
inputs = {
  region              = try(local.env_vars.region, "eu-west-2") # Default to eu-west-2
  env                 = local.env_vars.env
  aws_account_id      = local.env_vars.aws_account_id
  aws_root_account_id = local.aws_root_account_id
}

We can now access region as a variable in every terraform module

variable "env" {
  type = string
}

And our generated provider block is configured to select the relevant region

provider "aws" {
  region  = "${local.env_vars.region}"
  profile = "${local.env_vars.env}"
}

Slightly DRYer Terragrunt

Although this is well documented elsewhere, I moved a lot of the common Terragrunt configuration to a separate directory. For example, /env/_common/certs.hcl contains

terraform {
  source = "../../../../modules//certs"
}

dependency "lets_encrypt" {
  config_path = "../../lets_encrypt"
}

inputs = {
  example_com_acme_certificate = dependency.lets_encrypt.outputs.acme_certificate
}

which has the advantage of standardising dependency and input configuration for cert modules in all regions and environments. /env/<env_name>/<region>/terragrunt.hcl can be reduced to:

include {
  path = find_in_parent_folders("base.hcl")
}

include "certs" {
  path = "${get_terragrunt_dir()}/../../../_common/certs.hcl"
}

while retaining the option to add additional dependencies or inputs per-env as required.

Running the stack

I had some really useful feedback from @yb-jmogavero that made me realise I hadn't made clear how this stack is meant to be run. In summary, I've had a go running this in CI/CD but at our scale it's quicker (and a bit less scary) for me to make infrastructure changes manually.

As a compromise we run

terragrunt run-all init -upgrade --terragrunt-non-interactive && terragrunt run-all plan --terragrunt-non-interactive

in /env whenever infrastructure changes are present in a pull request.

We use the Terragrunt dependencies block to make explicit the implicit dependencies between Terragrunt modules e.g. between aws_accounts and iam_roles. This ensures that the run-all command won't attempt to plan a module before its dependents have been executed. You can verify your module dependencies by installing the dotgraphing tool and running terragrunt graph-dependencies | dot -Tsvg > graph.svg in /env.

It is worth mentioning that at the time of writing run-all will still execute a module even if execution of its dependents fail. this can create a very sticky situation situation if you're using Terragrunt to delete a stack as subsequent attempts to delete the upstream module will fail.

This approach has worked reasonably well for us and gives the team automatic feedback if their infrastructure changes fail to plan. This ensures that I only need to review pull requests that are known to plan correctly and can then apply and fix any errors that arise at apply time.

Thanks for reading and hope it was useful. Submit a PR/issue if you've got any suggestions, I'm always up for a chat and keen to hear what you're building.