We just inherited a terraform project and was wond...
# terraform
h
We just inherited a terraform project and was wondering if anyone has any tips / tricks on how to manage the project and also if certain issues we are facing are just common issues with this kind of project. We have not worked a lot with terraform in the past. Background information The project has to repos: •
iac-modules -
containing reusable modules •
iac
- where we normally work and deploy from the
iac
repo contains code for multiple aws accounts and has the following folder structure: root contains one folder per account, each account can contain folders like: • core • mgmt • datastores • services • security each team can have multiple accounts but for each account within a team, we use symlinked terraform files eg awsteam1test -> services -> eks ->
<http://team1-eks.tf|team1-eks.tf>
symlinked to
/etc/team1-eks
awsteam2prod -> services -> eks ->
<http://team1-eks.tf|team1-eks.tf>
symlinked to
/etc/team1-eks
we have no deploy pipeline (pipelines have not been possible to setup before because the owner repo is an external from an external company, we are now in process of moving the repo to a repo we own and control) the project was developed and managed by 1 person, we are now 1.5 people managing it. Current issues we are facing • we have issues keeping things up to date, or rather, things are updated when we have to work within a specific folder. This causes us to have to fix unrelated changes before we can deploy a potentially easy change. • when we update the symlinked file, we often forget to update all environments and end up updating only the one environment we needed to do a change in. • the
iac-modules
repo
smells
, my predecessor mentioned that the modules have been created my a external company and they reused whatever they had in house / that they were used to work with • how do we generally keep things up to date whenever there are new changes (eg is it a good idea to always update every little change, collect changes and update them in batches, or as we do today, only update when we need to change something) • probably more issues that we are not aware of Current routines we have • talk to each other before we do changes / deploys and ask each other if there are any unrelated changes causing problems.
a
👋 There is a lot of opportunity there! I think @Kief Morris may be a person with some good overarching advice? I know we worked together on shaping up a tf repo a few years ago and he sees this a lot.
Most concrete thing I can offer is that now you have the ability to run pipelines, I HIGHLY recommend running plans on at least a daily basis to identify drift. Ideally you would get broken builds if there is drift and then fix them in your regular processes. But even if you never look at the outputs, when you come back around to the repo and have to do the fixes before you can get started you will have some context as to when that drift happened and be able to hopefully fix it faster as you will have context which can lead to confidence to do the fixes.
h
Thanks for the input! Sounds like a good candidate to put on top of our improvements list 🙌
k
At a first glance, the symlinking sounds like it could make it hard to develop and test changes to the code without disrupting other teams. Even if you use branches or something to make changes (which I wouldn't recommend), I think it would mean teams can't control when they're ready to deploy a change to the infra code.
I prefer being able to treat infra code as a versioned artifact, like an application artifact. I think most IaC setups manage code, libraries, and sharing in ways that you wouldn't do with application code.
h
Thanks for the input, I also prefer separate folders with potential duplicates over symlinks, you also easily get an overview of what’s different in each environment by diffing the folders The symlink could work if you are able to perfectly separate the code and only keep the configuration in each folder, but can cause problems again if you have to update the code in the symlink and I doubt we will ever be able to perfectly separate code and config
k
I find having separate folders gets very messy. You end up with differences that are hard to manage, mistakes copying changes across them
I'd recommend using modules, with versioning, either referring to them from git repos or a module registry
That way teams can pull and test module versions when they're ready, giving them control and flexibility, without managing drift between multiple copies of the same code
t
So your repo is really close to being pretty ideal from the sound of it. Someone went to great lengths to setup symlinks to decrease code duplication... which imo, is the right idea, but the wrong execution. I'm going to throw out the tool I use for Terraform Terramate as a way to manage this mess a bit better. Here's an example repo in AWS that has the code seperated out into nested stacks for AWS accounts that contain stacks that contain modules. The things I would do here are as follows: 1. Get rid of the symlinks 2. Use terramate to generate all the code that was in the symlinked files 3. Treat all the code in the account folders as artifacts (this goes back to what Kief said above). You're never going to modify the code in the account folders directly, just by changing terramate configs. 4. Profit? When that's done, you'll have a very DRY declarative IAC config that is easier to maintain. Furthermore, because all the changes in each environment are contained in code and easily readable, it makes it SUPER gitops friendly.
The tradeoff to having repos set up like this is that they're really complicated to keep track of and deploy correctly. I'd also check out the Github actions in the above linked repo to see how you can manage the CI/CD in an automated fashion.
Once you get the repo and your process under control, you'll have a bunch of small stacks with a really limited blast radius. This means it'll be safer to run, since they're smaller it'll speed up CI/CD times, and you'll have less merge conflicts while your team is working together. Small stacks are definitely, imo, the best IAC pattern possible.
h
@Kief Morris not sure if I follow or maybe I was unclear By
folders
I mean where you use / reference the modules per account, eg
team1test
->
eks
->
<http://eks-module.tf|eks-module.tf>
<-- one module with versioning that different teams can use This is the setup we have today, except that we use symlinks to keep everything the same for each teams environments, there are some signs of making variables out of everything where the module is used eg
<http://eks-module.tf|eks-module.tf>
so that we can override things like version using local variables
Thanks for the input @Troy Knapp , we have looked at, but not used, terramate and terragrunt and we were worried about the complexity it would add to our terraform setup
t
You have complexity now that you're not able to manage.
h
I can’t argue with that point 😅
t
One of my friends wrote this article a couple years ago, and I think her clever insight applies here as well. She frames this problem as "Where do you want your mess?" or WDYWYM for short. When doing almost anything computer related, we make compromises and push complexity somewhere where we can make tradeoffs that benefit us. I very firmly believe that Terraform/Tofu is no different. The secret to a well designed IAC repo is to uncouple as much as possible and eliminate as many hard dependencies as you can... all while breaking up the runs to be as small as possible. This increases the complexity of the deploy, but computers are pretty good at doing complex things. Both Terragrunt and Terramate allow you the ability to create a graph that defines the relationships between stacks and resolve deployments in a sane manner. Both allow you to use outputs from one stack as inputs into another so any dependencies between stacks can be formalized and know. IMO, I think terramate gets an edge when it comes to change detection and its ability to run only single stacks that actually change as opposed to having to run all the infra it depends on (terragrunt can do this as well... but its a paid feature). The other thing I like about Terramate is that it just looks and feels like Terraform from a user's perspective. You can still plan and apply locally just the same as how you could with Terraform by itself. The big difference is that Terramate handles module versioning, and all the other boiler plate Terraform code you need to get things running.
h
Thanks for that! I’ll try to formulate some kind of summary based on what I’ve learned so far: 1. We do not use modules properly and we should probably read / learn more about modules. 2. I guess our symlinks is an attempt to get rid of the
mess
from copy pasting stacks between environments, but instead we get another
mess
particularly when we use shared variables, it has become difficult to figure out which variable to use and where to find them. Following this pattern, if we don’t think about where we want our
mess
to be, our next move would probably be to something that manages our config variables (terragrunt?). 3. My predecessor created separate folders per stack, which for me seemed like a
mess
when it comes to deploying an application that has been split into multiple stacks. If I knew about
terraform graph
I could probably have used that to see the dependencies. Though instead I tried to reduce that
mess
by creating a single folder for a single application (not a big application, but ends up using many different modules and becomes). we are not sure where we want to keep our
mess
but we do know what we do not want to do: • spending too much time on maintaining stacks (updating module versions etc) • create a
mess
for each other when we deploy locally and forget to commit changes probably more, but that comes with time I guess
^ only focuses about deployment, have not started on the issue of debugging iac
t
> 1. I guess our symlinks is an attempt to get rid of the
mess
from copy pasting stacks between environments, but instead we get another
mess
particularly when we use shared variables, it has become difficult to figure out which variable to use and where to find them. Following this pattern, if we don’t think about where we want our
mess
to be, our next move would probably be to something that manages our config variables (terragrunt?). One thing that I'll point out is that my current setup is only the last step of my journey, there are a lot of other possibilities as well. Symlinks feel like a clever way to get around Terraform workspaces... which kinda suck, but are a good way to keep code dry as well. I've made some pretty good patterns over the years using config overrides for workspaces that do well for this sort of thing. This module has been INVALUABLE to that end. Basically, I just set up a bunch of maps in the locals with their keys being the workspace names (in your case the workspace would be equal to the account ID or account name) of all the stacks then pulled in a global config module and merged it against the local overrides. That gave me an easy to use pattern without too much repetition. That way you can enumerate your local variables in a dry way and have it be reasonably easy to read.
h
that’s one of the things that complicates things for us when making a choice, there are so many possibilities and when we read articles one of these possibilities, they generally don’t mention the path they’ve taken from before. And it’s often difficult to realise the value you get when you jump to the end someones current path. https://registry.terraform.io/modules/Invicton-Labs/deepmerge/null/latest looks promising if we are going to follow our next immediate step in our journey, but we will also look again at
terramate
again.
t
So... one of the cool things about Terramate is that its configs just naturally override themselves in the same way, so there's no need for that module. You can have a global config at the root of the repo, then overrides in each of the account folders for ALL the nested stacks in that folder to inherit. You can have a config change in the EKS folder that only the nested stacks in the EKS folder have, etc. I really like this approach because when you write your terraform, you can write the values that were inherited by terrmate at that stack directly into the new terraform code. There's really not much of a need for local variables for that sort of thing. It makes the code a LOT easier to read and grok.
I'll note that I'll still use locals especially if I need to do something dynamic like data queries or whatever.
I'll also note that module sources in Terraform don't allow interpolation (which is, frankly, mind blowing), and neither do providers. So one of the things that I ran up against trying to have a DRY setup is how do I upgrade a module in dev, and let the changes cook before I move them to stage or prod? So the only way to do that if using a workspace or symlinked stack is to just delay applying the changes until they are fully vetted in a lower environment. This is ok if you have a TACOS that can handle this well... but its not very GitOps friendly. Now you have code that's eventually going to be compatible with what's in your repo while you're trying things out.
OpenTofu is actively fixing the interpolation of module sources and provider versions, btw.
h
thanks, that’s good to know!
Copy code
So the only way to do that if using a workspace or symlinked stack is to just delay applying the changes until they are fully vetted in a lower environment. This is ok if you have a TACOS that can handle this well... but its not very GitOps friendly. Now you have code that's eventually going to be compatible with what's in your repo while you're trying things out
Would you mind elaborating this a little more? I’m not understanding what it entails. From my current view point, using gitops, every change will always be eventually compatible, since the changes have to be applied and that might take some time. Just also mention here that we have an option to use https://github.com/flux-iac/tofu-controller in our kubernetes cluster as gitops style pipeline, with continious drift detection and ability to created dependencies between stacks
t
Would you mind elaborating this a little more? I’m not understanding what it entails.
Let's take, for example, you want to make a change in a stack but that stack is symlinked or its in a part of a workspace. Let's also assume you have 3 environments dev/stage/prod. Because the code is shared between all three environments when you make a change in one place, it changes in all three. This is both a strength and a weakness. Its a strength because it keeps all your environments in sync. (You can't imagine how many problems I could have fixed if I had just had the ability to have stage and prod match completely.) Its a weakness because once you commit and merge your code, you have to figure out how to manage testing your lower environments first, then rolling it out to upper environments. For small changes, this isn't a big deal... but for big ones it certainly can be. For example, over the last three weeks I've migrated all my app's secrets away from AWS secrets manager to Doppler. Every single secret was touched, new DB users with new permissions were created... it was a BIG change, and it's rollout needed to be carefully controlled. I made three PRs one for each environment three weeks ago, merged the first into dev, let it cook for a week, merged the second to stage, let it cook for a week, then merged the last into prod yesterday. Imagine if I had made a similarly sweeping change in a workspace/symlink. Git has a linear history, so my workflow would be like commit the big change, test on dev, wait for a week WHILE BLOCKING ALL CHANGES FROM BEING APPLIED TO STAGING. Any changes that I want to apply to stage now has the big awesome change that I want to let cook on dev, therefore, I can't apply those changes. This means in the case described above that I'm essentially blocking prod for 3 weeks. Furthermore, I have a job, and just because things are being tested on a lower environment doesn't mean I have a 3 week vacation. What if I make ANOTHER big change that I need to test for a bit in lower environments first? If I apply that to dev, then any of my future applies in stage/prod would include THAT code too. Or what happens if I have to make an emergency change on prod while my awesome new feature is cooking on stage? Non of these are insurmountable problems, but they are cases that you need to think about and plan for. Maybe push your dev team to have better integration tests on their application that you can run... Maybe create integration tests of your own to test changes faster. Maybe have a TACOS that can queue up changes that you can run sequentially (as opposed to just applying latest). Maybe you need to be able to revert changes on the fly... So, the above problems are you
mess
that you get for having all your terraform code being reused. THIS is why I got on the Terramate train. I now generate tofu (I use OpenTofu, not Terraform personally) code as artifacts. That allows me to split up complicated changes into one PR per environment without having code duplication. This does make my repo more complicated to contribute to, and that is the
mess
that I'm willing to deal with.
h
Now I understand what you are talking about and see the weakness you are talking about. I’ve also looked into other tools, like https://github.com/flux-iac/tofu-controller that I referenced, it is missing a couple of features that other tools have env0 and spacelift looks nice, but their drift detection is enterprise only atlantis looks like it’s only a workflow engine with no guard rails (though we have OPA and can configure guard rails if we host atlantis in kubernetes) - selfhosted creates another mess of course then terramate, does it have it’s own pipelines or do you add them to your existing pipelines? I saw some references to a PR workflow with terramate
t
Env0 is probably my favorite TACOS. Their pricing model is awesome, and their runners are very well architected. Andrew, their head of sales engineering is awesome. Their workflow system also works VERY well with workspaces (in fact, its designed to use them). For me, one of the biggest problems I had with Atlantis was the lack of support for multiple parallel builds. IMO, having tons of stacks is THE IaC pattern to follow because of the ability to parallelize runs, and only being able to have one kick off at a time is a huge bummer.
> terramate, does it have it’s own pipelines or do you add them to your existing pipelines? Terramate is kinda the anti-tacos? Their entire system is built on using github actions (like in the example repo I linked to earlier). We run all our terrmate runners on a GitHub-Arc operator in our infra k8s cluster. Terrmate's cloud offering is still in beta, but its pretty much some really cool slack integrations (it DMs you when your apply fails), a great way to monitor stack deployment and resource changes, and reports on stack health. Their cloud offering does WAY less than TACOS that I've used in the past, but with the strong GitHub integration it really doesn't need to do as much.
The way we've got terramate working now is that we have a job that does change detection and generates a matrix of all the things that changed in each account/folder, then runs any changes resident to those nested stacks. Frankly speaking, I like Env0's pipelines better... but I really don't need as much fancy UI help with terramate.
h
having the ability to run builds in parallel is a must thanks for all the information so far, it’s been very valuable for me! 🙌 Think my next step would be to test out some the different tools to get a feel for them
Little update here from our end: Pipelines We still do not have a good pipeline. At the moment there's only one person (me) working on the terraform repo and changes are rare, we will revisit this topic when we are more than 1 person. That said, we did check out some of the bigger iac pipeline providers, but they ended up being to expensive for us at the moment. We also tried out the
tofu-controller
, but was not able to test it fully because the flux source controller does not work directly with symlinks ref https://github.com/fluxcd/flux2/issues/3619 We did not explore the include feature https://fluxcd.io/flux/components/source/gitrepositories/#include mentioned in the issue. We also tried out
burrito
https://github.com/padok-team/burrito but here we are facing issues with locked terraform stacks. If the burrito pipeline somehow fails in the middle, the stack gets locked and burrito is not able to force unlock it. We are using karpenter and spot instances, which increases the chance of burrito pipeline to stop in the middle. You could probably solve this somehow, but we decided not to invest any further time into this. Our current focus We have shifted our focus to now look into ways of letting our teams provision iac by themselves, normally I think in this case you would probably make the abstractions for the teams using terraform modules, but for our case, we don't have proper terraform pipelines, we have more k8s expertise than terraform and we are already exposing the teams to kubernetes manifests, we are currently looking into
kubernetes resource orchestrator
https://github.com/kro-run/kro to create the abstraction layer and package multiple services in one
aws controllers for kubernetes
https://aws-controllers-k8s.github.io/community/docs/community/overview/ we are already full into aws and this gives us the missing pipelines we don't have with terraform to do this we are looking into
kube resource orchestrator
https://kro.run/ and
aws controllers for kubernetes
https://aws-controllers-k8s.github.io/community/docs/community/overview/
a
Very nice choice! I see a lot of people choosing YAML over terraform for templating. I personally used to GCP version (Google Config Connector) as my last role and it helped cross the barrier into app teams.
h
Yeah, it seems like the easier choice for us, though worth mentioning that ack / kro is not something that will be able to solve all our cases, but it should be able to solve most related to app teams and self provisioning of infrastructure.
t
After a brief foray into Bicep, I've become quickly skeptical of any IAC solutions that don't do state tracking. That's been one of the thin lines that's kept me from other IaC solutions. ACK looks really interesting though...