Havard Noren
10/28/2024, 11:36 AMiac-modules -
containing reusable modules
• iac
- where we normally work and deploy from
the iac
repo contains code for multiple aws accounts and has the following folder structure:
root contains one folder per account, each account can contain folders like:
• core
• mgmt
• datastores
• services
• security
each team can have multiple accounts but for each account within a team, we use symlinked terraform files eg
awsteam1test -> services -> eks -> <http://team1-eks.tf|team1-eks.tf>
symlinked to /etc/team1-eks
awsteam2prod -> services -> eks -> <http://team1-eks.tf|team1-eks.tf>
symlinked to /etc/team1-eks
we have no deploy pipeline (pipelines have not been possible to setup before because the owner repo is an external from an external company, we are now in process of moving the repo to a repo we own and control)
the project was developed and managed by 1 person, we are now 1.5 people managing it.
Current issues we are facing
• we have issues keeping things up to date, or rather, things are updated when we have to work within a specific folder. This causes us to have to fix unrelated changes before we can deploy a potentially easy change.
• when we update the symlinked file, we often forget to update all environments and end up updating only the one environment we needed to do a change in.
• the iac-modules
repo smells
, my predecessor mentioned that the modules have been created my a external company and they reused whatever they had in house / that they were used to work with
• how do we generally keep things up to date whenever there are new changes (eg is it a good idea to always update every little change, collect changes and update them in batches, or as we do today, only update when we need to change something)
• probably more issues that we are not aware of
Current routines we have
• talk to each other before we do changes / deploys and ask each other if there are any unrelated changes causing problems.Abby Bangser
10/28/2024, 11:45 AMAbby Bangser
10/28/2024, 11:47 AMHavard Noren
10/28/2024, 11:59 AMKief Morris
10/28/2024, 12:42 PMKief Morris
10/28/2024, 12:43 PMHavard Noren
10/28/2024, 12:50 PMKief Morris
10/28/2024, 1:00 PMKief Morris
10/28/2024, 1:00 PMKief Morris
10/28/2024, 1:01 PMTroy Knapp
10/28/2024, 1:12 PMTroy Knapp
10/28/2024, 1:15 PMTroy Knapp
10/28/2024, 1:18 PMHavard Noren
10/28/2024, 1:18 PMfolders
I mean where you use / reference the modules per account, eg team1test
-> eks
-> <http://eks-module.tf|eks-module.tf>
<-- one module with versioning that different teams can use
This is the setup we have today, except that we use symlinks to keep everything the same for each teams environments, there are some signs of making variables out of everything where the module is used eg <http://eks-module.tf|eks-module.tf>
so that we can override things like version using local variablesHavard Noren
10/28/2024, 1:21 PMTroy Knapp
10/28/2024, 1:23 PMHavard Noren
10/28/2024, 1:29 PMTroy Knapp
10/28/2024, 1:43 PMHavard Noren
10/28/2024, 2:18 PMmess
from copy pasting stacks between environments, but instead we get another mess
particularly when we use shared variables, it has become difficult to figure out which variable to use and where to find them. Following this pattern, if we don’t think about where we want our mess
to be, our next move would probably be to something that manages our config variables (terragrunt?).
3. My predecessor created separate folders per stack, which for me seemed like a mess
when it comes to deploying an application that has been split into multiple stacks. If I knew about terraform graph
I could probably have used that to see the dependencies. Though instead I tried to reduce that mess
by creating a single folder for a single application (not a big application, but ends up using many different modules and becomes).
we are not sure where we want to keep our mess
but we do know what we do not want to do:
• spending too much time on maintaining stacks (updating module versions etc)
• create a mess
for each other when we deploy locally and forget to commit changes
probably more, but that comes with time I guessHavard Noren
10/28/2024, 2:19 PMTroy Knapp
10/28/2024, 2:28 PMmess
from copy pasting stacks between environments, but instead we get another mess
particularly when we use shared variables, it has become difficult to figure out which variable to use and where to find them. Following this pattern, if we don’t think about where we want our mess
to be, our next move would probably be to something that manages our config variables (terragrunt?).
One thing that I'll point out is that my current setup is only the last step of my journey, there are a lot of other possibilities as well. Symlinks feel like a clever way to get around Terraform workspaces... which kinda suck, but are a good way to keep code dry as well.
I've made some pretty good patterns over the years using config overrides for workspaces that do well for this sort of thing.
This module has been INVALUABLE to that end. Basically, I just set up a bunch of maps in the locals with their keys being the workspace names (in your case the workspace would be equal to the account ID or account name) of all the stacks then pulled in a global config module and merged it against the local overrides. That gave me an easy to use pattern without too much repetition.
That way you can enumerate your local variables in a dry way and have it be reasonably easy to read.Havard Noren
10/28/2024, 2:41 PMterramate
again.Troy Knapp
10/28/2024, 2:48 PMTroy Knapp
10/28/2024, 2:50 PMTroy Knapp
10/28/2024, 3:00 PMTroy Knapp
10/28/2024, 3:01 PMHavard Noren
10/29/2024, 8:22 AMSo the only way to do that if using a workspace or symlinked stack is to just delay applying the changes until they are fully vetted in a lower environment. This is ok if you have a TACOS that can handle this well... but its not very GitOps friendly. Now you have code that's eventually going to be compatible with what's in your repo while you're trying things out
Would you mind elaborating this a little more? I’m not understanding what it entails.
From my current view point, using gitops, every change will always be eventually compatible, since the changes have to be applied and that might take some time.
Just also mention here that we have an option to use https://github.com/flux-iac/tofu-controller in our kubernetes cluster as gitops style pipeline, with continious drift detection and ability to created dependencies between stacksTroy Knapp
10/29/2024, 12:31 PMWould you mind elaborating this a little more? I’m not understanding what it entails.Let's take, for example, you want to make a change in a stack but that stack is symlinked or its in a part of a workspace. Let's also assume you have 3 environments dev/stage/prod. Because the code is shared between all three environments when you make a change in one place, it changes in all three. This is both a strength and a weakness. Its a strength because it keeps all your environments in sync. (You can't imagine how many problems I could have fixed if I had just had the ability to have stage and prod match completely.) Its a weakness because once you commit and merge your code, you have to figure out how to manage testing your lower environments first, then rolling it out to upper environments. For small changes, this isn't a big deal... but for big ones it certainly can be. For example, over the last three weeks I've migrated all my app's secrets away from AWS secrets manager to Doppler. Every single secret was touched, new DB users with new permissions were created... it was a BIG change, and it's rollout needed to be carefully controlled. I made three PRs one for each environment three weeks ago, merged the first into dev, let it cook for a week, merged the second to stage, let it cook for a week, then merged the last into prod yesterday. Imagine if I had made a similarly sweeping change in a workspace/symlink. Git has a linear history, so my workflow would be like commit the big change, test on dev, wait for a week WHILE BLOCKING ALL CHANGES FROM BEING APPLIED TO STAGING. Any changes that I want to apply to stage now has the big awesome change that I want to let cook on dev, therefore, I can't apply those changes. This means in the case described above that I'm essentially blocking prod for 3 weeks. Furthermore, I have a job, and just because things are being tested on a lower environment doesn't mean I have a 3 week vacation. What if I make ANOTHER big change that I need to test for a bit in lower environments first? If I apply that to dev, then any of my future applies in stage/prod would include THAT code too. Or what happens if I have to make an emergency change on prod while my awesome new feature is cooking on stage? Non of these are insurmountable problems, but they are cases that you need to think about and plan for. Maybe push your dev team to have better integration tests on their application that you can run... Maybe create integration tests of your own to test changes faster. Maybe have a TACOS that can queue up changes that you can run sequentially (as opposed to just applying latest). Maybe you need to be able to revert changes on the fly... So, the above problems are you
mess
that you get for having all your terraform code being reused. THIS is why I got on the Terramate train. I now generate tofu (I use OpenTofu, not Terraform personally) code as artifacts. That allows me to split up complicated changes into one PR per environment without having code duplication. This does make my repo more complicated to contribute to, and that is the mess
that I'm willing to deal with.Havard Noren
10/29/2024, 12:55 PMTroy Knapp
10/29/2024, 1:00 PMTroy Knapp
10/29/2024, 1:09 PMTroy Knapp
10/29/2024, 1:22 PMHavard Noren
10/29/2024, 2:02 PMHavard Noren
03/17/2025, 9:43 AMtofu-controller
, but was not able to test it fully because the flux source controller does not work directly with symlinks ref https://github.com/fluxcd/flux2/issues/3619
We did not explore the include feature https://fluxcd.io/flux/components/source/gitrepositories/#include mentioned in the issue.
We also tried out burrito
https://github.com/padok-team/burrito but here we are facing issues with locked terraform stacks. If the burrito pipeline somehow fails in the middle, the stack gets locked and burrito is not able to force unlock it. We are using karpenter and spot instances, which increases the chance of burrito pipeline to stop in the middle. You could probably solve this somehow, but we decided not to invest any further time into this.
Our current focus
We have shifted our focus to now look into ways of letting our teams provision iac by themselves, normally I think in this case you would probably make the abstractions for the teams using terraform modules, but for our case, we don't have proper terraform pipelines, we have more k8s expertise than terraform and we are already exposing the teams to kubernetes manifests, we are currently looking into
kubernetes resource orchestrator
https://github.com/kro-run/kro to create the abstraction layer and package multiple services in one
aws controllers for kubernetes
https://aws-controllers-k8s.github.io/community/docs/community/overview/ we are already full into aws and this gives us the missing pipelines we don't have with terraform
to do this we are looking into kube resource orchestrator
https://kro.run/ and aws controllers for kubernetes
https://aws-controllers-k8s.github.io/community/docs/community/overview/Abby Bangser
03/17/2025, 10:21 AMHavard Noren
03/17/2025, 10:35 AMTroy Knapp
03/17/2025, 12:50 PM