Hi everyone, I want to ask a best practice for man...
# general
j
Hi everyone, I want to ask a best practice for managing a common resource in Terraform. Here's the detail of our case: We have 50+ micro-services running on our Kubernetes cluster, most of them use EKS OIDC connector to assume an AWS role and get authorization for AWS services. Because the current K8S cluster version is very old. We decided to migrate all existing services to a new Kubernetes cluster. Since the OIDC connect are dedicated for each cluster, we need to add the new cluster's OIDC connector to those service roles' IAM assume policy. Currently each service has the IAM assume policy in their own terraform folder. That means we need to update 50+ Terraform folders and apply all of them, which is a bit painful for the SRE team. I wonder for the common resources such as the IAM assume policies, shall we move them to a centralized Terraform folder, or keep them in each micro-service's folder? Please share your thoughts. Thanks!
d
I think it’s a matter of what team is directly responsible for maintaining the IaC. If distributed teams are responsible for maintaining their own IaC per corporate governance standards, I would keep it in each micro service. If you have a centralized infrastructure team that maintains all TF state, I would create a central repository.
a
We use an in-house EKS IAM module to create the Role (with the AssumeRolePolicy pre-configured as necessary) that is used in each service's project. That way we can easily create them, but if we need to update them all we can change the module then use automation to trigger downstream changes to each project (for every invocation of the module)
k
Team ownership is one of the forces to consider when deciding how to design this stuff. Another is change - what tends to change together? Centralizing all of the infra code in one central place, as you see, can make the system brittle by coupling everything to the deployment of that infrastructure. Andrew's setup is a good way to decouple by making separate deployed instances, but still sharing the code. You can still run into tangles where there are dependencies across different components, so you end up needing to coordinate the deployment of all the pieces. I advocate being rigorous on designing infrastructure as loosely coupled pieces, and being able to deploy each separately; pipelines and tests for the components can help, forces you to keep things decoupled (as with unit testing and TDD for application code, it drives good design practices)
There are some interesting tools and things that let you specify what infrastructure to deploy for each service or application, so the spec can live with the application code, but refer to the infrastructure code to deploy. I'm looking at things like Score (hi Humanitec) and https://sst.dev/, although I haven't dug deep into either one.
s
It sounds like your using IRSA? If so I've found success in using the https://registry.terraform.io/modules/terraform-aws-modules/iam/aws/latest/submodules/iam-role-for-service-accounts-eks module. Then just in tf use the k8 provider to create a service account pulling using a regular .yml and reference the module in the annotations line.
j
We have a bit of a larger workload, and we’ve settled on a 3-tier system: (1)“system”, (2)“shared” and (3)“application”. 1&2 are environment (and AWS-account) centric, 3 is application centric. We can ignore system for now, shared is where we provision things like EKS clusters as we classify them as shared features that applications depend on. Another shared ‘feature’ is state. We don’t allow state persistence in EKS by default, state has to go into S3, RDS, MSK etc. That way the application lifecycle gets much more robust. We also assume an EKS cluster might be lost or compromised, and as such must be replacable without impacting state. Therefore, we put EKS clusters in their own AWS account, and data in a separate AWS account. That does mean that any cluster that uses ISRA will need to have an Identity Provider entry for the cluster CA in that data environment. So what we ended up doing to enable this, is storing a record of every active cluster in an s3 bucket (pretty easy with terraform to get some map into JSON format). This allows you go store things like the cluster FQDN, Name, Environment, CA fingerprint etc. in a way that can be easily retrieved later. Then we have two modules that are used with this data: a cluster record module that pulls in all those files, turns them back into a local terraform map which you can then filter in a for loop to only get the clusters you want (i.e. only ‘development’ or ‘production’). You end up with a list of data about those clusters which you then use with a second module: identitiy provider. You give it a list of the client IDs and thumbprints you want it to support, which you get an up-to-date copy of from the first module. It ensures that you can create identity providers anywhere you want, with support for any source cluster you want. Operationally, that means you can give any AWS account an identity provider that works with IRSA with the clusters of your choice (since IRSA uses the AWS general STS endpoints anyway). A third module is used for setting up Roles that allow you to give it an AssumeRoleWithWebIdentity trust policy. We made it usable on an application level where you provide it with at least the name of your application and the cluster it runs in (and it will use those parameters to generate the subject for the Kubernetes SA and use it to read cluster records to verify the IDs for the client exists). It then generates an AWS IAM Role that can be assumed with the web identity, based on the Identity Provider that was pre-provisioned with the module in that shared stage. The only thing you need to ensure to make this work in a stable fashion is to have something like a workflow engine or renovate bot to automatically trigger a terraform plan (and apply if needed) when a cluster is added or removed so identity providers and roles are updated if the infrastructure changes. But that’s something you would need to do anyway if you ever add or remove a cluster, even if it is only once a year.
It’s all pretty much based on the well-architected framework from AWS, the cloud architecture blueprint from Google and the EKS Security Best Practises (also from AWS). HashiCorp didn’t have some standard for this, and neither did the AWS Terraform module library.
s
@John Keates that makes sense, we did something similar at one point where we abstraction the network/workloads/applications at one point as a test run for permission boundaries. We output the tf state into .json used it through the different abstractions. Honestly it didn't workout the way we wanted it to but it's along the same thread.