Hello! I’d love to pick your brains about abstract...
# platform-blueprints
t
Hello! I’d love to pick your brains about abstractions that you build over your platforms and service templates they have. Our goal is to: • provide autonomy for teams to work both with application code as well as infrastructure, • abstract golden paths/most common use cases, ◦ abstract infrastructure, but give ability for extending it • make it easy to rollout template updates to the existing services that were created with these templates. Let’s start with an example: 1. Team A needs a containerised backend app with a database. 2. We have a template for that - they just clone it and just modify the Python code, everything else is set up and part of the same repository, including deployments and infrastructure in Terraform and Terragrunt. 3. Next, then need to add an SQS queue: add
aws_sqs_queue
in Terraform files, set up the correct IAM permissions for the container to publish messages to the queue. 4. Then they add a DynamoDB with Kinesis & S3. How do you approach building the abstractions here? Let’s say that the initial containerised app and database is a golden path for us. Currently, we don’t have much of an abstraction layer and people need to play with Terraform directly on every change. We’d love to build an abstraction layer over it, but are wondering about a few choices: • Migrating towards Terraform CDK to have TypeScript/Python flexibility • Using or creating a similar YAML syntax like score.dev or AWS Copilot • Leveraging Terraform modules more How do you approach it? Have you tried any solutions that turned out to be a wrong direction? What worked out for you, that engineers were happy with?
j
Hi, if you are on AWS, AWS CDK seems to be a good choice (meaning it will cover your all 3 requirements) no idea about other 2 choices you've mentioned
t
Hi! We evaluated AWS CDK some time ago and we disregarded it because we also need eg. Gitlab Terraform providers. AFAIR AWS CDK doesn't support 3rd party services, do you know if something changed here? I saw Datadog having their own construct but I'm not sure if the adoption of 3rd party vendors is picking up now 🤔
b
It sounds like you have a need for Terraform because it can handle all expected providers, platforms, and third parties. If you distribute templates for developers, don't neglect lifecycle management. (e.g. How do you perform upgrades and fix bugs and roll out those changes?) You could start with a baseline set of Terraform modules.
You also have to consider (a) what resources do you have to maintain this (b) what does the future look like (what platforms are we supporting, how skilled are engineers)
t
Thanks Bradley! We kind of have set of baseline Terraform modules, but we're looking to introduce more abstraction over it, without loosing the flexibility to still be able to tinker with details and cover the cases that are not covered by the abstraction layer itself. We internally call it a "composable platform" project. On you considerations, we have resources and skilled engineers, use AWS mainly for the cloud resources and Gitlab for CI/CD. On the other hand, I'd love to avoid building too much things in-house and spend more time on more important problems to solve 😄 Have you had similar considerations in your case?
b
Having your tech stack constrained helps you immensely. You can avoid edge cases from supporting different platforms.
m
is the current way to do this fully documented end-to-end, step-by-step? if not, going through that exercise and finding the biggest gaps/manual steps might help, and I'm sure your teams would appreciate the docs while you work on the platform.
v
For me, the golden path would be this: 1. Developer should be able to create a new backend app in the list of all services. 2. They can then add a dependency for sqs. My platform can then show them available queues and if they want to use any, or create a new one and provide the queue details in environment variables that they can use. 3. Developer does a release, backend app is deployed, it uses environment variables to access queue and the queue is also created by platform. All the required permissions, alerts, monitoring is also configured by platform. 4. My platform should also enable a Bring Your Own Queue mechanism. We have achieved all of this using terraform and a lot of custom scripts and code, but now it takes care of upgrades, rollbacks, backups and recovery, autoscaling, access control and even IaC code updates.
p
It sounds like you’re trying to solve for an interface and some labor for certain managed cloud services. Instead of the client teams adding terraform config and such, you want to automate its provisioning. This is usually approached as a GUI or CLI where the user chooses what they want and then an automation engine does all the things. The tricky part here is rolling platform updates, allowing clients to subscribe to changes from a core repository. I think the best model is to build the UI scaffolding (doesn’t have to be a GUI), the update engine, and build one example provider eg SQS. Then, allow some open sourcing - teams who want managed service X can contribute it into the platform in a lightweight way.
y
We're in the exact same process, @Tomek Fiechowski. We're using Terraform, AWS and GitHub actions. We're building a golden path, so for instance we've decided our devs should use AWS fargate. We haven't solved the "update templates" part yet in an easy way yet. Currently, our devs update their Terraform IaC setup with what we tell them (via Slack and our website). We're experimenting now with the GitHub Terraform Provider, and also looking into Copier and https://cuelang.org/, to see if we can utilize any of those somehow. Perhaps we can create PRs to team IaC repos, or perhaps they can use some internal tool to get updates.
t
THanks @Yngvar Kristiansen for your response! I'll look into Cuelang, but Copier is the tool we're already using 💪 I've been out of implementation details a bit for a moment, but the team praises it for it's capabilities - to automatically update majority of the services through creating PRs to then 🙌 From our current experience, it's a tool worth recommending
We're building a golden path, so for instance we've decided our devs should use AWS fargate.
Have you already agreed on the abstraction layer? Fargate will cover the containerised environment but what about other resources, like queues, databases or something more fancy? Are developers expected to have good knowledge about Cloud resources and IaC behind these or you are creating some abstraction over this?
Feedback we've received from developers is that they lack the clear boundaries in their services - they create a service out of a template and at a first glance they can modify everything, but in reality is that there are places that only my team would touch. It leads to confusion what they can modify or not
y
it's a tool worth recommending
OK good to know!
Have you already agreed on the abstraction layer?
We are trying to not to invent our own abstraction layer (a possibly huge topic), but we're mostly relying on terraform-aws-modules for the various AWS resources out there, for instance ECS. They're quite flexible and supports the various needs teams have. We have agreed upon using a set of basic AWS services: ECS with Fargate, and for Aurora postgres serverless v2 for databases, unless people have very specific needs. Otherwise most teams just use standard AWS stuff like VPC, ALB, Parameter store or Secrets manager. We haven't covered the observability stack yet, but it looks like we're going for Grafana, possibly managed Prometheus / AMP and Cloudwatch as data sources, and using the OpenTelemetry stack.
Just a few teams use queues, they use SQS I believe. We might add it to our golden path based on needs down the road.
Are developers expected to have good knowledge about Cloud resources and IaC behind these or you are creating some abstraction over this?
We are aiming for our devs to know the cloud resources they are using. I can talk a lot about this topic though 😄 Some find it easy, some find it hard, some are not interested in learning those things at all.
It leads to confusion what they can modify or not
We have tried to very clearly communicate that teams have the responsibility for their own infrastructure, but with good help from the platform team (my team) and AWS support. So we are spending most of our time helping teams, and unfortunately don't have much time to improve our platform product. At this point in time, I find that some people/teams don't really have the competence to upgrade aws providers in Terraform and Terraform providers if there are breaking changes, however.
It leads to confusion what they can modify or not
Interesting, we might be heading in that direction as well. How template lifecycle / updates are handled is probably strongly related to this. Perhaps simply putting most configuration in Terraform variables rather than in the middle of the resources can provide a good enough separation of what the team normally should be touching and not.
to automatically update majority of the services through creating PRs
You haven't by chance open sourced this? 😄
t
Aurora postgres serverless v2
How do you manage updating the database versions? I think we use the same, but there are already some pending updates to newer versions. Not sure if our Platform should do it or the dev teams 🤔
Just a few teams use queues, they use SQS I believe.
How local development looks like, if they need an SQS?
You haven't by chance open sourced this? 😄
Not yet 😄 But I shared an idea with the team to make a presentation about it on one of the upcoming Platform/DevOps conferences - will let you know if we do it 💪 We could also jump on a call some time and involve some members of our teams to exchange experiences 💪
y
How do you manage updating the database versions?
We haven't done it yet, we're on 13 still. For minor versions, AWS updates automatically. Usually, we document how to do changes and upgrades, so teams can follow a guide. For more complex operations, a guide would not be a suitable fit, and then we either do it together (mob programming), or we do it for them.
How local development looks like, if they need an SQS?
I don't know for SQS specifically, I haven't used it myself.
will let you know if we do it 💪 We could also jump on a call some time and involve some members of our teams to exchange experiences 💪
Cool, sounds good 👌