Hello. I hope you could help me with some basic qu...
# platform-blueprints
o
Hello. I hope you could help me with some basic questions about IDPs because I am very skeptical of the whole idea. - I like the idea of service catalogs, but who would keep it up to date? Would we need to maintain a list somewhere with all the relevant links, or autodiscover? If the latter, how do you filter out things which are not services but just utils? E.g. our team has a lot of util agents repos, but none of them are production services. - How is service catalog different from a good Confluence page with all the necessary info + links? - Would service creation workflow create the repo, with all the default files, terraform, pipelines etc? Would it deploy a basic dummy app to somewhere? - Would IDP be able to support teams that use repo per service vs teams that use monorepos? - What about resource creation of "protected" resources, e.g. DNS and Firewall rules during service creation? These assets are done via Terraform where my team has to review PR changes. Should devs be able to just create things without review, automatically? Or should the IDP just raise a PR for us to look at? - How do you manage changes to services after service creation? If teams want to e.g. update their health probes or replica count - they can currently just go to Terraform we made for them and update it. How would an IDP help here? - If we come up with an improvement to golden paths, e.g. better Terraform modules - how would IDP help us apply it to existing infra? Would we have to re-create the service, or would it try to apply on existing? What about conflicting changes? - How can you restrict that only specific set of people in a team can create services? Currently this is done with rights on Terraform repo.
a
there’s a lot to unpack there. The answer to all of your questions is mostly “it depends on the idp and how you set it up”
what is the core business problem you’re trying to solve?
o
Sadly, I am not trying to solve one. I was asked to look into IDPs. 2 devs suggested that they used backstage before in their previous company and it worked well for creating cookie cutter services. Myself I find the entire thing a little pointless because a) we don't create services that often, b) takes 10 minutes of copy pasting to be done with the task.
a
yup exactly 😉 we arrived at that thesis when we were starting a company and opted to focus on managing services instead of creating. My POV is the flow looks like: create, connect, coordinate. Creation is a single player usually 1x event, connecting things together also very rare, coordinating actions around a service is where things get interesting - the problem is its the “write” path for a service so now its less about discovery and more about “deployment” or “orchestration” of the thing
(connecting is probably a 2-3 player type event)
I believe the IDP is an emergent property (or can be an emergent property) if you build other pieces correctly
I think the focus on IDP is a huge waste of time unless you’re massive enterprise and want to use it for discovery
b
Rather than look straight at the solution, maybe you could work with your team(s) to look more at the perceived problems. I find a path-to-prod workshop with representatives from all teams really useful for this. You can then work out where the real delays are and the painpoints. E.g. if you don’t create new projects very often, then maybe a catalogue doesn’t solve the problem. But you maybe find out that dev teams struggle to know what functions other teams have written .. so an IDP (or API Gateway) might help with discovery. Eg. you mention that your team approves terraform PRs … how often do you deny the request? If the answer is very little, then perhaps that should be a retrospective review step, rather than an approval step.
o
We have about 160 developers in 12 squads, each squad having at least 1 service, but some having 10+. All of them use our Terraform modules for standard IAC (apart from protected resources), all of them use our YAML AzDO pipelines. Aside from minor manual things like creating an AzDO repo or raising a PR for protected resources, most teams can get by following an existing example from another squad. There are deviations from the standard after the fact (people change Terraform for their services), but I don't see how an IDP would help.
b
The only way to answer that question I think is to ask them 🙂 They must perceive some problems somewhere if they’re keen to bring in an IDP. The weakest argument I’ve heard out in the field is that the existing pipeline “wasn’t very pretty” … the dev’s wanted a dashboard of the pipeline. But it wasn’t just vanity, they wanted to be able to show what automation looks like to their leadership teams, so they could continue to attract funding for automation. So, a dashboard was born 😉
One useful trick I use when I know I’m biased in my thinking is to approach it from another angle. E.g. rather than approaching as “I don’t think an IDP offers any benefit over what we do today”, I’d instead ask myself “what would need to be true for an IDP to be a good idea”. I can then look at how that compares to what the company is doing today and what their aspirations are for the future. If it doesn’t match, it’s not a good idea.
o
“what would need to be true for an IDP to be a good idea”. - having 3 dedicated people instead of half of one would be a good start 🙂 We have a lot of pieces there already. For their chief reason - creation of consistent services - a simple PowerShell script to create a repo + modules + boilerplate code would be a better time investment, even if not pretty.
b
@Oskar Mamrzynski I think you have good instincts for seeking to find a solution to a problem rather than the other way around. For the purpose of having an informed conversation with your colleagues, I think it's worth peeling back this comment: "having 3 dedicated people ... would be a good start". • What objectives are these 3 people achieving? • Do you need 3 people for a short burst or are they needed for long-term activities? (Every team is different, would be great if you could share with everybody) An IDP is a broad term that may solve the challenges you're facing, but it's difficult to know until you enumerate those.
j
not to distract from the great conversation. it’s not just about service creation. it’s also about service sunset/deletion. many platform teams miss the concept of full lifecycle mechanics, e.g. CRUD in their workflow creation and curation
c
Hey Oskar!
Two observations upfront and then I’ll try to step through your questions - caveat is, that I don’t fully understand your situation, so please consume with a pinch of salt. 1. You use the acronym IDP, but for me it’s unclear what you’re actually after. You reference backstage, which indicates your looking for a dev portal but usually the ‘P’ is interpreted with ‘platform’ here - you’re in the “platform engineering” Slack ;-) 2. You mentioned that two developers who used backstage in their previous company was the inception point of you investigating. I am guessing here, that this experience is still fresh, as in they’re recent joiners of your company. It might be, that for you, the whole tooling, experience and process is crystal clear and easy to understand - but after all you’re using it daily and also at least partly created it. For someone new to the company it might be completely different and hard to understand - hence their want for a portal as an abstraction layer to make this more easy to understand.
I like the idea of service catalogs, but who would keep it up to date? Would we need to maintain a list somewhere with all the relevant links, or autodiscover? If the latter, how do you filter out things which are not services but just utils? E.g. our team has a lot of util agents repos, but none of them are production services.
You can think of “the catalog” as a data warehouse. There normally is an ingestion process that will collect all relevant information. How this works is dependent on the product you’re using. For backstage you would have a YAML file stored along the sourcecode in the repo and backstage would ingest that file periodically and update the catalog entry from there. Different types of entities can be distinguished easily in the catalog, so you could ingest both and easily tell services from utils apart. Depending on the product you use, you can even understand and visualize the dependencies between catalog entries (and their corresponding source code repos).
How is service catalog different from a good Confluence page with all the necessary info + links?
I am not going to dive into the religious aspect of that question. To state the plainly obvious: if you’re using a catalog, the infos can be maintained by the devs in context and in an abstract format in their IDE. Keeping everything in the right formatting and up to date is then cared for by automation so it can be visualized nicely. Confluence is only caring for the “can be visualized” part.
Would service creation workflow create the repo, with all the default files, terraform, pipelines etc? Would it deploy a basic dummy app to somewhere?
There is no simple answer to that, as normally, you would configure the scaffolding process to carry out what you want it to carry out. If you want it to deploy a dummy app somewhere - make it! You want it to be deployed by an already included pipeline to somewhere - include the creation of the pipe in the process. You get the idea...
Would IDP be able to support teams that use repo per service vs teams that use monorepos?
TLDR - yes ; Depending on the product you use, this is easy or hard. You enter the realm of “what would a platform support more easily over just a portal” very quickly here.
What about resource creation of “protected” resources, e.g. DNS and Firewall rules during service creation? These assets are done via Terraform where my team has to review PR changes. Should devs be able to just create things without review, automatically? Or should the IDP just raise a PR for us to look at?
There is a recurring theme with “depends on the product you use” here. Good platforms will let you use your TF modules but abstract them away behind a layer that is more accessible by developers. Usually there should be nothing to configure for a developer for e.g. a DNS record - it really should be computed from the service name and your domain name being entered into a simple template. Firewall rules are the same - the only thing a developer should be reasoning about is, if his service is a publicly exposed one or internal - depending on that information the right FW config should be generated. This eliminates the need for a PR altogether as the generated config cannot go wrong - “developer self service” is the keyword here.
How do you manage changes to services after service creation? If teams want to e.g. update their health probes or replica count - they can currently just go to Terraform we made for them and update it. How would an IDP help here?
“Go to the Terraform” is usually not what a dev wants to do or hear. Once again, I think it’s about the abstraction for that interface - the wish for a portal in front of that, which makes it more accessible and reduces the possibility for error. You could have a look at Score (http://score.dev) for inspiration.
If we come up with an improvement to golden paths, e.g. better Terraform modules - how would IDP help us apply it to existing infra? Would we have to re-create the service, or would it try to apply on existing? What about conflicting changes?
Simply put - a good IDP (platform!) abstracts the TF away for the developers. You evolve it disconnected from them in the platform team and as soon as you’re ready for rollout, the IDP helps you to centrally manage that rollout to the projects in a sensible way - e.g. stage and group based rollout - dev of test teams first, then dev for all followed by staging for all and finally prod for all. While doing that, your IDP should make sure that the next deployment the dev calls for is executed against the new TF module version. How the state is managed and what the update policy is, should be encoded into the TF modules as this may vary per resource you manage (which is exactly the cognitive load you want to get away from the devs by abstracting them from the TF).
How can you restrict that only specific set of people in a team can create services? Currently this is done with rights on Terraform repo.
Any credible IDP I know of (and here you can let the “P” be platform or portal - doesn’t matter) supports RBAC. Happy to have a virtual coffee together if you want more specific feedback on any of those points.
a
Only going to offer a very small extension to these amazing detailed answer.... "Abstracting the terraform" is often the reason people start with platforms, but if that keeps being the goal the org often ends up in maintenance hell: • Not knowing if the person requesting the infra or writing the terraform is in charge of the resulting infra • Not knowing who can/should update for security and feature requests • Users being frustrated by the wrapper limitations but the wrapper creators not having time to extend or desire to meet "every edge case" And more. The key if moving into platform space is to not just abstract the code, but abstract the maintenance and cognitive load. Become an internal service provider just as the infra team depends on a cloud, your app Devs can depend on the platform with clear lines of ownership for levels of performance/resilience/etc. That very well might have been obvious to everyone else but I often run into that as an opportunity for clarity in discussions/expectations.
o
Thank you all for these amazing answers. IDP in my OP = Internal Developer Portal, not Platform. That's what every (hype) article calls the acronym. I definitely do not want to abstract away Terraform from devs, precisely for this reason - "Users being frustrated by the wrapper limitations but the wrapper creators not having time to extend or desire to meet "every edge case". I expect devs to know enough Terraform to be able to extend what my team has written, but not do things from scratch. We hardcode a lot of practices into modules that they do not drown in sea of parameters to set. What I like about current process is that we follow the standard programming process of commit, PR, review, auto-deploy that they are familiar with, whereas a lot of these Portals look like they'd just give ClickOps to devs.
c
Hey @Oskar Mamrzynski!
Users being frustrated by the wrapper limitations but the wrapper creators not having time to extend or desire to meet “every edge case”
This looks a bit fabricated, as it can be true for your modules as well and can be extended by not providing enough capacity or willingness during PR review. Platforms should provide a clear contract, which includes the possibility to opt out. If you want to transcend that concept to the golden path metaphor, users should be able to opt out of using a golden path completely or partially, depending on their needs. The reasoning behind that is simple - you should never assume that a platform (including a portal, which is usually one of the components of a platform) can solve 100% of all cases. But if the platform can extend its benefits to 80% of your org (the devs who want to have their cognitive load reduced, for whom the abstractions work and who don’t want to indulge in manually solving repetitive tasks), you might get enough capacity to care more for the rest of the 20% who are not on the platform and everybody wins, because overall you’ll be delivering value quicker and with better quality.
o
Can you give an example of a platform you've built before that would easily allow for enforcing consistency after a service has already been created? For example, we deployed a lot of services already with no pod security context in k8s. I could create a new golden path with that in for brand new services, but that doesn't solve existing ones. I have not seen any ID(Portal) solution that helps to solve this.
a
@Oskar Mamrzynski AFAIK there’s nothing that truly solves this problem. Approaches I’ve taken at scaled orgs are to create a back pressure mechanism that forces the transition. For example: By policy you must deploy code every N days and if not you loose some type privilege/you get paged until resolved, Templates are “versioned” are recomputed at deployment time, Versions can be deprecated
This will force the adoption because they will either A) want to deploy code bc of a new feature B) get paged until they deploy
A is an incentive that almost all orgs have and B you need to create
o
On a similar note, we have the 10%-ers teams who go beyond the golden path. Sometimes it works out well, but I have seen some awful implementation as well, that we will now have to correct. 😞
a
oh thats an easier problem create a larger back pressure mechanism that makes it hard for them to stay off
For example introduce manditory “chaos engineering” type things where you fail over regions or something. Make it super easy on the golden path for that case to get handled
You cannot get out of that because there is a real business (BCP/DR) reason, justify your HC to not be on the golden path to need to build your BCP/DR team up to handle it. Track this as a list of exceptions
Don’t correct things yourself, design a system that is self correcting and turn it on and manage execptions through budget.
c
@Oskar Mamrzynski > Can you give an example of a platform you’ve built before that would easily allow for enforcing consistency after a service has already been created? For example, we deployed a lot of services already with no pod security context in k8s. I could create a new golden path with that in for brand new services, but that doesn’t solve existing ones. I have not seen any ID(Portal) solution that helps to solve this. I don’t want to turn this into a product pitch, but Humanitec’s platform orchestrator can do precisely this. The principle or pattern you want to look for is dynamic configuration management (DCM). You can have a look at the OSS and fully terraformed reference architectures for IDPs here. Also happy to drink a virtual coffee together and answer any questions.
a
I dont think it can actually do that.
It cannot manage the organizational incentive structure
c
I was referring to Oskar’s question @Andrew Fong. Let me edit my post for clarification.
a
DCM is fine if you fully trust that the DCM never requires a human to validate on the team that owns
most service teams will still want to have some type of last mile validation (and probably should)
also dealing w/ sprawl and then requiring entire system migration seems suboptimal
c
I would also be against creating back pressure. Any form of punishment will create the will to avoid the solution alltogether. If you cannot build enough value into your golden path that people actually want to use it, then it might not be golden after all. To give an example: If you pre-vet your golden DB configs with security and compliancy, then any instance automatically created from those templates will be getting a checkmark during the next audit - no more questions needed. People might be inclined to actually use that 🙂
a
You just defined back pressure btw
If you do not use the golden path you will not get a checkmark in the next audit.
c
No - your example was “either they adapt or you take something away from them” while I would always emphasize “you use it, you get something as an added bonus”. That’s the difference between pressure and pull.
a
I can change that to the affirmative: “Use this path and you’ll never have to worry about deployments getting stalled for a review by the compliance team”
The key is that leadership must hold the line if they violate it
b
I think @Clemens Jütte is highlighting a better approach which is to provide incentives rather than punishment for not achieving business objectives With punishment, the platform team is placing themselves in between the business objective (i.e. next audit) and the developers which creates conflict Instead, the platform team should provide patterns as enablers to the business objectives
a
The platform team is not - leadership is and must otherwise you’re hoping
b
The whole shift from DevOps to Platform Engineering to avoid operations getting shot because they're the messenger (using a cliche)
a
All we’re discussing is the semantics of the message. At the end of the day if you are not using the golden path you will have issues - what those issues result in is increased resource utilization by the team not on the golden path
o
I was more asking about a technical solution, or an example how IDPs tackle service change after service has been created. Of course right now I can make that change with a new TF module version, and then have to raise 20+ PRs for different microservices. Whether they get merged is the human aspect that you are currently debating because it could break things. 🙂 Currently, carrot doesn't work on devs because most of the time they don't care. New security context config? Ports over 1024? Don't care as long as my app is still working. So the "stick" is that we do it for them.
a
Resource utilziation = cost, cost will result in someone asking hard questions, hard questions are back pressure
@Oskar Mamrzynski look at things like https://www.grit.io/ if you want to handle the PR issues. Its based off systems like Rosey at Google for handling these use cases
There is a point where the carrot isnt enough at scale. Stragglers will always need a push
o
This is also why I don't agree with abstracting things away too much for devs. If they do not care about how we configured redis, kafka, k8s, networking etc, then cry to us at 8pm when things are broken and they didn't pay attention how it was configured, and what the limits are, then we can't help them.
j
@Oskar Mamrzynski, worth creating t-shirt sizes for developers for each platform?
a
fwiw business objectives should always be aligned to customer. We are putting in speed bumps to help you deliver a tremendous customer experience and we’re making these speed bumps very clear with a sign ahead. The notion that back pressure is bad comes from lack of ability to explain a larger picture and communicate it well. I have never seen great engineering teams not understand the “why” if its communicated well. Gates without context are what cause problems
you can’t paper over bad leadership/culture with tech
c
@Andrew Fong that’s an interesting one I’ve not yet used. For the same concern I would normally advise https://github.com/openrewrite or the payed offering https://www.moderne.io/ .
@Oskar Mamrzynski, what you describe is usually happening because there’s a combination of poor communication of change and no gated rollouts. If you’re able to have a controlled rollout to just the dev environments of some friendly teams, then go to all dev environments and only ever propagate further (be that in teams or in stages) when the problems are solved (possibly once again with clear contracts and steps to opt out of an upgrade for some time - e.g. while a critical patch is being rolled towards production or capacity is built in for that change in the team) then you normally don’t have these issues. But all of these facilities require manual work (and lots of it!) in an environment where every change and rollout is a PR that needs to be filed, reviewed and tracked. So, I understand why such a system might not be in scope for you right now, but it can be with an IDP.
b
This is also why I don't agree with abstracting things away too much for devs.
@Oskar Mamrzynski I really like leaning into this idea. There is a good balance for devs. To some extent, a developer says "I just want a container app with postgres and redis. I don't want to babysit it, just go". However, when/if things go wrong, what will it take to get back on the happy path. In the Internal Developer Platforms I've built, the one strategy that we employed that had massive impact was a heavy focus on curation. Most of this comes down to infrastructure modules (e.g. Terraform, Helm, <insert other tech>) • Put yourself in the mind of a developer that isn't a cloud/k8s expert. How do I know which modules to use? • What ways can this go wrong for a developer? Are there good error messages that allow a non-expert dev to course-correct? • Have we built automated testing so that when someone makes a change to IaC, it doesn't ripple to the user? • When we roll out changes to a developer, do they know how to upgrade without causing downtime? I wrote a high-level piece on Terraform curation earlier this year on this topic: https://www.nullstone.io/blog-posts/terraform-self-service
a
On our platform we have data indexers which provide accurate pricing data, and we offer a pricing/compute sync options for privately hosted clusters, like externally owned eks clusters. Eg. only sync the i3x, or only ones < 1k. Also you can create org/user level permissions, reshare infra app definitions We offer a perfect balance of abstraction vs control. You can rely on abstractions up to the point of not knowing what k8s even is, or you can fine grain control anything like daemonset installs, custom k8s operators. It provides a UI for easy multi-cloud/region management, and filter by app category, and even do in-line edits + rollouts, which is ultra useful for rapid development Summarizes workload requirement requests, offers node tainting by app during deployment, and orchestrates cloud server/disk provisioning which is automatically tracked by us. You don’t need terraform at all. You can use our GitOps flow, that we call functions as code, or others like Flux/Argo, or a mix using more traditional infra as code
111 Views