So platform engineering is a relatively newer conc...
# platform-leadership
a
So platform engineering is a relatively newer concept. I am a bit curious since we have developed a AI powered platform for platform and devops engineers. Does the following problem statement exist in other orgs as well? Platform engineers are tasked to create solutions that allow teams to be more stream aligned. This is especially difficult as it means empowering SDEs or FullStack devs to provision infrastructure themselves (atleast in the case of this one particular client we are working with). Do platform engineers require such a platform where they can audit what's going on between these stream aligned teams and the devs have a bit of autonomy over infrastructure provisioning and code deployment using a no-code, ChatGPT-like interface (no need to learn IaC), effectively removing any bottlenecks with software development.
c
Hey Abhinav! You might want to read more into platform engineering then - a good starting resource is https://internaldeveloperplatform.org/ . Generally speaking, “developer autonomy by self-service” is almost always a goal for a platform where only highly regulated scenarios deviate from. I wouldn’t add C-GPT to the mix of interfaces as mandatory because there are more options to achieve the same outcome, but it’s a path you can choose.
a
Hey Jay, So the objective here is to eliminate the need to learn IaC especially by engineers who already have a ton of responsibilities with software development and design. Although we are in the very nascent stages of the implementation, we are seeing a 3-5x reduction in time and effort while provisioning infrastructure. We were originally intending it to be a tool that boosts the productivity of DevOps engineers but we got this use case from the client themselves. I want to understand if this a a more common problem statement.
c
There is frankly not enough information to answer if it makes sense conceptually. Personally I have not yet received this requirement. The usual route is to have dedicated people who provide IaC templates that can be scaled to the developers by a formalized and declarative interface. Think platform engineers writing Terraform modules and developers consuming them via an abstracted interface, which might be “just values files” or some kind of spec like Score or Radius. The problem with a chat based interface that I would see is that either you are limited to an opinionated set of templates that need to be written and adapted to the customers context anyway or you are opening yourself to LLM problems like halucinations and non-deterministic replication (or lack of) of patterns. How do you make sure that the same prompt leads to exactly the same infra in to environments if you don’t stay on primitives? How do you progress an infrastructure change over different stages? Just a few things that jump to my mind.
k
it's definitely the future, because we won't be needing the portals & standardized forms / standardized permissions. One developer can say "I need all reasources to start my Java Project", other can say "I need a cluster, a repository etc". Both can trigger in the same provisioning scripts, but create different responisibilty boundaries (a devs which wants to manage the resources & CI/CD will do it, the once which won't - Platform Team will do it, simply based on the evaluation of the prompt". But for now I see the same risks @Clemens Jütte mentioned about the halucinations & uncertainty of the outcomes, producing a potential mess. I personally believe that Platform Engineering is about creating developers a SDLC capabilities as "serverless & abstract" as possible, so they care only about the code & business logic. "This is my piece of application, simply run it somewhere & give me troubleshooting capability for it". And the same will be with LLMs - "just give me an easy interface to train model" (so another topic for Platform Engineering - we need to learn fine-tuning etc. Maybe it will be a different platform than classic SDLC - I don't know yet, so far in my company we have started serving developers with this as well). You have started a very large, interesting topic though 🙂
a
Yeah I completely agree with @Clemens Jütte that covering this over a text will definitely be impossible without writing a proper article outlining the solution for each challenge. But just to be brief: 1. We are limiting the hallucination factor by limiting the use of LLMs to only gauge user intent. We are relying more on our backend system rather than giving too much autonomy to the LLM. We will further reduce the hallucination issue by training an SLM. 2. We have designed solutions to overcome the issue of reusing prompts for the same infrastructure by blending key terms with natural language. For eg, you can deploy a simple infra by saying the following prompts (these are prompts from demos of our platform available on YouTube): a. "Create a vpc with cidr 10.0.0.0/16. Using the vpc just created, create a private subnet with cidr 10.0.1.0/24." b. "Using the subnet created above, create an instance with 20gb storage and size t3.micro." c. Key thing to remember here is that the user doesn't have to define the context every single time because we manage the compliance as well. For eg, if the user forgets to mention the subnet, by default, our platform picks the first subnet from a list of pre-defined secure subnets. 3. Infrastructure changes are tracked on a per resource basis. The same way version control is implemented in terraform.
@Krzysztof H. that's why we have taken on this challenge. Right now, new tools keep popping up every month or so offering maybe a 10% improvement on the existing solutions. This creates a big headache for the DevOps engineer to keep track of the latest tools. That's why we decided to integrate with popular DevOps tools and offer a standard natural language based interface. We are combining infrastructure provisioning and code deployment with monitoring coming soon.
j
we did go the GenAI route for our chatbot experience of the platform. we found that the responses were more stable and predictable after we fed the GenAI solution our own documentation that was fairly prescriptive in terms of code generation. that helped stabilize the responses, eliminate hallucinations and make them durable and reproducible across invocations and users. that said, we did include an escape hatch for users interacting with the bot in case they were getting stuck. please think "press 0 to talk to a service agent" type of thing
k
Have you measured actual change somewhat? Like limited number of questions from developers, platform increased utilization, if developers actually want to use it? I am super curious if there is a real value or just a nice addition
j
we do capture a point-in-time end user satisfaction after a chatbot exchange and we do measure the amount of times people have to opt for the escape hatch route other metrics we capture: • # of slack questions / threads in public support channels that have to be serviced by platform engineers vs chatbots. the goal here is to offload the 80% of the "low value/high distraction" inquiries while still allowing the team to engage with the 20% "medium to high value, lower distraction" set of topics • we capture some arbitrary level of engineering hours per slack thread, usually in the 20-30 mins range and then assign some engineering cost time. we do measure the duration of each chat bot interaction and its outcome (successful, unsuccessful and escalated to human, unsuccessful and escalated to human due to wrong avenue, user story, etc). that helps us quantify the value add on both sides, e.g. are you getting both correct responses and saving time while still in flow • finally, we track all of that against TAM adoption and overall platform NPS. the relationships here are murkier and we don't have a good way to attribute delta fluctuations from just the chat bot experience alone • outside of data, we talk to users to get their sentiment / feedback so we can calibrate ourselves on that. after all, end user is higher precedence than data captured
sorry for the wall of text, i hope this is helpful
i can't share graphs and charts with you but we do have those on the metrics above
senior leaders are extremely sensitive to anything AI "win" related so we get asked to write up some of these scenarios and present them to more senior leadership audiences
k
<3 this is a wall of value, not just a text! Thank you so much, it’s very insightful & with a lot to think about
g
Such insightful discussion hereblob waver
e
@Jordan Chernev So you use GenAI to capture feedback on your platform ref ^ ?
j
No, we use simpler mechanisms than that. Not any different from a traditional NPS reponse / survey and/or Slack workflow to keep track of # of threads, metadata fields around user
c
@Jordan Chernev, just to understand that part better - you’re not only using a chatbot as an interface, you’re also using GenAI to create code that is being executed to create/update infrastructure? The way to stabilize the generated code was by feeding the model your documentation as a learning target?
e
@Jordan Chernev was just wondering what your your post was on 🤔
j
you’re not only using a chatbot as an interface, you’re also using GenAI to create code that is being executed to create/update infrastructure? The way to stabilize the generated code was by feeding the model your documentation as a learning target?
yes, yes and yes
users of the platform don’t like reading documentation we have prepared for them with IaC code generation (TF in our case) so we fed it to the LLM and then the chatbot became the interface to that
we met the customers where they are at - slack as opposed to where they don’t like spending time - reading docs
the “magic” sections in the documentation were js-powered templates that a user could provide input parameters so we could generate valid TF code for them on the fly - we made it that easy but if people don’t spend time with the docs, they won’t see and use that
i hope this helps
a
We are eventually moving towards GenAI for most of our day to day tasks, even as engineers. We are also working on generative automation. Essentially, reproducible automation workflows that can be created on the fly using AI.
Can i get some feedback on the three major problem statements we are solving with out platform? 1. High cost of acquisition of tools and engineers creates a big barrier to entry for companies looking to create or expand a Platform/DevOps team. 2. Multiple tools for the exact same task in the market. A big headache for engineers. 3. Lack of a singular platform that can handle infrastructure creation, code deployment and monitoring. Integrations have to be create manually and maintained in-house.
c
Ah I see! That’s interesting @Jordan Chernev! Thanks for sharing! It totally explains why the AI can interpret input to outcome so well, if it already had pre-opinionated generation paths to learn from. That is exactly the part where I always struggle to commit myself and believe it will work. I always see myself generating this high-value training input first before the output becomes predictable correct and then asking myself if the interface is going to make that big of a change. From your statements, I guess this has been a successful project up to now? I would love to hear about it - have you submitted a talk for PlatformCon for that?
j
> From your statements, I guess this has been a successful project up to now? yes, it’s in active use > I would love to hear about it - have you submitted a talk for PlatformCon for that? there isn’t much more to it 🙂 i gave you 90% of it earlier. no platformcon talk for me this year. the one thing we hadn’t done yet is convert this to a plugin so a developer would be able to tell the bot to just merge the required changes for them as a part of an interaction
i found this product last night which made me think of this thread - https://www.cycle.app/. i haven’t played with it or tested but sharing either way as it’s potentially relevant to what we discussed. if someone does end up using it, i would be curious to hear their feedback
r
We’ve taken a Low Code approach to this. A library of terraform modules that can be fully configured using YAML inputs. This removes the “learning IaC” challenge and would be much easier to test & support than TF code generated by AI. I guess AI could choose the library and generate the YAML. This way the dev can use a supported architecture based on a pattern built by platform engineer but not have to learn IaC.