Hi, what are some of the challenges with platform ...
# platform-engineering-in-edge-computing
t
Hi, what are some of the challenges with platform engineering in edge? Can someone share their approach? My feel is edge has alot of additional complexity that I imagine most developers don't want to deal with. They just want code to run closest to end users
j
Hi, From my point of view the main challenge is to abstract complexity by designing and building simple platforms running basic services following your organization best practices and propose these simple platforms as templates to your dev. So: 1. Define what are your guidelines for deploying and maintaining platforms running on edge infra 2. Creates some simple platforms following guidelines and propose them to some developers (kind of beta testers) 3. Adding CI/.CD and observability on your platform 4. Open the tool 5. Add more complex platforms when teams are ready and iterate. I hope I well understood your question abd I also hope it helps a bit.
c
Hi @Timothy Fong, I have a brief write-up on the topic here: https://avassa.io/articles/edge-computing-platform-engineering/ but I will say that I observe two things that I belive will “trickle through” to developers through i.e. an IDP:
1. The hardware in many on-site edges is heterogenous and pretty limited. So you want to make sure that e.g. an application that require GPU is only scheduled on nodes that actually have a GPU available. 2. It is unlikely that you want to run all applicaitons in all locations with the same configuration and version, so there is a need to describe to the orchestration layer under which circumstances to run the applications. Circumstances can be e.g. geographical, configured (i.e. “stores of size medium”) or directly related to resource availability.
h
To add to what Carl mentioned, hardware can also have bad/spotty connection to the internet, be limited in resources, it's a interesting space
t
Thanks @Carl Moberg this is good. How much declarative granularity is granted to the developer versus they are given more abstract categories? E.g. should it be down to the GPU system set that they specify? Or are these categories pre-determined by workload (ML, rendering, physics)?
When you mean version are you talking about version of the software or package? And availability is that like a logical tree (first x resource, then y resource if not available)?
@jean-philippe Foures makes sense in principle. Can you give a concrete example of an abstraction from a beat practice?
c
@Timothy Fong to early to tell on the question about declarative granularity imho. We (the industry) need a little more experience. I think we’ll have three types of data to match against: configured (think labels), hardware configuration (host-level memory, is gpu available, is camer attached, etc) and “state” (e.g. how much memory is available right now. The first two is easy to match against as they move slowly or are under complete administrative control. The third one (state) is harder as scheduling based on matching with parameters that may change at any time leads to complexity.
When I say version I do mean a versioned “application” which usually consists of a set of versioned container images. And, yes, there’s tree-like ordering required.
t
Thanks @Carl Moberg do you think for now it wouldn’t seem naive as a product manager to both a) allow optional granularity while b) figuring out with platform team and the value streams what they understand the workload requirements to be at the workload level? I like scores approach of workload oriented even though the score spec includes things like cpu and memory and other things
@Carl Moberg I read the docs seems powerful. If the platform team had a product manager and used Avasso it seems like there is no need to capture requirements and abstractions since the labels and other matching of workloads can just be done self serve? Have you found the customers are for platform team smaller such they didn’t need a product manager to harmonize different implementations?
I
have a core and edge model and my gut is
On the flip side I am talking to a company with its own edge nodes to join platform team. This describes how I imagine they would want to approach building a platform to use the edge.
Can you give me an example of how this works in the real world? We have a write once run everywhere approach but very geo heterogeneous. Some countries have poor connections.
We have this limitation as well. So how do you think an edge platform should best address this? I am not sure if having different application versions at a higher level works for us. But I have been thinking perhaps the kinds of services used could be. For example rendering for low capacity and poor connectivity sites could use the service that has a lower resolution service.
j
@Timothy Fong about https://platformengin-b0m7058.slack.com/archives/C04RLL5BQ4W/p1682709932189249?thread_ts=1682661129.236639&cid=C04RLL5BQ4W by abstraction as example you can create some bricks using your preferred resources: a web server composed of instance + local storage+ image+ security groups... Combined with a db and so on. You can also pre build entire platforms and propose them directly to your devs.
c
For the question in https://platformengin-b0m7058.slack.com/archives/C04RLL5BQ4W/p1682815226222929?thread_ts=1682661129.236639&cid=C04RLL5BQ4W I notice that among our users, the platform teams are in charge of the definition of labels, and they then allow application teams to match against them. A sign of maturity is of course whether the label taxonomy is created collaboratively between platform and app/value stream-teams.
And for the question in https://platformengin-b0m7058.slack.com/archives/C04RLL5BQ4W/p1682818643187659?thread_ts=1682661129.236639&cid=C04RLL5BQ4W IMHO there are two ways of thinking about it. Either (1) as a scheduling/placement challenge, i.e. how do we formally define where a specific application (consisting of several containers) should be started and kept alive. In the Avassa-system we use site- and host-level label matching for that. Or (2) as a configuration-management challenge, i.e. where the same application needs different configuration based on which site it is running on. In this case, the configuration distribution mechanism needs to be able to access the same label space as the placement/scheduling and use that.