https://platformengineering.org logo
Title
c

Cruise Hall

01/28/2023, 4:06 AM
Heyo!👋 Any observability 🔭 platform product owners (or eng leads) out there? I’d love to connect and learn what you’re learning! And I’d especially appreciate help fine-tuning a mental model - I’m calling The Observability Iceberg - that I’m piloting with my team! (Look closely and you’ll probably see it’s just a one-dimensional view of a Wardley Map 😉, but one thing at a time)
j

jean-philippe Foures

01/28/2023, 10:02 AM
Nice work.
a

Abby Bangser

01/28/2023, 12:16 PM
Super interesting! thanks for sharing @Cruise Hall :thankyou: Curious how you use this in practice to address the goal you set
…help SRE & platform teams prioritize the most valuable work…
As in, all of those things need to be done (without installing OTel collector, you can’t add logs to traces in the apps). My understanding of Wardley map conversations is about how to move the bottom of your iceberg into things you can buy rather than build. So I would maybe expect to see examples like “use a hosted telemetry collector” or something like that there?
c

Cruise Hall

01/28/2023, 4:53 PM
@Abby Bangser great question - I’d make the case that capabilities below the water line can be necessary to enable capabilities above the waterline, but those necessary capabilities aren’t valuable in their own right - they’re only valuable after they enable capabilities above the water line. ^^ this is probably the primary observation that the model is meant to convey, any other insights are probably just consequences of this assertion. Applied to the OTel collector example, maybe the team realizes that the operators need logs more immediately than they need traces (but can imagine a future where they could leverage traces as well) • if the lift to install OTEL collector is equal to the lift to install a logs-only agent, the OTEL collector looks good b/c it enables an experience in the short term and gives the option for building an experience in the long term • If there’s the logs-only agent is much easier and faster to install than the OTEL collector, the logs-only agent looks good In both cases the team assumes some risk until (e.g.) a viable log search experience is actually delivered to the user. But the “Iceberg” would caution against exerting any additional effort to enable traces if the team can’t point to the experience that the tracing plumbing will enable for the users.
Oh and I forgot to address a real example - our team provided a log search dashboard to users of a data workflow platform, and last year we noticed that the time-to-search latency skyrocketed. Further analysis revealed that increased log volume for one framework (of many) was to blame. Our order of operations was: 1. Create a dashboard per framework to insulate as many users as possible (experience engineering) 2. Delete the offendinglog.info statement (signal mining) 3. Prioritize a structured logging architecture initiative to give us more levers to improve dashboard performance in the future (foundational plumbing) 1 paid off immediately for users of non-affected frameworks. 2 paid off in a sprint or two for users of the impacted framework. And 3 is still in progress, and will only deliver value to the dashboard users after we use structured logs to improve its performance - however we’ve since found other ways to leverage structured logs to power other experiences , improving the likelihood that initiative will be worth the effort.
s

Sonja Chevre

01/30/2023, 7:04 AM
oh this is super interesting! I used to work as a product manager for an observability platform, now working in the API space and in charge of the cloud and platform ops - and of course very interested in the power of observability. also I'm a fan of Wardley Map - so your post really got my interest!
a

Abby Bangser

01/30/2023, 10:34 AM
Ah that is really great to share Cruise! Thanks for making the example so concrete. It sounds like it really helps push for focused prioritisation 👏