Subject `Challenges in End to End Integration App Testing` Platform Engineering #general

Subject: `Challenges in End-to-End Integration/App...

Deepak Chougule

01/14/2025, 6:20 AM

Subject:

Challenges in End-to-End Integration/App Testing

Problem Statement: I am currently leading a developer experience charter at a company with a development team of around 500 engineers. My team and I are responsible for improving developer experience and have developed an Internal Developer Platform (IDP) to enable the creation of ephemeral environments and service deployments. While the IDP has resolved many feature testing challenges, allowing service owners to independently deploy and test services in smaller groups, we continue to face significant challenges with end-to-end app testing (end to end flow through app): 1. Service Stability: Deploying a large number of services (e.g., 50) simultaneously in a testing environment often results in multiple failure scenarios, such as: ◦ Deployment issues due to network interruptions. ◦ Resource limitations. ◦ Failures caused by other unforeseen factors. 2. Test Data Consistency: Ensuring consistent and reliable test data across multiple services remains a bottleneck. For example: ◦ Services like the wallet service must have the user’s wallet correctly initialized for the user under test. ◦ Maintaining consistent test data across ephemeral environments is challenging and often leads to unreliable test outcomes. 3. Up-to-Date Service Definitions: Incremental changes to service code, database schemas, seed data, etc., are occasionally missed by service owners. This results in outdated service definitions, leading to issues in establishing a stable environment with fully functional business flows. Research and Insights: I’ve explored solutions to address these challenges and came across Uber’s approach of creating a sandbox environment. In their approach, the base environment mirrors production, and services with new feature code changes are deployed in a sandbox (referred to as a "slate"). Request for Input: Has anyone faced similar challenges and successfully resolved them? I’d love to hear about your experiences, insights, or any best practices you can share.

Arjun Iyer

01/14/2025, 6:29 AM

@Deepak Chougule If Uber’s approach is appealing, you may want to check out Signadot as it’s a more general/k8s-native implementation of Uber’s approach. (Full disclosure, i’m the founder). Thanks

Piotr Szwed

01/14/2025, 11:06 AM

Sounds to me lime typical problems to imperative implementation, I would look deeper into more declarative way of building stuff (K8s Operator Pattern)

Shane Dowling

01/14/2025, 2:38 PM

Do you need to be deploying all those services to get confidence in the thing being deployed? It may be worth siloing off some of the more problematic services and using something like Consumer Driven Contract Testing so that services are coupled to an interface, not an implementation. Sometimes it's not a platform engineering problem but an architectural one.

Michael Weinberg

01/14/2025, 6:07 PM

The way I've handled this in the past is to maintain a "stable" collection of services alongside the dev/test deploys using service mesh config to isolate the test service(s) from the stable things. Requests to a test service would route to stable unless there was a test version in their namespace/network segment

Michael Weinberg

01/14/2025, 6:09 PM

The one limitation of what I ran in the past is that your entrypoint has to be the test service. You can overcome this with fancier routing, but I've never run that at scale

Ramiro Berrelleza

01/16/2025, 1:10 AM

the approach that @Michael Weinberg recommends is pretty neat if you have a very complex environment. What we’ve seen is that, for most cases, deploying an end-to-end environment with all the services that your test needs gets the job done, is more straightforward, and it requires less ‘magic’ (e.g. service mesh or other types of networking redirection). In my experience, kubernetes is much better than people anticipate at running a lot of services on a very small footprint.

Ramiro Berrelleza

01/16/2025, 1:14 AM

@Deepak Chougule of the issues you mention, Up-to-Date Service Definition is definitely the most challenging. Deploying 50 containers at scale in Kubernetes requires some tinkering, but it’s not super difficult. I’ve worked with teams where each dev would routinely deploy 50-100 services. It becomes more of a cost problem than anything else. Data for ephemeral environments is super interesting as well. This is, IMO, where the “fallback to a stable” service approach tends to become more challenging. It’s a lot easier to get consistent data when every developer has an environment with every service. that way they have full control of the data lifecycle. Having datasets per service is riskier for the exact reason you mention, if every service isn’t consistent, the data won’t work. An end-to-end approach to data sets works best here, in my opinion. but would love to hear what others are doing here.

Michael Weinberg

01/16/2025, 7:20 PM

For sure. The approach I described was an evolution from deploy all the services. YMMV on how easy that is in practice

Deepak Chougule

01/20/2025, 8:08 AM

@Shane Dowling Thanks for your response. In most of the cases, deploying all of the services and doing end to end testing is not required. But in case of a big feature releases, where multiple services are involved to complete a connected flow, in that case we need testing through app to make sure that end to end flow is working as expected.

Deepak Chougule

01/20/2025, 8:17 AM

@Michael Weinberg @Ramiro Berrelleza Thanks for your response. Yeah, I have read about the approach you are talking about. Maintaining the stable env which is same as production (in many cases people use production env as the stable env) and deploying services that have changes in a sandbox env and route the request to stable env using header. Here maintaining the stable env is a challenge, Trying to keep it always identical to production is a challenge, specially when it comes to data. One approach is to take to production snapshot timely and restore on stable env. But you need dedicated set of people to manage and fix the issues in this env. Instead of maintaining the stable env, if we treat prod as stable env, then there are lot of complicated scenarios that need to be taken care of, like message queues, async flows etc. and we are directly playing with prod data that might get corrupted if we miss any flow to address.

4 Views

Open in Slack

Previous Next