When I first started my current job, I had to fly out to San Francisco for an onboarding session with a handful of other new hires across the entire enterprise. The then-VP of engineering, Patrick, led a Q&A session and I raised my hand, incredulous, to ask, “Do we really have 900 repos in Github?”
My previous company had a monolith that my fellow senior engineers talked about splitting up, but just figuring out what the infrastructure should look like seemed daunting for a small company. And earlier that summer, I’d had a conversation about microservices with an acquaintance at a barbecue, but I didn’t get the sense that either of us really got the point. My understanding was that microservices were only for scaled-out engineering teams, but I had no idea what the correct scale was.
Patrick just nodded his head firmly. I couldn’t tell if he was genuinely unfazed by the number. Was 900 repos a lot? Surely some of them were archived. Surely not all of them were deployable services.
At the time, our tech org was just shy of 500 people. And experientially, none of it felt chaotic. We had multiple well-staffed platform teams who had clearly put a lot of thought into ensuring the developer experience worked smoothly:
- Well-documented standards that made setting up new services, APIs, and client gems simple.
- Pact tests that both validated API-level contracts and whose artifacts acted as mock data for local development.
- All services had a staging environment and there was a service dedicated to seeding QA data across systems.
This felt like the deliberate outcome of a highly collaborative group of engineers who had the breathing room to think about what made sense rather than “well, this is the cool thing, let’s try it.”
When the distributed architecture presents nameless, unknowable friction
My new team was actively trying to split an order history API from an order service into its own separate service for retrieving product data related to the order history. To complicate matters, the company had two types of orders and the service being migrated only provided details for one order type, while a separate payments service provided data for the other order type. This precarious bifurcation led the team to build a data aggregation service, whose sole job was simply to return the IDs of the products in the order history.
Years later, it’s clear now that the pain point was the lack of investment in the upstream data modeling, but the downsteram team cut corners by using services to work around the problem rather than having the team that owned the upstream order systems fix the data modeling. Why? I don’t have the answer to that. But in hindsight, it’s clear that the way the order domain was split into multiple services owned by multiple teams hid the burden well.
I rolled out a giant sheet of paper on my dining room table and started diagramming all the moving parts of the migration. I had questions about the intended split of domains, but without yet understanding what the company’s near-term business goals were, I didn’t really know how to ask the questions I wanted to ask.
Three years later, I’d put together a slide deck proposing deprecating the service, alongside three other services that all arguably fit into an “order management” domain. But by the time I made the proposal, everyone was too overwhelmed by the complexity and the reality of the world we lived in where phrases like “scrappy” and “MVP” and “we’ll do the tech investment next quarter” dominated. The people in charge of the road maps would conveniently forget the tech investment that was sacrificed for shipping velocity within a week. I watched as multiple people raised hands during my presentation, cutting me off, asking how any of this simplification was possible, without offering me the chance to walk through it.
I became exhausted by the skepticism.
No one ever sat me down and told me what the point of a distributed architecture was, but I had inferred it pretty quickly:
- It allows some degree of resillience during a partial outage if the system itself is designed with graceful degradation in mind.
- It encourages cleanly decoupled domains with strong ownership boundaries.
- It removes bottlenecks.
But I also saw the pain points just as quickly:
- The pact tests (API schema contracts) between services were awkward and painful to maintain, requiring multiple pull requests that had to be merged in a very particular order to not break CI.
- Microservices create decoupling to the point where, when introduced in highly siloed org charts, actually cause horrifying organizational friction. To roll out Feature A for Q2, we now need Teams X, Y, and Z to sequence work in their roadmaps rather than have the entire feature be single-owner.
- To work around that friction and meet unreasonably tight deadlines, teams end up creating new microservices that further fragment the domain of another team.
The result:
Only a few months into the job, I felt an incredible urgency to find the right abstractions for the domains I was involved with, mostly systems that constellated around a fledgling e-commerce platform emerging out of an existing subscription service. I had no experience with e-commerce, but I had a lot of opinions nonetheless.
No one asked me to, but I stood up a series of services around what I felt were the proper domains for an e-commerce platform: a centralized product data cache, a recommendations gateway system, and an orchestration service that stitched everything together. And to be clear: I was merely following my gut on these decisions. I had no intellectual awareness of orchestration architecture patterns or what the appropriate jargon-y response was to the question of, “Why clean domain boundaries?” My mind just knew: these things categorically fit together and I will arrange them as such.
It’s not really shitty. This post is just a placeholder. Where I test out that Mermaid is rendering diagrams: