How often do you hear this sentence? Probably a little too often. As an engineer who has worked on a microservices architecture for the past 4 years, I would like to share with you my most valuable lessons, confident that it will be helpful to anyone trying to put a similar architecture together, and confident that I’m probably still doing most of it wrong.
We’ll travel through time, from the early moments in the development of the architecture where everything makes sense, feels great and just snaps into place, and we can watch our first microservices taking their first steps. Then, we’ll move along into very fast-paced development cycles where lots gets done and developers are able to contribute to the codebase without stepping on each other’s feet, and we’ll eventually step together into the dark side (don’t worry, I’ll hold your hand) where the world starts slowing down and standardisation evolves from a “nice to have” into a necessary step towards lesser development freedom. Eventually, we’ll reach a tipping point where we can take a step back and have a look at the overall result of our development efforts and analyse where things went wrong and how we can go back to the good times, to our Developers’ Heaven, where productivity and freedom shine bright.
The quest begins
Let’s start from the beginning. We have our beautiful blank canvas, we can draw all sorts of beautiful things on it, the first requirements come along, and it just feels natural that we can start by writing our first service thinking of it one day being part of a massively distributed software system, where this one service will just do its part. We’re still a small team, maybe 5 to 10 developers, we work together on the essential things needed by our new software system: a framework to build simple services, the infrastructure where we can run our software on (in the cloud, of course), and a continuous integration and continuous delivery pipeline for it.
At this point we don’t focus too much on the infrastructure, we’ll just pick a random and easy way to deploy some code to our cloud provider of choice and be done with it. Instead, we concentrate on making our first service as beautiful as possible, it’s our first child and it must be spoiled, right? We put together a configuration management tool and set up our CI/CD pipeline, which can, in theory, be reused for other services in the architecture, and voila’, we’ve deployed our first service into our testing and production environments. The service isn’t doing anything really, but it’s running somewhere, so now it will be easy to make it useful, and we are ready to add new services when necessary.
With all the necessary building blocks in place we start splitting up the work, it feels nice, developers have a few whiteboards sessions and come up with the first few services and can start parallelising work. The management is really happy as the MVP is being built really quickly, last minute changes are easy to make, and developers look engaged and happy, too.
Time to start growing up
The product is doing rather well and the company has a few more ideas, so it starts hiring more developers, the architecture allows the work to be parallelised anyway, we can build things faster and nuke the competition.
New joiners are very excited to join such a great development environment, everyone comes in with their fair share of experience and debates aren’t really necessary, there’s a service for everyone that needs some work done to deliver the next product features, so let’s keep cranking out more code.
We are now a development team of 20 to 30 developers and, as it usually happens, some of them quit, for whatever reason, it doesn’t really matter. What matters is, some of the developers have to step in and take over some of the services that were worked on by the leavers, and they don’t like what they see. “What’s this mess? Who wrote this? How is this even in production? Where are the tests? Why this service doesn’t even run locally?” are just a handful of question that arises from taking a look at services that were worked on by just one or two developers in isolation.
The developers in charge of taking them over start expressing their will to go back to work on the services they were working on before, because theirs are the only ones which are beautiful and perfect (never heard that one before?). The overall development team, which is always willing to learn from their mistakes, sets up a meeting and everyone agrees to add pair programming and code reviews to the mix, plus some degree of documentation, to prevent this from happening again. **So, some of the freedom goes up in smoke, but it’s for the greater good, we’re very collaborative people anyway.
Our meeting isn’t even over and management comes into the room informing us of a production issue, customers are calling up complaining something isn’t working as expected. Everyone rushes back to their keyboards and starts having a look at their own services, everything looks good, it must be some other team’s problem, but which one? Unfortunately, there’s no easy way to tell. Some teams haven’t set up proper monitoring yet, logging goes into different places and it’s hard to track where the failures are occurring. We know the blame game doesn’t take us anywhere, so we start helping each other out to review the processes and see where it could be going wrong. Eventually, the issue is found and the team responsible patches the issue and quickly releases the fix to production thanks to our continuous delivery pipeline, happy days! Not so fast.
The patch was actually a database migration, which now is breaking another service, how wasn’t this spotted in the testing environment? Unfortunately one of the teams didn’t have time to write an automated test for that feature on their side and the issue was not noticed. Management is getting nervous. The second team has to update their service to adhere to the new database schema, they manage to fix the issue in the end but wait a second… why are we still sharing a database? Page 1 of every book about microservices says we should not do that, ever. I guess we need another meeting.
The meeting between the developers involved in the production issue comes up with a rather simple solution. First of all, one of the services should master the data and provide HTTP APIs for the other services so that their data can be accessed, and of course, the APIs will not contain unplanned breaking changes. If a service needs to master some data, from now on, the team responsible for it must deploy and maintain their own database, and with this comes the responsibility of maintaining backups, scheduled updates, and so on. The developers also did not forget about how hard it was to find the source of the production issue, they decided to also get distributed logging right, to be able to trace through the distributed systems what is happening and where things go wrong rather than relying on each team to be responsible, otherwise there’s a very high chance of encountering the “not my problem” syndrome, where production is broken but every team’s system “looks good”.
Is this time to hire a centralised DevOps team then? It sounds like a possible solution, but the word centralised goes against developers’ freedom, everyone agrees that it’s much better to just try to standardise on a logging solution and some basic metrics, even though this will mean duplicating most of the infrastructure work. Management starts to see targets not being hit, as the new infrastructure pieces are a burden to the team’s productivity for the next few weeks, but it’s still in their interest to carry on the work. It sounds much better to have something in place to alert the teams when things go wrong in production before customers do so, next time.
What just hit the fan?
The company keeps growing, it consists of a decent number of teams now and more than 50 developers. We are still releasing new features but it feels like things are slowing down, no one can understand why. New requirements come along, simple features, but now those features require a few teams to get together into a room and figure out who’s responsible for what: “who should be the owner of that particular piece of data?”, “how do we migrate this data set or API without breaking other parts of the system?”, “when does your team have time to take on some of the necessary work for this feature?”. It feels wrong. We should be decoupled and able to work on things independently, this new feature would now require 10 services to hit another service’s API. Performance wise this is not going to work, it could also evolve into a single point of failure now, and we don’t want that in our distributed system. Let’s have an architecture meeting to try and nail this down.
The most skilled developers, the ones that really-really-really love solving complex problems get into a room, and they all come up with a rather simple solution, the classic “E-U-R-E-K-A-!” moment. We just need another piece of infrastructure, a message broker, and data replication so each service can keep a replicated state of the data and we can reduce the coupling and workload between services. So much simpler! Well, no, it’s not. Migrating an architecture based on RPC into an event-driven one, embracing at-least-one delivery semantics, message ordering, data replication, message serialisation and evolution, and eventual consistency is a MASSIVE DEVELOPMENT UNDERTAKING (I would make this bigger and bolder if I could).
The developers agree on adding the piece of infrastructure and start phasing it in where necessary, but again the infrastructure work needs to go into each team’s backlog. Management people started losing some hair on this one.
Work carries on as normal, but developers become slightly less motivated as more and more constraints are limiting their freedom to make choices about their services and technologies.
The most powerful selling point of microservices is usually the fact that you can have an heterogeneous set of languages and technologies and use the right one for the right purpose. This is definitely the case. You could use the most advanced tools in one service to crunch big data. In another, you could use pure functional languages to build a beautifully smart billing engine. In a web app you could just use one of the standard MVC frameworks.
The main party-pooper is the fact that services need to communicate. They will need to share state or knowledge to some degree, with either APIs or events and they will need to make their logic available to others. This usually means services will need to be accessible, most likely either via HTTP, or via the message broker of choice. The chosen technologies should support the use of these communication mechanisms as flawlessly as possible. The developers would need to re-implement the basic needs of the infrastructure, like HTTP, CI/CD pipelines, distributed logging and tracing, monitoring, message brokers consumers and producers, plus probably other pieces of the infrastructure. Each one of this task will need to happen for each language or technology in use. This makes it not very viable if you start using 10 different programming languages for example. Even Google and Facebook have their own standard programming languages, and they don’t have a development team of merely 50 developers.
At the other end of the spectrum you could choose to use only one technology for your microservices communication. This is also a little extreme in my opinion, and the best way, as always, must be in between. You have to find the right sweet spot that works for you and your company.
One day some of the developers realise that they are spending most of their time writing infrastructure code, and very little to no business logic. Before writing the code to make an HTTP request to another service, someone takes a peek at another team’s service code and they notice the HTTP request gets converted into their business model and there’s a one-liner checking for one condition based on the input parameters of the request, returning the result, which then needs to get serialised back into an HTTP response, sent back to the other services which than will need to deserialise the response and turns it into something they can use on their side. But wait, all of this infrastructure code could be replaced by copy-pasting a one liner of stateless logic, but a copy-paste doesn’t feel right either. So where’s the problem?
To see the actual problem you might need to zoom out. Stop thinking about only your team’s services, start thinking about the overall picture. If something feels wrong, it could probably be improved by talking with the other teams involved, and together collaborate to find a better way. Most of us developers prefer not having to interact with others, but we should try to overcome this limitation and take the initiative to proactively improve what’s wrong with the overall architecture, as it’s the most valuable piece of work we’ll do (more than writing another HTTP endpoint anyway).
The silver lining
The teams finally get together and start revisiting some of the early choices they’ve made when splitting up the services. Taking a better look all together from the high ground made the developers notice where they could just do things differently. Some services might need to be repurposed, some might be killed, but it’s a worthy sacrifice, even if it’s a beautiful newborn baby (I should probably stop using this metaphor now…). On top of that, we can now take a step back and start rationalising most of the infrastructure. Developers are expressing how much they dislike having to write all of this infrastructure code and having to justify the longer timelines to the management team for rather simple tasks. With the change, productivity and engagement go back to higher levels, and that’s where you want your development teams to be.
After more than 4 years working on a microservices-based distributed system I’ve learnt that it mostly comes down to drawing the right lines in your domain to optimise the teams’ performance. If two services are communicating too much and seem too coupled, they probably need to be merged back into one, and don’t be afraid of doing so. If you’re working in a small team, you can start with a well-structured monolith as well. Just build it in a way that could be easily split and deployed separately but keep it together until more developers come along to help out or the codebase grows too large.
There is a very basic analysis you could do for your services right now, count the lines of codes of your projects and split them into two categories: infrastructure and useful business logic. If you hit the point where your services have more than 50% of the code related to the infrastructure, you might want to consider this as a red flag and revisit the domain model and your overall architecture.
If you are coming in prepared, having read tons of books and watched a lot of talks but with no real-world experience, you will definitely be able to avoid some of the most common mistakes made when adopting microservices. But, make sure you write those down and review them from time to time or they might come back to haunt you.
For those of you still thinking that building a distributed system is not that hard, think again. Challenges are behind every corner. Technologies move very quickly and new services become legacy in a matter of months. Distributed transactions must be avoided at all cost. Sharing code will need to happen with caution or could cause incompatibilities when different services upgrade to newer frameworks or libraries. DevOps must be done right and security concerns are multiplied by the number of applications running in production. Truly independent deployments with zero downtime require some thought. New joiners can cause unnoticed damage in larger development teams. You can be the one that prevents all of this from happening. Don’t let your distributed system become legacy, be heard.
Disclaimer: this article was mostly a work of fiction, the examples used are not related in any way to my current or past working environments. I hope you still enjoyed reading it, though.