Lessons Learned from Decommissioning a Legacy Service
February 22, 2017
Last quarter, I worked on a project called “kill-inbox-war.” Inbox-war was a frontend service that served legacy LinkedIn messaging UI, RPC (Remote Procedure Call) endpoints before Rest.li, and some parts of notifications and invitations for LinkedIn. Even though most of the functionality related to messaging had already been moved out of inbox-war before the project started, the service still had greater than 1,400 QPS (queries per second) at peak. Finally, after almost four months of hard work, the QPS has dropped to almost 0. The graph below is inbox-war QPS for the past six months.
inbox-war QPS for the past six months
In this post, I would like to share some of my experiences and what I have learned from this project.
Four “easy” steps to kill an old service
Figure out which endpoints are still in use. Using logs, InGraphs, Inspector (an internal tool to track pageviews), service-call events, etc., you can get a clear picture of what endpoints are still in use. Build a table that has an analysis of traffic for all the endpoints still in use.
Identify alternatives for each endpoint. Old services can be decommissioned mainly because most of their functionalities have been rewritten with better implementations. Identify the alternatives you have for each of the remaining endpoints.
Figure out a strategy to deprecate each endpoint individually. You can start from the easiest endpoint that has a straightforward alternative endpoint. Then tackle each of the others one-by-one. Some of the alternatives might be implemented differently, like if the original endpoint is using HTTP GET and the replacement is using HTTP POST, or vice versa. Implementation differences like this need to be addressed individually. Sometimes it can be messy, but don’t lose patience.
Fix corner cases. It is guaranteed that you will not cover all endpoints that are currently in use. Some use cases might have very limited traffic that hasn't been caught in your analysis. Expect these to happen and leave some extra time for addressing issues.
Lessons learned from decommissioning a legacy service
I have learned a lot of things from doing this project.
The code you write today will last longer than you imagine it would. In the industry, there is a 2-4-6 rule that frontend services usually last around two years, while APIs usually last around four years, and backend services usually last around six years. But in reality, the services will probably last much longer than you might have originally imagined. For example, inbox-war was created in 2009, but has stayed for more than seven years, even though it is a frontend service.
Think carefully about the design and implementation. People reading your code in the future may never meet you and will never attend your design meetings. Write code in a generic way that is easy to follow, rather than writing code in a hacky way to be cool. When removing one endpoint, I thought it was a simple endpoint that could be migrated by redirecting it to an alternative endpoint. But after doing this, another team complained that some functionality was broken. I figured out that part of the endpoint that served very limited traffic used HTTP POST, while 99% of traffic used HTTP GET. From this example, I have learned that when using any endpoint, you should use it for its intended purpose, rather than hacking it a certain way just to get your application to work. While this might help you get things done faster, future maintainers might face a lot challenges when working with your code. Quoting from one of the engineers on our team, Swapnil Ghike: “Code is communication. It’s our primary and the longest lasting tool to inform our thoughts to our current and future team. Thus, we should do this communication right.”
Always seek a clean solution. A lot of patches for bugs fall into this category. The fix is intended to be a temporary solution, but the code will probably stay there forever. You don’t know your users. Your corner case will definitely be hit some day, especially for large-scale apps used by millions or billions of people.
Delete code no longer in use. I have found in some cases that some code and files are no longer in use because of new implementations, but old code has never been deleted from the code base. I ended up changing files that were no longer in use, deploying and testing locally, and spending days trying to figure out why I didn’t see my changes. People from other teams will not have the domain knowledge in your area. The cost of figuring out which exact file is in use will be much higher than that of deleting the code and files initially. Keep a clean code base. Don’t confuse your readers.
Mindset you need when decommissioning legacy services
Patience is always a virtue. This is probably the most important thing to remember. You will face scenarios where people who wrote the code or know the context have already left the company. Nobody remaining knows the code any more, and you will need to figure it out on your own. Don’t panic. The people who know about the specific implementation might have left, but there will likely still be people who know about the framework used. Or, if you cannot find people who know about the framework, you could find people who at least know about the generic technology. Software engineering is all about logic. You need some patience and time to reverse engineer to figure out how things are put together.
Communication is important. You might face scenarios where other teams are using the endpoints in your old service. The alternative service provides a replacement, but the behavior might differ from the previous one. In this situation, communicate with the other teams about the issue, explaining the benefits gained from decommissioning old services, the impacts decommissioning might have (good or bad), and what needs to be done when issues surface. Plan and prepare carefully ahead of time.
Legacy services, of course, are not only an issue at LinkedIn. Dealing with legacy services has always been a headache for technology companies. All new services today will become old one day, and meanwhile we will keep stacking functionality on top of them.
Finally, let me summarize my lessons learned in two sentences. First, when you develop new services, think carefully, write clean code, and set a high bar for maintaining the services. Second, when you deprecate old services, be patient and courageous.