17 - Harms of "Maintenance mode"

Context

While calmly watching a movie on a Wednesday evening, Pagerduty starts yelling at me through the cheap speakers in my phone. Time to strap in and figure out what is causing chaos!

What was happening?

Effectively, a specific feature¹ that my team had inherited was being flagged by a notable customer as lacking some security precautions for distribution of static assets.

What's so special?

This just sounds like business as usual, right? Unfortunately, incident response lasted into the night longer than expected. The culprit being lack of knowledge.

Lack of knowledge on my behalf? Yes.
Lack of knowledge on everybody's behalf? Saddly, also yes.

See, the codebase where this feature was implemented had been placed in what you could call maintenance mode. My colleagues and I had inherited it as such. The only work ever being done on it were minor changes to get a vulnerability resolved or to keep packages updated (due to minimal bandwidth).

The historical context was that this feature set was built by external contractors in a time before standards had yet emerged. The codebase never underwent the more thorough design process that our engineers are required to follow in the current day. As such, it was a total edge case that never saw any threat modelling.

Side-effects of growth

This situation was a direct side-effect of the company rapidly evolving out of its startup phase. When initially built, the feature was a lead generating product that the business considered financially promising. It was a feature being pumped out of the feature factory to generate customer appetite. Rapid growth caused turnover and thus the original domain experts and their knowhow was lost.

Key difficulties

Here are some things that were quite notable about the codebase.

1. It was built rapidly with external knowledge and a foreign stack

The fact that this product had been contracted out meant that it did not abide to the patterns that the company was trying to set in place. Although very intelligently built, it did not fit the mold of how our engineers operated day to day. We were lacking the habit, familiarity, and intuition of working within its tech stack.

2. Documentation was well defined yet outdated

There was a good amount of documentation in the form of READMEs and system diagrams. Unfortunately, these had never been updated. Functionalities of the codebase had changed and our only reference of the design had fallen out of date. An initial effort had been diligently executed, yet our own lazyness bit us in the long run.

3. It lived in its own bubble of cloud infrastructure

This product had been built out in full isolation down to the infrastructure level. This meant that even access to a cloud console for exploration and debugging was only on a need to know basis. You can image that it is kinda hard to get infrastructure support during an incident when your SRE team doesn't have access either.

4. It was quite well designed and operated reliably

This final item is the punchline. The colleague who came to my rescue had the most context on it and pointed out to me: You really pulled the short straw, this thing hasn't caused an incident in over two years.

Since the codebase operated as designed, it wasn't very noisy. This was a problem. It happily kept handling its traffic load and flew under the radar of any decision makers within the company. The teams consecutively owning it assumed that everything was fine since no support tickets or incidents were being triggered for it. The service remained in maintenance mode.

If it had been prone to failure and an annoyance, its poor security design choices might have been signaled earlier.

Outcomes

Mean time to incident resolution was slowed down considerably.
Learning had to be done by spelunking on the spot.
Solutions to adapt the initial product design had to be discovered under pressure.
Cruft was finally made visible to stakeholders and prioritized!

A couple of lessons learned

Although it was a pain to turn on my neurons outside of work hours, the experience was highly beneficial. Here are some thoughts:

Security debt of operational solutions

The speed of headcount growth tends to be outpaced by technical cruft. Since there were never enough resources to attribute to refactoring the project to a more modern paradigm, it was left unattended. This was a case of the good old "Don't touch it if it isn't broken". Unfortunately, security debt can be a lot more subtle than general software implementation debt.

Be quick to deprecate

This being more of a mental note to myself, if a feature doesn't provide considerable ARR or is under utilized. Ensure that you champion its destruction and burn it to the ground even if it is functional. New product initiatives will rarely constrain themselves to adapting with legacy implementations. Just build from the ground up according to current requirements.

A service catalog is not enough

Another interesting point is the fact that this product's existence wasn't unknown to the team owning it. It lived in our software catalog and had up to date ownership. Even with all of this, risks still lurked that were inherent to its design. Although we kept track of it, we couldn't fully gauge the urgency of care it required. Never assume that your software is safe just because it hasn't been touched recently and already passed security scans.

Prioritize valuable analytics

As owners of the codebase, we had very little visibility into the impact of the product. Metrics were minimal since it had been built quickly to respond to short term needs. How many customers used it? What was the ballpark of the generated profit? How much scale could it handle? All of these are questions which should be answered in a split second. Nobody could answer them at the time. Maybe tacking an OTel metric or two onto the key flows could have given a great deal of insight via a dashboard.

Summary

Although messy, this was a fun time. I got to play around with some serverless pipelines and a couple of AWS services that I don't get to use in my day to day. Sometimes, even chaotic moments can bring in a breath of fresh air!

Implementation details and any names of services/features were omitted for security and confidentiality reasons. The operational lessons feel more valuable to me in this case than the technical details. ↩