Contrary to the name given, firefighters wish to prevent fires rather than fight fires. A lot of people in fire safety always think about prevention first then only about fighting methods. I had briefly worked in the space of safety systems, for an office building many fire safety protocols are in place like monitoring, automated extinguishing, ease of access, education etc. A fire is best prevented or extinguished when it is too small to cause any damage.
Software industry on the other hand idolizes fire fighting metaphorically. We have progressed with tech so much that instead spending energy on monitoring and alerting alone, there are systems which can scale up and down based on traffic, restart itself when not responding, retry when messages fail to deliver etc without manual intervention.
With developers also taking up operations, they are equipped with power to take the call on how to design and run the application. Building a resilient system takes deliberate effort and often have to think about how will the system perform under some conditions. It can’t be caught with unit & integration test cases easily, may be performance tests can catch but it is often not built in the continuous delivery pipeline.
Let us take a scenario, a solution will take about 5 days to introduce some functionality in two different layers to take it to prod but the solution will be very resilient with very less chance of failure versus another solution that can be developed and deployed in a day but it has to be manual scaled up and down depending on traffic situations.
For a reader the solution 1 may look obvious when looking in isolation, but for a set of people who are constantly developing and operating software will not give a second thought to understand the repercussions of solution 2 and may end up taking it, so that there is less work on their plate.
How can we avoid this? Akin to fire prevention we have to set architectural and design decisions that can keep resiliency and self healing as primary considerations for an established system. This can be relaxed for products that are looking for a market fitment, which can be more relaxed compared to an established product.
In one of the situations I was asked when I pointed out a similar problem “What is wrong? I am anyways monitoring, I will go ahead and fix as soon as the alert comes”. I replied “By the time you fought the fire, you already lost some business to competition and also ended spending more money during operations than developing and taking to market”. When people did a simple math they understood the magnitude of it.