I'll let you all in on a secret. The big tech companies don't keep their websites perpetually available because of magic. These companies are able to serve their users on the web so consistently because of a few specific strategies: knowing the total cost of downtime, having a plan for failure, measuring success, and being nimble enough to learn from mistakes. This means honest corporate introspection and a culture of learning from both failures and successes.
The Cost of Downtime
The cost of website downtime is not simply lost revenue (is this downed service even a revenue generating portion of your technology?), but a much more complex consideration. How do you quantify and attach a value to a disenfranchised customer base or measure brand impact for an unavailable public service? Or, in a world where sourcing, hiring, and retaining skilled technologists is a herculean effort, how is the morale of your team--from DevOps to project managers and everyone that they interact with--affected by downtime? And lastly, what is your team *not doing* because they're firefighting something that they'd really like to fix?
Downtime can be classified roughly into four major areas of cost:
(Lost revenue) + (technological repair costs) + (social repair costs) + (opportunity costs) = total cost of downtime.
The first two numbers are usually discrete, known and frequently compelling.
Lost revenue on something like an e-commerce site is easy to calculate:
(Revenue per minute [or hour]) x (number of minutes [or hours] the site is unavailable).
It's also a number that should be available for assessing the value of a technical upgrade. For something like an internal service, lost revenue is more difficult to calculate but can be approximated by:
(Number of employees affected) x (hours lost) x (FTE [full-time equivalent] costs).
Technological Repair Costs:
Tech repair costs depend critically on your style of infrastructure, application deployment and how close you are to the nirvana of 'continuous delivery'. Fixed physical infrastructure with lots of shared/multi-tenant components (e.g. a monster hardware load balancer serving 100 applications) is a lot more rigid and brittle than a software-defined infrastructure where each component can be safely torn down and replaced independently of everything else. A continuous delivery / DevOps philosophy can be used to advantage text and deploy scalability fixes and bug fixes alike.
Companies who are forging a Cloud strategy using utility compute (IaaS, PaaS, DaaS, BaaS, SaaS) instead of hosted or on premises hardware are able to be more experimental and iterative in their responses to failure. The ability to make incremental architectural changes without large capital costs to resize a system or months-long-projects to upgrade is a significant business advantage.
Perhaps you have only 24 hours' notice before some enormous news event will drive 10x traffic to your already struggling website. It's much easier to add a software load balancer into your AWS infrastructure than it is to get a hardware load balancer ordered, shipped, installed and configured. This advantage isn't only fiscal. Corporate nimbleness has a direct and positive impact on the second half of the equation as well.
For most companies, the last two pieces are much harder to quantify.
Social Repair Costs:
Social repair has two major components - the customer and the employee. And as it happens, they're coupled. I've never met someone who wakes in the morning and thinks "How can I give someone terrible service today?" These people may exist, but I haven't run into them and I think you haven't either. More often the person tasked with fixing the website is embarrassed by delivering poor service and is holding him or herself personally to a higher standard of delivery than the business does... initially. He or she will do everything possible during the outage to restore service. Sometimes, at great physical cost to them individually. Anyone who has taken care of an ill child or a misbehaving technology for 24 or 36 hours straight knows it can take more than a week to feel restored.
The social cost of technical debt and outages weaken the foundation of a company's service. It is a direct expense on the staff morale, new initiatives, and the end user. Empowering your team to address issues through judicious experimentation and rapid iteration keeps them engaged and improving the customer experience. This approach tackles the social cost of an outage directly. When a team can fix problems they know about, they're happier and they improve the service they're delivering--making for happier customers as well.
Outages take time. Responding and fixing an outage incurs an opportunity cost. Good outage resolution is forward progress and improves the structural foundation for the future.
Businesses are supported and facilitated by technology and vice versa. The most valuable team members during and in the aftermath of an outage are those who can effectively liaise between senior or executive staff and the technology teams. They code switch; speaking both the business and technology dialects. They ensure that both teams negotiate for the best experience for the user, communicating the problem and resolution to the customers.
When it comes to communicating with customers, it's best to take the gracious and transparent approach, "We pooched it. Here's how we are fixing the underlying problems. First A, then B;" rather than a vague obfuscating acknowledgement that "there were issues." Earlier this week, I was told a story about an ISP that lost a full day of email delivery for 250k users. They sent an email to each customer transparently, explaining what happened, with a personalized section including the "from" line email address and timestamp of the lost mail. In response, they received dramatically more encouragement than lamentation.
There are volumes written about how to conduct gracious postmortems and how to enable and support a learning culture. If you haven't read anything by the Etsy team about postmortems please go read Gracious Postmortems. Dave Zweiback's The Human Side of Postmortems is another great resource. But, all this introspection should not wait for an outage. You can circumvent some of the cost of an outage by skipping the outage part and going straight to the cleanup. Go ask your team about the list of things they'd like to fix. I'm willing to bet it's a long list.