Do You Have a Downtime Response and Escalation Plan?

By Pete Czech

p>Let’s imagine a nightmare scenario. It’s 3 am, and you’re sound asleep. Then, you get the call – your website is down. What next? Unfortunately, you are now experiencing something almost every website owner or manager has gone through: catastrophic downtime at what seems like the worst time possible. So… What do you do?

I was inspired to write this post by the events of the last week. A new worm is propagating across WordPress sites. This virus looks for vulnerabilities on the WordPress installation and associated plug-ins – vulnerabilities that we don’t even know about yet – and installs some software on the host computer. Unfortunately, in this case, the software being installed was mining cryptocurrency, which as everyone knows utilizes a lot of processing power. As the virus started running, websites were taken to a standstill with the server’s CPU being locked up while mining for the currency the bad guys prefer.

Luckily, the majority of our clients are engaged in our 24/7/365 emergency response plans. We didn’t even wake them up – the next day we alerted them to the situation and proceeded to fill in any gaps and eliminate the problematic software. However, when these situations occur, I always think about the majority of websites running WordPress that have zero escalation plan in place. With that in mind, I want to highlight some essential components of an escalation plan that would help you avoid losing any more sleep.

First, though, let me summarize our approach to this problem, with a less than desirable name: MRDR. Though, I guess it’s appropriate because we want to kill anything that can cause you to lose sleep, right?

MONITOR
RESPOND
DIAGNOSE
REACT

Let’s dig into each item above in some detail.

M: Monitoring

First, you need to install monitoring software. Without monitoring software, your first indication the site will be down will come from a customer, a co-worker or even worse, the boss. Server monitoring is absolutely essential to track your website performance and uptime.

There are a variety of decent monitoring packages you can subscribe to that monitor your server and website remotely. These services work by making connections from the outside world to your server via a variety of protocols. When the connection works, nothing is reported. However, if the connection slows down or stops altogether, you are notified. I utilize some of these services myself, but, as I’ll explain in a second, they are not enough alone to properly manage your online properties.

The issue with subscription models is that they are extremely black and white – either your site is up, or it isn’t. By the time these monitoring services know your site is failing to respond, it’s often too late to even access the server to diagnose the issue. These services are basically just mimicking a user, but automated in nature. Some are capable of logging deeper into your server for additional alerts however that’s something from a security perspective that you may not want to deal with. IE, if you monitor MySQL remotely then you are also opening up MySQL to remote access, which may not be worth the risk.

The best solution is locally installed software, running on the server itself, that reports back to you on a variety of different metrics. Any number of failures can cause a website to go down. You can be inundated with traffic, for example. Or, you can be overrun with an internal process taking down the site. Your hard drive can fail or fill up. So many possibilities can present themselves, and it’s essential that you have a monitoring solution that can alert you to many of those issues before end users know they are a problem. Unfortunately, this isn’t an easy thing to do – you need “root” level server access, for example. Many shared hosting providers won’t even let you install that type of software. So, while this is the preferred method of protecting yourself against surprises, it’s also an advanced-level service that you should probably rely on an expert to assist with.

Most clients today are utilizing services such as Amazon AWS or Rackspace for virtualized server instances, which do offer root-level access and allow you to install this type of software. We recommend this for almost every client unless they truly do not care about performance or uptime issues. If you do have a minimum deliverability requirement, then these are the services which would best handle the requirement for intense monitoring.

R: Respond

The second essential component of an escalation plan is having a first responder that can diagnose an issue within a documented period of time. For example, our 24/7/365 monitoring service includes a 15-minute guaranteed first response time. You need to make sure your escalation plan has this level of assurance in place. Otherwise, it isn’t a set of procedures per say but rather just a list of “how things should be done”. In my experience, rarely is that what actually happens! And, in my experience, the sooner an issue is spotted the quicker it is to fix.

The key to response is speed – you need to know that the responders will actually respond within the guaranteed timeframe. This is the secret to avoiding those 3AM wake-up calls. I’ve seen organizations attempt to set this up a variety of ways. First, I’ve seen clients who do this in-house, assigning certain employees to be “on-call” for time periods to ensure coverage. I find this works well when you need development support associated with an outage, but not very well when it comes to first response. The best first responders are literally already awake and already at work waiting for issues to arise. This is hard for organizations to provide in-house unless they have a significant IT infrastructure already in place. 

The second approach is outsourcing. First response is something that makes total sense to outsource – it’s about speed, diagnostics (which we’ll review in a minute), and then escalation to the appropriate resources for patching. Since this is basically an insurance policy, it doesn’t make sense to build an internal team when there are other options that specialize in this discipline available and at a much lower price point.

Finally, many of the larger hosting and infrastructure companies such as Rackspace offer system administration services, but unfortunately, they fall short. They focus solely on the server but don’t dig any further. Once they figure out that an issue lies with any of the software on the server, they leave it to you to figure the rest out. A good system administrator must also have a depth of understanding about the software on the server such that they can at the very least make an in-depth diagnosis and assist developers in making the necessary adjustments to solve the issue. I find that with hosting companies this is especially difficult as they are monitoring thousands of servers and customers at a time, whereas outsourced system administrators have a much smaller client base and therefore have knowledge of your situation, business, and software running on the server in the first place.

D: Diagnose

The key to a first responder is two-fold. First, they must have the access and ability to dig into a server and understand what is happening. This is a mostly diagnostic task, and a system administrator is the best candidate for this position. However, a system administrator alone often isn’t enough. You also need to make sure that the system administrator understands the software on the server. As I mentioned above, having a system administrator on call that doesn’t understand what your server is doing from a software and business perspective really just means they will funnel your issue down the line of escalation faster, and without much more insight other than a soft diagnosis. IE, maybe your database is down because it is overloaded… And they won’t tell you why: that’s your problem.

Secondly, your system administrator must have the ability to act on your behalf as necessary. Sometimes, escalating the issues they find with the hosting company being utilized will be necessary. If you have an outsourced systems team, they must be able to communicate with your hosting company directly via a support or ticketing system as many times requests must go to the company directly, depending on the issue at hand. Empower your system administrators to act on your behalf and you will not be spending long nights in front of the computer when the you-know-what hits the fan. To this point, I want to just clarify: it’s better to have a system administrator that understands your company and software that can escalate to your host, rather than your host serving as the system administrator who escalates issues right back to you. There is a difference between the two.

With these points made, the key to first response can best be summarized as:

  • First responders should be system administrators who are strong at diagnosing issues, and also have exposure and a level of understanding in relation to the software and purpose of the server.
  • There should be a guaranteed response time that is official policy for downtime alerts.
  • They need access to escalate issues to the hosting or network provider to avoid having you stuck as the middleman.
  • They need to be able to communicate collaboratively with the next layer of support rather than just pass the buck.

R: React

Ultimately, the role of the first responder is to diagnose, stabilize and then escalate if necessary to the next step. Most of the time, when it comes to a website or web application failing, it can be traced back to a similar series of possible reasons. As we stated earlier in our example, a WordPress worm is when software of a malicious nature is installed on your server, usually via some hole or vulnerability within the software. In this case, the fix is two-fold: first, you need to diagnose the problem and clean out the offending software. Then, you need to patch the vulnerability.

System administrators are typically not going to dig too deeply into the patching, but rather focus on the cleaning and diagnostics. This means that you need a subsequent stage in your escalation plan: how to react to a diagnostic from a software perspective. This step is always a bit trickier. Having development support at 3AM isn’t as easy as having an on-call administrator. So, as part of most escalation plans, we focus on having the administrator triage and stabilize until developers can respond. In the WordPress example, the administrator, who is empowered to make decisions, could take any number of steps such as cleaning out the offensive material and then preventing further installs with temporary measures on the server. Then, when the developers become available, he can communicate with them to ensure a long-term fix.

Developers should step in for precisely that reason – crafting any long-term fixes that need to be in place in response to these new threats or optimizing the code to prevent further downtime. When the plan in place is executed well, the system administrator should be able to provide such detail to developers that a fix is apparent and quick to put into motion. This is an escalation plan that can leave you sleeping easy at night.

Bonus Step! R: Revise

I am throwing in this bonus step because it’s important to continuously improve your processes. I’m a big believer in continuous improvement both in business and in all of my pursuits. Your team should look at each instance that your escalation plan is utilized as an opportunity to revise and make the process more efficient. Each time the plan is executed, look for areas that would benefit from improvement and revise the plan as necessary. A plan should be a set strategy to be executed but that doesn’t mean it’s set in stone and can never be improved upon. Take advantage of actual instances and case studies to improve how you respond, and the result will be that future incidents will be even more smoothly managed.

Side Note: Updates aren’t enough

I was speaking with a colleague yesterday and the topic of WordPress and vulnerabilities came up. They maintained their opinion that when updated, the software is perfectly secure. I want to reiterate that oftentimes, new vulnerabilities come so quickly that it’s almost impossible to prevent infiltration. Updates are not always enough – you can easily be the victim of an infiltration before updates are even available. If your company relies on your website or you store valuable user data, you must have an escalation plan in place to respond quickly to any issues.

Wrapping Up

I highly encourage all customers to consider 24/7/365 monitoring and downtime support services. It’s an insurance policy that protects your server, software, data and allows for a quick response to any issues. Downtime is inevitable – it happens to everyone. Those who have processes and procedures in place do the best when it comes to bouncing back from an unfortunate situation. Hopefully, this post will help you get ahead of future frustration by encouraging you to formulate your plan of attack, mitigating any risk of future catastrophic downtime (or lost sleep!)

Get in Touch

In the past, we have addressed many of the important reasons to take website accessibility seriously.

Get In Touch