Looking for the best methods on how to avoid server downtime?
Running a business or online store comes with the expectation from your clients that your site will be accessible whenever they desire to shop. Individuals around the globe are accessing the Internet 24/7. Sites that provide a good or service to those visitors need to be online to accommodate any and all potential traffic or customers, regardless of the day or time.
When a site or server becomes inaccessible for some reason, known as down, downtime, or an outage, the cost of downtime to customers, the business, and the reputation of that site can be drastic. Sales can be severely impacted and the business brand can be tarnished.
Understanding server downtime, taking it seriously, and planning ahead for serious issues can lessen or eliminate the impact of unexpected problems.
What is Server Downtime?
Server downtime is experienced when your on-premise or hosted server is not accessible via the Internet. This can mean that a website is:
- Non-responsive due to timing out or a failure in the application that runs the website.
- Not fully loading.
- Presenting some type of error.
- Dependent upon a 3rd party application or component that is working incorrectly.
- Missing critical components for operating correctly.
- Not functioning in a way that allows visitors to use the site as expected.
What are the Causes of Server Downtime?
The list of reasons why a website or server can go down and become unreachable is nearly endless. From floods to fires and simple human errors, there is a potential for something to cause an outage nearly every day, regardless of where you are.
We can sort outage causes into three categories:
Servers connecting to the Internet depend on a complicated combination of network cabling, electrical power, and hardware components. Any issue with one of the physical aspects of your hosting infrastructure will present potential downtime.
From a hard drive failing, to a brief interruption in the supply of power to your server or a cable getting pulled loose, the resiliency of your site to stay online is directly related to the ability of your servers to cope with some type of physical issue. Some components may even be outside the control of your infrastructure and the data center you're in, such as fiber cuts impacting an upstream Internet Service Provider (ISP).
Any server is going to operate software of some type, and all software is prone to experience issues. These could be problems brought about by a bug in the software itself or a configuration error made by an administrator. Software issues may also present themselves in a way that does not completely prevent your site from working optimally for revenue generation.
For instance, if you use a 3rd party payment processor and have a misconfiguration, your site may not be able to reach the software. It would appear as though your site is up, but your customers will not be able to complete any transactions. This is a form of downtime that involves a single aspect of your site or hosting infrastructure not being accessible to customers.
Your site may experience downtime due to a mistake made by an administrator, personnel at the data center where you host, or even a malicious individual who has gained unauthorized access to your site and servers. Humans, by our nature, are prone to making mistakes.
Entire sections of the Internet have been taken offline because of a network administrator accidentally typing the wrong command on a router for an Internet service provider.
Google has experienced downtime because an engineer automated a process that adversely impacted hundreds of servers. Facebook experienced downtime because someone accidentally forgot to renew an SSL certificate.
All of these were simple mistakes made by a person, many of which were unintentional.
These are some of the most difficult outages to prepare for or attempt to prevent, since it is almost impossible to fully anticipate the actions that an individual may take. Even with all the potential for a person to create issues, this can be mitigated to some degree by taking specific steps to limit access and monitor changes (discussed more below).
What is the Cost of Downtime?
An event that causes downtime to your server and site, whether it is a direct impact to your sales or a remark online that disparages others from visiting, carries a cost. Costs may be monetary or damage your brand, resulting in a lower reputation for potential and current customers.
When your site is experiencing issues or is offline, this has a direct impact on the people visiting your business. Any issue resulting in your site not loading properly, or loading at all, is going to dissuade visitors from waiting or coming back.
Research from Google has shown that an increase of just 9 seconds in page loading time increases the risk of a visitor leaving your site by 123%.”
To convert your visitors into customers, you need to ensure that your site is online and accessible. Any barrier your visitors have with visiting your site or completing a purchase will result in a potential loss of sale, potential loss of a return customer, and further damage to your business.
The most profound issue of your site being inaccessible is the loss of sales and revenue generation. Issues accessing your site, even if it simply loads a bit more slowly, can have a direct effect on sales. If your site is completely offline and there is no way for visitors to even access it, the loss will be ongoing and potentially devastating.
A research paper published by IDC discovered that for SMBs, the cost of downtime can range from $8,220 per hour to $25,600 per hour (taking into account loss of sales/productivity, cost of IT resources to get back online, and other factors).”
Downtime is detrimental, and depending on the size of your business and outage endured, could result in your organization eventually going out of business.
Reputation and Brand
An often overlooked, but equally as important, aspect of your server experiencing downtime is the loss of consumer confidence. Word-of-mouth advertising is key for businesses to grow, and having your customers be your ambassadors is one of the best ways to bring potentially new customers into your business.
When your site is offline, though, you begin to lose the respect and customer loyalty that regular and repeat clients build with your business. Visitors to a site expect it to be up, responsive, and capable of processing their requests and purchases.
When you're unable to provide a reliable server, your customers (regular and potential) will look elsewhere for a service or site that is up and offers what they want.
Often, you will see that the most vocal customers, both in person and online, are those who are upset about an interaction or experience they had. These negative remarks can have a damaging effect on your business as they spread amongst people through social media posts, further damaging your reputation and costing you additional business.
How to Avoid Server Downtime
The idea of never experiencing downtime is an almost impossible concept to achieve. Even the largest companies and the most prominent online presences experience downtime in some fashion.
Oftentimes, these companies make guarantees of being up 99.999% (referred to as "five nines"), which equates to less than six minutes of downtime per year. This degree of uptime is achieved using a number of infrastructure designs and factors to help minimize the impact of issues as they arise.
Here are the top four methods for how to avoid server downtime:
1. Monitoring and Alerting Systems
One of the most important steps you can take in preventing downtime is knowing what is going on with your infrastructure at all times. Being able to spot and identify issues before they occur or can interrupt your site's ability to be accessed is crucial. To do so, you will need to monitor your infrastructure for performance and threat detection.
Numerous software packages and services exist (such as Grafana, Munin, or Pingdom) to allow you insight into how your infrastructure and site are performing. These services will help you monitor server health, such as:
- Server load.
- Disk space.
- Hardware health.
- Page load times.
- Software status.
Threat detection and monitoring are also crucial for keeping malicious software and actors at bay. Software such as Threat Stack Oversight Intrusion Detection System and Alert Logic Security & Compliance Suite will help you with:
- Threat monitoring.
- Intrusion detection.
- 24/7 incident response.
- And more.
Additionally, you can utilize off-network services to gain an understanding of how visitors to your site will experience it, detailing how long it takes for the site to completely load from various parts of the world or if certain service providers are having issues reaching your site.
This early warning of potential issues can help you get ahead of a problem and prevent it from becoming an issue that results in actual downtime.
2. High Availability
If your site has to be able to survive any type of physical outage (such as a piece of hardware in the server failing or a power outage to a server), one of the first steps is to make sure that you're using a high availability (HA) setup.
High availability can be achieved by utilizing one server (we can call this the primary) to handle all traffic, but with an additional server (called the secondary) sitting and waiting for an event to take place, such as a traffic spike. This additional server is constantly synchronizing data and files with the primary server.
When the primary server experiences an issue, the secondary server almost instantly takes over and continues serving up your site. This specific type of relationship can be called automatic failover or active/passive, and is extremely common, especially with database servers.
Another form of high availability to be aware of is an active/active server relationship. In this form of HA, you have both servers simultaneously receiving data and serving it back to the visitor, while synchronizing data between each other. The main benefit to this is not waiting for the secondary server to take over in the event of an issue.
An active/active HA setup is much more complicated and requires careful preparation and close monitoring to ensure you don't have issues, but is reliable and safeguards SMEs with mission-critical workloads or applications that need to stay online.
3. Geography Redundancy
Another concept to gain high availability is having your hosting infrastructure located in physically different areas separated by great distances. The idea is that if a natural disaster strikes, or a catastrophic power outage occurs, the infrastructure you have is separated by a large enough distance to not have both locations impacted.
When the outage occurs at location A, your servers at location B detect the issue and are ready to receive the traffic. While twice as expensive, this is one of the most effective ways at ensuring your site stays online.
The revenue generated while online during an event that otherwise would take a site down can easily cover the cost of the second set of infrastructure for large enough businesses.
Geo-redundant solutions are highly complex and often require numerous services and monitoring solutions to effectively perform the switch from location A to B. Data synchronization (to ensure that whatever location your visitors access is a mirror of the other), DNS changes (needed for directing client browsers to the appropriate location when a site goes offline), and multiple health checks (to ensure that a simple failed ping does not failover of your entire site) are just a few of the pieces that are needed to effectively and safely operate infrastructure in a geo-redundant fashion.
These types of setups are often reserved for the hosting environments where it is absolutely critical that the application or website stay online.
4. Code Versioning and Reverting
Above, I briefly touched on the idea of human involvement causing an outage. While it is impossible to protect against 100% of all potential issues an individual can introduce to our hosting infrastructure, we are able to take precautions to minimize the impact and risk.
Ensuring that any change that an employee (or even yourself) makes is reviewed by another individual is an excellent step in verifying that the code or alterations are safe, sane, and won't introduce a breaking change. This code review or peer review is a critical step that larger organizations take to ensure that an accidental typo or conflict isn't missed.
However, mistakes happen and people are fallible. To help protect against this, code versioning can be used to help reduce downtime from a recently implemented change. When using versioning, any and all changes are automatically documented, creating a history of the changes made.
In the event that some alteration broke part of your site (be it a visual discrepancy, a connection to some local or third party service, or even the accidental deletion of files), you can see the exact change committed and revert it. This running log of changes allows for easy tracking of what has been done, and enables you to pinpoint exactly when a breaking change occurred and what needs to be done to correct it.
Prepare for Server Downtime Now
Server downtime is a potentially damaging event for your business. At some point, almost every site will experience some type of downtime, even if it's an issue outside of their control. There are numerous causes and potential points of failure when hosting a website, all of which can cause visitors to have a poor experience or be completely prevented from accessing your site.
Knowing what these pitfalls are and how to best prepare for them is a step you can take to help minimize the risk and damage of any type of downtime event.
Current Infrastructure Not Cutting It? Download Our eBook to Find Out Why Outsourcing Your Managed Web Services is Better For Your Business
About the Author