An increasing amount of software houses now move to a microservices model for their applications. By definition this means distributing a previously tightly coupled system into many components. The benefit that people hope for by moving to an architecture like this is – among others – individual high availability and service continuity.

Based on conversations that I had with many for you, I thought it would make sense to write up an article around high availability, as the concept is often misunderstood. I will refer to some Azure services in this article, but many concepts apply across clouds and deployment models.

What is your Services Level Objective (SLO)?

When we talk about cloud services, we often mention an SLA or “Service Level Agreement”. An SLA is a level of service that your cloud provider legally commits to. In many cases these agreements are intentionally vague or high level. For example: Many SaaS products will meet their SLA as long as users receive a valid response from the service in under 2 minutes. (An empty web page is – going by this statement alone – a “valid” response as it comes back without an error code)

A “Service Level Objective” on the other hand is an internal measure that a service is committed to achieving. However – crucially – there is no monetary or other compensation to the customer if the objective is missed.

Based on both your SLA and SLO targets you may choose different availability configurations, deployment methods and deployment targets.

It’s ultimately about money…

Based on what I have seen out in the field, a reason not to run in a highly available configuration is often simply down to the commercials.

A system that can have 8 hours of downtime every evening might be able to get away without a highly available config, but this largely depends on what the system does during the rest of the day when it is required. (again, the SLA/SLO consideration comes up)

More often than not running in a highly available configuration will require at least twice the hardware, which obviously has an implication on cost.

Different people, different systems, different HA configs

But when is a system highly available? There are a few different schools of thought in this area.

Many will only consider a service highly available, if each of its components is highly available.
Deployments in more than one geographical location might also form part of the consideration.

Consider the scenarios below, bearing in mind that they are only some of the possible options.

Scenario 1 Our web app and database are deployed onto a single machine and traffic is hitting the machine directly. If the machine loses power, needs to reboot, or experiences downtime for any other reason, then our entire product is offline.

Scenario 2 In this scenario we have made our solution slightly more available by splitting the IIS and SQL workload onto two dedicated boxes. This means that if our data machine goes down, then we can display an error message on the website and still perform any actions that do not require the database. If the website goes down, then our data would still be online, but without a client to talk to it.

Scenario 3 Here we have chosen to deploy our solution into Azure PaaS. A web app uses an Azure SQL service. Both have a tight SLA with the cloud provider and therefore provide – in many cases – higher availability than a deployment to single sets of machines. The PaaS offering allows to load balance traffic across several machine instances in the same geographic region under the hood.

The only two things that could really impact our highly available setup in this scenario is a datacentre outage, a deployment failure, or a network problem.

Scenario 4 Uses a traffic manager (this is a level 7 – DNS level – router) to distribute traffic between two geographically different datacentres. This accounts for the concern raised in the previous scenario around data centre outages. It also requires twice the hardware though.

Now that we are sending traffic between datacentres – however – we have new challenges to solve. Our data needs to be replicated between the two active replicas of our application. Scenario 4 might also get us into a problem where a faulty data deployment results in half of the traffic being sent to a broken experience.

Scenario 5 Fixes this at the expense of an additional traffic manager. This also ends up creating a longer route between service tiers, but ultimately delivers better availability and a better user experience in most cases.