I have written a few articles about this notion in the past, but all of them were in relation to IaaS…
- Understanding Storage Performance Limits (not complete anymore but still valid)
- Making sure you stay online during an IaaS maintenance event (they are now very infrequent)
- Determining your storage account location (less relevant when using managed disks)
- Safeguards and keeping your VM secure
…and I wanted to take the opportunity to create a similar article for PaaS as part of my Azure Governance series.
Why do we recommend PaaS over IaaS?
Quite simply, the experience is so much better. Never mind the lower cost, but PaaS is definitely the way to go when it comes to Azure.
Not only do you have to worry a lot less about scaling individual compute components, you also get an instant advantage from not having to manage server operating systems, group policy and domains.
This advantage is not just monetary, but also extends to your team’s technical capacity. People who have time to be creative, will usually be creative. If they are busy just running your system you will most definitely see less innovation.
Last but not least, Azure PaaS enables a much easier transition into modern software delivery. Rather than having to orchestrate DSC, Puppet, Chef, or Acronis, you simply spin up a resource and deploy your binary. (whether that’s with a Microsoft tool, Octopus, or another deployment tool)
With all that being said, I also want to express that PaaS still does not work for absolutely every solution and that there are valid cases where IaaS makes more sense.
With that out of the way, let’s dive into the main content of this article.
Understanding our shared responsibility when it comes to disaster recovery, high availability, and geo-replication in Azure PaaS
Making your code respond to a failover
Many PaaS services in Azure offer geo-replication. (for example: Cosmos DB, Azure SQL DB, Storage Accounts) While the geo-replication and failover for these resources is handled by the Azure platform, it is important that we write our applications in such a way that they can respond to these events.
When designing your application ask yourself some of the following questions:
- Would this call work if it happened on a different machine?
- What is my requirement for data consistency?
- If I executed this call twice in a row, would it have the same result?
- If this call times out, what happens?
- If this call fails half way through, how do I recover?
If you have looked into micro services recently many of these will sound familiar. That is because different PaaS components work independently. (like services) Even if their internal architecture is nothing like a micro service, we need to understand the external context.
Active / Active
You may want to consider to run your application in different regions. You can use traffic manager to guide traffic to your resources.
A number of different configurations are available:
- Failover – only use the secondary region if the primary endpoint is down
- Location – use the endpoint closest to the user
- Round robin – use one endpoint for one user then the other for the next user
Many customers choose to configure several traffic managers between each tier of their application.
While this has a slight routing overhead, it means that you will still have a consistent experience if one service in a particular region is down.
If you use traffic manager as a “front door” service before your customer reaches your application, then they may experience issues if some of the lower level services are unavailable in the region. This is at a slight performance and cost advantage.
Test your setup frequently
We recommend that you test your application regularly and initiate manual failovers. It is a very good idea to put instrumentation into place that measures the impact of a failover. Application Insights is a great start to this.
- In Active/Passive scenarios, make sure you are failing over primaries and secondaries automatically or in a scripted manner when the failover happens
- In Prod/Min scenarios, make sure that scale up can happen while the application remains online after a failover
- In Active/Active scenarios, make sure that your data consistency requirement aligns with the type of failover