You are currently viewing OCI Compute Failover Scenario using OCI Functions

OCI Compute Failover Scenario using OCI Functions

Recently I had a use case with one of my customers that sounds something like this: “I have an application running in OCI on Compute Virtual Machines. I want my application to be resilient, so I created VMs in multiple Availability Domains. I would like to have a Primary VM (in AD1) that will handle all traffic, and a Failover VM (in AD2) that will be stopped, but in case Primary VM is unhealthy – the Failover VM will be started automatically and traffic will go to the Failover instead.”

How do we do this?

Well, if the Failover VM would be in a Running state all time, the failover would be really simple – just a check in the Load Balancer as backup and that would be all.

As this use case requires the Failover VM to be stopped, we will use an Alarm to trigger an OCI Function that will Start the Failover VM if the Primary VM is unhealthy.

Description

Right, so let’s talk details.

We have an application running in OCI on Virtual Machines. The Virtual Machines are in a private subnet, so public access is made through a public Load Balancer.

We have a Primary VM that sits in Availability Domain 1 and a Failover VM in Availability Domain 2. All traffic is routed to the Primary VM as the Failover VM is marked as Backup in the Load Balancer and is NOT running. The Failover VM will be in critical state, but that is ok.

In case the Primary VM becomes unhealthy, an alarm will trigger an OCI Function that will automatically START the Failover VM and STOP the Primary VM, if not already stopped.

The Load Balancer will route the traffic to the Backup site until the Primary VM will be healthy again. Once the Primary VM is healthy again, the Failover VM can be stopped.

Setup

Load Balancer Backend Set Configuration

Menu -> Networking -> Load Balancers

Note: The creation of the Virtual Machines and the Load Balancer is excluded from this post as there is nothing specific to those steps.

Once the VMs and the Load Balancer are created, we need to add the VMs to a backend set in the Load Balancer (this step can be done during the creation of the Load Balancer as well):

  • Create a Backend Set

I’ll use Weighted Round Robin for Traffic Distribution Policy and HTTP Health Check on port 80, but you can configure this as you wish.

  • Add backends

Select your backends (I have my two VMs in AD1 and AD2). You can also check “Automatically add security list rules” and then ADD

  • Select your Failover VM (in my case webtest_ad2), click on Actions and then Edit to mark it as Backup

This way, the Load Balancer will only send traffic to the Primary VM (webtest_ad1 in my case), as long as it is healthy, and will only switch and send traffic to the Failover VM (webtest_ad2) if the Primary VM is unhealthy. Once Primary VM will be healthy again, the Load Balancer will switch back and send traffic only to Primary and you could drain all connections and stop the Failover VM.

Deploy Failover Function

Menu -> Developer Services -> Applications (under Functions)

Before deploying the OCI Function, make sure you set up your environment for using Functions – follow one of the quickstart guides available (Cloud Shell is recommended).

  • Create a new Application – we’ll call it oci-compute-failover-app
  • Follow the first 7 steps from the Getting Started Guide using Cloud Shell to be able to deploy de Function
  • Download the source code for the Failover Function and navigate to the Function’s folder
git clone https://github.com/iavladu/oci-compute-failover.git

cd oci-compute-failover
  • Deploy the Failover Function to your newly created Application
fn -v deploy --app <your app name>

# Example: 
fn -v deploy --app oci-compute-failover-app
  • Copy the OCID of your Primary VM and Failover VM and set the following config values:
fn config function <app-name> oci-compute-failover primary_vm <Primary VM OCID>
fn config function <app-name> oci-compute-failover failover_vm <Failover VM OCID>

# Example:
fn config function oci-compute-failover-app oci-compute-failover primary_vm ocid1.instance.oc1.fra.asdfagvewrve33121e1d3243rd
fn config function oci-compute-failover-app oci-compute-failover failover_vm ocid1.instance.oc1.fra.asrrrr223423sdvsvddfge345gf
  • You can also set this key-value pair configuration from the Console, in the Function -> Configuration

Create Notification to call Function

Menu -> Developer Services -> Notifications (under Application Integration)

  • Create a new Topic
  • Create a new Subscription and select Function protocol
    • Select the OCI Compute Failover application and function

Make sure you got all permissions necessary.

  • You can add additional subscriptions like email or slack to get notified when your alarm will be triggered

Create Alarm to Trigger Function

Menu -> Observability & Management -> Alarm Definition (under Monitoring)

  • Create a new Alarm
  • Give it a name, body and select the preferred severity

  • Under Metric Description
    • Select the compartment
    • The oci_lbaas Metric Namespace
    • The UnHealthyBackendServers Metric Name

  • Under Metric Dimensions
    • Select resourceId as Dimension Name and the Load Balancer OCID as Dimension Value
    • Add Additional dimension
    • Select backendSetName as Dimension Name and the Backend Set Name as Dimension Value

  • Under Trigger Rule
    • Select greater than or equal to as Operator and as value
    • This is because we’ll always have the Failover VM as unhealthy (because it is stopped) but if we have two unhealthy, it means that the Primary is down

  • In the Notifications section, select the failover Topic created earlier and Save

The configuration is now complete!

Test the Failover

All traffic should go to the Primary VM under normal circumstances. In order to test the failover, there are two ways:

  • Stop the Primary VM
    • If the Primary VM is shut down, the Alarm will be triggered, and the Failover Function will be invoked
    • The Function will start the Failover VM and will not attempt to stop the Primary VM, as the status is not RUNNING
    • Once the Failover VM is up and running, the Load Balancer will automatically send traffic to this machine instead
    • If in the meantime, the Primary VM becomes healthy again, the Load Balancer will switch back to it and traffic will only go to the Primary and the Failover VM can be stopped
  • Stop the webserver on the Primary VM (or the services that is subject to the health check), but not the VM itself
    • If the health check on the Load Balancer (in this case, HTTP) becomes unhealthy, the alarm will be triggered, and the Failover Function will be invoked
    • The Function will start the Failover VM and will also stop the Primary VM
    • Once the Failover VM is up and running, the Load Balancer will automatically send traffic to this machine instead
    • If in the meantime, the Primary VM becomes healthy again, the Load Balancer will switch back to it and traffic will only go to the Primary and the Failover VM can be stopped

Conclusions

This way of doing the failover will have a downtime of approximative 3-5 minutes

  • The Load Balancer notices that the node is not healthy
  • The alarm gets fired with 1 min delay because the condition must be maintained for 1 min minimum
  • The Function is called to start the Failover VM
  • The Failover VM is started
  • The Load Balancer sees the new instance as healthy

If a failover with almost zero downtime is needed, then the Failover VM should be up-and-running, and the Load Balancer will switch traffic automatically between primary and backup without the need of an OCI Function.

Ionut Adrian Vladu

I enjoy building python scripts for…everything! I am a Cloud enthusiast and I like to keep up with technology. When I'm not behind a computer, I like taking photos -- Visit My 500px profile -- or sit back and enjoy Formula 1 race weekends. Currently, working as a Tech Cloud Specialist @ Oracle
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments