In this article I will cover a solution designed to reduce the cost for AWS infrastructure and in the same time ensure the uptime of the applications.

Requirements

The requirements are as it follows:

  • use NLB for the ingress solution with traffic being served by all nodes
  • ensure the applications are evicted in time when the spot instances are reclaimed by AWS

Solution

Below is a set of minimum requirements to have this solution working.

EKS provisioning

For cluster provisioning using terraform there is a very nice terraform module which can be used. The link contains the actual configuration required to create the cluster with a dedicated autoscaling group for Spot instances.

Another way to deploy an EKS cluster with Spot instances is to use the eksctl.

The cluster should be deployed also with an autoscaling group of On-Demand instances. The applications should be configured with Affinity and Tolerations which will ensure the pods can be started also on On-Demand instances in case AWS will not be able to provide Spot instances.

Spot termination handler

In one of the previous articles I’ve mentioned how Nginx Ingress can be used with a NLB.

When AWS reclaims a spot instance a 2 minutes notice is sent. The endpoint http://169.254.169.254/latest/meta-data/spot/termination-time will become available for that specific instance.

In order to monitor that endpoint there is a very nice addon called kube-spot-termination-notice-handler to which I’ve added some additional options, one of them being to deregister the node from Target Groups also. More details on how to do that can be found on the above repo. All the credit goes to the team who initially developed the addon.

This IAM role permissions used by the kube-spot-termination-handler will look like on below example. This will allow the addon to remove the instance from the Target Groups as soon as the reclaim notice has been sent by AWS.

Make sure the NLB is configured to have the node drain set to max 60 seconds.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": [
      "elasticloadbalancing:DeregisterTargets",
      "elasticloadbalancing:Describe*",
      "elasticloadbalancing:Modify*",
      "elasticloadbalancing:Register*",
      "autoscaling:Describe*",
      "autoscaling:DetachInstances"
    ],
    "Resource": "*"
  }]
}

After this step we are left with ~60 seconds to drain the node. Another option added to the addon is to drain the nodes by using applications labels in parallel. This will improve the node drain time. We are left now with one last issue to resolve. The cluster availability to schedule the evicted pods.

Something very important is that all cluster applications will need to have a PodDisruptionBudget. This will prevent the pods of the same application being terminated all together in case all containers are in the same node.

Overprovisioning

The cluster will require to be overprovisioned in order to have space to reschedule the evicted pods fast. For this we will use another addon called cluster-overprovisioner. The addon has a formula to calculate with how much the cluster should be over provisioned based on how large it is. Pause containers will be deployed with a lower priority class, so when the normal applications will be evicted Kubernetes will terminate the Pause containers and schedule the evicted pods pretty much instant. The Pause containerswill trigger the cluster to scale up.

Cluster fine tuning

By default the EKS cluster comes with a decent tuning when it comes to API interactions. But this is not enough to have fast proxy rules updates when the pods are evicted. The kube-proxy args can be found in the official Kubernetes documentation. Depending on the size of the cluster the following args will need to be tweaked to accelerate the kube-proxy updates of the node iptables

  • kube-api-qps set to ~300
  • kube-api-burst set to ~400
  • iptables-sync-period (optional) set to 5s
  • ipvs-sync-period (optional) set to 5s

A very nice article about that can be found here.

The actual kube-proxy configuration will look like on below example:

.................
      containers:
      - command:
        - kube-proxy
        - --v=2
        - --ipvs-sync-period=5s
        - --iptables-sync-period=5s
        - --kube-api-qps=300
        - --kube-api-burst=400
        - --config=/var/lib/kube-proxy-config/config
        image: kube-proxy
..................

Conclusion

Saving money on the infrastructure is not always easy. The development and infrastructure teams will need to work together for such a solution to work and not cause more loss for the company then it will actually save from using the Spot instances.

The requirements will look like below:

  • first of all the applications will need to have fast start/stop times. Having health checks for the apps it make also a difference.
  • the addons will need to be configured correctly and fine tuned to be in sync with the infrastructure (NLB Target Group node draining time with the kube-spot-termination-notice-handler is one of them).
  • all cluster apps (core and company apps) will need to be properly configured with PDBs and HPAs at least. Both of these will have their own set of requirements in order to function correctly. PodAntiAffinity and Tolerations will be an added bonus if they are configured. A set of best practices for Kubernetes deployments can be found here. On top of everything ensure the applications have been load tested and the resources have been correctly configured for each application.
  • the containers will need to be small or use a common base so the pull will be fast also. On top of that the container will need to be in the same region as the cluster to speed up the pull.
  • the cluster will need to be provisioned with extra args, specially kube-proxy.
  • metrics and logs alerts are essential for a deployment like this also.