This article is about AWS cost optimisation using Cloud Custodian and Jenkins both running in Kubernetes.

Scheduling the development infrastructure during the night, weekends and holidays can make a big difference when it comes to Cloud charges. AWS offers discounts for long term commitment but this doesn’t always help and companies end up losing money.

Let’s take for example the monthly cost of a Linux m5.large instance:

On-Demand cost 1 year Standard no-upfront cost 3 years Standard no-upfront cost On-Demand 12 hours a day Monday to Friday (22 working days per month)
~$81 ~$51 ~$35 ~$29

The cost can be reduced even further by using Spot instances with a power schedule.

Cloud Custodian is a fine tool when it comes to API interaction with different cloud providers. In this article I will focus on AWS EC2. This solution can be easily extended to other AWS offerings like RDS, Autoscaling groups and also be used with other Cloud providers.

Cloud Custodian documentation is excellent and the community is always there to help.

How Cloud Custodian policies work

YAML format is being used to write the policies. A policy will cover the whole hour, there is no concept of minutes. For example if a policy looks like this on=[(M-F,8)];off=[(M-F,20)];tz=gb the instance will be powered ON anytime between 8:00am and 8:59am and can be powered OFF anytime between 20:00 and 20:59 when Cloud Custodian will run.

Requirements

Below is a short example on how Cloud Custodian can help reduce the AWS bill. There will be no Cloud Custodian instance running, we will use Jenkins to schedule a Cloud Custodian container in Kubernetes. The following prerequisites are required:

  • a Jenkins installation deployed to Kubernetes
  • an AWS account with some AWS instances
  • Jenkins pods able to assume IAM roles (kube2iam preferably)

AWS IAM role

An IAM role called AwsMaid will need to be created for Cloud Custodian like on below example. The IAM role will need to be extended for RDS instances and Autoscaling groups.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudCustodianStartStop",
            "Action": [
                "ec2:DescribeTags",
                "ec2:StopInstances",
                "ec2:StartInstances"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

Jenkins config

Below can be found the JCasc Plugin configuration. This needs to be added in the jenkins casc yaml configuration.

jenkins:
----------
  clouds:
  - kubernetes:
      connectTimeout: 5
      containerCapStr: "100"
      jenkinsTunnel: "jenkins-agent.default.svc.cluster.local:50000"
      jenkinsUrl: "http://jenkins.default.svc.cluster.local:8080"
      maxRequestsPerHostStr: "32"
      name: "kubernetes"
      readTimeout: 15
      templates:
      # Cloud Custodian container configuration     
      - annotations:
        - key: "iam.amazonaws.com/role"
          value: "AwsMaid"
        containers:
        - args: "cat"
          command: "/bin/sh -c"
          image: "cloudcustodian/c7n:0.8.45.3"
          livenessProbe:
            failureThreshold: 0
            initialDelaySeconds: 0
            periodSeconds: 0
            successThreshold: 0
            timeoutSeconds: 0
          name: "cloud-custodian"
          resourceRequestCpu: "300m"
          resourceRequestMemory: "100Mi"
          ttyEnabled: true
        label: "cloud-custodian"
        name: "cloud-custodian"
        namespace: "default"
        podRetention: "never"
        workspaceVolume:
          emptyDirWorkspaceVolume:
            memory: false

To ensure the instances will be powered off/on the job should be scheduled to run 3 or 4 times per hour. Below is an example on how the pipeline looks like.

node("cloud-custodian") {
  stage("power-schedule") {
    try {
      checkout poll: false,
      scm: [
      $class: 'GitSCM', branches: [[
      name: '*/master']], doGenerateSubmoduleConfigurations: false, userRemoteConfigs: [[
      url: CLOUD_CUSTODIAN_POLICIES_REPO]]]

      // force 20 minutes timeout
      timeout(time: 20) {
        container(name: 'cloud-custodian') {
          dir('cloud-custodian') {
            sh """
                set +x
                custodian run -v \
                -s out policy.yaml \
                --region eu-west-1 \
                --region eu-west-2 \
                --region eu-central-1 \
                --cache-period=0
            """
          }
        }
      }
    } catch(err) {
      currentBuild.result = 'FAILURE'
      echo "Caught: ${err}"
      error(err.message)
    }
  }

Cloud Custodian policy

The easiest way to manage the policies is to keep them under version control. Below file can be saved to a git repository. The above Jenkins pipeline is preconfigured to pull the policies using the variable CLOUD_CUSTODIAN_POLICIES_REPO and run them from the folder cloud-custodian. The simplest policy which can be found also in the documentation looks like on the below example. Cloud Custodian will parse all the tags for all the EC2 instances and act accordingly.

policies:
  - name: offhours-stop
    resource: ec2
    filters:
       - type: offhour
    actions:
      - stop

  - name: offhours-start
    resource: ec2
    filters:
      - type: onhour
    actions:
      - start

EC2 tagging

This above policy will power ON/OFF the instances by parsing all the EC2 instances tags. The instance tagged with Key maid_offhours and a a Value of on=[(M-F,8)];off=[(M-F,20)];tz=gb will be powered ON at 8AM and powered OFF at 8PM Monday to Friday using as a timezone Gb. More advance filters can be applied, extra details can be found in the Cloud Custodian documentation.

Key Value
maid_offhours on=[(M-F,8)];off=[(M-F,20)];tz=gb