EKS Cluster Autoscaling with Karpenter

2023-11-21

8 minute read

I’m going to go through setting up Karpenter for EKS using Terraform and no third-party modules. The required networking and EKS cluster will need to be setup beforehand. You can see how to setup an EKS cluster here. If you’re wondering why use Karpenter over the standard Cluster Autoscaler, there are a few big/quality of life reasons: able to work with as many instance families as you choose without creating multiple node groups, zone and price awareness, and flexible/granular scaling options. All the referenced Terraform code can be obtained here.

These are the providers that we’ll be using in the environment.

Providers

providers.tf

 1locals {
 2  env    = "sandbox"
 3  region = "us-east-1"
 4}
 5
 6provider "aws" {
 7  region = local.region
 8  default_tags {
 9    tags = {
10      env       = local.env
11      terraform = true
12    }
13  }
14}
15
16provider "helm" {
17  kubernetes {
18    host                   = module.eks-cluster.endpoint
19    cluster_ca_certificate = base64decode(module.eks-cluster.certificate)
20    exec {
21      api_version = "client.authentication.k8s.io/v1beta1"
22      # This requires the awscli to be installed locally where Terraform is executed
23      args        = ["eks", "get-token", "--cluster-name", module.eks-cluster.name]
24      command     = "aws"
25    }
26  }
27}
28
29provider "kubectl" {
30  apply_retry_count      = 5
31  host                   = module.eks-cluster.endpoint
32  cluster_ca_certificate = base64decode(module.eks-cluster.certificate)
33  load_config_file       = false
34
35  exec {
36    api_version = "client.authentication.k8s.io/v1beta1"
37    command     = "aws"
38    # This requires the awscli to be installed locally where Terraform is executed
39    args = ["eks", "get-token", "--cluster-name", module.eks-cluster.name]
40  }
41}

versions.tf

 1terraform {
 2  required_providers {
 3    aws = {
 4      source  = "hashicorp/aws"
 5      version = "~> 5.0"
 6    }
 7    kubectl = {
 8      source  = "alekc/kubectl"
 9      version = "~> 2.0.3"
10    }
11    helm = {
12      source  = "hashicorp/helm"
13      version = "~> 2.11.0"
14    }
15  }
16  required_version = "~> 1.5.7"
17}

Initialize the module where needed. Here we’re pulling some output data from the EKS module.

 1module "karpenter" {
 2  source                     = "../../modules/karpenter"
 3  env                        = local.env
 4  region                     = local.region
 5  cluster_name               = module.eks-cluster.name
 6  cluster_endpoint           = module.eks-cluster.endpoint
 7  irsa_oidc_provider_arn     = module.eks-cluster.oidc_provider_arn
 8  eks_node_role_arn          = module.eks-cluster.node_role_arn
 9  karpenter_version          = "v0.32.1"
10  worker_node_types          = ["t3.medium", "t3a.medium"]
11  worker_node_capacity_types = ["spot", "on-demand"]
12  worker_node_arch           = ["amd64"]
13}

Module files

We’ll be using Helm to deploy Karpenter to the EKS cluster and updating the “aws-auth” configmap to include the Karpenter node role. Using the kubectl provider, we’re setting the NodeClass and NodePool manifests and one important thing to highlight is we’re targeting subnets and security groups by the “karpenter.sh/discovery” tag, so make sure your tags are set before running. You can see an example of this in the VPC module in repo referenced earlier. The reason I’m using this particular kubectl provider is if we use the kubernetes_manifest resource on the Kubernetes provider, it will fail since the CRD doesn’t exist yet.

Another thing to note is this example is using the Bottlerocket OS for nodes and assigning to controller pods to the core node group. We don’t want them possibly being shuffled to its own created EC2 instance.

Karpenters API docs explain each of the settings in NodeClass and NodePool. You can get as general or specific as you want with the NodePool such as minimum/maximum core count, no specific instance types and just sizes.

main.tf

  1resource "helm_release" "karpenter" {
  2  namespace        = "karpenter"
  3  create_namespace = true
  4  name             = "karpenter"
  5  repository       = "oci://public.ecr.aws/karpenter"
  6  chart            = "karpenter"
  7  version          = var.karpenter_version
  8
  9  values = [
 10    <<-EOT
 11    settings:
 12      clusterName: ${var.cluster_name}
 13      clusterEndpoint: ${var.cluster_endpoint}
 14      interruptionQueueName: ${aws_sqs_queue.karpenter.name}
 15      aws:
 16        defaultInstanceProfile: ${aws_iam_instance_profile.karpenter.name}
 17    serviceAccount:
 18      annotations:
 19        eks.amazonaws.com/role-arn: ${aws_iam_role.karpenter_irsa.arn}
 20    affinity:
 21      nodeAffinity:
 22        requiredDuringSchedulingIgnoredDuringExecution:
 23          nodeSelectorTerms:
 24          - matchExpressions:
 25            - key: role
 26              operator: In
 27              values:
 28              - core
 29    EOT
 30  ]
 31}
 32
 33resource "kubectl_manifest" "aws_auth_config" {
 34  yaml_body = <<-YAML
 35    apiVersion: v1
 36    kind: ConfigMap
 37    metadata:
 38      name: aws-auth
 39      namespace: kube-system
 40    data:
 41      mapRoles: |
 42        - groups:
 43          - system:bootstrappers
 44          - system:nodes
 45          rolearn: "${var.eks_node_role_arn}"
 46          username: system:node:{{EC2PrivateDNSName}}
 47        - groups:
 48          - system:bootstrappers
 49          - system:nodes
 50          rolearn: "${aws_iam_role.karpenter_node.arn}"
 51          username: system:node:{{EC2PrivateDNSName}}
 52  YAML
 53
 54  depends_on = [
 55    helm_release.karpenter
 56  ]
 57}
 58
 59resource "kubectl_manifest" "karpenter_node_class" {
 60  yaml_body = <<-YAML
 61    apiVersion: karpenter.k8s.aws/v1beta1
 62    kind: EC2NodeClass
 63    metadata:
 64      name: default
 65    spec:
 66      amiFamily: Bottlerocket
 67      role: "karpenter-node-${var.cluster_name}"
 68      subnetSelectorTerms:
 69        - tags:
 70            karpenter.sh/discovery: "${var.cluster_name}"
 71      securityGroupSelectorTerms:
 72        - tags:
 73            karpenter.sh/discovery: "${var.cluster_name}"
 74      tags:
 75        platform: eks
 76        Name: "eks-karpenter-${var.env}"
 77        karpenter.sh/discovery: "${var.cluster_name}"
 78      metadataOptions:
 79        httpEndpoint: enabled
 80        httpProtocolIPv6: disabled
 81        httpPutResponseHopLimit: 2
 82        httpTokens: required
 83      blockDeviceMappings:
 84        # Root device
 85        - deviceName: /dev/xvda
 86          ebs:
 87            volumeSize: 4Gi
 88            volumeType: gp3
 89            encrypted: true
 90        # Data device: Container resources such as images and logs
 91        - deviceName: /dev/xvdb
 92          ebs:
 93            volumeSize: 20Gi
 94            volumeType: gp3
 95            encrypted: true
 96  YAML
 97
 98  depends_on = [
 99    helm_release.karpenter
100  ]
101}
102
103resource "kubectl_manifest" "karpenter_node_pool" {
104  yaml_body = templatefile("../../modules/aws/eks-addons/karpenter/files/node_pool.yaml", {
105    INSTANCE_TYPES = jsonencode(var.worker_node_types)
106    CAPACITY_TYPES = jsonencode(var.worker_node_capacity_types)
107    INSTANCE_ARCH  = jsonencode(var.worker_node_arch)
108  })
109
110  depends_on = [
111    helm_release.karpenter
112  ]
113}

node_pool.yaml

The node pool manifest is where we can really fine tune the instance types if we choose to. You can be as general as an instance family type such as the C class or specific instance sizes.

 1apiVersion: karpenter.sh/v1beta1
 2kind: NodePool
 3metadata:
 4  name: default
 5spec:
 6  template:
 7    spec:
 8      requirements:
 9        - key: kubernetes.io/arch
10          operator: In
11          values: ${INSTANCE_ARCH}
12        - key: kubernetes.io/os
13          operator: In
14          values: ["linux"]
15        - key: karpenter.sh/capacity-type
16          operator: In
17          values: ${CAPACITY_TYPES}
18        - key: node.kubernetes.io/instance-type
19          operator: In
20          values: ${INSTANCE_TYPES}
21        - key: karpenter.k8s.aws/instance-generation
22          operator: Gt
23          values: ["2"]
24      nodeClassRef:
25        name: default
26      kubelet:
27        maxPods: 110
28  limits:
29    cpu: 1000
30  disruption:
31    consolidationPolicy: WhenUnderutilized
32    expireAfter: 720h # 30 * 24h = 720h

data.tf

1data "aws_caller_identity" "current" {}

The IRSA role used by the Karpenter controller and node role assigned to nodes. An OIDC provider will need to be setup for the service account to assume an IAM role. The scoped out IRSA role that is referenced below can be found here. If you get an error on the Spot service linked role, you may already have it setup for your account.

iam.tf

 1locals {
 2  irsa_oidc_provider_url = replace(var.irsa_oidc_provider_arn, "/^(.*provider/)/", "")
 3  account_id             = data.aws_caller_identity.current.account_id
 4}
 5
 6resource "aws_iam_instance_profile" "karpenter" {
 7  name = "karpenter-irsa-${var.env}"
 8  role = aws_iam_role.karpenter_irsa.name
 9}
10
11# Create service account role for spot support
12resource "aws_iam_service_linked_role" "spot" {
13  aws_service_name = "spot.amazonaws.com"
14}
15
16resource "aws_iam_role" "karpenter_node" {
17  name = "karpenter-node-${var.env}"
18  assume_role_policy = jsonencode({
19    Statement : [
20      {
21        Action : "sts:AssumeRole",
22        Effect : "Allow",
23        Principal : {
24          "Service" : "ec2.amazonaws.com"
25        }
26      }
27    ],
28    Version : "2012-10-17"
29  })
30
31  managed_policy_arns = [
32    "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
33    "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
34    "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
35    "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
36  ]
37}
38
39data "aws_iam_policy_document" "irsa_assume_role" {
40  statement {
41    effect  = "Allow"
42    actions = ["sts:AssumeRoleWithWebIdentity"]
43
44    principals {
45      type        = "Federated"
46      identifiers = [var.irsa_oidc_provider_arn]
47    }
48    condition {
49      test     = "StringEquals"
50      variable = "${local.irsa_oidc_provider_url}:sub"
51      values   = ["system:serviceaccount:karpenter:karpenter"]
52    }
53    condition {
54      test     = "StringEquals"
55      variable = "${local.irsa_oidc_provider_url}:aud"
56      values   = ["sts.amazonaws.com"]
57    }
58  }
59}
60
61resource "aws_iam_role" "karpenter_irsa" {
62  name               = "karpenter-irsa-${var.env}"
63  assume_role_policy = data.aws_iam_policy_document.irsa_assume_role.json
64  managed_policy_arns = [
65    aws_iam_policy.karpenter_irsa.arn
66  ]
67}
68
69resource "aws_iam_policy" "karpenter_irsa" {
70  name = "karpenter-irsa-${var.env}"
71  policy = templatefile("../../modules/karpenter/files/irsa_policy.json", {
72    AWS_ACCOUNT_ID = data.aws_caller_identity.current.account_id
73    AWS_REGION     = var.region
74    CLUSTER_NAME   = var.cluster_name
75  })
76}

This SQS queue will notify Karpenter for spot interruptions and instance health events.

sqs.tf

 1resource "aws_sqs_queue" "karpenter" {
 2  message_retention_seconds = 300
 3  name                      = "${var.cluster_name}-karpenter"
 4}
 5
 6resource "aws_sqs_queue_policy" "karpenter" {
 7  policy    = data.aws_iam_policy_document.node_termination_queue.json
 8  queue_url = aws_sqs_queue.karpenter.url
 9}
10
11data "aws_iam_policy_document" "node_termination_queue" {
12  statement {
13    resources = [aws_sqs_queue.karpenter.arn]
14    sid       = "EC2InterruptionPolicy"
15    actions   = ["sqs:SendMessage"]
16    principals {
17      type        = "Service"
18      identifiers = [
19        "events.amazonaws.com",
20        "sqs.amazonaws.com"
21      ]
22    }
23  }
24}

These are the Cloudwatch health events mentioned above that will send events to the SQS queue.

cloudwatch.tf

 1locals {
 2  events = {
 3    health_event = {
 4      name        = "HealthEvent"
 5      description = "Karpenter interrupt - AWS health event"
 6      event_pattern = {
 7        source      = ["aws.health"]
 8        detail-type = ["AWS Health Event"]
 9      }
10    }
11    spot_interupt = {
12      name        = "SpotInterrupt"
13      description = "Karpenter interrupt - EC2 spot instance interruption warning"
14      event_pattern = {
15        source      = ["aws.ec2"]
16        detail-type = ["EC2 Spot Instance Interruption Warning"]
17      }
18    }
19    instance_rebalance = {
20      name        = "InstanceRebalance"
21      description = "Karpenter interrupt - EC2 instance rebalance recommendation"
22      event_pattern = {
23        source      = ["aws.ec2"]
24        detail-type = ["EC2 Instance Rebalance Recommendation"]
25      }
26    }
27    instance_state_change = {
28      name        = "InstanceStateChange"
29      description = "Karpenter interrupt - EC2 instance state-change notification"
30      event_pattern = {
31        source      = ["aws.ec2"]
32        detail-type = ["EC2 Instance State-change Notification"]
33      }
34    }
35  }
36}
37
38resource "aws_cloudwatch_event_rule" "this" {
39  for_each = { for k, v in local.events : k => v }
40
41  name_prefix   = "${each.value.name}-"
42  description   = each.value.description
43  event_pattern = jsonencode(each.value.event_pattern)
44
45  tags = merge(
46    { "ClusterName" : var.cluster_name },
47  )
48}
49
50resource "aws_cloudwatch_event_target" "this" {
51  for_each = { for k, v in local.events : k => v }
52
53  rule      = aws_cloudwatch_event_rule.this[each.key].name
54  target_id = "KarpenterInterruptionQueueTarget"
55  arn       = aws_sqs_queue.karpenter.arn
56}

variables.tf

 1variable "cluster_name" {
 2  type = string
 3}
 4variable "cluster_endpoint" {
 5  type = string
 6}
 7variable "env" {
 8  type = string
 9}
10variable "region" {
11  type = string
12}
13variable "irsa_oidc_provider_arn" {
14  type = string
15}
16variable "eks_node_role_arn" {
17  type = string
18}
19variable "karpenter_version" {
20  type = string
21}
22variable "worker_node_types" {
23  type = list(string)
24}
25variable "worker_node_capacity_types" {
26  type = list(string)
27}
28variable "worker_node_arch" {
29  type = list(string)
30}

Demo

An easy way to test this is using the pause container to force autoscaling. Adjust the CPU cores depending on your instance type.

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: inflate
 5spec:
 6  replicas: 0
 7  selector:
 8    matchLabels:
 9      app: inflate
10  template:
11    metadata:
12      labels:
13        app: inflate
14    spec:
15      terminationGracePeriodSeconds: 0
16      containers:
17        - name: inflate
18          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
19          resources:
20            requests:
21              cpu: 1

1kubectl apply -f ./test.yaml
2kubectl scale deployment inflate --replicas 5

You should see additional nodes being created fairly quickly to assign these new pods. If you don’t see any activity, you can view the logs with this command:

1kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller