EKS Cluster Autoscaling with Karpenter
I’m going to go through setting up Karpenter for EKS using Terraform and no third-party modules. The required networking and EKS cluster will need to be setup beforehand. You can see how to setup an EKS cluster here. If you’re wondering why use Karpenter over the standard Cluster Autoscaler, there are a few big/quality of life reasons: able to work with as many instance families as you choose without creating multiple node groups, zone and price awareness, and flexible/granular scaling options. All the referenced Terraform code can be obtained here.
These are the providers that we’ll be using in the environment.
Providers
providers.tf
1locals {
2 env = "sandbox"
3 region = "us-east-1"
4}
5
6provider "aws" {
7 region = local.region
8 default_tags {
9 tags = {
10 env = local.env
11 terraform = true
12 }
13 }
14}
15
16provider "helm" {
17 kubernetes {
18 host = module.eks-cluster.endpoint
19 cluster_ca_certificate = base64decode(module.eks-cluster.certificate)
20 exec {
21 api_version = "client.authentication.k8s.io/v1beta1"
22 # This requires the awscli to be installed locally where Terraform is executed
23 args = ["eks", "get-token", "--cluster-name", module.eks-cluster.name]
24 command = "aws"
25 }
26 }
27}
28
29provider "kubectl" {
30 apply_retry_count = 5
31 host = module.eks-cluster.endpoint
32 cluster_ca_certificate = base64decode(module.eks-cluster.certificate)
33 load_config_file = false
34
35 exec {
36 api_version = "client.authentication.k8s.io/v1beta1"
37 command = "aws"
38 # This requires the awscli to be installed locally where Terraform is executed
39 args = ["eks", "get-token", "--cluster-name", module.eks-cluster.name]
40 }
41}
versions.tf
1terraform {
2 required_providers {
3 aws = {
4 source = "hashicorp/aws"
5 version = "~> 5.0"
6 }
7 kubectl = {
8 source = "alekc/kubectl"
9 version = "~> 2.0.3"
10 }
11 helm = {
12 source = "hashicorp/helm"
13 version = "~> 2.11.0"
14 }
15 }
16 required_version = "~> 1.5.7"
17}
Initialize the module where needed. Here we’re pulling some output data from the EKS module.
1module "karpenter" {
2 source = "../../modules/karpenter"
3 env = local.env
4 region = local.region
5 cluster_name = module.eks-cluster.name
6 cluster_endpoint = module.eks-cluster.endpoint
7 irsa_oidc_provider_arn = module.eks-cluster.oidc_provider_arn
8 eks_node_role_arn = module.eks-cluster.node_role_arn
9 karpenter_version = "v0.32.1"
10 worker_node_types = ["t3.medium", "t3a.medium"]
11 worker_node_capacity_types = ["spot", "on-demand"]
12 worker_node_arch = ["amd64"]
13}
Module files
We’ll be using Helm to deploy Karpenter to the EKS cluster and updating the “aws-auth” configmap to include the Karpenter node role. Using the kubectl provider, we’re setting the NodeClass and NodePool manifests and one important thing to highlight is we’re targeting subnets and security groups by the “karpenter.sh/discovery” tag, so make sure your tags are set before running. You can see an example of this in the VPC module in repo referenced earlier. The reason I’m using this particular kubectl provider is if we use the kubernetes_manifest resource on the Kubernetes provider, it will fail since the CRD doesn’t exist yet.
Another thing to note is this example is using the Bottlerocket OS for nodes and assigning to controller pods to the core node group. We don’t want them possibly being shuffled to its own created EC2 instance.
Karpenters API docs explain each of the settings in NodeClass and NodePool. You can get as general or specific as you want with the NodePool such as minimum/maximum core count, no specific instance types and just sizes.
main.tf
1resource "helm_release" "karpenter" {
2 namespace = "karpenter"
3 create_namespace = true
4 name = "karpenter"
5 repository = "oci://public.ecr.aws/karpenter"
6 chart = "karpenter"
7 version = var.karpenter_version
8
9 values = [
10 <<-EOT
11 settings:
12 clusterName: ${var.cluster_name}
13 clusterEndpoint: ${var.cluster_endpoint}
14 interruptionQueueName: ${aws_sqs_queue.karpenter.name}
15 aws:
16 defaultInstanceProfile: ${aws_iam_instance_profile.karpenter.name}
17 serviceAccount:
18 annotations:
19 eks.amazonaws.com/role-arn: ${aws_iam_role.karpenter_irsa.arn}
20 affinity:
21 nodeAffinity:
22 requiredDuringSchedulingIgnoredDuringExecution:
23 nodeSelectorTerms:
24 - matchExpressions:
25 - key: role
26 operator: In
27 values:
28 - core
29 EOT
30 ]
31}
32
33resource "kubectl_manifest" "aws_auth_config" {
34 yaml_body = <<-YAML
35 apiVersion: v1
36 kind: ConfigMap
37 metadata:
38 name: aws-auth
39 namespace: kube-system
40 data:
41 mapRoles: |
42 - groups:
43 - system:bootstrappers
44 - system:nodes
45 rolearn: "${var.eks_node_role_arn}"
46 username: system:node:{{EC2PrivateDNSName}}
47 - groups:
48 - system:bootstrappers
49 - system:nodes
50 rolearn: "${aws_iam_role.karpenter_node.arn}"
51 username: system:node:{{EC2PrivateDNSName}}
52 YAML
53
54 depends_on = [
55 helm_release.karpenter
56 ]
57}
58
59resource "kubectl_manifest" "karpenter_node_class" {
60 yaml_body = <<-YAML
61 apiVersion: karpenter.k8s.aws/v1beta1
62 kind: EC2NodeClass
63 metadata:
64 name: default
65 spec:
66 amiFamily: Bottlerocket
67 role: "karpenter-node-${var.cluster_name}"
68 subnetSelectorTerms:
69 - tags:
70 karpenter.sh/discovery: "${var.cluster_name}"
71 securityGroupSelectorTerms:
72 - tags:
73 karpenter.sh/discovery: "${var.cluster_name}"
74 tags:
75 platform: eks
76 Name: "eks-karpenter-${var.env}"
77 karpenter.sh/discovery: "${var.cluster_name}"
78 metadataOptions:
79 httpEndpoint: enabled
80 httpProtocolIPv6: disabled
81 httpPutResponseHopLimit: 2
82 httpTokens: required
83 blockDeviceMappings:
84 # Root device
85 - deviceName: /dev/xvda
86 ebs:
87 volumeSize: 4Gi
88 volumeType: gp3
89 encrypted: true
90 # Data device: Container resources such as images and logs
91 - deviceName: /dev/xvdb
92 ebs:
93 volumeSize: 20Gi
94 volumeType: gp3
95 encrypted: true
96 YAML
97
98 depends_on = [
99 helm_release.karpenter
100 ]
101}
102
103resource "kubectl_manifest" "karpenter_node_pool" {
104 yaml_body = templatefile("../../modules/aws/eks-addons/karpenter/files/node_pool.yaml", {
105 INSTANCE_TYPES = jsonencode(var.worker_node_types)
106 CAPACITY_TYPES = jsonencode(var.worker_node_capacity_types)
107 INSTANCE_ARCH = jsonencode(var.worker_node_arch)
108 })
109
110 depends_on = [
111 helm_release.karpenter
112 ]
113}
node_pool.yaml
The node pool manifest is where we can really fine tune the instance types if we choose to. You can be as general as an instance family type such as the C class or specific instance sizes.
1apiVersion: karpenter.sh/v1beta1
2kind: NodePool
3metadata:
4 name: default
5spec:
6 template:
7 spec:
8 requirements:
9 - key: kubernetes.io/arch
10 operator: In
11 values: ${INSTANCE_ARCH}
12 - key: kubernetes.io/os
13 operator: In
14 values: ["linux"]
15 - key: karpenter.sh/capacity-type
16 operator: In
17 values: ${CAPACITY_TYPES}
18 - key: node.kubernetes.io/instance-type
19 operator: In
20 values: ${INSTANCE_TYPES}
21 - key: karpenter.k8s.aws/instance-generation
22 operator: Gt
23 values: ["2"]
24 nodeClassRef:
25 name: default
26 kubelet:
27 maxPods: 110
28 limits:
29 cpu: 1000
30 disruption:
31 consolidationPolicy: WhenUnderutilized
32 expireAfter: 720h # 30 * 24h = 720h
data.tf
1data "aws_caller_identity" "current" {}
The IRSA role used by the Karpenter controller and node role assigned to nodes. An OIDC provider will need to be setup for the service account to assume an IAM role. The scoped out IRSA role that is referenced below can be found here. If you get an error on the Spot service linked role, you may already have it setup for your account.
iam.tf
1locals {
2 irsa_oidc_provider_url = replace(var.irsa_oidc_provider_arn, "/^(.*provider/)/", "")
3 account_id = data.aws_caller_identity.current.account_id
4}
5
6resource "aws_iam_instance_profile" "karpenter" {
7 name = "karpenter-irsa-${var.env}"
8 role = aws_iam_role.karpenter_irsa.name
9}
10
11# Create service account role for spot support
12resource "aws_iam_service_linked_role" "spot" {
13 aws_service_name = "spot.amazonaws.com"
14}
15
16resource "aws_iam_role" "karpenter_node" {
17 name = "karpenter-node-${var.env}"
18 assume_role_policy = jsonencode({
19 Statement : [
20 {
21 Action : "sts:AssumeRole",
22 Effect : "Allow",
23 Principal : {
24 "Service" : "ec2.amazonaws.com"
25 }
26 }
27 ],
28 Version : "2012-10-17"
29 })
30
31 managed_policy_arns = [
32 "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
33 "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
34 "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
35 "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
36 ]
37}
38
39data "aws_iam_policy_document" "irsa_assume_role" {
40 statement {
41 effect = "Allow"
42 actions = ["sts:AssumeRoleWithWebIdentity"]
43
44 principals {
45 type = "Federated"
46 identifiers = [var.irsa_oidc_provider_arn]
47 }
48 condition {
49 test = "StringEquals"
50 variable = "${local.irsa_oidc_provider_url}:sub"
51 values = ["system:serviceaccount:karpenter:karpenter"]
52 }
53 condition {
54 test = "StringEquals"
55 variable = "${local.irsa_oidc_provider_url}:aud"
56 values = ["sts.amazonaws.com"]
57 }
58 }
59}
60
61resource "aws_iam_role" "karpenter_irsa" {
62 name = "karpenter-irsa-${var.env}"
63 assume_role_policy = data.aws_iam_policy_document.irsa_assume_role.json
64 managed_policy_arns = [
65 aws_iam_policy.karpenter_irsa.arn
66 ]
67}
68
69resource "aws_iam_policy" "karpenter_irsa" {
70 name = "karpenter-irsa-${var.env}"
71 policy = templatefile("../../modules/karpenter/files/irsa_policy.json", {
72 AWS_ACCOUNT_ID = data.aws_caller_identity.current.account_id
73 AWS_REGION = var.region
74 CLUSTER_NAME = var.cluster_name
75 })
76}
This SQS queue will notify Karpenter for spot interruptions and instance health events.
sqs.tf
1resource "aws_sqs_queue" "karpenter" {
2 message_retention_seconds = 300
3 name = "${var.cluster_name}-karpenter"
4}
5
6resource "aws_sqs_queue_policy" "karpenter" {
7 policy = data.aws_iam_policy_document.node_termination_queue.json
8 queue_url = aws_sqs_queue.karpenter.url
9}
10
11data "aws_iam_policy_document" "node_termination_queue" {
12 statement {
13 resources = [aws_sqs_queue.karpenter.arn]
14 sid = "EC2InterruptionPolicy"
15 actions = ["sqs:SendMessage"]
16 principals {
17 type = "Service"
18 identifiers = [
19 "events.amazonaws.com",
20 "sqs.amazonaws.com"
21 ]
22 }
23 }
24}
These are the Cloudwatch health events mentioned above that will send events to the SQS queue.
cloudwatch.tf
1locals {
2 events = {
3 health_event = {
4 name = "HealthEvent"
5 description = "Karpenter interrupt - AWS health event"
6 event_pattern = {
7 source = ["aws.health"]
8 detail-type = ["AWS Health Event"]
9 }
10 }
11 spot_interupt = {
12 name = "SpotInterrupt"
13 description = "Karpenter interrupt - EC2 spot instance interruption warning"
14 event_pattern = {
15 source = ["aws.ec2"]
16 detail-type = ["EC2 Spot Instance Interruption Warning"]
17 }
18 }
19 instance_rebalance = {
20 name = "InstanceRebalance"
21 description = "Karpenter interrupt - EC2 instance rebalance recommendation"
22 event_pattern = {
23 source = ["aws.ec2"]
24 detail-type = ["EC2 Instance Rebalance Recommendation"]
25 }
26 }
27 instance_state_change = {
28 name = "InstanceStateChange"
29 description = "Karpenter interrupt - EC2 instance state-change notification"
30 event_pattern = {
31 source = ["aws.ec2"]
32 detail-type = ["EC2 Instance State-change Notification"]
33 }
34 }
35 }
36}
37
38resource "aws_cloudwatch_event_rule" "this" {
39 for_each = { for k, v in local.events : k => v }
40
41 name_prefix = "${each.value.name}-"
42 description = each.value.description
43 event_pattern = jsonencode(each.value.event_pattern)
44
45 tags = merge(
46 { "ClusterName" : var.cluster_name },
47 )
48}
49
50resource "aws_cloudwatch_event_target" "this" {
51 for_each = { for k, v in local.events : k => v }
52
53 rule = aws_cloudwatch_event_rule.this[each.key].name
54 target_id = "KarpenterInterruptionQueueTarget"
55 arn = aws_sqs_queue.karpenter.arn
56}
variables.tf
1variable "cluster_name" {
2 type = string
3}
4variable "cluster_endpoint" {
5 type = string
6}
7variable "env" {
8 type = string
9}
10variable "region" {
11 type = string
12}
13variable "irsa_oidc_provider_arn" {
14 type = string
15}
16variable "eks_node_role_arn" {
17 type = string
18}
19variable "karpenter_version" {
20 type = string
21}
22variable "worker_node_types" {
23 type = list(string)
24}
25variable "worker_node_capacity_types" {
26 type = list(string)
27}
28variable "worker_node_arch" {
29 type = list(string)
30}
Demo
An easy way to test this is using the pause container to force autoscaling. Adjust the CPU cores depending on your instance type.
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: inflate
5spec:
6 replicas: 0
7 selector:
8 matchLabels:
9 app: inflate
10 template:
11 metadata:
12 labels:
13 app: inflate
14 spec:
15 terminationGracePeriodSeconds: 0
16 containers:
17 - name: inflate
18 image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
19 resources:
20 requests:
21 cpu: 1
1kubectl apply -f ./test.yaml
2kubectl scale deployment inflate --replicas 5
You should see additional nodes being created fairly quickly to assign these new pods. If you don’t see any activity, you can view the logs with this command:
1kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller