Self-hosted Github Action runners on Kubernetes using Karpenter and ARC

2024-05-07

6 minute read

In this post, I’m going to show how to setup self-hosted Github Action runners on Kubernetes using the Actions Runner Controller (ARC) and Karpenter on Terraform. Using Karpenter is not a requirement; however, it will make life easier when using multiple instance types and a mix of spot and on-demand. This post won’t get into the details of setting up Karpenter since that is done here.

This first section is an example Terraform module for GHA ARC on EKS. As of this writing, there is a bug with version ‘0.9.1’ that will kill a new runner node before its fully ready, so avoid that release for now.

One important thing to keep in mind when using ARC is the runner image is not the same as the Github hosted runners. You will find that you will need to install dependencies that you didn’t need before (ex: python, maven, git, jq).

Module

main.tf

1module "gha_arc" {
2  source      = "../../../../modules/aws/eks_services/gha_arc"
3  arc_version = "0.9.0"
4}

providers.tf

 1provider "helm" {
 2  kubernetes {
 3    host                   = data.aws_eks_cluster.ops.endpoint
 4    cluster_ca_certificate = base64decode(data.aws_eks_cluster.ops.certificate_authority[0].data)
 5    exec {
 6      api_version = "client.authentication.k8s.io/v1beta1"
 7      # This requires the awscli to be installed locally where Terraform is executed
 8      args    = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.ops.name]
 9      command = "aws"
10    }
11  }
12}

Module files

ARC consists of two Helm charts: one for the controller and one for the runner scale sets. First need to install the controller.

main.tf

This is the controller service that is put into the “arc-systems” namespace.

The values are optional, but I’ll explain the reasoning behind these. I use Grafa and turned on metrics and adding an annotation to scrap the metrics provided. Setting a node affinity to avoid the controller pod being placed in a node provisioned by Karpenter.

 1resource "helm_release" "arc_systems" {
 2  namespace        = "arc-systems"
 3  create_namespace = true
 4  reuse_values     = false
 5  name             = "actions-runner-controller"
 6  repository       = "oci://ghcr.io/actions/actions-runner-controller-charts"
 7  chart            = "gha-runner-scale-set-controller"
 8  version          = var.arc_version
 9  values = [
10    <<-EOT
11    podAnnotations:
12      k8s.grafana.com/scrape: "true"
13    metrics:
14      controllerManagerAddr: ":8080"
15      listenerAddr: ":8080"
16      listenerEndpoint: "/metrics"
17    affinity:
18      nodeAffinity:
19        requiredDuringSchedulingIgnoredDuringExecution:
20          nodeSelectorTerms:
21          - matchExpressions:
22            - key: role
23              operator: In
24              values:
25              - core
26    EOT
27  ]
28}

For my GHA workflows, I have jobs that require different core counts, so here there are two different runner scale sets requesting specific core sizes.

Here is a general workload runner with two core counts. I found that if I request two cores exactly, Karpenter would request a four core instance due to daemon pods on the node, so I set “1.5” to get that two core instance type working.

On the values settings:

telling Karpenter to do-not-disrupt the node for any case such as bin-packing, so as not to kill a running job.
setting “dind” mode for the public actions that use Dockerfiles.
since the ARC runner image is bare minimum (not the same as a Github hosted runner), I’ve set a workaround to get “git” installed immediately on startup. You could potentially use a custom runner image; however, I didn’t have any luck with that. Installing “git” at a minimum will allow the “actions/checkout” step to use git to checkout the repo instead of downloading an archive of the repo since git isn’t installed.
installing the listener pods on the core nodes and not temporary nodes setup by Karpenter
telling the runner pod to run on my “general” node type in the Karpenter pool
NOTE: update the githubConfigUrl setting to your organization
NOTE: githubConfigSecret in this case is using a custom Github app to generate the temporary token instead of using a Personal Access Token. Directions to create an app for this purpose is located here

 1resource "helm_release" "arc_runners" {
 2  namespace        = "arc-runners"
 3  create_namespace = true
 4  reuse_values     = false
 5  name             = "arc-runner-set"
 6  repository       = "oci://ghcr.io/actions/actions-runner-controller-charts"
 7  chart            = "gha-runner-scale-set"
 8  version          = var.arc_version
 9
10  values = [
11    <<-EOT
12    githubConfigUrl: "https://github.com/<org>"
13    githubConfigSecret: arc-app
14    runnerGroup: k8s
15    runnerScaleSetName: k8s-runner
16    minRunners: 0
17    maxRunners: 100
18    containerMode:
19      type: "dind"
20    template:
21      spec:
22        metadata:
23          annotations:
24            karpenter.sh/do-not-disrupt: "true"
25        containers:
26          - name: runner
27            image: ghcr.io/actions/actions-runner:latest
28            command: ["/bin/bash","-c","sudo apt-get update && sudo apt-get install git -y && /home/runner/run.sh"]
29            resources:
30              requests:
31                cpu: "1.5"
32        nodeSelector:
33          node-type: general
34    listenerTemplate:
35      spec:
36        containers:
37          - name: listener
38            securityContext:
39              runAsUser: 1000
40        nodeSelector:
41          role: core
42    EOT
43  ]
44  depends_on = [
45    helm_release.arc_systems
46  ]
47}

Larger core count runner scale set:

 1resource "helm_release" "arc_runners_performance" {
 2  namespace        = "arc-runners"
 3  create_namespace = true
 4  reuse_values     = false
 5  name             = "arc-runner-set-performance"
 6  repository       = "oci://ghcr.io/actions/actions-runner-controller-charts"
 7  chart            = "gha-runner-scale-set"
 8  version          = var.arc_version
 9
10  values = [
11    <<-EOT
12    githubConfigUrl: "https://github.com/<org>"
13    githubConfigSecret: arc-app
14    runnerGroup: k8s
15    runnerScaleSetName: k8s-runner-8-core
16    minRunners: 0
17    maxRunners: 100
18    template:
19      spec:
20        metadata:
21          annotations:
22            karpenter.sh/do-not-disrupt: "true"
23        containers:
24          - name: runner
25            image: ghcr.io/actions/actions-runner:latest
26            command: ["/bin/bash","-c","sudo apt-get update && sudo apt-get install git -y && /home/runner/run.sh"]
27            resources:
28              requests:
29                cpu: "7.5"
30        nodeSelector:
31          node-type: performance
32    listenerTemplate:
33      spec:
34        containers:
35          - name: listener
36            securityContext:
37              runAsUser: 1000
38        nodeSelector:
39          role: core
40    EOT
41  ]
42  depends_on = [
43    helm_release.arc_systems
44  ]
45}

Secret referenced above in the runner scale set in the “githubConfigSecret” value.

 1resource "kubernetes_secret" "arc_app" {
 2  metadata {
 3    name      = "arc-app"
 4    namespace = "arc-runners"
 5  }
 6
 7  data = {
 8    github_app_id              = data.aws_ssm_parameter.github_app_id.value
 9    github_app_installation_id = data.aws_ssm_parameter.github_app_install_id.value
10    github_app_private_key     = data.aws_ssm_parameter.github_app_private_key.value
11  }
12
13  type                           = "Opaque"
14  wait_for_service_account_token = false
15}

variables.tf

1variable "arc_version" {
2  type = string
3}

Using Parameter Store to store values for the Github app secret.

data.tf

 1data "aws_ssm_parameter" "github_app_id" {
 2  name = "/eks/gha_arc/app_id"
 3}
 4
 5data "aws_ssm_parameter" "github_app_install_id" {
 6  name = "/eks/gha_arc/app_install_id"
 7}
 8
 9data "aws_ssm_parameter" "github_app_private_key" {
10  name = "/eks/gha_arc/app_private_key"
11}

Karpenter

Here’s a quick run down of what the Karpenter module looks like and the two node pools. You can set instance families instead of specfic node types, but I like to be specific here.

 1module "karpenter" {
 2  source                 = "../../../../modules/aws/eks_services/karpenter"
 3  env                    = local.env
 4  region                 = local.region
 5  karpenter_version      = "0.36.0"
 6  cluster_name           = data.aws_eks_cluster.ops.name
 7  cluster_endpoint       = data.aws_eks_cluster.ops.endpoint
 8  irsa_oidc_provider_arn = data.terraform_remote_state.eks.outputs.oidc_provider_arn
 9  eks_node_role_arn      = data.aws_iam_role.node_ops.arn
10  general_node_types = [
11    "c5.large",
12    "c5a.large",
13    "c6.large",
14    "c6a.large",
15    "c7a.large"
16  ]
17  general_node_capacity_types = ["on-demand", "spot"]
18  perf_node_types = [
19    "c5.2xlarge",
20    "c5a.2xlarge",
21    "c6.2xlarge",
22    "c6a.2xlarge",
23    "c7a.2xlarge"
24  ]
25  perf_node_capacity_types = ["on-demand", "spot"]
26  node_arch                = ["amd64"]
27  node_volume_size         = 50
28}

One important setting to highlight here is the disruption setting. The default is “underUtilized” and this is more for stateless workloads that can be disrupted when Karpenter is trying to bin-pack the nodes. We don’t want this for GHA workflow jobs and have them killed while working.

General node pool (2 cores)

 1apiVersion: karpenter.sh/v1beta1
 2kind: NodePool
 3metadata:
 4  name: general
 5spec:
 6  template:
 7    metadata:
 8      labels:
 9        node-type: general
10    spec:
11      requirements:
12        - key: kubernetes.io/arch
13          operator: In
14          values: ${INSTANCE_ARCH}
15        - key: kubernetes.io/os
16          operator: In
17          values: ["linux"]
18        - key: karpenter.sh/capacity-type
19          operator: In
20          values: ${GENERAL_CAPACITY_TYPES}
21        - key: node.kubernetes.io/instance-type
22          operator: In
23          values: ${GENERAL_NODE_TYPES}
24        - key: karpenter.k8s.aws/instance-generation
25          operator: Gt
26          values: ["2"]
27      nodeClassRef:
28        name: bottlerocket
29      kubelet:
30        maxPods: 110
31  limits:
32    cpu: 100
33  disruption:
34    consolidationPolicy: WhenEmpty
35    consolidateAfter: 30s

Larger 8 core node pool

 1apiVersion: karpenter.sh/v1beta1
 2kind: NodePool
 3metadata:
 4  name: performance
 5spec:
 6  template:
 7    metadata:
 8      labels:
 9        node-type: performance
10    spec:
11      requirements:
12        - key: kubernetes.io/arch
13          operator: In
14          values: ${INSTANCE_ARCH}
15        - key: kubernetes.io/os
16          operator: In
17          values: ["linux"]
18        - key: karpenter.sh/capacity-type
19          operator: In
20          values: ${PERF_CAPACITY_TYPES}
21        - key: node.kubernetes.io/instance-type
22          operator: In
23          values: ${PERF_NODE_TYPES}
24        - key: karpenter.k8s.aws/instance-cpu
25          operator: In
26          values: ["8"]
27        - key: karpenter.k8s.aws/instance-generation
28          operator: Gt
29          values: ["2"]
30      nodeClassRef:
31        name: bottlerocket
32      kubelet:
33        maxPods: 110
34  limits:
35    cpu: 200
36  disruption:
37    consolidationPolicy: WhenEmpty
38    consolidateAfter: 30s