Self-hosted Github Action runners on Kubernetes using Karpenter and ARC
In this post, I’m going to show how to setup self-hosted Github Action runners on Kubernetes using the Actions Runner Controller (ARC) and Karpenter on Terraform. Using Karpenter is not a requirement; however, it will make life easier when using multiple instance types and a mix of spot and on-demand. This post won’t get into the details of setting up Karpenter since that is done here.
This first section is an example Terraform module for GHA ARC on EKS. As of this writing, there is a bug with version ‘0.9.1’ that will kill a new runner node before its fully ready, so avoid that release for now.
One important thing to keep in mind when using ARC is the runner image is not the same as the Github hosted runners. You will find that you will need to install dependencies that you didn’t need before (ex: python, maven, git, jq).
Module
main.tf
1module "gha_arc" {
2 source = "../../../../modules/aws/eks_services/gha_arc"
3 arc_version = "0.9.0"
4}
providers.tf
1provider "helm" {
2 kubernetes {
3 host = data.aws_eks_cluster.ops.endpoint
4 cluster_ca_certificate = base64decode(data.aws_eks_cluster.ops.certificate_authority[0].data)
5 exec {
6 api_version = "client.authentication.k8s.io/v1beta1"
7 # This requires the awscli to be installed locally where Terraform is executed
8 args = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.ops.name]
9 command = "aws"
10 }
11 }
12}
Module files
ARC consists of two Helm charts: one for the controller and one for the runner scale sets. First need to install the controller.
main.tf
This is the controller service that is put into the “arc-systems” namespace.
The values are optional, but I’ll explain the reasoning behind these. I use Grafa and turned on metrics and adding an annotation to scrap the metrics provided. Setting a node affinity to avoid the controller pod being placed in a node provisioned by Karpenter.
1resource "helm_release" "arc_systems" {
2 namespace = "arc-systems"
3 create_namespace = true
4 reuse_values = false
5 name = "actions-runner-controller"
6 repository = "oci://ghcr.io/actions/actions-runner-controller-charts"
7 chart = "gha-runner-scale-set-controller"
8 version = var.arc_version
9 values = [
10 <<-EOT
11 podAnnotations:
12 k8s.grafana.com/scrape: "true"
13 metrics:
14 controllerManagerAddr: ":8080"
15 listenerAddr: ":8080"
16 listenerEndpoint: "/metrics"
17 affinity:
18 nodeAffinity:
19 requiredDuringSchedulingIgnoredDuringExecution:
20 nodeSelectorTerms:
21 - matchExpressions:
22 - key: role
23 operator: In
24 values:
25 - core
26 EOT
27 ]
28}
For my GHA workflows, I have jobs that require different core counts, so here there are two different runner scale sets requesting specific core sizes.
Here is a general workload runner with two core counts. I found that if I request two cores exactly, Karpenter would request a four core instance due to daemon pods on the node, so I set “1.5” to get that two core instance type working.
On the values settings:
- telling Karpenter to do-not-disrupt the node for any case such as bin-packing, so as not to kill a running job.
- setting “dind” mode for the public actions that use Dockerfiles.
- since the ARC runner image is bare minimum (not the same as a Github hosted runner), I’ve set a workaround to get “git” installed immediately on startup. You could potentially use a custom runner image; however, I didn’t have any luck with that. Installing “git” at a minimum will allow the “actions/checkout” step to use git to checkout the repo instead of downloading an archive of the repo since git isn’t installed.
- installing the listener pods on the core nodes and not temporary nodes setup by Karpenter
- telling the runner pod to run on my “general” node type in the Karpenter pool
- NOTE: update the githubConfigUrl setting to your organization
- NOTE: githubConfigSecret in this case is using a custom Github app to generate the temporary token instead of using a Personal Access Token. Directions to create an app for this purpose is located here
1resource "helm_release" "arc_runners" {
2 namespace = "arc-runners"
3 create_namespace = true
4 reuse_values = false
5 name = "arc-runner-set"
6 repository = "oci://ghcr.io/actions/actions-runner-controller-charts"
7 chart = "gha-runner-scale-set"
8 version = var.arc_version
9
10 values = [
11 <<-EOT
12 githubConfigUrl: "https://github.com/<org>"
13 githubConfigSecret: arc-app
14 runnerGroup: k8s
15 runnerScaleSetName: k8s-runner
16 minRunners: 0
17 maxRunners: 100
18 containerMode:
19 type: "dind"
20 template:
21 spec:
22 metadata:
23 annotations:
24 karpenter.sh/do-not-disrupt: "true"
25 containers:
26 - name: runner
27 image: ghcr.io/actions/actions-runner:latest
28 command: ["/bin/bash","-c","sudo apt-get update && sudo apt-get install git -y && /home/runner/run.sh"]
29 resources:
30 requests:
31 cpu: "1.5"
32 nodeSelector:
33 node-type: general
34 listenerTemplate:
35 spec:
36 containers:
37 - name: listener
38 securityContext:
39 runAsUser: 1000
40 nodeSelector:
41 role: core
42 EOT
43 ]
44 depends_on = [
45 helm_release.arc_systems
46 ]
47}
Larger core count runner scale set:
1resource "helm_release" "arc_runners_performance" {
2 namespace = "arc-runners"
3 create_namespace = true
4 reuse_values = false
5 name = "arc-runner-set-performance"
6 repository = "oci://ghcr.io/actions/actions-runner-controller-charts"
7 chart = "gha-runner-scale-set"
8 version = var.arc_version
9
10 values = [
11 <<-EOT
12 githubConfigUrl: "https://github.com/<org>"
13 githubConfigSecret: arc-app
14 runnerGroup: k8s
15 runnerScaleSetName: k8s-runner-8-core
16 minRunners: 0
17 maxRunners: 100
18 template:
19 spec:
20 metadata:
21 annotations:
22 karpenter.sh/do-not-disrupt: "true"
23 containers:
24 - name: runner
25 image: ghcr.io/actions/actions-runner:latest
26 command: ["/bin/bash","-c","sudo apt-get update && sudo apt-get install git -y && /home/runner/run.sh"]
27 resources:
28 requests:
29 cpu: "7.5"
30 nodeSelector:
31 node-type: performance
32 listenerTemplate:
33 spec:
34 containers:
35 - name: listener
36 securityContext:
37 runAsUser: 1000
38 nodeSelector:
39 role: core
40 EOT
41 ]
42 depends_on = [
43 helm_release.arc_systems
44 ]
45}
Secret referenced above in the runner scale set in the “githubConfigSecret” value.
1resource "kubernetes_secret" "arc_app" {
2 metadata {
3 name = "arc-app"
4 namespace = "arc-runners"
5 }
6
7 data = {
8 github_app_id = data.aws_ssm_parameter.github_app_id.value
9 github_app_installation_id = data.aws_ssm_parameter.github_app_install_id.value
10 github_app_private_key = data.aws_ssm_parameter.github_app_private_key.value
11 }
12
13 type = "Opaque"
14 wait_for_service_account_token = false
15}
variables.tf
1variable "arc_version" {
2 type = string
3}
Using Parameter Store to store values for the Github app secret.
data.tf
1data "aws_ssm_parameter" "github_app_id" {
2 name = "/eks/gha_arc/app_id"
3}
4
5data "aws_ssm_parameter" "github_app_install_id" {
6 name = "/eks/gha_arc/app_install_id"
7}
8
9data "aws_ssm_parameter" "github_app_private_key" {
10 name = "/eks/gha_arc/app_private_key"
11}
Karpenter
Here’s a quick run down of what the Karpenter module looks like and the two node pools. You can set instance families instead of specfic node types, but I like to be specific here.
1module "karpenter" {
2 source = "../../../../modules/aws/eks_services/karpenter"
3 env = local.env
4 region = local.region
5 karpenter_version = "0.36.0"
6 cluster_name = data.aws_eks_cluster.ops.name
7 cluster_endpoint = data.aws_eks_cluster.ops.endpoint
8 irsa_oidc_provider_arn = data.terraform_remote_state.eks.outputs.oidc_provider_arn
9 eks_node_role_arn = data.aws_iam_role.node_ops.arn
10 general_node_types = [
11 "c5.large",
12 "c5a.large",
13 "c6.large",
14 "c6a.large",
15 "c7a.large"
16 ]
17 general_node_capacity_types = ["on-demand", "spot"]
18 perf_node_types = [
19 "c5.2xlarge",
20 "c5a.2xlarge",
21 "c6.2xlarge",
22 "c6a.2xlarge",
23 "c7a.2xlarge"
24 ]
25 perf_node_capacity_types = ["on-demand", "spot"]
26 node_arch = ["amd64"]
27 node_volume_size = 50
28}
One important setting to highlight here is the disruption setting. The default is “underUtilized” and this is more for stateless workloads that can be disrupted when Karpenter is trying to bin-pack the nodes. We don’t want this for GHA workflow jobs and have them killed while working.
General node pool (2 cores)
1apiVersion: karpenter.sh/v1beta1
2kind: NodePool
3metadata:
4 name: general
5spec:
6 template:
7 metadata:
8 labels:
9 node-type: general
10 spec:
11 requirements:
12 - key: kubernetes.io/arch
13 operator: In
14 values: ${INSTANCE_ARCH}
15 - key: kubernetes.io/os
16 operator: In
17 values: ["linux"]
18 - key: karpenter.sh/capacity-type
19 operator: In
20 values: ${GENERAL_CAPACITY_TYPES}
21 - key: node.kubernetes.io/instance-type
22 operator: In
23 values: ${GENERAL_NODE_TYPES}
24 - key: karpenter.k8s.aws/instance-generation
25 operator: Gt
26 values: ["2"]
27 nodeClassRef:
28 name: bottlerocket
29 kubelet:
30 maxPods: 110
31 limits:
32 cpu: 100
33 disruption:
34 consolidationPolicy: WhenEmpty
35 consolidateAfter: 30s
Larger 8 core node pool
1apiVersion: karpenter.sh/v1beta1
2kind: NodePool
3metadata:
4 name: performance
5spec:
6 template:
7 metadata:
8 labels:
9 node-type: performance
10 spec:
11 requirements:
12 - key: kubernetes.io/arch
13 operator: In
14 values: ${INSTANCE_ARCH}
15 - key: kubernetes.io/os
16 operator: In
17 values: ["linux"]
18 - key: karpenter.sh/capacity-type
19 operator: In
20 values: ${PERF_CAPACITY_TYPES}
21 - key: node.kubernetes.io/instance-type
22 operator: In
23 values: ${PERF_NODE_TYPES}
24 - key: karpenter.k8s.aws/instance-cpu
25 operator: In
26 values: ["8"]
27 - key: karpenter.k8s.aws/instance-generation
28 operator: Gt
29 values: ["2"]
30 nodeClassRef:
31 name: bottlerocket
32 kubelet:
33 maxPods: 110
34 limits:
35 cpu: 200
36 disruption:
37 consolidationPolicy: WhenEmpty
38 consolidateAfter: 30s