GitLab Runner Scale From 0 on EKS
If you’ve setup a project in GitLab you are probably familiar with the gitlab-runner
. It is used by GitLab to run all the pipeline jobs. If you have a self hosted environment you’ve definitely already set one up yourself, but if you’re using the Cloud version, then it has been done for you. There are times when you’ll want to setup your own runner, even when working with GitLab in the cloud. We found that because we have a mono-repo it can take a long time to build on the shared runners and it also consumes a lot of our monthly CI minutes.
Since we already have a Kubernetes environment, setting up a runner in Kubernetes was a no brainer. With some tweaking I was able to get the runner to use AWS Spot Instances and the S3 cache. This means we can run a single build on a large instance type and save a lot of money.
The guide below details how to configure a Kubernetes executor, S3 Cache, and AWS Spot Instances.
Configuring EKS for Spot Instances⌗
To make use of the Spot instances during a build pipeline we first need to create an autoscaling group that is made up of only spot instances. The ASG needs to be “tainted” to avoid the group being used by anything other than a build job. There are also additional labels and tags that need to be added so that the cluster-autoscaler
can scale up the node group from 0 instances. If the tags are not present, the cluster-autoscaler
will fail to find a matching node group for the affinity defined.
Creating a managed node group⌗
To setup a managed node group using eksctl
, you can create a yml file similar to the config snippet below. The node group defined here is composed of EC2 Instance types that provide 4 CPU cores and either 8 or 16 GiB of RAM. Most of our builds are CPU intensive so the CPU cores are more important. The ordering of the instance types should make c5a.xlarge
the preferred build server but any from the list can be selected depending on availability. The instanceTypes
need to match what is available in your region. I have seen that the a
CPU Type (AMD) is not available in every region, if it is available you can save a few cents more by preferring a
over i
(Intel).
managedNodeGroups:
- name: build-8vcpu-16gb-32gb-1b
spot: true
tags:
k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: stateless
k8s.io/cluster-autoscaler/node-template/label/instance-type: spot
k8s.io/cluster-autoscaler/node-template/label/intent: build
k8s.io/cluster-autoscaler/node-template/taint/buildInstance: "true:NoSchedule"
labels:
nodegroup-type: stateless
instance-type: spot
intent: build
taints:
- key: buildInstance
value: "true"
effect: NoSchedule
instanceTypes:
- c5a.2xlarge
- t3.2xlarge
- c5.2xlarge
- c5ad.2xlarge
- m5.2xlarge
- c5d.2xlarge
desiredCapacity: 0
minSize: 0
maxSize: 5
availabilityZones: ["af-south-1b"]
iam:
withAddonPolicies:
certManager: true
autoScaler: true
externalDNS: true
ssh: # use existing EC2 key
publicKeyName: your-ssh-key
Apply the configuration to your cluster
eksctl create nodegroup --config-file=./eks-cluster.yaml --include=build-4vcpu-8gb-16gb-1a
Autoscaling group tags⌗
Once the nodegroup has been setup, it is vitally important that you add the tags that will allow for scale up from 0. During the eksctl nodegroup creation it adds the tags to the EKS configuration but they are missing from the ASG.
From the AWS Console
- Switch to EC2
- At the bottom of the left hand column select Auto Scaling Groups
- Select the autoscaling group you have just created, the name will match the
eksctl
managed node group name - Scroll down to Tags and add the following tags
k8s.io/cluster-autoscaler/node-template/label/instance-type: spot
k8s.io/cluster-autoscaler/node-template/label/intent: build
k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: stateless
k8s.io/cluster-autoscaler/node-template/taint/buildInstance: true:NoSchedule
These are the same tags as the eksctl
config and will allow cluster-autoscaler
to scale up from 0 instances.
S3 Cache Bucket⌗
The gitlab-runner
can be setup to use an S3 Bucket for its cache. This is the preferred method when using a Kubernetes executor with spot instances because each build job will run on a new instance so ephemeral storage with a cache
volume is not an option.
Bucket Details⌗
Create a bucket in your preferred region and use all the standard bucket create options: The bucket should not be public and there are no additional special configs applied.
The gitlab-runner will use the bucket for build pipeline caches and each cache object can be hundreds of megabytes. Depending on how you have configured your GitLab CI job, a build pipeline may run for every merge request and commit to a release branch which means this cache will just build up indefinitely if it is never cleared. To prevent wasted expense, the cache should be expired after a certain number of days.
When a new job is started, the runner will look for the cache, and if it cannot find it, it will simply ignore it and carry on with the build. As long as you don’t depend on the cache being there every time there is no impact on the actual build process. Branches that execute frequently will be unaffected because they are constantly updating their cache.
To apply a retention policy, open the AWS Console
- Switch to S3
- Select your bucket from the list
- Select the Management tab
- Click on Create lifecycle rule
- Input a suitable name, for example
delete_files_older_than_60_days
- Enter a prefix to limit the scope to
project/
. This may be unnecessary since the GitLab runner only uses theproject
folder in the bucket. - Under Lifecycle rule actions select Expire current versions of objects
- For Days after object creation input a value in days, example
60
- Input a suitable name, for example
- Click on Create rule
The above configuration will delete all cache objects older than 60 days. This will prevent the bucket filling up with large cache objects that will never be used again.
IAM User⌗
The gitlab-runner
will need full access to the cache bucket. Create a policy that provides access to the bucket. The policy below is probably overkill but you can use it and just modify it for your bucket name:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetBucketTagging",
"s3:DeleteObjectVersion",
"s3:GetObjectVersionTagging",
"s3:ListBucketVersions",
"s3:GetBucketLogging",
"s3:RestoreObject",
"s3:CreateBucket",
"s3:ListBucket",
"s3:GetObjectVersionAttributes",
"s3:GetBucketPolicy",
"s3:ReplicateObject",
"s3:PutEncryptionConfiguration",
"s3:GetObjectAcl",
"s3:GetBucketObjectLockConfiguration",
"s3:AbortMultipartUpload",
"s3:PutBucketTagging",
"s3:GetObjectVersionAcl",
"s3:GetObjectTagging",
"s3:GetBucketOwnershipControls",
"s3:PutObjectTagging",
"s3:DeleteObject",
"s3:PutBucketVersioning",
"s3:DeleteObjectTagging",
"s3:GetBucketPublicAccessBlock",
"s3:GetBucketPolicyStatus",
"s3:ListBucketMultipartUploads",
"s3:GetObjectRetention",
"s3:GetBucketWebsite",
"s3:GetObjectAttributes",
"s3:PutObjectLegalHold",
"s3:GetBucketVersioning",
"s3:PutBucketCORS",
"s3:GetBucketAcl",
"s3:GetObjectLegalHold",
"s3:GetBucketNotification",
"s3:ListMultipartUploadParts",
"s3:PutObject",
"s3:GetObject",
"s3:PutBucketNotification",
"s3:PutBucketWebsite",
"s3:PutObjectRetention",
"s3:PutBucketLogging",
"s3:GetBucketCORS",
"s3:PutBucketObjectLockConfiguration",
"s3:GetBucketLocation",
"s3:ReplicateDelete",
"s3:GetObjectVersion"
],
"Resource": [
"arn:aws:s3:::<your-cache-bucket>/*",
"arn:aws:s3:::<your-cache-bucket>"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "s3:ListAllMyBuckets",
"Resource": "*"
}
]
}
Now create a User and attach only the newly create S3 Runner Cache policy. Save the AWS Access Key ID and Secret Access Key. These need to be saved as a secret in Kubernetes.
Gitlab Runner Deployment⌗
Environment Setup⌗
To prepare for the gitlab-runner you first need to create the namespace and S3 Cache Secret
kubectl create namespace gitlab-runner
Then create the secret using the AWS Access key and secret you created earlier
kubectl create secret generic gitlab-runner-s3-access \
--from-literal=accesskey="AKI..." \
--from-literal=secretkey="Ax..." \
-n gitlab-runner
Helm Deploy⌗
The simplest way to deploy the runner is to make use of the official helm chart.
helm repo add gitlab https://charts.gitlab.io
Then create your values.yaml
file that will be used. For this deployment there is a fair amount of configuration required to ensure the runners make use of the EC2 Spot Instances.
You will also need to get your runnerRegistrationToken
from GitLab. The most useful runner type is a group runner. You can get the registration token for a new group runner at the root group in GitLab.
Click the blue box in the top right and copy the Runner Registration Token, this will be inserted into the values.yaml
.
The full values.yaml
is as follows and finer details are explained below
imagePullSecrets:
- name: your-pull-secrets
gitlabUrl: https://gitlab.com/
runnerRegistrationToken: XXXXX
unregisterRunners: true
metrics:
enabled: true
serviceMonitor:
enabled: true
service:
enabled: true
name: gitlab-k8s-runner
podAnnotations:
downscaler/exclude: "true"
runners:
cache:
secretName: gitlab-runner-s3-access
privileged: true
config: |
[[runners]]
environment = [
"DOCKER_HOST=tcp://docker:2376",
"DOCKER_TLS_CERTDIR=/certs",
"DOCKER_TLS_VERIFY=1",
"DOCKER_CERT_PATH=$DOCKER_TLS_CERTDIR/client"
]
[runners.cache]
Type = "s3"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
BucketName = "your-cache-bucket"
BucketLocation = "eu-west-1"
Insecure = false
AuthenticationType = "access-key"
[runners.kubernetes]
namespace = "{{.Release.Namespace}}"
image = "docker:19.03.12"
privileged = true
poll_timeout = 300
poll_interval = 10
cpu_request = "2"
[[runners.kubernetes.volumes.empty_dir]]
name = "certs"
mount_path = "/certs/client"
medium = "Memory"
[runners.kubernetes.affinity]
[runners.kubernetes.affinity.node_affinity]
[[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution]]
weight = 1
[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference]
[[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference.match_expressions]]
key = "eks.amazonaws.com/capacityType"
operator = "In"
values = ["SPOT"]
[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution]
[[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms]]
[[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
key = "intent"
operator = "In"
values = ["build"]
[runners.kubernetes.node_tolerations]
"buildInstance=true" = "NoSchedule"
[runners.kubernetes.pod_annotations]
"downscaler/exclude" = "true"
rbac:
create: true
If you are happy with the configuration you can apply it with
helm upgrade --install --namespace gitlab-runner --create-namespace gitlab-runner -f values.yaml gitlab/gitlab-runner
There are a few important points to note with the configuration above…
Docker-in-Docker Builds⌗
A lot of build pipelines make use of Docker-in-Docker (DinD) to build and push docker images. Running DinD in kubernetes (or even in Docker) requires a few special tweaks in the gitlab-runner
configuration.
The environment
field is used to set environment variables that are passed to each build. DinD builds require additional environment variables that are used by docker when running the docker build.
environment = [
"DOCKER_HOST=tcp://docker:2376",
"DOCKER_TLS_CERTDIR=/certs",
"DOCKER_TLS_VERIFY=1",
"DOCKER_CERT_PATH=$DOCKER_TLS_CERTDIR/client"
]
When a build launches in Kubernetes it creates a pod with all the required services. The DOCKER_HOST
is therefore accessible at docker
and we use port 2376
because TLS is set to true. The DOCKER_TLS_CERTDIR
and DOCKER_CERT_PATH
variables tell docker where to find the certificates that are used between the client and daemon.
[runners.kubernetes]
image = "docker:19.03.12"
privileged = true
In the runners.kubernetes
section we also need to set privileged
true which will allow containers access to the docker daemon during the build. The image
is also set to an official docker
image which will be used by default if no image is specified in the gitlab-ci.yaml
.
[[runners.kubernetes.volumes.empty_dir]]
name = "certs"
mount_path = "/certs/client"
medium = "Memory"
This section is critical because it will create an in memory volume that is shared between all the services in a pod. This allows the docker
service to create the client certificates that will be used by the build container.
Node Affinity and tolerations⌗
To make sure that build jobs are launched on Spot instances we need to create a node affinity in the config.toml
[runners.kubernetes.affinity]
[runners.kubernetes.affinity.node_affinity]
[[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution]]
weight = 1
[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference]
[[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference.match_expressions]]
key = "eks.amazonaws.com/capacityType"
operator = "In"
values = ["SPOT"]
[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution]
[[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms]]
[[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
key = "intent"
operator = "In"
values = ["build"]
The configuration above will instruct the Kubernetes scheduler to prefer instances labelled as eks.amazonaws.com/capacityType: SPOT
and require instances labelled as intent: build
.
[runners.kubernetes.node_tolerations]
"buildInstance=true" = "NoSchedule"
Because we tainted the ASG we also need to add a toleration to make sure the build pods can tolerate the taint. This toleration will allow build pods to ignore the NoSchedule
taint on build nodes.
End Notes⌗
Spot instances are a great way to save money when using AWS. The combination of spot and EKS is fantastic. With the above configuration you can access cheap, high spec, hardware for executing your builds on.
The only drawback is the time it takes for a new node to be created. In my configuration above I have the cpu_request
value for each build set to 2
. This means that if there are no nodes with 2 CPU available, we need a new node to be created. This relies on the cluster-autoscaler
expanding the ASG in reaction to a pod that is stuck in the pending
state. All of this happens reasonably fast but it does mean adding between 2 - 4 minutes to a build. If you have frequent builds this can be less of an issue because the builds will follow on from each other and reuse the existing nodes before cluster-autoscaler
has a chance to remove them.
If I had time, I would look at setting up cluster overprovisioning to keep at least 1 node waiting for a build during office hours. Maybe that should be a future post…