If you’ve setup a project in GitLab you are probably familiar with the gitlab-runner. It is used by GitLab to run all the pipeline jobs. If you have a self hosted environment you’ve definitely already set one up yourself, but if you’re using the Cloud version, then it has been done for you. There are times when you’ll want to setup your own runner, even when working with GitLab in the cloud. We found that because we have a mono-repo it can take a long time to build on the shared runners and it also consumes a lot of our monthly CI minutes.

Since we already have a Kubernetes environment, setting up a runner in Kubernetes was a no brainer. With some tweaking I was able to get the runner to use AWS Spot Instances and the S3 cache. This means we can run a single build on a large instance type and save a lot of money.

The guide below details how to configure a Kubernetes executor, S3 Cache, and AWS Spot Instances.

Configuring EKS for Spot Instances

To make use of the Spot instances during a build pipeline we first need to create an autoscaling group that is made up of only spot instances. The ASG needs to be “tainted” to avoid the group being used by anything other than a build job. There are also additional labels and tags that need to be added so that the cluster-autoscaler can scale up the node group from 0 instances. If the tags are not present, the cluster-autoscaler will fail to find a matching node group for the affinity defined.

Creating a managed node group

To setup a managed node group using eksctl, you can create a yml file similar to the config snippet below. The node group defined here is composed of EC2 Instance types that provide 4 CPU cores and either 8 or 16 GiB of RAM. Most of our builds are CPU intensive so the CPU cores are more important. The ordering of the instance types should make c5a.xlarge the preferred build server but any from the list can be selected depending on availability. The instanceTypes need to match what is available in your region. I have seen that the a CPU Type (AMD) is not available in every region, if it is available you can save a few cents more by preferring a over i (Intel).

managedNodeGroups:
  - name: build-8vcpu-16gb-32gb-1b
    spot: true
    tags:
      k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: stateless
      k8s.io/cluster-autoscaler/node-template/label/instance-type: spot
      k8s.io/cluster-autoscaler/node-template/label/intent: build
      k8s.io/cluster-autoscaler/node-template/taint/buildInstance: "true:NoSchedule"
    labels:
      nodegroup-type: stateless
      instance-type: spot
      intent: build
    taints:
      - key: buildInstance
        value: "true"
        effect: NoSchedule
    instanceTypes:
      - c5a.2xlarge
      - t3.2xlarge
      - c5.2xlarge
      - c5ad.2xlarge
      - m5.2xlarge
      - c5d.2xlarge
    desiredCapacity: 0
    minSize: 0
    maxSize: 5
    availabilityZones: ["af-south-1b"]
    iam:
      withAddonPolicies:
        certManager: true
        autoScaler: true
        externalDNS: true
    ssh: # use existing EC2 key
      publicKeyName: your-ssh-key

Apply the configuration to your cluster

eksctl create nodegroup --config-file=./eks-cluster.yaml --include=build-4vcpu-8gb-16gb-1a

Autoscaling group tags

Once the nodegroup has been setup, it is vitally important that you add the tags that will allow for scale up from 0. During the eksctl nodegroup creation it adds the tags to the EKS configuration but they are missing from the ASG.

Read more about this in the Scaling from 0 with EKS and Spot Instances post

From the AWS Console

  1. Switch to EC2
  2. At the bottom of the left hand column select Auto Scaling Groups
  3. Select the autoscaling group you have just created, the name will match the eksctl managed node group name
  4. Scroll down to Tags and add the following tags
    1. k8s.io/cluster-autoscaler/node-template/label/instance-type: spot
    2. k8s.io/cluster-autoscaler/node-template/label/intent: build
    3. k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: stateless
    4. k8s.io/cluster-autoscaler/node-template/taint/buildInstance: true:NoSchedule

These are the same tags as the eksctl config and will allow cluster-autoscaler to scale up from 0 instances.

S3 Cache Bucket

The gitlab-runner can be setup to use an S3 Bucket for its cache. This is the preferred method when using a Kubernetes executor with spot instances because each build job will run on a new instance so ephemeral storage with a cache volume is not an option.

Bucket Details

Create a bucket in your preferred region and use all the standard bucket create options: The bucket should not be public and there are no additional special configs applied.

The gitlab-runner will use the bucket for build pipeline caches and each cache object can be hundreds of megabytes. Depending on how you have configured your GitLab CI job, a build pipeline may run for every merge request and commit to a release branch which means this cache will just build up indefinitely if it is never cleared. To prevent wasted expense, the cache should be expired after a certain number of days.

When a new job is started, the runner will look for the cache, and if it cannot find it, it will simply ignore it and carry on with the build. As long as you don’t depend on the cache being there every time there is no impact on the actual build process. Branches that execute frequently will be unaffected because they are constantly updating their cache.

To apply a retention policy, open the AWS Console

  1. Switch to S3
  2. Select your bucket from the list
  3. Select the Management tab
  4. Click on Create lifecycle rule
    1. Input a suitable name, for example delete_files_older_than_60_days
    2. Enter a prefix to limit the scope to project/. This may be unnecessary since the GitLab runner only uses the project folder in the bucket.
    3. Under Lifecycle rule actions select Expire current versions of objects
    4. For Days after object creation input a value in days, example 60
  5. Click on Create rule

The above configuration will delete all cache objects older than 60 days. This will prevent the bucket filling up with large cache objects that will never be used again.

IAM User

The gitlab-runner will need full access to the cache bucket. Create a policy that provides access to the bucket. The policy below is probably overkill but you can use it and just modify it for your bucket name:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketTagging",
        "s3:DeleteObjectVersion",
        "s3:GetObjectVersionTagging",
        "s3:ListBucketVersions",
        "s3:GetBucketLogging",
        "s3:RestoreObject",
        "s3:CreateBucket",
        "s3:ListBucket",
        "s3:GetObjectVersionAttributes",
        "s3:GetBucketPolicy",
        "s3:ReplicateObject",
        "s3:PutEncryptionConfiguration",
        "s3:GetObjectAcl",
        "s3:GetBucketObjectLockConfiguration",
        "s3:AbortMultipartUpload",
        "s3:PutBucketTagging",
        "s3:GetObjectVersionAcl",
        "s3:GetObjectTagging",
        "s3:GetBucketOwnershipControls",
        "s3:PutObjectTagging",
        "s3:DeleteObject",
        "s3:PutBucketVersioning",
        "s3:DeleteObjectTagging",
        "s3:GetBucketPublicAccessBlock",
        "s3:GetBucketPolicyStatus",
        "s3:ListBucketMultipartUploads",
        "s3:GetObjectRetention",
        "s3:GetBucketWebsite",
        "s3:GetObjectAttributes",
        "s3:PutObjectLegalHold",
        "s3:GetBucketVersioning",
        "s3:PutBucketCORS",
        "s3:GetBucketAcl",
        "s3:GetObjectLegalHold",
        "s3:GetBucketNotification",
        "s3:ListMultipartUploadParts",
        "s3:PutObject",
        "s3:GetObject",
        "s3:PutBucketNotification",
        "s3:PutBucketWebsite",
        "s3:PutObjectRetention",
        "s3:PutBucketLogging",
        "s3:GetBucketCORS",
        "s3:PutBucketObjectLockConfiguration",
        "s3:GetBucketLocation",
        "s3:ReplicateDelete",
        "s3:GetObjectVersion"
      ],
      "Resource": [
        "arn:aws:s3:::<your-cache-bucket>/*",
        "arn:aws:s3:::<your-cache-bucket>"
      ]
    },
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": "s3:ListAllMyBuckets",
      "Resource": "*"
    }
  ]
}

Now create a User and attach only the newly create S3 Runner Cache policy. Save the AWS Access Key ID and Secret Access Key. These need to be saved as a secret in Kubernetes.

Gitlab Runner Deployment

Environment Setup

To prepare for the gitlab-runner you first need to create the namespace and S3 Cache Secret

kubectl create namespace gitlab-runner

Then create the secret using the AWS Access key and secret you created earlier

kubectl create secret generic gitlab-runner-s3-access \
  --from-literal=accesskey="AKI..." \
  --from-literal=secretkey="Ax..." \
  -n gitlab-runner

Helm Deploy

The simplest way to deploy the runner is to make use of the official helm chart.

helm repo add gitlab https://charts.gitlab.io

Then create your values.yaml file that will be used. For this deployment there is a fair amount of configuration required to ensure the runners make use of the EC2 Spot Instances.

You will also need to get your runnerRegistrationToken from GitLab. The most useful runner type is a group runner. You can get the registration token for a new group runner at the root group in GitLab.

Click the blue box in the top right and copy the Runner Registration Token, this will be inserted into the values.yaml.

The full values.yaml is as follows and finer details are explained below

imagePullSecrets:
  - name: your-pull-secrets

gitlabUrl: https://gitlab.com/

runnerRegistrationToken: XXXXX

unregisterRunners: true

metrics:
  enabled: true
  serviceMonitor:
    enabled: true

service:
  enabled: true

name: gitlab-k8s-runner

podAnnotations:
  downscaler/exclude: "true"

runners:
  cache:
    secretName: gitlab-runner-s3-access
  privileged: true
  config: |
    [[runners]]
      environment = [
        "DOCKER_HOST=tcp://docker:2376",
        "DOCKER_TLS_CERTDIR=/certs",
        "DOCKER_TLS_VERIFY=1",
        "DOCKER_CERT_PATH=$DOCKER_TLS_CERTDIR/client"
        ]
      [runners.cache]
        Type = "s3"
        Shared = true
      [runners.cache.s3]
        ServerAddress = "s3.amazonaws.com"
        BucketName = "your-cache-bucket"
        BucketLocation = "eu-west-1"
        Insecure = false
        AuthenticationType = "access-key"
      [runners.kubernetes]
        namespace = "{{.Release.Namespace}}"
        image = "docker:19.03.12"
        privileged = true
        poll_timeout = 300
        poll_interval = 10
        cpu_request = "2"
        [[runners.kubernetes.volumes.empty_dir]]
          name = "certs"
          mount_path = "/certs/client"
          medium = "Memory"
        [runners.kubernetes.affinity]
          [runners.kubernetes.affinity.node_affinity]
            [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution]]
              weight = 1
              [runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference]
                [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference.match_expressions]]
                  key = "eks.amazonaws.com/capacityType"
                  operator = "In"
                  values = ["SPOT"]
            [runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution]
              [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms]]
                [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
                  key = "intent"
                  operator = "In"
                  values = ["build"]
        [runners.kubernetes.node_tolerations]
          "buildInstance=true" = "NoSchedule"
        [runners.kubernetes.pod_annotations]
          "downscaler/exclude" = "true"    

rbac:
  create: true

If you are happy with the configuration you can apply it with

helm upgrade --install --namespace gitlab-runner --create-namespace gitlab-runner -f values.yaml gitlab/gitlab-runner

There are a few important points to note with the configuration above…

Docker-in-Docker Builds

A lot of build pipelines make use of Docker-in-Docker (DinD) to build and push docker images. Running DinD in kubernetes (or even in Docker) requires a few special tweaks in the gitlab-runner configuration.

The environment field is used to set environment variables that are passed to each build. DinD builds require additional environment variables that are used by docker when running the docker build.

environment = [
  "DOCKER_HOST=tcp://docker:2376",
  "DOCKER_TLS_CERTDIR=/certs",
  "DOCKER_TLS_VERIFY=1",
  "DOCKER_CERT_PATH=$DOCKER_TLS_CERTDIR/client"
]

When a build launches in Kubernetes it creates a pod with all the required services. The DOCKER_HOST is therefore accessible at docker and we use port 2376 because TLS is set to true. The DOCKER_TLS_CERTDIR and DOCKER_CERT_PATH variables tell docker where to find the certificates that are used between the client and daemon.

[runners.kubernetes]
  image = "docker:19.03.12"
  privileged = true

In the runners.kubernetes section we also need to set privileged true which will allow containers access to the docker daemon during the build. The image is also set to an official docker image which will be used by default if no image is specified in the gitlab-ci.yaml.

[[runners.kubernetes.volumes.empty_dir]]
  name = "certs"
  mount_path = "/certs/client"
  medium = "Memory"

This section is critical because it will create an in memory volume that is shared between all the services in a pod. This allows the docker service to create the client certificates that will be used by the build container.

Node Affinity and tolerations

To make sure that build jobs are launched on Spot instances we need to create a node affinity in the config.toml

[runners.kubernetes.affinity]
          [runners.kubernetes.affinity.node_affinity]
            [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution]]
              weight = 1
              [runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference]
                [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference.match_expressions]]
                  key = "eks.amazonaws.com/capacityType"
                  operator = "In"
                  values = ["SPOT"]
            [runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution]
              [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms]]
                [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
                  key = "intent"
                  operator = "In"
                  values = ["build"]

The configuration above will instruct the Kubernetes scheduler to prefer instances labelled as eks.amazonaws.com/capacityType: SPOT and require instances labelled as intent: build.

[runners.kubernetes.node_tolerations]
          "buildInstance=true" = "NoSchedule"

Because we tainted the ASG we also need to add a toleration to make sure the build pods can tolerate the taint. This toleration will allow build pods to ignore the NoSchedule taint on build nodes.

End Notes

Spot instances are a great way to save money when using AWS. The combination of spot and EKS is fantastic. With the above configuration you can access cheap, high spec, hardware for executing your builds on.

The only drawback is the time it takes for a new node to be created. In my configuration above I have the cpu_request value for each build set to 2. This means that if there are no nodes with 2 CPU available, we need a new node to be created. This relies on the cluster-autoscaler expanding the ASG in reaction to a pod that is stuck in the pending state. All of this happens reasonably fast but it does mean adding between 2 - 4 minutes to a build. If you have frequent builds this can be less of an issue because the builds will follow on from each other and reuse the existing nodes before cluster-autoscaler has a chance to remove them.

If I had time, I would look at setting up cluster overprovisioning to keep at least 1 node waiting for a build during office hours. Maybe that should be a future post…