Spot instances for the win!

Cloud computing is supposed to be cheap, right?

No longer do we need to fork out £5-10k for some silicon and tin, and pay for space and the power, the cables and the install, etc, etc. Building in the cloud meant we could go and provision a host and leave it running for a few hours and remove it when we were done. No PO/finance hoops to jump through, no approvals needed, just provision the host and do your work.

So, in some ways this was true, there was little or no upfront cost and it’s easier to beg forgiveness than permission, right? But the fact is we’ve moved on from the times when AWS was a demo environment, or a test site, or something that the devs were just toying with. Now it’s common for AWS (or Azure, or GCE) to be your only compute environment and the fact is the bills are much bigger now. AWS has become our biggest platform cost and so we’re always looking for ways to reduce our cost commitments there.

At the same time that AWS and cloud have become mainstream for many of us, so too have microservices, and while their development and testing benefits of microservices are well recognised, the little-recognised truth is that they also cost more to run. Why? Because as much as I may be doing the same amount of ‘computing’ as I was in the monolith (though I suspect we were actually doing less there) each microservice now wants its own pool of memory. The PHP app that we ran happily on a single 2GB server with 1CPU has now been split out into 40 different components, each with its own baseline memory consumption of 100MB, so I’ve already doubled my cost base just by using a more ‘efficient’ architecture.

Of course, AWS offers many ways of reducing your compute costs with them. There are many flavours of machine available, each with memory and CPU offerings tuned to your requirements. You can get 50%+ savings on the cost of compute power by committing to paying for the system for 3 years (you want the flexible benefits of cloud computing, right?). Beware the no-upfront reservations though – you’ll lose most of the benefits of elastic computing, with very little cost-saving benefits.

You could of course use an alternative provider, Google bends over backward to prove they have a better, cheaper, IaaS, but the truth is we’re currently too in-bed and busy to move provider (we’ve only just finished migrating away from Rackspace, so we’re in no hurry to start again!)

So, how can we win this game? Spot Instances. OK, so they may get turned off at any moment, but the fact is for the majority of common machine types you will pay 20% of the on-demand price for a spot instance. Looking at the historical pricing of spot instances also gives you a pretty good idea how likely it is that a spot instance will be abruptly terminated. The fact is, if you bid at the on-demand price for a machine – i.e. what you were GOING to pay, but put it on a spot instance instead, you’ll end up paying ~20% of what you were going to and your machine will almost certainly still be there in 3 months time. As long as your bid price remains above spot price, your machine will stay on and you will pay the spot price, not your bid!

AWS Spot Price History

What if this isn’t certain enough for you? If you really want to take advantage of spot instances, build your system to accommodate failure and then hedge your bids across multiple compute pools of different instance types. You can also reserve a baseline of machines, which you calculate to be the bare minimum needed to run your apps, and then use spots to supplement that baseline pool in order to give your systems more burst capacity.

How about moving your build pipeline on to spot instances or that load test environment?

Sure, you can’t bet your house on them, but given the right risk approach to them you can certainly save a ton of money of your compute costs.

Running Jenkins on Kubernetes

by Sion Williams

tl;dr

This guide will take you through the steps necessary to continuously deliver your software to end users by leveraging Amazon Web Services and Jenkins to orchestrate the software delivery pipeline. If you are not familiar with basic Kubernetes concepts, have a look at Kubernetes 101.

In order to accomplish this goal you will use the following Jenkins plugins:

  • Jenkins EC2 Plugin – start Jenkins build slaves in AWS when builds are requested, terminate those containers when builds complete, freeing resources up for the rest of the cluster
  • Bitbucket Oauth Plugin – allows you to add your bitbucket oauth credentials to jenkins

In order to deploy the application with Kubernetes you will use the following resources:

  • Deployments – replicates our application across our kubernetes nodes and allows us to do a controlled rolling update of our software across the fleet of application instances
  • Services – load balancing and service discovery for our internal services
  • Volumes – persistent storage for containers

Credit

This article is an AWS variant of the original Google Cloud Platform article found, here.

Prerequisites

  1. An Amazon Web Services Account
  2. A running Kubernetes cluster

Containers in Production

Containers are ideal for stateless applications and are meant to be ephemeral. This means no data or logs should be stored in the container otherwise they’ll be lost when the container terminates.

– Arun Gupta

The data for Jenkins is stored in the container filesystem. If the container terminates then the entire state of the application is lost. To ensure that we don’t lose our configuration each time a container restarts we need to add a Persistent Volume.

Adding a Persistent Volume

From the Jenkins documentation we know that the directory we want to persist is going to be the Jenkins home directory, which in the container is located at /var/jenkins_home (assuming you are using the official Jenkins container). This is the directory where all our plugins are installed, jobs and config information is kept.

At this point we’re faced with a chicken and egg situation; we want to mount a volume where Jenkins Home is located, but if we do that the volume will be empty. To overcome this hurdle we first need to add our volume to a sacrificial instance in AWS, install Jenkins, copy the contents of Jenkins Home to the volume, detach it, then finally add it to the container.

Gotchas

Make sure that the user and group permissions in the Jenkins Home are the same. Failure to do so will cause the container to fail certain Write processes. We will discuss more about the Security Context later in this article.

To recursively change permissions of group to equal owner, use:

$ sudo chmod -R g=u .

Now that we have our volume populated with the Jenkins data we can start writing the Kubernetes manifests.  The main things of note are the name, volumeId and storage.

jenkins-pv.yml

apiVersion: v1
kind: PersistentVolume
metadata:
 name: jenkins-data
spec:
 capacity:
 storage: 30Gi
 accessModes:
 - ReadWriteOnce
 awsElasticBlockStore:
 volumeID: aws://eu-west-1a/vol-XXXXXX
 fsType: ext4

With this manifest we have told Kubernetes where our volume is held. Now we need to tell Kubernetes that we want to make a claim on it. We do that with a Persistent Volume Claim.

jenkins-pvc.yml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
 name: jenkins-data
spec:
 accessModes:
 - ReadWriteOnce
 resources:
 requests:
 storage: 30Gi

In the file above we are telling Kubernetes that we would like to claim the full 30GB. We will associate this claim with a container in the next section.

Create a Jenkins Deployment and Service

Here you’ll create a deployment running a Jenkins container with a persistent disk attached containing the Jenkins home directory.

jenkins-deployment.yml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 annotations:
 labels:
 app: jenkins
 name: jenkins
spec:
 replicas: 1
 selector:
 matchLabels:
 app: jenkins
 template:
 metadata:
 labels:
 app: jenkins
 spec:
 containers:
 - image: jenkins:2.19.2
 imagePullPolicy: IfNotPresent
 name: jenkins
 ports:
 - containerPort: 8080
 protocol: TCP
 name: web
 - containerPort: 50000
 protocol: TCP
 name: slaves
 resources:
 limits:
 cpu: 500m
 memory: 1000Mi
 requests:
 cpu: 500m
 memory: 1000Mi
 volumeMounts:
 - mountPath: /var/jenkins_home
 name: jenkinshome
 securityContext:
 fsGroup: 1000
 volumes:
 - name: jenkinshome
 persistentVolumeClaim:
 claimName: jenkins-data

There’s a lot of information in this file. As the post is already getting long, I’m only going to pull out the most important parts.

Volume Mounts

Earlier we created a persistent volume and volume claim. We made a claim on the PersistentVolume using the PersistentVolumeClaim, and now we need to attach the claim to our container. We do this using the claim name, which hopefully you can see ties each of the manifests together. In this case jenkins-data.

Security Context

This is where I had the most problems. I found that when I used the surrogate method of getting the files onto the volume I forgot to set the correct ownership and permissions. By setting the group permissions to the same as the user, when we deploy to Kubernetes we can use the fsGroup feature.  This feature lets the Jenkins user in the container have the correct permissions on the directories via the group level permissions. We set this to 1000 as per the documentation.

If all is well and good you should now be able start each of the resources:

kubectl create -f jenkins-pv.yml -f jenkins-pvc.yml -f jenkins-deployment.yml

As long as you dont have any issues at this stage you can now expose the instance using a load balancer. In this example we are provisioning an aws loadbalancer with our AWS provided cert.

jenkins-svc.yml

apiVersion: v1
kind: Service
metadata:
 labels:
 app: jenkins
 name: jenkins
 annotations:
 service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:eu-west-1:xxxxxxxxxxxx:certificate/bac080bc-8f03-4cc0-a8b5-xxxxxxxxxxxxx"
 service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
 service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
spec:
 ports:
 - name: securejenkinsport
 port: 443
 targetPort: 8080
 - name: slaves
 port: 50000
 protocol: TCP
 targetPort: 50000
 selector:
 app: jenkins
 type: LoadBalancer
 loadBalancerSourceRanges:
 - x.x.x.x/32

In the snippet above we also use the loadBalancerSourceRanges feature to whitelist our office. We aren’t making our CI publicly available, so this is a nice way of making it private.

I’m not going to get into the specifics of DNS etc here, but if that’s all configured you should now be able access your Jenkins. You can get the ingress url using the following:

kubectl get -o jsonpath="{.status.loadBalancer.ingress[0].hostname}" svc/jenkins

EC2 Plugin

I guess you’re wondering; “why after all that effort with Kubernetes are you creating AWS instances as slaves?” Well, our cluster has a finite pool of resource. We want elasticity with the Jenkins slaves, but equally, we don’t want a large pool sat idle waiting for work.

We are using the EC2 Plugin so that our builder nodes will be automatically launched as necessary when the Jenkins master requests them. Upon completion of their work they will automatically be turned down ,and we don’t get charged for anything that isn’t running. This does come with a time penalty for spinning up new VM’s, but we’re OK with that. We mitigate some of that cost by leaving them up for 10 mins after a build, so that any new builds can jump straight on the resource.

There’s a great article on how to configure this plugin, here.

Bitbucket OAuth

Our Active Directory is managed externally, so integrating Jenkins with AD was a little bit of a headache. Instead, we opted to integrate Jenkins with Bitbucket OAuth, which is useful because we know all of our engineers will have accounts. The documentation is very clear and accurate, so I would recommend following that guide.

Building Kubernetes Clusters

We’re in the early stages of deploying a platform built with micro services on Kubernetes. While there are a growing number of alternatives to k8s (all the cool kids are calling it k8s, or kube), Mesos, Nomad and Swarm being some of the bigger names, we came to the decision that k8s has the right balance of features, maturity and out of the box ease of use to make it the right choice for us. For the time being at least.

You may know k8s was gifted to the world by Google. In many ways it’s a watered down version of the Omega and Borg engines they use to run their own apps worldwide across millions of servers, so it’s got good provenance. The cynical among you (myself included) may suggest that Google gave us k8s as a way of enticing us to the Google Compute Engine. K8s is clearly designed with being run on GCE first and foremost with a number of the features only working on GCE. That’s not to say it can’t be run in other places, it can and does get used elsewhere from laptops and bare metal to other clouds. For example we run it in AWS and while functionality on AWS lags behind GCE, k8s still provides pretty much everything we need at this point.

As you can imagine, setting up k8s could be pretty complicated due to the number of elements involved in something which you can trust to run your applications in production with little more than a definition in a YAML file, fortunately the k8s team provide a pretty thorough build script to set it up for you called ‘kube-up’.

Kube-up is a script* capable of installing a fully functional cluster on anything from your laptop (Vagrant) through Rackspace, Azure, AWS and VMware and of course onto Google compute and container engine (GCE & GKE). Configuration and customisation for your requirements is done by modifying values in the scripts, or preferably exporting the appropriate settings into your env vars before running the script.

For a couple of reasons, which seemed good at the time, we’re running in AWS. While support for AWS is pretty good, the main feature missing currently that we’ve noticed is the lack of the ingress resource, which provides advanced L7 control such as rate limiting,  it’s actually pretty difficult to find good information on what actually is supported, both in the Kube-up script and once k8s is running and in use. The best option is to read through the script, see what environment variables are mentioned and then have a play with them.

Along with a kube-up script, there is also a kube-down script (supplied in the tar file downloaded by kube-up). This can be very handy if you’re building and rebuilding clusters to better understand what you need but be warned, it also means it’s perfectly feasible to delete a cluster you didn’t want deleted.

So far I’ve found a few guideline which I think should be stuck to when using kube-up, these, with a reason why, are:

Create a stand-alone config file (a list of export Env=Vars) and source that script before running kube-up instead of modding the downloaded config files. 

Having gone through the build process a couple of times now, I’ve come to the conclusion the best route is to define all the EnvVar overrides into a stand-alone file and source the file before running the main kube-up script. By default, kube-up will re-download the tar and replace the script directory, blowing away any overrides you may have configured. Downloading a new version of the tar file means you benefit from any fixes and improvements, keeping your config outside this means you don’t have to keep re-defining it. I should add too that I have had to hack the contents of various scripts to get the script to run without errors, so using the latest version doe help minimise this.

Don’t use the default Kubernetes cluster name, create a logical name (something that makes sense to use and stands the test of having 3-4, other clusters running alongside and still making sense what this one is) 

Kube-up/down both rely on the information held in ~/.kube. This directory is created when you run kube-up and lets the kubectl script know where to connect and what credentials to use to mange the system through the API. If you have multiple clusters and have the details for the ‘wrong’ cluster stored in this file, kube-down will merrily also delete the wrong cluster.

In addition to this, in AWS, kube-up/down both rely heavily on AWS name tags. These tags are used during the whole lifecycle of the cluster so are important at all times. When kube-up provisions the cluster it will tag items to know which resources it’ll manage. The same tags are used by the master to control the cluster. For example; to add the appropriate instance specific routes to the AWS route tables. If the tags are missing, or duplicated (which can happen if you are building and tearing down clusters frequently and miss something in the tear-down) you can end up with a cluster which is reported as fully functional, but applications running in the cluster will fail to run.

One problem I found was that having laid out a nice VPC config, including subnet and route tables with Terraform and then having provisioned the system, when I came to deploying the k8s cluster the k8s script failed to bind it’s route table to the subnet which I ha told it to use. It failed because I had already defined one myself in Terraform. kube-up did report this as an error, but continued on and provisioned what looked like a fully functioning cluster. It wasn’t until the following day that we identified that there were important per-node routes missing. kube-up had provisioned and tagged a route table. Because that table was tagged, that’s the table the kube master was updating when minions were getting provisioned. The problem being that route table was not associated to my subnet. Once I had tagged by terraformed subnet with the appropriate k8s tag, the master would then update the correct table with new routes for minions. I had to manually copy across the routes from the other table for the existing minions.

Understand your network topology before building the cluster and define IP ranges for the cluster that don’t collide with your existing network and allow for more clusters to be provisioned alongside in the future. 

If, for example you choose to deploy 2 separate clusters using the kube-up scripts they will both end up with the same IP addressing, they will also only be accessible over the internet. While this isn’t the end of the world, it’s not ideal and being able to access them using their private IP/name space is a huge improvement. Of course, if the kube-up provisioned IP range is the same as one of your internal networks, or you have 2 VPCs with the same IP ranges it becomes impossible to do this. Having a well thought-out Network and IP ranges also makes routing and security far simpler. If you know all your production services sit over there you can easily configure your firewalls to restrict access to that whole range.

Although you can pre-build the VPC, networks, gateways, route tables, etc. if you do, make sure they’re kube-up friendly, adding the right tags (which match the custom name you defined above.)

When building with dealt configs, kube-up will provision a new VPN into AWS. While this is great when you want to just get something up and running, it’s pretty likely you’ll actually want to build a cluster in a pre-existing VPC. You may also already have a way of building and managing these. We like to provision things with Terraform and so we found a way to configure kube-up to use an existing VPC (and to change it’s networking accordingly) there are still a number of caveats.

K8s makes heavy use of some networking tricks to provide an easy to use interface, however this means that to really understand k8s  (you’re running your production apps on this, right? so you want a good idea how it’s running, right?) you should also have a good understanding of it’s networks. In essence, Kubernetes makes use of 2 largely distinct networks. The first is to provide IPs to the master and nodes and allows you to reach the surface of the cluster (allowing you to manage it, and deploy apps onto it and for those to be served to the world). It uses the second network to manage where the apps are within the cluster and to allow the scheduler to do what it needs to without you having to worry about what node an apps is deployed to and what port it’s on. If either of these network ranges collides with one of your existing networks you can get sub-optimal behaviour, even if this means you have to hop through hoops just to reach your cluster.

Update the security groups as soon as the system is built to restrict access to the nodes. We’ve built ours in a VPC with a VPN connection to our other systems, so we can restrict access to private ranges only. 

Also note that by default, although kube-up will provision a private network for you in AWS, all the nodes end up getting public addresses and a security group which allows access to these nodes from anywhere over SSH and HTTP/S for the master. This strikes me as a little scary.

  • Kube-up is in fact far more than just a single script, it downloads a whole tar file of scripts, but let’s keep it simple.