Continuous deployment - a case study

Continuous deployment is considered, among some enlightened, to be the holy grail in many organisations where software is developed.

What is continuous deployment then? My interpretation is that every change is deployed into production. That is, every change that passes the quality gates the team has created.

A relaxed version is continuous delivery. Continuous delivery means that every change that passes the quality gates ends up as a release candidate. A release candidate that is possible to deploy into to production. The deployment will, however, be done later. If ever.

I will describe how I implemented continuous deployment for one product and continuous delivery for three others for a client.

Why?

Why would you want to do that? What is the purpose of deploying new things to production so often?

The answer is that it allows our users to use the new, shiny product. We allow them to use it as soon as possible and will therefore be able to get fast feedback from real users. We will soon know if a change we did was a good change or not.

We are able to minimize waste. Code that has been committed to a version control system, VCS, and just lays there is waste. It is work that has been done but isn't used. We don't know if it is valuable or not and up to the point it is being used, it is worthless.

In other words, we want to maximise feedback from our users and minimize waste in form of unused inventory.

Updates are complicated

Isn't it a problem to always deliver new stuff? What about training our users?

Well, have you thought about gmail? Assuming that you use gmail, when was the last time you attended a course describing their latest updates? Oh, never? Why? Well, their changes are small and incremental. I am not even sure when a new version was released. I am, however, sure that gmail is extremely different today compared to when it was released some ten+ years ago.

In other words. Small and incremental changes are most likely not a problem. There are many products out there that are updated frequently but in so small steps that the users don't have to be trained to use the new features.

How was it implemented?

How was I able to implement continuous deployment and continuous delivery then?

The implementation was done using these, seemingly easy, steps:

Package the product as an easy to install package
Make the package available for an installation tool
Deploy the package

Easy to install package

Our target environment was Red Hat Linux. This meant that an easy to install package in our environment was RPM packages.

Building RPM packages is easy when you use Gradle. There is a nice suite of plugins built by Netflix called Nebula. We used the Ospackage plugin that allows you to build RPM packages using Gradle.

So building RPMs from Gradle is not too complicated. There is still a lot of work that you have to do to make the installation, upgrade and un-installation behave as you want. But it is doable and easy to test. We used local CentOS hosts that we brought up and down using Vagrant. This allowed us to install on a clean environment, upgrade a few times and then un-install. And start over again whenever we needed.

Make the package available for an installation tool

Next step was to promote the RPMs to a package manager. The standard way of sharing RPMs is to use YUM repositories.

Artifactory is able to serve as a YUM repo. The free version is unfortunately not able to answer the question "Which is the latest artifact?". This is a feature that only is available in Artifactory Pro.

Getting the latest available artifact is very important when you use provisioning tools like Puppet. It makes writing the manifest for a host easy. Instead of specifying the exact version, you specify that the latest version of a product should be installed.

The result was that we opted for using Artifactory Pro.

Deploying Maven artifacts to Artifactory using the Artifactory plugin available for Gradle is easy. Deploying non Maven artifacts turned out to be very complicated. Asking the JFrog support didn't result in anything usable. Their suggestions was complicated and didn't handle the case of uploading the artifacts as a part of a Gradle build.

Artifactory, however, has a REST api. Using the REST api allowed us to write a Gradle plugin that deploy a local RPM to Artifactory. We called the plugin gradle-artifactory-rpm-plugin. A clone of the source code is available at GitHub.

The plugin supported the tasks

Deploy to a stage repository - prepare for installation in an integration test instance
Promote from stage to utv - our first test instance
Promote from utv to test - our second acceptance test instance
Promote from test to prod - our production environment

The idea was that we should be able to test the installation in an integration test environment before trying to install it anywhere else. The integration test environment was the only environment where all the testing was automated.

We never really used the stage repository, but we had the possibility if we wanted. There was a need for it for one teams, but they never got around to start using it before I left. The net result was that all artifacts bounced in the stage repository before they where promoted to the utv repository.

Deploy the package

When the RPM packages where available in the YUM repo it was time to actually install them.

Puppet is idempotent, that is you should expect the same result no matter how many times you execute Puppet. This means that running may result in no change if there isn't any settings that should be changed. This in turn mean that we should be able to execute Puppet often and it shouldn't be dangerous. Nothing will happen if there isn't a new version available.

This was not the case in our situation. We had Puppet manifests that wasn't idempotent so executing them often wasn't anything we were comfortable with.

We also had problems with too many requests to Red Hat. We were allowed to ask for new artifacts 100 times per host and day. This became a problem since each execution resulted in two queries to Red Hat. Running Puppet every thirty minutes resulted in 96 request per day and we had problems with hosts banned.

Our solution was to not run Puppet scheduled often but instead trigger it from the build script. This was done using another Gradle plugin Gradle Puppet Plugin that executed puppet agent -vt when we knew that there were a new package available.

The net result was that we could automate building packages, promote them to a YUM repository and finally trigger an installation of them. All of this was done using Gradle and scheduled by Jenkins.

This allowed us to implement continuous deployment for one system and continuous delivery for three systems.

A better solution than to write a plugin that triggered the execution of Puppet would have been to fix the manifests and possible host the YUM repos we searched in local. And then allow Puppet to execute scheduled every 30 minutes. This wasn't anything I was able to implement before I left.

Obstacles

This may seem like a pretty straight forward solution. What were the problems? There were many problems. Many technical. But the hardest problem are, as usual, people problems.

No manager buy in

The management didn't really care about the availability of the services we offered. Our customers were people with reading disabilities such as bad eyesight, blindness or dyslexia. It was considered ok with a few hours downtime of our public facing services every second week.

Mundane and therefore error prone routine tasks was not seen as anything we should automate. Automating them would increase the quality. But that wasn't anything that was prioritized.

A large inventory of unused functionality, such as code written but not deployed, was not viewed as something that should be avoided.

This lack of management buy in was a factor that slowed us down, but it didn't stop us to show that it was possible to implement.

SNAPSHOT not allowed in the version number

One issue that we faced and that had to be resolved was the version numbering of the artifacts. It turns out that the concept of SNAPSHOT conflicts with the version number rules that the RPM standard defines.

The clash with SNAPSHOTS may be seen as a problem but we turned it into an advantage by just ignoring it and used the build number as revision for a version. We used the build number from Jenkins and thus created 1.2.3-27 from the version number 1.2.3-SNAPSHOT and build 27. This was sufficient for us to be able to always work with the latest artifacts.

SNAPSHOT versions

The concept of SNAPSHOT versions are complicated. At least if you don't view every commit as a potential release candidate.

One comment that came up was

"But I have a job that releases a new version, tags it and increases the version number to the next development version. When the new version number is committed and pushed, the build job is triggered and builds again. This creates a new RPM. This new RPM will be the latest. The latest will now be a development version with a SNAPSHOT version number. What should I do?"

One solution could be to acknowledge that the idea of SNAPSHOTs are broken. Then, if you insists of using SNAPSHOTs, consider two jobs. One job that build and verifies the product, another that delivers RPMs at the same time as you are releasing and tagging the product.

The better solution is to acknowledge that the idea of SNAPSHOTs doesn't work and install whatever version you created and differ them using a revision number after the version number. One way of creating the revision number is to use the build number from your CI server.

This may or may not be your desired solution. It all depends on the users. If you have many external users that really looks at the version number and tries to determine if they should use your latest release, then you have one case where it might make sense to more or less manually release the product and carefully handle the version numbers.

Semantic Versioning 2.0.0 is great when you have many clients. This particular customer built internal applications with one customer. The team that build the application. Or the other team. Both teams was co-located and sat next to each other, less than 5 m apart.

If you have one internal user, and that user is your own team, then the benefits with automation could be larger than any fundamentalistic idea about version numberings.

Too many requests to Red Hat

Our Red Hat, RHEL, hosts subscribed to an official update channel from Red Hat. The idea is great, you will get official updates as soon as they are available. This channel had a limitation, you are only allowed to ask for updates for one host 100 times each day. If you ask more often than that, that host will be banned. Banning means that the host will not able to get any updates at all until it has been manually unlocked by an administrator.

100 times a day may seem as a high number and it is probably set high on purpose from Red Hat. The problem we faced was, however, that our Puppet installation asked twice every execution. Even when it was executed in dry run mode. And the default, and recommended, setting is to run Puppet every 30 minutes. The net result was that we asked Red Hat for updates at least 96 times a day. Too close to the limit for comfort. We only needed to execute Puppet twice to come over the Red Hat limit. We had lots of hosts banned before we realized why.

The solution was to schedule the execution seldom, every sixth hour. But still run Puppet in dry run mode because we still didn't have good, idempotent, Puppet manifests.

Updating the hosts was done manually until we implemented a Gradle plugin that was able to execute Puppet remote. This allowed us to execute Puppet once there was an update for any product that should be installed on a specific host. This was the last piece we needed to be able to setup continuous deployment for at least one product.

Branches for the Puppet manifests

We used branches in our version control, we used Git, for the Puppet manifests. The idea was to promote from the development branch to the test branch and finally to the production branch.

This may sound as a reasonable way to separate development from testing and finally production. The problems showed up when deployment changes for one system should be promoted but not the changes for another system. The suggested solution was to cherry pick the changes.

Cherry picking is not as easy as one could wish. It is possible, but unnecessary complicated.

Using packages and promoting artifacts to different repositories solved the cherry picking problem. After a while, the changes needed to the Puppet manifests faded away and after a while there weren't any need for any changes to the manifests for the systems that used RPM packages.

Conclusion

It takes time to implement continuous delivery and continuous deployment. It is not impossible. It is even simple in the right setting. Simple does, however, not mean easy.

It is complicated if the environment, Management and coworkers, don't care if the systems are down or not.

It is possible to do using open source tooling. The only non open source tool we really needed was Artifactory Pro, we had to use the professional version to get the YUM support we needed.

Almost all problems are people problems. The technical challenges are possible to overcome even if it takes time.

Acknowledgements

I would like to thank Malin Ekholm and Johan Helmfrid for proof reading.

Resources

Gradle Artifactory Rpm Plugin - Promoting RPMs to Artifactory
Gradle PuppetPlugin - Running Puppet remote
Nebula Ospackage Plugin - A Netflix Gradle plugin for creating RPMs
Artifactory - The package manager used
Thomas Sundberg - The author