Zero-downtime deployment with HAProxy as load balancer

Filed under: Automation, Continuous deployment, Linux, Tools, — Tags: Gatlin, HAProxy — Thomas Sundberg — 2017-03-29

I want to deploy new versions of an application with no downtime. It turns out to be a bit tricky. Here is one solution that sort of works.

The Problem

I am not in control over the deployment process, all I can do is monitor an URL and stop sending traffic to it if there are errors.

I want to deploy small changes often to reduce the risk associated with large deploys. This is not a distributed system with lots of small services, it is a monolith that is redeployed often.

The Solution

The solution is to have more than one server handling the load and divide the traffic between these servers. The technique is called load balancing and is not new. All I have to do is to setup a load balancer and configure it properly.

Two categories of load balancers

Load balancers work on layer 4, the transport layer. Or layer 7, the application layer. I want to load balance a web application so a layer 7 load balancer is what I need. The layers here refer to the OSI model.

Using HAProxy as a layer 7 load balancer does the trick.

Installing HAProxy

The installation of HAPoxy is different on different systems, I installed it on an Ubuntu 16.04 like this:

    apt-get install software-properties-common
    add-apt-repository ppa:vbernat/haproxy-1.7

    apt-get update
    apt-get install haproxy

I found the instructions at https://haproxy.debian.net/ and was able to install the latest version, 1.7 as of this writing.

Configure HAProxy

Installing HAProxy was the easy part, the real work was in tuning its configuration. I ended up with this configuration in /etc/haproxy/haproxy.cfg


global
	log /dev/log	local0
	log /dev/log	local1 notice
	maxconn 2000
	chroot /var/lib/haproxy
	stats socket /run/haproxy/admin.sock mode 660 level admin
	stats timeout 30s
	user haproxy
	group haproxy
	daemon

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect  5000
    timeout client  10000
    timeout server  10000

frontend loadbalanser
    stats enable
    stats uri     /admin?stats
    bind *:80
    mode http
    default_backend gfr

backend gfr
    stats enable
    stats uri /admin?stats
    mode http
    balance roundrobin
    option forwardfor
    http-request set-header X-Forwarded-Port %[dst_port]
    option httpchk GET /service/foretag/6.0/ws?wsdl
    server gfr1 l7700744.ata.ams.se:8580 check rise 8 downinter 30000ms observe layer7 on-error mark-down
    server gfr2 l7700745.ata.ams.se:8580 check rise 8 downinter 30000ms observe layer7 on-error mark-down

The most important part is the two last lines. They specify two different servers that should handle the load.

The real magic, and tuning, was to find values for the server specification so a deploy could be done while using the servers. I used the servers by adding some load generated using Gatling.

The health check was performed using an HTTP call to a url where I check if the wsdl for a web service is available or not. If it isn't, the application isn't up and running.

Result

The load balancing works. When a server responds with an error, that particular server is marked as down. It will come back when the deploy is done and the expected wsdl is available again.

I still lose a few calls during deployment. With constant load, about twice the production load, I lose approximately ten calls per server when they are reinstalled. That's not good, but given that I'm not able to alter the deploy process, I guess it will have to do.

I wish I could find a setting that resends a failed call once to another server, but I can't find one that works. The option redispatch seemed promising, but it didn't work well for me. When I had option redispatch and retries set I lost more traffic compared to not having them set.

A better result

If I could change the deploy process, I would change it so that the server that is about to be re-deployed is removed from the load balancer before the deploy. HAProxy is really good at reloading its configuration. A script that removes a server, reloads HAProxy's configuration, performs the deployment, adds the server again, and finally reloads the configuration would not be too hard to write. This would give me a real zero-downtime deployment. Not just short downtime deployment as I am able to achieve with this setup.

Conclusion

HAProxy works very well. It is possible to re-configure it during usage without losing traffic.

Acknowledgements

I would like to thank Malin Ekholm for proof reading.

Resources



(less...)

Pages

About
Events
Why

Categories

Agile
Automation
BDD
Clean code
Continuous delivery
Continuous deployment
Continuous integration
Cucumber
Culture
Design
Executable specification
Gradle
Guice
J2EE
JUnit
Java
Linux
Load testing
Maven
Mockito
Pair programming
PicoContainer
Programming
Public speaking
Quality
Recruiting
Requirements
Scala
Selenium
Software craftsmanship
Software development
Spring
TDD
Teaching
Technical debt
Test automation
Tools
Web
Windows
eXtreme Programming

Authors

Thomas Sundberg
Adrian Bolboaca

Archives

Meta

rss RSS