I want to deploy new versions of an application with no downtime. It turns out to be a bit tricky. Here is one solution that sort of works.
I am not in control over the deployment process, all I can do is monitor an URL and stop sending traffic to it if there are errors.
I want to deploy small changes often to reduce the risk associated with large deploys. This is not a distributed system with lots of small services, it is a monolith that is redeployed often.
The solution is to have more than one server handling the load and divide the traffic between these servers. The technique is called load balancing and is not new. All I have to do is to setup a load balancer and configure it properly.
Load balancers work on layer 4, the transport layer. Or layer 7, the application layer. I want to load balance a web application so a layer 7 load balancer is what I need. The layers here refer to the OSI model.
Using HAProxy as a layer 7 load balancer does the trick.
The installation of HAPoxy is different on different systems, I installed it on an Ubuntu 16.04 like this:
apt-get install software-properties-common add-apt-repository ppa:vbernat/haproxy-1.7 apt-get update apt-get install haproxy
I found the instructions at https://haproxy.debian.net/ and was able to install the latest version, 1.7 as of this writing.
Installing HAProxy was the easy part, the real work was in tuning its configuration. I ended up with this
global log /dev/log local0 log /dev/log local1 notice maxconn 2000 chroot /var/lib/haproxy stats socket /run/haproxy/admin.sock mode 660 level admin stats timeout 30s user haproxy group haproxy daemon defaults log global mode http option httplog option dontlognull timeout connect 5000 timeout client 10000 timeout server 10000 frontend loadbalanser stats enable stats uri /admin?stats bind *:80 mode http default_backend gfr backend gfr stats enable stats uri /admin?stats mode http balance roundrobin option forwardfor http-request set-header X-Forwarded-Port %[dst_port] option httpchk GET /service/foretag/6.0/ws?wsdl server gfr1 l7700744.ata.ams.se:8580 check rise 8 downinter 30000ms observe layer7 on-error mark-down server gfr2 l7700745.ata.ams.se:8580 check rise 8 downinter 30000ms observe layer7 on-error mark-down
The most important part is the two last lines. They specify two different servers that should handle the load.
option httpchkdefines how the check will be done
The real magic, and tuning, was to find values for the server specification so a deploy could be done while using the servers. I used the servers by adding some load generated using Gatling.
The health check was performed using an HTTP call to a url where I check if the
wsdl for a web service
is available or not. If it isn't, the application isn't up and running.
The load balancing works. When a server responds with an error, that particular server is marked as down. It will
come back when the deploy is done and the expected
wsdl is available again.
I still lose a few calls during deployment. With constant load, about twice the production load, I lose approximately ten calls per server when they are reinstalled. That's not good, but given that I'm not able to alter the deploy process, I guess it will have to do.
I wish I could find a setting that resends a failed call once to another server, but I can't find one that works.
The option redispatch
seemed promising, but it didn't work well for me. When I had
option redispatch and
set I lost more traffic compared to not having them set.
If I could change the deploy process, I would change it so that the server that is about to be re-deployed is removed from the load balancer before the deploy. HAProxy is really good at reloading its configuration. A script that removes a server, reloads HAProxy's configuration, performs the deployment, adds the server again, and finally reloads the configuration would not be too hard to write. This would give me a real zero-downtime deployment. Not just short downtime deployment as I am able to achieve with this setup.
HAProxy works very well. It is possible to re-configure it during usage without losing traffic.
I would like to thank Malin Ekholm for proof reading.