Traps on the way of Blue-Green deployments with Docker
Docker (as well as Kubernetes) offers you a way to update your applications with no-downtime through a common strategy called
Blue-Green deployment.
Blue-Green deployments work in this way:
- Your currently deployed application ("Green") is serving the incoming traffic.
- A new version of your application is deployed ("Blue") and tested, but is not yet receiving any traffic.
- When "Blue" is ready, we can start sending the incoming traffic to "Blue" too.
- At this point we have two copies of our application running in parallel (the "Green" and the "Blue").
- Now we have to stop sending incoming traffic to the "Green" application, "Blue" is handling all the incoming traffic.
- Since "Green" is not receiving any traffic anymore, it can be safely removed.
- The "Blue" will be marked as "Green", allowing in the future a deploy of a newer version using the same strategy.
Docker Swarm blue-green deployment
If you are using Docker Swarm, this can be a stack file that implements the blue-green deployment strategy.
# app.yml
version: '3.4'
services:
app:
image: acme/todo-list:${VERSION}
deploy:
update_config:
order: start-first
acme/todo-list
is a simple todo-list web application. The update_config.order: start-first
instructs docker swarm to
use the blue-green deploy strategy.
Do deploy the v1
of our todo-list in a Docker-Swarm cluster we can run:
VERSION=v1 docker stack deploy todolist_app -c app.yml
If we want to update the app to v2
we can run:
VERSION=v2 docker stack deploy todolist_app -c app.yml
The update will follow the blue-green deploy strategy as described before.
Docker will keep v1
running and will deploy v2
.
When v2
is ready, it will redirect all the traffic to v2
and will remove v1
. Neat!
But if we look at the logs of our load balancer we can see something as:
1.2.3.4 - - [13/Jun/2019:06:00:10 +0000] "GET /toto/list HTTP/1.1" 502 150 ....
The status code is 502
, "Bad Gateway".
This HTTP status code that means that the load balancer has received an invalid response from the application server (or no response).
Why is that?
To remove v1
, after stopping to send incoming traffic, Docker sends the SIG_TERM
signal to the app and waits
up to 10 seconds for the app to gracefully terminate itself.
If v1
is still running after 10 seconds, Docker brutally kills the v1
app.
This will terminate any connection the app had (and pending requests will receive 502 error).
Graceful stop
We can change the amount of seconds Docker will wait before removing the container by tuning the stop_grace_period
parameter:
# app.yml
version: '3.4'
services:
php:
image: acme/todo-list:${VERSION}
stop_grace_period: 120s
deploy:
update_config:
order: start-first
With this configuration, after sending the SIG_TERM
signal, docker will wait up to two minutes before killing the app.
Depending on the specific logic and response times of your application, you can tell to docker how long to
wait for your application to terminate before forcing it.
Blue-green deployments and PHP-FPM
If you are using PHP-FPM, the previous configurations might not be enough.
Unfortunately (?) PHP-FPM is configured by default to terminate immediately after receiving the SIG_TERM
signal.
Even if docker is ready to wait for 10 seconds (or any other value you might have configured with stop_grace_period
)
PHP will terminate itself (and all the requests being served) without waiting.
This will lead again to 502
errors.
To solve this we have to instruct also PHP to give enough time itself to complete serving the pending requests,
tuning process_control_timeout
parameter (check here for a full list of PHP-FPM configurations).
By setting process_control_timeout = 5
,
PHP-FPM will wait up to 5 seconds before exiting and killing all the processes that were serving requests.
We can add this parameter in the Dockerfile
when building our PHP image.
# Dockerfile
FROM php:fpm
# ...
RUN { \
echo '[global]'; \
echo 'process_control_timeout = 5'; \
} | tee /usr/local/etc/php-fpm.conf
# ...
In the same way how docker was waiting for a container to finish its job, now PHP will do the same and wait up to 5 seconds
for its child processes to finish serving the requests.
In this way we configured how long docker should wait before terminating the container, and also how long PHP
will wait to complete the requests.
If PHP is able to stop running in less than 5 seconds, it will do it (for instance when all the pending requests are served quickly).
The same applies for docker.
In this way, these timeouts are applied only for the worse case.
This post was first published on https://www.goetas.com/blog/traps-on-the-way-of-blue-green-deployments/.