A robust means of deploying web applications with Amazon Web Services is to use an Elastic Load Balancer (ELB) to balance requests between an “Auto Scaling Group” (ASG) of EC2 instances. As well as horizontally scaling, this set-up allows automated canary (aka blue-green) deployments, where new application versions are deployed as a new ASG which replaces the existing EC2 instances; a so-called “immutable infrastructure” approach.
Such a procedure relies on ELB “health check” requests to test that the new EC2 instances are ready to take production traffic (and the old instances can be terminated). For canary deployments, it’s important that the health check is accurate: false positives lead to broken applications being brought into production, causing errors, downtime, and other sadness.
CircleCI, our continuous integration service, packages up the application when the tests pass on master and uploads the tarball to Atlas.
Atlas then employs Packer with a set of uploaded configuration (e.g. Puppet manifests and modules) to create a new Amazon Machine Image (AMI).
Atlas then deploys this AMI into production using Terraform. Terraform brings the new AMI into production as described above, creating a new ASG and launch configuration that uses the new AMI.
Evolving this process over the last few months has highlighted the importance of getting the health check right. Below are some tips. An example Django application run with uWSGI and NGINX is used but most of the advice translates to other frameworks and HTTP servers.
While this article is nominally about health checks, the TLDR is that you can build great things with Hashicorp’s products. Specifically, if you use AWS and haven’t checked out Terraform before - do that today.
A health-check Django view
Our ELB health check is configured in Terraform as:
In other words: an EC2 instance is considered healthy when two HTTP requests to
return a 200 status (within five seconds).
Let’s start simple:
It would have been easier to use NGINX to respond to the health-check request directly without troubling uWSGI. But we get considerably more value by going a layer deeper and getting the Django application to respond. Several classes of problem are prevented by doing this since health checks fail when uWSGI can’t start the Python application.
Unhealthy instances can’t run the Python application
As per the 12-factor app guidelines, our EC2 instances are stateless and read their configuration from environment variables. These are set by Upstart, sourcing a configuration file managed by consul-template:
We ensure the Python application cannot start with missing/invalid configuration
using simple wrapper functions in
SECRET_KEY environmental variable isn’t defined in Consul, uWSGI won’t
be able to start the Python application and health checks will fail. This
practice ensures canary deployments fail if configuration is missing.
Assuming uWSGI can start the Python application, let’s example the set-up that allows Django to respond successfully to the health check.
We terminate TLS on the ELB and proxy requests to port 80 of the EC2 instance.
For normal user requests, we use the
X_FORWARDED_PROTO header to ensure TLS is used.
However, we don’t want this for health-check requests so we use a separate
Here we use a separate log file as we don’t want the health check requests being included in the main access file which we forward as a JSON event stream to Loggly (this is super-useful).
ELB health-check requests use the private IP address of the EC2 instance as the host header so we need to ensure such requests are correctly handled by the Django application.
For NGINX, this isn’t a problem as we proxy to the Django application in the catch-all virtualhost (the first one defined).
For the Django application to respond correctly, the private IP address must be
ALLOWED_HOSTS setting or Django will return a “400 Bad Request”
response. Since webservers are ephemeral, this setting needs to be set
dynamically, normally by calling the AWS internal metadata service during
start-up. You can make such a request in the EC2 “user-data” and write the
value to a config file, or call the metadata service when
imported. The former may look something like:
At this point, the simple health-check view defined above will happily respond to requests. Let’s now extend the implementation of the health-check view.
Check pages render correctly
You can use the Django test client to run a simple smoke test on your site. For example, checking the homepage loads.
We can use this helper in our view function:
Check migrations have applied successfully
As shown above, we attempt to apply migrations when Upstart starts the Django application. Should any of these migrations fails, we don’t want to bring that machine into production. Hence we check for unapplied migrations as part of the health check:
Our extended health check view function now looks like:
And there you have it: an effective health check view for Django applications.