Here’s the situation in a nutshell: you have a service running in the Amazon EC2 Container Services environment. Your architecture consists of a set of nodes, as shown in the sketch below.
Everything “lives” inside a private address space “in the Cloud”, on Amazon Web Services (AWS). The cluster that resides in the AWS EC2 Container Services (ECS) consists of “nodes”, each of which are “instances” launched from the AWS EC2 environment. These instances are of a certain “type” (like e.g. “t2.medium”), for example determining how much disk space and computing power will be available, and have networking characteristics attributed to them (like private and potentially public address spaces). Within each of these nodes inside our cluster, a number of Docker containers are running. You don’t have to use Docker as a means of provisioning/launching services, but that’s a different topic. Each of these Docker containers, in this example, represents a micro service. That’s the 20,000 ft view, more or less.
These micro services need to get their configuration information from somewhere. Assuming that a the recipe for building one of those micro services is stored on Github, for example, some of the configuration information can be packaged with that repository. Sensitive information, like API keys and other secrets, obviously cannot be stored in such a publicly accessible location. With Python projects, people often use a config.py module for general information and a local_config.py module (listed in the .gitignore file) for e.g. overwriting stub keys with the actual secret keys. When you deploy your micro service, you need to take care that you deploy (the right) `local_config.py` along with your code. When you have a lot of micro services, this clearly is a pain and mistakes are bound to happen. That is where a service like Consul is very handy. Consul is a lot more than just a key/value store for configuration information, by the way. When you deploy Consul along your micro services on AWS, you have a means to have a central configuration service, accessible for all your micro services. Assuming you have configured things correctly, the Consul service is only accessible within the private address space used within your ECS cluster. In other words, nobody in the outside world is able to access your configuration stored in Consul.
It would be nice to make frequent backups of your Consul key/value store. Consul does not come with something built in, so you need to roll your own backup solution. In essence you need something that is able to create a dump of the contents of the store. You can write your own utility around a HTTP request like
curl http://<Consul host>:<Consul port>/v1/kv/?recurse
and do all the “dirty work” yourself. Or, you can use a Python client like consulate to do that “dirty work” for you. In other words, getting the data needed for your backup is pretty easy. How you want to implement it is something that needs a little bit more thought. You want your backups done frequently, so a cron job seems like the natural solution. Where do you run your cron job? Do you package it with one of your micro services? Do you package it with your Consul setup? Do you create a separate “stuff” container that runs all kinds of management scripts? How about initiating the backup outside the Cloud? A minor additional question is: where do I store my backups? On AWS (S3) or locally?
To start with the minor additional question of “where”. It seems like that AWS S3 is the most convenient answer. Your backup will live in the same private address space, more readily available for a restore, and storage on AWS is dirt cheap. Storing it locally involves a little bit more coding, but works fine too. If you want to look inside a backup file, you would have to download it first, from S3, so that may be a slight advantage of having it stored locally.
Where to run the cron job is to some degree a matter of taste too. Packaging it with an existing service (like a micro service or Consul itself) seems like bad design to me. The whole idea of a micro service is that it does X and X alone, no room for “oh, and a little bit of Y too”. How about having a “management container” or “management node” within your cluster? Besides doing backups, you probably want to run all kinds of health checking, metrics gathering and other scripts. This choice is a bit more philosophical. Unless you have some sort of management UI on top of your AWS environment (like Rancher), which does all the communicating with AWS “under the hood”, you probably will need to run a local script if you want to do something like restoring a Consul backup. If that is the case, maybe you want to have your backup utility run locally too, just from a “completeness” point of view. Backup and restore are probably just two modes of one utility. Sometimes containers crash, which could theoretically stop your backups. But there is a solution for that: a new container will get spun up, with all your management services. My personal choice was having the backup be initiated locally, through a cron job running on a on-prem server. How do you make that happen?
That is where the concept “Task Definitions” comes in. A “Task Definition” in the AWS ECS environment is a recipe to “do something”. This “something” can either be a “service” or a “task”. A “service” is in general something you expect to keep running; think “micro service” here, for example. You start the service and it keeps running till “something happens”; normally this means that you stop the service or restart the service (after an update). A “task” is in general a more transient event; you start it, it makes something happen and then exits. This is exactly what we need. So, what ingredients do you need for a “Task Definition”? The main ingredient, really, is an “image”, specifically a “Docker image”. A Docker image is a read-only template. For example, an image could contain an Ubuntu operating system with Apache and your web application installed. Images are used to create Docker containers. Docker images are the build component of Docker. In its turn, a Docker image is create from the recipe, listed in a file called the Dockerfile. Docker images can be stored in various ways: on Docker Hub, a local repository or a third party respository like Quay.io or within AWs itself.
In our case, when the Docker container is created, it has one very clear purpose:
- connect to the Consul key/value store,
- retrieve all records,
- store the records in a file
- ship that file to a well-defined bucket on AWS S3
To make this happen, we need to translate this into a Dockerfile. In the FROM clause in the Dockerfile, you specify on which the Docker container is based (like Ubuntu, CentOS, Debian, …). I chose phusion, but that is not necessarily the best choice. I probably could have chosen something like busybox. One difference between various choices is the size of the resulting Docker image; this could be a critical factor. I’ll probably spend a future blog on this.
The rest of the ingredients is listed in this GitHub repo. Using the Dockerfile, I created a Docker image and stored it on Docker Hub. Now we can use this image in the Task Definition:
I have removed a lot of details from this, to keep things simple and focus on the most important aspects. The Docker image has been bolded above. By just listing adsabs/consul-backup AWS “knows” that it needs to look on Docker Hub if it cannot find the image locally. By adding a label after the colon, a specific version will get downloaded. The Docker container that will get created when you run “Run Task” within AWS RCS for this particular Task Definition, will mount “/tmp” from the node on “/tmp” in the container. This is something I wanted to be able to keep a log file that would stick around, even after the Docker container was removed.
When “Run Task” is executed, the Docker container is built, and the command specified after “CMD” in the Dockerfile is executed. This will run the Python script backup.py. This looks at the environment variables, telling it to do either a backup or a restore. I think the source code of backup.py is pretty self-explanatory. Instead, let’s look at how to make this setup into a local cron job.
The essence of this is the “boto3” Python module, specifically the component that deals with AWS ECS. In essence you would need a method along the following lines:
def run_task(cluster, desiredCount, taskDefinition):
Thin wrapper around boto3 ecs.update_service;
:param cluster: The short name or full Amazon Resource Name (ARN) of the cluster that your service is running on. If you do not specify a cluster, the default cluster is assumed.
:param desiredCount: The number of instantiations of the task that you would like to place and keep running in your service.
:param taskDefinition: The family and revision (family:revision ) or full Amazon Resource Name (ARN) of the task definition that you want to run in your service. If a revision is not specified, the latest ACTIVE revision is used. If you modify the task definition with UpdateService , Amazon ECS spawns a task with the new version of the task definition and then stops an old task after the new version is running.
client = get_boto_session().client('ecs')
where cluster refers to the name of the cluster and taskDefinition would be something like “consul-backup:12” (where 12 refers to the “revision number” of the Task Definition). The method “get_boto_session()” is something like
Gets a boto3 session using credentials stores in app.config; assumes an
app context is active
:return: boto3.session instance
Now we have all the ingredients to initiate backups of the Consul key/value store from a local server.