Intro
In this post, I discuss some thoughts about AWS’s Elastic Container Service (ECS). ECS can be used to run docker containers. ECS is a management layer for containers that adds the following concepts:
-
task
- An ECS task is equivalent to a running container. -
task definition
- An ECS task definition is equivalent to adocker-compose.yml
, defining a multi-container application. -
service
- An ECS service manages many ECS tasks, handling deployments of new task definitions, autoscaling, and task monitoring.
Cross-environment task definitions
One of the first things I wanted to do was to share the same task definition across environments. Let’s say I have three clusters: dev, stage, and prod. An ECS service is defined per-cluster, but I want to avoid having to define create a separate task definition for each environment. To satisfy the needs of having different application versions in each environment, we can simply use the task definition versions.
I followed the advice in this article 1 and handled this by using the combination of a directory I mount in each container as a volume, and using a custom entrypoint script for each container. The directory contains a shell script which defines what environment we are running in. The entrypoint script knows which arguments to pass to the app based on the environment.
How do I get the data into the node directory? Via ansible.
Ansible Local
I have two problems to solve:
- Each node needs to have a directory containing information about the node’s environment and/or the cluster it’s running on.
- I want to support spot instances and autoscaling.
I can support both of these solely using AMI’s, but I also want to be able to update nodes in-place, so I decided to use run Ansible locally on each node using a local connection.
I use packer to create an AMI based on the most recent ECS optimized AMI
using a source_ami_filter
like:
{
"source_ami_filter": {
"filters": {
"virtualization-type": "hvm",
"name": "amzn-ami-*-amazon-ecs-optimized",
"root-device-type": "ebs"
},
"owners": ["amazon"],
"most_recent": true
}
}
and include an ansible provisioner like:
{
"provisioners": [{
"type": "ansible",
"playbook_file": "playbooks/ecs.yml"
}],
}
The provisioner installs ansible locally on the node, installs cron tasks to run ansible repeatedly, and adds an rc.local file to run ansible on startup. It also copies playbook files that:
- Pull updated playbooks from a location in s3
- Create a directory containing environment information which can be mounted as container volumes
- Perform other important configuration such as rsyslog and ntp configuration.
- Installs cron tasks that run ansible repeatedly every N minutes.
Creating Environment Config
I mentioned that the Ansible scripts create an Environment config file in a directory which can be mounted as a container volume.
In the tasks below, we retrieve the instance tags for the node we are
on, and look for a tag called env
. We set a fact called
ec2_tagged_environment
which will be referenced by the template
node.config.j2
.
- name: Load EC2 Instance Facts
block:
- ec2_metadata_facts:
- ec2_tag:
state: list
resource: ""
region: ""
register: ec2_tag_list
- set_fact:
ec2_tagged_environment: ""
when: "'env' in ec2_tag_list.tags"
- template: src=node.config.j2 dest=/etc/container_config/node.config
node.config.j2
can be a shell script that can be sourced, assuming
your entrypoint is a shell script:
#
MY_ENVIRONMENT=""
Cron Jobs
I’ve been unsure about how to implement cron tasks. Right now I have split out each cron task into its own container, and have CloudWatch Events triggering task runs. This works but has the unfortunate result that it appears I have a lot of stopped containers. This ruins an otherwise useful metric.
Three alternatives I may try are:
- Convert the cron jobs to use AWS Lambda
- Try to run
crond
as the entrypoint to the container (I’m not sure how container stopping will work with the tasks thatcrond
spawns) - Try to use some non-
crond
scheduler.
Monitoring
I recommend using logspout
with your logging platform of choice. DataDog
is also great to have on each node, as well as running as a container so
that it can get container metrics automatically. Since some of the
ECS-related DataDog metrics are infrequently updated, it’s useful to
have CloudWatch dashboards and alarms for the more important metrics.
To use the containers above, we have to make sure there is one running container on each node at all times. This can typically be accomplished via an ECS Service, but that becomes problematic when autoscaling is involved. This is another thing that we can handle with Ansible.
We can have ansible do the following each time it is run:
- Run a
docker ps
to list running containers - Search for logspout and datadog
- Start the appropriate task using awscli
aws ecs start-task
Spot Instances
For a dev environment, using all spot instances may be acceptable. For production, it’s probably not. To handle that case, I suggest creating both a spot fleet and autoscaling group. The autoscaling group should scale based on sustained CPU or memory utilization. The spot fleet can provide extra capacity for non-critical workloads, or it can be used in conjunction with autoscaling policies to handle peak load.
Application Load Balancer (ALB)
ALB has been a great way to route requests to my containers. Having created an ECS Service with a container having a dynamic port, you can specify an associated target group. When tasks start, the dynamic port used by the associated container will be registered with the target group. That target group can be associated with an ALB Listener, and routed by Host header (among other things).
Dynamic Ports
Speaking of ports, one of the biggest sources of frustration I’ve encountered was trying to get JMX monitoring working with a container using a dynamic port. This is partially a problem with the JMX protocol, in that it includes the port in the protocol, and partially with ECS in that there doesn’t seem to be a direct way to let a container know what dynamic port has been assigned to it. I went in circle with this one, and never found a good solution.
At this point, I have given up on being able to connect directly to a
cotnainer using jmx; however, jmx metrics could be retrieved by using a
sidecar container and an EXPOSE
declaration with each service that
exposes JMX.
Conclusion
I’ve found ECS to be a good-enough solution to the container management problem. There’s tight integration with VPCs and security groups, role-based authorization, and it feels very cohesive with the rest of the AWS ecosystem. Although I’ve looked into Amazon Elastic Container Service for Kubernetes (EKS), it seems like something better suited for those not heavily invested in other AWS services.