Data Scientist at

Running a Docker Container on AWS EC2

30 Aug 2018 - Tags: aws, docker, and tools


Amazon Web Service’s Elastic Compute Cloud (AWS EC2) is Amazon’s cloud computing platform, which allows users to rent server time to run their own applications. EC2 provides a selection of different hardware configurations (or instance types), so customers can select the hardware configuration which best suits their needs. For example, a scientific modelling application might use a compute-optimized EC2 instance; an in-memory database might use a memory-optimized EC2 instance; an application training deep neural networks might use an EC2 instance which has GPUs available to it; and a Hadoop application might use a storage-optimized EC2 instance. This allows customers to rent a hardware environment which suits their needs without having to purchase that hardware up front.

In this post, we’ll see how to set everything up for a new AWS account, launch an instance, upload our data, run some python code on it, download the results, and shut down our instance.


Create an AWS Account

To run an EC2 instance, you’ll first need an Amazon Web Services account. You can sign up for a Free Tier account at the AWS website by clicking on the “Create an AWS Account” button and following the instructions.

Create an IAM User

After creating an account, you should create an Identity and Access Management (IAM) user through which you can administrate your rented instances.

  1. Sign in to the IAM console using your AWS account information.
  2. In the panel on the left, choose “Users”, and then select the “Add user” button.
  3. For the User name, enter “Administrator” (or whatever you want, really).
  4. Select the “AWS Management Console access” checkbox.
  5. Select the “Custom password” radio button and enter the password you’d like to use for this IAM user.
  6. Unselect the “User must create a new password at next sign-in” checkbox (unless you’re making the account for someone else, in which case this will enable them to set their own password).
  7. Then select “Next: Permissions”.
  8. Select “Add user to group”, and choose “Create group”.
  9. For the Group name, enter “Administrators” (or, again, whatever you want).
  10. In the table of policies, select the check box next to the AdministratorAccess policy (with the description “Provides full access to AWS services”).
  11. Select “Create group”.
  12. Select the checkbox next to your newly-created group.
  13. Select “Next: Review”, and then “Create user”.
  14. You can now log out of the AWS console, and log back in as the IAM Administrator user you just created. Go to https://<your_aws_account_id>.signin.aws.amazon.com/console/, where <your_aws_account_id> is your AWS account number without the hyphens.
  15. Sign in using the IAM user username (Administrator), and the password you set for that user.

Create a Key Pair

To secure your connection to your instance, you’ll want to create a cryptographic key pair. This will allow you to connect to your instance using the private key via SSH.

  1. Sign in to the AWS EC2 console using your IAM Administrator user information.
  2. In the panel on the left, in the “NETWORK & SECURITY” section, select “Key Pairs”.
  3. In the upper-right, select the region (e.g. “US East (Ohio)” or “US West (Oregon)”) for which you want to generate the key pair. It’s probably fine to leave it as is. However, if you want to run an instance in different regions, you’ll need to generate a new key pair for each region - key pairs are region-specific.
  4. Click the blue “Create Key Pair” button.
  5. Enter a name for your new key pair. I’ll refer to the name you chose below as <key-pair-name>. Amazon recommends using <username>-key-pair-<region>, where <username> is your IAM username (or some other name that you’ll remember), and <region> is the AWS region you’ll be connecting to (the region shown in the upper-right of the page).

The private key will be downloaded by your browser. Open up a terminal and set user permissions for the file:

chmod 400 <key-pair-name>.pem

If you’re in Windows, you can either use PuTTY (see the “To prepare to connect to a Linux instance from Windows using PuTTY” section of this page), or personally I’d recommend using cmder, which comes with an SSH client and bash emulator which allows you to run that chmod command directly.

Save the private key file in a safe place, because you won’t be able to download it again! However, you can always create a new key pair when you launch an instance.

Create a Security Group

You’ll also want to create what’s called a “security group”, which sets limits on what kind of network traffic your instance can receive.

  1. From the upper-right of the EC2 console, select the region for the security group you’re about to create (e.g. “US East (Ohio)”). This should be the same region for which you created the key pair (and will probably already be set to the one you want).
  2. In the panel on the left, under “NETWORK & SECURITY”, select “Security Groups”.
  3. Click the blue “Create Security Group” button.
  4. For the Security group name, enter a name for your new security group. Amazon recommends <name>_SG_<region>, where <name> is some name you’ll remember, and <region> is the AWS region for which you’re creating the security group (because, like key pairs, security groups are region-specific).
  5. Enter a description of your security group (e.g. “My default SG for Ohio”)
  6. In the “Inbound tab”, create these three rules:

These rules ensure that your instance will accept HTTP requests from any IP, but only SSH connections from the IP address of the computer you’re currently using, which helps keep your instance secure. You can of course change this to accept whatever IP addresses or range of addresses from which you want to connect to your instance.

Finally, click the blue “Create” button to create the security group.

(Optional) Request Service Limit Increase

Amazon Web Service’s free tier includes 750 hours per month of one specific instance type, t2.micro. This instance has only 1 virtual CPU and 1 GB of RAM. This is enough to test things out before trying them on a beefier instance type for which you will be charged (and even enough to run a simple webserver), but not enough to do any serious data processing.

To actually run your code on a beefier instance than t2.micro, you will have to request a limit increase for the instance type you want to use. The default limit is 1 t2.micro instance (at a time), and 0 for all other instance types. This is both for users’ protection (so customers don’t accidentally fire up 1000 expensive instances and rack up thousands of dollars in charges), and also for Amazon’s own protection (so they don’t have to waste server time running those instances which were accidentally fired up or never terminated).

To request a limit increase,

  1. Go to the EC2 console.
  2. In the panel on the left side of the page, go to “Limits”.
  3. Find the instance type you want to run, and select the “Request limit increase” link next to it. You’ll be brought to a “Create case” page in the service center.
  4. Select the “Service Limit Increase” radio dial.
  5. From the “Limit type” drop-down, select “EC2 Instances”. This will give you some additional drop-downs.
  6. From the “Region” drop-down, select the service region (e.g. Ohio) which you want to change your instance limit.
  7. From the “Primary Instance Type” drop-down, select the instance type whose limit you want to change. To decide which instance you want to use, take a look at the different instance types and their pricing.
  8. From the “Limit” drop-down, select “Instance Limit”.
  9. In the “New limit value” textbox, enter the max number of instances (of the type you selected) that you want to be able to run simultaneously.
  10. In the “Use Case Description”, enter a short description of what you’ll be using this instance type for (e.g. “data science / feature engineering”).
  11. Click the “Submit” button.

After a few days, you’ll hear back from Amazon web services about your service limit increase. They may or may not approve the increase - they usually only approve incremental increases. For example, they’ll likely approve an increase to use m5.xlarge, but may not immediately approve an increase to use m5.4xlarge until you’ve actually run jobs on larger instances a few times.

It’s also worth checking your instance limits (in the EC2 console, go to “Limits” in the pane on the left) after your service limits have been increased. I found that AWS actually increased my limits across the board for lower-to-mid level instance types after I requested a limit increase only for m5.large.

Once your service instance limit has been increased for your desired instance, you can run an instance of that type!

Launch an EC2 Instance

Now that we have things set up, we can launch our AWS virtual machine (called an “instance”). From the EC2 console, click the blue “Launch Instance” button.

Select the Amazon Machine Image (AMI) you want to use. If you want to ensure you’re using only AMIs you won’t be charged for, select the “Free tier only” checkbox on the left. We’ll use the “Amazon Linux 2” AMI.

Select the Instance Type you want and click the blue “Review and Launch” button. For testing things out, you’ll probably want to select the t2.micro instance type (which is free). However, for running more serious data analyses, you’ll want to select an instance type which has enough memory and processing power. Here’s a list of AWS instance types and their pricing.

Remember that instances other than t2.micro are not free! So don’t launch an instance of any other type just yet. I’d suggest trying everything in this post out using the t2.micro instance first (which, again, is free), and then only once you know what you’re doing going back and starting a paid instance.

Now we’ll set our instance to belong to the security group we created earlier. Under “Security Groups”, click the “Edit security groups” link.

Select the “Select an existing security group” radio button, and select the box next to the security group you created earlier.

If you’ll want to connect to your instance from a different IP address than you were using when you set up the security group, you’ll have to either edit your security group, or create a new security group which allows connections from your new IP.

To create a new security group,

  1. Select the “Create a new security group” radio button.
  2. Give your new security group a name, if you want.
  3. Enter the same three rules as above (Type=HTTP,Source=Anywhere; Type=HTTPS,Source=Anywhere; Type=SSH,Source=My IP)

Or, to edit an existing security group (you only have to do this if you’re connecting from a different IP than you were when you set up the security group)

  1. Go back to the EC2 console and select “Security groups” from the panel on the left.
  2. Select the box next to the security group you want to change.
  3. Click the “Actions” button and select “Edit inbound rules”.
  4. Change the “Source” of the SSH rule to be “My IP”.
  5. Click the blue “Save” button.

After you’ve selected a security group (or created a new one), click the blue “Review and Launch” button.

Click the blue “Launch” button.

Here you can either select to use the key pair you created earlier, or create a new one (for example if you don’t have access to the key pair you created earlier). To use the key pair you created earlier, select “Choose an existing key pair” from the drop-down, and then select the name of the key-pair you want to use from the second drop-down. But if you want to create a new one, select “Create a new key pair” from the drop-down, then give it a name, and select “Download Key Pair”, and repeat the chmod command from earlier on the downloaded file.

Finally, click the acknowledgement checkbox and click the blue “Launch Instances” button. From here you can click the blue “View Instances” button (or, from the EC2 console, go to “Instances” in the panel on the left). From here you can see info about your instance which should now be up and running!

Connect to an EC2 Instance

To connect to your AWS instance via SSH, run (in a terminal):

ssh -i <key-pair-name>.pem ec2-user@<publicDNS>

where <publicDNS> is the public DNS of your instance. You can see the public DNS of your instance by going to the Instances page (in the EC2 console, in the panel on the left under “INSTANCES”, click on “Instances”), and selecting the instance you want. In the lower right of the page, in the “Description” tab, there will be a domain after “Public DNS (IPv4)”, for example ec2-99-999-99-99.us-east-2.compute.amazonaws.com. Use this as the <publicDNS> when logging in via SSH.

Again, if you’re in windows, you’ll have to use Putty, cmder, or some other SSH client for windows.

Install Docker, Build, and Run a Container

The next thing we’ll want to do is start a docker container in which to run our code. We’ll use a docker container not only because they ensure reproducible execution of our code, but because it automates the installation of all the software we’ll need.

We’ll use the yum package manager to install docker. First update yum by running (within your shell after SSH-ing into your instance):

sudo yum update -y

Then install docker with:

sudo yum install -y docker

And then start the docker service:

sudo service docker start

Finally, set the user permissions so you don’t have to type sudo before every command…

sudo usermod -a -G docker ec2-user

Log out and log back in again to your instance (type exit and then log back in with ssh -i <key-pair-name>.pem ec2-user@<publicIP>).

Make sure docker has started up successfully by running the following command, which should show information about local or running docker images:

docker info

Now you can either create new docker image from scratch, or run one that’s already been created. To run a docker container from Docker Hub (the docker equivalent of GitHub), run:

docker run -it <owner>/<image>

Where <owner> is the owner on Dockerhub of the image you want to run, and <image> is the image’s name. The command will automatically download and run a docker image from Docker Hub. The i and t options cause the docker image to run in interactive mode, and you will get dropped into a console within the container. For example, to use Kaggle’s docker image for Python, run (though note that this won’t work on a t2.micro instance because it doesn’t have enough memory!):

docker run -it kaggle/python

To use the image we’ll be building below, you can run the image from my Docker Hub repository:

docker run -it winsto99/featuretools

You can also create your own docker image from scratch! To run automated feature engineering with featuretools, we’ll only need numpy, pandas, and featuretools (and their dependencies). So, we can build a relatively simple Dockerfile:

FROM ubuntu:latest

RUN apt-get update
RUN apt-get install -y python3 python3-pip

RUN pip3 install numpy \
    pandas \

Save the above code to a file called Dockerfile.

To build the docker image from that Dockerfile (if it’s the only file named Dockerfile in your current directory), run

docker build -t <image-name> .

where <image-name> is the name you want to give your docker image (for example, featuretools).

If you have multiple docker files with different file names in the same directory, you can build a specific one by piping that file into the build command:

docker build -t <image-name> - < <dockerfile-name>

where <dockerfile-name> is the filename of the dockerfile you want to build.

You can see all the images you have built or downloaded to your instance by running

docker images

You can now push the docker image to your docker hub repository. This both makes it easier for you to use it in the future (you can just docker pull the image instead of building it from scratch), and also makes it easy for you to share the image with others (they can just docker pull the image). If you don’t have a docker hub account, you can head over to the Docker Hub website and create a free account.

Back at your EC2 instance shell, log in to your docker hub account by running

docker login

And entering your docker hub username and password.

Then, tag your image with

docker tag <image-name> <username>/<image-name>

Where <username> is your docker hub username, and <image-name> is the name of the image you want to push to docker hub.

Finally, you can push your image to docker hub with

docker push <username>/<image-name>

To run that image that you built, run:

docker run -it <image-name>

Or, if you haven’t built the image on this instance but you’ve previously pushed it to Docker Hub, run:

docker run -it <username>/<image-name>

Now you’ve been dropped into a bash shell running in your Docker container!

Note that an even better way to use Docker containers with EC2 instances, especially if you want to run multiple instances, is Amazon’s Elastic Container Service. But, that topic is for another day!

Upload Data to an EC2 Instance

Now you need to copy your data from your local machine (or wherever) to your EC2 instance. There are a few different ways you can do this - we’ll go through them one by one.

Using scp

To load external data via scp, at a shell from within your container, run:

scp <remote-username>@<remote-domain>:<remote-path> <ec2-path>

where <remote-username>@<remote-domain> is the username and domain of a server from which you want to copy data, <remote-path> is the path on that server for the data you want to copy. <ec2-path> is the path on your server where you want to copy the data to.

A few notes on flags you might consider using (add the flag after scp):

Using Amazon S3

To load data from Amazon S3 (Simple Storage Service - the first 5GB and 20,000 file retrievals are free, I believe), you’ll need to use the boto3 package, the AWS SKD for Python. See the instructions here on how to install, set up, and use boto3.

Using docker cp

If you want to transfer data from your local machine to your home folder on your AWS instance, run (in a terminal on your local machine):

scp -i <key-pair-name>.pem <filename> ec2-user@<publicDNS>:~/

where <filename> is the file you want to transfer.

But then the data is only on your virtual machine, but not within the docker container! To copy data to a inside a running docker container, SSH into your instance and run:

docker cp <filename> <container_name>:/desired/path/<filename>

where <filename> is the file name of the data you want to transfer, /desired/path is the path within your docker container where you want the data to be copied to, and <container_name> is the name of the running container. You can see information about running containers by running

docker ps

The value in the “NAMES” column is what you want to use for the <container_name>. Docker comes up with some… interesting default names for containers, like “peaceful_wozniak” and “hopeful_heisenburg”. If you don’t like using the randomly-generated names, you could have given your container your own name by specifying the --name <custom_name> argument when running docker run.

Using Docker Volumes

However, instead of uploading data from your computer and then copying it to the container, a perhaps simpler alternative is to use docker volumes to share data between the host and container. When starting the container, you can specify folders which will be shared between the host and the container. That way, when you upload data to your EC2 host (via scp or any other method), you’ll be able to access it from within the docker container. When you want to upload data from a local machine, this seems to me to be the simplest option.

When starting the container, run:

docker run -it -v /<dir-in-host>/:/<dir-in-container>/:rw <image-name>

This will allow you to access anything in the dir-in-host folder on the host instance from the dir-in-container folder inside the docker container!

For example,

docker run -it -v /home/ec2-user/data/:/root/data/:rw winsto99/featuretools

will create the folder /home/ec2-user/data in your EC2 instance, and /root/data in the docker container running on that instance. Anything you put in /home/ec2-user/data from outside the container, you’ll be able to access in /root/data from within the container, and vice-versa.

Using the Kaggle CLI

Finally, as a special case: if you want to download the data for a Kaggle competition, you can copy your Kaggle API token kaggle.json to your container running on your EC2 instance (see the Kaggle API page about getting that set up if you’re interested), installing Kaggle with

pip3 install kaggle

And then downloading the data via the Kaggle CLI with

kaggle competitions download -c <competition-name>


kaggle datasets download -d <username>/<dataset_name>

depending on what you want to download.

Run Code on an EC2 Instance

Now you can run your code inside the container on the virtual machine, just like you would on your personal computer! Running a Jupyter notebook server can be a bit more complicated, since you need to expose a port or just use a pre-built image.

Download Data from an AWS Instance

EC2 instances can be left continuously running, which is what you’ll want to do if you’re running a server or a website or a service. However, if you just ran a batch of analysis code, you’ll want to download the results. If you shut down your instance before saving the results somewhere, that data will be lost! After your analysis code has finished running and produced output data, you can either copy the data from your container to a server using scp, write the data to an S3 bucket using boto3, or use some other method.

Shut Down an AWS Instance

When you’re done using your instance, you should terminate it. With a free tier instance, this is both polite and practical - you won’t be able to launch another free tier instance until you’ve terminated the one you have running (i.e., you are only allowed to have one running free tier instance at any given time).

This is more important when using a non-free-tier instance because you’re getting charged for each second the instance runs!

To shut down an instance,

  1. Go to the EC2 console.
  2. In the panel on the left, under “INSTANCES”, select “Instances”.

  3. Select the box next to the instance you wish to terminate.

  4. Click the “Actions” button at the top, and select “Instance State” -> “Terminate”.

  5. A confirmation dialog will pop up - hit the blue “Yes, Terminate” button to end the instance.