How to prepare an AWS test image for PyTorch

Jul 26 2022

I’ve been getting started in open-source development with PyTorch, starting with running and testing the examples for Pytorch in distributed mode. Big thanks to Mark for reviewing and merging my first PR.

My current Macbook Pro doesn’t support using PyTorch with GPUs, although as of 1.12 that’s changed and you can read Sebastian’s review of his experience with it here. But for me, the easiest way to run PyTorch and its associated tests is to spin up a relatively small GPU-based instance in AWS for testing (GCP offers similar functionality) and then tear it down.

I’ve updated the instructions in the official docs, but thought I’d add them here as well, mostly as reference to myself for how to do it.

These instructions assume you have:

An AWS account and
AWS CLI set up
A unique key-pair for logging into your instance

We’ll be spinning up a g4dn.4xlarge instance. This is the lowest-cost GPU instance and is fine for a couple hours of testing example runs. If your model is large and memory-intensive, you’ll want something larger.

Current specs (as of summer 2022) are: 1 GPU, 16 vCPUs, 64 GiB of memory, 225 NVMe SSD, up to 25 Gbps network performance. The cost is $1.20/hour, so if you accidentally leave it running for a month, it’s going to cost ~$560. One quick way to pre-empt this is to set up billing alarms.

We’ll be creating this instance from the AWS CLI, although you can also do this from the console, just takes a bit longer.

From the command line, run the command to create a new EC2 instance:

--image-id - the ID of the g4dn.4xlarge AMI
--instance-type - g4dn.4xlarge, our Deep Learning Instance Type
--key-name pytorch - the EC2 key you created
--security-groups [your security group] - make sure this security group has ingress/egress for port 80, 22, and 443

Note: the ID of the image gets updated every several days, here’s a script to find out what it is when you create the image. thanks, David!

aws ec2 run-instances --image-id ami-0403bb4876c18c180 --instance-type g4dn.4xlarge --key-name pytorch  --security-groups [your security group]

Once it’s set up, ssh into it using:

ssh -i "yourkey.pem" ubuntu@theinstancename.compute-1.amazonaws.com (ubuntu is the user here)

=============================================================================
       __|  __|_  )
       _|  (     /   Deep Learning AMI (Ubuntu 18.04) Version 59
      ___|\___|___|
=============================================================================

Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-1069-aws x86_64v)

Please use one of the following commands to start the required environment with the framework of your choice:
for TensorFlow 2.7 with Python3.8 (CUDA 11.2 and Intel MKL-DNN) ____________________________ source activate tensorflow2_p38
for PyTorch 1.10 with Python3.8 (CUDA 11.1 and Intel MKL) ______________________________________ source activate pytorch_p38
for AWS MX 1.8 (+Keras2) with Python3.7 (CUDA 11.0 and Intel MKL-DNN) ____________________________ source activate mxnet_p37
for AWS MX(+AWS Neuron) with Python3 __________________________________________________ source activate aws_neuron_mxnet_p36
for Tensorflow(+AWS Neuron) with Python3 _________________________________________ source activate aws_neuron_tensorflow_p36
for PyTorch (+AWS Neuron) with Python3 ______________________________________________ source activate aws_neuron_pytorch_p36
for AWS MX(+Amazon Elastic Inference) with Python3 ______________________________________ source activate amazonei_mxnet_p36
for base Python3 (CUDA 11.0) _______________________________________________________________________ source activate python3

The deep learning instance comes with PyTorch pre-buildt in a conda environment, but these dependencies tend to fall out of sync with each other. Usually it’s easier to start from scratch, especially since examples installs its own dependencies.

So then you can run:

#!/bin/bash -x
mkdir -p /home/ubuntu/my_examples 
cd /home/ubuntu/my_examples 
chown -R 1000:1000 .
git clone https://github.com/pytorch/examples.git 
cd examples
echo ". /home/ubuntu/anaconda3/etc/profile.d/conda.sh" >> /home/ubuntu/.bashrc
. /home/ubuntu/anaconda3/etc/profile.d/conda.sh
source /home/ubuntu/.bashrc
conda create --name pytorchenv 
conda activate pytorchenv
conda install  -c pytorch torchvision cudatoolkit=10.1

And you should be good to go!

There is an even easier way to do this if you want to avoid having to run a bunch of bash commands, you can also package the commands as part of the install command using --user-data, which pushes data to your image at launch. Here’s more on how user-data works.

In order to push data to the image, you have to use an instance profile, which is tied to an IAM role,and which you can set up using the command line. It’s important to note that the file, even though it runs bash, is a text file.

aws ec2 run-instances --image-id ami-0403bb4876c18c180 --instance-type g4dn.4xlarge --key-name pytorch --security-groups [your security group] --iam-instance-profile '{"Name": "EC2_Access" }' --user-data file://install_pytorch.txt

In order to make sure that your bash script ran correctly, you can tail the image setup logs, which are located in these two places:

/var/log/cloud-init.log 
/var/log/cloud-init-output.log