Understanding and Using Amazon Web Services (AWS)
In the computer vision research world, we're always in the need for more processing power. Amazon Web Services (AWS) is a good option that can in many cases be quite economical. There are two main services of interest: Elastic Compute Cloud (EC2), which provides machines you can connect to and run programs on; and Simple Storage Service (S3), which provides long-term storage of files accessible via the web (or not, if you prefer). Here are the essentials you need to know to get started:
Amazon provides you with virtual machines that appear to be 'bare-metal' -- you install everything on them, including the operating system.
If working with only a few machines, I general start them up using the AWS web console using a standard Ubuntu machine image (available here; see the right side of the page and pick one with EBS boot, latest ubuntu release, 64 bit). Then I
scp a startup script to all the machines once they've booted up, and then run it on all of them. I use parallel-ssh to scp the script to all machines, and then start the script.
The script installs all needed libraries/programs, checks out my code from my SVN, compiles it, downloads any needed data files (either from my office machine or sometimes I put them on S3 beforehand), and then starts running them, piping stdout and stderr to logfiles, in case I need to debug anything later on.
In general, I set up almost all of my distributed code to use a very simple producers + consumers queue-based system. The way it works is that one or more producers create jobs to be done, and put them on distributed queues. For example, when I was running a face attribute classification system, if a user submitted 100 faces to be classified according to 50 attributes, then I created 50 jobs (one for each attribute). For each job, I created an input and an output queue. The server put each face url into each input queue as separate items (i.e., 100 items per queue). The workers (consumers), meanwhile, have a main loop that looked like this:
while 1: item = inputq.get() result = process(item) outputq.put((item, result))
Finally, the server has a separate thread that is simply reading from the output queue and storing the results out to disk/memory/database:
while 1: item, result = outputq.get() database.write(item=result)
For the queueing, I wrote a bunch of code that builds on redis, which is a non-sql database that I absolutely love. Nowadays, I think there might be some premade queueing solutions, of which I've started trying out celery, which seems quite great.
This architecture is very simple, and very efficient for lots of reasons. It will naturally handle adding more workers or producers, or removing them. It maintains maximum throughput without any fancy scheduling, etc.
There are two main costs on ec2: computation time and data transfer costs. Amazon only charges for data OUT from ec2 to the internet, so if you are, e.g., using EC2 to compute features for images, that would be the size of the computed features (i.e., original image size doesn't matter). Both are 'pay for what you use', with no minimum monthly fees. Here's how you compute both of these.
The data transfer costs are $0.12/GB. So, e.g., if the computed features are 100 KB per image, then you're looking at 100 GB total size = $12. Of course, if you need to do some other tasks on the features and you do those on ec2 as well (e.g., classification), then you might end up sending much less data back, which would lower this cost.
For computation time, ec2 is nearly perfectly elastic, meaning you can run 1 machine for 1000 hours or 100 machines for 10 hours and the cost is the same (modulo rounding -- you are charged per machine-hour, rounded up). So if you are batch processing data in a trivially parallelizable way across both machines and cores on a machine (e.g., feature extraction), then you only need to figure out how many 'EC2 core-hours' your total task will take. The simplest way is to start up one of the ec2 instances, see how long it takes to do 1 task, then multiply by number of tasks.
The different types of machines available to use and their prices are listed here. If you are cpu limited, then the 'compute optimized' instances are best. If you are RAM limited, then 'memory optimized'. The per-core cost for all of the 'Compute Optimized - Current Generation' machines (c3.*) is the same: 13.33 EC2 core-hours/$. Let's say your task takes 5 seconds per EC2 machine core and you have 1 million tasks todo. 1 million tasks = 5 million 'EC2 core-seconds' = 1389 'EC2 core-hours'. Then simply divide by the above constant (13.33/$) to get total computation cost: $104. (This is linear in cpu time, so if your task takes e.g. 5x as much time, then the cost is 5x as much.)
So total cost for this hypothetical feature extraction job would be roughly $115 to process a million images assuming 5 secs/image and 100kb/extracted feature.
As far as total 'wall-clock time' it'll take to extract all these features, you are initially limited to 20 machines at a time (although you can get that limit raised with a simple email). If you get the beefiest c3 machine ('c3.8xlarge', which has 32 cores), then you have 20*32=640 cores working at once, so total wall-clock time would be 1389 EC2 core-hours/640 cores = ~2.2 hours.
There are 3 different ways to pay for machines, depending on your usage scenario:
- on-demand: this is the standard option, with no prepayment, and guaranteed availability.
- reserved: if you plan on using machines for at least a year, then this lets you prepay some amount in exchange for lower hourly rates. I think if your utilization is at least 40% or so, then it makes sense to do this.
- spot pricing: if what you are doing is not super-time-critical, then you can bid for machines at a given price. Typically, spot prices are about 1/3rd the normal on-demand pricing. I.e., if a machine normally costs $1.20 per hour, you can bid at $0.40 and if they have excess capacity at this price, you get a machine. As soon as their capacity dips below this price, your machines get terminated. In many cases, if I have a lot of computations to do which are not very time-critical, I use this to save a lot of money. Of course, you have to make sure to save your results somehow as soon as they are calculated, otherwise if the machine is killed, you would lose them.
There are 2 types you'll use commonly:
- normal: for any data that you want to be absolutely sure will not get corrupted over very long periods of time. The storage cost is $0.03/GB upto the 1st terabyte, and then it gets cheaper if you have more data
- reduced redundancy: this is less reliable (99.99% vs 99.999999999% for normal) - i.e., you can expect 1 file out of 10,000 getting corrupted every year. The price is 80% of the normal price. This level of reliability is more than good enough for anything you have backed up, or if you need it only for a few months.
For both S3 and EC2, you are charged for bandwidth OUT of amazon, at $0.12/GB, but bandwidth INTO amazon is free. There are also one-time per-file charges for S3 (both in and out), which are usually not very much. Bandwidth betweenEC2 and S3 is fast and free.