Effective Use of Amazon Mechanical Turk (MTurk)
Introduction
In computer vision, perhaps nothing has had as big of an impact in the last ~8 years as perhaps Amazon Mechanical Turk (MTurk). Back in 2011 when I was applying for jobs, I argued that the two general enablers of modern computer vision successes have been increased processing power and the availability of far more labeled data; I'd argue that the latter has been more important, and within that, the ease of getting labels was less foreseeable than that of more (and more realistic) data. MTurk is a way of getting such labels cheaply and efficiently for large amounts of data.
So what is MTurk? It is a marketplace for small jobs that can be done virtually. These are called 'Human Intelligence Tasks' (HITs). Workers who have registered with Mechanical Turk can sign on and complete the HITs. HITs typically consist of web pages which ask the worker for some sort of judgement or input. For computer vision researchers, the webpage will often contain some image(s) and ask the worker to either describe what's in the image, label specific parts or pixels on the image, or confirm some existing labels/guesses. For each HIT, a worker is typically paid a few cents by the requester, and the task should take between 30 seconds to a few minutes.
Getting good quality labels requires understanding a few main factors:
- How to structure jobs for efficient completion while discouraging laziness and cheating
- How to get jobs done at a reasonable quality level while keeping costs low (in terms of money and time)
Creating Jobs
A job in Mechanical Turk is defined using 3 pieces:
- A page template: This consists of HTML code that defines what a worker sees, with placeholders for template arguments, such as image urls, etc. You write standard HTML, including CSS stylesheets for custom styling and javascript scripts for adding interactive and intuitive user-interface functionality, or even links to external databases, etc. This is a very flexible system, and as more applications move to the web, it means that this system will be increasingly familiar to software developers of all types. Amazon provides a few standard templates, which might cover your use-case if you're lucky; if not, they're pretty easy to write with some basic HTML experience.
- Batch data file: A batch of data to fill into the template, given as a Comma-Separated Values (CSV) file. All the variables defined in the page template are filled in here, for as many individual HITs as you want to create. It's easy to create these from any programming language, or even Excel. You can also have as many lines (i.e., HITs) as you want within a single CSV file, allowing for the creation of small and large batches alike.
- Additional metadata: Some metadata about the job, such as keywords, price, quality filters, etc. These are used to determine how workers can find your job, which workers are eligible to work on it (i.e., they must have a certain approval rating on past jobs), how much they get paid, how much time they have available to work on each job, how many individual workers must complete each job, etc. You set this once per job, and it's easy to modify if you want to experiment with different prices, etc.
You can create as many templates as you want, and you can upload as many CSVs as you want. The cost of running a job is set by you, and Amazon takes an additional 10% or $0.005 per HIT, whichever is higher. Once you submit a job, there's a progress page showing how fast workers are completing the HITs. You can even view results as a table while in progress. Once done, you can export the results as a CSV file to analyze them as you wish. Workers are paid either automatically after a time limit (default 1 week), or you can explicitly approve/reject individual HITs. This means that if you find workers who are sloppy or cheating, you can reject them (they are not paid) and additionally ban them forever from working on your jobs.
To get better insight as a requester, it's worthwhile to examine the process from the standpoint of a worker. (In fact, I'd strongly recommend you register as a worker and try doing some jobs to see what it's like.) Here's the process:
- You search for jobs to work on. You generally want to maximize your hourly revenue with some factor in there for how fun/annoying the work is. Because MTurk is setup as a post-pay system which requires approval from requesters, you are taking a risk when working on any new job -- you have no idea how likely it is that you will actually get paid for your work; nor how long it will take you in practice to do work good enough to get accepted most of the time. This risk means that there's a strong incentive to work for jobs you've done before, and failing that, for requesters you've worked for before.
- If there are no jobs from known requesters, though, then you have to pick a new type of job. So you look at the samples for various jobs and see how easy it looks. You might also do external research (e.g., on forums) to see if others have any feedback on particularly employers. Many workers have a strong bias for finding jobs with many HITs available -- so that if it works out, they can 'grind' on that type of HIT for quite a while.
- Once a worker picks a HIT type they like, they will sometimes do just a few and then wait a few days to see how quickly and reliably they get paid. If not quick enough, they might never do another HIT by that requester again. (Life's too short to work for assholes/slowpokes!)
- If they get paid promptly and without too much fuss, then they might start doing lots of your HITs. They will get better over time, leading to higher hourly revenues (more jobs done in the same amount of time) and hopefully better quality as well.
Based on this workflow, it should now be clearer to you (as a requester) that you have to work at building a good reputation, and like other trust-based systems, it's much easier to lose it than gain it!
Pricing and Speed
The right way to think about costs on MTurk is that they affect only the rate at which jobs get done. You can pay as little as you want and still get your jobs done, but it will just take much longer to complete. So if, e.g., your HIT pays $0.02 per 30-second job and you can get a batch of 1000 such jobs done in a few hours, paying $0.04 instead might make it take less than an hour total for the same work done.
That being said, as a researcher, your goal is to get accurate labels with the least amount of effort on your part, and so it's often in your benefit to spend more in order to get higher-quality workers who will be more devoted. If you're calculating the hourly wage for your workers, make sure to take into account that the average shown to you in the 'manage your HITs' panel is only for within the HIT itself; there's additional time involved for the workers in submitting each HIT, accepting the next one, and every 25 HITs they also have to solve a CAPTCHA (to prevent abuse by bots).
Another important fact to keep in mind is that there are only two countries in which Amazon pays workers in cash: America and India; in other countries, workers are given Amazon credits or other non-cash compensation. This causes (is a reflection of?) the workforce to be disproportionately dominated by workers in these two countries. As a consequence, there is somewhat of a bimodal distribution of the rates at which workers are willing to work: Americans will tend to avoid jobs at very low pay (relative to other options in the US), and so those jobs tend to be completed by Indians. (Note that this is only very loosely true because many workers in the US use MTurk as only a small supplemental source of income, and are thus somewhat less pay-sensitive.) So, very roughly speaking, low-paying jobs will tend to be completed by Indian workers, and the higher-paying ones by Americans. This can dramatically affect your job quality due to language and cultural issues. If your job involves American colloqualisms, or requires very precise knowledge of English (e.g., if you need workers to distinguish between subtly-different terms), this might be difficult for non-native English speakers to complete. On the other hand, if you just want to get the gender of faces labeled, it probably won't matter.
Ultimately, as with most questions on MTurk, the way to really find what works is to try out small (but not too small!) batches and evaluate the quality of results for yourself. The work involved in mturk is mostly front-loaded, meaning you'll have to put in some effort initially to design and setup your jobs, but once that's done, it's very easy to submit more batches, or make small tweaks. This is thus quite conducive to running experiments, such as on pricing, or small design changes.
Best Practices
There are many non-obvious things about making best use of Mechnical Turk which I've picked up over my many years of using it. However, as MTurk has changed drastically during this time period (both technologically and in their policies and the makeup of workers), not all of this advice might be relevant or correct today. Nevertheless, most of the higher-level intuition should still help guide you.
First, a lot of basic information is listed in the appendix from my phd thesis.
Next, the following is adapted from an email exchange with someone who had questions about a simple survey task he was trying to do on MTurk. His Mturk task asked workers to compare the image output from two different algorithms and ask which one seemed more like ground truth (also shown).
Question: When setting up a task, there's an option for Time allotted per assignment. Do the workers consider this allotted time as a measure of difficulty of the task? I expect that my task would take around 5 seconds but would setting it to like 5 minutes hurt?
Answer: I've never changed this option from the default. I think it's normally comically high, like 1 hour or more. I don't think workers really use it to decide what to do. Instead, in practice they simply do a few of the tasks, and see what it's like (i.e., how much time it typically takes them, and how much they would get paid for that). So actually the setting that matters much more is time until HITs are automatically accepted. This value should be as low as possible, as workers are essentially taking on risk for this many days by doing a job. I think the default is 1 week, meaning that a worker will do a few jobs, and then will wait 1 week to see if you are a 'good employer' -- someone who accepts most of his hits, or a bad one (who rejects HITs for some reason). If you instead reduce it down to a few days at most, or even just 1 day, then you will have more workers willing to try out your job. This is especially important early on, as you have to build up your reputation as an employer who pays his workers quickly and without hassle.
Question: Should I limit by country?
Answer: We didn't limit by country (it wasn't an option back when I was using Mturk). [Update May 2014: see the pricing section for why this might be a good idea in some cases.]
Question: I pick 1 cent per 5-second task which turns out to give a worker $7.20 per hour. Do you think it's appropriate?
Answer: If the task really takes 5 seconds on average, then 1 cent is slightly on the higher side of what people are usually paid [as of May 2013]. The usual solution is to add more than one 'job' per HIT, so that it takes a bit longer. Remember that you are being charged cost + 10%, or cost + 0.5 cents (whichever is higher)...so until you get to 5 cents per hit, you're paying proportionally more per HIT to amazon in fees. If your total volume is not very much (~$100), this doesn't matter, but at larger volumes, these costs can add up.
But in general, the thing to understand about costs on mturk is that they don't determine IF your job gets done, but rather WHEN your job gets done. More money == faster completion. It's hard to judge how much money is enough, so usually I start with the bare minimum, submit a small job (~20 HITs or so), and see how long it takes workers to complete that job. If it's too slow, I make the next batch a bit more expensive and repeat this process until I'm happy with the speed.
BTW, this kind of iteration is almost always needed, not just for pricing, but also to evaluate how well workers are completing the task (in terms of accuracy) and also for debugging.
Question: Do they have a testing mode where I can see what the workers would actually see? I only see the 'Preview Hits' and the submit button doesn't do anything.
Answer: There is a 'sandbox' where you can try things out, but in practice I found this awkward to setup and use. The reason is that normally you want to setup the qualifications for a job such that you only accept workers who have 95% or better acceptance rate, and who have completed at least a few hundred HITs. But since you yourself will not have done this many HITs yet, you're not allowed to try out your own jobs!
So instead, just use the preview mode to see how the job looks and behaves. This will give you 90% of the info you need. One thing you should do here is make your browser window no bigger than 1280x1024, which is often the maximum resolution for many mturk workers (sometimes even less). At this resolution, there should be a minimum of scrolling required, and ESPECIALLY no horizontal scrolling! The remaining 10% is making sure submission works. For this, it's easiest to just submit a small test batch, as described above, and see if any workers do it. After you submit a batch, within at most 1 hour you should have at least a few people who've done your HITs. (Usually much faster, like within 5-10 minutes of submission.) If not, then there is a problem with the submit button and you should cancel the job and debug.
BTW, you don't have to wait until a batch finishes to examine results. You can see and download results at any point, and it will show you all the HITs done so far. This is very useful early on as you're debugging, and realize that there are various problems in your scripts -- you simply cancel the job, and then submit a new version after you've fixed the bug.
Question: How should I deal with spammers? Do I ban them?
Answer: One of the most important things on mturk is to be very conservative about not paying workers. Your reputation matters a lot, so don't reject worker payment unless you're 100% sure they cheated. If they didn't cheat, but they're not very accurate, then it's most likely a problem in the way you set the job up, and you should change things to make it easier/more likely for workers to do the right thing. But if you reject these 'poor quality' (but honest) workers, some of them can get very upset and completely ruin your reputation, making it impossible to get work done. It's better in the long run to just 'absorb' these costs and keep them to a minimum rather than try to get rid of them completely.
We didn't worry about spammers in any of our labeling tasks, but I know others have had to deal with that problem. In general, if you design your tasks such that it is not trivially easy to cheat, then spammers are usually not a huge issue. It also helps if your jobs are very small and cheap, as then there is often less incentive for spammers to devote time and resources to figure out how to cheat. Finally, if you do decide to flag/not pay/report spammers, be super-careful that you don't accidentally ban non-spammers. There are various forums where workers can discuss tasks, and I've heard of cases where people who were unfairly banned got really angry and caused many other workers to stop working for that person as well.
A simpler alternative is to just ratchet up the number of workers doing the same task. For simple things like attribute labeling, we required 3 responses per image, but for the face verification task, we had 10 responses each. I think we might also have thrown out outliers from this 10, but I'm not sure about that.
Question: Can we have like a qualification test ourselves where each worker has to answer a couple of questions to make sure they don't just make a random guess. Do you think this is needed or is the standard 95% qualification option enough. Also if needed, does there exist such feature to do so on Amazon?
Answer: I would first try the simple 95% qualification [Update May 2014: 98%] before you move on to more sophisticated things. Run some smallish batches and see if the results look reasonable. It is possible to add custom qualification tasks, but I've never done them, so I don't know how they work. I think they also drastically cut down on the number of workers who are willing to do them, so only do it if it's absolutely necessary.
Question: Here's my current interface. How does it look?
Answer:
- I think you need an additional instruction line at the beginning: 'If both outputs seem equally likely, or neither, please choose the appropriate option on the right.'
- You should make the text a bit bigger, especially the different options.
- You should make sure that the text itself is a click target, so workers don't have to click on the little radio buttons. Bigger click targets are the #1 thing to optimize in mturk (#2 is reducing scrolling). In HTML, making the text be a label, with the appropriate
for='id_of_radiobutton'
will make it clickable. - The placement of the both and neither options is a bit weird. It might be missed by some workers. They don't quite look like options right now. I think you want them closer to the options for A and B, so people realize that they are making a choice amongst 4 things.
- Actually, one thing that I've found which helps in surveys is to make the background color of each option a different color (light-colored of course, so you can still read the text). Make these consistent across all jobs, so that workers mentally associate, e.g., pink=option A, sky blue=option B, lilac=both, yellow=neither, or something like that. You want to reduce the computational load for the workers.
- The text above each photo is a bit weird. I would use: Input (Age 5) and Output (Age 8), or put Age 5/8 on the next row below Input/Output.
- I'm sure you know this, but if you want these kind of surveys to be meaningful, you have to randomize which photo you put as A and which as B (i.e., don't always have the same method output to A, because some people will just pick A all the time).
This kind of job, btw, is ideal for including multiple jobs on the same HIT, since it's so quick. It might also help give better responses, as people can see a few samples at once and get a better feel for realism.
Finally, this kind of job is also the easiest to cheat on, so you'll have to be a bit careful about using the results. One thing you'll want to do is filter out people who always answer the same thing. (But don't reject or ban them, as I explained before.) Another thing you'll want to do is have a high multiplicity on the job, i.e., have at least 5-7 if not 10-15 users do the same job (this is a configuration setting). Note that it's much better to use mturk's multiplicity option rather than simply having multiple identical hits, since mturk can guarantee that different workers do each hit.
Question: I'm issuing a test batch of 15 hits and find out that it's very slow. The first batch I tried used the standard setting: Master, 95% acceptance rate, and >= 1000 hits. I only got 3 hits done within 2 hours. Then, for the second batch I removed the Master qualification, kept 95% acceptance rate, and decreased down to >=100 hits. I'm running it for about one hour and 13/15 hits done.
Answer: Starting in January 2013, Amazon made it much harder for international workers to register. (Read the comments section for discussion about how it's not an actual ban, but that it's just become much harder for international workers to get approved.) Concretely, this has meant that the labor pool on mturk is smaller than it used to be, and the remaining workers can thus command a higher price.
In particular, the default worker qualifications (under 'Advanced' when creating a new job), now includes 'worker must be Master'. These are workers who have gone through a more stringent review process. While their work quality might be better (I don't have a good sense if this is actually true), this pool of workers is even smaller. So for very simple jobs (where there's little chance of screwing up), it's usually better to uncheck this option. Good replacement criteria are 'worker must have completed at least 1000 jobs' and 'worker approval rate >= 95%' (or thereabouts). [Update May 2014: it looks like 95% is too low now; 98% should be the new minimum. Also, confirmation that Master's is probably not useful.]
13 hits in one hour is still pretty slow, though.
Question: I've heard that running overnight might be faster because of the workers from India. But from your experience, do you know how I should adjust the setting to get the result faster?
Answer: There is definitely large variability in daytime hits vs. nighttime hits, but it's not always better one way.
Question: The average time per assignment now is around 1:15 minutes and effective hourly rate is $4.90. I'm paying 10 cents per hit which consists of 10 questions.
Answer: 10 cents per hit sounds a bit high, although if it's taking 1:15, that's not too bad. One thing you can try is halving both: 5 cents for 5 questions. Sometimes workers prefer that, since each hit is faster (even though total money earned should theoretically be the same).
Also, workers can see how many hits total are available for this job. If you've submitted a very small test batch (15 hits), then many workers will skip it because they think it won't be worthwhile to learn how to do that new type of hit if there are not that many todo. Many workers like to 'grind' on a task for a long time, so it's nice when there are hundreds or thousands of hits available.
So I'd create larger test batches (at least 100 if not more), and just cancel them as soon as you get a sense for how they're going (i.e., to change some settings/price).
Question: I'm running the full task with the same setting. It's pretty fast indeed. 772 tasks done in 43 minutes.
Answer: Awesome! Remember to check the results so far to make sure people seem to be doing it correctly. It's often useful to write a script to parse the output and show results by worker id. This will show you any patterns, eg if they mark the same answer for all inputs.
Question: Do you think 100 hits and 95% rate is too low? Some turker sent me an email telling me that's kind of low especially the 100 hits.
Answer: Yeah the 100 hits is very low. I'd set it to 1000 maybe. Also, how many workers are you having do each task? And are you ensuring that the worker is forced to make a choice before hitting submit? If not, some workers might not enter anything.
Question: 3 Workers. Oh how do I ensure that? I got all the results now, but there are a few that missed 1-2 questions
Answer: It's on the main page of job settings for the job. I think it's called 'number of workers per hit' or something like that. For surveys like yours, you probably want 10 or so.
Question: Oh I mean how do I ensure that all questions are answered before they click the submit button.
Answer: Oh, you have to do that in javascript. Start off with the submit button disabled, then if the user clicks any of the buttons, then enable the submit button again. But if you only got a few unanswered questions, then it's not worth doing this.