I am now the proud(?) owner of a brand new ThinkStation. Nice, fast box, and loaded up with dev and analysis goodies.
Based on discussion with Dr. Hurst, I’m going to look into using Mechanical Turk as a way of gathering large amounts of biometric data. Basically, the goal is to have Turkers log in and type training text. The question is how. And that question has two parts – how to interface with the Turk system and how to get the best results. I’ll be researching these over the next week or so, but I thought I’d put down some initial thoughts here first.
I think a good way to get data would be to write a simple game that presents words or phrases to a user and has them type those words back into the system. Points are given for speed (up to 10 seconds for word?) and accuracy (edit distance from word). Points are converted into cash somehow using the Turk API?
The list of words should be the 100 most-used English words? Words/phrases are presented randomly. There is some kind of upper and lower limit on words that a user can enter so the system is not abused. In addition, ip address/browser info can be checked as a rough cull for repeat users.
Ok, I’m now the proud owner of an AWS account and a Turk Requestor account. I’ve also found the Amazon Mechanical Turk Getting Started Guide, though the Kindle download is extremely slow this Christmas Day.
- Found the online version of the Getting Started Guide
- Amazon Mechanical Turk API Reference
- Amazon Mechanical Turk Developer Guide.
- The Sandbox Testing Environment
- Command-line tools download
Setting up sandbox accounts. Some things appear to timeout. Not sure if I’ll be able to use the api directly. I may have to do a check against the session uid cut-and-paste.
Set up Admin, dev and read-only IAM user accounts.
Accessed the production and sandbox accounts with the command line tools. Since I already had a JRE, I just downloaded the installed directory. You need to create an MTURK_CMD_HOME environment variable that points to the root of your turk install. In my case ‘C:\TurkTools\aws-mturk-clt-1.3.1’ Do not add this value to your path – it makes java throw a fit. The other undocumented thing that you must do is change the service-url values for the accounts from http to https.
To log in successfully, I was not able to use the IAM accounts and had to use a rootkey. And sure enough, when I looked over the AWS Request Authentication page, there it was: Amazon Mechanical Turk does not use AWS Identity and Access Management (IAM) credentials. Sigh. But that’s 4 tabs I can close and ignore.
Setting up the Java project for the HIT tutorial.
- Needed to put the mturk.properties file at the root of the project, since that’s where System.getProperty(“user.dir”) said I should.
Success! Sent a task, logged in and performed the task on in the sandbox. Now I have to see how to get the results and such.
And the API information on Amazon makes no sense. It looks like the Java API is not actually built by Amazon, the only true access is through SOAP/REST. The Java API is in the following locations:
- Amazon, which points to…
If you download the zip file, the javadoc API is available in the docs directory, and there appear to be well-commented samples in the samples directory.
Took parts from the simple_survey.java and reviewer.java examples and have managed to post and approve. Need to see if already approved values can be retrieved. If they can be, then I can check against previous uids. Then I could either pull down the session_ids as a list or create a new PHP page that returns the session_ids restfully. Kinda like that.
The php page that returns session_ids is done and up: http://philfeldman.com/iRevApps/dbREST.php. I’ve now got java code running that will pull down the entire xml document and search through it looking for a result that I can then match against the submitted values. I need to check to see that the value isn’t already approved, which means going through the approved items first to look for the sessionIDs
And, of course, there’s a new wrinkle. A worker can only access an HIT once, which has led me to loading up quite a few HITs. And I’ve noticed that you can’t search across multiple HITs, so I need to keep track of them and fill the arrays from multiple sources. Or change the db to have an ‘approved’ flag, which I don’t want to do if I don’t have to. I’d rather start with keeping all the HIT ids and iterating over them. If that fails, I’ll make another RESTful interface with the db that will have a list of approved session_ids.
Iterating over the HITs in order of creation seems to work fine. At least good enough for now
Starting on the training wizard directive. While looking around for a source of training text I initially tried vocabulary building sites but wound up at famous quotations, at least for the moment. Here’s one that generates random quotes.
The wizard is done and it’s initial implementation as my turk data input page. Currently, the pages are:
- irev3 – the main app
- irevTurk – the mechanical turk data collection
- irevdb – data pulls and formatting from the db.
And I appear to have turk data!