Author Archives: pgfeldman

New computer, new plans

I am now the proud(?) owner of a brand new ThinkStation. Nice, fast box, and loaded up with dev and analysis goodies.

Based on discussion with Dr. Hurst, I’m going to look into using Mechanical Turk as a way of gathering large amounts of biometric data. Basically, the goal is to have Turkers log in and type training text. The question is how. And that question has two parts – how to interface with the Turk system and how to get the best results. I’ll be researching these over the next week or so, but I thought I’d put down some initial thoughts here first.

I think a good way to get data would be to write a simple game that presents words or phrases to a user and has them type those words back into the system. Points are given for speed (up to 10 seconds for word?) and accuracy (edit distance from word). Points are converted into cash somehow using the Turk API?

The list of words should be the 100 most-used English words? Words/phrases are presented randomly. There is some kind of upper and lower limit on words that a user can enter so the system is not abused. In addition, ip address/browser info can be checked as a rough cull for repeat users.

Ok, I’m now the proud owner of an AWS account and a Turk Requestor account. I’ve also found the Amazon Mechanical Turk Getting Started Guide, though the Kindle download is extremely slow this Christmas Day.

Setting up sandbox accounts. Some things appear to timeout. Not sure if I’ll be able to use the api directly. I may have to do a check against the session uid cut-and-paste.

Set up Admin, dev and read-only IAM user accounts.

Accessed the production and sandbox accounts with the command line tools. Since I already had a JRE, I just downloaded the installed directory. You need to create an MTURK_CMD_HOME environment variable that points to the root of your turk install. In my case ‘C:\TurkTools\aws-mturk-clt-1.3.1’ Do not add this value to your path – it makes java throw a fit. The other undocumented thing that you must do is change the service-url values for the accounts from http to https.

To log in successfully, I was not able to use the IAM accounts and had to use a rootkey. And sure enough, when I looked over the AWS Request Authentication page, there it was: Amazon Mechanical Turk does not use AWS Identity and Access Management (IAM) credentials. Sigh. But that’s 4 tabs I can close and ignore.

Setting up the Java project for the HIT tutorial.

  • Since I’ve been using IntelleJ for all my JavaScript and PHP, I thought I’d see how it is with Java. The first confusion was how to grab all the libraries/jar files that Turk needs. To add items, use File->Project Structure. This brings up a ‘Project Structure” dialog. Pick Modules under Project Settings then click the Sources/Paths/Dependencies tab. Click the green ‘+‘ on the far right-hand side, then select Jar or Directories. It doesn’t seem to recurse down the tree, but you can shift-click to select multiple one-level directories. This should then populate the selected folders in the list that has the Java and <Module Source> listed. Once you hit OK at the bottom of the dialog, you can verify that the jar files are listed under the External Libraries heading in the Project panel.
  • Needed to put the mturk.properties file at the root of the project, since that’s where System.getProperty(“user.dir”) said I should.

Success! Sent a task, logged in and performed the task on in the sandbox. Now I have to see how to get the results and such.

And the API information on Amazon makes no sense. It looks like the Java API is not actually built by Amazon, the only true access is through SOAP/REST. The Java API is in the following locations:

If you download the zip file, the javadoc API is available in the docs directory, and there appear to be well-commented samples in the samples directory.

Took parts from the simple_survey.java and reviewer.java examples and have managed to post and approve. Need to see if already approved values can be retrieved. If they can be, then I can check against previous uids. Then I could either pull down the session_ids as a list or create a new PHP page that returns the session_ids restfully. Kinda like that.

The php page that returns session_ids is done and up: http://philfeldman.com/iRevApps/dbREST.php. I’ve now got java code running that will pull down the entire xml document and search through it looking for a result that I can then match against the submitted values. I need to check to see that the value isn’t already approved, which means going through the approved items first to look for the sessionIDs

And, of course, there’s a new wrinkle. A worker can only access an HIT once, which has led me to loading up quite a few HITs. And I’ve noticed that you can’t search across multiple HITs, so I need to keep track of them and fill the arrays from multiple sources. Or change the db to have an ‘approved’ flag, which I don’t want to do if I don’t have to.  I’d rather start with keeping all the HIT ids and iterating over them. If that fails, I’ll make another RESTful interface with the db that will have a list of approved session_ids.

Iterating over the HITs in order of creation seems to work fine. At least good enough for now

Bought and set up tajour.com. Got a little nervous about pointing a bunch of turkers at philfeldman.com. I will need to make it https after a while too. Will need to move over the javascript files and create a new turk page.

Starting on the training wizard directive. While looking around for a source of training text I initially tried vocabulary building sites but wound up at famous quotations, at least for the moment. Here’s one that generates random quotes.

The wizard is done and it’s initial implementation as my turk data input page. Currently, the pages are:

  • irev3 – the main app
  • irevTurk – the mechanical turk data collection
  • irevdb – data pulls and formatting from the db.

And I appear to have turk data!

First Draft?

I think I’ve made enough progress with the coding to have something useful. And everything is using the AngularJS framework, so I’m pretty buzzword compliant. Well, that may be to bold a statement, but at least it can make data that can be analyzed by something a bit more rigorous than just looking at it in a spreadsheet.

Here the current state of things:

The data analysis app: http://tajour.com/iRevApps/irevdb.html

Changes:
  • Improved the UX on both apps. The main app first.
    • You can now look at other poster’s posts without ‘logging in’. You have to type the passphrase to add anything though.
    • It’s now possible to search through the posts by search term and date
    • The code is now more modular and maintainable (I know, not that academic, but it makes me happy)
    • Twitter and Facebook crossposting are coming.
  • For the db app
    • Dropdown selection of common queries
    • Tab selection of query output
    • Rule-based parsing (currently keyDownUp, keyDownDown and word
    • Excel-ready cvs output for all rules
    • WEKA ready output for keyDownUpkeyDownDown and word. A caveat on this. The WEKA ARFF format wants to have all session information in a single row. This has two ramifications:
      • There has to be a column for every key/word, including the misspellings. For the training task it’s not so bad, but for the free form text it means there are going to be a lot of columns. WEKA has a marker ‘?’ for missing data, so I’m going to start with that, but it may be that the data will have to be ‘cleaned’ by deleting uncommon words.
      • Since there is only one column per key/word, keys and words that are typed multiple times have to be grouped somehow. Right now I’m averaging, but that looses a lot of information. I may add a standard deviation measure, but that will mean double the columns. Something to ponder.

Lastly, Larry Sanger (co-founder of Wikipedia) has started a wiki-ish news site. It’s possible that I could piggyback on this effort, or at least use some of their ideas/code. It’s called infobitt.com. There’s a good manifesto here.

Normally, I would be able to start analyzing data now, with WEKA and SPSS (which I bought/leased about a week ago), but my home dev computer died and I’m waiting for a replacement right now. Frustrating.

Cluster Analysis in Mathematica

UMBC appears to have a Wolfram Pro account and student copies of Mathematica, covered by tuition, it seems. I need to do cluster analysis on words, trigraphs and digraphs. This seems to be a serious win. One option is to use the heavy client. This page seems to cover that.

I wonder if I can use Alpha Pro as a service for an analysis page though. That could be very cool. It certainly seems like a possibility. More as this progresses…

Notes:

  • R Commander Two-way Analysis of Variance Model – https://www.youtube.com/watch?v=uSI1CIHEZcc
  • Success! In that I was able to read in a file (Insert->File Path…), then click Import under the line. Boy, that’s intuitive…
  • ANOVA (yes, all caps) runs like this: ANOVA[myModel, {myFactor1, myFactor2, All}, {myFactor1, myFactor2}]

Trigraphs and digraphs and milliseconds oh my!

I’ve been reading my papers on recognizing biometrics from keystroke info and it generally seems to either work from training neural nets or from examining the timing of certain letter patterns, particularly digraphs and trigraphs. I’m currently working on parsing the raw data into more manageable data that can be stored in a form specific table:

CREATE TABLE IF NOT EXISTS `trigraph_table` (
  `uid` int(11) NOT NULL AUTO_INCREMENT,
  `session_id` varchar(255) NOT NULL,
  `word` varchar(255) NOT NULL,
  `milliseconds` int(11) NOT NULL,
  PRIMARY KEY (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

I can point back to the session_table for all the associated data if there is not sufficient clustering with just the session_id. Speaking of the session_table, it’s getting downright scary:

CREATE TABLE IF NOT EXISTS `session_table` (
`uid` int(11) NOT NULL AUTO_INCREMENT,
`session_id` varchar(255) NOT NULL,
`type` int(11) NOT NULL,
`entry_time` datetime NOT NULL,
`ip_address` varchar(255) NOT NULL,
`browser` varchar(255) NOT NULL,
`referrer` varchar(255) NOT NULL,
`submitted_text` text NOT NULL,
`raw` MEDIUMTEXT NOT NULL,
`parent_session_id` varchar(255) DEFAULT NULL,
`veracity` int(11) NOT NULL,
`hostname` varchar(255) NOT NULL,
`city` varchar(255) NOT NULL,
`region` varchar(255) NOT NULL,
`country` varchar(255) NOT NULL,
`latlong` varchar(255) NOT NULL,
`service_provider` varchar(255) NOT NULL,
`postal` varchar(255) NOT NULL,
  PRIMARY KEY (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

The idea that we are anonymous is just silly, from the moment we connect to a server, we’re basically exposed. And I’m just getting started with gathering data. Imagine what experienced web developers can do, particularly once cookies are enabled. Minimising that damage to someone who posts to the site is probably on the list of things to look at. Maybe server hardening? Certainly no links from the page.

On a different note, I’ve been thinking a bit about how confident we need to be of a source before we can start determining information trustworthiness. It seems to me that if we can show that the information comes from a wide number of individuals, we get information as well, even if we can’t sufficiently distinguish one. That trips up the one individual reporting something unique scoop aspect though.

Another thing that’s kind of interesting is typos and spell check. Keeping track of typos is easy – just run everything through a dictionary. A different, also potentially useful thing is to look at the differences in words produced by keystrokes and the submitted text. Where those differ, some sort of spell correction was used, which doesn’t behave like a paste, or it would be trapped. Anyway, it’s another form of interesting data.

 

Plans coming together

Ok, things are getting close. I have all the code pieces talking in a single application (http://philfeldman.com/irev1.html). After playing around with the ways that the key down/up events can be trapped, I decided to do as little processing as possible and simply record the keycode, time and status (up/down). The main reason for this is that things like the shift key are pressed while other keys are typed and then released. This way it’s easier to see that happen.

I also needed to prevent pasting, since that makes everything more complex (recognizing paste events, working around them, etc). It turns out that YUI dosn’t seem to handle the paste event, so you have to get it from the document directly:

var pasteTrap = document.getElementById('submittedTextInput');
pasteTrap.onpaste = function(e){
    alert("Paste is not allowed");
    return false;
}

As usual, we have the fine contributors to StackOverflow to point the way to do this

Amazingly, it even works in all browsers.

Next is cleanup, putting all the pieces into modules where they belong and doing some better css. I think something like secret might be pretty easy to put together. Colored backgrounds before coding up picture loading. But along those lines.

Last thing for the day is to finish the next pass at the IRB submission.

Safe(er) Data and Nonexistent Functions

If you want to reduce the likelihood of a SQL injection attack, use, precompiled queries. Nice in theory, tougher in practice. The nub of the problem appears to be the way that PHP binds data to execute the insert or the pull. With a nice, vulnerable query you can use string manipulation functions and as such make nice, general functions. However, if you’re mean, you can add something like “;DROP TABLE students; and poof, the table students is gone. Now, there should be a nice call that returns everything as an associative array, but that doesn’t seem to be reliable across PHP installations, so we need to work with the much more restrictive fetch();

Things to remember:

  • Everything has to happen when the statement is available, between prepare() and close().
  • Use bind_params(String datatypes…) to send data and bind_results for returning data. bind_params is less picky – you can access elements of an array directly. For bind_results you have to have individual variables declared.
  • When things go wrong in the PHP mysql code, it is likely that an HTML table will be returned. That will need to be handled.
  • Stringify and parse of objects into and out of JSON may or may not handle hierarchies. Watch what goes on in the debugger.

Anyway that just about doubled the line count in the middleware and bound the PHP code much more tightly to the form of the database. That being said, this is intended to have some production values in it anyway, so that may be a good thing. The new and improved results are in the same old place, namely io2.html. Next comes the integration of all that DB work, the recognizer part, and the panel part.

Basic Chores

Not much to write about, but some good work got squeezed in today. First, I was able to transition over to mysqli, which turned out to be nearly painless. I’ve been working on a thin layer that’s admittedly got some security holes, but that’s not what I’m trying to work through and the data’s junk anyway.

So to get use out of all this stuff, I need to have everything run on a server. I use Dreamhost, who I like a lot and have been with for years, and they give you PHP and mysql out of the box. So today was the day to try and take all the pieces that I have gotten working on my dev machine and migrate them to a place that people can access.  It did mean getting familiar with SSH and PuTTY all over again though.

The first step was creating a database. Since I’m on a shared server, that’s not as simple as when you own the instance, but Dreamhost has a dashboard that makes this pretty reasonable. It does take time though for everything to trickle through though. Once it was up and running I created a new copy of the same old table I’ve been using for my tests and populated it with the same old data.

Once that was done I fired up WinSCP and copied the files over, changed the config file and tried running the php script on the command line. Imagine my surprise when everything ran right the first time. And then compound that again when the web page worked as well. And both of those files had no changes. Repeat after me:

“Configuration files are wonderful”

“Relative addressing is also”

Anyway, here it is in all its glory: io2.html.

The next part is handling the submission of data to the db, which is making me a bit nervous about sql injection. I may just use the YUI Escape object to modify the string so it isn’t dangerous. Nope, that won’t work, but we can use blobs. Here’s how (from here):

/**
 * update the files table with the new blob from the file specified
 * by the filepath
 * @param int $id
 * @param string $filePath
 * @param string $mime
 * @return boolean
 */
function updateBlob($id,$filePath,$mime) {
$blob = fopen($filePath,'rb');

$sql = "UPDATE files
SET mime = :mime,
data = :data
WHERE id = :id";

$stmt = $this->conn->prepare($sql);

$stmt->bindParam(':mime',$mime);
$stmt->bindParam(':data',$blob,PDO::PARAM_LOB);
$stmt->bindParam(':id',$id);

return $stmt->execute();

}

On a related note, I wonder how many of our actions can be stereotyped in a way that can be detected in the browser?

Some good parts, kind of integrated

We are packing up, so I’m done for now. Progress has been pretty good. The core parts of the posting module are done, though they are not yet managed by a “topic manager” or some similar. I have YUI talking via PHP to MySQL, sending objects that contain data that will be needed in a structured way. That turned out to be much harder than I thought, simply because I couldn’t make the debugger in IntelleJ work in the PHP server file in such a way that I could watch an HTTP request come in. In olden days, I would have RTFM about the process and worked from that, but now with OpenSource, I’ve become very dependent on the debugger to tell me what’s actually going on. Many times things don’t correspond with (often stale) documentation. So in the end, I put together a light PHP class that pretty much echoed POST calls back at me so that I could look at them in the JavaScript debugger. That burned a day. Sigh.

The last (new) thing to do is to make the database access robust. I did my code based on Learning PHP, MySQL, and JavaScript, a generally fine book, but it still uses the deprecated “mysql_*” calls. I need to update that and have some generalized data return structures built. Then that part should be reasonably static from here on out.

Integration by parts

It’s vacation so I must be coding. Today it’s outside with a lovely view of the rolling hills surrounding Deep Creek Lake.

I’m working on getting all the pieces for the first version of the Recognizer working. The goal is to have a webapp that allows for original  “posts” or “comments” that pertain to a post. I’m thinking that both of these can be handled in the same table:

CREATE TABLE IF NOT EXISTS `session_table` (
`uid` int(11) NOT NULL AUTO_INCREMENT,
`session_id` varchar(255) NOT NULL,
`type` int(11) NOT NULL,
`entry_time` datetime NOT NULL,
`ip_address` varchar(255) NOT NULL,
`browser` varchar(255) NOT NULL,
`referrer` varchar(255) NOT NULL,
`submitted_text` text NOT NULL,
`raw` text NOT NULL,
`parent_session_id` varchar(255) DEFAULT NULL,
`veracity` int(11) NOT NULL,
PRIMARY KEY (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

There is extra data being stored here (browser, ip, etc) simply because it will make correlations potentially easier at this state when I have little idea what works. I’m also storing the raw data along with the post so it can all be processed later, in multiple ways. At this point, the intent is to save the raw data as key/value pairs in JSON, mostly because it’s easy to make that conversion using Y.JSON.stringify.

The other thing that I’ll need to organize the posts/comments is a topic table. I think that for the time being, it can be taken from the title of the first post. Comments can then point to the topic, and that allows for filtering of what to see. Additional filleting at this level can be keywords, trustworthiness, location, etc.