Trustworthy Anonymous Citizen Journalism From Really Scary Places

Tentative findings/implications for design

  • Pervasive serendipity
  • Civilizational vs. tribal affordances

The current applications:

“The Friction of Fear versus the Currency of Trust”

Would you like to see reporting information from places like Syria or North Korea safer and more trustworthy? Are you familiar with apps like Secret and Whisper but wish they were better? Me too.

People have a need to speak out about things even when that may land them in trouble. So far, the internet has provided many ways for people to speak out, blogs, facebook, twitter etc. Unfortunately, it’s all to easy to get in trouble speaking your mind in these mediums, because we are very traceable, unless we take extraordinary precautions.

I believe that technology is failing to provide these people with a usable means of expressing themselves without fear of discovery. People will always need to speak from the safety of true (or at least good enough) anonymity.

But anonymity isn’t enough. We, the public need to be able to know how trustworthy information from such anonymous sources is. Without some way of determining trust, true news can easily get lost in a wash of misinformation, trolling and rumour.

I’m researching ways to anonymously report trustworthy news by trying to produce technologies that will allow people to report the news from scary places without fear of retaliation for them, their family or their friends.

This work is in support of my PhD thesis which has the working title of “Trustworthy Anonymous Citizen Journalism from Really Scary Places” The goal of this effort is to provide a means of producing useful, reliable information/news in areas where gathering information is particularly difficult (e.g. the Syrian Civil War, Mexican drug cartels, etc). Information should be produced anonymously in a way that a user cannot be counterfeited, and users are not ever identified sufficiently to be placed at risk. It will initially be a web-interactive platform where individuals could post news items anonymously, but in such a way that the information in the post can be cross-referenced to determine the likely veracity or ‘trustworthiness’ of the post.

In such a system anonymity is critical since other actors may desperately wish to determine the identity of the poster. The goal for this research is to determine ways that anonymity can be so thoroughly “baked in” to the design that even if the servers were completely hacked, no information that could reliably point to a particular individual could be recovered.

To determine trustworthiness, it helps greatly if the software can “recognize” a user from their interaction, without ever knowing any other identifying information (such as a login). Work has been done that can identify users [1][2][3], and that can evaluate the users level of cognitive stress [4] from typing patterns. Using this work in a browser-context, I intend to determine the viability of using instrumented measures of user behavior (typing patterns, word use, etc.) to recognize returning users to websites or internet-capable applications without using any specific identifying information.

In this way, the system need never know the user’s offline identity. Only the information provided by the user becomes valuable, and the connections or correlations between information from various users provides the basis for determining relative trustworthiness.

In other words, if a number of users assert over a period of time that “the sky is blue”, then that element gains in trustworthiness. On the other hand, even if one user “trolls” the system with repeated statements that the sky is “yellow with purple polka-dots”, that information can be correlated with a particular (still anonymous) user and classified accordingly.

Currently I’m working on a pilot study that results in software that can reliably (though not absolutely) distinguish among multiple submissions by multiple users so that one user’s corpus can be reliably distinguished from another’s while maintaining absolute anonymity. Once users can be recognized, topics and tags associated with those users can be associated using clustering and other statistical means.

100% recognition is not require here, so this is different from efforts to replace passwords, for example. If the system is only 50% confident that a novel news item is coming from a “trustworthy” source, then the reliability weight of the news item is proportianately reduced. WIth luck, the system should prove to be reasonably robust. Further integration with other, external fact-checking sites may also be used to determine the veracity of items posted by users.

Subsequent studies will attempt to extend the work of the pilot study across progressively larger population of users so that any limitations of the initial studies can be uncovered.

  1. Chang, M., et al. “Capturing Cognitive Fingerprints from Keystroke Dynamics for Active Authentication.” (2013): 1-1.
  2. Monrose, Fabian, and Aviel D. Rubin. “Keystroke dynamics as a biometric for authentication.” Future Generation computer systems 16.4 (2000): 351-359.
  3. Haider, Sajjad, Ahmed Abbas, and Abbas K. Zaidi. “A multi-technique approach for user identification through keystroke dynamics.” Systems, Man, and Cybernetics, 2000 IEEE International Conference on. Vol. 2. IEEE, 2000.
  4. Vizer, Lisa Michele. Detecting cognitive stress and impairment using keystroke and linguistic features of typed text: Toward a method for continuous monitoring of cognitive status. Diss. UNIVERSITY OF MARYLAND, BALTIMORE COUNTY, 2013.

What follows are my notes and thinking on the topic

This came up in an online discussion and I thought it was worth sharing.

If you don't have time to read this, this is probably the most useful part: the Introduction section covers the major points that any
research project needs to be able to answer quickly and clearly:

1. What is the problem?
2. Why is it interesting and important?
3. Why is it hard? (E.g., why do naive approaches fail?)
4. Why hasn't it been solved before? (Or, what's wrong with previous proposed solutions? How does mine differ?)
5. What are the key components of my approach and results? Also
include any specific limitations.


Current trustworthiness project

  • Pull stories from Google News (top-level feeds: World, national, entertainment, etc.) RSS, parse them, and put them in the database
  • Using Alchemy NLP, pull out the authors, subjects, links, etc from the stories. Search for them in the Alchemy News API. Use this to populate relevant tables (author, etc) that can point back to the main article
  • We’ll need some ratings tables as well. The information should include the rating and the links that support the statement. If there is freeform text, we could run some NLP on it for sentiment, etc.
  • Provide the list to the browser as the navigator, with the trustworthiness annotations
  • When a story is clicked, show the associated network with that story. Each item can be clicked to bring up information about that attribute (as a pop-up?)

This is why the world needs this

What is the one thing you know and how do your research questions relate to that?

And things to consider

The research questions that pertain to this effort are:

  1. How to determine identity (uniqueness?) in an anonymous way? Just to make things harder, we need ways that can’t be used indirectly to identify someone, like GPS movement patterns. The assumption to be tested is that identity can be recognized by detecting patterns of action, rather than a login, for example. Also, the anonymization needs to happen on the client. Websites can be hijacked (although for initial testing and gaming, raw data on the server makes sense).
  2. How to determine the trustworthiness of information gathered anonymously, using crowdsourcing techniques?
  3. How do people behave when they know they are anonymous? And given the tradeoffs that hide identity such as aggregation?

The initial approaches for answering these questions will be to look at the following:

  1. Examine the biometrics that can be detected using mobile technology, and see if this information can be used to reliably and uniquely detect if a single user is interacting with the mobile device. Examples of work in this field are
    1. Unique in the Crowd: The privacy bounds of human mobility
    2. Identifying User Traits by Mining Smart Phone Accelerometer Data
    3. Human Identification via Gait Recognition Using Accelerometer Gyro Forces
    4. A Password So Secret, You Don’t Consciously Know It (similar)
    5. Accelerometer-Based Transportation Mode Detection on Smartphones
    6. Capturing Cognitive Fingerprints from Keystroke Dynamics (overview article)
    7. Cell Phone-Based Biometric Identification
    8. LatentGesture
    9. Using hidden markov models for accelerometer-based biometric gait recognition
    10. Biometric Gait Authentication Using Accelerometer Sensor
    11. And here’s a data source with approximately 60 million unique samples of accelerometer data collected from 387 different devices. With code that can be used apparently.
    12. Researchers develop ‘narrative authentication’ system – not sure it this is directly relevant, but think about other possible but nonobvious ways that identity could be built up by observing patterns of usage, activity patterns, etc.
    13. Extracting insights from the shape of complex data using topology – This paper applies topological methods to study complex high dimensional data sets by extracting shapes (patterns) and obtaining insights about them. Our method combines the best features of existing standard methodologies such as principal component and cluster analyses to provide a geometric representation of complex data sets. Through this hybrid method, we often find subgroups in data sets that traditional methodologies fail to find.
    14. Using Machine Learning and NodeJS to detect the gender of Instagram Users
    15. Typing patterns could be good for a proof-of-concept, or as part of a final system. It looks like all the calculation could be done on the device to produce an ID. My guess is that the value would tend to drift, so users would be recognized using probabilities.
      1. Typing Patterns: A Key to User Identification
      2. Privacy: Gone with the Typing! Identifying Web Users by Their Typing Patterns
      3. Keystroke dynamics as a biometric for authentication
      4. A multi-technique approach for user identification through keystroke dynamics
      5. How the way you type can shatter anonymity—even on Tor
    16. Once we can recognize people, can we tell when they are under stress or unreliable? (a related article on normalicy profiles in computers)
    17. Avoiding Crowdsourcing  problems
      1. How Mechanical Turk is Broken | MIT Technology Review
      2. Creating speech and language data with Amazon’s Mechanical Turk
      3. Soylent: a word processor with a crowd inside
      4. The DARPA balloon challenge?
    18. Via Andrea Choiniere
      1. Thesis on this topic, including code used for walking detection and step cycle calculation:
      2. Research paper implementing accelerometer biometric test using controlled phone position and controlled walkway:
      3. Example using controlled phone position and quasi-controlled activities:
      4. Example using various activities but under quasi-controlled conditions:
    19. Meeting with Andrea on 12.30.13
      1. Looks like uniqueness can be determined by word choice in a corpus as small as 500. That does make things easier. It also allows for triangulation against other metrics, which would allow for looking at accelerometer data from several body positions for example. Though I’m not sure that’s needed.
      2. An issue to consider is that people who might use the site would have other examples of their writing. This means that an anonymous source could be identified. As a way around this, a vector-based translation algorithm could be trained using identification code to remove/modify the parts of the user’s language that are identifiable.
      3. And actually, this means that I could build a simple website that you train once, then write to. The site then transpiles the user’s words and publishes to twitter or Facebook for example. This just addresses the anonymous part of the problem, not the trustworthiness part, but it’s nice low-hanging fruit.
    20. Fika – 10.17.14
      1. Flame War Detection using Naive Bayes – Amy
      2. Typing Patterns: A Key to User Identification – Amy
      3. Keystroke data – Amy
      4. Andrés Monroy-Hernández – Helena
      5. Jeanine Finn – Alyson
  2. Examine how groups of people interact with “newsworthy” events. Are there means by which the use up multiple visual and audio perspectives can make it unlikely that the event is counterfeit in some way. Some work has already been done with respect to event authenticity as seen in this Poynter article. The technology behind shooting acoustic analysis could be useful here, as discussed by the Washington Post. And of course, there’s Storify, Storyful and Project EPIC.
    1. Cornell Social Lab – CityBeat automated news gathering from social networks:
    2. Social Physics
  3. A third element combines the previous two, by attempting to determine the “trustworthiness” of an individual based on prior interactions with “newsworthy” events.
    1. Belief Dynamics and Decision Making “..models must consider the roles of beliefs, attitudes, and sacred values within a culture, and how they interact with institutional constraints and perceived external pressures. They must address behaviors within a culture at the levels of the individual, the group, and the governing body. The most important objective of our MURI is to bring together models of beliefs and behaviors at each of the three levels, showing how the levels interact and influence one another.”
    2. Understanding Support Vector Machines (a collection of nice blog postings and tutorials)
    3. RELEVANCE: A Review of and a Framework for the Thinking on the Notion in Information Science. Basic thinking about how connections determine relevance from 1976. OCR copy in project folder.
    4. Proximity of multiple users reporting aspects of the same story might be helpful. One way to determine if multiple observers were colocated at the same time could be to use a secondary “acoustical” network. Ultrasonic (timestamped?) signals from one device could then be incorporated into another devices’ feed, which could be used to help validate both signals. A paper that touches on this for other reasons is here:
    5. Evidentiality for text trustworthiness detection: Evidentiality is an important clue for text trustworthiness detection. With the binarized vector setting, evidential based text representation model has considerably performaned better than both the bag-of-word model and the content word based model. Most crucially, we show that the best trustworthiness detection result is achieved when evidentiality is incorporated in a linguistically sophisticated model where their meanings are interpreted in both semantic and pragmatic terms.
    6. Analyzing collective behavior from blogs using swarm intelligence: We introduce a nature-inspired theory to model collective behavior from the observed data on blogs using swarm intelligence, where the goal is to accurately model and predict the future behavior of a large population after observing their interactions during a training phase. Specifically, an ant colony optimization model is trained with behavioral trend from the blog data and is tested over real-world blogs. Promising results were obtained in trend prediction using ant colony based pheromone classier and CHI statistical measure.
    7. No, Torture Doesn’t Make Terrorists Tell The Truth — But Here’s What Actually Works  “This approach can also separate liars from truth-tellers. When recalling their experiences in a cognitive interview, people who are telling the truth give longer and more detailed answers. Their recollections also tend to grow as more details come back into focus. Liars, on the other hand, typically tell a bare-bones story that doesn’t develop with retelling.”“Credibility is all in the words people use,” Meissner told BuzzFeed News. “It’s in the way they tell their story.” And crucially, it seems hard to game the system. Telling a lie is more mentally demanding than telling the truth, and hiding this cognitive effort is harder than concealing signs of stress.”
  4. A possible fourth element is a way of determining the anonymous identity of a device. As this article shows, it is possible to uniquely identify a portable device from the characteristics of its sensors. This means that it may be possible o identify the person and the device. Should trust be reduced if a known person is using a new device? Is a device that was used by a trusted person an indicator that the next person is more/less trustworthy than normal?
  5. Since “following” an individual allows the creation of a social network that can be used to identify an individual, subscribers to the repositories can track “themes” or “ideas”. It might be that these themes are automatically generated, crowdsourced using tags, or some other means of categorization. A question to be addressed is whether a category can consist of reports from only one reporter
    1. Creating a repository or portal could be something like Zooniverse. (Wikipedia entry)
    2. Gaining Wisdom From Crowds (Communication of the ACM)
  6. Education. How to help potential reporters become good ones, while not providing clues to their identity? Is it possible to add a level of awareness so that the system can look at a report that someone wants to upload and identify potential ways that they could be identified from the information?
  7. Trolling and/or teamwork
    1. Teamwork OP: Riot on making ‘good’ the easy choice
  8. videogame studies
    1. How World of Warcraft Might Help Head off the Next Pandemic
    2. Virtual Epidemics as Learning Laboratories in Virtual Worlds.
    3. Journal of Virtual Worlds Research


  • Wickr – The Most Trusted Messenger in the World Trusted by world leaders, executives, journalists, human rights activists and your friends.
  • Google NewsLab – “collaborate with journalists and entrepreneurs to help build the future of media.”
  • AlchemyData News API: (video) (API details) Simple query provides news and blog content enriched with NLP and highly targeted search, trend analysis and historical access to news and blog content.
  • Alchemy Natural Language Processing (demo) (features) Offers 12 API functions as part of its text analysis service, each of which uses sophisticated natural language processing techniques to analyze your content and add high-level semantic information.
  • Watson Developer Cloud offers a variety of services for developing cognitive applications. Each Watson service provides a Representational State Transfer (REST) Application Programming Interface (API) for interacting with the service. IBM Bluemix™ is the cloud platform in which you deploy applications that are developed using Watson services.
  • Microsoft Search API (News) – Lots of oddities to use. Here’s my example.
  • Newswhip Spike – Know exactly what stories, writers and events are getting engagement in real time. Focus on niche topics, places and specialist publications
  • Infobitt – Manifesto(s) long and short. Larry Sanger is a co-founder of Wikepedia. To contact (another committee member?)
  • Jstacs – A Java framework for statistical analysis and classification of biological sequences
  • Mining Twitter with R
  • MentionMap
  • WEKA
  • JGAAP (Java Graphical Authorship Attribution Program)
  • SigmaJS is a JavaScript library dedicated to graph drawing. It makes easy to publish networks on Web pages, and allows developers to integrate network exploration in rich Web applications.
  • Media Cloud – Media Cloud is a project that seeks to track news content comprehensively – providing open, free, and flexible tools for quantitative analysis of media trends.(The Berkman Center for Internet & Society at Harvard University)
  • Truthy – Information diffusion research at Indiana University (paper and article)
  • Let’s Encrypt – The objective of Let’s Encrypt and the ACME protocol is to make it possible to set up an HTTPS server and have it automatically obtain a browser-trusted certificate, without any human intervention. This is accomplished by running a certificate management agent on the web server.
  • ipInfo – service to get location information from ip address
    // PHP example from 
    $ip = $_SERVER['REMOTE_ADDR'];
    $details = json_decode(file_get_contents("{$ip}/json"));
    echo $details->city; // -> "Mountain View"
  • YUI AngularJS
  • PHP
  • Apache
  • Quora API – Quora is an interesting ‘news’ site that has it’s own ways of determining veracity by tracking who posts what and how it’s upvoted. It turns out they have an unofficial REST api. It might be possible to tie into it using a sort of calculated identity. And have an exchange with the users of the site.
  • Reddit-related


  • Computation + Journalism: The Computation+Journalism Symposium is a celebration and synthesis of new ways to find and tell news stories with, by, and about data and algorithms. It is a venue to seed new collaborations between journalists and computer and data scientists: a bazaar for the exchange of ideas between industry/practice and academia/research.

Relevant literature (newest on top):

Presentation of the information

People who might be useful to involve

  • Jonathan Grudin
  • Roy Rada – agreed to be on committee
  • Lina Zhou
  • Bin Zhou
  • Kevin Crowston
  • Leysia Palen. This hews closely to Project EPIC. – agreed to be on committee.
  • Jon Callas
  • James Graves – Keyboard pattern recognition.
  • Delip Rao – Lead author on several papers during his time at the Human Language Technology Center of Excellence at Johns Hopkins University that dealt with algorithmic methods of identification. Many of his co-authored papers talk about “latent attributes,” those implicit specific details about people that can be surfaced, including ethnicity and gender.


Fellow Students

  • Amir Karami <>
  • Ali Azari <>

The experiment to determine the validity of the the approaches will be twofold, and will split into two parts.

The first will be the analysis of the crowdsourced mobile data to determine “identity” of users, their trustworthiness, and the likelihood that a documented event is authentic.

To produce data for the first part of the system to analyze, the second part will be the development of a multiplayer online game (MOG) that will track users as they engage in a game of “space invaders” loosely based on the television series “V”. In this game, registered users will interact with an Augmented Reality scenario where events such as UFO appearances, alien artifact discoveries, and so forth will be presented as game play elements. Participants will have to discover and document these events, which will become more varied and complex as the story unfolds. Some research to bear in mind.

Using this framework, we will have the ability to know exactly which registered user did what in the context of the game. We will know what information with respect to game events has been created. As such, the analysis systems will have both a “real world” dataset produced by the biometric and recording game components as well as clean meta information about how the data originated. This means that all analysis results will be testable with respect to the actual events.

The primary experiment will consist of having users interact with the game scenario in the roles of insurgents or occupiers. The game will start with a surreptitious alien invasion and (depending on the number of players), builds towards a larger scale conflicts. All “weapons” in the game are indirect fire, where a target is designated using a mobile device. Initially, players will only be able to track aliens (to join with or oppose). Weapons will become available over time as the story arc progresses. Players have to develop sufficient trust to be granted control of weapons. Other players may try to interfere with the trust relationship, etc.

The result of all this activity is to produce a large amount of clean data for the purposes of analysis and to test possible biometric solutions and information validation techniques. Given that MMOGs can easily have millions of users, it seems reasonable that this mechanism for gathering data should result in thousands to millions of high quality data sets.

Once the analytic components are sufficiently robust and accurate within the game context, the possibility of the software elements being successful in an actual event is quite high.

Gates (Bold items are major milestones)

  • Determine cross-platform mobile development environment that will support biometrics and AR game concept
  • If needed determine AR library
  • Develop proof of concept AR game “scene” that allows for the recording, commenting and saving of an event.
  • Develop proof of concept biometric “recognizer”
  • Story development and production of game “bible”
  • Initial front-end game framework design (app and server)
    • User registration and roles
      • Resistance Forces
      • Occupation Forces
    • Push notifications
    • AR event presentation
    • AR documenting
      • Video
      • Stills
      • Text
      • Other? (shooting down UFOs, etc)
    • Server GIS-based game engine.
      • Event production, management and recording of meta information
        • Who saw what and where
        • What they did
    • Recording and storage of data for later analysys
  • Initial back-end game framework design (Server and webpage)
    • Biometric integration (automatic login later?)
    • Event recognition and synchronization
    • Communication between users
      • Since the game is based on anonymous users. How do they establish communication channels? F2F meetings? (hold two phones together and shake to guarantee proximity?)
      • C3 could be based on groups that have met F2F reaching out to other users based on behavior
      • User behavior tracking – needs to be useful enough to see if someone is worth recruiting but not enough to trap them. 
    • Recording and storage of data for later analysis
      • Development of interfaces for various customers, such as the news media
  • Integration engine for correlating front-end and back-end data
  • Front-end coding
  • Back-end coding
  • Generation of production game assets
  • Closed alpha release (invitation)
  • Debugging and initial data analysis.
  • Initial paper(s)
  • Open game alpha release
  • Debugging and data analysis
  • Beta release
  • Production release
  • Algorithm refinement until accuracy goals are achieved
  • Decoupling of back-end from game engine for use “in the wild”

1 thought on “Trustworthy Anonymous Citizen Journalism From Really Scary Places

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s