Relevance, Pertinence (Vantinence?)

Initial decisions/constraints

  • Using the Research Through Design (RTD) process, I started with a small set of assumptions and constraints, and used the process of development to determine what was technically possible, Likely to resonate with a user and scalable. The space that I have placed myself (generation of trustworthy information from anonymous sources) is itself a constraint, since people individually can’t be rated directly, but must be associated only, and only probabilistically, with the information the data entry webpage produces)
  • Needed to be webpage for greatest penetration. No app or plugins should be needed. JavaScript is now a standard(ish) part of every browser, so the decision was to use JavaScript and Angular 1.x for all client-side development. In addition, I decided to use the Typescript language and compiler to produce the JavaScript in an effort to produce more robust software.
  • Login and user management for this part (as opposed to the anonymous posting part), because information browsing is not as risky as content generation, and the ability to track user’s browsing habits over time could be useful. There is always the option of allowing a guest user.
  • Database: A general relational database (MySql) was chosen for the server information store because it allowed high flexibility in the construction of the data. A relational database can mimic graph and object databases better than the inverse. Within this context of a general approach, the data structures and some of the historical situations that led to certain choices are discussed:
    • Users:
      CREATE TABLE tn_users (
          id_index INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
          created_on DATE,
          last_accessed DATE,
          login VARCHAR(255) UNIQUE NOT NULL,
          password VARCHAR(255) NOT NULL
      );
      • This table provides minimal information about the user. DATE fields for last accessed and date created provides some insight into how long the user has been active in the system, however the decision was made to embed any additional information about user actions in the system within other data structures.
    • Types:
      CREATE TABLE tn_types (
          id_index INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
          name VARCHAR(255)
      );
      • Based on the RTD process, a goal of this system was to be as flexible as possible. It was reasonable to assume that items, associations and other data structures in the database might have differing types. Rather than predefining them, this table allows types to be added as the need becomes apparent. Since the queries generally depend on the id_index for speed, additional types can be added, but not removed. Initial types consisted of: UNKNOWN, USER, ITEM, RATING, and  URL. Later, after incorporating NLP capabilities,  KEYWORDS, QUERY, ENTITIES, CONCEPTS, AUTHOR, TITLE, SENTIMENT were added. As the ability to adjust the local network interactively was added the HOST, TRUSTWORTHINESS, COMPUTED, and EXPLICIT types were added. As these interactions were refined, the DEPRECATED, MERGE, WAMPETER, and
        IDENTITY types were added.
    • Items:
      CREATE TABLE tn_items (
          id_index      INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
          created_on    DATE,
          item_type     INT NOT NULL, 
          user_id       INT,  
          text          TEXT, 
          float_val     FLOAT, 
          int_val       INT, 
          link          VARCHAR(255), 
          title         VARCHAR(2046), 
          pub_date      DATE, 
          image         VARCHAR(255), 
          guid          VARCHAR(255) 
      );
      • One of the initial concepts used in the design is that an item (webpage, entity, etc) were singletons. Multiple networks could share the same items, but there is only one copy, which can be easily identified by the item’s guaranteed unique ID (guid). The idea behind this was that it would make the storage requirements lower and also allow the comparison of multiple networks that use the same item. For example if multiple networks access the same Wikipedia page, that information can be tracked with a simple query as opposed to a much more extensive search.
      • Other information associated with items are the user_id of the individual that first accessed this item. The rest of the information reflects a best guess as to what could be needed to save in the context of an internet result. It is based loosely on the RSS 2.0 item specification.
    • Touches:
      CREATE TABLE tn_touches (
          id_index         INT AUTO_INCREMENT PRIMARY KEY,
          touch_date       DATE,
          user_id          INT, 
          item_id          INT
      );
      • The intent of this table is to track the overall behavior(?) of the items that make up a network. Rather than adding multiple copies of an item, a small record containing the pointer to the item, the id of the user that requested the item, and the time that the request happened. Otherwise, all information is provided in the item record.
    • Associations
      CREATE TABLE tn_associations (
          id_index    INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
          created_on  DATE,
          weight      FLOAT,
          network_id  INT,
          user_id     INT,
          assoc_type  INT, -- tn_types
          source_id   INT, -- tn_item
          target_id   INT, -- tn_item
          guid        VARCHAR(255) 
      );
      • Associations are used to create a distinct network, w. An association is directed, with a source and target. They also can be typed and associated with the user that created them.
      • As I realized that PageRank could be calculated in the browser in real time, I added a FLOAT weight field that would save manipulations that were performed on the browser.
    • Networks
      CREATE TABLE tn_networks (
          id_index       INT AUTO_INCREMENT PRIMARY KEY,
          user_id        INT,
          dictionary_id  INT,
          is_private     int,
          read_only      int,
          archive        int,
          description    text,
          name           VARCHAR(255)
      );
      • The network table contains rows that describe the basic components of the network; the owner of the network, the name, and publishing information. Because a user might need a ‘scratch pad’ network when trying out ideas, networks also contain an ‘archive’ flag that if zero, deletes that network on the next request for the server to add a network for that user.
      • As development continued with the system, it became apparent that there needed to be able to add a particular context to a network, which necessitated the development of the dictionary system and the addition of a dictionary index field to the network table. The structure of the dictionary is discussed below.
    • Dictionary Entries
      CREATE TABLE tn_dictionary_entries (
          id_index        INT AUTO_INCREMENT PRIMARY KEY,
          word            VARCHAR(255) NOT NULL,
          parent          INT NOT NULL DEFAULT 0,
          dict_id         INT,
          user_id         INT,
          word_type       INT,
          source_count    INT NOT NULL DEFAULT 0,
          description     VARCHAR(1024),
          server_code     TEXT
      );
      • A dictionary entry consists of several related elements beyond the basic id_index, word, and owning dictionary. Since dictionaries may have multiple definitions for the same word (Java, for example), rather than having a single word that is accessed by many dictionaries, each dictionary has its own word. Additional information for each words includes:
        • Parent
          • This is index of another word in the current dictionary. Taxonomies can be build using this mechanism, though multiple parents are not allowed. This allows the establishment of a particular context. For example, Java -> drink could be a member of a food-focused taxonomy, while Java->programming language would be used in a software context.
        • Word type
          • My current thinking is that type would indicate if the word was a member of the KEYWORD, CONCEPT, ENTITY or other type.
        • Source Count
          • If a source is used to create a particular dictionary, the source count would be the number of times that the word appears in the training(?) corpus. This may not be an appropriate location for this, as a dictionary may be run against many texts, and each count may be useful. Should there be a ‘corpus’ table that correlates networks, words and counts? It could get re-run every time the network or the dictionary is modified…
        • Description
          • A short description that might be used by the user to clear up ambiguity, either when choosing a word or a dictionary. For example, Java in two different dictionaries would have:
            • Noun: the main island of Indonesia
            • Trademark. a high-level, object-oriented computer programming language.
          • It is possible that the description could have simple markup that could have words in the definition that are used elsewhere in the dictionary also be pointed at. It could allow a more sophisticated model of what the meaning is.
        • Server Code
          • The intent here is to have interpreted code that could be used to perform common actions on text to see if it is an instance of the word. For example, stemming, regex, edit distance could all be actions that might be selectively applied on a word-by-word basis
    • Dictionaries
      CREATE TABLE tn_dictionaries (
          id_index         INT AUTO_INCREMENT PRIMARY KEY,
          user_id          INT NOT NULL,
          is_private       INT NOT NULL DEFAULT 0,
          read_only        INT NOT NULL DEFAULT 0,
          archive          INT NOT NULL DEFAULT 0,
          dictionary_name  VARCHAR(255) ,
          source_text      TEXT
      );
      • Dictionaries are similar to networks in that they are the reference point for a collection of items (or words in this case). Dictionaries are associated with a user, have a name and read/write/archive flags. They also have a text flag which indicates the corpus that they were derived from, though I think now that there should be a tn_corpus table.